• abhibeckert@beehaw.org
    link
    fedilink
    arrow-up
    3
    ·
    2 years ago

    I’d bet sites blocking ChatGPT will regret it when (not if) Bing starts using it for search engine relevance.

  • dbilitated@aussie.zone
    link
    fedilink
    arrow-up
    2
    ·
    2 years ago

    I’d rather like it if they train it on stuff I say. I want the AI of tomorrow to reflect my thoughts.

    seriously I would much prefer gold tier journalism and news sites let it crawl so when people use it to make choices in the future they’re guided to better choices.

    it is honestly so hard to know what will happen though, it’s so complicated it’s virtually guaranteed we’re not correctly anticipating the consequences of any of this. I’m not really even talking about the AI, I’m talking about the effects on society which are a lot more complex.

  • ashtrix@lemmy.ca
    link
    fedilink
    arrow-up
    1
    ·
    2 years ago

    Yeah, it’s already too late. Why didn’t they provide this before they already scraped websites?

    • P03 Locke@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 years ago

      You think Google thought about robots.txt before they developed their search engine? Nah, it’s all public Internet, and they scraped away.

      A non-zero percentage of web sites will bother to follow these instructions, but it might as well be zero.

      • The Doctor@beehaw.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 years ago

        Very early on, at least, their spiders respected robots.txt.

        I know there are folks that have all of the Big G in their robots.txt files on principle, might want to ask them if it works or not.

        • chameleon@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          2 years ago

          I do and I can confirm there are no requests (except for robots.txt and the odd /favicon.ico). Google sorta respects robots.txt. They do have a weird gotcha though: they still put the URLs in search, they just appear with an useless description. Their suggestion to avoid that can be summarized as: don’t block us, let us crawl and just tell us not to use the result, just trust us! when they could very easily change that behavior to make more sense. Not a single damn person with Google blocked in robots.txt wants to be indexed, and their logic on password protecting kind of makes sense but my concern isn’t security, it’s that I don’t like them (or Bing or Yandex).

          Another gotcha I’ve seen linked is that their ad targeting bot for Google AdSense (different crawler) doesn’t respect a * exclusion, but that kind of makes sense since it will only ever visit your site if you place AdSense ads on it.

          And I suppose they’ll train Bard on all data they scraped because of course. Probably no way to opt out of that without opting out of Google Search as well.

  • Tibert@compuverse.uk
    link
    fedilink
    arrow-up
    1
    ·
    2 years ago

    Like it is useful… Open ai already got all the useful info out of the websites.

    Tho maybe for the sites generating new content it may have a use. But all the content before that is already lost to chatgpt.

  • On@kbin.social
    link
    fedilink
    arrow-up
    1
    ·
    2 years ago

    Is it possible that they offloaded the scraping to a different company to avoid direct litigation now theyre out in the open? To say “we didn’t scrape your website, and you can’t prove it.”

    Like DDG, Ecosia, Qwant use Bing for their data Or how feds buy data from data brokers. Outsource the dirty job like every tech company does and shift the blame if caught doing something unlawful.

    It seems they are trying to garner some positive PR after they scraped through everything without anyone noticing.

    • TehPers@beehaw.org
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      2 years ago

      Why would they be concerned about litigation? As far as I know, scraping is completely legal in most/all countries (including the US, which I’m more familiar with and they’re headquartered out of), as long as you’re respecting copyright and correctly handling PII (which they claim to be making an effort on).

  • breaks.ʟᴏʟ@lemmy.studio
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 years ago

    But for large website operators, the choice to block large language model (LLM) crawlers isn’t as easy as it may seem. Making some LLMs blind to certain website data will leave gaps of knowledge that could serve some sites very well (such as sites that don’t want to lose visitors if ChatGPT supplies their information for them), but it may also hurt others. For example, blocking content from future AI models could decrease a site’s or a brand’s cultural footprint if AI chatbots become a primary user interface in the future. As a thought experiment, imagine an online business declaring that it didn’t want its website indexed by Google in the year 2002—a self-defeating move when that was the most popular on-ramp for finding information online.

    Really curious how this will end up

    • axibzllmbo@beehaw.org
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      That’s an interesting point that I hadn’t considered, the comparison to Google indexing in the early 2000’s may prove to be very apt with the number of people I’ve seen using chat GPT as a search engine.

    • The Bard in Green@lemmy.starlightkel.xyz
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      Hilariously, unless ALL lemmy instances do this, anyone that federates with you will have to block it too or any communities they sync with you will be available on their instances…