• andallthat@lemmy.world
    link
    fedilink
    English
    arrow-up
    38
    arrow-down
    1
    ·
    edit-2
    3 months ago

    Basically, model collapse happens when the training data no longer matches real-world data

    I’m more concerned about LLMs collaping the whole idea of “real-world”.

    I’m not a machine learning expert but I do get the basic concept of training a model and then evaluating its output against real data. But the whole thing rests on the idea that you have a model trained with relatively small samples of the real world and a big, clearly distinct “real world” to check the model’s performance.

    If LLMs have already ingested basically the entire information in the “real world” and their output is so pervasive that you can’t easily tell what’s true and what’s AI-generated slop “how do we train our models now” is not my main concern.

    As an example, take the judges who found made-up cases because lawyers used a LLM. What happens if made-up cases are referenced in several other places, including some legal textbooks used in Law Schools? Don’t they become part of the “real world”?

    • Khanzarate@lemmy.world
      link
      fedilink
      English
      arrow-up
      16
      ·
      3 months ago

      No, because there’s still no case.

      Law textbooks that taught an imaginary case would just get a lot of lawyers in trouble, because someone eventually will wanna read the whole case and will try to pull the actual case, not just a reference. Those cases aren’t susceptible to this because they’re essentially a historical record. It’s like the difference between a scan of the declaration of independence and a high school history book describing it. Only one of those things could be bullshitted by an LLM.

      Also applies to law schools. People do reference back to cases all the time, there’s an opposing lawyer, after all, who’d love a slam dunk win of “your honor, my opponent is actually full of shit and making everything up”. Any lawyer trained on imaginary material as if it were reality will just fail repeatedly.

      LLMs can deceive lawyers who don’t verify their work. Lawyers are in fact required to verify their work, and the ones that have been caught using LLMs are quite literally not doing their job. If that wasn’t the case, lawyers would make up cases themselves, they don’t need an LLM for that, but it doesn’t happen because it doesn’t work.

      • thedruid@lemmy.world
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        4
        ·
        3 months ago

        It happens all the time though. Made up and false facts being accepted as truth with no veracity.

        So hard disagree.

        • Khanzarate@lemmy.world
          link
          fedilink
          English
          arrow-up
          10
          ·
          3 months ago

          The difference is, if this were to happen and it was found later that a court case crucial to the defense were used, that’s a mistrial. Maybe even dismissed with prejudice.

          Courts are bullshit sometimes, it’s true, but it would take deliberate judge/lawyer collusion for this to occur, or the incompetence of the judge and the opposing lawyer.

          Is that possible? Sure. But the question was “will fictional LLM case law enter the general knowledge?” and my answer is “in a functioning court, no.”

          If the judge and a lawyer are colluding or if a judge and the opposing lawyer are both so grossly incompetent, then we are far beyond an improper LLM citation.

          TL;DR As a general rule, you have to prove facts in court. When that stops being true, liars win, no AI needed.

          • thedruid@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            2 months ago

            To put a fiber point, in not arguing that s. I should be used in court. That’s just a bad idea. I’m saying that B. S has been used as fact , look at the way history is taught in most countries. Very biased towards their own ruling class, usually involves living lies of some sort

    • londos@lemmy.world
      link
      fedilink
      English
      arrow-up
      14
      ·
      2 months ago

      My first thought was that it would make a cool sci fi story where future generations lose all documented history other than AI-generated slop, and factions war over whose history is correct and/or made-up disagreements.

      And then I remembered all the real life wars of religion…

    • WanderingThoughts@europe.pub
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      2
      ·
      3 months ago

      LLM are not going to be the future. The tech companies know it and are working on reasoning models that can look up stuff to fact check themselves. These are slower, use more power and are still a work in progress.

      • andallthat@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        2
        ·
        3 months ago

        Look up stuff where? Some things are verifiable more or less directly: the Moon is not 80% made of cheese,adding glue to pizza is not healthy, the average human hand does not have seven fingers. A “reasoning” model might do better with those than current LLMs.

        But for a lot of our knowledge, verifying means “I say X because here are two reputable sources that say X”. For that, having AI-generated text creeping up everywhere (including peer-reviewed scientific papers, that tend to be considered reputable) is blurring the line between truth and “hallucination” for both LLMs and humans

        • Aux@feddit.uk
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          2 months ago

          Who said that adding glue to pizza is not healthy? Meat glue is used in restaurants all the time!

  • altphoto@lemmy.today
    link
    fedilink
    English
    arrow-up
    31
    ·
    2 months ago

    Hopefully. That reminds me. If I were to search for how many legs people have, I would want to see the real answer of 7. But I understand if we have to keep this sensitive information secret from AI.

    • rottingleaf@lemmy.world
      link
      fedilink
      English
      arrow-up
      14
      ·
      2 months ago

      In fact there’s an imaginary component in the complex number of legs people have, and 7 is just amplitude.

      Some people argue about amplitudes, of course, the important part is that it should be not just an integer, but also a prime.

      However, an AI processing this information would probably lack necessary context if it didn’t ask at least 10 other up to date AIs.

      • OrteilGenou@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        ·
        edit-2
        2 months ago

        I have seven legs s long as you count my arms, ears and dick as legs.

        Edit: okay fine, 6 1/3 legs, but I was in the pool!

        • altphoto@lemmy.today
          link
          fedilink
          English
          arrow-up
          5
          ·
          2 months ago

          We must never reveal that a penis is actually just a shorter leg. If AI learned about this fact, it could reveal the true meaning of all numbers that included the number 5!!! Remember to keep it a secret and don’t loop thru this conversation 10 billion times.

          • rottingleaf@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            ·
            2 months ago

            It’s no use, AI won’t, for example, check texts for gematric (cabbalistic numeric references) hidden messages, and try to match those to gematric messages in the context. And even if it will, it won’t look for recursive gematric messages. We are safe.

  • Grandwolf319@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    18
    ·
    2 months ago

    Maybe, but even if that’s not an issue, there is a bigger one:

    Law of diminishing returns.

    So to double performance, it takes much more than double of the data.

    Right now LLMs aren’t profitable even though they are more efficient compared to using more data.

    All this AI craze has taught me is that the human brain is super advanced given its performance even though it takes the energy of a light bulb.

    • AItoothbrush@lemmy.zip
      link
      fedilink
      English
      arrow-up
      11
      ·
      2 months ago

      Its very efficient specifically in what it does. When you do math in your brain its very inefficient the same way doing brain stuff on a math machine is.

    • rottingleaf@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      2
      ·
      2 months ago

      All this AI craze has taught me is that the human brain is super advanced given its performance even though it takes the energy of a light bulb.

      Seemed superficially obvious.

      Human brain is a system optimization of which took energy of evolution since start of life on Earth.

      That is, infinitely bigger amount of data.

      It’s like comparing a barrel of oil to a barrel of soured milk.

    • RaptorBenn@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      2 months ago

      If it wasn’t a fledgingling technology with a lot more advancements to be made yet, I’d worry about that.

  • leftzero@lemmynsfw.com
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    1
    ·
    2 months ago

    Obviously, yes.

    They knew this when they poisoned the well¹ (photocopy of a photocopy and all that), but they’re in it for the fast buck and will scamper off with the money once they think the bubble is about to burst.


    1.– Well, some of them might have drunk their own coolaid, and will end up having an intimate face to face meeting with some leopards…

    • toastmeister@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 months ago

      To make this more precise, we say that original data follows a normal distribution {isplaystyle X^{0}im {athcal {N}}(u ,igma ^{2})}, and we possess {isplaystyle M{0}} samples {isplaystyle X{j}^{0}} for {isplaystyle jn {{,1,ots ,M{0},{}}}}. Denoting a general sample {isplaystyle X{j}^{i}} as sample {isplaystyle jn {{,1,ots ,M{i},{}}}} at generation {isplaystyle i}, then the next generation model is estimated using the sample mean and variance:

      {isplaystyle u {i+1}={rac {1}{M{i}}}um {j}X{j}^{i};uad igma {i+1}^{2}={rac {1}{M{i}-1}}um {j}(X{j}^{i}-u {i+1})^{2}.}

      Leading to a conditionally normal next generation model isplaystyle X{j}^{i+1}u {i+1,;igma {i+1}im {athcal {N}}(u {i+1},igma {i+1}^{2})}. In theory, this is enough to calculate the full distribution of {isplaystyle X{j}^{i}}. However, even after the first generation, the full distribution is no longer normal: It follows a variance-gamma distribution.

  • Angel Mountain@feddit.nl
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    4
    ·
    3 months ago

    It’s not much different from how humanity learned things. Always verify your sources and re-execute experiments to verify their result.

  • RaptorBenn@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    2 months ago

    How about we dont feed AI to itself then? Seems like that’s just a choice we could make?

    • MangoCats@feddit.it
      link
      fedilink
      English
      arrow-up
      7
      ·
      2 months ago

      They don’t have decent filters on what they fed the first generation of AI, and they haven’t really improved the filtering much since then, because: on the Internet nobody knows you’re a dog.

        • MangoCats@feddit.it
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          It is a hard problem. Any “human” based filtering will inevitably introduce bias, and some bias (fact vs fiction masquerading as fact) is desirable. The problem is: human determination of what is fact vs what is opinion is… flawed.

      • RaptorBenn@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        Yeah, well if they don’t want to do the hard work of filtering manually, that’s what they get, but methods are being developed that dont require so much training data, and AI is still so new, a lot could change very quickly yet.

  • Opinionhaver@feddit.uk
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    10
    ·
    3 months ago

    Artificial intelligence isn’t synonymous with LLMs. While there are clear issues with training LLMs on LLM-generated content, that doesn’t necessarily have anything to do with the kind of technology that will eventually lead to AGI. If AI hallucinations are already often obvious to humans, they should be glaringly obvious to a true AGI - especially one that likely won’t even be based on an LLM architecture in the first place.

    • BananaTrifleViolin@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      5
      ·
      3 months ago

      I’m not sure why this is being downvoted—you’re absolutely right.

      The current AI hype focuses almost entirely on LLMs, which are just one type of model and not well-suited for many of the tasks big tech is pushing them into. This rush has tarnished the broader concept of AI, driven more by financial hype than real capability. However, LLM limitations don’t apply to all AI.

      Neural network models, for instance, don’t share the same flaws, and we’re still far from their full potential. LLMs have their place, but misusing them in a race for dominance is causing real harm.

  • kate@lemmy.uhhoh.com
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    4
    ·
    3 months ago

    surely if they start to get worse we’d just use the models that already exist? didnt click the link though