Well, that’s awesome.

  • conciselyverbose
    link
    fedilink
    arrow-up
    26
    ·
    edit-2
    3 months ago

    The problem is that LLMs aren’t human speech and any dataset that includes them cannot be an accurate representation of human speech.

    It’s not “LLMs convinced humans to use ‘delve’ a lot”. It’s “this dataset is muddy as hell because a huge proportion of it is randomly generated noise”.

    • NuXCOM_90Percent@lemmy.zip
      link
      fedilink
      arrow-up
      3
      arrow-down
      10
      ·
      3 months ago

      What is “human speech”? Again, so many people (around the world) have picked up idioms and speaking cadences based on the media they consume. A great example is that two of my best friends are from the UK but have been in the US long enough that their families make fun of them. Yet their kid actually pronounces it “al-you-min-ee-uhm” even though they both say “al-ooh-min-um”. Why? Because he watches a cartoon where they pronounce it the British way.

      And I already referenced socal-ification which is heavily based on screenwriters and actors who live in LA. Again, do we not speak “human speech” because it was artificially influenced?

      Like, yeah, LLMs are “tainted” with the word “delve” (which I am pretty sure comes from youtube scripts anyway but…). So are people. There is a lot of value in researching the WHY a given word or idiom becomes so popular but, at the end of the day… people be saying “delve” a lot.

      • conciselyverbose
        link
        fedilink
        arrow-up
        11
        arrow-down
        1
        ·
        edit-2
        3 months ago

        Speech written by a human. It’s not complicated.

        It cannot possibly be human speech if it was produced by a machine.