Well, that’s awesome.

  • @conciselyverbose
    link
    20
    edit-2
    7 hours ago

    The problem is that LLMs aren’t human speech and any dataset that includes them cannot be an accurate representation of human speech.

    It’s not “LLMs convinced humans to use ‘delve’ a lot”. It’s “this dataset is muddy as hell because a huge proportion of it is randomly generated noise”.

    • @[email protected]
      link
      fedilink
      -56 hours ago

      What is “human speech”? Again, so many people (around the world) have picked up idioms and speaking cadences based on the media they consume. A great example is that two of my best friends are from the UK but have been in the US long enough that their families make fun of them. Yet their kid actually pronounces it “al-you-min-ee-uhm” even though they both say “al-ooh-min-um”. Why? Because he watches a cartoon where they pronounce it the British way.

      And I already referenced socal-ification which is heavily based on screenwriters and actors who live in LA. Again, do we not speak “human speech” because it was artificially influenced?

      Like, yeah, LLMs are “tainted” with the word “delve” (which I am pretty sure comes from youtube scripts anyway but…). So are people. There is a lot of value in researching the WHY a given word or idiom becomes so popular but, at the end of the day… people be saying “delve” a lot.

      • @conciselyverbose
        link
        6
        edit-2
        4 hours ago

        Speech written by a human. It’s not complicated.

        It cannot possibly be human speech if it was produced by a machine.