If you give me several paragraphs instead of a single sentence, do you still think it’s impossible to tell?
If you give me several paragraphs instead of a single sentence, do you still think it’s impossible to tell?
It’s not it’s biological origins that make it hard to understand the brain, but the complexity. For example, we understand how the heart works pretty well.
While LLMs are nowhere near as complex as a brain, they’re complex enough to make it extremely difficult to understand.
But then there comes the question: if they’re so difficult to understand, how did people make them in the first place?
The way they did it actually bears some similarities to evolution. They created an “empty” model - a large neural network that wasn’t doing anything useful or meaningful. But it depended on billions of parameters, and if you tweak a parameter, its behavior changes slightly.
Then they expended enormous amount of computing power tweaking parameters, each tweak slightly improving its ability to model language. While doing this, they didn’t know what each number meant. They didn’t know how or why each tweak was improving the model. Just that each tweak was making an improvement.
Unlike evolution, each tweak isn’t random. There’s an algorithm called back-propagation that can tell you how to tweak the neural network to make it predict some known data slightly better. But unfortunately it doesn’t tell you anything about the “why” this tweak is good, or “what” each parameter change means. Hence why we don’t understand how LLMs work.
One final clarification: we do have some understanding on high level - just like we have some understanding of how a brain works. We have much better understanding of LLMs than brains, of course, but we can’t really explain either.
It’s not that nobody took the time to understand. Researchers have been trying to “un-blackbox” neural networks pretty much since those have been around. It’s just an extremely complex problem.
Logistic regression (which is like a neural network but with just one node) is pretty well understood - but even then sometimes it can learn some pretty unintuitive coefficients and it can be tricky to understand why.
With LLMs - which are enormous by comparison - it’s simply not a tractable problem to understand how it works in detail.
I don’t see how that affects my point.
Today’s AI detector can’t tell apart the output of today’s LLM. Future AI detector WILL be able to tell apart the output of today’s LLM. Of course, future AI detector won’t be able to tell apart the output of future LLM.
So the claim that “all text after 2023 is forever contaminated” just isn’t true. At any point in time, only recent text could be “contaminated”. Researchers would simply have to be a bit more careful including it.
Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.
They don’t redistribute. They learn information about the material they’ve been trained on - not there natural itself*, and can use it to generate material they’ve never seen.
Language models actually do learn things in the sense that: the information encoded in the training model isn’t usually* taken directly from the training data; instead, it’s information that describes the training data, but is new. That’s why it can generate text that’s never appeared in the data.
It’s specifically distribution of the work or derivatives that copyright prevents.
So you could make an argument that an LLM that’s memorized the book and can reproduce (parts of) it upon request is infringing. But one that’s merely trained on the book, but hasn’t memorized it, should be fine.
Why should such a thing be assumed???
That last point is completely impossible. Don’t forget that I don’t have to run the official lemmy software on my instance. I can make changes: for example, I can add a feature to my instance like “log every post in a separate, local database before deleting it from lemmy”. Nobody else but me will know this feature exists. Or (to be AGPL compliant) have a separate tool to regularly back up my lemmy database, undoing deletions.
As for the second point: I’d say making local votes private and non-local public will be worse for privacy due to causing confusion.
I tried Mastodon at one point, but couldn’t enjoy it. They reject using recommendation algorithms in favor of chronological order. And I get their reasons, but that meant my feed was mostly full of things I didn’t care about, so I left. Things may be better now that they’ve added an ability to follow hashtags, I don’t know.
Yea, didn’t watch the video, but had to post exactly this!
But honestly, I think people will do better long term if they have to put in even just a little bit of legwork to find the communities with the right fit, and ignore the rest.
That kinda misses the point, though. For me it’s more about promoting decentralization than it’s about whether people’s reasons to want to join all communities on a topic make sense (they actually can for niche topics). Without a feature like that, I fear people will just all join the largest community on the topic and “centralize” it.
I see it as compensating for disadvantages people have. So, if one student has lower test scores, but achieved them despite going to an underfunded school and having a part-time job, then that student scores are actually more impressive than someone else who scored better, but had private tutors throughout high school. Once you account for people’s disadvantages, you should naturally get more diverse student body.
And of course minority students have disadvantages that should be accounted for. But they don’t affect everyone the same, and racial quotas is a very lazy way to do this. Instead, admissions should look at the individual circumstances of each student.
I, personally, want things to be decentralized. I want to have 100+ technology communities that are all relevant. But for that to be practical, there needs to be a simple mechanism for people to follow the topic “technology”, and get the content of all these 100+ communities merged together (then perhaps manually block some of them that have bad moderation). Unless we have such mechanism, we’ll end up with one main big technology community, and all others will be secondary.
I’m hoping for two features: Let communities “follow” other communities - so one community’s content also shows up on the other. And let me group communities together on my personal feed, if they don’t want to follow each other for some reason. For now, I stay mostly on the home page, which aggregates everything - but I’d much prefer to be able to browse by topic and still have some aggregation.
Why? Colleges can still give preference to students who live in poor neighborhoods or bad school districts. What’s the problem with that approach?
It’s not really human rights violations that drive US sanctions anyway. There’s plenty of other countries that are rife with human rights violations, but don’t get sanctions. So long as they listen when US interests are concerned, they’re fine.
If I was mostly staying on the home page, looking at the aggregate feed, I wouldn’t care. But since I tend to browse by community, I see it as a big problem actually.
Whether a specific reason for defederating is a good idea depends on the instance IMO. I don’t think a “general purpose” instance should defederate on ideological grounds.
That said, they should defederate instances whose members are too disruptive. But right now we only speculate how hexbear members will behave. We will only know once they actually federate.