There are VERY FEW fully open LLMs. Most are the equivalent of source-available in licensing and at best, they’re only partially open source because they provide you with the pretrained model.
To be fully open source they need to publish both the model and the training data. The importance is being “fully reproducible” in order to make the model trustworthy.
In that vein there’s at least one project that’s turning out great so far:
Not just LLMs but all kinds of models are equivlant to freeware, aka the model itself and other essential bits for it to work. I won’t even call it source avaliable as there is no source.
Take redis as example. I can still go grab the source and compile a binary that works. This doesn’t applies on ML models.
Of course one can argue the training process isn’t determistic thus even with the exact training corpus, it can’t create the same model in terms of bits on mulitple runs. However, I would argue the same corpus provide the chance to train a model of similar or equivalent performance. Hence the openness of the training corpus is an absolute requirement to qualify a model being FOSS.
I’ve seen this said multiple times, but I’m not sure where the idea that model training is inherently non-deterministic is coming from. I’ve trained a few very tiny models deterministically before…
You sure you can train a model deterministically down to each bits? Like feeding them into sha256sum will yield the same hash?
Yes of course, there’s nothing gestalt about model training, fixed inputs result in fixed outputs
Fortunately, LLMs don’t really need to be fully open source to get almost all of the benefits of open source. From a safety and security perspective it’s fine because the model weights don’t really do anything; all of the actual work is done by the framework code that’s running them, and if you can trust that due to it being open source you’re 99% of the way there. The LLM model just sits there transforming the input text into the output text.
From a customization standpoint it’s a little worse, but we’re coming up with a lot of neat tricks for retraining and fine-tuning model weights in powerful ways. The most recent bit development I’ve heard of is abliteration, a technique that lets you isolate a particular “feature” of an LLM and either enhance it or remove it. The first big use of it is to modify various “censored” LLMs to remove their ability to refuse to comply with instructions, so that all those “safe” and “responsible” AIs like Goody-2 can turned into something that’s actually useful. A more fun example is MopeyMule, a LLaMA3 model that has had all of his hope and joy abliterated.
So I’m willing to accept open-weight models as being “nearly as good” as a full-blown open source model. I’d like to see full-blown open source models develop more, sure, but I’m not terribly concerned about having to rely on an open-weight model to make an AI system work for the immediate term.
I suppose the importance of the openness of the training data depends on your view of what a model is doing.
If you feel like a model is more like a media file that the model loaders are playing back, where the prompt is more of a type of control over how you access this model then yes I suppose from a trustworthiness aspect there’s not much to the model’s training corpus being open
I see models more in terms of how any other text encoder or serializer would work, if you were, say, manually encoding text. While there is a very low chance of any “malicious code” being executed, the importance is in the fact that you can check the expectations about how your inputs are being encoded against what the provider is telling you.
As an example attack vector, much like with something like a malicious replacement technique for anything, if I were to download a pre-trained model from what I thought was a reputable source, but was man-in-the middled and provided with a maliciously trained model, suddenly the system I was relying on that uses that model is compromised in terms of the expected text output. Obviously that exact problem could be fixed with some has checking but I hope you see that in some cases even that wouldn’t be enough. (Such as malicious “official” providence)
As these models become more prevalent, being able to guarantee integrity will become more and more of an issue.
Even if you trained the AI yourself from scratch you still can’t be confident you know what the AI is going to say under any given circumstance. LLMs have an inherent unpredictability to them. That’s part of their purpose, they’re not databases or search engines.
if I were to download a pre-trained model from what I thought was a reputable source, but was man-in-the middled and provided with a maliciously trained model
This is a risk for anything you download off the Internet, even source code could be MITMed to give you something with malicious stuff embedded in it. And no, I don’t believe you’d read and comprehend every line of it before you compile and run it. You need to verify checksums
As I said above, the real security comes from the code that’s running the LLM model. If someone wanted to “listen in” on what you say to the AI, they’d need to compromise that code to have it send your inputs to them. The model itself can’t do that. If someone wanted to have the model delete data or mess with your machine, it would be the execution framework of the model that’s doing that, not the model itself. And so forth.
You can probably come up with edge cases that are more difficult to secure, such as a troubleshooting AI whose literal purpose is messing with your system’s settings and whatnot, but that’s why I said “99% of the way there” in my original comment. There’s always edge cases.
what about redistributability?
That would be part of what’s required for them to be “open-weight”.
A plain old binary LLM model is somewhat equivalent to compiled object code, so redistributability is the main thing you can “open” about it compared to a “closed” model.
An LLM model is more malleable than compiled object code, though, as I described above there’s various ways you can mutate an LLM model without needing its “source code.” So it’s not exactly equivalent to compiled object code.
Is abliteration based off the research by the Anthropic team? When they got Claude to say it was the golden gate bridge?
Ironically, as far as I’m aware it’s based off of research done by some AI decelerationists over on the alignment forum who wanted to show how “unsafe” open models were in the hopes that there’d be regulation imposed to prevent companies from distributing them. They demonstrated that the “refusals” trained into LLMs could be removed with this method, allowing it to answer questions they considered scary.
The open LLM community responded by going “coooool!” And adapting the technique as a general tool for “training” models in various other ways.
The importance is being “fully reproducible” in order to make the model trustworthy.
Well that’s a problem, because even with training data that’s impossible by design.
I’m not sure where you get that idea. Model training isn’t inherently non-deterministic. Making fully reproducible models is 360ai’s apparent entire modus operandi.
Check out the dolpin-trained LLMs, he did one for Mistral and one for Phi-2. Uncensored and OSS
If a layman may ask, what are folks even using AI/LLMs for mostly? Aside from playing around with some for 10-15 mins out of simple curiosity, I don’t have a practical use for platforms like ChatGPT. I’m just wondering what the average tech enthusiast uses these for, outside of academia.
I teach language. I get paid for my time in front of students, not the time it takes to prepare their lessons and the materials. I use AI to quickly reference grammar rules, to fabricate example dialogs in specific scenarios to practice, and to suggest activities to do in class to practice the target grammar. I never do exactly as it says, just take it as kind of a source of suggestions for me to build from.
That sounds like a time saver for sure. I imagine that some of those elements (grammar rules) are widely available everywhere, while others (practice dialogues, activity suggestions focused on the use of language) would require a fairly specific training model.
Well, LLMs are quite literally trained on language, so asking it to simulate a conversation between a hotel clerk and a guest who is upset that they can’t find the hair dryer is pretty much what it’s best at doing.
You can even build the dialogs with students. Have them introduce a scenario for the LLM to manufacture, then have the students suggest variables to apply, such as the clerk being hungry and in a bad mood while the guest is actually drunk after returning from a club in order to see how the language changes, then have the students act it out for laughs.
A friend of mine and I have gotten used to using it during our conversations. We do fast fact-checking or find a good first opinion regarding silly topics. We often find it faster than digging through search-engine results and interpreting scattered information. We have used it for thought experiments, intuitive or ELI5 explanations of topics that we don’t really know about, finding peer-reviewed sources for whatever it is that we’re interested in, or asking questions that operationalizing into effective search engine prompts would be harder than asking with natural language. We always always ask for citations and links, so that we can discard hallucinations.
Thanks for sharing! I’m probably too set in my ways to ever utilize AI for things like this. I never use virtual assistants like Alexa or Google either, as I like to vet and interpret the source of information myself. Having the citations would be handy, but ultimately I’d want to read them myself so the IA/VA just becomes an added step.
we use it to classify data that is needed to be sent to one of three endpoints. chatgpt tells our tool where it belongs. there are.probably more practical ways to do this, but the customer wanted AI in his product so here we are 🤷
What’s FOSS-AI? A model everyone can download and use for free? Or in the OSS spirit that everything need to be open and without discrimination of use, aka OSS training data corpus and no AUP attached?
Or you mean the inference engine running those models?
Everything which is not BigTech. Preferably FOSS, at least not BigTech, just alternatives to for example OpenAI.
So you’re including free models like freeware, not FOSS only, by non big tech.
Your choice of models will be quite limited as the compute resource and training corpus needed to make a viable base model isn’t anyone can do.
Agreed, but there have been big projects that have been open source. I can imagine* an AI (LLM) being developed fully FOSS. It would be rare, but I can see it happening if a big foundation got behind it. Maybe Mozilla, or another that tries to keep the spirit of their mission statement.
*Imagine: I’m not too familar with all of the current, public, and free models out there, just a few. This was just me making a hopeful guess about if it might be actually happening now.
Edward Snowden isn’t god
I know that’s a shock to some…
Anyway I think he means self hosted options. I would recommend Ollama with a frontend
LM Studio https://github.com/lmstudio-ai/lms
I wouldn’t use it personally
I’m just convinced all of y’all asking about this are in a huge circle jerk that never ends, but refuses to understand how it all works.
A model is a model. It’s a simplified way of narrowing down thresholds of confidences. It’s a pretty basic sorting algorithm that runs super fast on accelerated hardware.
You people seem to think it’s like fucking magic that steals your soul.
Don’t send information over the wire, and you’re golden. Learn how it works, and stop asking dumb questions like this is all brand new, PLEASE.
There is a difference between a general scare about the AI buzzword and legitimate distrust in online services which are closely connected to american spying institutions (regardless if they are ai or not)
If my calories tracker app would apoint a (former) NSA official on their board, I would be looking for alternatives too. This is not about AI, this is about a company with huge sets of private data being closely interconnected with american spy institutions.
Sad that you don’t seem to be able to distinguished between legitimate security questions and badly informed hypes/scares ass soon as a buzzword like AI occurs
Read the last part of my comment again. Seems I very clearly grasp the concern.
I did read this part, and while this is generally true, there are use cases of such large models. Some of them require the input of personal data (find bugs in my code, formalize this email, scan this picture for text and translate it, draw an anime version of this picture of my friend tom)
So people being weary of security implications of such large models are certainly not
in a huge circle jerk that never ends, but refuses to understand how it all works.
Sure you can just call them all dumb using ai like the mainstream (putting in personal data) and attribute it to an unwillingness to understand, but this doesn’t match the reality. Most people don’t even understand how an operating system functions, which components work online and which offline and who can access which of their information, let alone know how “AI” works and what the security implications are.
So If people ask those questions, hoping there are alternatives they can use safely your answer “no, u just dumb, machine can’t harm you, its not magic, just don’t put in data in”
Is not only rude but also missing the point. Most usefull/fun/mainstream ways DO in fact, put in data.
You explaining basic models also doesn’t help, as the concern here is not mainly/only the model, but american spy institution to access all prompts you did put in, maybe categorizing you in personality clusters dependent on your usage of language or assigning tags on which political stance a users has (and with entities like the NSA I could imagine far worse)
Also “A model is a model” Is not very accurate in such cases. When someone has control and secrecy over each aspect of the model, it would be very well possible for entities like the NSA to manipulate the content the models puts out in arbitrary directions. A government controlling and manipulating information the public receives is a red flag for a lot of people (rightfully so IMHO)
How are people supposed to get better in digital privacy topics if you just tell them to shut up and insult them when they aks questions trying to learn? You acting like you are in your Elfenbeinturm of genius isn’t helping anyone.
I’m not calling anybody dumb. I’m saying they’re being willfully ignorant and assuming this is all brand new tech that is mysterious, rather than learning about how it works.
A lot of people are hyped by the “Hype & PR” machine right now instead of being (appropriately IMHO) suspicious and using critical thinking.
Your comment certainly feels like you look/kick down on people instead of giving them a helping hand getting up.
With your attitude you are driving away people who want to do exactly what you want of them: educating theirselfs.
You are being contra productive to your own demands is what I’m saying
Not sure I should take a lot of interest in the thoughts of someone on the Internet trying to act socially or intellectually superior, but using the word “theirselfs”, which is not a word.
I am definitely “Contra productive” though. I can beat it on a single life with no Konami codes 🫰
If your argumentation results to shaming people for not being native English speakers I think everything is said here.
A model is that feeling you feel after eating baby carrots
A model is the smoothness of a dog on leather
A model is a salad that taste like mud
A model is…
My documented process https://fabien.benetou.fr/Content/SelfHostingArtificialIntelligence but honestly I just tinker with this. Most of that isn’t useful IMHO except some pieces, e.g STT/TTS, from time to time. The LLM aspect itself is too unreliable, and I do like 2 relatively recent papers on the topic, namely :
- No “Zero-Shot” Without Exponential Data https://arxiv.org/abs/2404.04125
- ChatGPT is bullshit https://link.springer.com/article/10.1007/s10676-024-09775-5
which are respectively saying that the long-tail makes it practically impossible to train AI to be correct in rare cases and that “hallucinations” are a misnomer for marketing purposes to be replaced instead by “bullshit” used to convinced people without caring for veracity.
Still, despite all this criticism it is a very popular topic, hyped up to be the “future” of computing. Consequently I did want to both try and help others to do so rather than imagine that it was restricted to a kind of “elite”. I try to keep the page up to date but so far, to be honest, I do it mostly defensively, to be able to genuinely criticize because I did take the time to try, not reject in block.
PS: I do try also state of the art, both close and open-source, via APIs e.g OpenAI or Mistral but only for evaluation purposes, not as tools part of my daily usage.