I didn’t expect a 8B-F16 model with 16GB on disk could be run in my laptop with only 16GB of RAM and integrated GPU, It was painfuly slow, like 0.3 t/s, but it ran. Then I learnt that you can effectively run a model from your storage without loading into memory and checked that it was exactly the case, the memory usage kept constant at around 20% with and without running the model. The problem is that gpt4all-chat is running all the models greater than 1.5B in this way, and the difference is huge as the 1.5b model runs at 20 t/s. Even a distilled 6.7B_Q8 model with roughly 7GB on disk that has plenty of room (12GB RAM free) didn’t move the memory usage and it was also very slow (3 tokens/sec). I’m pretty new to this field so I’m probably missing something basic, but I just followed the instrucctions for downloading it and compile it.
I’ll vouch for Koboldcpp. I use the CUDA version currently and it has a lot of what you’d need to get the settings that work for you. Just remember to save what works best as a .kcpps, or else you’ll be putting it in manually every time you boot it up (though saving doesn’t work on Linux afaik, and its a pain that it doesn’t).