And it’s hard to tell what the difference is. Apples ‘built from the ground up for AI’ chips just have more RAM. What’s the difference with CPUs? Do they just have more onboard graphics processing that can also be used for matrix multiplication?
The stupid difference is supposed to be that they have some tensor math accelerators like the ones that have been on GPUs for three generations now. Except they’re small and slow and can barely run anything locally, so if you care about “AI” you’re probably using a dedicated GPU instead of a “NPU”.
And because local AI features have been largely useless, so far there is no software that will, say, take advantage of NPU processing for stuff like image upscaling while using the GPU tensor calculations for in-game raytracing or whatever. You’re not even offloading any workload to the NPU when you’re using your GPU, regardless of what you’re using it for.
For Apple stuff where it’s all integrated it’s probably closer to what you describe, just using the integrated GPU acceleration. I think there are some specific optimizations for the kind of tensor math used in AI as opposed to graphics, but it’s mostly the same thing.
The idea is having tensor acceleration built into SoCs for portable devices so they can run models locally on laptops, tablets and phones.
Because, you know, server-side ML model calculations are expensive, so offloading compute to the client makes them cheaper.
But this gen can’t really run anything useful locally so far, as far as I can tell. Most of the demos during the ramp-up to these were thoroughly underwhelming and nowhere near what you get from server-side services.
Of course they could have just called the “NPU” a new GPU feature and make it work closer to how this is run on dedicated GPUs, but I suppose somebody thought that branding this as a separate device was more marketable.
EU should introduce regulation that prohibits client-side AI/ML processing for applications that require internet access. Show the cost upfront. Let’s see how many people pay for that.
It’s definitely weird that everyone is panicking about data center processing costs but not about the exact same hardware powering high end gaming devices that have skyrocketed from 100W to 450W in a few years, but ultimately if you want to run a model locally you can run a model locally. I’m not sure how you’d regulate that, it’s just software.
Hell, I don’t even think distributing the load is a terrible idea, it’s just that the models you can run locally in 40 TOPS kinda suck compared to the order of magnitude more processing you get on modern GPUs.
I’m not talking about stable diffusion or anything like that.
I meant whatever Twitter, or any similar chatbots, or AI assistant features of apps should be run on server-side, not put a load on customers’ devices.
Yeah, no, I get the spirit of the thing. I’m just saying that… well, for one that it wouldn’t be a bad idea if it worked, it just doesn’t at the moment. But more importantly that regulations don’t work like that. You can’t just make rules that go “hey you guys specifically have to run this software on a server specifically”. You can already run assistants locally using a whole bunch of downloadable models, it’d be a huge overreach to tell people and companies that they CAN make the software and run it, but only remotely. That’s just… not how rules and regulations are put together.
Basically yes. They come with an NPU (Neural processing unit) which is hardware acceleration for matrix multiplications. It cannot do graphics. Slap whatever NPU into the chip, boom: AI laptop!
Modern graphics cards pack a lot of functionality. Shading units, Ray tracing, video encoding/deciding. NPU is just the part needed to accelerat Neural nets.
But you can accelerate nural nets better with a GPU, right? They’ve got a lot more parallel matrix multiplication compute than any npu you can slap on a CPU.
It all depends on the GPU. If it’s something integrated in the CPU it will probably not so better, if it’s a 2000$ dedicated GPU with 48GB of VRAM is will be very powerful for Neural Net computing. NPUs are most often implemented as small, low-power, embedded solutions. Their goal is not to compete with data centers or workstations, it’s to enable some basic “AI” features on portable devices. E.g: “smart” camera with object recognition to give you alerts.
The Apple chips also have a wide interface to the RAM. That means you can run chatbots (LLMs) and other AI workloads that are memory-bound at crazy speeds compared to an Intel (or AMD) computer.
Depending on the chip, they have somewhere from 100 to 400 GB/s. I’m not sure on the numbers on Intel processors. I think the consumer processors have about 50 - 80 GB/s. (~Alder Lake, dual channel DDR5) Mine seems to have way less. And a recent GPU will be somewhere in the range of 400 to 1000 GB/s. But consumer graphics cards stop at 24GB of VRAM and these flagship models are super expensive. Even compared to Apple products.
The people from the llama.cpp project did some measurements and I believe the Apple “Metal” framework seems to outperform the x86 computers by an order of magnitude or so. I’m not sure, it’s been some time since i skimmed the discussions on their Github page.
Apple is also much faster because the integrated graphics are actually usable for LLMs.
The base M is just a big faster than an Intel/AMD laptop if you can get their graphics working. The M Pro is 2x is fast (as its memory bus is 2x as wide). The M Max is 4x as fast.
AMD is coming out with something more competitive in 2025 though, Strix Halo.
And it’s hard to tell what the difference is. Apples ‘built from the ground up for AI’ chips just have more RAM. What’s the difference with CPUs? Do they just have more onboard graphics processing that can also be used for matrix multiplication?
The stupid difference is supposed to be that they have some tensor math accelerators like the ones that have been on GPUs for three generations now. Except they’re small and slow and can barely run anything locally, so if you care about “AI” you’re probably using a dedicated GPU instead of a “NPU”.
And because local AI features have been largely useless, so far there is no software that will, say, take advantage of NPU processing for stuff like image upscaling while using the GPU tensor calculations for in-game raytracing or whatever. You’re not even offloading any workload to the NPU when you’re using your GPU, regardless of what you’re using it for.
For Apple stuff where it’s all integrated it’s probably closer to what you describe, just using the integrated GPU acceleration. I think there are some specific optimizations for the kind of tensor math used in AI as opposed to graphics, but it’s mostly the same thing.
Seems silly to try to get the CPU to do GPU stuff, just upgrade the GPU.
The idea is having tensor acceleration built into SoCs for portable devices so they can run models locally on laptops, tablets and phones.
Because, you know, server-side ML model calculations are expensive, so offloading compute to the client makes them cheaper.
But this gen can’t really run anything useful locally so far, as far as I can tell. Most of the demos during the ramp-up to these were thoroughly underwhelming and nowhere near what you get from server-side services.
Of course they could have just called the “NPU” a new GPU feature and make it work closer to how this is run on dedicated GPUs, but I suppose somebody thought that branding this as a separate device was more marketable.
EU should introduce regulation that prohibits client-side AI/ML processing for applications that require internet access. Show the cost upfront. Let’s see how many people pay for that.
That is a weird proposal.
It’s definitely weird that everyone is panicking about data center processing costs but not about the exact same hardware powering high end gaming devices that have skyrocketed from 100W to 450W in a few years, but ultimately if you want to run a model locally you can run a model locally. I’m not sure how you’d regulate that, it’s just software.
Hell, I don’t even think distributing the load is a terrible idea, it’s just that the models you can run locally in 40 TOPS kinda suck compared to the order of magnitude more processing you get on modern GPUs.
I’m not talking about stable diffusion or anything like that.
I meant whatever Twitter, or any similar chatbots, or AI assistant features of apps should be run on server-side, not put a load on customers’ devices.
Yeah, no, I get the spirit of the thing. I’m just saying that… well, for one that it wouldn’t be a bad idea if it worked, it just doesn’t at the moment. But more importantly that regulations don’t work like that. You can’t just make rules that go “hey you guys specifically have to run this software on a server specifically”. You can already run assistants locally using a whole bunch of downloadable models, it’d be a huge overreach to tell people and companies that they CAN make the software and run it, but only remotely. That’s just… not how rules and regulations are put together.
Basically yes. They come with an NPU (Neural processing unit) which is hardware acceleration for matrix multiplications. It cannot do graphics. Slap whatever NPU into the chip, boom: AI laptop!
Matrix multiplication is also largely what graphics cards do, I wonder how the npus are different.
Modern graphics cards pack a lot of functionality. Shading units, Ray tracing, video encoding/deciding. NPU is just the part needed to accelerat Neural nets.
But you can accelerate nural nets better with a GPU, right? They’ve got a lot more parallel matrix multiplication compute than any npu you can slap on a CPU.
It all depends on the GPU. If it’s something integrated in the CPU it will probably not so better, if it’s a 2000$ dedicated GPU with 48GB of VRAM is will be very powerful for Neural Net computing. NPUs are most often implemented as small, low-power, embedded solutions. Their goal is not to compete with data centers or workstations, it’s to enable some basic “AI” features on portable devices. E.g: “smart” camera with object recognition to give you alerts.
The Apple chips also have a wide interface to the RAM. That means you can run chatbots (LLMs) and other AI workloads that are memory-bound at crazy speeds compared to an Intel (or AMD) computer.
Really? How fast is the memory bus compared to x86? And did they just double the bus bandwidth by doubling the memory?
I’m dubious because they only now went to 16gb ram as base, which has been standard on x86 for almost a decade.
Depending on the chip, they have somewhere from 100 to 400 GB/s. I’m not sure on the numbers on Intel processors. I think the consumer processors have about 50 - 80 GB/s. (~Alder Lake, dual channel DDR5) Mine seems to have way less. And a recent GPU will be somewhere in the range of 400 to 1000 GB/s. But consumer graphics cards stop at 24GB of VRAM and these flagship models are super expensive. Even compared to Apple products.
The people from the llama.cpp project did some measurements and I believe the Apple “Metal” framework seems to outperform the x86 computers by an order of magnitude or so. I’m not sure, it’s been some time since i skimmed the discussions on their Github page.
Apple is also much faster because the integrated graphics are actually usable for LLMs.
The base M is just a big faster than an Intel/AMD laptop if you can get their graphics working. The M Pro is 2x is fast (as its memory bus is 2x as wide). The M Max is 4x as fast.
AMD is coming out with something more competitive in 2025 though, Strix Halo.