An EPYC Miss? Microsoft Azure Instances Pair AMD's MI300X With Intel's Sapphire Rapids

imaginary_num6er@alien.top · 1 year ago

An EPYC Miss? Microsoft Azure Instances Pair AMD's MI300X With Intel's Sapphire Rapids

Mission_Interview116@alien.top · 1 year ago

Really shows Intel’s vision and technological superiority. No reason not to choose Intel’s processors as they are the gold standard of AI CPU design.

Vushivushi@alien.top · 1 year ago

Microsoft doesn’t seem to advertise the acceleration capabilities of SPR in these instances. I’ve heard good things about AMX for inference. Here’s VMWare accelerating a 7b model, but in the scope of large scale training and inference, I don’t think there’s a role to be played by SPR’s accelerators.

I think it’s just a difference of platform validation time.

We’re beginning to see Genoa AI platforms roll out now from several CSPs.

DevAnalyzeOperate@alien.top · 1 year ago

Intel’s accelerator strategy and focus on memory bandwidth is paying off huge.

First time in awhile I’ve seen Intel execute something well and catch AMD with their pants down, despite sapphire rapids being a lemon in most respects.

HippoLover85@alien.top · 1 year ago

Amd dc sales are taking off and intel is still struggling.

Spr is not a competitive product for the vast majority of workloads. Its fine here because who cares about the cpu performance. Cloud probably paying premium for bergamo chips, and you dont need a powerful 128 core here.

GrandDemand@alien.top · 1 year ago

This^

SPR is cheaper than Genoa and Bergamo, and supply of those EPYC chips has not been as abundant as SPR.

There’s advantages to SPR over Zen 4 EPYC in ML/AI workloads, and while MI300X will be doing the grunt of the training and inference, some model weights/parameters could be offloaded to the CPU with minimal performance loss in the event the VRAM buffer overflows to system memory. CPU only inference could also tested for model performance on weaker hardware or be utilized if all MI300X are busy and there’s unused CPU cycles (which is likely for these workloads). SPR generally outperforms Genoa in inference so there’s some merits for its selection over the latter.

Regardless though this decision by Microsoft just boils down to cost and availability

BatteryPoweredFriend@alien.top · 1 year ago

It’s more likely MS didn’t think taking a whole bunch Geona systems out of use elsewhere would be worth it. The backlog for Genoa throughout most of h1 made them nearly as unobtainable as H100s.

Most of the CPU time in these sort of systems is usually taken up by relatively basic PCIe traffic management. More likely, SPR and Geona are basically interchangeable as far as this is concerned and SPR Xeon just had less opportunity cost.

If there was actually any special sauce that made a tangible difference with this type of setup, there would be an epic bumrush by everyone to buy up SPR Xeons to host all their H100s, but they’re clearly not. Nvidia would have also made a far bigger and more public stink over Intel’s failure to deliver SPR on time, due to DGX-H100.

HTwoN@alien.top · 1 year ago

Sapphire Rapids is better for AI. Intel spent silicon real estate on AI accelerator.

farnoy@alien.top · 1 year ago

But what are the accelerators doing? AMX surely isn’t being used because it contributes nothing compared to 8 massive GPUs? I doubt QAT has a use either, since these GPUs are going to be fed by p2p dma from the 400Gbps NICs they each have.