It isn’t a major architecture update. Nvidia’s slides from Ampere’s release stated that the next two architectures after Ampere would be part of the same family.
Performance gains will be had by improving the RT & tensor cores, using an improved node, probably N4X, to facilitate clock speed increases at the same voltages, and by increasing the number of SMs across the product stack. The maturity of the 5nm process will allow Nvidia to use larger die than they could in Ada.
and HW support for DLSS features and CUDA as a programming platform.
It might be “a major architecture update” by the amount of work that Nvidia engineering will have to put in to pull off all the new features and RT/TC/DLSS/CUDA improvements without regressing PPA - that’s where the years of effort will be sunk - and possibly large improvements in perf in selected application categories and operating modes, but a very minor improvement in “perf per SM per clock” in no-DLSS rasterization on average.
‘Ampere Next’ referred to datacenter lineup, which ended being the biggest architectural change in datacenter GPUs since Volta vs GP100. And Ampere Next Next, referred to datacenter Blackwell, which is MCM so again a big change
GA102 to AD102 increased by about 80%, but the jump from Ad102 to GB202 is only slightly above 30%,
Maybe GB202 is not the top chip, and the top chip is named GB200.
I mean, you’d expect this die to be called GB102 based on the recent numbering scheme, right? Why jump to 202 right out of the gate? They haven’t done that in the past, AD100 is the compute die and AD102, 103, 104… are the gaming dies. In fact this has been extremely consistent all the way back to Pascal, even when there is a compute uarch variant that is different (and, GP100 is quite different from GP102 etc) it’s still called the 100.
But if there is another die above it, you’d call it GB100 (like Maxwell GM200, or Fermi GF100). Which is obviously already taken, GB100 is the compute die. So you bump the whole numbering series to 200, meaning the top gaming die is GB200.
There is also precedent for calling the biggest gaming die the x110, like GK110 or the Fermi GF110 (in the 500 series). But they haven’t done that in a long time, since Kepler. Probably because it ruins the “bigger number = smaller die” rule of thumb.
Of course it’s possible the 512b rumor was bullshit, or this one is bullshit. But it’s certainly an odd flavor of bullshit - if you were making something up, wouldn’t you make up something that made sense? Odd details like that potentially lend it credibility, because you’d call it GB102 if you were making it up. It will also be easy to corroborate across future rumors, if nobody ever mentions GB200-series chips again, then this was probably just bullshit, and vice versa. Just like Angstronomics and the RDNA3 leak, once he’d nailed the first product the N32/N33 information was highly credible.
You should be looking at transistor amount if anything at all, “cuda cores” is only somewhat useful when looking at different products within the same generation.
without scaling DRAM bandwidth anywhere near as much, only partially compensating for that with a much bigger L2.
For 5090 on the other hand we might also have clock increase going (another 1.15x?), and proportional 1:1 (unlike Ampere -> Ada) DRAM bandwidth increase by a factor of 1.5 due to GDDR7 (no bus width increases necessary; 1.5 = 1.3 * 1.15), so this is 1.5x perf increase 4090 -> 5090, which has to be further multiplied by whatever u-architectural improvements might bring, like Qesa is saying.
Unlike Qesa, though, I’m personally not very optimistic regarding those u-architectural improvements being very major. To get from 1.5x that comes out of node speed increase and the node shrink subdued and downscaled by node cost increase, to recently rumored 1.7x one would need to get (1.7 / 1.5 = 1.13) 13% perf and perf/w improvement, which sounds just about realistic. I’m betting it’ll be even a little bit less, yielding more like 1.6x proper average, that 1.7x might have been the result of measuring very few apps or outright “up to 1.7x” with “up to” getting lost during the leak (if there was even a leak).
1.6x is absolutely huge, and no wonder nobody’s increasing the bus width: it’s unnecessary for yielding a great product and even more expensive now than it was on 5nm (DRAM controllers almost don’t shrink and are big).
More memory bandwidth does not translate 1:1 to more performance. The GPU core is by far the most important. Even at 4K the current 1TB/s memory bandwidth is sufficient and overclocking the core is what gets you the most performance.
We’ve also seen that the 128-bit 4060Ti 16GB with its pitiful bandwidth can utilize its full 16GB VRAM without any issues at 1440P.
So if you’re trying to estimate performance gains, the core is where you should look for now, especially if Blackwell keeps the increased L2 cache (Ampere’s cache was measured in kilobytes, it was a radical change and it definitely worked well for AMD with RDNA2 too). Unless you’re doing 8K gaming the extra memory bandwidth will have minimal impact.
If I understand this article and what kopite7kimi said correctly, it sounds like a 33% cache increase, which he assumed meant a 33% memory controller increase. So 128MB, which they derived 512 bit from originally. That’s not that huge of a jump of cache compared to the current 96 it seems to me.
GDDR7 is supposed to start at 32Gbps, but there is also some claims of 36 Gbps. If you average to the cache (33%) and memory speed (60%) increase we’re talking maybe 45% more effective bandwidth.
It’s because AD102 is already a huge monolithic die, a little over 600mm2, with 814mm2 being the theoretical limit. In short: the bigger the die size the lower the yields, there will never be a gaming GPU much bigger than 600mm2 because then you’re looking at terrifying prices.
A node shrink only helps so much, and you don’t want a GPU sucking 800 watts either. The wider memory bus also takes up extra space. The 5090 is still monolithic so a ~30% improvement in performance sounds plausible.
Even worse, AMD (underatandibly) is “skipping” RDNA4 high-end to both maximize AI production and give their engineers more time to get a chiplet design with multiple graphics chiplets working well.
RDNA5 will likely be high end again, but until then, the 7900XTX or a refresh of it will likely remain the fastest AMD card.
Which means next gen Nvidia pricing will go through the roof. We’re probably looking at a $999 16-20GB 5070(Ti) that matches AMD flagship performance so I wouldn’t be surprised if a 24GB 5080 will be priced at $1500-1750 MSRP as the gaming flagship and the hybrid 32GB 5090 $2500-3000 MSRP. Remember the 3090Ti launched at $2000 despite gaming competition from the $1000 6950XT.
… And people will buy them. RIP GPU prices for the next 3 years, at least.
Its expected to be like Ampere, Ampere was 17% increase in SMs (rtx 3090ti vs rtx Titan) but the SM itself was improved such that they yielded about 33% improvement per SM in ‘raster’ and massive improvements in occupency for RT workloads. So 3090ti ended up 46% faster in ‘raster’ vs rtx Titan.
The TPC and GPC of Blackwell are rumored to be overhauled with a more hesitant rumor about the SM also being improved.
I would bet on not a significant upgrade as well. Performance per transistor in Ada vs Ampere actually went down, mostly because they spend so much transistor budget on cache. 170% more transistors for 65% more performance, if you assume the full potential AD102 die is 15% faster than a 4090. But even before AMD and Nvidia started playing with massive caches, you could always relatively accurately predict how fast a GPU would be based on transistor count.
Cache is not shrinking in die area with new nodes. If this thing has 128MB, and the rest of the die stays at 600mm^(2), the area dedicated to logic would be smaller than AD102. Unless they start stacking cache that is. Would not shock me, if most, or all the L2 is on a 2nd layer.
I’ve heard 60% more logic density going from 5nm to 3nm, but who knows how optimistic those numbers are as they are probably best case scenario. Can’t but imagine a real GPU application would at maximum reach 40-50% more logic density, and I can’t imagine a performance uplift higher than that, unless they make it a 800mm^(2) die which I don’t believe.
Am I reading those Cuda core projections right?
GA102 to AD102 increased by about 80%, but the jump from Ad102 to GB202 is only slightly above 30%, aside from no large gains going to 3nm?
Might not turn out that impressive after all.
It’s highly likely to be a major architecture update, so core count alone won’t be a good indicator of performance.
It isn’t a major architecture update. Nvidia’s slides from Ampere’s release stated that the next two architectures after Ampere would be part of the same family.
Performance gains will be had by improving the RT & tensor cores, using an improved node, probably N4X, to facilitate clock speed increases at the same voltages, and by increasing the number of SMs across the product stack. The maturity of the 5nm process will allow Nvidia to use larger die than they could in Ada.
lmao
and HW support for DLSS features and CUDA as a programming platform.
It might be “a major architecture update” by the amount of work that Nvidia engineering will have to put in to pull off all the new features and RT/TC/DLSS/CUDA improvements without regressing PPA - that’s where the years of effort will be sunk - and possibly large improvements in perf in selected application categories and operating modes, but a very minor improvement in “perf per SM per clock” in no-DLSS rasterization on average.
‘Ampere Next’ referred to datacenter lineup, which ended being the biggest architectural change in datacenter GPUs since Volta vs GP100. And Ampere Next Next, referred to datacenter Blackwell, which is MCM so again a big change
True, completely forgot that there wasn’t a very large overhaul last gen.
Maybe GB202 is not the top chip, and the top chip is named GB200.
I mean, you’d expect this die to be called GB102 based on the recent numbering scheme, right? Why jump to 202 right out of the gate? They haven’t done that in the past, AD100 is the compute die and AD102, 103, 104… are the gaming dies. In fact this has been extremely consistent all the way back to Pascal, even when there is a compute uarch variant that is different (and, GP100 is quite different from GP102 etc) it’s still called the 100.
But if there is another die above it, you’d call it GB100 (like Maxwell GM200, or Fermi GF100). Which is obviously already taken, GB100 is the compute die. So you bump the whole numbering series to 200, meaning the top gaming die is GB200.
There is also precedent for calling the biggest gaming die the x110, like GK110 or the Fermi GF110 (in the 500 series). But they haven’t done that in a long time, since Kepler. Probably because it ruins the “bigger number = smaller die” rule of thumb.
Of course it’s possible the 512b rumor was bullshit, or this one is bullshit. But it’s certainly an odd flavor of bullshit - if you were making something up, wouldn’t you make up something that made sense? Odd details like that potentially lend it credibility, because you’d call it GB102 if you were making it up. It will also be easy to corroborate across future rumors, if nobody ever mentions GB200-series chips again, then this was probably just bullshit, and vice versa. Just like Angstronomics and the RDNA3 leak, once he’d nailed the first product the N32/N33 information was highly credible.
Well the 512 rumour was kopite, and this is also kopite saying he misinterpreted a 128MB L2$ to mean 512
It is already leaked, GB200 is a chiplet design that will be exclusive for server customers. GB202 will be used for the 5090.
You should be looking at transistor amount if anything at all, “cuda cores” is only somewhat useful when looking at different products within the same generation.
Still very accurate if you know what to look for.
For example, the reason why Ampere vs Turing CUDA cores scale different will let you predict how an Ampere GPU scales vs Turing GPU.
It’s also why we knew how Ada would scale linearly except with 4090 that was nerfed to be more efficient
I guess people don’t dig into white papers to learn about how and why the architectures perform as they do
without scaling DRAM bandwidth anywhere near as much, only partially compensating for that with a much bigger L2.
For 5090 on the other hand we might also have clock increase going (another 1.15x?), and proportional 1:1 (unlike Ampere -> Ada) DRAM bandwidth increase by a factor of 1.5 due to GDDR7 (no bus width increases necessary; 1.5 = 1.3 * 1.15), so this is 1.5x perf increase 4090 -> 5090, which has to be further multiplied by whatever u-architectural improvements might bring, like Qesa is saying.
Unlike Qesa, though, I’m personally not very optimistic regarding those u-architectural improvements being very major. To get from 1.5x that comes out of node speed increase and the node shrink subdued and downscaled by node cost increase, to recently rumored 1.7x one would need to get (1.7 / 1.5 = 1.13) 13% perf and perf/w improvement, which sounds just about realistic. I’m betting it’ll be even a little bit less, yielding more like 1.6x proper average, that 1.7x might have been the result of measuring very few apps or outright “up to 1.7x” with “up to” getting lost during the leak (if there was even a leak).
1.6x is absolutely huge, and no wonder nobody’s increasing the bus width: it’s unnecessary for yielding a great product and even more expensive now than it was on 5nm (DRAM controllers almost don’t shrink and are big).
More memory bandwidth does not translate 1:1 to more performance. The GPU core is by far the most important. Even at 4K the current 1TB/s memory bandwidth is sufficient and overclocking the core is what gets you the most performance.
We’ve also seen that the 128-bit 4060Ti 16GB with its pitiful bandwidth can utilize its full 16GB VRAM without any issues at 1440P.
So if you’re trying to estimate performance gains, the core is where you should look for now, especially if Blackwell keeps the increased L2 cache (Ampere’s cache was measured in kilobytes, it was a radical change and it definitely worked well for AMD with RDNA2 too). Unless you’re doing 8K gaming the extra memory bandwidth will have minimal impact.
If I understand this article and what kopite7kimi said correctly, it sounds like a 33% cache increase, which he assumed meant a 33% memory controller increase. So 128MB, which they derived 512 bit from originally. That’s not that huge of a jump of cache compared to the current 96 it seems to me.
GDDR7 is supposed to start at 32Gbps, but there is also some claims of 36 Gbps. If you average to the cache (33%) and memory speed (60%) increase we’re talking maybe 45% more effective bandwidth.
It’s because AD102 is already a huge monolithic die, a little over 600mm2, with 814mm2 being the theoretical limit. In short: the bigger the die size the lower the yields, there will never be a gaming GPU much bigger than 600mm2 because then you’re looking at terrifying prices.
A node shrink only helps so much, and you don’t want a GPU sucking 800 watts either. The wider memory bus also takes up extra space. The 5090 is still monolithic so a ~30% improvement in performance sounds plausible.
Even worse, AMD (underatandibly) is “skipping” RDNA4 high-end to both maximize AI production and give their engineers more time to get a chiplet design with multiple graphics chiplets working well.
RDNA5 will likely be high end again, but until then, the 7900XTX or a refresh of it will likely remain the fastest AMD card.
Which means next gen Nvidia pricing will go through the roof. We’re probably looking at a $999 16-20GB 5070(Ti) that matches AMD flagship performance so I wouldn’t be surprised if a 24GB 5080 will be priced at $1500-1750 MSRP as the gaming flagship and the hybrid 32GB 5090 $2500-3000 MSRP. Remember the 3090Ti launched at $2000 despite gaming competition from the $1000 6950XT.
… And people will buy them. RIP GPU prices for the next 3 years, at least.
Its expected to be like Ampere, Ampere was 17% increase in SMs (rtx 3090ti vs rtx Titan) but the SM itself was improved such that they yielded about 33% improvement per SM in ‘raster’ and massive improvements in occupency for RT workloads. So 3090ti ended up 46% faster in ‘raster’ vs rtx Titan.
The TPC and GPC of Blackwell are rumored to be overhauled with a more hesitant rumor about the SM also being improved.
I would bet on not a significant upgrade as well. Performance per transistor in Ada vs Ampere actually went down, mostly because they spend so much transistor budget on cache. 170% more transistors for 65% more performance, if you assume the full potential AD102 die is 15% faster than a 4090. But even before AMD and Nvidia started playing with massive caches, you could always relatively accurately predict how fast a GPU would be based on transistor count.
Cache is not shrinking in die area with new nodes. If this thing has 128MB, and the rest of the die stays at 600mm^(2), the area dedicated to logic would be smaller than AD102. Unless they start stacking cache that is. Would not shock me, if most, or all the L2 is on a 2nd layer.
I’ve heard 60% more logic density going from 5nm to 3nm, but who knows how optimistic those numbers are as they are probably best case scenario. Can’t but imagine a real GPU application would at maximum reach 40-50% more logic density, and I can’t imagine a performance uplift higher than that, unless they make it a 800mm^(2) die which I don’t believe.