• 0 Posts
  • 13 Comments
Joined 1 年前
cake
Cake day: 2023年10月25日

help-circle

  • by improving the RT & tensor cores

    and HW support for DLSS features and CUDA as a programming platform.

    It might be “a major architecture update” by the amount of work that Nvidia engineering will have to put in to pull off all the new features and RT/TC/DLSS/CUDA improvements without regressing PPA - that’s where the years of effort will be sunk - and possibly large improvements in perf in selected application categories and operating modes, but a very minor improvement in “perf per SM per clock” in no-DLSS rasterization on average.



  • GA102 to AD102 increased by about 80%

    without scaling DRAM bandwidth anywhere near as much, only partially compensating for that with a much bigger L2.

    For 5090 on the other hand we might also have clock increase going (another 1.15x?), and proportional 1:1 (unlike Ampere -> Ada) DRAM bandwidth increase by a factor of 1.5 due to GDDR7 (no bus width increases necessary; 1.5 = 1.3 * 1.15), so this is 1.5x perf increase 4090 -> 5090, which has to be further multiplied by whatever u-architectural improvements might bring, like Qesa is saying.

    Unlike Qesa, though, I’m personally not very optimistic regarding those u-architectural improvements being very major. To get from 1.5x that comes out of node speed increase and the node shrink subdued and downscaled by node cost increase, to recently rumored 1.7x one would need to get (1.7 / 1.5 = 1.13) 13% perf and perf/w improvement, which sounds just about realistic. I’m betting it’ll be even a little bit less, yielding more like 1.6x proper average, that 1.7x might have been the result of measuring very few apps or outright “up to 1.7x” with “up to” getting lost during the leak (if there was even a leak).

    1.6x is absolutely huge, and no wonder nobody’s increasing the bus width: it’s unnecessary for yielding a great product and even more expensive now than it was on 5nm (DRAM controllers almost don’t shrink and are big).




  • but all they would need to do is look at like the top 100 games played every year

    My main hypothesis on this subject - perhaps they already did, and out of the top 100 games only 2 games was possible to accelerate via this method, even after exhaustively checking all possible affinities and scheduling schemes, and only on CPUs with 2 or more 4-clusters of E-cores.

    The support for the hypothesis is the following suggestions:

    1. how many behavioral requirements the game threads might need to satisfy
    2. how temporally stable the thread behaviors might need to be, probably disqualifying apps with any in-app task scheduling / load balancing
    3. the signal that they possibly didn’t find a single game where 1 4-core E-cluster is enough (how rarely is this applicable if they apparently needed 2+, for… some reason?)
    4. the odd choice of Metro Exodus as pointed out by HUB - it’s a single player game with very high visual fidelity, pretty far down the list of CPU limited games (nothing else benefited?)
    5. the fact that none of the games supported (Metro and Rainbow 6) are based on either of the two most popular game engines (Unity and Unreal), possibly reducing how many apps could be hoped to have similar behavior and possibly benefit.

    Now, perhaps the longer list of games they show on their screenshot is actually the games that benefit, and we only got 2 for now because those are the only ones they figured (at the moment) how to detect threads identities in (possibly not too far off from as curiously as this), or maybe that list is something else entirely and not indicative of anything. Who knows.

    And then there comes the discussion you’re having, re implementation, scaling, and maintenance with its own can of worms.





  • I wholeheartedly agree about MaxTech on average, but in this particular instance… the video is actually pretty good.

    Or rather, if one watches it correctly, the material presented that remains after the filter is quite good. Namely, one should discard all claims based on pure numerology - numbers from the config, current year, “RAM used” shown by activity monitor (a big chunk of that figure is very optional file cache that only marginally improves perf, but uses up the more RAM the more you give it, for starters, + a lot more), etc.

    The actual experiments with applications done on the machine, performance measurements (seconds, etc), demonstrations of responsiveness (switching from tab to tab on camera) are actually quite well done, in fact, the other youtube videos on the subject rarely include quantifiable performance / timing measurements and limit themselves to demos (or pure handwaving and numerology).

    Of course, conclusions, “recommendations”, etc. in their exact wording also need to be taken with a half a metric ton of salt, but there is still a lot of surprisingly good signal in that video, as noted.


  • Let’s put some science to it, shall we.

    Using Digital Foundry’s vid as the main perf orientation source for ballpark estimates, it seems that in gaming applications depending on a game M1 Max is anywhere from 2.1 to staggering 4.5 times slower than desktop 3090 (350W GPU), with geomean sitting at embarrassing 2.76. In rendering Pro Apps, on the other hand, using Blender as an example, the difference is quite a bit smaller (even though still huge), 1.78.

    From Apple’s event today it seems to be pretty clear that information on generic slides pertains to gaming performance, and on dedicated pro apps slides - to pro apps (with ray tracing). It appears that M3 Mac / M1 Max in gaming applications, therefore, is 1.5x, which would put M3 Max at 1.84x slower still than 3090. Looks like it will take M3 Ultra to beat 3090 in games.

    In pro apps (rendering), however, M3 Max / M1 Max is declared having a staggering 2.5x advantage, moving M3 Max from being 1.78x slower than 3090 to being 1.4x faster than 3090 (desktop at 350W), or alternatively, 3090 being only 0.71x of M3 Max’s performance.

    Translating all of this to 4000 series using TechPowerUp ballpark figures, it appears that in gaming applications M3 Max is going to be only very slightly faster than… a desktop 4060 (non-Ti; 115W). At the same time the very same M3 Max is going to be a bit faster than a desktop 4080 (320W GPU) in ray-tracing 3D rendering pro applications (like Redshift and Blender).

    With an added detail that a desktop 4080 is a 16 GB VRAM GPU, with the largest consumer grade card - 4090 - having 24 GB of VRAM, while M3 Max can be configured with up to 128 GB of unified RAM even in a laptop enclosure, which will probably make about 100 GB or so available as VRAM, about 5x more than on Nvidia side, which, like the other commenter said, unjustly downvoted, makes a number of pro tasks comically impossible (do not run) on Nvidia very much possible on M3 Max.

    So, anywhere from a desktop 4060 to a desktop 4080 depending on application, in games, 4060, in pro apps, “up to 4080” depending on the app (and a 4080 in at least some of the ray tracing 3d rendering applications).

    Where does that put a CAD app I’ve no idea, probably something like 1/3 away from games and 2/3 aways from pro apps? Like 1.45x slower than a desktop 3090? Which puts it somewhere between a desktop 4060 Ti and a desktop 4070.

    I’m sure you can find how to translate all of that from desktop nvidia cards used here to their laptop variants (which are very notably slower).

    I have to highlight for the audience once again an absolutely massive difference in performance improvement between games and 3D rendering pro apps: M3 Max / M1 Max, as announced by Apple today, is 1.5x in games, but 2.5x in 3D rendering pro apps, where M1 Max already was noticeably slower in games than it presumably should have been given how it performed in 3D rendering apps, relative to Nvidia.