GTC Nvidia’s Blackwell GPU architecture is barely out of the cradle – and the graphics chip giant is already looking to extend its lead over rival AMD with an Ultra-themed refresh of the technology.
Announced on stage at Nvidia’s GPU Technology Conference (GTC) in San Jose, California, on Tuesday by CEO and leather jacket aficionado Jensen Huang, the Blackwell Ultra family of accelerators boasts up to 15 petaFLOPS of dense 4-bit floating-point performance and up to 288 GB of HBM3e memory per chip.
And if you’re primarily interested in deploying GPUs for AI inference, that’s a bigger deal than you might think. While training is generally limited by how much compute you can throw at the problem, inference is primarily a memory-bound workload. The more memory you have, the bigger the model you can serve.
According to Ian Buck, Nvidia veep of hyperscale and HPC, the Blackwell Ultra will enable reasoning models including DeepSeek-R1 to be served at 10x the throughput of the Hopper generation, meaning questions that previously may have taken more than a minute to be answered can now be done in as little as ten seconds.
With 288 GB of capacity across eight stacks of HBM3e memory onboard, a single Blackwell Ultra GPU can now run substantially larger models. At FP4, Meta’s Llama 405B could fit on a single GPU with plenty of vRAM left over for key-value caches.
To achieve this higher capacity, Nvidia’s Blackwell Ultra swapped last-gen’s eight-high HBM3e stacks for fatter 12-high modules, boosting capacity by 50 percent. However, we’re told that memory bandwidth remains the same at a still class-leading 8 TB/s.
If any of this sounds familiar, this isn’t the first time we’ve seen Nvidia employ this strategy. In fact, Nv is following a similar playbook to its H200, which was essentially just an H100 with faster, higher-capacity HBM3e onboard. However, this time around, with these latest Blackwells, Nvidia isn’t just strapping on more memory, it’s also juiced the peak floating-point performance by 50 percent – at least for FP4 anyway.
Nvidia tells us that FP8 and FP16/BF16 performance is unchanged from last gen.
More memory, more compute, more GPUs
While many have fixated on Nvidia’s $30,000 or $40,000 chips, it’s worth remembering that Hopper, Blackwell, and now its Ultra refresh aren’t one chip so much as a family of products ranging the gamut from PCIe add-in cards and servers to rack-scale systems and even entire supercomputing clusters.
In the datacenter, Nvidia will offer Blackwell Ultra in both its more traditional HGX servers and its rack-scale NVL72 offerings.
Nvidia’s HGX form factor has, at least for the past few generations, featured up to eight air-cooled GPUs stitched together by a high-speed NVLink switch fabric. However, this time it has opted instead to cram twice as many GPUs into a box in a config it’s calling the B300 NVL16.
According to Nvidia, the Blackwell-based B300 NVL will deliver 7x the compute and 4x the memory capacity of its most powerful Hopper systems, which, by our calculation, works out to 112 petaFLOPS of dense FP4 compute and 4.6 terabytes of HBM3e memory capacity. However, this also suggests that only the B300’s individual floating-point performance will top out at 7 petaFLOPS of dense FP4 – the same as the Blackwell B100-series chips it announced last year.
For even larger workloads, Nvidia will also offer the accelerators in its Superchip form-factor. Like last year’s GB200, the GB300 Superchip will pair two Blackwell Ultra GPUs with a combined 576 GB of HBM3e memory with a 72-core Grace Arm-compatible CPU.
Up to 36 of these Superchips can be stitched together using Nvidia’s NVLink switches to form an NVL72 rack-scale system. But rather than the 13.5 terabytes of HBM3e of last year’s model, the Grace-Blackwell GB300-based systems will offer up to 20 terabytes of vRAM. What’s more, Buck says the system has been redesigned for this generation with an eye toward improved energy efficiency and serviceability.
And if that’s still not big enough, eight of these racks can be combined to form a GB300 SuperPOD system containing 576 Blackwell Ultra GPUs and 288 Grace CPUs.
Where does this leave Blackwell?
Given its larger memory capacity, it’d be easy to look at Nvidia’s line-up and question whether Blackwell Ultra will end up cannibalizing shipments of the non-Ultra variant.
However, the two platforms are clearly aimed at different markets, with Nvidia presumably charging a premium for its Ultra SKUs.
In a press briefing ahead of Huang’s keynote address today, Nvidia’s Buck described three distinct AI scaling laws, including pre-training scaling, post-training scaling, and test-time scaling, each of which require compute resources to be applied in different ways.
At least on paper, Blackwell Ultra’s higher memory capacity should make it well suited to the third one of these regimes, as they allow customers to either serve up larger models – AKA inference – faster or at higher volumes.
Meanwhile, for those building large clusters for compute-bound training workloads, we expect the standard Blackwell parts to continue to see strong demand. After all, there’s little sense in paying extra for memory you don’t necessarily need.
With that said, there’s no reason why you wouldn’t use a GB300 for training. Nvidia tells us the higher HBM capacity and faster 800G networking offered by its ConnectX-8 NICs will contribute to higher training performance.
Competition
With Nvidia’s Blackwell Ultra processors expected to start trickling out sometime in the second half of 2025, this puts it in contention with AMD’s upcoming Instinct MI355X accelerators, which are in an awkward spot. We would say the same about Intel’s Gaudi3 but that was already true when it was announced.
Since launching its MI300-series GPUs in late 2023, AMD’s main point of differentiation was that its accelerators had more memory (192 GB and later 256 GB) than Nvidia’s (141 GB and later 192 GB), making them attractive to customers, such as Microsoft or Meta, deploying large multi-hundred- or even trillion-parameter-scale models.
MI355X will also see AMD juice memory capacities to 288 GB of HBM3e and bandwidth to 8 TB/s. What’s more, AMD claims the chips will close the gap considerably, promising floating-point performance roughly on par with Nvidia’s B200.
However, at a system level, Nvidia’s new HGX B300 NVL16 systems will offer twice the memory, and significantly higher FP4 floating-point performance, roughly 50 percent more. If that weren’t enough, AMD’s answer to Nvidia’s NVL72 is still another generation away with its forthcoming MI400 platform.
This may explain why, during its last earnings call, AMD CEO Lisa Su revealed that her company planned to move up the release of its MI355X from late in the second half to the middle of the year. Team Red also has the potential to undercut its rival on pricing and availability, a strategy it’s used to great effect in its ongoing effort to steal share from Intel. ®