Pascal
Thursday, Apr 28, 2016 · 3900 words · approx 19 mins to readPreamble: here’s the original unedited version of a guest article I wrote for The Tech Report on NVIDIA’s latest Pascal microarchitecture, the first chip implementing that in the Tesla GP100 product, and what it might mean for future GeForce-focused chips and products. The Tech Report first published it on April 27th 2016.
Introduction
Telling you that Nvidia recently announced the first GPU that implements their latest Pascal microarchitecture, GP100, is probably a bit of a redundant opening gambit for this short walkthrough of Pascal and its impact on the modern GPU landscape. It’s been the talk of the PC technology space since Nvidia CEO Jen-Hsun Huang announced it in his inimitable, best keynote-giver in Silicon Valley, style — draped in that lovely leather jacket that I’d steal from his very back if we were ever in the same place at the same time again — during the first keynote of the company’s recent GPU Technology Conference (GTC) in sunny San Jose on April 5th.
Unable to release the first Pascal until volume production of 16nm transistor technology was possible, because of the size it needed to be to afford a generational leap over the prior GPUs powering the Tesla product line, it feels like Pascal’s been a long time coming. In reality, we haven’t deviated too far from Nvidia’s typical modus operandi of a roadmap announcement at a GTC, followed by products a couple of years down the line.
However, because 28nm manufacturing has lasted such a long time in discrete GPU land, with AMD and Nvidia skipping the use of 20nm at the various foundries because of its unsuitability for the needs of high-power semiconductor devices, people have been clamouring for the first products on new viable production technology to see what advancements they’d bring to the table. Volume manufacturing for 28HP at TSMC started back in late 2011, remember!
Now that Pascal is here, at least in announcement form, I jumped at the chance to reprise my 2009 analysis of Nvidia’s Fermi when Jeff dropped me a line recently. Back then, Fermi had been announced at GTC in September 2009, but they only really talked about it from the standpoint of GPU compute. Scott asked me to have a go at reasoning about the graphics bits of Fermi back then, and what it’d mean for GeForce. Sounds familiar!
My task is a little different this time, because we were also told the basic graphics-focused makeup of GP100 at GTC, so I don’t have to do too much there (and get some of it wrong like I did last time!). However, reading the Pascal tea leaves leaves me wondering if GP100 will actually be used in GeForce products.
So what for Pascal and the products you’ll likely want to buy this year, if you’re in the market for a new GPU like I am, but you know nothing else powerful is coming on 28nm? Let’s start with a brief recap of the last generation to see where things ended up on 28, prior to the GP100 announcement, before we jump into the new stuff. Be warned, if you’re not interested in the bigger building blocks of GPU designs and lots of talk about how many of them are present, here be dragons. Still with me? Great, because some context and background always helps set the scene. I’ll switch to the author’s we now, since we’re on this weird journey through Blaise’s semiconductor namesake together.
Maxwell Recap
We were actually going to take you all the way back to Fermi here, but after collating all of the research to take that 7 year trip down memory lane, we realised that a backdrop of Maxwell and Maxwell 2 is enough.
You see, Maxwell never really showed up in true Tesla product line form, like GP100 has for Pascal. Even the biggest manifestation of the Maxwell 2 microarchitecture, GM200, made some design choices that were definitely focused on satisfying consumer GeForce customers, rather than the folks that might want buy it in Tesla form for HPC applications.
Key to that is support for double precision arithmetic, or FP64. FP64 is something that has no real place in what you might call a true GPU, because of the nature of graphics rendering, but is needed for certain applications and algorithms. Especially those where a highly parallel machine that looks a lot like a GPU is also good fit for those core algorithms, and they have a ratio of FP64 to lesser precision computation that’s much more in favour of having a lot of FP64 performance baked in to the design.
You’d expect a HPC-focused Maxwell to have at least 1/3rd throughput FP64, like the big Kepler, GK110, that came before it. Instead, GM200 had almost the bare minimum of FP64 performance — 1/32nd of the FP32 rate — without cutting it out of the design altogether. We’ll circle back to that thought later.
The rest of the Maxwell microarchitecture however, especially in Maxwell 2, was also typical of a graphics-focused design and typical of how Nvidia have enjoyed scaling out their designs in recent generations, from the building block of an SM upwards.
Nvidia group a number of SMs in a structure that could stand on its own as a full GPU, and indeed they do operate independently, called GPCs. A GPC has everything needed to go about the business of graphics, including a full front-end up to and including a rasteriser, the SMs which provide all of the compute and texturing ability along with their required fixed function bits like schedulers and shared memory, and a connection to the outside world and memory, through their now-standard L2 cache hierarchy.
Maxwell GPCs contain 4 SMs, and each Maxwell SM is a collection of 4 32-wide main scalar SIMD ALUs, each with its own scheduler, and where each of the 32 lanes in the SIMD are operating in unison, as you’d expect a modern scalar SIMD design to. Along for the ride in an SM is the texture hardware, to let the GPU get nicer access to spatially coherent (and usually filtered) data, normally with which to render your games with, but also to do useful things for compute algorithms; fusing off the texture hardware for HPC-focused designs doesn’t make too much sense, unless you’re trying to hide that it used to be a GPU of course. Per Maxwell SM there’s 8 samples per clock of texturing ability.
GM200 is a 6 GPC design. So 6 front-ends, 6 rasterisers, 6 sets of back-ends and connections to the shared 3MiB L2 cache part of the memory hierarchy, and a total of 24 SMs (and thus 24 * 4 * vec32 SIMDs, and 24 * 8 samples per clock of texturing) across the whole chip. At over 1GHz in all shipping configurations, and well over that in GeForce GTX 980 Ti form, especially the overclocked partner boards, it’s overall the most powerful single GPU that’s shipped to date.
If GM200 sounds big, that’s because it absolutely is: at just over 600mm2, fabricated by TSMC on their 28HP technology, it’s the biggest GPU Nvidia could have made tipping over the edge of the yield curve, pretty much. While big GPUs lend themselves to decent yields, because it’s easy to sell them in cut down form, you still need the yield to be decent to extract a profit from a GPU configuration with the bits you are able to turn on, in context with the competitive landscape of the day.
So that’s our GP100 backdrop in a nutshell. What I’m trying to get at by painting yet another picture of the big Maxwell here is that it’s overwhelmingly mostly just a really big consumer GPU, not a HPC part. The lack of FP64 performance really does materially hurt its usefulness in HPC applications and Nvidia can’t ignore that forever. Intel are shipping the new Knights Landing (KNL) Xeon Phi now, and it’s an FP64 beast and capable of tricks that other GPU-like designs can’t pull, such as, you know, booting an OS by itself because each SIMD vector unit is managed by a set of decently capable x86 cores!
GP100
So our Maxwell and GM200 recap highlights that GP100 has its work cut out in a particular field: HPC. Let’s take a 10,000ft view of how it’s been designed to tackle that market as an overall product, before we dive into some of the details.
It’s still an “SMs in collections of GPCs” design, so we don’t have any new things to learn in terms of how it works at the microarchitecture level, at least as far as the basics go. Nvidia have resurrected the TPC nomenclature as something to call a pair of SMs, but we can mostly ignore it for the purposes of this look at things.
A full, unfettered GP100 is a 6 GPC design, each GPC containing 10 SMs. Nvidia announced that the first shipping GP100s in the Tesla P100 product would have 56 SMs enabled. Nvidia are therefore disabling 4 SMs (highly likely two TPCs in different GPCs, rather than both), for yield reasons we assume. GP100 is 610mm2, manufactured by TSMC on their 16FF+ node.
16FF+ is definitely mature, but GP100 is easily the biggest and most complex design to have be manufactured by TSMC on that technology so far. Given the potential customers, if Nvidia could reliably turn on all 60 SMs across the chip for the Tesla P100 product at this point in the production cycle of the GP100 chip, you bet they absolutely would. Power isn’t a concern here really, so it has to be yields.
The Pascal SM in GP1xx is actually much smaller than the GM2xx SM for the main hardware. It’s just two 32-wide main SIMD ALUs this time, rather than 4. There are also big changes afoot in this main ALU, but let’s punt that for the next page for the time being. Also along for the ride is a separate 16-wide FP64 ALU, giving the design “half rate” FP64. If we multiply out all of the numbers that describe the GP100 design, you’ll see exactly what that rate ends up as: 5.3 GFLOPs. Good googly moogly; most of the GPUs I work on for a day job have around 1/10th of that for FP32 performance, and literally zero FP64 ability at all. If you’re a HPC person and your code needs FP64 performance to go fast, GP100 is your very best friend.
Pascal has a familiar L1/shared memory-into-L2 cache hierarchy, as we’ve seen on Kepler and Maxwell, and it’s 4MiB in size on GP100. That changes the “L2 size per SM” ratio significantly compared to GM200 and Maxwell, and not in the bigger-is-better direction. 56 enabled SMs sharing 4MiB of L2 in GP100, plays 24 SMs that share 3MiB of L2 in GM200.
That said, while there are half the 32-wide ALUs per SM in GP100 compared to GM200, there’s no reduction in the size of the register file (RF) that the SMs have access to, giving GP100 twice the per-SM RF compared to GM200. For certain classes of data-dense code, the kind of which you tend to find in HPC applications, that’s a very welcome change in the new chip.
6 GPCs of 10 SMs, each with lots of welcome FP64 ALU performance and a large per-SM RF — as an aside, if you think the 4MiB of L2 GP100 has is a lot of on-chip memory, there’s actually more than 3x that in total RF if you add it all up — demands to be fed with a beefy memory subsystem to give a nice “bytes-per-FLOP” ratio, which is the metric that really matters for devices like this. Nvidia are making use of High Bandwidth Memory (HBM) for GP100. I’ll leave the gory details for later, but there’s a huge increase in external memory bandwidth for GP100 compared to what was possible in GM200 and other GPUs that relied on GDDR5, and that’s even with Nvidia using conservative clocks for the HBM configuration they’ve chosen to go with.
From a HPC standpoint at least, we’re pretty much done with our high-level view of how GP100 is constructed: 6 GPCs, 10 SMs per GPC, each SM with 256KiB of RF, 4 SMs turned off across the chip (as pairs or TPCs, one TPC each for two unlucky GPCs), all sharing a 4MiB L2 and then on through to a very wide, high-throughput HBM memory system.
Let’s take a closer look at the SM and see what’s changed in the ALUs and how they interact with the register file. The changes help both HPC and graphics applications so they’re particularly interesting.
GP100 - FP16
The biggest change in the Pascal microarchitecture at the SM level is the support for native FP16 arithmetic. Rather than dedicate a separate ALU structure to it like with the FP64 hardware, Pascal runs FP16 arithmetic via clever reuse of the FP32 hardware. It won’t be completely apparent how it goes about it until the ISA is released, but given Nvidia have disclosed that the hardware supports data packing and unpacking from the regular 32-bit wide registers, along with the required sub-addressing, in the huge RF we discussed earlier, it’s highly likely it’s effectively splitting each FP32 SIMD lane in the ALU into a “vec2” type of arrangement and those vec2 FP16 instructions then address two halves of a single register in the ISA.
It’s probably completely identical to how Nvidia supported FP16 in the Maxwell microarchitecture in the Tegra X1. So Pascal isn’t actually the first Nvidia design of the modern era to support native FP16, but it is the first design destined for a discrete GPU that supports it.
Because the FP16 capability is part of the same ALU they need to support FP32, it’s reasonably cheap to design in, in terms of area, and there are benefits for a couple of big classes of program that might be run on a GP100 in its useful lifetime. Because GP100 currently only (and maybe only ever will) powers Tesla products, Nvidia’s messaging around the FP16 support focuses on how is helps deep learning algorithms, allowing for a big performance jump and a reduction in required storage and movement of the data required to feed those algorithms (primarily saving bandwidth, although GP100 isn’t shy of having lots of that as we’ll talk about shortly).
The second obvious big winner for native FP16 support is graphics. Because the throughput of the FP16 hardware is up-to 2x (and that up-to is important, because it highlight that there is a vectorisation aspect to it and it’s not just “free”), and lots of modern shader programs can be run at reduced precision, if the shading language supports it in the graphics API being used, native FP16 support in a GPU can be taken advantage of. FP16 support is part of the major graphics APIs these days, so as a graphics architecture in a GeForce, Pascal is ideally placed to see big potential benefits in performance for traditional uses of the microarchitecture as well. A win win in terms of green lighting those changes versus what was there in the main ALU for Maxwell, and Kepler before that.
HBM2
We’re in the home stretch of describing what’s new in Pascal compared to Maxwell, in the context of GP100. AMD debuted High Bandwidth Memory first, putting it to critically acclaimed use with their Fiji GPU in a range of Radeon consumer products. HBM brings two big benefits, and AMD took advantage of them both: lots and lots of dedicated bandwidth, and a much smaller package size because of how it works.
HBM individually connects the memory channels of a number of DRAM devices directly to the GPU, by virtue of very clever physical packaging and a new wiring technology. The DRAM devices are stacked on top of each other, and the parallel channels connect to the GPU via an interposer. So the GPU sits on top of a big piece of passive silicon, called an interposer, with wires etched into it, with the DRAM devices sat right next to the GPU on that same interposer, all of that together in one package.
Nvidia’s pictures of the GP100 package (and cool NVLink physical interconnect) show you what I mean. You can see the four individual stacks of DRAM devices, each delivering a 1024-bit memory interface to the GPU. High-end GPUs have bounced between 256-bit and 512-bit for a long time, prior to HBM. Now with HBM you get 1024-bit per stack. Each stack has a maximum memory capacity defined by the JEDEC standard, so aggregate memory bandwidth and memory capacity are intrinsically linked in designs that use HBM.
Connected to each GP100 via the interposer are 4 1024-bit stacks, each made up of 4 8Gib DRAMs. 4 x 4 x 8Gib gives us 16GiB of total memory connected to each GPU. The peak clock of HBM2 in the specification is 1000MHz, giving a per-stack bandwidth of 256GiB/sec, or 1TiB/sec across a 4-stack setup. Nvidia, likely for a number of reasons rather than one main one, have chosen 700MHz (1400MHz effective because it’s a DDR memory). GP100 therefore has just a touch less than 720GiB/sec of memory bandwidth, or around double that of the fastest possible GDDR5-equipped GPU on a 384-bit bus. 384-bit being Nvidia’s bus width of choice for the high-end in the last couple of generation.
The downside of all of that bandwidth is monetary cost. The interposer silicon necessarily has to be at least as big as the GPU and four footprints of a single DRAM stack in a GP100 package, and we already spoke about the GP100 die being a faintly ridiculous 610mm2 on a modern 16nm process. So the interposer is probably on the order of 1000mm2. We could work it out together, you and I, but my eyeballing of the package in Nvidia’s whitepaper tells me that I’m close, so let’s keep our digital callipers in our drawers. 1000mm2 single pieces of silicon — with etched features remember so there’s lithography involved — even if they’re regular features and reasonably straightforward to image and manufacture, are still expensive. They’re cut from the same 300mm wafers as normal processors, so you can only get a relatively small handful of them per wafer, especially because their long sides will result in quite a lot of wastage of the circular wafer.
We wouldn’t be surprised if interposer manufacturing cost today results in a per-interposer cost of around 2 of Nvidia’s low-end discrete graphics cards in their entirety, GPU, memories, PCB, display connectors, SMT components, et al.
So now that we have a good picture of the changes wrought in the microarchitecture and the memory system, with Pascal and its first outing in GP100, we can have a go at puzzling over what the first GeForce products that contain Pascal might look like.
GeForce Pascals - A Total Guess
Let me switch back to the first person, to make it clear that these guesses are just that, so that Jeff can absolve himself of my mistakes, as I undoubtedly make them, when he looks at the GeForce Pascals for real.
I think it really boils down to two things, as far as what GeForce configurations of Pascal will look like: FP64 and use of HBM2. To repeat what we concluded earlier, FP64 is completely useless for graphics and it costs a lot of area, especially for dedicated SIMDs to run it as well as the main FP32+FP16 pipeline, like they have in the GP100 design. So to keep costs down for consumers, given how expensive 16nm manufacture is, I’m expecting it to effectively disappear in the chips that arrive to power GeForce models. It’ll still be there because it can’t disappear completely, but it’ll probably just be 1/32nd rate like in GM200.
Then there’s HBM2. I’d have argued for its inclusion in GeForce Pascals a few months ago, but GDDR5X is on the way with its doubling of prefetch length and probable (fairly large) increase in effective clock speed. It’ll be cheaper to use than HBM2 at similar aggregate bandwidths (and it’s cheaper at the on-chip PHY level too, saving costs there as well as the interposer and stack packaging), and doesn’t have as strict rules tying bandwidth to capacity. That lets Nvidia have different memory sizes other than 4, 8, 12 and 16GiB on their GeForce products, compared to if they’d used HBM2.
So I would guess that there’s at least one chip, still really big but quite a bit smaller than 610mm2, potentially with very similar overall throughput to GP100 in the metrics that define graphics, and with less memory capacity but still plenty of overall bandwidth. Some rumours say that’s called GP102. I think it’s 56-60 SMs, 1/32nd FP64 and more than 8GiB of 384-bit GDDR5X, and if it exists then it’s likely for a Titan first and maybe a enthusiast’s favourite “Ti” product later down the line.
Then the GM200 replacement for the pair of high-end GeForce non-Tis that Nvidia like to do these days. It’s likely called GP104. Again with token FP64 throughput — remember these are GPUs, not HPC offerings — 8GiB of 256-bit GDDR5X, and I’m guessing at 40 SMs or thereabouts, and all the associated machinery in terms of texturing and backend throughput that implies, in around 300mm2.
After that I don’t really want to put a flag in the ground, but expect something else for the “1060” part of the product line and something for the “1050” and below, probably at around 100mm2. Maybe by then we’re onto GP2xx and a little kick for the design in some small ways.
Conclusion
To wrap up, we left out discussion of some other really interesting bits of GP100, should you want to go read about them yourself. Nvidia’s whitepaper is good, so I’d recommend reading it and focusing on two things: NVLink and that part of the platform architecture (it’s part of how they build the DGX-1 rackable supercomputer product), along with the fact that GP100 can now effectively service its own page faults with minimal host intervention. That feature has some really exciting applications for graphics, but it’s too big of a topic to cover here. We don’t know whether that feature will make it into GeForce Pascals, but it’s definitely worth keeping an eye out for.
Anyway, we hope that trip through a Maxwell refresher, Pascal and GP100 overview, HBM2 and its associated costs, and then onto some guesses about the other Pascals aside from GP100 based on all of that, has whet your appetite for the upcoming pitched battle between Nvidia on one side with these new possible things, and AMD with their new Polaris microarchitecture on the other. I for one need something to drive an Oculus Rift. Maybe two, one per eye. Jeff’s also Jonesin’ for a fix of big powerful GPUs given the Rift and Vive that have shown up at TR HQ recently. It’s about time new GPUs made possible by new manufacturing showed up, and the groundswell of VR adoption is probably going to be a really good kicker for whatever hits the market from both Nvidia and AMD.