DriversWednesday, Aug 16, 2017 · 2500 words · approx 12 mins to read
Introduction and disclaimer
Having helped to get Vega out the door, I thought it’d be nice to start writing about work-related things again where situations and subject matter permit.
What you’re about to read is not an exact description of how my employer, AMD RTG, or my previous employer, Imagination Technologies PowerVR, develop and release their GPU drivers. It draws on how they both do it in broad terms: the idea of this essay is to give laypersons an insight into what it takes to deliver one part of a much larger product. It is not necessarily the opinion of either AMD or Imagination.
I’ll use the word driver a lot, but I really mean the entire GPU software stack here, be it firmware, client API drivers, compilers, control panels or anything else shipped alongside the lower-level code that makes a GPU do something interesting on your screen.
I’ve worked for RTG for the best part of 6 months now, following the 7 year spell at PowerVR. Now that I’ve spent a good bit of time there, I’m now more surprised by the similarities between the two companies and how they go about shipping a complete GPU product than I was by the initial shock of the differences. So despite their different sizes, budgets, markets and products, both groups operating and go about a lot of things in much the same way.
The goal for both technology groups is ultimately the same, with one major extra task for AMD. At a suitably high level, both groups research, develop, test and then ship graphics IP that gets integrated into a final product. RTG go on to produce many of the ASICs that form the heart of those final products themselves, whereas PowerVR stop at the IP delivery stage to the end customer making the SoC (bar their test chip programme).
And it’s the same for all GPU companies. The hardware has to go out with a driver and that driver needs to be developed somehow. The internal processes, decision making, timelines and complexities of that are almost completely hidden from the layperson. Hopefully shining a bit of a light on it will be helpful.
And psst, it’s highly likely the same for anything that needs a driver to function, just adapted to the market and whatever software runs on top. Here’s how it works for graphics as I’ve observed it.
New GPU product development and release is ruled by a schedule, and therefore key milestones along the way. For the sake of argument, let’s call out just a few in chronological order: IP completion (so everything done, final RTL delivered), tape out, first silicon bringup, second silicon bringup (there’s usually at least a second one, even if you’re lucky and it’s just metal changes), release run-up, final QA and release.
For GPUs, at each of those milestones you need a driver of some kind, even pre-silicon. Before final RTL is delivered you need to have stood the driver up on the emulator, even if it’s just enough to boot a minimal OS and run targeted tests to ensure the design works in several key ways.
So driver development starts way before hardware is finished and is then timed to release into the product schedule at each milestone. The pace picks up as the product gets closer to release, because both the hardware and driver are in better shape.
In practice, there are often dozens more milestones to hit, and associated drivers to cut, before the first final driver makes it into the hands of the public. First final driver is an important construct to understand, because hopefully there’ll be many more released after that to improve the product, be it compatibility or performance or both, across the useful lifetime of the hardware in question. That’s especially true for GPUs because of the complexity involved.
By that I mean the total complexity, from the driver source itself all the way up to the games banging away at it with whatever the rendering engineers have come up with next as their way to draw. It’s forever changing across the entire software stack, from the GPU’s BIOS or firmware up to the games.
If you’re unlucky, and this is unfortunately common in mobile, you’ll only get one driver for the entirety of an major OS release on a given device. You can’t download a new driver on your phone like you can on the PC. Even certain PC operating systems have no real GPU driver updates out-of-band of the OS being updated.
Frequent releases, overall blessing that they arguably are, also come with their own schedules and release cadence. On Windows it tends to be: release to a chosen calendar regularity unless a major game or new Windows client comes out.
The key element to understand is that there’s no concept of just one driver. Even if just one ever ships in public, there were hundreds of bringup, pre-alpha, alpha, beta, customer RC, internal RC, press, ISV bugfix test, and other custom one-off builds along the way. All timed to a specific part in the product schedule and released in its own specific way, continuing long after public hardware release.
The features of the hardware that the driver exposes are pushed from two major directions for GPUs. The first are those locked in at product definition time and have to ship, because the APIs the product has to support were decided at that point. Baseline API features have to work so that API certification programmes are passed and the product can claim support for whatever it needs to. Almost all of that side of things is non-negotiable, other than when it comes to performance. I’ll come back to that negotiation.
The other axis of feature development are things the hardware is capable of but that the APIs it supports don’t expose. Some of that is transparent and just happens under the hood with no developer involvement, but most of those things are exposed by extensions in today’s APIs. This is the more interesting set when it comes to driver development.
To get a candidate list of things that the driver could expose transparently or via extension, to figure out what to work on, it’s usually up to the product or program manager to canvas various folks internal and external to the business so that they can put a list together. There’s usually also active lobbying from various folks who want to champion their particular pet feature(s) that they think would be good to get out there.
IHVs will also always talk to ISVs to help drive development of that list. After all, they’re the ones who’ll eventually program the thing. ISVs don’t have to be external to the company: most GPU companies will hopefully have at a demo team, developer technology engineers, and maybe a pathfinding research team. Driver engineers themselves are also a good source for inspiration!
Hopefully the list is focused on execution efficiency first, to make it easier for the hardware to do a good job with whatever the developer is expressing during rendering or compute, but the proposed list could contain absolutely anything within the bounds of the hardware’s ability.
So now there’s a list, how does it get developed and shipped in the driver? In my experience there’s never just one person responsible as the gate, or the final arbiter for the order. In practice, it takes a lot of discussion between a lot of people all across the business.
It might make sense to actually never develop feature or extension X because of external impetus or market condition D. X is now in the bin. Maybe game A is coming out in a timeframe that suits your next product launch and the ISV has asked for feature Z. Feature Z is now higher up the list than feature Y, say. Maybe most people in the BU support development of Y but there’s no immediate need to ship it. Sometimes extension Q supersedes extension U from 7 drivers ago, so now there needs to be a plan for what happens to U in the wild.
That’s the key thing to understand: priorities for features can and do change based on a wide range of internal and external factors. It’s something that’s constantly reassessed, can often be overridden by someone lobbying loudly for their favourite thing, and is almost always in some state of flux. You build what you have to ship as your priority, and then everything else is a nice-to-have that’s developed against changing market conditions and customer requirements, ideally gated by someone with a great top-level view of what the market needs. In practice that single person is often ten or more.
The number of people that process directly impacts in practice, assuming all an extension or feature exists for is to make something faster or better looking for the same performance, is small in overall terms. Driver engineers need to know what to work on, QA and release teams need to know what to test, validate and package, and ISVs consuming it on the other side in apps and games need it to be there. But the end user doesn’t really care yet.
Whatever the means of getting it into the external ISV’s hands, the vendor has to ship first — ideally as betas so they have a heads up and time to play with whatever it is, but often just at final release time when the driver is available to the public — before the end user’s experience is effected by the existence of whatever it is.
Knowing there’s an overall product schedule that the driver schedules have to run alongside, and knowing that there’s a complex and malleable process for deciding what goes into the driver that’s not a must-ship feature, who builds it?
At both RTG and PowerVR it’s a set of discrete teams of software engineers that gets the driver coding done. Because it’s software, there’s also a surrounding microcosm of managers, QA, packaging and ISV testing labs, plus associated teams like mine in Game Engineering that seed ISVs with betas and help run their new and existing content on it, often on unreleased products, and report bugs and problems in both directions.
So you have very focused teams of people pressing the keys to write to the code, and a much wider group of people involved at arranging, corralling and slinging real content in anger on the hardware, before it gets into the end user’s hands.
A quick word on engineering team overlap since it’s a commonly-held misconception: driver teams tend to be split per-API, broadly speaking. That means even if you have a driver stack where a common layer underpins multiple APIs — RTG have this today with PAL, which is the common layer for our DirectX 12 and Vulkan drivers — the folks working on implementing each client API on top tend to be separate.
Those separate teams have their own schedules and external influences, and in AMD’s case often separate geographies too. So just because a feature exists in Direct3D, say, doesn’t mean it is, will be, or even has to be exposed in any other API.
PowerVR have done outward driver marketing maybe zero times that I can remember, despite it being just as important an aspect of product development as it is at RTG. That’s because of the out-of-band updating problem and the fact that it’s up to the customer what driver is used on any given PowerVR GPU.
For RTG, and because it’s the PC and the drivers for PC platforms carry extra value over just the parts that let games interact with the hardware, the driver is outwardly marketed alongside new hardware releases, and then separately throughout the year as new and improved drivers are released on their own.
On PC, it’s an important part of getting the message out about the overall proposition for the products. It’s also a commonly missed thing in GPU product reviews, sadly, primarily because of the pressure to evaluate the basic performance and capability of the thing in question in games.
So the marketing team has to work hard to shine a light on the good things the driver is capable of, especially if it’s interesting and holds a lot of potential benefit to the end-user in some way, but where a traditional product review might not take a look at it.
I want to wrap up by highlighting a few key points and summarising what’s above. Firstly, there’s the huge amount of moving parts at the business level. Just to get a driver out of the door takes software engineering, product management, regular management, executives, ISVs, customers, QA, ISV lab work. All need to be coordinated (hopefully!) to get something new for you to run.
That all runs to a complex schedule that’s completely intertwined with the hardware’s. Certain kinds of drivers need to be available at different points in that schedule. The number of individual executed builds before any public release on new hardware is massive.
As for what can go into a GPU driver outside of the must-ship feature list, that’s probably the hardest thing that gets worked on outside of the actual software engineering of the code itself. Modern APIs map quite well to how GPUs actually work, but still leave a good amount of potential performance and efficiency on the table.
Working out the best way, and the best order, to expose all of that is tough and takes a lot of time. The must-ship baseline feature set always has its own bugs or necessary improvements which often takes priority over any new stuff. Then there’s stuff that can’t be exposed for whatever reason, or that doesn’t make sense to for its own myriad reasons. Maybe the timing isn’t right. Maybe the demand for the feature is minimal.
In GPU land, once you ship a combined hardware and software solution that’s capable of thing X, X needs to stay there effectively for all eternity and ideally spread across the stack. It’s not quite that bad in practice and things can be removed, but that’s a huge pain point and why (mostly) throwing it all out to start again with Mantle, Vulkan, Metal and Direct3D 12 has been so cathartic for the graphics industry.
As a wise man once said, “the history of graphics pipeline evolution is littered with things that seemed a good idea to someone that turn out to be terrible ideas, not used very much, and/or a burden on everyone forever since it’s almost impossible to deprecate anything. We stick things in too easily.”
He’s absolutely right. Labouring drivers for those new APIs with the feature-led baggage of old is something we need to be wary of.