Semiconductors from idea to productWednesday, Feb 25, 2015 · 9900 words · approx 47 mins to read
Disclaimer: what you’re about to read is not an exact description of how my employer, Imagination Technologies, and its customers, takes semiconductor IP from idea to end user product. It draws on how they do it, but that’s it.
This essay is designed to be a guide to understanding how any semiconductor device is made, regardless of whether it’s purely an in-house design or licensed IP, or something in between. I’ll touch on chips for consumer devices since that’s what I work on most, but the process applies almost universally to any chip in any device.
I’ve never read a really great top-to-bottom description of the process, and it’s something I’d have loved to have read years before I joined a semiconductor IP company. By writing it I hope it helps others in the same position. If you’re at all interested in chip manufacture and how those chips are made and selected for consumer products, this should hopefully be a great read.
If you’re a publication, online or print, and want to run this piece or a customised version of it, please get in touch! The Tech Report ran a lightly edited version on their site, published on April 22nd 2015.
On 17th June 2015, a colleague at Imagination, Alex Voica, wrote a piece that references this one, explaining in detail how the IP licensing model works and how the major players set the tone for the industry. It’s highly recommended reading if you like what’s below.
It all starts with an idea you see. Not quite at the level of “I want to build a smartphone!”, although understanding that the smartphone might be a target application for the idea will be great to help the idea take shape. No, we’re going to talk about things a little bit further down, at the level of the silicon (but not for long!) chips that do all the computing in modern devices, be they smartphones or otherwise.
All of the chips I can think of, even the tiniest and most specialised chips that perform just a few functions, are made up of much smaller building blocks underneath. If you want to perform any non-trivial amount of computation, even just on a single input, you’re going to need a design that builds on top of foundational blocks.
So whether the idea is “let’s build the high-performance GPU that’ll go in the chips that go into smartphones!”, or something that’s much simpler, the idea (almost) never gets built in its entirety like that, as one monolithic piece of technology. It usually must be built from smaller building blocks. The primary reason, especially these days, is that it’s incredibly rare that one single person can hold the design for a chip in her or his head in its entirety, in order to build it from start to finish and make sure it works. Modern chips are complex, usually consisting of at least a couple of hundred million transistors in most consumer products and often much much more. Most main processors in a modern desktop or laptop are well over a billion transistors. There’s maybe over a billion transistors in your pocket, in the main chip in your phone.
So you overwhelmingly can’t build it as a monolithic thing, because humans just don’t work that way. Instead, the idea must be broken down into blocks. Maybe a single person can design, build, assemble and test all of the blocks themself, but blocks are are must. I’ll talk a lot about blocks, so apologies if the word offends somehow, or it means “I hate your cat” in your native language. I definitely love your cat.
For simplicity’s sake, I’m going to talk about most common processors these days, that all take at least a year to make. Nothing in the semiconductor business happens really quickly. It really does normally take years to go from an idea about a chip, all the way through design, build, validation, integration, testing, sampling, possible rework, and mass production. All before the product can be sold and you hold it in your hands, put it under your TV, drive it, fly it, use it to read books, or whatever else the chip finds itself in these days.
The lifetime of a new chip is therefore never short. There are some macro views of the semiconductor industry where you might think that’s the case: for example, the modern smartphone system-on-chip (SoC) vendor might be able to go from project start to chip mass production in a short matter of months, but that’s because all they’re doing is integrating the already designed, built, validated and tested building blocks that other people made. Tens of thousands of man years went into all of the constituent building blocks before the chip vendor got hold of them and turned them into the full SoC.
Years. Not months, weeks, days or anything silly like that, at least for the main chips performing complex computation in modern consumer electronics and related industries.
Knowing what you need
A chip development taking years means there’s a certain amount of hopefully accurate prediction to be done. Smart chip designers are data-driven folks that don’t trust instinct or read tea leaves. They don’t make decisions based on whether the headline in today’s paper started with the third letter of the name of the second dog they had in their first house as a kid. Knowing what to design is almost pure data analysis.
Data inputs come in to the chip designer from everywhere. Marketing teams, sales people, existing customers, potential new customers, product planners, project managers, and competitive and performance analysis folks like me. Then there’s the data they get from experience, because they built something similar last time and they know how well (or not) that it worked.
Therefore the chip designer’s first job is to filter all of that data and use it as the foundation of the model of what they’re going to build, knowing as much as possible about the contextual life of the chip when it finally comes into existence. What kind of products is it going to go in, eventually? What does the customer expect as a jump over the last thing someone sold them? Is there a minimal bar for new performance, or a requirement for some new features? Are trends in battery life, materials science, or the manufacture of chips by the foundry changing?
What about costs? Costs play an enormous role in things. There’s no point designing something that costs $20 if your competitor can sell their closely functional and performing equivalent for $10. Knowing your cost structure for any chip is probably the thing that shapes your top-level bounds the most, as a chip designer. Every choice you make has a cost, direct or indirect.
Say your chip needs a Widget A, and Widget A is 20 square millimetres in area on the process technology of your foundry, and your total chip cost lets you design something that’s 80 square, because every square millimetre costs you 20 cents and your customer won’t pay more than $20 for the full chip, and you really need that 25% gross margin on the manufacture to pay for the next chip. Widgets B through Z only have 60mm^2 left, and really a bit less than 60 because it’s incredibly hard to lay out everything on the chip so that there’s no gaps. Sometimes you even want gaps, for power or heat reasons. I’ll come back to that theme later.
That’s both an direct (your chip can’t cost more than $16 to fab) and indirect (choosing Widget A affects your further choices of Widgets B through Z) set of costs to model.
So the chip designer takes all of those inputs, feeds them into her or his models — and there’s usually a lot of spreadsheet gymnastics here, more than you might think! — and decides what Widgets they need for their chip, intercepting all of the top-level context about the chip, when it will be made and when it will come into the world, to make sure everything known about its design and manufacture is taken advantage of as much as possible.
We now know the designer needs some blocks to build their chip out of and they’ve made the hard decisions about what they believe they need. Where do those blocks come from these days?
Buy it in or build it yourself
If you’re a semiconductor behemoth like Intel, where you literally have the ability not just to design the chip yourself, but also manufacture it because you own the chip fabrication machinery too, you invariably build the blocks yourself. Say you’re the lead designer for the next-generation Intel Core i8-6789K xPro Extreme Edition Hyper Fighting. These days a product like the Intel Core is not just the CPU like it used to be, where everything else in the system lies on the other end of a bus the CPU is connected to and talks to. Today’s real Intel Core, say the Core i7-4790K, is a CPU, memory controller and internal fabric, GPU, big last level cache, video encoder, display controller, PCI Express root complex, and more. So let’s assume the i8-6789K is probably at least all of those things.
As lead designer of something like the i8-6789K, there’s probably almost nothing on the i7-4790K chip that your predecessor who designed it bought from outside Intel, or that you’ll now buy in from a 3rd party as a building block. I’d like to think there’s at least one block that Intel didn’t design, but I wouldn’t be surprised if someone told me there were zero 3rd-party pieces.
Intel do make chips where they get the blocks from outside of their company, but the vast majority of their revenue comes from chip sales where the entire design is almost completely an Intel one.
So where are you going to get them from? Intel obviously have design teams for each and every block of the chip as we’ve just found out. It’s incredibly expensive, but the competitive advantages are enormous. Knowing all of your block designs are coming from your own company, on timescales you (hopefully!) control, where your competitors have no idea what you’re building, and have you full design-level control over every part that results in a flipflop to be flipflopped, is really compelling. That vertical integration is overwhelmingly an excellent idea if you can afford it, because it lets you put economies of scale to work for you to amortise the incredibly expensive capital expenditure required.
You can see that build-it-yourself mentality elsewhere in the chip industry. Qualcomm do as much as they can. NVIDIA are trying their very best. Apple are beating the rest of the consumer device world to death with their ability to do it, allowing them to vertically integrate as much as they can, and lots of that is built on them doing it themselves, at the chip’s block level.
At the other end of the scale in consumer devices like phones and tablets, you have vendors that are master integrators, but design none of the blocks themselves. They go shopping, get the blueprints for the blocks from other suppliers, connect them up and ship the result, often very quickly. It’s comparatively cheap and easy for them to do it, and, primarily because it’s also cheap and easy for someone else to do it too, they’re in a horrible, slow, squeezing, cost-down race to the bottom that only a few will survive unless they can differentiate.
So buy it in or build it yourself is largely a matter of capital expenditure, expertise and supporting shipping volume. There’s incredible extra nuance to it depending on the company making the chip — some vendors will take a block design in-house where previously they bought it before, not because doing so will make them any more money directly, but simply because it’ll increase the size of the smile on the customer’s face when they use the thing the chip is in — but those are the big factors.
Now we know where the blocks tend to come from: if you’re rich and your customers love your stuff so your competition matters less, if they can even compete with you at all, and you ship loads of whatever it is you make, you can go ahead and try and do as much of it as you can yourself. If your cost structure and competitive environment means things are tighter, you need to go shopping. I’ve written about how you should go shopping, if you want to nip off and read about that too.
Regardless, someone needs to design the blocks.
Modern block design starts with an architect. The architect is responsible for the what and the why, and part of the how, of the block. But they’re usually not responsible for the rest of the how, or the where. I’ll explain what the hell I’m talking about, I promise.
The what is reasonably obvious. Let’s take the GPU because I’m incredibly fond of things that make pixels. What does a GPU do? It processes graphics workloads. The why doesn’t mean, “why does it process graphics workloads?”. That much should be obvious. The why means, “why does it process graphics workloads in this or that particular way”. The fundamentals of computing machinery mean there are infinite ways to skin any complex computational cat.
Sticking with GPUs, they have to process the pixels on your screen, right? Well you could architect a GPU that processes all of them individually, one after the other. You could architect a GPU that processes all of them individually, one after the other, in a random order. Or you could architect a GPU that processes a bunch of them in parallel together in a tile-based fashion because they have some level of complete independence, yet also some inherent level of connected properties, and exploiting spatial locality in the memory hierarchy leads to great things in a modern processor. Or you could architect a GPU that does nothing but render pictures of cats and hope that’s what the user wants. It’s the architect’s job to figure out why his Widget should process inputs and yield outputs a particular way.
In terms of modern consumer oriented semiconductors, these blocks all have a certain heft. A certain boundary, complexity and physical size. They’re not trivial, they’re almost always programmable in some way, and they tend to be busy with memory a lot of the time. So the why is never simple, and architects today usually can’t operate alone because of that.
So the block architect is operating in many similar ways to the full chip designer we talked about earlier. Interestingly, because it’s so expensive to develop a chip and you want to be able to reuse blocks from one design to the next if you can, they’re basically black boxes. A block needs ways to get data in, do some work and get resulting data out, but it often does that work in complete isolation from the rest of the system without sharing data or resources, and usually goes about it completely differently from other blocks. Computation is not computation, if you catch my drift.
A CPU goes about its business in a completely different fashion — from memory in, through computation, to memory out — than a GPU, never mind DSPs, modems, video encoders, video decoders, display pipelines and everything else on a modern complex chip.
The architect therefore needs to understand the use of their block by the software and rest of the system, how it connects to that system as a black box of sorts, and be broadly aware of some of the more physical properties of how it’ll end up in the chip. But because it’s a black box, almost everything can be an implementation detail. That’s the how part.
You often find that block architects, including those at companies that do it all in-house, will design those blocks with a common interface to the outside world that’s shared with other similar blocks in the same family. It lets the full chip architect make changes to their bigger design and swap out certain blocks for others, without making material changes to anything but the eventual layout. It also makes it possible for the block designer to make multiple variants of their block, specialised in certain ways to address certain markets, without making those variants any more complicated to integrate than each other.
Think of it like swapping out one CPU in your PC for another. They share the same interface, the pins in the CPU’s case and what travels across them, with the outside world, but the implementation could be completely different.
Software helps here, presenting a uniform layer to the rest of the software system for the hardware underneath, too. Drivers for certain blocks let you keep a common interface at the software level while changing the implementation underneath whenever it’s needed. It’s the same principle.
Software helps drive the underlying block architecture, too, but not completely. Whether running a software instruction set architecture (ISA) or implementing drivers for a client API executed by the block, the architecture of the block can be completely different underneath while staying compatible, allowing differentiation on performance, power efficiency, area and features. Look at GPUs as a block, where there’s a lot of differentiation in underlying architecture in modern SoCs, yet they all implement support for a common set of APIs.
Lastly, as mentioned, the where part of the block tends not to matter as much to the architect. Especially in modern full system-on-chip designs, if you could take a look at the physical layout of them and identify the blocks, you’d see them all over the place. One vendor will always place the video encoder and decoder next to each other, but another will place them separately. In terms of power, sometimes discrete blocks will share power delivery or a power island, but sometimes not. The chip’s physical topology doesn’t really matter to the block architect. That’s the domain of the hardware person that implements it. More about the hardware person later.
First, we need to talk about simulation.
When the architect has completed their design, at least for modern processors in consumer electronics, it’s handed off to two separate teams. The simulation team’s job is to take the design and create a functionally accurate software implementation of the design, somewhat obviously called the simulation. The functionally accurate part is important; because the simulation is of the what part of the design to make sure it does what it’s supposed to, which means for all of the same inputs accepted by the real hardware version, it will produce the same outputs. But that also means it doesn’t have to implement the exact same how as the hardware. So it doesn’t have to be cycle accurate, which is really important (and actually, because it’s a software simulation, it means it very probably can’t be cycle accurate).
So the simulated software design acts for all intents and purposes like the hardware design, it just doesn’t perform like it. In reality it’s many orders of magnitude slower for complex designs. For example, the simulated models of the GPUs that I’m used to working on can take minutes or even hours to generate single frames of output, depending on what’s being rendered. Hours. So while the simulation is a great way to verify the design is correct, it’s still nowhere near the hardware in terms of the performance.
For the curious, simulators tend to be written in C++ or a dialect of C++ called SystemC. SystemC has features and idioms that allow the simulator writer to more easily model the functionality of the parts of the design that will be operational at the exact same point in time.
The benefit of the simulation is that it’s much cheaper and faster to produce than the hardware implementation. It’s much easier to test, verify and debug, too, since it’s just software (if you’re a simulator engineer, rest assured I completely understand that it’s not just software, as if there’s no inherent complexity compared to other types of software, before you hunt me down and kill me in my sleep).
So when an architecture is finished and handed off to the hardware and simulator teams, the simulator team is always going to finish first, and really they have to because the hardware team are going to use the simulator model to help verify that their hardware implementation works correctly for the same inputs and outputs!
So we now understand that there’s an architecture team responsible coming up with the way the block should work, a simulator team that models it in a functionally accurate way in software, and also a hardware team that’s responsible for expressing the architecture in terms of the final physical design. There’s a lot of overlap between architecture and hardware in many cases (and in my experience great hardware people tend to be really good architects in many ways, and vice versa). For example, there’s no point in the architecture team designing something that has to be able to absorb a certain number of clocks of memory latency, but the hardware team implement the main computational pipeline depth so it can keep a lot less work in flight and not hide that latency. Architecture and hardware therefore work very closely with each other to make sure the design is respected and implemented correctly.
That said, there’s still a lot of the hardware implementation of a design that can be done entirely as a black box where the architect might not actually know how it works underneath. That’s the great thing about designing something in modular fashion; as long as your part of the design accepts the right inputs and produces the right outputs, the implementation can sometimes, but not always, be just details for the hardware person to worry about.
That keeps things simpler for the architecture team, and lets the hardware team focus on what they’re good at in terms of the physical implementation. It requires a really good level of trust and communication between the architecture team and the hardware team. Misinterpretation of each other can cause bottlenecks to appear in the design where the architect didn’t expect it, or in the worst case even completely broken physical implementations. There are block designs in chips that were successful in the market, but shipped with hardware-level defects caused by the implementation not really gelling with what the architecture team had in mind for the design. Get it really wrong and you can’t ship.
The real meat of the where when it comes to the building blocks of a modern chip is in power, area and physical layout. Most blocks are rectilinear to make them easier to lay out, but that tends to waste space compared to more complex layouts. Those where maybe the tooling wasn’t capable and a human got involved, to reduce wasted area on the chip that doesn’t contain some working logic. Sometimes that’s desirable, for routing or power delivery reasons, and sometimes you will always have extra space on a chip because you’re pad limited. But sometimes you really need to pack blocks together as closely as possible for the smallest possible area, and it’s the hardware team’s job to design a block that can potentially be laid out in a flexible way.
Let’s talk about that in more detail.
The output of the hardware team is a hardware description language (HDL) variant called RTL, or register transfer language. There are a couple of dominant RTL variants called Verilog and VHDL, but there’s no clear dominance between them on the market, at least not that I’ve been able to discern. Some companies implement their hardware with one and some the other, and the two languages can be used together in different blocks in a single chip, integrated by the physical design teams.
RTL is a human-written and human- and machine-readable expression of the movement of data and processing logic for the blocks on a chip. It’s fed into the electronic design automation (EDA) tools operated by the physical design team. More on that later.
RTL encodes what needs to happen at the logical level of each part of the design, accepting inputs, working on those inputs in some way, before moving the data out at the back end. It’s very much like normal computer code. The hardware programmers build libraries of common processing parts in RTL, which allow you to build more complex structures.
Going back to my favourite example, the GPU, you tend to want to multiply two numbers at almost every stage of the architecture. There’s no point in the guy coding the blender, say, and the girl coding one of the ALU datapaths, to write their own interpretation of a floating point multiplier. Instead they can collaborate, build something together that jointly fits their individual needs because they tend to want to operate on the same kinds of numbers, building the multiplier once.
Then in the RTL for their blender or shader core, they can import that shared multiplier and instantiate it inside that larger encompassing block, saving time to implement their part of the design and reducing the cost of validation later.
One of the key things to understand about RTL is that it can be debugged, tested and validated before a chip is created. There are various software and hardware platforms that can take RTL and execute it pretty much directly, without the need to take the RTL all the way to a physical chip to see if it works.
Emulation and FPGAs
Remember, the complexity of a modern chip design is measured in billions of transistors these days. The physical area of a modern system-on-chip in today’s smartphones and tablets starts at around 50 square millimetres and triples from there at the top end. The chips in more powerful laptops and desktop computers, especially the GPUs, run to several billions of transistors and upwards of 500 square millimetres on modern process technology.
You can imagine that costs a lot of money to manufacture, and I’ll talk a bit more about that later. Because of the inherent costs, if you weren’t able to test your design functionally before taking it into physical chip form, it’d be impossible to design anything complex. Being able to prototype your design before the complex “back-end” physical processes get started is therefore completely fundamental to modern chip development.
We talked about simulation earlier, where there exists a functional software implementation that completely mimics what the hardware is supposed to do, just not exactly how it does it at the cycle level. But what about ways for executing the RTL ahead of having to turn it into a physical chip, in that desirable, cycle-accurate manner?
There are a number of options for doing so these days, ranging from the “cheap” to the incredibly expensive. Depending on the size of the block you want to implement and how it connects to the outside world, FPGAs might be an option. FPGAs, as the initialism eventually suggests, are reconfigurable arrays of programmable gates. The gates in question are the fundamental logic gates that processors are made out of, like the “and” gate, which takes two inputs and outputs a logical 0 if both inputs are 0, otherwise it outputs a logical 1.
The FPGA programming process takes as input the RTL the hardware team creates. Remember that RTL is a logical representation of the function of the hardware, so it maps particularly well to FPGAs. Processing speeds for today’s fastest and largest FPGAs that can implement the biggest designs are in the low single MHz range for large blocks. A far cry from the GHz-class designs you might expect for something like a modern CPU, but only a couple of orders of magnitude away from the frequency of things like GPUs or DSPs.
Regardless of the low clock speed, it’s still an incredible improvement in speed compared to simulation. Crucially, it’s also cycle accurate! That property of cycle accuracy exposes the design, no matter how it’s implemented in final silicon, to performance analysis. If you’re able to connect the FPGA implementation to other FPGAs or prototypical silicon that helps you implement the wider system architecture the design might find itself in, you can start to figure out how fast it’s going to be in real systems.
There’s usually some kind of disconnect between the performance in FPGA form and the final shipping silicon, but it’s usually enough to give you an idea and start to work on tuning the design for performance and its performance relative to what it communicates with on the outside world.
Then you have a class of full-block emulation systems that, as long as you have enough of them, can be configured together to emulate very large designs in full. The “in full” property is important. Back to GPUs again (sorry!). Say your GPU design consists of a front-end, a shader core, and a back-end, to keep it simple. Imagine the design is such that even the largest FPGAs you can get your hands on are only big enough to hold the design for the shader core, but not the front- and back-end architecture as well. You’d have to split the design across multiple FPGAs, or not implement it in FPGAs at all depending on the inter-block complexity and communication you need between those parts in order to make the design work.
Emulators, which after years of consolidation in the EDA industry are now usually also produced by the tools vendors like Cadence and Synopsys, are enormous. I really mean that: a large installation of a modern emulator that’s big enough for a large chip design can easily fill a big room, and the room tends to need to be specially constructed due to the power and cooling requirements.
The emulator has the nice property, as well as being able to implement even large designs in full if you have enough of them connected together, that it can also be set to appear to a connected real host system as the real device. So you can boot your operating system, load the driver for the GPU and run the full software stack as if it was a real device, many months ahead of ever seeing the design in silicon.
The advantages to that ease-of-use and ability to use the full software stack, just like with FPGAs, is hard to understate. It’s great for the driver writer, the performance analyst like me, the hardware team implementing the RTL, or maybe architecture who might want changes to be made based on the full run-time data you’re now able to collect using the full software stack, all because the emulator lets you pretend to the system that the design is real.
Full block-level emulation is slow, even slower than FPGAs, at less than 1 MHz, and also many times more expensive than any other solution (millions of dollars to buy something that will be useful to the designer of a modern consumer system-on-chip, for example, which is why they actually tend to be leased) . But it’s still much cheaper and much faster in terms of turnaround time compared to full silicon production.
So, let’s imagine that every step in the process we’ve talked about has been followed, including and up-to full emulation of the design in a giant emulation platform that fills a huge custom-designed building, complete with fully plumbed-in liquid cooling. The design now provably works. It’s tested with real software. The driver writer is happy with the driver running against the simulated and emulated models. There are no last minute hardware bug fixes to be made (hopefully!). It passes the full regression suite. The team responsible for delivery to the customer — be that an internal customer if you’re someone like Intel or Qualcomm, or an external customer if you’re shipping to someone like Mediatek or Rockchip that only tends to work with outsourced designs — has signed the design off.
The RTL is then shipped to the customer. It’s an important milestone in the journey to full silicon, and for an IP-only supplier it’s pretty much the last point in the journey, but there’s still so much to be done in order to get it into a working chip in a device. So what’s next? One of the coolest and most mysterious processes, at least from my vantage point.
The RTL that describes the hardware then needs to be turned into logic in the form of transistors. That’s where EDA tools and the physical design team come in. The transformation of RTL into actual transistors on the chip is a process called synthesis.
If you understand how computer programming works, you’ll be familiar with the concept of compilation. You take code written in a high level language and transform it through a series of steps, generating and consuming various intermediate representations of the code, until the final hardware code generation happens, targeted at the instruction set architecture of the processor that’ll consume it at runtime. Interestingly, modern processors also tend to take that binary representation of its instruction set and internally transform that into other representations it can understand, all hidden from the programmer.
The key point to take out of that is that while there are multiple complex transformations of the original high-level code, the computational meaning isn’t changed. If the programmer wants to add two numbers together, you better not change the addition to a divide. You can argue that it is changed at certain steps, but I’d argue, at least for simplicity’s sake, that it’s actually just optimised.
That same “compilation” happens during RTL synthesis, where the functional meaning of the hardware design, encoded in RTL, isn’t changed, but the final representation, in this case usually a really cool binary interchange format called GDS2, is now something that can help generate the actual transistors on the chip. Think of it as the map between the described logic in the RTL to the transistor structures on the silicon.
The output of the synthesis EDA tools is built from a library of standard cells. The cells are collections of transistors that implement a certain structure in the silicon. The easiest one to point at is the SRAM cell, not least because they tend to be readily visible on a photograph of a physical chip floorplan! It’s a large, highly-regular rectilinear structure, built out of cellular building blocks.
The cells are tied to the foundry process of course, and tend to be provided by either the foundries themselves — someone like TSMC, or Samsung LSI — or the EDA tools vendor like Synopsys. It’s common for IP vendors to partner up with the cell library vendor to create a tooling flow for an implementation of a particular block that’s optimised for a certain foundry process and set of cell libraries, and their best operating conditions, to guarantee a set of performance characteristics.
So synthesising the RTL is the act of turning the human-readable HDL into cellular blocks of transistors, in effect. And because transistors have physical dimensions, they need to be laid out in relation to one another.
Because the GDS2 is a full physical representation of the structures on the chip, it has inherent size, and it might surprise some to realise that modern chip physical manufacture is a 3D process (not that I’m not talking about modern 3-dimensional transistors). Not only does the bottom silicon layer spread out in width and height in a single polysilicon layer, but the design also spreads upwards in terms of metal.
Every modern microprocessor has a metal stack. The reasons why will hopefully be obvious, or least become obvious shortly. Think of the full chip now, with individual blocks that we talked about the design and implementation of, all connected to each other in the final large chip design. But how are they connected? Tiny wires! The silicon layer is a single planar structure, but the connections between blocks aren’t implemented in the silicon. They’re implemented in metal wires, and so the wires have to go upwards.
In today’s large designs, there’s no way to get away with the wiring being possible in just a single layer, so there tends to be a full stack of metal layers, all snaking around and through each other like 3D spaghetti with insulating material in between. The more metal layers the more costly the design is to manufacture, and you can guess that with modern large designs with many layers in the metal stack, and a large silicon layer, that it’s not a task for a human being to layout. At least not entirely.
Starting again at the silicon layer, the blocks occupying a shared planar surface, they need to be placed beside each other. For manufacturing reasons, the chips on a wafer need to be square or rectilinear, but the blocks inside don’t, although it helps if they are of course for layout simplicity. Imagine a variant of Tetris, where you have to not only get differently shaped blocks to fit together in an optimal space, but where an extra constraint is that related blocks also have to be near each other to keep the interconnection between them — the metal layer stack — as simple as possible.
You can probably quite easily understand that certain layouts of the blocks on a chip will cause you to need fewer long wires to cross the chip to connect them to each other, or to bus fabrics or what have you, resulting in a less complex metal stack.
You can also probably imagine that the number of possible permutations of block layout on the bottom silicon layer, and wire layouts in the metal stack, is mind boggling. Tiny changes in block placement at the silicon layer can have exponential growth in the wiring, for example. Searching through the possible combinations of block placement and wiring complexity therefore tends to be done by computer.
The bounding conditions for that search tend to be block frequency (because the clock for the chip has to propagate through all of the blocks), power (more transistors equals more power, relative to a fixed input voltage and frequency) and area (because the area has a physical cost in dollars for the silicon and the metal stack that wires it all up). There are many more in reality, and the EDA tools vendors tend to sell the software that figures it all out. The important thing to note that finding an optimal layout for one input factor can cause huge changes in all of the others.
Now it is possible for a human being to layout parts of the design, and there are reasons why that might be desirable. The layout software is guided by an engineer or set of engineers, but it might only have a certain number of possible search strategies baked in to it, and they’ll all be bound by the characteristics we talked about before: power, area, frequency and so on.
However there’s one important characteristic to the boundary conditions of the layout software: time. Because its an exponential problem to solve in terms of numbers of input conditions to find a final layout solution for, the software has to somehow limit its runtime. If the software isn’t sophisticated enough to find an acceptable layout for the usual input parameters, it’s possible for skilled folks to step in and either partially or fully guide the software to the solution the chip designer is looking for.
Fully or partially laid out digital logic in complex, large microprocessors is rare, but it does happen. You can imagine that the reasons why it does happen are to really optimise the last few percent of a design in a certain way, to do the best possible job on performance, power and area. I’ve looked at dozens of large chip designs in the last few years, in detail, and I’ve seen hand-optimised layout just once. I think. I say I think because it’s very hard to tell as you can imagine.
It’s worth talking about clocking of large chips here, just briefly, since it’s such a big part of certain designs. The way a clock is applied to digital logic means that the clock propagates through a block’s cells, driving them, pretty much in a wave, carried by the wires that route through the metal stack we talked about earlier. The clocks applied to a modern large design are complex, because they’re designed to cover a large range and move between levels and lock quickly. They’re also varied, because there’s no one clock to govern a single system-on-chip, so you find anywhere from a few to dozens of clock sources in today’s designs.
The length of the wires is the biggest factor in figuring out the delay of the clock travelling along them, and it’s that delay that’s the main limit for the peak operating frequency of a given block. Factor that in to whatever else is sharing the same clock source and it’s possible for a single block in a modern design to limit the peak frequency of all other blocks that share the same clock.
The clock is effectively a tree, with it starting at the root source where it’s generated, moving along the branches, which are the main feeding wires it drives, to the leaves of the tree, which are the wires furthest away from the source.
The clock tree is sometimes able to account for a non-trivial amount of the area of a design, since it needs to route effectively, and the complexity and number of blocks the tree needs to feed is numerous in a modern design. I point out the clocking just to give you an idea that not all of the area on a modern chip is dedicated to computational logic, and that clocking setups and clock variation strategies to keep power under control are an increasingly large part of modern chip designs.
When I was first introduced to semiconductor manufacture, I can’t even remember when now, there was always talk of this mythical process called the tapeout. I always wondered why it was called that, because I couldn’t fathom where a tape was ever involved in semiconductor design or manufacture. I fathomed right, because it’s a legacy term from back in the day when the final data required to perform semiconductor manufacture was actually delivered on a tape or tapes, and from even further back in semiconductor history when masks for the lithography were literally created by masking with tape.
These days it’s transferred electronically of course. So what are the constituent parts of a tapeout, as far as the chip design goes? The big part is the GDS2, produced by the synthesis and layout steps. It’s sent to a mask house in order for it to be turned into a literal photon-blocking mask. More about why in the next part.
The mask, or normally a set of masks for today’s designs, are the most critical part of the manufacturing process. The mask set can be modified by the foundry after it’s creation by the company that takes the GDS2 as input to create the mask, but generally it’s set in stone and can’t be altered, so it’s critical that the mask house gets it right.
The delivery of GDS2 to the mask company is digital, but the mask is obviously a physical object. It looks really cool in person, if you’re ever lucky to see one; it’s big enough that you can, without microscope, see the constituent blocks of the design in good detail.
The mask set and some associated metadata is then sent to the foundry for manufacture.
This part is a series of books on its own, and I’m no expert, so I’m going to be brief. Silicon dice, and I presume it’ll be the same with the replacement materials for silicon when they eventually arrive, have their circuits etched into them by a combination of lithography and chemical processes. And it’s not really just silicon either; the actual material is a mixture of silicon and dopants, to give the resulting transistors certain electrical and switching properties.
Laser light, these days specific very short wavelengths of ultraviolet and I believe usually 193nm UV (and since the transistor feature size is smaller than the light wavelength, the process is called sub-wavelength lithography), is shone through the mask created in the previous step. It then passes through complex optics that focus and steady the mask beams — including through water for some foundry processes, using the water as a lens! — allowing a photoresistive layer on the silicon to be exposed to the UV.
The unexposed photoresistive layer is washed off of the wafer, then a complex set of processes happens that etches the circuits, gets rid of the exposed photoresist, and lays down the metal stack (among other things).
The wafer is moved underneath the lithographic hardware, to manufacture each individual die on the wafer, and apparently there’s an incredibly complex set of computations happening in real-time while that’s happening to correct the optical assembly and the laser emission, to make sure the dice are etched without defects. Some process nodes require multiple exposures through the mask set per chip, increasing time and cost.
The wafers are then cut so that each individual die can be taken out of the water. I believe modern wafers, even for tiny dice, are cut with what’s effectively a saw, rather than something like a laser. The margin for error is incredibly small, given how tightly the dice are packed together on the wafer.
Testing and packaging
After the dice are cut from the wafer, they need to be packaged into something that can be placed into the final device. Packaging at this point depends entirely on the chip and the target device market. For big PC chips like CPUs and GPUs, that usually means placement onto an organic substrate that connects the metal pins on the die to larger balls or pins on the underneath of the package.
Packaging tends to happen externally to the foundry, at least if you don’t own your own foundry like Intel or Samsung, via 3rd party companies. The supply chain for semiconductors is actually really quite long. Most people assume that for manufacture into the final packaged chip, the foundry does everything after assuming the design from the designer, but that’s not true. That adds some latency to the production process, on top of the time taken for manufacture by the foundry. The cut dice are sent to the packaging house for that step, then the packaging house sends them to another place for testing. There’s been some consolidation in recent years of this part of the production process, with packaging and testing houses becoming the same entity by merger or acquisition. Geographically, almost all of those for hire are in Taiwan.
Testing is the point in the process where you figure out if the chip is going to work or not. Certain tests can be performed ahead of full packaging, on the bare die and even while still in full wafer form, but there are some that obviously can’t, where you need the chip to become completely functional, so it needs to be powered on fully and run certain external or self tests to determine operational functionality.
There are obviously certain other steps of testing, usually longer completely functional tests with full software stacks, where the packages are placed into form factor devices and run through long run-time tests in varied operating conditions, to check the chip can run in all of the environments it will ever find itself in. Those kinds of test stacks tend to be done by the chip vendor, with the chip in situ in a device form factor that’s representative of what you’ll finally buy.
If the chip vendor is happy at this point and testing completes properly, the design is signed off for limited production.
Yields and binning
Yields are computable at this point. You know how many wafer starts you had, you know how many chips came back working, and you can start to bin those chips at various grades. Because of the inherent nature of a physical manufacturing process, despite the high degree of control over the whole process from the individual wafers upwards, not every chip is the same.
You want it to be that way, but there are inherent things that stop it being the case. Sometimes you have functional defects, where blocks of logic on the chip just don’t work, but where you can suffer that and sell the chip as a slightly different SKU with different performance, with the defective block turned off.
Sometimes you have process variation, where everything works the same as another copy of the same chip, but it won’t clock as high at the same voltage. So you have to test the chip functionally, for defects, and operationally to find out where on the voltage/frequency/power curve it sits and can run reliably at. Binning is inherently time consuming and therefore adds significant costs to things, but it’s the only way to be able to guarantee you can sell as many dice as possible from a given production run.
Otherwise, if your products demand uniform performance so your customers know exactly what they’re buying, and you only have a couple of performance levels to sell at, say a tablet and a phone, then you’ll discard some of your production run unless your foundry is fantastic. It’s all a big tradeoff between complexity of the chip, complexity of manufacturing and the target devices the chip is supposed to go into in the end.
There are various stages of production in a chip’s lifetime. The first part already happened, if you’ve been following along. There’s been enough wafer runs to produce enough chips to sign-off basic functional and operational tests. Hundreds of chips, but not thousands, usually.
Then modern chips usually go into a wider, but still not full, production run, to create enough chips for device vendors to use to create the final devices they intend to sell, to make sure the chip behaves properly in those devices. More on that soon.
Then there’s full mass production. At this point there’s a lot of money at stake for the chip vendor. They place an order with the foundry that can’t usually be altered; because of how the foundries work, which is related in part to the manufacture section earlier, the chip being produced in a foundry can’t be changed quickly. Therefore to amortise the cost of swapping one set of masks and wafers out — remember this whole process takes place in completely clean environments with no contaminants that could find themselves landing on the chip and spoiling the lithography, and every swap of the mask set and wafers is a chance to introduce something that could compromise yields — you either have to place a big order, or you have to wait for the foundry to be doing something special with the production run for some reason.
That does happen from time to time: for example, a new fab building comes online, so the foundry will dedicate time and energy swapping out the wafer types, optics and mask sets more often than normal, to produce a bunch of different designs, to test out the production pipeline and make sure the fab is operational. Often they’ll produce different designs on the same wafer at this point! Vendor A’s chips might be right next to Vendor B’s chips on the wafer, without ever knowing, for certain kinds of wafer starts.
Mass production is usually on the order of at least hundreds of thousands of chips, at least for consumer device designs, if not tens or even the low hundreds of millions over the production lifetime of some devices. Economies of scale kick in big time here, especially for the bigger chip designers. Some companies are able to keep an entire fab building consumed with a single chip design, for a single target market, for extended periods of weeks or longer at a time.
The economics of production mean that it’s not financially viable for a fab to run cold with no wafer starts, or be constantly swapping and changing design starts. So chip vendors that can guarantee volumes and longer runs of production get priority over the smaller vendors that don’t need as much, or need many more designs to go into production.
Early prototype sampling
We talked a little bit about sampling earlier in the testing and packaging section, but it’s worth quickly revisiting. Sampling is in the production process that happens ahead of mass production. A limited number of wafers are started to make sure the chip works, before mass production orders are placed.
It’s at that point that chip vendors sample the silicon to potential customers, as well as themselves for internal testing. The potential customers take delivery of a small number of chips, usually in the tens or hundreds, in order to test them out on prototype boards, well ahead of device form factor creation. Sometimes before anyone even understands what the form factor is even going to be like, for new markets or device types.
I find this part fascinating because of the varied sizes, shapes, colour and functional variations these devices come in. Some are expensive, single-PCB designs with a nice socket for the test chips to go into and everything integrated pretty close to the full form factor design.
But some are crazy multi-level, tiered PCB assemblies of various sizes, shapes and colours, connected with completely non-standard connectors you’ve never seen before and might never see again, with no nice socket, so to change the system-on-chip you have to swap out an entire PCB assembly. They’re real frankenmachines where I sometimes wonder how they can be transported safely without breaking, since they have an inherent fragility.
Lastly, I wanted to talk about the semiconductor intellectual property (IP) business model. I haven’t really touched on it specifically, although it’s been implied throughout, in the sections on block architecture, HDL, hardware, and the back-end physical design processes.
The semiconductor IP model is one where a company, the most famous example being Arm, goes through every part in the process I’ve outlined, from idea up to the RTL. The RTL is what they nominally sell, and usually they stop there. They only sell the source code to a block or blocks, but not a full chip design, the chip itself, or anything like an end-user device.
But increasingly semiconductor IP businesses are crossing the boundary from the RTL delivery into the later stages, and they have to do that to stay competitive and help their customers with ever complicated tapeouts and chip productions.
It’s not uncommon now for IP vendors to supply, often in conjunction with an EDA vendor, not just the RTL, but the RTL along with tools, scripts and a set of cell library choices, in order for the buyer to run them at synthesis time to generate the block with a very specific, tightly-controlled set of physical properties. It’s not quite an already fully synthesised “macro” instantiation of the IP, ready to integrate with other chosen blocks in the chip design at that point.
It’s more a set of hard and fast rules that effectively say, “if you synthesise exactly like this, we guarantee this area, power and frequency for this IP”. They do that because it’s increasingly hard for a back-end physical design team to get the most out of a complicated block like a CPU, GPU, ISP, modem, or video, and really optimise it, especially for power and area.
So the block vendor gets involved now to help that process go as smoothly as possible. It’s still somewhat common to see fully synthesised macro sales, but that’s falling away in favour of a tightly integrated tools flow, especially one that’s aimed at integration into certain chip types for certain markets, where the extra costs can be justified in favour of a better chip in the end.
The transition from full chip vendor where you did everything yourself, like Intel still do and which some other companies are moving towards and some others away from, to semiconductor IP where you’re either buying or selling blocks in RTL form, has come about almost solely because of the costs involved in meeting the increased complexity of designs, in order to then meet rising customer demand and expectation as chip technology, features and performance marches forward at a rapid pace.
That said, the semiconductor IP market is changing, too. I can’t talk about how, but it’s increasingly exciting to work in this space where there’s a high rate of change in how things are done, lots of new players coming and going, and an increasing reliance on the IP business model in order to make today’s complex chips that make their way into consumer electronics and related markets.
The final product
Hopefully you now have a much better idea of the complexity, timescales, costs, number of steps, risks, number of suppliers and the design and architecture aspects of today’s modern processors. Most of what I’ve written applies just as much to companies like Intel who are enormous and do almost everything themselves, as to the much smaller, yet incredibly agile vendors like Actions Semiconductor (who I bet you’ve never even heard of!), who buy everything for a chip from semiconductor IP vendors, and just do the parts that happen after the RTL is bought and delivered, before handing off their chip to a device OEM.
If you want to find a nice average point for your understanding of consumer chips, especially SoCs, where the vendor is not quite an Intel, but not quite the cheapest, tiniest, fastest — and thus least tested, integrated and proven — vendor or chip at the other end of the market, things tend to happen on this kind of scale:
At least a million units shipped of something that takes at least a year from idea to mass production, that costs at least $10M dollars to get to that production, all the way from architecture through IP selection and purchase, integration, bring-up and testing, through to early production, packaging, sampling and then finally mass production.
Just a few short years ago the timescales and costs were doubled for the average vendor. Just goes to show how quickly things are moving and yet still how complex everything is getting, in order to go from initial idea for a chip into the device in your hand, in a computer on or under your desk, or increasingly found in places like your car.