In Fermi and prior designs, Nvidia used deep pipelining to achieve high clock frequencies in its shader cores, which typically ran at twice the speed of the rest of the chip. The increased parallelism in the SMX is a consequence of Nvidia’s decision to seek power efficiency with Kepler. (AMD has recently adopted a similar scheduling format in its GCN architecture.) As in the past, Nvidia schedules its work in groups of 32 pixels or threads known as “warps.” Those vec32 units should be able to output a completed warp in each clock cycle, while the vec16 units and SFUs will require multiple clocks to output a warp.
(Incidentally, the partial use of vec32 units is apparently how the GF114 got to have 48 ALUs in its SM, a detail Alben let slip that we hadn’t realized before.)Īlthough each of the SMX’s execution units works on multiple data simultaneously according to its width-and we’ve called them vector units as a result-work is scheduled on them according to Nvidia’s customary scheme, in which the elements of a pixel or thread are processed sequentially on a single ALU. There are eight special function units per scheduler to handle, well, special math functions like transcendentals and interpolation. Each of the four schedulers in the diagram above is associated with one vec16 unit and one vec32 unit. In the SMX, there are four 16-ALU-wide vector execution units and four 32-wide units. Although Nvidia likes to talk about them as individual “cores,” the ALUs are actually grouped into execution units of varying widths. The organization of the SMX’s execution units isn’t truly apparent in the diagram above. According to Alben, about half of the Kepler team was devoted to building the SMX, which is a new design, not a derivative of Fermi’s SM.
Whatever you call it, though, the new SMX has more raw computing power-192 ALUs versus 32 ALUs in the Fermi SM. More notably, the SMX packs a heaping helping of ALUs, which Nvidia has helpfully labeled as “cores.” I’d contend the SM itself is probably the closest analog to a CPU core, so we’ll avoid that terminology. As you can see, Kepler’s SMX is clearly more powerful than past generations, because it’s over 700 pixels tall in block diagram form. The SM is where nearly all of the graphics processing work takes place, from geometry processing to pixel shading and texture sampling. To some extent, GPUs are just massive collections of floating-point computing power, and the SM is the locus of that power. Logical block diagrams of the Kepler SMX (left) and Fermi SM (right). Warm up your scroll wheels for this baby.
Let’s start by looking at the new SM, which Nvidia calls the SMX, because it gives us the chance to drop a massive block diagram on you. An idea that seemed brilliant to the architects would be nixed because it didn’t work well in silicon, or if it didn’t serve the shared goal of building a very power-efficient processor.Īlthough Kepler is, in many ways, the accumulation of many small refinements, Danskin identified the two most major changes as the revised SM-or shader multiprocessor, the GPU’s processing “core”-and a vastly improved memory interface. Danskin and Alben told us their team took a rather different approach to chip development than what’s been common at Nvidia in the past, with much closer collaboration between the different disciplines involved, from the architects to the chip designers to the compiler developers. Kepler was developed under the direction of lead architect John Danskin and Sr. Although Kepler’s fundamental capabilities are largely unchanged versus the last generation, Nvidia has extensively refined and polished nearly every aspect of this GPU with an eye toward improved power efficiency.
The first chip based on the Kepler architecture is hitting the market, aboard a new graphics card called the GeForce GTX 680, and we now have a clear sense of what was involved in the creation of this chip.
The gains would come from changes to the chip’s architecture, design, and software together.įast forward to today, and it’s time to see whether Nvidia has hit its mark. Those improvements, he said, would go “far beyond” the traditional advances chip companies can squeeze out of the move to a newer, smaller fabrication process. Huang predicted the chip would be nearly three times more efficient, in terms of FLOPS per watt, than the firm’s prior Fermi architecture. At Nvidia’s GPU Technology Conference in 2010, CEO Jen-Hsun Huang made some pretty dramatic claims about his company’s future GPU architecture, code-named Kepler.