People of ACM - Norm Jouppi
November 4, 2025
In the early 1980’s, you were involved (along with John Hennessy, Dave Patterson, and others) in the development of the RISC and MIPS microprocessors (two instruction set architectures that have shaped modern computing). MIPS, for example, emphasizes simplicity and speed. Why was MIPS needed at the time?
Let me start by saying the opportunity to work with Turing Award winners John and Dave over my career has been a tremendous blessing. Returning to technical issues, general-purpose instruction set architectures had gotten progressively more complex over time, which made it hard to pipeline and use other forms of instruction-level parallelism (like issuing multiple instructions per clock cycle). These complex architectures were called complex instruction set computers (CISCS). At the time with 4 micron technologies we could only have 25K transistors per chip, and 32-bit CISC computers required multiple chips to implement. By using simpler instructions and microarchitectures, we were able to get higher performance from pipelining and simple statically-scheduled parallel instruction issue, while eliminating work at runtime by enabling more optimization at compile time. This, in turn, enabled the first single-chip 32-bit microprocessors (the Berkeley RISC and the Stanford MIPS). BTW, one of John’s sayings during the MIPS project was “Never put off until runtime what you can do at compile time.” We’ve followed that wise advice in the design of our TPUs.
In the 1990 paper you presented at ISCA, “Improving direct-mapped cache performance by the addition of a small fully associate cache of prefetch buffers,” you introduced two new buffers to improve memory hierarchy design. What are buffers, and how did your buffers improve the memory of computer systems?
During that era, we could fit a processor on a single chip (making it higher performance and more power efficient), but we didn’t have much room left over for instruction and data caches. Also direct-mapped caches had lower access times, but higher miss rates. My idea was to try to get the best of both worlds by effectively providing some associativity by augmenting a direct-mapped cache with a much smaller associative cache that could be accessed in parallel without significant additional delay. Similarly, the prefetch buffers (holding 1-4 cache lines each) were a way to fetch data before it was needed, but not replace useful items in the cache if the prefetched data wasn’t needed. In other words, it’s a way of getting higher efficiency from the memory system, without significantly increasing access time or hardware costs. Variants of stream buffers (for prefetching) have been adopted by many computer systems. That said, the small fully associative cache holding cache lines that had been replaced due to a lack of associativity (a “victim cache”) tends to garner more attention even though it is the less useful of the two techniques. An indie band in Texas even named themselves “Victim Cache” after reading about it.
Why are Tensor Processing Units (TPU’s) an advancement over the graphic processing units (GPU’s) that fueled machine learning advances a decade ago?
GPUs were originally designed for graphics (BTW, I’ve worked on some GPUs over my career). Then GPUs were extended to high-performance computing . In the last decade they’ve also added support for shorter precision datatypes which are good for ML. So GPUs effectively have baggage from multiple application domains, as well as maintaining some compatibility with decades of previous designs. We’ve designed TPUs from the ground up to be optimized for ML, and nothing else. That narrows the domain that TPUs address, but it also enables us to optimize more efficiently for that single domain.
This has led to different design decisions – for example we’ve used multidimensional tori which tensor computations map well to and that avoid the cost and power needed for more general-purpose interconnect switch networks needed for other applications. We’ve also used much larger threads of control, increasing efficiency, with functional units computing 256x256 matrix multiplies using a single instruction and reusing fetched data operands 256 times or more. This saves SRAM accesses, which require much more energy than compute. Our most recent TPU (Ironwood) has a single core per reticle-sized chip, vs. CPUs with over 100 cores per chip, and GPUs with thousands of thread units per chip.
What was your work in telepresence about?
I like to call that period of my work my “technical midlife crisis” (like Pablo Picasso’s Blue Period ). At that time making CPU cores larger didn’t improve performance per cost, and on-chip multiprocessor research had reached a consensus. So the future looked like we would just be increasing the number of cores per CPU chip by 2X with each generation of lithographic scaling for the foreseeable future, which also created a bit of a “winter” for computer architecture research.
On top of this, on a personal note my wife and I had two young kids and a new baby. Given that I had to travel a lot for my work, it increased my interest in capturing all the human interaction details of in-person meetings such as 360 degree gaze preservation (which is much harder than just preserving eye contact).I was also interested in exploring the ability to move around and talk with different groups of people at a remote location, high dynamic range multi-channel surround audio to preserve the “cocktail party effect, etc. But in 2003 my mutually-immersive robotic telepresence project was frankly ahead of its time, so I got more involved in HPC instead. Maybe in another decade or two it will be the right time for it (using humanoid robots, much higher BW wireless LANs, etc.). I co-authored the 2004 ACM Multimedia paper about it. Ironically it was the only paper submission in my career that I’ve received 5 out of 5 reviews—all with the highest possible rating.
What is the next revolution in computer architecture?
Lately I’ve been very excited by the diversity and breadth of recent computer architecture conferences. For example, attendance at the last ISCA conference set a record by a wide margin. With the end of Moore’s Law and the end of Dennard Scaling (i.e., transistors aren’t becoming cheaper with scaling, and their power is increasing rapidly), I see tremendous interest in domain-specific architectures as a means to improve efficiency for all kinds of systems. So I encourage everyone to attend our computer architecture conferences to find out!
Norman P. Jouppi is a Vice President and Engineering Fellow at Google. He joined Google to develop ML accelerators, and is the tech lead for Google’s Tensor Processing Units (TPUs). Jouppi is recognized for trailblazing innovations in computer memory systems. He was also the lead designer of several microprocessors, contributed to the design of graphics accelerators, and extensively researched telepresence.
Jouppi has served the ACM in many capacities, including as the Chair of SIGARCH and as a member of the ACM Council. He is an ACM Fellow and received the ACM-IEEE CS Eckert Mauchly Award for pioneering contributions to the design and analysis of high-performance processors and memory systems. He’s also received the IEEE CS Seymour Cray Computer Engineering Award for his work on ML supercomputers, is a Fellow of the ACM, IEEE, AAAS, and a member of the NAE.