People of ACM - Bill Dally

September 19, 2017

In a talk you gave in 2011 titled “Power, Programmability and Granularity: The Challenges of Exascale Computing,” you said that “reaching an exascale computer by the end of the decade, and enabling the continued performance scaling of smaller systems, requires significant research breakthroughs in three key areas: power efficiency, programmability, and execution granularity.” From the vantage point of 2017, in which area(s) have we made the most strides, and in which area(s) are we still lagging?

We have made the most progress on power efficiency and we are lagging on programmability. In 2011, supercomputers built from our Fermi generation of GPUs achieved less than 2 billion floating operation points per second per Watt (GFLOPS/W). The efficiency of our new Tesla V100 GPU is 25 GFLOPS/W. This is more than a 10x improvement in six years with very little of the improvement coming from semiconductor technology. It is almost all due to improved architecture and circuits.

On the other hand, the industry has made relatively little progress on improving programmability. Most HPC applications are written in Message Passing Interface (MPI) + X (where X is the language used within the node) and have many machine dependencies embedded in the source code. Such codes require considerable effort to port from one supercomputer to another. We envision a future where applications are coded in a target-independent language and the mapping to different target machines is largely automated. The Legion programming system, which we are developing in collaboration with Stanford and several US Department of Energy (DOE) labs, is a big step in this direction.

NVIDIA’s DRIVE PX2 is an AI supercomputer that is being used both in production and development of autonomous vehicles. Can you tell us a little about the DRIVE PX2’s architecture and capabilities?

Our scalable line of DRIVE PX automotive AI computers and the DriveWorks software stack that runs on it provide an ideal platform on which to build an autonomous vehicle. Our next-generation DRIVE PX system will use our Xavier SOC integrated circuit that includes Volta cores and a dedicated deep-learning accelerator (DLA). This gives it 30 trillion operations per second (TOPs) of deep learning inference performance, which is required to run multiple deep neural networks (DNNs) on the real-time data from the suite of light detection and ranging sensors (LIDARs), cameras, and other sensors on a modern autonomous vehicle (AV) system that accurately perceives and maps all objects in the car’s environment. The DRIVE PX systems package the Xavier chip with the redundancy required to achieve the automotive safety integrity level (ASIL) required for up to Level 5 autonomy (performance that equals that of a human driver).

Our DriveWorks software provides a platform that includes modules for perception, localization (to a highly detailed map), prediction, and planning. Multiple DNNs are used to understand the vehicle’s environment, identify lane markings, detect objects (such as people or vehicles), find open space, and suggest trajectories. AV vendors can use this package as a starting point and adapt it to their own sensor suite, training data, and control approach.

What are we learning from early tests of autonomous vehicles (and their onboard computers) that will shape the development of these computers going forward?

Very high levels of deep-learning inference performance are critical to providing accurate perception, prediction, and planning functions with low latency. As we collect more data, we use larger models, and our networks become more accurate at detecting obstacles, predicting the intent of other cars, discriminating small objects in the road, and using this information to provide a safe ride. Running larger models on multiple high-resolution sensors requires tens of Tera-ops of performance in a power budget of tens of watts. All of this must be done with sufficient redundancy to provide highly reliable operation.

What is an area of computer architecture or high performance computing that you believe has great potential and is not getting enough attention?

With the end of Moore’s Law, application- and domain-specific architectures are the most promising approaches to continue scaling of performance and efficiency, and they deserve more attention. These can either take the form of dedicated accelerators, like our Deep Learning Accelerator, or domain-specific instructions, like the tensor instructions that provide deep-learning performance in Volta. A bioinformatics accelerator I developed with Yatish Turakhia at Stanford achieves a speedup of 10,000x compared with a top-of-the-line Xeon. In contrast, research on conventional processor architecture is at the point of diminishing returns.

William J. “Bill” Dally is Chief Scientist and Senior Vice President of Research at NVIDIA. He is also a professor at Stanford University, where he directs a research group developing novel processor and network architectures and new digital design techniques.

Dally received the ACM-IEEE CS Eckert-Mauchly Award in 2010 and the IEEE Seymour Cray Award in 2004. He has been recognized for fundamental contributions to the system and network architecture, signaling, routing and synchronization technology that is used in most large parallel computers today. Dally’s Imagine processor employed stream processing to significantly improve the power, speed and efficiency of high performance computers. His Merrimac streaming supercomputer project evolved into graphics processing unit (GPU) computing. In 2002, he was named an ACM Fellow.