People of ACM - Rachid Guerraoui
April 22, 2025
What is an important way in which distributed computing has advanced since you began your career?
I started my research career with the advent of the Internet. At that time, it wasn’t necessary to coordinate the activities of computers except for small subnetworks of a few machines. The situation has significantly changed since then—hundreds of processes in a multiprocessor architecture now need to synchronize their access to shared data. At the same time, blockchain-like applications require thousands of computers all over the world to maintain the same shared state.
One of your most cited papers in the ACM Digital Library is “ On the Correctness of Transactional Memory.” In the paper, you along with co-author Michal Kapalka introduce the concept of opacity. What role does transactional memory play in a distributed system? Why was the concept of opacity an improvement over the existing state-of-the-art in the field?
Exploiting multi-processor architectures to speed up the performance of applications is appealing, but not trivial. One needs to “divide” an application into pieces that run on parallel processors, which also need to synchronize their access to shared data. A simple way to do this, from the application programmer’s perspective, is to use the abstraction of transactional memory. Implementing the abstraction itself is nevertheless very challenging. Our paper addresses the question of what it means precisely to provide a correct implementation of that abstraction. Our book, Principles of Transactional Memory, and a full implementation of a transactional memory system (Swiss Trademark) followed that paper.
Another of your highly cited papers in the ACM Digital Library is “ The Next 700 BFT Protocols. ” In the paper you and your co-authors show how to derive a BFT protocol in a simple way. What is a BFT protocol and why it is important to derive such protocols?
What we typically call a BFT (Byzantine Fault-Tolerant) protocol is a distributed algorithm that tolerates nodes that can behave arbitrarily, e.g., computers hacked by malicious players. BFT protocols that seek to achieve agreement among nodes of a distributed system are key to building the abstraction of the universal internet computer. For example, they need to replicate the state of a Turing machine over many nodes while providing the illusion of an immortal single machine. This replication must occur even if some of the nodes crash or behave in an arbitrary manner.
These protocols are notoriously hard to design and implement. There also isn’t any “one-size-fits-all” protocol that applies to all situations. In the paper mentioned above (co-written with Marko Vukolic, Nikola Knezevic, and Vivien Quema), we show how, starting from a protocol that was proven correct and of which implementation has been tested exhaustively, with simple modifications one can develop new protocols that each perform best under various conditions.
In the introduction to the recent paper, “Byzantine Machine Learning: A Primer,” you (along with co-authors Nirupam Gupta and Rafael Pinot) note that because of “the growing computational demand of machine learning tasks…the training procedures for these algorithms” are “fragmented into several (simpler) sub-tasks that are distributed on different machines (or nodes).” Will you tell us a little about this paper?
The most important challenge of machine learning is robustness—essentially, the ability to develop trustworthy models. These models must be developed without trusting individual data sources or underlying machines involved in the learning. Working with my students and postdocs (Peva Blanchard, Mahdi El Mhamdi, Sebastien Rouault and Julien Steiner), we defined the notion of Byzantine Machine Learning.
We kept working on this fascinating problem to better understand what can (and cannot) be done. With my former postdocs Nirupam Gupta and Rafael Pinot, we wrote the paper cited above which was followed by a book, Robust Machine Learning - Distributed Methods for Safe AI, to summarize the main results and open problems in the field. A companion full library, which allows computing professionals to experiment with various robust ML protocols, was also developed in our lab.
What has been a challenge in your efforts to improve computer science education in Africa? What is an approach that has been especially effective in these efforts?
The first challenge was to identify serious interlocutors from Africa who are both influential and genuinely motivated to promote science in general, as well computer science specifically. After several attempts, I met some Moroccan government officials—the president of OCP (the biggest company in Morocco), and the president of UM6P (a young and very ambitious university in Morocco). Along with colleagues from Europe and the US, they were instrumental in launching several initiatives (including conferences, exchange programs, and new degrees) that promoted computer science in Africa through a strong partnership with EPFL.
The second challenge was to convince students and companies that there cannot be any AI development (or any development at all) without a strong system of computer science education.
The third challenge has been to go beyond temporary initiatives and build something long-term. I was fortunate to meet excellent faculty and staff in Morocco who are now leading these efforts.
Rachid Guerraoui is a Professor in the School of Computer and Communications Sciences at EPFL. He is recognized as one of the leading figures in distributed computing, an area of research that aims to make multiple computers work together to solve a common problem.
His innovations address the fundamental theory and practice of distributed computing, including problems such as agreement and reliable information dissemination. He is also interested in distributed computing abstractions such as transactional memory and concurrent data structures. More recently, Guerraoui has published influential papers in the field of Byzantine machine learning, which explores the intersection of robust machine learning and distributed computing.
In addition to his technical contributions, Guerraoui has initiated several dynamic programs to make quality computer science education accessible throughout Africa. Already an ACM Fellow and recipient of the Dahl-Nygaard Award, he was recently named the recipient of the inaugural ACM Luiz André Barroso Award. This new award was established to recognize researchers from historically underrepresented communities who have made fundamental contributions to computer science.