People of ACM - Nesime Tatbul

May 23, 2024

Why is this an exciting time to be working in large-scale data management systems?

Technological advances in our capability to generate, collect, store, and process digital information have enabled everything to become data-driven. In fact, data is one of the key driving forces behind the rapid progress we have been witnessing in AI/ML recently. As much as scalable data management is a crucial component of any AI/ML system infrastructure, data systems themselves can also greatly benefit from AI/ML advances in becoming more performant, adaptive, and easier to use. More generally, there are plenty of new research opportunities and challenges as we enhance our data systems to leverage the rapidly-evolving hardware and software landscape, but also to ensure that data systems continue to deliver scalable performance in the face of novel data-intensive workloads. It is really exciting to be a member of the vibrant database community and witness its growing real-world impact.

What is a challenge you are working on right now in your role as Senior Research Scientist at Intel’s Parallel Computing Lab?

My current research broadly investigates how we can improve data systems through machine learning and observability. I have been pursuing a few different projects focusing on AI-driven data systems, analytics, and applications. Time series data type plays a special role in many of these projects, as it is essential to observing, understanding, and managing any dynamic system. For example, observability data comes in the form of heterogeneous time series (e.g., metrics, logs, traces), from large numbers of telemetry data sources, at high and varying volumes. There is a need to efficiently collect and query this data to help engineers keep track of system behavior, debug or explain performance issues, and potentially use this data for further system modeling and automation. We are building a new system called "Mach" to make such data more easily accessible at low resource footprints. We are also exploring new analytical techniques over time-varying data, such as anomaly detection.

You are known for your contributions to stream processing, including the Aurora Borealis and the S-Store Systems. For those who are unfamiliar, what is stream processing and how have recent innovations in this area improved large-scale data management?

Stream processing is a database technology that has been around for more than two decades. It was originally motivated by highly dynamic and distributed data sources such as sensors or stock tickers, and the need to communicate and process data pouring in from these sources in an online fashion, in near real time. Stream processing systems introduced a new paradigm where such data could be continuously processed in memory, instead of first being inserted into a traditional disk-based database to be later queried in an offline fashion. This way, one could handle higher volumes of fresh data at lower latencies.

The early-generation systems focused on defining new languages, models, and architectures as foundations. During the big data era, streams represented the velocity dimension of big data and there was widespread industrial development and adoption of open-source streaming systems that could reliably operate at large scales (e.g., in cloud data centers). Today, stream-based processing has become an integral part of any modern data infrastructure, efficiently supporting not only continuous machine learning model training and inference pipelines but also real-time data ingestion and decision making in many advanced applications such as clickstream analysis, autonomous driving, and cloud observability.

At SIGMOD 2021, you and co-authors presented “Bao: Making Learned Query Optimization Practical.” What would be the benefit of applying machine learning techniques to query optimization? Why has it been impractical up until this point?

Database systems count on their query optimizers to find efficient plans to execute their query workloads. This is a performance-critical task, but also hard to get right due to the complex nature of the problem. Traditional optimizers have been largely rule- or heuristics-based, and are subject to making planning mistakes with serious performance and cost consequences, especially as data and query workloads change over time. The key insight is that, by adding an adaptive learning loop to a query optimizer, we can detect these mistakes and correct them. Bao, the Bandit optimizer, provides a practical way for realizing this idea, mainly because it can be integrated to traditional optimizers by leveraging their existing query hinting mechanisms and its model training time is relatively short. In follow-up work, we have further generalized Bao into AutoSteer—a generic framework to automatically steer query optimization in any SQL database, further expanding Bao’s practicality through reduced human supervision and reduced dependency on database-specific capabilities.

Your current role at Intel and MIT spans across industry and academia. What is it like to be working in such a dual setting?

Working for an industrial research lab while being embedded in an academic environment is a very unique and exciting experience. This exposes me to both cutting edge research pushing the boundaries of our field as well as latest industrial technology trends and challenging real-world problems. There is huge potential for impact by bridging these two worlds. I feel very fortunate to be exploring these opportunities together with a diverse group of collaborators from both sides.

 

Nesime Tatbul is a Senior Research Scientist at Intel’s Parallel Computing Lab (PCL) and MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). Her research interests are broadly in large-scale data management systems and modern data-intensive applications, with a recent focus on learned data systems, time series analytics, and observability data management. She is most known for her contributions to stream processing, which include the Aurora/Borealis Systems (now TIBCO StreamBase) and the S-Store System.

Among her honors, she was a co-recipient of an ACM SIGMOD Research Highlight Award (2022), an ACM SIGMOD Best Paper Award (2021), and two ACM SIGMOD Best Demonstration Awards (2005 and 2019). She was recently named an ACM Distinguished Member for foundational scientific contributions in streaming data systems and time series analytics.