People of ACM - Marcelo Arenas

January 25, 2022

What is the semantic web, and what is its relationship to the management of data?
The web was developed with human users in mind, which resulted in many tools that make it more friendly and understandable for such users. However, a nicely designed web page can be very difficult to interpret for a computer. The aim of the semantic web is to provide the means to make the web machine-understandable, which in concrete terms means to develop a series of standards, methodologies, techniques and tools to formally specify the semantics of data on the web.

Three fundamental standards for the semantic web are Uniform Resource Identifiers (URIs), the Resource Description Framework (RDF) and the query language SPARQL. URIs are identifiers for web resources, in the same way URLs are identifiers of web pages. But such resources may include anything--from a digital file or a web page, to a book author. RDF is a data model that allows specifying relationships between web resources. An RDF file, or dataset, can be thought of as a graph where nodes are web resources, and edges are used to specify relationships between them. For example, in an RDF dataset for a social network, a node represents a person and an edge is used to specify a certain relationship between two people in this network, such as one person follows another person or two people are friends. Finally, SPARQL is the standard query language for extracting information from an RDF dataset.

Many of the usual issues in data management appear in the semantic web, in particular when managing an RDF dataset. This is not surprising, since an RDF dataset, on one hand, can naturally be thought of as a graph database, and, on the other hand, can directly be stored as a relational database. Hence, classic data management problems such as storing, cleaning, integrating, querying and reasoning are relevant for the semantic web. Moreover, the development of the semantic web poses new challenges for data management, particularly given the highly distributed nature of data on the web.

Your paper “Semantics and complexity of SPARQL,” (co-authored with Jorge Pérez and Claudio Gutierrez) won the Semantic Web Science Association Ten-Year Award. What is a key insight of this paper?
The World Wide Web Consortium (W3C) defines the standards for the semantic web, such as RDF and SPARQL. In 2006, when the paper “Semantics and complexity of SPARQL” was published, the W3C was working on the standardization of the syntax and semantics of SPARQL. This was not an easy task, not only because of the effort needed to understand the requirements for querying data on the web, but also because some of the needed features for SPARQL were different from the traditional operators in database query languages. In particular, the open nature of web data, which is constantly changing, makes the ability to retrieve optional information, if available, an important functionality. The key insights of this paper are the definitions we propose for a simple algebraic syntax and a formal semantics for SPARQL. This proposal allowed us to carry out the first detailed analysis of the computational complexity of evaluating queries in SPARQL, which was useful to understand the complexity of evaluating the different operators in this language and, in particular, of the operator that was being proposed to retrieve optional information. The semantics of SPARQL finally adopted by the W3C was based on this proposal, which I think is its most important contribution. Many investigations on query languages for the semantic web have used this proposal.

One of your most-cited recent papers is “Foundations of Modern Query Languages for Graph Databases.” Why are graph databases important and what are some exciting research directions in this area?
There are many reasons why graph databases are important and popular. Graphs are a simple and natural way to represent data; in fact, there are many domains in which data can be conceptualized in a simple and intuitive way by using graph databases, such as social, communication, transport and contact tracing networks. Also, graphs offer a flexible data model, where updates can be easily carried out by adding and removing edges, and where lightweight data integration methods can be developed.

The aim of the paper “Foundations of Modern Query Languages for Graph Databases” was to identify the fundamental features of graph databases, which are common in different views and implementations of this technology. In this sense, a key issue at this point, which has received considerable attention, is the development of a standard query language for graph databases, where such common functionalities are included. In terms of exciting research directions, the incorporation of knowledge in graphs, which are usually referred to as knowledge graphs, is posing many new interesting challenges. How can knowledge graphs be constructed and updated? How can they be integrated? How can the knowledge in them be validated? What is the right way to query them? How can we reason with and about them? How can the ideas from the semantic web, especially with respect to representing vocabularies and ontologies, be used in this area? How can all of this be integrated with the developments in artificial intelligence, particularly in graph neural networks and graph embeddings? These are all very interesting research questions.

How might the World Wide Web look and function differently in five to ten years?
I can imagine that the ability of machines to understand web data will be much higher in five to ten years. This might seem obvious to some, but what I think will be interesting is the combination of techniques coming from different areas. On one hand, we have the techniques and standards developed in the semantic web to represent the semantics of data on the web. On the other hand, we have artificial intelligence techniques that are taking enormous steps in handling human tasks, such as translating a text from one language to another. It will be very interesting to see the proper integration of such techniques for dealing with web data.

Marcelo Arenas is a Professor at the Pontificia Universidad Católica de Chile and Director of the Millennium Institute for Foundational Research on Data. His research interests are in the areas of data management, database systems, applications of logic in computer science, knowledge representation and the semantic web.

Among his honors, Arenas received a SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention, the Semantic Web Science Association Ten-Year Award, and nine Best Paper awards at various conferences. He was recently named an ACM Distinguished Member for outstanding scientific contributions to computing.