People of ACM - Ricardo Baeza-Yates

May 29, 2018

NTENT offers a technology that ranks search results by using machine learning (ML) and natural language processing (NLP) to analyze documents and queries. How are ML advances applied to semantics to transform search?

ML is transforming our world and hence it was natural to couple its power with semantic technology. This can start with just adding semantic features when we use ML to understand text, to more complicated architectures, such as predicting which semantic rule should be used in some NLP step. As our main goal is to predict the intention behind a query—that is, for the task that a person wants to perform, ML is the natural tool to use, including ranking the different plausible tasks as well as answers within each task, which naturally handles the ambiguity of using a few terms to specify our needs. This is even more relevant for voice queries, which are longer and thus naturally have more semantics encoded in them.

You are known for the Baeza-Yates-Gonnet algorithm, a string-matching algorithm. What key insights led Gaston Gonnet and you to introduce this algorithm, and why is it important for search?

To put this in context, this algorithm was part of my PhD thesis at the University of Waterloo and Gaston was my supervisor. In string matching, a natural way to represent a search algorithm is a finite automaton. If you use a non-deterministic one, you just need as many states as the length of the string to be searched. However, if you transform the automaton to a deterministic one, the number of states can grow a lot, even exponentially, if you are searching for regular expressions.

On the other hand, running the non-deterministic automaton is much slower than the deterministic one. However, each state needs just one bit to be represented (active or not active), and thus our key insight was that we could encode the non-deterministic automaton in a binary sequence and update all the states in parallel by using bitwise computer operations over those sequences (therefore, this algorithm is also called the "shift-or" algorithm). That is, we could encode strings of length 32 or 64 in a standard computer, which in practice is more than enough to search for any string.

So, we had the best of two worlds: small automata with linear searching time. This idea was later extended to other types of searches, including patterns with errors, creating its own class of pattern matching algorithms (called bit-parallel pattern matching), which has been applied not only to text searching, but also to computational biology and other areas.

One of the talks you have given as an ACM Distinguished Speaker is “Bias and the Web.” What is bias on the web, and how can it be prevented from becoming a more entrenched problem in the years ahead?

Bias on the web is a vicious cycle that starts with data bias, which is amplified by algorithmic bias (including sampling bias) and in turn reinforced through different interaction biases (such as presentation bias and position bias), continuing with our own self-selection biases that reflect our cultural and cognitive biases. The cycle is closed by personalization techniques that process our usage data or, worse, by our own biased contributions to web content in the form of comments, opinions, blogs, etc. This makes the task of finding truthful and fair relevant content by search engines even more difficult. You can find all the details in the article that appears in the June 2018 issue of Communications of the ACM.

Preventing this vicious cycle is a paramount social and technical challenge: just consider the “fake news” problem. The first step is to be aware of all these biases—this was the motivation for starting this talk in 2016. The second step is to mitigate as much as possible the bias in each stage of the cycle. That implies debiasing web data, better sampling techniques, developing more transparent and accountable algorithms, and more fair user interfaces and debiasing usage data. This may imply that a web-based company must lose some revenue, but there is no other way to get around the problem if we want to build a better web for a better world. As users, we also need to be more conscious about our decisions and check the sources and claims of every information piece that we consume and publish, as being neutral is the main way to help improve machine learning algorithms.

You spent a large part of your career at the Universidad de Chile in Santiago and you have taught and worked around the world. What are some of the unique challenges and opportunities for the computing field in South America?

After my PhD I wanted to return to my home country and try to contribute to the development of computer science in Chile and Latin America. At that time, in 1989, there were a handful of people with PhDs there. So, I did my research career there for 15 years before moving to Barcelona, doing a lot of service to the local and regional community through the Chilean Computer Science Society, CLEI (the association of all CS departments in Latin America), CYTED (Ibero-American Program of Science and Technology for Development) and UNESCO. The main challenge was lack of resources, so research in general is much harder.

On the other hand, the potential opportunities were large, as even small changes could have a significant impact, because many people in the region are avid to learn. So, I want to believe that I contributed to my alma mater with a few “bytes,” as today it is one of the top three CS departments in the region and is in the top 100 in the world according to the QS World University rankings. In other words, working where you can really make a difference is much more satisfactory than working where you are just one more researcher of a large group.

Ricardo Baeza-Yates is a Chilean-Spanish computer scientist who currently serves as CTO of NTENT, a semantic search technology company. Prior to joining NTENT, he held a number of leadership positions for Yahoo Labs at locations around the globe—most recently as VP of Research at Yahoo’s headquarters in Sunnyvale, California. Baeza-Yates is also part-time Director of Computer Science Programs for Northeastern University’s Silicon Valley campus. Earlier in his career, he was a full professor at the University of Chile as well as at the University of Pompeu Fabra in Barcelona, Spain, where he still has part-time appointments. He is also adjunct professor at the University of Waterloo, Canada, his PhD alma mater.

The second edition of his book Modern Information Retrieval (co-authored with Berthier Ribeiro-Neto), won the 2012 Book of the Year Award from the Association for Information Science and Technology. Baeza-Yates was named an ACM Fellow in 2009 for contributions to the development of algorithms and information retrieval techniques. He is an ACM Distinguished Speaker and serves on the Editorial Board of ACM Books.