People of ACM - Fabrizio Sebastiani
July 12, 2016
How did you become interested in the field of information retrieval, and specifically, in text classification?
In the early ‘90s my research interests were actually at the intersection of mathematical logic (ML) and artificial intelligence. At that time, a small community of researchers was starting to investigate a new approach to information retrieval that had its foundations in mathematical logic. This topic looked very challenging; I had the opportunity to join an EU-funded project on this theme, and this was my de facto entrance into the (textual) information retrieval arena.
Over the years, the idea of viewing IR in terms of logic became less attractive to me, since I perceived the difficulty of straightjacketing information needs and textual documents (whose semantics is inherently fluid, subjective, and context-dependent) into logical formulae, which have a drastically more rigid nature. At the same time, it became increasingly clear to me that the subjective and context-dependent nature of language could be harnessed by looking at language not as something crystallized in dictionaries and grammar textbooks, but as something that manifests itself in language use. This led me straightaway to machine learning, and to text classification, which at that time was the most visible meeting ground of IR and ML.
What are some promising areas of research in text classification that will make significant advances in the years ahead?
Text classification started out as the task of classifying professionally authored textual documents, such as newswire reports or abstracts of scientific papers. Many believe that this is now an essentially solved problem, since current systems perform fairly accurately at this, and since in recent years the accuracy we can obtain on this task has plateaued. One of the challenges that text classification is confronting nowadays, and on which I expect significant advances in the years to come, is to obtain accurate results on informal text, such as what can be found in social media. For instance, making sense of tweets is often much harder than making sense of newswires from a news agency, since the former are (unlike the latter) often riddled with urban jargon, acronyms, obscure abbreviations, and ungrammaticalities of all types, and since the former are also much shorter (and thus carry less information) than the latter.
In the ‘90s, text classification essentially meant classification by topic. Another challenging problem is classifying text according to dimensions other than topic, such as classification by opinion, or sentiment. Sentiment classification has been investigated for quite some years now, but is inherently much more difficult than classification by topic, since in order to express sentiment humans use a bewildering array of linguistic devices (including sarcasm, irony, innuendos, understatements), which are hard to capture for a computer program. Capturing sentiment is of key importance for a very wide range of applications, so we are going to see increased investment from industry in this area, from which significant advances will likely result.
Some of your recent research has involved a task called “quantification”; what is it exactly?
In many applications of classification, what we are really interested in is not the class to which an individual item belongs, but the percentage of items that belong to a certain class; this is true of most applications in the social sciences, market research, political science, and epidemiology, just to mention a few. The task of estimating the above-mentioned percentage via supervised learning is called “quantification.” Sure, we can perform quantification by classifying all items and counting how many of them have been assigned the class; but it would be akin to, say, finding the maximum element in a set by first ranking its elements and then picking the top-ranked one. As in this latter case, we are faced with a more general task (classification) and a more specific task (quantification), and intuition suggests that we should solve the more specific task directly; in the “maximum” example we obtain a gain in efficiency, while in the “quantification” example researchers are seeking a gain in accuracy. Quantification is going to be more and more important with the advent of Big Data: in these contexts, we simply cannot afford to pay attention to individual data items, and the only results that matter are at the aggregate level.
What value does the ACM SIGIR Conference on Research and Development in Information Retrieval bring to the field? Are there any parts of the SIGIR 2016 conference program that you are particularly excited about?
SIGIR is the “landmark” conference in the field of IR. While a number of other IR-related conferences (such as WSDM, ICTIR, and CHIIR—all ACM-sponsored conferences) have recently sprung up, each of them is narrower in scope, since they tend to focus on some specific subarea of IR. SIGIR is still the conference where you want to be in order to obtain a comprehensive view of the recent developments in IR at large. I should add that IR is not just “a query box and ten blue links.” In other words, while web search is the task every outsider associates with the term “information retrieval,” it is by no means the only endeavor addressed by IR, which also deals with recommendation, filtering, community question answering, information extraction, e-discovery, and others. Additionally, (hyper)text is not the only medium being addressed by IR, which also deals with music, spoken text, graphics, imagery, and video.
Concerning the SIGIR 2016 program, I find the entire program exciting! However, if I should mention one particular aspect of it, I am particularly curious about the sections dealing with the use of “deep learning” technology in IR. So far, deep learning has had a much bigger impact on NLP than on IR; however, SIGIR 2016 will include a keynote lecture, a tutorial, a workshop (the latter two very highly subscribed), and several contributed papers, all devoted to probing the relationships between deep learning and IR. I am thus curious to see if SIGIR 2016 will mark the emergence of a new trend; I expect I will get out of this conference with many new ideas for my own research.
You are a ski mountaineer with the Alpine Club of Italy. Ski mountaineers climb up mountains on foot (while carrying their skis) and then ski back down sheer peaks and untamed landscapes. What was the most difficult mountain you skied down and what was the experience like?
I like to think that the most difficult mountain is the one I have not yet climbed. Among the ones I did climb and ski down from, the most difficult was the “Punta Rossa della Grivola,” a peak in the Italian Western Alps. It is not a terribly challenging mountain per se, but during that trip I met my future wife, and I remember finding it difficult to concentrate on the mountain.
Fabrizio Sebastiani is a Principal Scientist at Qatar Computing Research Institute, Qatar Foundation, and a Senior Research Scientist at the Institute for the Science and Technologies of Information (ISTI), an institute of the Italian National Research Council (CNR). His research interests lie at the crossroads of information retrieval, machine learning, and human language technology, with particular emphasis on text classification, opinion mining, and their applications.
He has published 64 peer-reviewed articles on topics including information retrieval, text analytics, and opinion mining. Sebastiani is an active member of the ACM Special Interest Group on Information Retrieval (SIGIR) and is serving as General Co-Chair of the 2016 ACM SIGIR Conference, taking place from July 17-21 in Pisa, Italy.