People of ACM - Regina Barzilay

January 4, 2018

Along with colleague Kevin Knight and your student Ben Snyder, you developed a machine learning program that translated the ancient language of Ugaritic. How is your program different from popular language translators, and what are some potential future applications of this approach?

The main challenge in decipherment is the lack of parallel data (such as aligned translations in English and French). Modern machine learning methods utilize millions of such parallel sentences to learn correspondence between vocabularies of the two languages and their grammatical constructions. Colloquially speaking, these models learn to translate the word "dog" to its French equivalent because of the way they co-occur in aligned sentences. However, for most languages that are undeciphered today, we do not have such parallel resources. Therefore, the decipherment process has to be guided by different patterns which require novel models. For instance, in the case of Ugaritic, our model is built on the fact that both Hebrew and Ugaritic come from the same proto-Semitic language, which means that their alphabets, vocabularies and morphologies are closely related. We were able to design a Bayesian model that incorporates these assumptions and learns to map the alphabets and cognates (words in two languages that come from the same proto-root).

While every year our machine translation systems become better and better, our progress in decipherment is quite modest. I hope that in some near future machines will helps us to uncover mysteries of Linear A and Elamite, Bronze-age writing systems that are currently undecipered.

Among the many challenges of having machines understand and communicate with humans, in which area do you expect to see significant progress in the next 10 years?

Most of the progress in natural language comes from access to large amounts of annotated data. Just to give an example, consider the textual entailment task where the goal is to predict whether the truth of sentence B follows from sentence A (e.g.., "The meeting is scheduled on Sunday" entails "The meeting is scheduled on the weekend"). In 2011, the size of the benchmark dataset was 7,000; in 2015 it became 570,000. The techniques we have today can effectively utilize these large datasets, but show little improvement on small corpora. (Is a 7,000-sentence dataset really that small?) For many languages and applications, we will never have sufficient data for training our annotation-hungry neural models. This limitation is well understood by many in the community, and I hope we will see more exciting developments in the area of semi-supervised methods. The reason we as humans can learn from a very few examples relates to the vast prior knowledge we have collected before solving a new task. It would be really powerful if our models for language understanding could have the same capacity. It may come from intelligently utilizing raw text data, from incorporating multimodal signals such as vision, or from effectively communicating with humans and utilizing their knowledge.

You are developing computer vision programs that can scan huge datasets of mammogram images and predict which women, who are currently free of breast cancer, may be at the greatest risk in the future. What is the greatest challenge in developing this technology?

Unfortunately, the greatest challenges in developing this technology are not technical in nature. Most of it has to do with accessing the data, curating the records to create clean training sets and integrating the models into a clinical pipeline. While the medical community is eager to adopt the newest technologies, both sides have to work hard to understand each other's language and to build trust. Of course, there are many interesting technical challenges along the way. For instance, how to ensure that deep learning models can generalize across different populations and how to make the models interpretable for physicians so that they can utilize predictions in practice. In my ideal world, this data will be public so that the whole research community can contribute. But currently, only very few of us have access to this data, and it slows progress.

What advice would you offer a younger colleague just starting out in the field?

The advice that I always give to my students is to work on the problems that matter to you. Motivation may vary from person to person — be it societal good, intellectual curiosity, or personal obsession with the topic (as is often true in my case). The most important point is to carefully listen to yourself, and select the problem accordingly. While this advice sounds pretty obvious, it is not easy to implement. At any point in time, a research community collectively defines topics du jour, and these popular questions often gain the most attention. Working on them makes it easier to publish your results, gain funding and recognition. Therefore, it takes a concerted effort to not go with the flow, and to not be afraid to be different.

Regina Barzilay is the Delta Electronics Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT) Computer Science and Artificial Intelligence Laboratory (CSAIL). Her research interests include natural language processing (NLP) and machine learning. Recently, she has been active in applying machine learning methods to cancer research and drug design.

In October 2017, Barzilay received a MacArthur Fellowship, often referred to as a “genius grant.” She was cited for “significant contributions to a wide range of problems in computational linguistics, including both interpretation and generation of human language.”