People of ACM - Dong Yu

March 14, 2023

What is the difference between speech processing and natural language understanding?

Speech processing deals with speech signals and includes tasks such as speech analysis (e.g., emotion detection and speaker verification based on speech signals), speech coding, speech enhancement and separation, automatic speech recognition (ASR), and speech synthesis (often referred to as "text to speech"). These processing techniques may be packaged as standalone applications (e.g., dictation tools) or as parts of a natural language processing system such as command recognition in a command-and-control system, and speech recognition and synthesis in a speech-to-speech translation system. Speech processing focuses on the conversion between the speech signal and the corresponding text, not on understanding the converted text. Natural language understanding, however, focuses on the semantic understanding of the natural language, often in text form, although some end-to-end systems aim to understand the natural language directly from the speech signal.

What advance in your field has surprised you the most since you began your career?

The whole field of speech processing has converged on the deep learning approach, thanks to the availability of large training data, high-performing computation, and advanced modelling and training techniques. When we pioneered the paradigm shift in ASR in 2010 with a 30% word error rate reduction in challenging benchmarks, we did not foresee the rapid progress that has happened in the last several years.

At the recent APSIPA signal processing conference, you gave a keynote in which you discussed significant advances made by Tencent’s AI Lab on voice processing. The real-world applications of Tencent’s work include improvements to music separation, in-car voice processing, and online meetings. Will you discuss the underlying paradigm shift that is responsible for all of these advances?

In short, all of these advances were made by applying advanced deep learning techniques. Of course, for different tasks, we designed different models and loss functions to inject prior knowledge and satisfy the constraints set by the applications. We were able to achieve significantly improved music separation performance, in-vehicle speech signal processing performance (including automatic echo cancellation, speech enhancement, and multi-talker speech separation and recognition), and speech and music coding performance (especially under weak network connection conditions). All of these newly developed techniques have been deployed in products to benefit users. Working in an industrial lab, we are able to understand better the need for real-world applications and impact people more quickly.

What is an example of a natural language understanding research breakthrough you would like to see come to fruition in the near future?

We've seen surprising progress in natural language understanding recently. For example, ChatGPT, a dialog system developed by OpenAI which is fueled by a huge amount of data and a giant model, quickly impressed millions of people. However, current ChatGPT-like systems still lack strong reasoning and fact-checking ability. I think a further improved system should be able to conduct more complicated reasoning steps, exploit multi-modal information, and construct more precise world knowledge to understand languages better. Such a system will be able to learn by itself by observing the world and reading various documents.

Dong Yu is a Distinguished Scientist and Vice General Manager at Tencent AI Lab, which has teams in Shenzhen, Beijing, and Seattle. He has published more than 300 papers on topics including automatic speech recognition, speech processing, and natural language processing. As a volunteer, Yu has been an Associate Editor of IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) from 2011–2015, and an Associate Editor of ACM Transactions on Asian and Low Resource Language Information Processing (TALLIP) from 2017–2019. He served as the chair of the IEEE speech and language processing technical committee in 2021–2022.

Yu has received many Best Paper Awards, including the prestigious IEEE Signal Processing Society Best Paper Award in 2013, 2016, 2020, and 2022. He was recently named an ACM Fellow for contributions in speech processing and deep learning applications.