People of ACM - Xuedong Huang

July 25, 2017

You founded Microsoft’s Voice Recognition Group in 1993. How did you first become interested in voice recognition and what were the main challenges in the field at that time?

When I was a graduate student studying AI at Tsinghua University, I was fascinated by the challenge of entering Chinese on a computer using a keyboard with Western characters. I realized that speech was so important to the future of computing beyond helping 1.2 billion Chinese. I am so glad my experience at Tsinghua helped me go deep on both signal processing and pattern recognition, some of the most interesting and challenging AI problems at that time.

I left Tsinghua in 1987 to finish my PhD research at the University of Edinburgh. I was fortunate to be able to team up with many world-class researchers in Edinburgh and at Carnegie Mellon University during my PhD years on developing semi-continuous hidden Markov models, the concept of parameter sharing and tying still viable for the modern speech recognition system today. After graduating from Edinburgh, I started working at Carnegie Mellon, where I was a junior research faculty member. Microsoft convinced me that joining Microsoft Research could help to bring speech recognition to the mass market. I made one of the best decisions in my career in that it’s helping to bring technology to make people’s lives a bit better.

You led the team responsible for developing Microsoft Translator, which can perform voice translation in multiple languages. What recent research breakthroughs made these advances possible?

Deep neural nets helped dramatically improve the performance of a wide range of sequence-to-sequence mapping tasks. Speech recognition, optical character recognition, image recognition, machine translation, and text-to-speech synthesis all benefited from deep learning. Combined with quality training data, powerful computing, and ready-to-use deep learning toolkits, such as Microsoft’s CNTK and Google’s TensorFlow, we can deal with sequence-to-sequence mapping tasks very effectively. Microsoft Translator can translate voice from one language to another. It is a perfect sequence-to-sequence mapping problem suitable for deep neural nets.

One unique feature we offer is to help the PowerPoint presenter. With Microsoft Presentation Translator, anyone can download the presentation translator plugin and bring subtitles for the slide show in PowerPoint with as many as 60 languages. This is a great example of using AI to help to break the language barriers to enhance our overall productivity.

You’ve recently estimated that voice recognition has improved by 20% every year for the last 20 years and that, in a few years, computers will be as adept as humans at understanding speech. What makes you confident in that prediction?

I have personally witnessed the amazing journey of speech recognition performance in the past 30 years. The first computer I used for speech recognition was an IBM PC XT in 1985. We added TI MS 320 DSP to make real-time speech recognition possible, which is amazingly similar to today’s modern PC equipped with Nvidia GPU despite the exponential speed lift over the past 30 years, thanks to Moore’s Law. As the Communications of the ACM article A Historical Perspective of Speech Recognition I authored with James Baker and Raj Reddy summarized, the relative error rate reduction has been around 15-20% each year, thanks to the collective research push from the whole speech recognition community. It is a combination of using more advanced machine learning algorithms, more realistic training data, and more powerful computing paradigm.

For example, the switchboard task is to transcribe conversational speech of two speakers over a telephone network. Our research team achieved a historical human parity milestone on this last year. About six months after Microsoft reached 5.8%, IBM’s research team announced that they reduced the error rate further down to 5.5%. If history is a lesson, I am confident that my prediction will be realistic for a number of years to come.

Many experts believe that we are only getting started in the fields of speech recognition and natural language processing. What is the next big frontier or significant challenge in this field?

Speech recognition benefited from amazing progress of data-driven machine learning from using tools such as hidden Markov models and deep neural networks. Natural language processing is fairly broad. For machine translation, the progress is similar. In fact, machine translation benefited from many lessons and advances in speech recognition with a number of paradigm shifts switching form the traditional NLP technology to statistic machine translation to neural machine translation. On our human evaluation (a scale of 1 to 4), we looked at both Google and Microsoft MT. There is a very significant quality improvement using neural machine translation when both Microsoft and Google switched from statistical machine translation to neural machine translation in the spring of 2016, from 2.6% to better than 3.4%. However, unlike Microsoft’s switchboard speech transcription human parity milestone, machine translation will probably take some time to reach human quality, albeit it is close.

For general natural language understanding, it is still a fantastic research challenge. If general-purpose language understanding is a box shape, I’d think the current state of the art is shaped like the letter “T.” The modern search engine such as Google or Bing is broad but not very deep. There are many excellent domain-specific conversational bots that are deep, such as Microsoft’s AI-based Customer Support Agent that can answer deep technical questions. However, the domain coverage is not as broad as the web search engine. DeepMind’s AlphaGo is deep but narrow in the same sense if you compare it with the web search engine.

What do you think the next immediate computing paradigm shift will be?

AI is going to play a very important role in amplifying all vertical services. Breakthroughs in perception areas such as computer speech and computer vision will happen before breakthroughs in cognition such as reasoning and knowledge acquisition. Speech and vision technologies will enable natural multimodal interaction to provide differentiated values by taking deep advantage of personal data and knowledge.

We recently introduced Project Prague. The difference between our new gesture engine vs. others is we moved away from template-based gesture recognition. With the introduction of our new gesture dictionary defined by finger position and movement, Project Prague created a similar paradigm shift to that of speech recognition, which moved away from the word-template approach to phoneme-based dictionary framework that offers unparalleled flexibility to deal with new words or new gesture pattern. I think microphone + camera will finally free people from being tethered to computing devices, which will enable the next immediate wave of ambient computing revolution.

Xuedong Huang is a Microsoft Technical Fellow in AI and Research and is the company’s Chief Speech Scientist. As the head of Microsoft’s spoken language initiatives, he played an instrumental role in developing many high-profile speech products including Cortana, Microsoft Translator, Microsoft Cognitive Services and Cognitive Toolkit (CNTK), and other AI technologies used in Microsoft Office, Windows and Azure.

His honors include being selected as Asian American Engineer of the Year (2011) and Wired's “Next List 2016: 25 Geniuses Who Are Creating the Future of Business.” Huang holds over 100 patents and has published more than 100 scientific papers and two books. In 2016 he was named an ACM Fellow for contributions to spoken language processing.