People of ACM - Jingdong Wang

December 7, 2021

What does your role as Chief Architect for Computer Vision at Baidu's Artificial Intelligence Group involve?
Our team conducts product-driven and cutting-edge computer vision/deep learning/AI research and develops practical computer vision applications to support AI cloud, Baidu image and video search, intelligent driving, and other platform/product teams. We also actively collaborate with leading researchers, faculty members and scholars.

Will you give us one example of an exciting research development happening at the intersection of computer vision and multimedia search?
In the past 14 years, I primarily focused on multimedia search and deep learning-based computer vision architectures and applications. In the early part of these years, I worked closely with the Bing multimedia search team and focused on problems that are of both research significance and practical importance. Among the many exciting developments are salient object detection for the Bing image search color filter, visual features for improved image search ranker, and approximate nearest neighbor (ANN) search for image search by example.

ANN search is also a fundamental problem in machine learning and computational geometry. The goal of ANN search is to find the approximate nearest items for a query item among a large database under some defined distance metrics, such as Euclidean distance and cosine distance. I started the research on ANN in 2009 and made some interesting observations: if one item is close to the query item, its neighboring items are most likely to also be close to the query item. Motivated by this observation, I developed a practical ANN search algorithm with the neighborhood graph and the kd-trees as the index, where the kd-trees serve as selecting starting items and addressing the dis-connectivity issue of the neighborhood graph. We further designed an efficient algorithm to construct the neighborhood graph index. These two works were published in ACM MM 2012 and CVPR 2012.

The algorithm was later adopted in many Microsoft products as a core component. In 2014, we shipped the full solution of index building and search techniques to the Bing image search and Bing Ads. To the best of our knowledge, our neighborhood graph-based ANN search algorithm is the first of its kind to be used in a real product of large scale. Since 2017, the graph-based ANN search algorithm has been further improved and deployed to web search engines with hundreds of billions of deeply-learned document vectors, and the technology with solid state drive (SSD) was published in NeurIPS 2021. In 2019, the source of our algorithm was made publicly available at Github, which is now widely known and used.

You also have done significant work in the area of deep learning for computer vision. In the paper Deep High-Resolution Representation Learning for Visual Recognition, you and your co-authors explored how to build a universal neural network, the high-resolution network (HRNet) for visual recognition. What is HRNet and how does it advance the neural architecture design?
HRNet is a universal architecture that learns high-resolution representations. It is broadly applicable to general computer vision tasks, especially position-sensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection.

Prior to HRNet, almost all the backbones, including AlexNet (2012), GoogleNet (2014), VGGNet (2014), ResNet (2015), and DenseNet (2016), have been classification architectures developed initially from image classification. Most other visual recognition tasks, such as semantic segmentation, human pose estimation, and object detection are position-sensitive and require spatially fine representations. HRNet breaks the previous golden rule that classification architecture is the backbone for position-sensitive tasks and is designed to directly learn high-resolution (spatially fine) representations.

HRNet is conceptually different from classification architecture, being that it's designed from scratch, rather than from the classification architecture. It breaks the dominant design rule, connecting the convolutions in series from high resolution to low resolution, which goes back to LeNet-5 and is followed by the classification networks mentioned above, and maintains high-resolution representations through the entire network process.

How will computer vision advance in the near future? And what will be some practical applications for customers of Baidu?
I will mention two of the many potential advances. One is that computer vision might not be separately studied and computer vision and other modalities, e.g., natural language and speech, might be studied together as the tools in these areas are common now and are all deep learning. The other one is that the impact of computer vision in practical applications will increase.

Our team has been exploring computer vision for applications of practical significance. A few examples include image and video search, Baidu Maps, digital transformation with optical character recognition (OCR), visual perception for intelligent driving and transportation, power inspection, and deepfake detection.

What advice would you offer a younger colleague just starting out in the field?
It is a very exciting moment to work on computer vision, which is one of the hottest fields in artificial intelligence. I am also delighted and encouraged to see new generations joining the computer vision field.

My advice to young people is to start your research with a relatively easy, but still important, problem for gaining research experience and building confidence. Then seek out challenging and potentially impactful research problems. I also encourage young people to study one problem deeply until they have enough experience and understanding about research, and then choose to work on a different problem, or more problems.

My additional advice is to seek opportunities to work with or learn concretely from researchers who are active and have great expertise in your field. Focus on one or two topics, publish high-quality and high-impact papers, and build your reputation rather than publishing many incremental papers.

Finally, though deep learning is the dominant tool in computer vision, I would encourage younger researchers to understand the computer vision domain and learn the fundamentals and foundational techniques, such as probability, matrix computation, machine learning, and so on.

Jingdong Wang is Chief Architect for Computer Vision at Baidu, an AI company with a strong internet foundation that offers a full AI stack. Before joining Baidu, Wang spent many years at Microsoft Research Asia, most recently serving as Senior Principal Research Manager with its Visual Computing Group. His areas of interest include neural architecture design, human pose estimation, semantic segmentation, image classification, object detection, large-scale indexing, and salient object detection.

Wang has served as an Associate Editor for leading journals such as IEEE Transactions on Pattern Analysis and Machine Intelligence, and has been an area chair of the premier conferences in vision, multimedia, and AI, such as CVPR, ICCV, ECCV, ACM MM, IJCAI, and AAAI. Wang is an ACM Distinguished Member and a Fellow of the International Association for Pattern Recognition (IAPR).