People of ACM - Wenwu Zhu

November 21, 2023

Speaking broadly, what has been an important advance in multimedia research that has had a significant impact on the field in the last 10 years?

Over the past 10 years, AI has been deeply rooted into multimedia and thus significantly advanced the multimedia field, ranging from multimedia generation, multimedia coding and compression, multimedia analysis, to multimedia communications and delivery from multimedia computing life-cycle perspective. In particular, one important advance in multimedia research that has had a significant impact on the field is the development and widespread adoption of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), and transformer with pretraining.

Deep learning has revolutionized multimedia research by enabling remarkable progress in areas such as image classification, object detection, speech recognition, video analysis, and multimedia search and recommendation. CNNs have shown their power in tasks such as image classification, object detection, semantic segmentation, etc. They have achieved unprecedented levels of accuracy and have become a fundamental tool for various multimedia applications. RNNs, on the other hand, have had a significant impact on sequential data processing tasks, such as speech recognition and video captioning etc. The capability of capturing temporal dependencies hidden in data makes them become well-suited for various tasks including but not limited to language translation, speech synthesis, and sentiment analysis. Furthermore, the combination of deep learning with other technologies (e.g., generative adversarial networks (GANs) for image synthesis, reinforcement learning (RL) for interactive multimedia systems, and variational inference (VAEs) for AIGC) has also opened new ways for research and application development in the field of multimedia.

The availability of large-scale datasets such as ImageNet and COCO, along with advancements in computational power and parallel computing, has contributed to the success of deep learning in multimedia research. These developments have allowed researchers to train deeper and more complex neural networks, leading to improved performance and generalization capabilities. The recent success of large pretrained models such as ChatGPT further demonstrates the potential brought by Transformer with pretraining.

Overall, the advancements in deep learning techniques and their variants, along with the availability of large available data and increased computational power, have significantly propelled multimedia research forward in the last 10 years, enabling breakthroughs in various domains and paving the way for new applications and innovations.

Your most citied paper “Structural Deep Network Embedding” (co-authored with Daixin Wang and Peng Cui) looks at challenges to mining information from networks, including social media networks. How will the approach to mining information that you and your co-authors propose in this paper help with real word applications?

In real-world scenarios, graphs serve as fundamental and ubiquitous data structures for representing relationships between entities, such as social networks, financial networks, and protein networks. Extracting valuable information from graph data has been a major concern for both industry and academia. However, the presence of complex and diverse relationships between nodes has posed significant challenges for graph analysis.

Graph representation learning or network embedding is an effective graph/network data analytics technique and encourages users to deeply understand the hidden characteristics of graph/network data. Our paper, "Structural Deep Network Embedding," introduces an innovative approach that tackles this problem by establishing the connection between deep representations and graph topology. We utilize deep autoencoders to transform nodes into independent vectors, while leveraging techniques like Laplacian Eigenmaps and graph reconstruction to capture the first-order and second-order similarities between these vectors simultaneously. Consequently, our approach effectively captures the structural relationships between nodes in a vectorized form, yielding practical applications across various domains.

One prominent application of our method lies in graph-based anomaly detection, where fraudsters often engage in close interactions. For instance, in financial fraud detection, fraudsters frequently engage in suspicious transactions, while in spam web detection, spam websites are interconnected. By utilizing our method, we can map vector representations of fraudsters closer in latent space, resulting in good fraud detection performance. Additionally, our method shows great usefulness in social recommendation tasks given its ability to effectively capture social relationships between users, leading to more accurate and personalized recommendations.

In essence, our method excels at capturing highly nonlinear structural relations between nodes. When an application assumes that nodes that are topologically close in the graph should share similar properties, our method is able to capture their structural relations, being both discriminative and informative for downstream applications. The innovation of our approach lies in its ability to bridge the gap between deep representation learning and graph analysis, offering practical utility in various real-world applications.

Our work has been widely and successfully applied in many real-world tasks related to network science, such as social network data processing, biological information processing and recommender systems, as well as the social platforms such as Wechat and StarTimes, covering a broad range of applications including social network evolution, social behavior prediction, and information propagation analysis.

You are recognized for proposing an end-to-end Quality of Service (QoS) control method for MPEG-4 video streaming and delivery over the Internet and wireless networks. How was your method an advancement over the existing state-of-the-art at the time? Video streaming has greatly improved since the early days of the Internet when on screen videos were the size of postage stamps and were often pixelated. What is the next frontier in streaming content over the Internet?

Video applications typically have stringent latency and low packet-loss requirements, which could not be adequately supported by the Internet two decades ago. It is thus a challenging problem to design an efficient video streaming system that provides satisfactory perceptual quality while achieving high resource utilization. The advancement that our method brought is its unique approach to Quality of Service (QoS) control for streaming video over the Internet and wireless networks. Specifically, our end-to-end QoS control method is designed specifically to handle this via minimizing the possibility of network congestion and effectively matching the video stream rate to the available network bandwidth to avoid severe packet losses that caused substantial degradation in video quality. This not only significantly improves the video quality but also the overall user experience.

As pointed by our paper, “Two Decades of Internet Video Streaming: A Retrospective View,” in the special issue to celebrate the 20th Anniversary of ACM Multimedia, with more and more content being shared over social networks and streamed over the Internet, mining social behaviors and content propagation patterns is key to improving users' quality of experience for the future. The role of social content propagation patterns in the context of online social networks in shaping how multimedia content is generated, distributed, and consumed becomes more important. Leveraging these propagation patterns and designing effective delivery frameworks can immensely improve the quality of experience and service provision for global users. Motivated by large-scale measurements, we propose a propagation-based social-aware delivery framework using a hybrid edge-cloud and peer-assisted architecture. This work also receives the Best Paper Award from ACM Multimedia 2012 for introducing the use of intelligence from social propagation for video streaming.

For the next frontier in streaming content, I believe there is an increasing interest in immersive media services and applications, like 360-degree video streaming, augmented and virtual reality, and the recent metaverse experiences. These applications bring high fidelity, immersive interaction, and open data exchange between people and the environment. The key to maximizing these applications is to leverage edge computing paradigms. Handling AI models and user data well with QoS-guaranteed inference latency and accuracy can empower edge devices, enhancing immersive multimedia experiences. We have been working on edge-aware model search and quantization, as well as hardware-aware inference deployment. It is exciting to anticipate what new advancements in edge intelligence will reshape the landscape of video streaming or the whole multimedia.

What is an emerging area of research in your field that will have a significant impact in the coming years?

As I mention in one of my recent works on “Multimedia Intelligence,” I believe that explainable AI is expected to have a significant impact on multimedia research in the coming years. Explainable AI aims to make machine learning models more transparent and interpretable, allowing users to understand the reasoning and logic behind the model's predictions.

Traditional machine learning models, especially deep neural networks, are often considered “black boxes” since it is challenging to understand the underlying processes and logic that lead to a specific prediction. This lack of interpretability can cause issues when the involving models are used in risk sensitive multimedia applications—such as healthcare, finance, or criminal justice—where accurate and reliable predictions are crucial.

Explainable AI techniques focus on developing models that provide more interpretable and explainable outputs. This can be achieved by incorporating more transparent models such as decision trees, rule-based systems, symbolic reasoning systems, etc., to gain insights into the predictions made by complex models. The concept of “Multimedia Intelligence” is exactly to promote the ability to interpret and reason in processing multimedia data via the family of models based on the powerful deep architectures.

By making machine learning models more explainable when dealing with multimedia problems, we can increase trust for their excellent prediction performances, facilitate their adoption in sensitive domains, and enable users to better understand and utilize the capabilities of AI in their own fields. Therefore, this will have a significant impact on both academic and industrial sides of multimedia in the coming years.

Wenwu Zhu is a Professor at Tsinghua University in Beijing, China. His research interests include multimedia intelligence and graph machine learning.

Zhu has been a leading volunteer with ACM conferences including the ACM Conference on Information and Knowledge Management (CIKM), where he served as General Co-Chair in 2019, and ACM Multimedia, where he served as General Co-Chair in 2018 and Technical Program Chair in 2014. He received the ACM SIGMM Technical Achievement Award in 2023. He has received many Best Paper Awards at ACM and IEEE conferences. Zhu is a Fellow of AAAS, IEEE and was recently named an ACM Fellow for contributions to multimedia networking and network representation learning.