Abstract
The convergence of multimodal fusion and embodied intelligence has emerged as a pivotal frontier in advancing autonomous driving systems. Modern autonomous vehicles rely on integrating heterogeneous data streams, such as visual, LiDAR, tactile, and auditory inputs, to perceive dynamic environments and make human-like decisions. However, existing approaches often face challenges in achieving robust cross-modal alignment, real-time adaptability, and contextual reasoning under complex scenarios (e.g., occlusions, unpredictable human behavior, or adverse weather). For instance, traditional bird's-eye-view (BEV) methods struggle with fine-grained 3D scene understanding, while isolated sensor modalities fail to resolve ambiguities in long-tail driving scenarios.
Embodied intelligence further demands systems to "think" and "act" in a human-centric manner, integrating physical interaction, spatial reasoning, and iterative self-correction. Recent breakthroughs, such as vision-tactile fusion systems for deformable object manipulation and transformer-based multimodal frameworks for driver-like scene comprehension, highlight the potential of combining multimodal perception with cognitive architectures. Gaps persist in generalizing these capabilities to real-world deployment, particularly in balancing computational efficiency, ethical considerations, and safety-critical decision-making.
This workshop aims to bridge these gaps by fostering interdisciplinary discussions on multi-modal fusion techniques, embodied reasoning frameworks, and scalable learning paradigms for autonomous driving.
Keywords: autonomous driving, embodied intelligence, multimodal fusion.
Workshop Schedule
Cognitive learning of Autonomous Vehicles in Open Road Traffic
Speaker: Prof. Xiao Wang, Anhui University
Highlight: A cognitive learning framework enabling autonomous vehicles (AVs) to achieve human-like adaptability and safety in dynamic open-road environments is designed. By integrating (1) Spatial-Temporal Attention (STA) mechanisms to infer road-user intentions by dynamically prioritizing critical spatial regions and temporal segments; (2) Social Compliance Estimation via formalized "absolute right-of-way" (A_ROW) metrics and ROW-violation indices, ensuring socially acceptable interactions; (3) Deep Evolutionary Reinforcement Learning (DERL) combining Twin Delayed DDPG (TD3) with genetic algorithms to expand policy search space, avoid local optima, and balance safety, efficiency, and comfort. The framework demonstrates robust adaptation across traffic densities, bridging machine precision with human-like cognitive flexibility for safer, interpretable AV integration into real-world traffic.
Beyond Discovery: An Identification-Aware Bayesian Optimization Approach And Its Applications in Transportation
Speaker: Prof. Zhiyuan Liu, Southeast University
Highlight: In simulation-based transportation systems—such as autonomous driving in adverse weather—decision optimization under noisy environments remains a core challenge. This talk introduces an Identification-Aware Bayesian Optimization framework, which goes beyond merely finding high-performing solutions to robustly identifying them. By balancing exploration and exploitation with adaptive acquisition strategies, the proposed method enhances decision reliability and sample efficiency. Applications in traffic signal control and trajectory planning demonstrate how this approach supports embodied intelligence in uncertain, multimodal scenarios.
ECAFormer: Low-light Image Enhancement using Dual Cross Attention
Speaker: Dr. Weikai Li, Chongqing Jiaotong University
Highlight: In real-world autonomous driving, vehicles must operate safely across a wide range of challenging visual conditions — from night-time urban roads to dim tunnels and rainy or foggy environments. Among these, low-light conditions are especially problematic, as they can severely compromise the performance of perception modules by obscuring critical visual cues, increasing noise, and reducing image contrast. Enhancing image quality in such scenarios is therefore essential for improving safety and reliability in autonomous systems. This presentation introduces ECAFormer, a lightweight and effective framework specifically designed for low-light image enhancement. By employing a novel dual cross-attention mechanism, ECAFormer enables mutual enhancement of semantic and visual features across multiple scales, effectively balancing global brightness correction and local detail preservation. Extensive experiments on benchmark datasets and real-world dark road scenes show that the model significantly improves visibility and robustness while maintaining computational efficiency.
Multimodal Fusion Perception and Cognitive Computing
Speaker: Prof. Hui Zhang, Beijing Jiaotong University
Highlight: In modern urban traffic systems, complex traffic scenarios pose significant challenges to traffic management and intelligent transportation systems. To effectively address these challenges, it is essential to leverage multi-view and multimodal information for collaborative perception. This presentation explores how information can be acquired from various perspectives by combining different sensors and data sources, such as cameras, LiDAR, and in-vehicle sensors. It also focuses on how diverse modalities, such as images and videos, can be jointly analyzed and interpreted. This presentation introduces attention-based methods for the fusion and processing of multi-modal information, enabling systems to integrate data more effectively. Furthermore, the presentation will discuss how multi-modal data can be transformed into higher-level semantic understanding of traffic scenes, including modeling of pedestrians, vehicles, and road conditions. Finally, it will examine key challenges and future directions for collaborative perception in intelligent transportation.
Predicting Where Human Driver Should Look
Speaker: Prof. Zhixiong Nan, Chongqing University
Highlight: According to the road traffic injuries report of WHO, nearly 1.3 million people die in traffic accidents every year, and a high proportion of death results from driver attention distraction. Therefore, predicting where human driver should look is extremely important for advanced assisted driving systems and applications. This presentation discusses how to accurately predict human driver attention by fusing multiple data sources, such as cameras and in-vehicle sensors. This presentation introduces a method that predict human driver attention by simulating human driving experience accumulation procedure. Furthermore, this presentation also introduces another method that integrates multi-fold top-down guidance with the bottom-up feature.
Panel Discussion: Social and Economoic Impact of Embodied Intelligence for Autonomous Vehicles
Organizers
Hui Zhang
Beijing Jiaotong University
Weikai Li
Chongqing Jiaotong University
Zhixiong Nan
Chongqing University
Zhiyuan Liu
Southeast University
Xiao Wang
Anhui University