Abstract:
To address the management of unauthorized unmanned aerial vehicles (UAVs) in the low-altitude economy, a multimodal fusion method for UAV trajectory prediction based on visual and radar information is proposed. A deep fusion network, termed the Multi-Modal Fusion Framework, is designed for multimodal UAV trajectory prediction. The framework consists of two main components: feature extraction networks for two modalities and a bidirectional cross-attention fusion module. This architecture aims to fully leverage the complementary information from LiDAR and radar point clouds, capturing spatial geometric structures and dynamic reflection characteristics. In the feature extraction stage, independent yet structurally identical feature encoders are designed for LiDAR and radar data. Following feature extraction, the model employs a Bidirectional Cross-Attention Mechanism to achieve information complementarity and semantic alignment between the two modalities. To validate the effectiveness of the proposed model, the MMAUD dataset, used in the CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge, is adopted for training and testing. Experimental results demonstrate that the proposed multimodal fusion model significantly improves the accuracy of trajectory and position predictions. Additionally, ablation studies confirm the effectiveness of different loss functions and post-processing strategies in enhancing model performance. This model efficiently utilizes multimodal data, providing a robust solution for trajectory prediction of unauthorized UAVs in the low-altitude economy.