Overview: This project focuses on aligning multimodal data, LiDAR point clouds and RGB images to enhance perception in autonomous systems. Inspired by the Q-Former architecture from BLIP-2, the proposed model integrates a query-based transformer module that aligns spatial relationships between these two modalities, generating shared scene-level embeddings in a unified feature space.
GitHub Repository: View Source Code on GitHub
Key Features:
Results and Impact
Experiments showed that the model successfully aligns multimodal embeddings, making it useful for autonomous driving and robotic perception tasks. Through dropout regularization and hyperparameter tuning, the model achieved stable performance despite computational constraints.
Technlogies used:
This project contributes to advancing multimodal sensor fusion techniques, helping autonomous systems achieve more accurate and robust scene understanding.
Below is the full project report. You can view it directly here or download it.