Multimodal Perception and Fusion for Robust Human-Robot Interaction in Indoor Environments

Abstract
This paper presents a multimodal perception and fusion framework for enhancing human-robot interaction (HRI) in indoor environments. The proposed system integrates visual, auditory, and textual modalities to enable robust understanding of user commands, gestures, and contextual cues in real time. We design a hybrid fusion architecture that combines cross-attention mechanisms and feature alignment modules to effectively correlate heterogeneous inputs from microphones, RGB-D cameras, and natural language interfaces. In particular, we introduce a unified transformer-based encoder that supports synchronized understanding of voice commands and visual cues such as pointing gestures or facial expressions. The fused representations are used to drive a high-level interaction policy that governs robotic responses in service-oriented tasks. We evaluate the system across multiple indoor HRI scenarios including object retrieval, information query, and spatial navigation. Experimental results in both simulation and physical settings demonstrate that our multimodal approach significantly improves task success rate, intent recognition accuracy, and user satisfaction compared to unimodal baselines. The findings underscore the importance of tightly coupled multimodal fusion in building context-aware, socially responsive robotic systems.