Skip to main navigation menu Skip to main content Skip to site footer

LocateNet: Large Multimodal Models for Text-Guided Object Localization

Abstract

This work investigates the problem of text-guided object localization and presents a unified framework capable of establishing fine-grained semantic correspondences between natural language expressions and visual regions. Existing multimodal models primarily focus on global semantic understanding and often fail to capture attribute-level cues, relational semantics, and spatial structures required for accurate grounding. To address these limitations, the proposed method introduces a cross-modal structural fusion mechanism that jointly encodes linguistic constraints and visual representations, enabling the model to refine spatial cues at multiple semantic levels. A hierarchical spatial reasoning module further enhances region-level discrimination by integrating structural cues into the localization process. The framework is evaluated under diverse conditions, including varying instruction complexity, multi-scale perturbations, and occlusion levels, ensuring a comprehensive assessment of its robustness. Extensive experiments on a standard benchmark demonstrate substantial improvements across [email protected], IoU, Pointing Game Accuracy, and AUC, indicating that the model delivers both precise boundary estimation and stable semantic pointing performance. Additional sensitivity studies confirm that the proposed approach maintains consistent localization quality even when visual inputs degrade or language instructions become more complex. By enabling accurate, interpretable, and text-driven spatial grounding, this work provides a practical and effective solution for applications requiring fine-grained cross-modal understanding.

pdf