Vision-Oriented Multi-Object Tracking via Transformer-Based Temporal and Attention Modeling

Abstract
With the continuous advancement of computer vision technology, Multi-Object Tracking (MOT) has become an increasingly important task in visual understanding, with wide applications in intelligent surveillance, autonomous driving, and behavior analysis. Traditional MOT methods still face significant challenges in handling complex scenarios such as occlusion, target crowding, and identity switching. Based on the TransTrack model, this paper proposes an improved MOT algorithm. The algorithm introduces a multi-scale attention mechanism to enhance target representation and incorporates a trajectory memory module to model temporal consistency, effectively mitigating ID switching under occlusion. To improve robustness in complex samples, a dynamic sample reweighting strategy is designed to guide the model to focus on hard examples during training, thereby enhancing generalization. Extensive comparison experiments and ablation studies are conducted on the MOT17 dataset, and results show that the proposed method outperforms existing mainstream approaches across multiple metrics, demonstrating strong accuracy and stability. This study provides a new perspective for optimizing Transformer-based tracking models and lays a foundation for future applications in high-density tracking scenarios.