Skip to main navigation menu Skip to main content Skip to site footer

Data-Driven Failure Perception for Distributed Systems through Operational State Representation Learning

Abstract

Distributed systems play a critical supporting role in cloud computing and large-scale information infrastructures. Their operating environments are complex, the node scale is large, and system states change frequently. As a result, system failures often exhibit strong concealment, complex propagation paths, and wide impact ranges. To address the limitations of traditional failure detection methods that rely on manual rules and adapt poorly to dynamic environments, this paper focuses on distributed system failure perception and proposes a unified data-driven modeling approach. The method starts from multi-node operational data and constructs a holistic representation of system states. Normal behavior patterns are learned from operational modes, and the degree of state deviation is used as the basis for failure perception. This enables automatic identification of complex system operating conditions. During modeling, both temporal evolution and global consistency of system states are considered. The model can therefore form stable state representations under multi-node parallel execution and strong coupling. Comparative experiments on a publicly distributed system dataset demonstrate strong overall performance across multiple evaluation metrics. The results confirm the effectiveness and practical value of data-driven modeling for distributed system failure perception. This study provides a feasible technical solution for intelligent monitoring and system state awareness in complex distributed environments.

pdf