Stateful Detection in High Throughput Distributed Systems

时间：2026-01-22

Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, Saurabh Bagchi

Dependable Computing Systems Lab (DCSL)

School of Electrical and Computer Engineering, Purdue University

Email: {gkhanna, ilaguna, faarshad, sbagchi}@purdue.edu

Abstract

With the increasing speed of computers and the complexity of applications, many of today’s distributed systems exchange data at a high rate. Significant work has been done in error detection achieved through external fault tolerance systems. However, the high data rate coupled with complex detection can cause the capacity of the fault tolerance system to be exhausted resulting in low detection accuracy. We present a new stateful detection mechanism which observes the exchanged application messages, deduces the application state, and matches against anomaly-based rules. We extend our previous framework (the Monitor) to incorporate a sampling approach which adjusts the rate of verified messages. The sampling approach avoids the previously reported breakdown in the Monitor capacity at high application message rates, reduces the overall detection cost and allows the Monitor to provide accurate detection. We apply the approach to a reliable multicast protocol (TRAM) and demonstrate its performance by comparing it with our previous framework.

1. Introduction

The proliferation of high bandwidth applications and the increase in the number of consumers of distributed applications have caused them to operate at increasingly high data rates. Many of these distributed systems form parts of critical infrastructures, with real-time requirements. Hence it is imperative to provide error detection functionality to the applications. Error detection can broadly be classified as stateless detection and stateful detection. In the former, detection is done on individual messages by matching certain characteristics of the message, such as the length of the payload of the message. A more powerful approach for error detection is the stateful approach, in which the error detection system builds up state related to the application by aggregating multiple messages.

The rules are then based on the state, thus on aggregated information rather than on instantaneous information. Stateful detection is looked upon as a powerful mechanism for building dependable distributed systems [19][20]. The stateful detection models can be specified using various formalisms, such as, State Transition Diagrams, PetriNets or UML. Though the merits of stateful detection seem to be well accepted, scaling a stateful detection system with increasing application entities or data rate is a challenge. This is due to the increased processing load of tracking application state and rule matching based on the state. This problem has been documented for stateful firewalls that are matching rules on state spread across multiple, possibly distant, messages [19]. The stateful error detection system has to be designed without increasing the footprint of the system. Thus throwing hardware or memory at the problem is not enough because the application system also scales up and demands more from the detection system.

In our earlier work on developing an error detection system, we developed the Monitor([1], [7]) which provides detection by only observing the messages exchanged between the protocol entities (PEs). The Monitor is said to verify a set of PEs when it is monitoring them. The Monitor is provided a representation of the protocol behavior (using a state transition diagram i.e., STD) of the PEs being verified along with a set of stateful anomaly based rules. The Monitor uses an observer model whereby it does not have any information about the internal state of the PEs. The Monitor performs two primary tasks on observing a message. First, it performs the state transition corresponding to the PE based on the observed message. Note that the state of the PE estimated by the Monitor may differ from the real state of the entity since not all messages related to state changes are necessarily observable at the Monitor. Second, it performs rule matching for the rules associated with the particular state and message combination. We observe that the Monitor has a

breaking point in terms of (1) the incoming message rate or (2) the number of entities that it can verify, beyond which the accuracy and latency of its detection suffer [7]. The drop in accuracy or rise in latency is very sharp beyond the breaking point. We observe through a test-bed experiment that as the incoming packet rate into a single Monitor is increased beyond 100 pkt/s, the Monitor system breaks down on a standard Linux box. In other words, its latency becomes exceedingly high and accuracy of detection tends to zero. This effect is shown in Figure 1. This breakdown is caused by the processing capacity at the Monitor being exhausted. Hence, messages see long waiting times and, on the buffer becoming full, the messages also get dropped. Thus, for reasonable operation, the Monitor can only support data rates below the breaking point.

1200

1000

)

ms 800

( ncye 600

Monitor (baseline)

taL 400

200

0 0

100

200

300