Skip to content

The Phi Accrual Failure Detector

Published: at 04:00 AMSuggest Changes

Problem definition

Detecting failure in a distributed system setting is a desirable task for many obvious reasons. This paper introduces an implementation of an adaptive (accrual) failure detector. In this φ\varphi accrual failure detector, the conditions of the network is accumulated and used to update the probabilistic model for failure suspicion. Compares to the existing models in 2004, which output of suspicion level is binary, this implementation has the advantage of returning a real-value suspicion level. The authors compared their implementation to Chen Fault Detection and Bertier Fault Detection. For the benchmark scheme, they set up two computers between Japan and Switzerland transferring “heart beat” signal from Japan. They then later analyzed the collected data over a week and reported the result.

Notations

Note: The setting in this paper is sending heart beat signals from Japan to Switzerland.

NotationExplaination
φ\varphiSuspection value. Higher value means the higher chance the failure happened.
Φ\PhiHyperparameter. Threshold for φ\varphi.
Δi\Delta_i”Heart beat” signal period.
Δto\Delta_{to}Timeout for transmission.
Δtr\Delta_{tr}Average transmission time experienced by the messages.
α\alphaΔtrΔi+α\Delta_{tr} \approx \Delta_{i} + \alpha
qqMaster process that monitors other process for failure dection.
ppWorker process that sends “heart beat” signals.
TDT_DTime until q begins to suspect p permanently in case of failure happened.
λM\lambda_MAverage mistake rate at which a failure detector generates wrong suspicions.
P\lozenge PEventually perfect failure detector class.
susp_level_p(t) \mathrm{susp\_level\_p}(t)Suspicion level of p at time t.
ThighT_{high}Dynamic threshold upperbounds susp_levelpsusp\_level_p.
TlowT_{low}Dynamic threshold lowerbounds susp_levelpsusp\_level_p.
TlastT_{last}The time when thest most recent heart beat was received.
tnowt_{now}The current time.
Plater(t)P_{later}(t)The probability that the next heart beat will arrive after tt time unit since the last one.

Method

As mentioned above, this paper proposed an abstract accrual failure detector and a simple implementation of the idea with only 2 processes. This figure illustrates the differences in the conventional failure detection architecture and the proposed accrual failure detection architecture.

phifail

The main different in the proproposed architecture is the ability to return many suspicion levels instead of just binary levels. This scheme enable the system to perform many action as well as adaptive action based on the suspicion input level. In the proposed architecture, the suspicion level is represented by a value called φ\varphi. The suspicion level is defined by a logarithmic scale:

φ(tnow)log10(Plater(tnowTlast))\varphi(t_{now}) \triangleq -\log_{10}(P_{later}(t_{now} - T_{last}))

This formula is intuitive in the sense that it penaltize the delay tnowTlastt_{now} - T_{last} by a log scale of some pre-defined probabilistic model Plater(t)P_{later}(t). In this paper, the authors defined a threshold Φ\Phi for φ\varphi. Since this φ\varphi variable is computed in the log-scale, Φ\Phi also has logarithmic meaning. Each of the unit step increase of Φ\Phi will lead to ten times confident interval of failure detection. However, this fact only means that the confident about a failure dection is high, it doesn’t take into account the speed of the dectection.

The probabilistic model is given by the formula:

Plater(t)=1σ2πt+e(xμ)22σ2dx=1F(t)\begin{align} P_{later}(t) &= \frac{1}{\sigma \sqrt{2\pi}} \int_{t}^{+\infty} e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx \\ &= 1 - F(t) \end{align}

Note that the formula above is just the implementation of the abstract accrual failure detector in this paper. Theoretically speaking, we can choose any computable PlatertP_{later}{t} that is suitable to our need. The picture below demonstrate this probabilistic model.

iCDFt{ width=50% }

Until this point, we have the suspicion level φ\varphi and the probabilistic model for computing φ\varphi. In order to adapt the network condition into the failure detection scheme, the authors created a sized window with size WSWS. When the heart beat signal is arrived, it time stamp is stored into the window. The mean μ\mu and variance σ2\sigma^2 of the data in the window is maintained as the window received new data. In addtion, there are two variable keeping track of sum and sum of square of all element in the window are also maintained for computation convenient. The dataflow of the proposed implementation:

Flow

The dataflow figure above captured all the essential steps of the proposed algorithm. From the network, heartbest signal is collected and its time period is stored in the sampling window. To my understanding, the sampling window is a FIFO unit that removes oldest data when new data is added. The mean and variance for the probabilistic model Plater(t)P_{later}(t) is then computed using the data in the sampling window. At every time step, the value of φ\varphi is computed using the probabilistic model, the time of last arrival and the current time. This value φ\varphi will then be used in different application for various action. For example, in action 1, the threshold for φ\varphi is Φ1\Phi_1. Let’s say in this moment, φ>Φ1\varphi > \Phi_1, hence the machine will perform action 1 (e.g. warning, reallocate resources, etc.). On the other hand, action 2 has the threshold of Φ2\Phi_2, which is larger than φ\varphi at the moment, hence no action is performed. More interestingly, the multi-level suspicion φ\varphi enables the use of parametric action, which means the machine doesn’t have to behave in a binary manner (performs action or not), but it can perform actions to a certain degree adapting to the current situation.

Experiment

As mentioned above, the setting for experiment is a heartbeat signal transmission between Switzerland and Japan. Three failure detection schemes are compared: Chen FD, Bertier FD, and (this) ϕ\phi FD. The window size of 1,000 is used for all failure detectors. In this paper, the authors conducted 4 experiments:

More detail is provided in the authors’ paper.

Conclusion

This post only provides very high level abstraction of the authors’ work. I left out many discussion on the propertiesof failure detector, time period or heartbeat signal and the effect of network delay. Nevertheless, the results provided in this paper showed that the new scheme doesn’t imply additional cost in term of performance while it yields much better deteciton results under the authors’ benchmark. On another note, the authors also stated that based on their experimental result, it was sufficient to use normal distribution for the probabilistic model.


Previous Post
Deep compression case study - AlexNet
Next Post
Network Motifs