An adaptive failure detection scheme

This post is the summary of the paper: * The $\varphi$ accrual failure detector* by
N. Hayashibara, X. Défago, and T.Katayama. In Proc. 23rd IEEE Intl. Symp. on Reliable
Distributed Systems (SRDS’04), pp. 66-78, Florianópolis, Brazil, October 2004. IEEE CS Press.

More at: D2S Laboratory

Detecting failure in a distributed system setting is a desirable task for many obvious reasons. This paper introduces an implementation of an adaptive (accrual) failure detector. In this \(\varphi\) accrual failure detector, the conditions of the network is accumulated and used to update the probabilistic model for failure suspicion. Compares to the existing models in 2004, which output of suspicion level is binary, this implementation has the advantage of returning a real-value suspicion level. The authors compared their implementation to Chen Fault Detection and Bertier Fault Detection. For the benchmark scheme, they set up two computers between Japan and Switzerland transferring “heart beat” signal from Japan. They then later analyzed the collected data over a week and reported the result.

**Input**: A set of master processes \(\mathcal{M}\) and a set of worker processes \(\mathcal{W}\). The master processes are in charge of monitoring the worker processes.**Output**: An indicator of failure for each worker processes.**Assumption**: In this paper, for the shake of simplicity, the authors assumed that the master processes will never crash. Furthermore, only one worker and one master scheme was discussed in this paper.

*Note: The setting in this paper is sending heart beat signals from Japan to Switzerland.*

Notation | Explaination |
---|---|

\(\varphi\) | Suspection value. Higher value means the higher chance the failure happened. |

\(\Phi\) | Hyperparameter. Threshold for \(\varphi\). |

\(\Delta_i\) | “Heart beat” signal period. |

\(\Delta_{to}\) | Timeout for transmission. |

\(\Delta_{tr}\) | Average transmission time experienced by the messages. |

\(\alpha\) | \(\Delta_{tr} \approx \Delta_{i} + \alpha\) |

\(q\) | Master process that monitors other process for failure dection. |

\(p\) | Worker process that sends “heart beat” signals. |

\(T_D\) | Time until q begins to suspect p permanently in case of failure happened. |

\(\lambda_M\) | Average mistake rate at which a failure detector generates wrong suspicions. |

\(\lozenge P\) | Eventually perfect failure detector class. |

\(\mathrm{susp\_level\_p}(t)\) | Suspicion level of p at time t. |

\(T_{high}\) | Dynamic threshold upperbounds \(susp\_level_p\). |

\(T_{low}\) | Dynamic threshold lowerbounds \(susp\_level_p\). |

\(T_{last}\) | The time when thest most recent heart beat was received. |

\(t_{now}\) | The current time. |

\(P_{later}(t)\) | The probability that the next heart beat will arrive after \(t\) time unit since the last one. |

As mentioned above, this paper proposed an abstract accrual failure detector and a simple implementation of the idea with only 2 processes. This figure illustrates the differences in the conventional failure detection architecture and the proposed accrual failure detection architecture.

The main different in the proproposed architecture is the ability to return
many suspicion levels instead of just binary levels. This scheme enable the system
to perform many action as well as adaptive action based on the suspicion input
level. In the proposed architecture, the *suspicion level* is represented by
a value called $\varphi$. The suspicion level is defined by a logarithmic scale:

This formula is intuitive in the sense that it penaltize the delay $t_{now} - T_{last}$
by a log scale of some pre-defined probabilistic model $P_{later}(t)$. In this
paper, the authors defined a threshold $\Phi$ for $\varphi$. Since this $\varphi$
variable is computed in the log-scale, $\Phi$ also has logarithmic meaning. Each of
the unit step increase of $\Phi$ will lead to ten times confident interval of
failure detection. However, this fact only means that the confident about a failure
dection is high, it doesn’t take into account the *speed* of the dectection.

The probabilistic model is given by the formula:

\[\begin{align} P_{later}(t) &= \frac{1}{\sigma \sqrt{2\pi}} \int_{t}^{+\infty} e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx \\ &= 1 - F(t) \end{align}\]Note that the formula above is just the implementation of the abstract accrual failure detector in this paper. Theoretically speaking, we can choose any computable $P_{later}{t}$ that is suitable to our need. The picture below demonstrate this probabilistic model.

Until this point, we have the suspicion level $\varphi$ and the probabilistic model for computing $\varphi$. In order to adapt the network condition into the failure detection scheme, the authors created a sized window with size \(WS\). When the heart beat signal is arrived, it time stamp is stored into the window. The mean $\mu$ and variance $\sigma^2$ of the data in the window is maintained as the window received new data. In addtion, there are two variable keeping track of sum and sum of square of all element in the window are also maintained for computation convenient. The dataflow of the proposed implementation:

The dataflow figure above captured all the essential steps of the proposed
algorithm. From the network, heartbest signal is collected and its time period
is stored in the sampling window. To my understanding, the sampling window
is a FIFO unit that removes oldest data when new data is added. The mean and
variance for the probabilistic model $P_{later}(t)$ is then computed using
the data in the sampling window. At every time step, the value of $\varphi$
is computed using the probabilistic model, the time of last arrival and the
current time. This value $\varphi$ will then be used in different application
for various action. For example, in *action 1*, the threshold for $\varphi$ is
$\Phi_1$. Let’s say in this moment, $\varphi > \Phi_1$, hence the machine will
perform *action 1* (e.g. warning, reallocate resources, etc.). On the other hand,
*action 2* has the threshold of $\Phi_2$, which is larger than $\varphi$ at
the moment, hence no action is performed. More interestingly, the multi-level
suspicion $\varphi$ enables the use of parametric action, which means the machine
doesn’t have to behave in a binary manner (performs action or not), but it can
perform actions to a certain degree adapting to the current situation.

As mentioned above, the setting for experiment is a heartbeat signal transmission between Switzerland and Japan. Three failure detection schemes are compared: Chen FD, Bertier FD, and (this) $\phi$ FD. The window size of 1,000 is used for all failure detectors. In this paper, the authors conducted 4 experiments:

- Exp1: Average mistake rate $\lambda_M$. This experiment aims to provide some reasoning between the average mistake rate and the threshold $\Phi$.
- Exp2: Average detection time $T_D$. In this experiment, the relation between average detection time and the threshold is studied. Consistent with what I mentioned above, while the average mistake rate $\lambda_M$ decreased with high threshold (8-12), the detection time is increased significantly.
- Exp3: Effect of window size. The window size is plotted against the mistake rate. There are three lines representing 3 values of $\Phi$: 1,3,5. The result showed that larger widnow size leads to lower mistake rate. The result for different values of $\Phi$ is also consistent with experiment 2.
- Exp4: Comparision with Chen FD and Bertier FD. The authors conducted two experiment in this category. First experiment is comparision in the internet setting and the second is in the LAN setting. In both experiment, $\phi$ FD outperformed the other two methods.

More detail is provided in the authors’ paper.

This post only provides very high level abstraction of the authors’ work. I left out many discussion on the propertiesof failure detector, time period or heartbeat signal and the effect of network delay. Nevertheless, the results provided in this paper showed that the new scheme doesn’t imply additional cost in term of performance while it yields much better deteciton results under the authors’ benchmark. On another note, the authors also stated that based on their experimental result, it was sufficient to use normal distribution for the probabilistic model.