Just a second...

Recovering From Failure

Asynchrony and Failure

Diffusion is architected with asynchrony in mind to achieve high performance. Using its UpdateStream, developers can achieve high throughput by updating topics asynchronously: applying further topic updates before receiving result of prior updates.

If a topic update fails however, then all following topic updates also fail, including those “in flight” that have been applied via an UpdateStream. Features exist within the Diffusion Client API to calculate what has been lost following a failure, re-establish communications and reapply lost topic updates.

The Recovery Mechanism

Some Diffusion server errors are ephemeral: occuring only when the Diffusion cluster changes its membership. After an ephemeral error occurs it is safe to re-establish the UpdateStream and reapply any lost topic updates.

The RecoverableUpdateStream thinly wraps the UpdateStream, accounting topic updates against matching server responses to build a list of in flight topic updates. When user code detects an error and the error is ephemeral, user code calls the RecoverableUpdateStream's recover method which does the following:
  1. Re-establish its UpdateStream.
  2. Reapply any in-flight topic updates - handling any errors in the process.
  3. Return control to user code, or signal an error if recovery was impossible.