Recovering From Failure
Asynchrony and Failure
Diffusion is architected with asynchrony in mind to achieve high performance. Using its UpdateStream, developers can achieve high throughput by updating topics asynchronously: applying further topic updates before receiving result of prior updates.
If a topic update fails however, then all following topic updates also fail, including those “in flight” that have been applied via an UpdateStream. Features exist within the Diffusion Client API to calculate what has been lost following a failure, re-establish communications and reapply lost topic updates.
The Recovery Mechanism
Some Diffusion server errors are ephemeral: occuring only when the Diffusion cluster changes its membership. After an ephemeral error occurs it is safe to re-establish the UpdateStream and reapply any lost topic updates.
- Re-establish its UpdateStream.
- Reapply any in-flight topic updates - handling any errors in the process.
- Return control to user code, or signal an error if recovery was impossible.