Server clusters for high availability
Consider how to replicate session, topic and configuration information between a cluster of Diffusion™ servers to increase availability and reliability.
Diffusion uses a datagrid to share session and topic information between Diffusion servers within a cluster, providing high availability for clients connecting to load-balanced servers.
Diffusion uses Hazelcast™ as its datagrid. Hazelcast is a third-party product that is included in the Diffusion server installation and runs within the Diffusion server process.
The datagrid is responsible for the formation of clusters and the exchange of replicated data. These clusters operate on a peer-to-peer basis and by default there is no hierarchy of servers within the cluster.
Servers reflect session and topic information into the datagrid. If a server becomes unavailable, another server can access the session and topic information that is stored in the datagrid and take over the responsibilities of the first server.
As well as session and topic information, servers can use configuration replication to replicate configuration items such as security stores, topic views and metric collectors.
Configuration replication is active if session or topic replication is enabled, or it can be enabled separately.
- control authentication handler requests
- missing topic notifications
- request-response messaging
Some client control operations are cluster-aware. The command will be routed to the server in the cluster that hosts the specified session. When sending a request to a session filter, the command is applied to all matching sessions across the cluster.
- changeRoles
- close
- setConflated
- setSessionProperties
- getSessionProperties
See Configuring the Diffusion server to use replication and Replication.xml for more details.
Considerations
- By default Hazelcast uses multicast to discover other nodes to replicate data to. This is not secure for production use. In production, configure your Hazelcast nodes to replicate data only with explicitly defined nodes. For more information, see Configuring the Hazelcast datagrid.
- When
Diffusion
servers are
merged into a cluster, the servers can have inconsistent replicated data.
Unresolved inconsistencies can cause unpredictable behavior, due to issues such
as conflicts between updaters. If the
inconsistencies cannot be resolved, this is known as "split-brain". The inconsistent
Diffusion
server or servers are shutdown and must
be restarted.
Diffusion servers in a cluster can become inconsistent in a number of circumstances; for example, if a network partitions and then heals.
The quorum setting can help prevent inconsistencies due to network partitions. It enables you to set a minimum size for a cluster, below which the servers in a cluster will all shut down.
You should choose a quorum value so that after a network partition, the smaller cluster will shut down instead of attempting to heal. The servers from the smaller cluster can then be restarted and join the cluster cleanly, avoiding inconsistencies.
If you want to use the quorum feature, use an odd number of servers and set the value to just over half the cluster size. For example, if you have 5 servers in a cluster, set the quorum value to 3.
Note that servers shut down by the quorum feature will not restart automatically.
- An ideally sized cluster contains at least 3 nodes, and no more than 5 without consultation. Design your cluster to contain an odd number of servers, as these cannot fail to recover from a "split-brain".