This page in other versions: 1.0 / 1.0.3

7.1. Why monitoring matters

If one or more nodes are down in a BDR group then DDL locking for DDL replication will wait indefinitely or until cancelled. DDL locking requires consensus across all nodes, not just a quorum, so it must be able to reach all nodes. So it's important to monitor for node outages.

Global sequence chunk allocations can also be distrupted if half or more of the nodes are down or unreachable. See Global Sequence Voting.

Because DDL locking and global sequence allocations insert messages into the replication stream, a node that is extremely behind on replay will cause similar disruption to one that is entirely down.

Protracted node outages can also cause disk space exhaustion, resulting in other nodes rejecting writes or performing emergency shutdowns. Because every node connects to every other node there is a replication slot for every downstream peer node on each node. Replication slots ensure that an upstream (sending) server will retain enough write-ahead log (WAL) in pg_xlog to resume replay from point the downstream peer (receiver) last replayed on that slot. If a peer stops consuming data on a slot or falls increasingly behind on replay then the server that has that slot will accumulate WAL until it runs out of disk space on pg_xlog. This can happen even if the downstream peer is online and replaying if it isn't able to receive and replay changes as fast as the upstream node generates them.

It is therefore important to have automated monitoring in place to ensure that if replication slots start falling badly behind the admin is alerted and can take proactive action.