8.1. Executing DDL on BDR systems

A BDR group is not the same as a standalone PostgreSQL server. It is based on asynchronous multi-master replication without a central locking and transaction co-ordinator. This has important implications when executing DDL.

To safely manipulate the database schema in an asynchronous multimaster setup, all pending changes have to be replicated first. Otherwise it is possible that a row being replicated contains data for a row that has been dropped, or has no data for a row that is marked NOT NULL. More complex cases also exist. To handle this problem, BDR acquires a so-called DDL lock the first time in a transaction where schema changes are made.

Acquiring the global DDL lock requires contacting all nodes in a BDR group, asking them to replicate all pending changes, and prevent further changes from being made. Once all nodes are fully caught up, the originator of the DDL lock is free to perform schema changes and replicate them to the other nodes. While the global DDL lock is held by a node, no nodes may perform any DDL or make any changes to rows.

This means that schema changes, unlike with data changes, can only be performed while all configured nodes are reachable and keeping up reasonably well with the current write rate. If DDL absolutely must be performed while a node is down, it has to be removed from the configuration (using bdr.bdr_part_by_node_names) first. (Once removed, a node cannot be added back; it must be decomissioned and a new node joined in its place.)

8.1.1. The DDL lock

DDL is a heavier weight operation than on standalone PostgreSQL. Performing DDL on any node will acquire a "global DDL lock". This causes new transactions that attempt write operations on any node to pause (block) until the DDL lock is released or canceled. Existing write transactions will be given a grace period (controlled by bdr.max_ddl_lock_delay) to complete and aborted (canceled) if they don't complete within the grace period, with the error:

FATAL:  terminating connection due to conflict with recovery
DETAIL:  User was holding a relation lock for too long.
     

New writes continue to be blocked until the DDL operation has replicated to all nodes, been applied, and all nodes have confirmed to the DDL originator that the changes have been applied. Or until the transaction performing the DDL is canceled (aborted) by the user or administrator. All writes will be blocked, even if it does not affect the objects the currently in-progress DDL is modifying.

While any transaction on any node holds the DDL lock, all DDL from any other transaction on any node will immediately ERROR with an error like:

ERROR:  database is locked against ddl by another node
HINT:  Node (6313760193895071967,1,16385) in the cluster is already performing DDL
     

There is no grace period for conflicting DDL (schema changes), only DML (row writes).

If the transaction holding the DDL lock is taking too long to complete, or the DDL lock acquisition is getting stuck because of replication delays or down nodes, you can cancel the transaction that's acquiring the lock. Just pg_terminate_backend() the backend that's taking/holding the DDL lock. It's all crash-safe.

8.1.1.1. How the DDL lock works

If you don't care how the global DDL lock works you can skip this section, but understanding it will be useful when you're diagnosing issues.

There are two levels to the DDL lock: the global DDL lock, which only one node at a time may hold, and the local DDL lock, which each node has separately. When the global DDL lock is held then all the local DDL locks are held too.

The (somewhat simplified) process of DDL lock acquision is:

  1. A normal user backend attempts something that requires the DDL lock

  2. The BDR command filter notices that the DDL lock is needed, pauses the user's command, and requests that the local BDR node acquire the global DDL lock

  3. The local BDR node acquires its own local DDL lock. It will now reject any incoming lock requests from other nodes, cancel write transactions if needed, and pause new write transactions if needed.

  4. The local DDL node writes a message in its replicaiton stream to ask every other node to take their local DDL locks and reply to confirm they've done so

  5. Every node that gets the request acquires the local DDL lock to prevent concurrent DDL and writes, then checks with every other node to see that they've all replayed all outstanding changes from each other

  6. When a node confirms all its peers have fully replayed its pending replication stream it replies to the lock requestor to say the lock is granted

  7. When all peers have confirmed lock acquisition, the requesting node knows it now holds the global DDL lock

  8. The requesting node makes the required schema changes

  9. The requesting node writes the fact that it's done with the DDL lock to its WAL in the form of a lock release message

  10. The requesting node releases its local DDL lock and resumes normal write operations

  11. The other nodes replay the lock release message and release their local DDL locks, resuming normal write operations

Critically, this means that every BDR node must complete a two-way communication with every other BDR node before the DDL lock can be granted. This communication is done via the replication stream, so replication lag and delays, network slowness or outages, etc in turn delay the DDL locking process. While the system is in the process of acquiring the DDL lock, many nodes will hold their local DDL locks and will be rejecting other DDL requests or, if the lock mode requires, rejecting writes.

This means that schema changes and anything else that takes the DDL lock should only be performed when all nodes are reachable and there isn't a big replication delay between any pair of nodes.

It also means that if the system gets stuck waiting for a down node, everything stops while we wait.

If the DDL lock request is canceled by the requesting node, all the other reachable nodes will release their locks. So if your system is hung up on a DDL lock request that's making no progress you can just cancel the statement that's requesting the DDL lock and everything will resume normal operation.

8.1.2. Minimising the impact of DDL

To minimise the impact of DDL, transactions performing DDL should be short, should not be combined with lots of row changes, and should avoid long running foreign key lookups.

If DDL is holding the system up for too long, it is possible and safe to cancel the DDL on the originating node like you would cancel any other statement, e.g. with Control-C in psql or with pg_cancel_backend.

Once the DDL operation has committed on the originating node, you cannot cancel or abort it. You must wait for it to apply successfully on all other nodes and for them to replay confirmation. This is why it is important to keep DDL transactions short and fast.

Because DDL is disruptive in BDR, it's possible to configure the system so that transactions can't do DDL that requires a heavy global lock by default. This is controlled by the bdr.permit_ddl_locking setting. If set to false, any command that would acquire the global DDL lock is rejected with an ERROR instead. This helps prevent unintended global DDL lock acquisitions. You can make this the default for a database, user or group with

 ALTER ROLE username SET bdr.permit_ddl_locking = false;
     

or

 ALTER DATABASE dbname SET bdr.permit_ddl_locking = false;
     

or set it globally in postgresql.conf. Then when you intend to perform disruptive DDL, explicitly permit it:

BEGIN;
SET LOCAL bdr.permit_ddl_locking = true;
-- Do your schema changes here
COMMIT;