November 5, 2023 • 10 min read • Part 5 of 10

Part 5: Fault Tolerance & High Availability

Killing nodes for fun and profit. How CockroachDB survives failures without waking you up at 3 AM.

CockroachDB High Availability Resilience

Part 5: Fault Tolerance & High Availability

The “Cockroach” name comes from the idea that it’s hard to kill. Let’s see how it handles failure.

Replication Factor

By default, every range is replicated 3 times. This means you can lose 1 node and still have a majority (2/3) to continue accepting writes.

Scenario: Node Failure

Imagine you have 3 nodes: A, B, C.

Node A dies (someone tripped over the power cord).
The cluster notices Node A is gone.
For the ranges where Node A was the Leaseholder, the remaining replicas (on B and C) hold an election.
A new Leaseholder is chosen within seconds.
The cluster continues serving traffic.

Self-Healing

If Node A stays down for more than 5 minutes, the cluster assumes it’s gone for good. It will start creating new replicas of the missing ranges on other nodes (if available) to get back to the replication factor of 3.

Zero Downtime Upgrades

Because of this architecture, you can perform rolling upgrades with zero downtime. Stop a node, upgrade the binary, start it back up. Repeat.

Conclusion

High availability isn’t an add-on; it’s the default. You sleep better at night knowing your database can handle a node failure. But what about data consistency?

← Part 4 - Data Modeling Next: Part 6 - Transactions →

Tags: CockroachDB High Availability Resilience

← Back to Blog