Part 5: Fault Tolerance & High Availability
Killing nodes for fun and profit. How CockroachDB survives failures without waking you up at 3 AM.
The “Cockroach” name comes from the idea that it’s hard to kill. Let’s see how it handles failure.
Replication Factor
By default, every range is replicated 3 times. This means you can lose 1 node and still have a majority (2/3) to continue accepting writes.
Scenario: Node Failure
Imagine you have 3 nodes: A, B, C.
- Node A dies (someone tripped over the power cord).
- The cluster notices Node A is gone.
- For the ranges where Node A was the Leaseholder, the remaining replicas (on B and C) hold an election.
- A new Leaseholder is chosen within seconds.
- The cluster continues serving traffic.
Self-Healing
If Node A stays down for more than 5 minutes, the cluster assumes it’s gone for good. It will start creating new replicas of the missing ranges on other nodes (if available) to get back to the replication factor of 3.
Zero Downtime Upgrades
Because of this architecture, you can perform rolling upgrades with zero downtime. Stop a node, upgrade the binary, start it back up. Repeat.
Conclusion
High availability isn’t an add-on; it’s the default. You sleep better at night knowing your database can handle a node failure. But what about data consistency?