Table of contents
Open Table of contents
Overview
If you ask any SRE how they test high availability in their clusters, they will usually describe a symmetric partition: they pull the virtual network plug on a node, watch the remaining nodes form a quorum, and observe the isolated node scream into the void until it’s reconnected.
But in production environments, networks rarely fail so cleanly.
Instead, we face gray failures highly localized, asymmetric network degradations where Node A can talk to Node B, Node B can talk to Node C, but Node A cannot talk to Node C.
The catalyst for this post was a recent incident in my home lab. While updating firmware and tweaking VLAN profiles across my Omada SDN switches, a subtle trunk-port tagging mismatch caused a temporary routing black hole. My Omada controller reported every switch “Online” and basic gateway pings were succeeding. Yet, the distributed state machine backing my local services began behaving like it was possessed. API calls stalled, write latencies spiked, and raft-based components entered infinite election thrashes.
To understand exactly how modern consensus engines survive this flavor of infrastructure chaos, I built a zero-dependency Linux network namespace sandbox to simulate this asymmetric VLAN leak on an active etcd cluster.
The Mechanics of Asymmetric Partitions
In a classic 3-node cluster, a symmetric partition splits the network cleanly:
If Node 3 was the leader, Node 1 and Node 2 realize heartbeats have stopped, increment their Raft term, hold an election, and select a new leader. Node 3 sits in isolation, unable to write because it cannot reach a majority. When the partition heals, Node 3 sees a higher term, steps down to a follower, and reconciles its state. This is predictable.
Now, consider an asymmetric partition (the “gray” failure):
Here, Node 1 and Node 3 are blocked from communicating with each other, but both maintain a healthy link to Node 2.
If Node 2 is the cluster leader, it can easily replicate state to both followers because it has independent pathways to both. But what happens if Node 3 experiences a transient blip, stops receiving heartbeats from Node 2, and decides to run for election?
In a naive implementation of Raft, Node 3 would:
- Increment its local term (e.g., from Term 2 to Term 3).
- Transition to
Candidatestate. - Broadcast a
MsgVoterequest.
When Node 2 receives a MsgVote with a higher term (Term 3), Raft rules dictate that Node 2 must immediately step down to a follower and accept the new term. But Node 3 can never win the election because it cannot reach Node 1 to get a majority vote.
The active leader was just deposed by a “disruptive follower” that had absolutely no chance of forming quorum. Once Node 2 steps down, the entire cluster’s write path is halted, and the cluster enters an infinite loop of failed elections.
To solve this, modern implementations of Raft rely on two critical guardrails: Check-Quorum and Pre-Vote.
The Raft Rescue Team: Pre-Vote and Check-Quorum
Check-Quorum
Introduced to prevent disrupted leaders from clinging to power, Check-Quorum forces the active leader to actively monitor whether it can reach a majority of peers. Every heartbeat interval, the leader counts its active connections. If it cannot see a majority (e.g., 2 out of 3 nodes) for more than an election timeout period, it voluntarily steps down to follower status without waiting for a peer to challenge it.
Pre-Vote
Pre-Vote directly solves the “disruptive follower” problem. Before a node is allowed to increment its term and transition to a full Candidate state, it must run a lightweight, non-disruptive Pre-Candidate phase.
During Pre-Vote:
- The node does not increment its term.
- It transitions to
Pre-Candidateand sends aMsgPreVoterequest to its peers. - Peers only vote “Yes” if:
- The candidate’s log is at least as up-to-date as theirs.
- They have not heard from the active leader within their election timeout window.
- If the
Pre-Candidatecollects a majority of “Yes” votes, it officially elevates toCandidate, increments the term, and starts a real election.
If a healthy leader is actively heartbeating the rest of the cluster, those healthy nodes will reject the MsgPreVote. The disruptive node is sent back to its corner without bumping the cluster’s global term or forcing the leader to step down.
Building the Sandbox
Rather than spinning up three distinct Proxmox VMs which introduces configuration management, disk, and hypervisor routing overhead, we can build this entire testing laboratory using native Linux network namespaces (ip netns), virtual ethernet (veth) pairs, and a software bridge (br-consensus) inside a single VM.
This keeps our network entirely deterministic, allowing us to control the virtual cables directly with standard kernel firewall utilities.
Here is the simplified orchestrator script, consensus-sandbox.sh:
#!/usr/bin/env bash
set -euo pipefail
BRIDGE_NAME="br-consensus"
BRIDGE_IP="10.200.0.1/24"
declare -A NODES=(
["node1"]="10.200.0.11"
["node2"]="10.200.0.12"
["node3"]="10.200.0.13"
)
log() { echo -e "\e[1;32m[+]\e[0m $1"; }
err() { echo -e "\e[1;31m[-]\e[0m $1"; exit 1; }
check_deps() {
[[ "$EUID" -eq 0 ]] || err "This script must run as root (uses ip netns)."
command -v etcd &>/dev/null || err "etcd binary not found. Run: apt install etcd-server"
command -v etcdctl &>/dev/null || err "etcdctl binary not found."
}
cleanup() {
log "Cleaning up old namespaces, logs, and processes..."
pkill etcd || true
for node in "${!NODES[@]}"; do
ip netns del "ns-$node" 2>/dev/null || true
rm -f "/tmp/etcd-${node}.log"
done
ip link delete "$BRIDGE_NAME" type bridge 2>/dev/null || true
}
setup_network() {
log "Configuring network bridge ($BRIDGE_NAME)..."
ip link add name "$BRIDGE_NAME" type bridge
ip addr add "$BRIDGE_IP" dev "$BRIDGE_NAME"
ip link set dev "$BRIDGE_NAME" up
# Enable kernel IP forwarding
sysctl -w net.ipv4.ip_forward=1 >/dev/null
for node in "${!NODES[@]}"; do
ip="${NODES[$node]}"
log "Provisioning namespace ns-$node with IP $ip..."
ip netns add "ns-$node"
ip link add "veth-$node" type veth peer name "veth-ns-$node"
# Connect host side to bridge
ip link set "veth-$node" master "$BRIDGE_NAME"
ip link set "veth-$node" up
# Connect peer side inside namespace
ip link set "veth-ns-$node" netns "ns-$node"
ip netns exec "ns-$node" ip link set dev "veth-ns-$node" name eth0
ip netns exec "ns-$node" ip addr add "$ip/24" dev eth0
ip netns exec "ns-$node" ip link set dev eth0 up
ip netns exec "ns-$node" ip link set dev lo up
ip netns exec "ns-$node" ip route add default via 10.200.0.1
done
}
start_cluster() {
log "Launching 3-node etcd cluster inside network namespaces..."
local initial_cluster="node1=http://10.200.0.11:2380,node2=http://10.200.0.12:2380,node3=http://10.200.0.13:2380"
for node in "${!NODES[@]}"; do
local ip="${NODES[$node]}"
# Start background etcd daemon inside its network namespace
ip netns exec "ns-$node" etcd \
--name "$node" \
--initial-advertise-peer-urls "http://$ip:2380" \
--listen-peer-urls "http://$ip:2380" \
--listen-client-urls "http://$ip:2379,http://127.0.0.1:2379" \
--advertise-client-urls "http://$ip:2379" \
--initial-cluster-token "etcd-token" \
--initial-cluster "$initial_cluster" \
--initial-cluster-state "new" \
--heartbeat-interval 250 \
--election-timeout 1250 \
--experimental-initial-corrupt-check=true \
> "/tmp/etcd-${node}.log" 2>&1 &
done
log "Waiting 5 seconds for cluster initialization..."
sleep 5
log "Verifying cluster connectivity:"
ip netns exec ns-node2 etcdctl --endpoints=http://10.200.0.12:2379 member list
}
case "${1:-}" in
setup)
check_deps
cleanup
setup_network
start_cluster
log "Sandbox active! Logs redirected to /tmp/etcd-[node1-3].log"
;;
cleanup)
cleanup
log "All clean."
;;
*)
echo "Usage: $0 {setup|cleanup}"
exit 1
;;
esacconsensus-sandbox.sh
Save this script, make it executable, install etcd natively on your sandbox VM, and fire it up:
sudo apt update && sudo apt install -y etcd-server etcd-client
chmod +x consensus-sandbox.sh
sudo ./consensus-sandbox.sh setup
The script instantiates our three nodes, configures the network paths, spins up the daemons, and prints the active membership list:
[+] Configuring network bridge (br-consensus)...
[+] Provisioning namespace ns-node1 with IP 10.200.0.11...
[+] Provisioning namespace ns-node2 with IP 10.200.0.12...
[+] Provisioning namespace ns-node3 with IP 10.200.0.13...
[+] Launching 3-node etcd cluster inside network namespaces...
[+] Waiting 5 seconds for cluster initialization...
[+] Verifying cluster connectivity:
2d9c8d6763c8e547, started, node3, http://10.200.0.13:2380, http://10.200.0.13:2379, false
6a644955dcee6657, started, node1, http://10.200.0.11:2380, http://10.200.0.11:2379, false
e5b8e06b7b84e2a8, started, node2, http://10.200.0.12:2380, http://10.200.0.12:2379, false
[+] Sandbox active! Logs redirected to /tmp/etcd-[node1-3].log
Simulating the Omada VLAN Leak
With the cluster healthy and stable, we now inject our asymmetric gray failure. We want to cut off communication strictly between Node 1 (10.200.0.11) and Node 3 (10.200.0.13), leaving Node 2 (10.200.0.12) acting as our healthy bridge.
We execute this by running native iptables rules directly inside the target namespaces:
# Block outgoing packets from Node 1 to Node 3
sudo ip netns exec ns-node1 iptables -A OUTPUT -d 10.200.0.13 -j DROP
# Block outgoing packets from Node 3 to Node 1
sudo ip netns exec ns-node3 iptables -A OUTPUT -d 10.200.0.11 -j DROP
This creates our exact failure state. Let’s analyze the logs to see how the system reacts.
Deconstructing the Empirical Evidence
Let’s inspect /tmp/etcd-node2.log first to see how the cluster initially elected its leader during setup:
{"level":"info","ts":"2026-06-20T00:52:04.214573Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 is starting a new election at term 1"}
{"level":"info","ts":"2026-06-20T00:52:04.214607Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 became pre-candidate at term 1"}
{"level":"info","ts":"2026-06-20T00:52:04.214666Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 [logterm: 1, index: 3] sent MsgPreVote request to 2d9c8d6763c8e547 at term 1"}
{"level":"info","ts":"2026-06-20T00:52:04.215219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 has received 2 MsgPreVoteResp votes and 0 vote rejections"}
{"level":"info","ts":"2026-06-20T00:52:04.215250Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 became candidate at term 2"}
{"level":"info","ts":"2026-06-20T00:52:04.250524Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 has received 2 MsgVoteResp votes and 0 vote rejections"}
{"level":"info","ts":"2026-06-20T00:52:04.250537Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 became leader at term 2"}
Phase 1: The Pre-Vote Sequence
At 00:52:04.214, node2 (e5b8e06b7b84e2a8) timed out and wanted to start an election. Thanks to Pre-Vote:
- It transitioned to
pre-candidateat Term 1. - It broadcasted a
MsgPreVoterequest to its peers (node1andnode3). - It received 2
MsgPreVoteRespaffirmative responses. - Only after confirming it could form a majority did it transition to
candidateat Term 2, send the formalMsgVoterequests, collect the ballots, and crown itself leader.
This confirms our local cluster is actively using the Pre-Vote extension. Now, let’s see what happened inside /tmp/etcd-node1.log after we injected the network partition at 00:52:38:
{"level":"warn","ts":"2026-06-20T00:52:38.740590Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"2d9c8d6763c8e547","rtt":"286.005µs","error":"dial tcp 10.200.0.13:2380: i/o timeout"}
{"level":"warn","ts":"2026-06-20T00:52:43.740828Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"2d9c8d6763c8e547","rtt":"286.005µs","error":"dial tcp 10.200.0.13:2380: i/o timeout"}
And similarly, inside /tmp/etcd-node3.log:
{"level":"warn","ts":"2026-06-20T00:52:38.733577Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"6a644955dcee6657","rtt":"376.667µs","error":"dial tcp 10.200.0.11:2380: i/o timeout"}
{"level":"warn","ts":"2026-06-20T00:52:43.734841Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"6a644955dcee6657","rtt":"376.667µs","error":"dial tcp 10.200.0.11:2380: i/o timeout"}
Analysis of the Failure Mode
This is where the magic of Raft safety features shines. Let’s trace why the cluster didn’t fall apart:
- Physical Isolation:
node1andnode3immediately threwi/o timeoutwarnings when trying to peer with each other. The TCP stream was broken. - Quorum Maintenance:
node2remained the leader. Sincenode2can communicate withnode1andnode3, it successfully heartbeats both of them. - No election storms: Why didn’t
node1ornode3try to call a new election? Even though they are isolated from each other, they are still receiving continuous, healthy heartbeats from the leadernode2. Their internal election timers are constantly being reset. They have no reason to rebel.
But what if we turned Pre-Vote off? Let’s trace that scenario step-by-step to expose the “Disruptive Follower” failure mode:
If we ran this test with a legacy consensus implementation (where pre-vote is set to false), and node3 lost a heartbeat frame from the leader due to random packet jitter or switch buffer queue delays, this would happen:
node3’s election timer expires.node3immediately increments the term from 2 to 3.node3sends aMsgVotewith Term 3 tonode2.node2receivesMsgVotewith Term 3. Even thoughnode2has a healthy link tonode1, Raft mandates that any node seeing a higher term must immediately step down to a follower.node2steps down. The cluster loses its leader.node3tries to gather votes, but it cannot reachnode1. It only receives a vote from itself andnode2. It needs a majority (2 out of 3 excluding itself, meaning 2 total votes). It gets stuck.- Meanwhile,
node1realizesnode2is no longer sending leader heartbeats. It times out, increments the term to 4, and sendsMsgVote. - The cluster enters an infinite loop of failed elections, grinding all writes to a halt, all because one node’s term increment forced the healthy leader to step down.
By leaving pre-vote enabled (which is the default in etcd v3.4+ and modern HashiCorp Consul), node3 is forced to ask node2 if an election is warranted before incrementing the term. node2 says “No, I am actively talking to the leader, I reject your Pre-Vote.”
No term increment occurs, no leader step-down occurs, and our cluster remains 100% online.
SRE Rules for Consensus Architecture
Simulating these edge cases in my sandbox clarified several architectural patterns we must enforce when running stateful infrastructure.
1. Never Disable Pre-Vote and Check-Quorum
If you are operating a self-hosted etcd, ZooKeeper, Consul, or Corosync cluster, ensure these flags are not disabled. For etcd, they have been default since v3.4+, but if you are running legacy deployments, explicitly configure:
--pre-vote=true
--experimental-initial-corrupt-check=true
2. Configure Network to Drop, Not Reject
In our sandbox, we used DROP rules. Why? Because when a switch port drops packets, the TCP packet is lost, forcing the kernel to wait for an connection timeout (i/o timeout).
If we used REJECT, the firewall would immediately return an ICMP Destination Unreachable packet. The consensus engine would instantly catch the rejection and could behave differently. In the real world, switch configurations, bad fiber runs, or VLAN tagging leaks result in silent drops, which are far more damaging than active rejections. Always test with DROP.
3. Adjust Election Timeouts to Match Network Topologies
If your nodes are spread across different switches (or physical virtualization hosts), ensure your timeouts are adjusted. If your heartbeat-interval is too close to your election-timeout, standard network spikes (or STP convergence delays) will cause nodes to trigger the Pre-Vote routine unnecessarily.
A general rule of thumb for stable local infrastructure:
- Heartbeat Interval: 250ms
- Election Timeout: 1250ms (or exactly 5x the heartbeat interval)
This gives the system ample buffer to absorb transient packet loss without immediately trying to execute state changes.
Tearing Down the Lab
When you are done analyzing the logs, clean up your virtual interfaces and namespaces using the script’s cleanup routine:
sudo ./consensus-sandbox.sh cleanup
This ensures your host routing table, virtual interfaces, and network namespaces are cleanly deleted.
Building and breaking namespaces locally like this is one of the most powerful tools in an engineer’s toolkit. It takes consensus mechanisms out of the textbook and directly onto your screen, proving that even when switches and trunk configurations fail, the software we write and configure can still stand strong.