Skip to content
Venish Joe Clarence

Simulating Asymmetric Partitions with Raft Pre-Vote

Table of contents

Open Table of contents

Overview

If you ask any SRE how they test high availability in their clusters, they will usually describe a symmetric partition: they pull the virtual network plug on a node, watch the remaining nodes form a quorum, and observe the isolated node scream into the void until it’s reconnected.

But in production environments, networks rarely fail so cleanly.

Instead, we face gray failures highly localized, asymmetric network degradations where Node A can talk to Node B, Node B can talk to Node C, but Node A cannot talk to Node C.

The catalyst for this post was a recent incident in my home lab. While updating firmware and tweaking VLAN profiles across my Omada SDN switches, a subtle trunk-port tagging mismatch caused a temporary routing black hole. My Omada controller reported every switch “Online” and basic gateway pings were succeeding. Yet, the distributed state machine backing my local services began behaving like it was possessed. API calls stalled, write latencies spiked, and raft-based components entered infinite election thrashes.

To understand exactly how modern consensus engines survive this flavor of infrastructure chaos, I built a zero-dependency Linux network namespace sandbox to simulate this asymmetric VLAN leak on an active etcd cluster.

The Mechanics of Asymmetric Partitions

In a classic 3-node cluster, a symmetric partition splits the network cleanly:

symmetric-partitions

If Node 3 was the leader, Node 1 and Node 2 realize heartbeats have stopped, increment their Raft term, hold an election, and select a new leader. Node 3 sits in isolation, unable to write because it cannot reach a majority. When the partition heals, Node 3 sees a higher term, steps down to a follower, and reconciles its state. This is predictable.

Now, consider an asymmetric partition (the “gray” failure):

asymmetric-partitions

Here, Node 1 and Node 3 are blocked from communicating with each other, but both maintain a healthy link to Node 2.

If Node 2 is the cluster leader, it can easily replicate state to both followers because it has independent pathways to both. But what happens if Node 3 experiences a transient blip, stops receiving heartbeats from Node 2, and decides to run for election?

In a naive implementation of Raft, Node 3 would:

  1. Increment its local term (e.g., from Term 2 to Term 3).
  2. Transition to Candidate state.
  3. Broadcast a MsgVote request.

When Node 2 receives a MsgVote with a higher term (Term 3), Raft rules dictate that Node 2 must immediately step down to a follower and accept the new term. But Node 3 can never win the election because it cannot reach Node 1 to get a majority vote.

The active leader was just deposed by a “disruptive follower” that had absolutely no chance of forming quorum. Once Node 2 steps down, the entire cluster’s write path is halted, and the cluster enters an infinite loop of failed elections.

To solve this, modern implementations of Raft rely on two critical guardrails: Check-Quorum and Pre-Vote.

The Raft Rescue Team: Pre-Vote and Check-Quorum

Check-Quorum

Introduced to prevent disrupted leaders from clinging to power, Check-Quorum forces the active leader to actively monitor whether it can reach a majority of peers. Every heartbeat interval, the leader counts its active connections. If it cannot see a majority (e.g., 2 out of 3 nodes) for more than an election timeout period, it voluntarily steps down to follower status without waiting for a peer to challenge it.

Pre-Vote

Pre-Vote directly solves the “disruptive follower” problem. Before a node is allowed to increment its term and transition to a full Candidate state, it must run a lightweight, non-disruptive Pre-Candidate phase.

pre-vote

During Pre-Vote:

  1. The node does not increment its term.
  2. It transitions to Pre-Candidate and sends a MsgPreVote request to its peers.
  3. Peers only vote “Yes” if:
    • The candidate’s log is at least as up-to-date as theirs.
    • They have not heard from the active leader within their election timeout window.
  4. If the Pre-Candidate collects a majority of “Yes” votes, it officially elevates to Candidate, increments the term, and starts a real election.

If a healthy leader is actively heartbeating the rest of the cluster, those healthy nodes will reject the MsgPreVote. The disruptive node is sent back to its corner without bumping the cluster’s global term or forcing the leader to step down.

Building the Sandbox

Rather than spinning up three distinct Proxmox VMs which introduces configuration management, disk, and hypervisor routing overhead, we can build this entire testing laboratory using native Linux network namespaces (ip netns), virtual ethernet (veth) pairs, and a software bridge (br-consensus) inside a single VM.

This keeps our network entirely deterministic, allowing us to control the virtual cables directly with standard kernel firewall utilities.

Here is the simplified orchestrator script, consensus-sandbox.sh:

#!/usr/bin/env bash

set -euo pipefail

BRIDGE_NAME="br-consensus"
BRIDGE_IP="10.200.0.1/24"

declare -A NODES=(
    ["node1"]="10.200.0.11"
    ["node2"]="10.200.0.12"
    ["node3"]="10.200.0.13"
)

log() { echo -e "\e[1;32m[+]\e[0m $1"; }
err() { echo -e "\e[1;31m[-]\e[0m $1"; exit 1; }

check_deps() {
    [[ "$EUID" -eq 0 ]] || err "This script must run as root (uses ip netns)."
    command -v etcd &>/dev/null || err "etcd binary not found. Run: apt install etcd-server"
    command -v etcdctl &>/dev/null || err "etcdctl binary not found."
}

cleanup() {
    log "Cleaning up old namespaces, logs, and processes..."
    pkill etcd || true
    
    for node in "${!NODES[@]}"; do
        ip netns del "ns-$node" 2>/dev/null || true
        rm -f "/tmp/etcd-${node}.log"
    done

    ip link delete "$BRIDGE_NAME" type bridge 2>/dev/null || true
}

setup_network() {
    log "Configuring network bridge ($BRIDGE_NAME)..."
    ip link add name "$BRIDGE_NAME" type bridge
    ip addr add "$BRIDGE_IP" dev "$BRIDGE_NAME"
    ip link set dev "$BRIDGE_NAME" up

    # Enable kernel IP forwarding
    sysctl -w net.ipv4.ip_forward=1 >/dev/null

    for node in "${!NODES[@]}"; do
        ip="${NODES[$node]}"
        log "Provisioning namespace ns-$node with IP $ip..."

        ip netns add "ns-$node"
        ip link add "veth-$node" type veth peer name "veth-ns-$node"
        
        # Connect host side to bridge
        ip link set "veth-$node" master "$BRIDGE_NAME"
        ip link set "veth-$node" up

        # Connect peer side inside namespace
        ip link set "veth-ns-$node" netns "ns-$node"
        ip netns exec "ns-$node" ip link set dev "veth-ns-$node" name eth0
        ip netns exec "ns-$node" ip addr add "$ip/24" dev eth0
        ip netns exec "ns-$node" ip link set dev eth0 up
        ip netns exec "ns-$node" ip link set dev lo up
        ip netns exec "ns-$node" ip route add default via 10.200.0.1
    done
}

start_cluster() {
    log "Launching 3-node etcd cluster inside network namespaces..."
    
    local initial_cluster="node1=http://10.200.0.11:2380,node2=http://10.200.0.12:2380,node3=http://10.200.0.13:2380"

    for node in "${!NODES[@]}"; do
        local ip="${NODES[$node]}"
        
        # Start background etcd daemon inside its network namespace
        ip netns exec "ns-$node" etcd \
            --name "$node" \
            --initial-advertise-peer-urls "http://$ip:2380" \
            --listen-peer-urls "http://$ip:2380" \
            --listen-client-urls "http://$ip:2379,http://127.0.0.1:2379" \
            --advertise-client-urls "http://$ip:2379" \
            --initial-cluster-token "etcd-token" \
            --initial-cluster "$initial_cluster" \
            --initial-cluster-state "new" \
            --heartbeat-interval 250 \
            --election-timeout 1250 \
            --experimental-initial-corrupt-check=true \
            > "/tmp/etcd-${node}.log" 2>&1 &
    done

    log "Waiting 5 seconds for cluster initialization..."
    sleep 5
    
    log "Verifying cluster connectivity:"
    ip netns exec ns-node2 etcdctl --endpoints=http://10.200.0.12:2379 member list
}

case "${1:-}" in
    setup)
        check_deps
        cleanup
        setup_network
        start_cluster
        log "Sandbox active! Logs redirected to /tmp/etcd-[node1-3].log"
        ;;
    cleanup)
        cleanup
        log "All clean."
        ;;
    *)
        echo "Usage: $0 {setup|cleanup}"
        exit 1
        ;;
esacconsensus-sandbox.sh

Save this script, make it executable, install etcd natively on your sandbox VM, and fire it up:

sudo apt update && sudo apt install -y etcd-server etcd-client
chmod +x consensus-sandbox.sh
sudo ./consensus-sandbox.sh setup

The script instantiates our three nodes, configures the network paths, spins up the daemons, and prints the active membership list:

[+] Configuring network bridge (br-consensus)...
[+] Provisioning namespace ns-node1 with IP 10.200.0.11...
[+] Provisioning namespace ns-node2 with IP 10.200.0.12...
[+] Provisioning namespace ns-node3 with IP 10.200.0.13...
[+] Launching 3-node etcd cluster inside network namespaces...
[+] Waiting 5 seconds for cluster initialization...
[+] Verifying cluster connectivity:
2d9c8d6763c8e547, started, node3, http://10.200.0.13:2380, http://10.200.0.13:2379, false
6a644955dcee6657, started, node1, http://10.200.0.11:2380, http://10.200.0.11:2379, false
e5b8e06b7b84e2a8, started, node2, http://10.200.0.12:2380, http://10.200.0.12:2379, false
[+] Sandbox active! Logs redirected to /tmp/etcd-[node1-3].log

Simulating the Omada VLAN Leak

With the cluster healthy and stable, we now inject our asymmetric gray failure. We want to cut off communication strictly between Node 1 (10.200.0.11) and Node 3 (10.200.0.13), leaving Node 2 (10.200.0.12) acting as our healthy bridge.

We execute this by running native iptables rules directly inside the target namespaces:

# Block outgoing packets from Node 1 to Node 3
sudo ip netns exec ns-node1 iptables -A OUTPUT -d 10.200.0.13 -j DROP

# Block outgoing packets from Node 3 to Node 1
sudo ip netns exec ns-node3 iptables -A OUTPUT -d 10.200.0.11 -j DROP

This creates our exact failure state. Let’s analyze the logs to see how the system reacts.

Deconstructing the Empirical Evidence

Let’s inspect /tmp/etcd-node2.log first to see how the cluster initially elected its leader during setup:

{"level":"info","ts":"2026-06-20T00:52:04.214573Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 is starting a new election at term 1"}
{"level":"info","ts":"2026-06-20T00:52:04.214607Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 became pre-candidate at term 1"}
{"level":"info","ts":"2026-06-20T00:52:04.214666Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 [logterm: 1, index: 3] sent MsgPreVote request to 2d9c8d6763c8e547 at term 1"}
{"level":"info","ts":"2026-06-20T00:52:04.215219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 has received 2 MsgPreVoteResp votes and 0 vote rejections"}
{"level":"info","ts":"2026-06-20T00:52:04.215250Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 became candidate at term 2"}
{"level":"info","ts":"2026-06-20T00:52:04.250524Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 has received 2 MsgVoteResp votes and 0 vote rejections"}
{"level":"info","ts":"2026-06-20T00:52:04.250537Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e5b8e06b7b84e2a8 became leader at term 2"}

Phase 1: The Pre-Vote Sequence

At 00:52:04.214, node2 (e5b8e06b7b84e2a8) timed out and wanted to start an election. Thanks to Pre-Vote:

  1. It transitioned to pre-candidate at Term 1.
  2. It broadcasted a MsgPreVote request to its peers (node1 and node3).
  3. It received 2 MsgPreVoteResp affirmative responses.
  4. Only after confirming it could form a majority did it transition to candidate at Term 2, send the formal MsgVote requests, collect the ballots, and crown itself leader.

This confirms our local cluster is actively using the Pre-Vote extension. Now, let’s see what happened inside /tmp/etcd-node1.log after we injected the network partition at 00:52:38:

{"level":"warn","ts":"2026-06-20T00:52:38.740590Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"2d9c8d6763c8e547","rtt":"286.005µs","error":"dial tcp 10.200.0.13:2380: i/o timeout"}
{"level":"warn","ts":"2026-06-20T00:52:43.740828Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"2d9c8d6763c8e547","rtt":"286.005µs","error":"dial tcp 10.200.0.13:2380: i/o timeout"}

And similarly, inside /tmp/etcd-node3.log:

{"level":"warn","ts":"2026-06-20T00:52:38.733577Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"6a644955dcee6657","rtt":"376.667µs","error":"dial tcp 10.200.0.11:2380: i/o timeout"}
{"level":"warn","ts":"2026-06-20T00:52:43.734841Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"6a644955dcee6657","rtt":"376.667µs","error":"dial tcp 10.200.0.11:2380: i/o timeout"}

Analysis of the Failure Mode

This is where the magic of Raft safety features shines. Let’s trace why the cluster didn’t fall apart:

  1. Physical Isolation: node1 and node3 immediately threw i/o timeout warnings when trying to peer with each other. The TCP stream was broken.
  2. Quorum Maintenance: node2 remained the leader. Since node2 can communicate with node1 and node3, it successfully heartbeats both of them.
  3. No election storms: Why didn’t node1 or node3 try to call a new election? Even though they are isolated from each other, they are still receiving continuous, healthy heartbeats from the leader node2. Their internal election timers are constantly being reset. They have no reason to rebel.

But what if we turned Pre-Vote off? Let’s trace that scenario step-by-step to expose the “Disruptive Follower” failure mode:

If we ran this test with a legacy consensus implementation (where pre-vote is set to false), and node3 lost a heartbeat frame from the leader due to random packet jitter or switch buffer queue delays, this would happen:

  1. node3’s election timer expires.
  2. node3 immediately increments the term from 2 to 3.
  3. node3 sends a MsgVote with Term 3 to node2.
  4. node2 receives MsgVote with Term 3. Even though node2 has a healthy link to node1, Raft mandates that any node seeing a higher term must immediately step down to a follower.
  5. node2 steps down. The cluster loses its leader.
  6. node3 tries to gather votes, but it cannot reach node1. It only receives a vote from itself and node2. It needs a majority (2 out of 3 excluding itself, meaning 2 total votes). It gets stuck.
  7. Meanwhile, node1 realizes node2 is no longer sending leader heartbeats. It times out, increments the term to 4, and sends MsgVote.
  8. The cluster enters an infinite loop of failed elections, grinding all writes to a halt, all because one node’s term increment forced the healthy leader to step down.

By leaving pre-vote enabled (which is the default in etcd v3.4+ and modern HashiCorp Consul), node3 is forced to ask node2 if an election is warranted before incrementing the term. node2 says “No, I am actively talking to the leader, I reject your Pre-Vote.”

No term increment occurs, no leader step-down occurs, and our cluster remains 100% online.

SRE Rules for Consensus Architecture

Simulating these edge cases in my sandbox clarified several architectural patterns we must enforce when running stateful infrastructure.

1. Never Disable Pre-Vote and Check-Quorum

If you are operating a self-hosted etcd, ZooKeeper, Consul, or Corosync cluster, ensure these flags are not disabled. For etcd, they have been default since v3.4+, but if you are running legacy deployments, explicitly configure:

--pre-vote=true
--experimental-initial-corrupt-check=true

2. Configure Network to Drop, Not Reject

In our sandbox, we used DROP rules. Why? Because when a switch port drops packets, the TCP packet is lost, forcing the kernel to wait for an connection timeout (i/o timeout).

If we used REJECT, the firewall would immediately return an ICMP Destination Unreachable packet. The consensus engine would instantly catch the rejection and could behave differently. In the real world, switch configurations, bad fiber runs, or VLAN tagging leaks result in silent drops, which are far more damaging than active rejections. Always test with DROP.

3. Adjust Election Timeouts to Match Network Topologies

If your nodes are spread across different switches (or physical virtualization hosts), ensure your timeouts are adjusted. If your heartbeat-interval is too close to your election-timeout, standard network spikes (or STP convergence delays) will cause nodes to trigger the Pre-Vote routine unnecessarily.

A general rule of thumb for stable local infrastructure:

This gives the system ample buffer to absorb transient packet loss without immediately trying to execute state changes.

Tearing Down the Lab

When you are done analyzing the logs, clean up your virtual interfaces and namespaces using the script’s cleanup routine:

sudo ./consensus-sandbox.sh cleanup

This ensures your host routing table, virtual interfaces, and network namespaces are cleanly deleted.

Building and breaking namespaces locally like this is one of the most powerful tools in an engineer’s toolkit. It takes consensus mechanisms out of the textbook and directly onto your screen, proving that even when switches and trunk configurations fail, the software we write and configure can still stand strong.



Next Post
Forensic Reconstruction of Black-Box Protocols