Alex Jia - Software Engineer & Data Engineer

Viewstamped Replication (VR) is a protocol originally developed in the 1980s for handling crash failures in replicated services. This "Revisited" version presents an updated protocol that enables a service (like a file system or lock manager) to run on multiple nodes, maintaining a consistent state accessible to clients despite individual node failures.

1. Introduction

The paper introduces an updated version of Viewstamped Replication (VR), a protocol originally developed in the 1980s for handling crash failures in replicated services. VR enables a service (like a file system or lock manager) to run on multiple nodes, maintaining a consistent state accessible to clients despite individual node failures.

This "Revisited" version differs from the original in several key ways:

Improved Protocol: It incorporates optimizations inspired by later work on Byzantine Fault Tolerance (PBFT), making it simpler and more performant.
No Disk Requirement: The protocol achieves persistence through replication rather than mandatory disk writes during normal operations.
Reconfiguration: It introduces a new protocol to change group membership (adding/removing nodes) and adjust the failure threshold dynamically.
Application Independence: The protocol is presented as a standalone layer, unlike previous iterations that were tightly coupled with specific database or file system implementations.

2. Background

This section establishes the environment and assumptions for VR.

Assumptions: VR handles crash failures (nodes stop working) but not Byzantine failures (malicious attacks). It operates in an asynchronous network where messages can be lost, delayed, or reordered, but will eventually be delivered if retried.
Replica Groups: The system uses a group of 2f + 1 replicas to tolerate f failures. This size ensures that any "quorum" of f + 1 replicas will overlap with any other quorum by at least one node, guaranteeing data consistency.
Architecture: Clients run user code which talks to a local VR proxy. This proxy communicates with the replicas. Replicas run the VR code, which manages the protocol and makes "up-calls" to the service code (the actual application being replicated) only when requests are committed.

3. Overview

VR relies on a primary replica to order client requests. Other replicas act as backups.

Normal Operation: The primary dictates the order of operations. Backups follow this order.
View Changes: If the primary fails, the system performs a "view change" to select a new primary. Correctness requires that the new view reflects all operations executed in previous views. This is guaranteed because the new view starts from a quorum of f + 1 replicas, at least one of which must know about the latest committed request.
Recovery: Failed nodes can recover and rejoin. To do so safely, they must retrieve the most recent state from the group to ensure they don't vote in a quorum with outdated information.

4. The VR Protocol

This section details the core algorithms.

4.1 Normal Operation

State: Each replica maintains a configuration, a view-number (identifies the current primary), an op-number (sequence number for requests), a log (history of requests), a commit-number, and a client-table (deduplication of requests).
Process:
1. Client sends a REQUEST to the primary.
2. Primary assigns an op-number, adds it to its log, and broadcasts a PREPARE message to backups.
3. Backups check the sequence, add to their log, and reply with PREPAREOK.
4. Once the primary receives f PREPAREOKs (constituting a quorum including itself), the operation is committed.
5. The primary executes the operation, replies to the client, and informs backups of the commit (often piggybacked on the next PREPARE or via a COMMIT message).
6. Backups execute the operation after learning of the commit.

4.2 View Changes

Trigger: Backups expect regular communication from the primary. If a timeout occurs, they initiate a view change.
Protocol:
1. A replica advances its view-number and broadcasts STARTVIEWCHANGE.
2. When a replica receives f STARTVIEWCHANGE messages, it sends a DOVIEWCHANGE message to the new primary containing its log and state.
3. The new primary waits for f + 1 DOVIEWCHANGE messages. It selects the log with the highest view number and op-number (the most up-to-date history).
4. The new primary installs this state, sets its status to normal, and broadcasts STARTVIEW.
5. Backups install the new state and process any uncommitted operations in the new log.

4.3 Recovery

Problem: A recovering node cannot simply rejoin because it may have "forgotten" a vote it cast before crashing, potentially violating quorum guarantees.
Solution: The node enters a recovering state. It sends a RECOVERY message with a nonce. It waits for f + 1 RECOVERYRESPONSE messages, including one from the current primary. This allows it to learn the current view and sync its state/log before participating in decisions again.

4.4 Non-deterministic Operations

For operations that rely on local non-deterministic values (like reading a local clock), the primary must compute the value or predict it, record it in the log, and transmit it to backups so all nodes execute the exact same state transition.

4.5 Client Recovery

Clients track request numbers to handle dropped messages. If a client crashes, it queries the cluster to find its last executed request number and increments it to avoid reusing numbers.

5. Pragmatics

This section addresses implementation efficiency, specifically log management.

5.1 Efficient Recovery

Sending the full log during recovery is too expensive. Instead, replicas create checkpoints of the application state (snapshots).
The log is truncated (garbage collected) up to the checkpoint.
Recovering nodes fetch the application state (checkpoint) and only the small tail of the log. Merkle trees can be used to optimize the transfer of large application states by only sending changed pages.

5.2 State Transfer

If a node lags behind (without crashing), it uses GETSTATE messages to fetch missing log entries. If the gap is too large, it may fetch a checkpoint/snapshot similar to a recovering node.

5.3 View Changes

To keep view change messages small, replicas can send only a suffix of their log, rather than the whole history, assuming the new primary is likely already up to date.

6. Optimizations

Techniques to improve throughput and latency:

6.1 Witnesses

In a group of 2f+1, only f+1 nodes act as "active" replicas (running service code/storing state). The remaining f are "witnesses" that only participate in recovery or view changes. This reduces the number of nodes performing the actual work.

6.2 Batching

The primary collects multiple client requests and runs the consensus protocol once for the whole batch, reducing overhead under high load.

6.3 Fast Reads

Reads at Primary: The primary can execute reads immediately without consulting backups if it holds "leases" ensuring it is still the primary.
Reads at Backups: Backups can serve reads if slightly stale data is acceptable. To ensure causality (a client seeing their own writes), the client tracks their last request number, and backups wait until they have executed up to that point before replying.

7. Reconfiguration

This describes how to change the replica group (e.g., replacing a failed node or moving to a new data center) or the failure threshold (f).

Mechanism: Reconfiguration is treated as a special client request (RECONFIGURATION) that transitions the system to a new epoch.
The Protocol:
1. The request goes through the normal commitment process in the old group.
2. Once committed, the old group stops processing client requests.
3. The state is transferred to the new group (new nodes are brought up to date).
4. The new group begins processing in the new epoch (starting at view 0).
Safety: The protocol ensures that the old group stops effectively before the new group starts, preventing "split-brain" scenarios. Old nodes do not shut down until they receive confirmation (EPOCHSTARTED) that the new group is active.
Client Redirection: Clients interacting with the old group are informed of the new configuration via a NEWEPOCH message.

8. Correctness

An informal discussion of why the protocol is safe and live.

View Changes: Safety is maintained because the new view's initial state is derived from a quorum of the previous view. The "quorum intersection property" guarantees that at least one node in the new quorum holds all previously committed operations.
Recovery: Correctness relies on the recovering node not participating until it is synced with a current quorum.
Reconfiguration: Because the reconfiguration is a committed operation in the old group, and the new group is initialized from that state, all prior history is preserved.

9. Conclusions

The paper concludes by summarizing the contributions: a practical, efficient, disk-less replication protocol that handles crash failures, supports dynamic group reconfiguration, and provides high performance through batching and read optimizations.

Viewstamped Replication Revisited: A Deep Dive into Distributed Consensus