Distributed

How does Raft consensus work in Areev?

Areev uses OpenRaft 0.9 with a multi-raft architecture where each shard runs an independent Raft group, providing strong consistency for writes within a shard.

The context database partitions AI memory across shards, and each shard maintains its own Raft log, leader election, and state machine. A three-node cluster with three shards typically distributes leadership so each node leads one shard, balancing write load across the cluster. The state machine apply step deserializes the replicated command and executes it against the local Fjall LSM engine — writes are durable only after a majority of replicas acknowledge.

Cluster membership is managed via the chitchat gossip protocol for node liveness and metadata exchange. Nodes discover each other through seed addresses passed via --seed-nodes (with --clustered, --node-id, and --cluster-id) or DNS-based discovery. The NodeJoined, NodeLeft, LeaderElected, and LeaderLost audit events record all membership and leadership transitions. LeaderLost events provide DORA Art. 11 ICT incident evidence for regulated deployments.

Multi-Raft Architecture:
  Shard 0: Raft Group 0 (leader: node-0, followers: node-1, node-2)
  Shard 1: Raft Group 1 (leader: node-1, followers: node-0, node-2)
  Shard 2: Raft Group 2 (leader: node-2, followers: node-0, node-1)

State machine apply:
  1. Leader receives write command
  2. Raft replicates to majority
  3. State machine deserializes command
  4. Local Areev engine executes the write
  5. Audit event replicated on all nodes

How does sharding distribute data across nodes?

Consistent hash sharding with 150 virtual nodes per shard distributes autonomous memory grains evenly across the cluster, with two sharding modes available.

PerNamespace mode (the default) hashes the namespace to determine shard placement, so all grains in a namespace land on the same shard. This preserves namespace-level locality and keeps cross-grain queries local to one shard. PerEntity mode hashes namespace + subject together, distributing grains by entity for higher write parallelism in entity-heavy AI agent memory workloads.

Resharding between modes uses a five-phase migration: prepare, scan, transfer, verify, commit. During migration, writes are dual-routed to both old and new shard assignments until the commit phase completes. The ShardReassigned audit event records every reassignment with source shard, destination shard, and grain count.

PerNamespace (default):
  hash(namespace) → shard_id
  All grains in a namespace land on the same shard.

PerEntity:
  hash(namespace + subject) → shard_id
  Grains sharded by entity (namespace + subject).

How does distributed recall merge results across shards?

Distributed recall scatters the query to all shard leaders in parallel, gathers results with a per-shard timeout, and applies cross-shard reciprocal rank fusion (RRF) to merge rankings.

The scatter phase sends a gRPC ForwardQuery to every shard leader concurrently. Each shard executes the query against its local context database indexes (Tantivy BM25 for text, HNSW for vectors, hexastore for relations) and returns ranked SearchHit results. The gather phase collects responses and applies RRF fusion to produce a unified ranking across shards. Post-fusion filters include diversity reranking and contradiction detection.

Consent-dependent queries are automatically upgraded to Linearizable consistency to ensure the latest consent state is reflected — this auto-upgrade cannot be bypassed by clients. Three consistency levels are available: Eventual (read from any replica), BoundedStaleness (follower reads within max lag), and Linearizable (leader reads only). The DistributedRecall audit event includes shard response counts and a contradiction_coverage_incomplete flag for partial results.

distributed_recall(params) →
  1. Scatter: gRPC ForwardQuery to all shard leaders
  2. Gather: collect SearchHit results (timeout per shard)
  3. RRF fusion: merge rankings across shards
  4. Global filters: diversity reranking, contradiction detection
  5. Consistency: auto-upgrade for consent-dependent queries
  6. Audit: DistributedRecall event with correlation_id

GDPR Art. 17 crypto-erasure across multiple shards uses a two-phase commit (2PC) protocol coordinated by the node that receives the erasure request.

In the prepare phase, the coordinator sends ForgetPrepare(user_id_token) to all shards. Each shard marks the user’s grains as pending-delete and emits a ForgetPrepared audit event. In the commit phase, each shard destroys the user’s DEK, deletes all grains, purges index entries, and emits a ForgetCommitted audit event recording the grain count, destroyed key fingerprint, and tombstoned hook event count. If any shard fails to prepare, the coordinator sends ForgetAbort and all shards remove the pending marks.

User IDs are never stored in the Raft log — only blind tokens are transmitted between nodes. This design ensures that the AI memory replication layer never contains personally identifiable information, even in the append-only Raft log. Stale prepares that exceed the timeout are automatically cleaned up.

2PC Forget Protocol:
  Phase 1 — Prepare:
    Coordinator → all shards: ForgetPrepare(user_id_token)
    Each shard: mark user's grains as pending-delete
    Each shard: audit ForgetPrepared event

  Phase 2 — Commit:
    Coordinator → all shards: ForgetCommit
    Each shard: destroy DEK, delete grains, purge indexes
    Each shard: audit ForgetCommitted event

  On failure: ForgetAbort → shards remove pending marks

How does cluster security protect inter-node communication?

All inter-node communication uses mutual TLS (mTLS) with per-node certificates containing areev://node/{node_id} SAN URIs for identity verification.

Each node in the cluster holds a unique TLS certificate issued by a shared cluster CA. The leader validates the SAN URI of any node that forwards a write, preventing unauthorized nodes from injecting data into the context database. Certificate rotation is handled out-of-band — Areev watches the certificate file for changes and reloads without restart.

User IDs are pseudonymized in all audit entries using HMAC-SHA256, and each node maintains its own per-node encryption keys (no shared DEKs across nodes). This ensures that compromising one node’s storage does not expose AI agent memory data on other nodes.

mTLS identity:
  - Each node: unique TLS certificate
  - SAN URI: areev://node/{node_id}
  - Cluster CA: shared trusted root
  - Write validation: leader checks forwarding node's SAN

Privacy:
  - User IDs never in Raft log (blind tokens only)
  - HMAC-SHA256 pseudonymization for audit entries
  - Per-node encryption keys (no shared DEKs)

Kubernetes: Kubernetes StatefulSet cluster deployment
Configuration: Cluster configuration options
Crypto-Erasure: Single-node erasure details
Audit Trail: Distributed audit events

Distributed

How does Raft consensus work in Areev?

How does sharding distribute data across nodes?

How does distributed recall merge results across shards?

How does cross-shard GDPR erasure work?

How does cluster security protect inter-node communication?

Related