MG Format

What is the .mg binary format?

The .mg format is the on-disk serialization format that the Areev context database uses to persist every AI memory grain as a compact, encrypted binary blob.

Each .mg blob starts with a 9-byte header followed by a canonical MessagePack payload. The header encodes five fields in a fixed layout: version (u8, currently 1), flags (u8, 8-bit field for signing/encryption/compression and metadata flags), grain_type (u8, mapping to one of the 10 OMS grain types), ns_hash (u16 big-endian, a hash of the namespace for fast partition routing), and created_at_sec (u32 big-endian, Unix timestamp in seconds). The header is always unencrypted, allowing the autonomous memory engine to route and filter grains without decrypting the payload.

The MessagePack payload contains the grain’s content, metadata key-value pairs, subject identifier, and any embedding vector. When encryption is enabled, the payload is encrypted with AES-256-GCM using a random per-user data encryption key (DEK) wrapped by the customer master key (CMK). The wrapped DEK is stored separately in the keys partition. The AI agent memory engine uses SHA-256 content addressing — the hash of the complete .mg blob (header + payload) serves as its storage key in Fjall, enabling deduplication and integrity verification.

How does the 9-byte header work?

The header packs five fields into exactly 9 bytes with no padding, using big-endian encoding for multi-byte fields.

The version byte (offset 0) identifies the .mg format version, allowing the context database to handle format migrations. The flags byte (offset 1) is an 8-bit field with the following layout:

BitMaskMeaning
00x01COSE Sign1 signing enabled
10x02AES-256-GCM encryption enabled
20x04zstd compression enabled
30x08content_refs present in payload
40x10embedding_refs present in payload
50x20AI-generated content flag
6-70xC0Sensitivity level (0-3)

The grain_type byte (offset 2) maps to one of the 10 OMS grain types: 0x01=belief, 0x02=event, 0x03=state, 0x04=workflow, 0x05=action, 0x06=observation, 0x07=goal, 0x08=reasoning, 0x09=consensus, 0x0A=consent. The ns_hash field (offsets 3-4) stores a 16-bit hash of the namespace string, used for fast shard routing in distributed mode without parsing the full payload.

The created_at_sec field (offsets 5-8) stores the creation timestamp as a 32-bit big-endian Unix epoch in seconds. This provides coarse time ordering at the header level — the full millisecond-precision timestamp lives in the MessagePack payload as created_at (compacted to ca). The header’s fixed 9-byte size means the autonomous memory engine can read it with a single 9-byte I/O operation, making header-only scans (for filtering by grain type or namespace hash) efficient even on spinning disks.

Offset  Size   Field            Encoding
0       1      version          u8 (currently 1)
1       1      flags            u8 bitfield (sign|encrypt|compress|refs|ai|sensitivity)
2       1      grain_type       u8 (0x01-0x0A, 10 OMS types)
3       2      ns_hash          u16 big-endian
5       4      created_at_sec   u32 big-endian
9       ...    payload          canonical MessagePack (optionally encrypted)

How does content addressing work?

Every grain is keyed by the SHA-256 hash of its content, enabling deduplication, integrity verification, and deterministic storage addressing.

When the AI agent memory engine receives a grain for storage, it assembles the complete .mg blob (9-byte header + payload) and computes the SHA-256 hash of the entire blob. This hash becomes the grain’s primary key in Fjall. If a grain with the same content hash already exists, the context database detects the duplicate and returns the existing grain’s ID instead of creating a new entry. This content-addressed design ensures that identical AI memory content is stored exactly once, regardless of how many times it is written.

A bloom filter over the superseded set provides fast negative answers for probable-not-superseded checks — if the bloom filter says a grain is not superseded, the engine skips the Fjall lookup for supersession status entirely. The bloom filter’s false positive rate is tuned to 1%. Integrity verification happens on every read: the engine recomputes the SHA-256 hash of the complete .mg blob (header + payload) and compares it to the storage key, detecting any corruption or tampering.

# The content hash appears in API responses
curl -s http://localhost:4009/api/memories/abc123 | python3 -c "
import sys, json
grain = json.load(sys.stdin)
print(f'Content hash: {grain[\"content_hash\"]}')
print(f'Grain type: {grain[\"grain_type\"]}')
"

How does encryption integrate with the .mg format?

The flags byte in the header indicates whether the MessagePack payload is encrypted, and the per-user DEK is stored separately in the keys partition.

When the encrypt flag (bit 1, mask 0x02) is set, the context database encrypts the entire MessagePack payload using AES-256-GCM with the user’s random 256-bit DEK and a 96-bit nonce. The encrypted ciphertext and 128-bit authentication tag replace the plaintext payload in the .mg blob. The DEK is wrapped (encrypted) by the CMK using the configured key backend (local, Vault, AWS KMS, or PKCS#11 HSM) and stored in a separate Fjall partition keyed by user ID.

This envelope encryption design means the AI memory engine handles key rotation without re-encrypting grains — only the DEK wrapping changes. Crypto-erasure (GDPR Art. 17 compliance) destroys the user’s wrapped DEK, rendering all of that user’s grain ciphertext permanently unrecoverable without touching the grain blobs themselves. The autonomous memory engine verifies the GCM authentication tag on every read, detecting any tampering or corruption in the encrypted payload.