# Conditional Memory via Scalable Lookup (Engram)

**Paper:** "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
**Authors:** Xin Cheng, Wangding Zeng, Damai Dai, et al. (Peking University & DeepSeek-AI)
**arXiv:** [2601.07372](https://arxiv.org/abs/2601.07372)

---

## Core Idea

Transformers lack a native "knowledge lookup" primitive — they waste early layers reconstructing static patterns (named entities, idioms, formulaic phrases) through expensive computation that could be done via a simple table lookup. The paper introduces **conditional memory** as a new sparsity axis complementary to MoE (conditional computation). The concrete instantiation is **Engram**: a module that modernizes classic N-gram embeddings for O(1) lookup, injected into specific transformer layers.

## How Engram Works

1. **Tokenizer Compression:** A surjective mapping collapses semantically equivalent tokens (e.g., `Apple` / ` apple`) into canonical IDs, reducing effective vocab size by ~23%.

2. **Multi-Head Hashing:** For each position, extract suffix N-grams (typically 2-grams and 3-grams). Each N-gram is hashed via K independent hash heads into large embedding tables of prime size. All retrieved embeddings are concatenated into a single memory vector `e_t`.

3. **Context-Aware Gating:** The current hidden state `h_t` acts as a Query; the retrieved memory `e_t` provides Key and Value via learned projections. A sigmoid gate `alpha_t` controls how much memory is injected — if context contradicts the retrieved embedding (hash collision, polysemy), the gate suppresses it.

4. **Depthwise Causal Convolution:** A short (kernel=4) causal conv1d with SiLU expands the receptive field after gating.

5. **Residual injection:** Output added to hidden states via residual, followed by the standard Attention + FFN of that layer.

## Key Findings

### Sparsity Allocation (U-Shaped Scaling Law)
Given a fixed total parameter budget split between MoE experts and Engram memory:
- Pure MoE (rho=100%) is suboptimal
- Pure Engram (rho=0%) is also suboptimal (memory can't replace computation)
- **Sweet spot: ~75-80% MoE, ~20-25% Engram** — stable across model scales

### Large-Scale Results (Engram-27B vs MoE-27B, iso-param & iso-FLOPs)
- Knowledge tasks: MMLU +3.0, CMMLU +4.0 (expected)
- Reasoning: BBH +5.0, ARC-Challenge +3.7, DROP +3.3 (surprising)
- Code/Math: HumanEval +3.0, MATH +2.4, GSM8K +2.2 (surprising)

### Why It Helps Reasoning (Not Just Knowledge)
- **Effective depth increase:** Engram relieves early layers from reconstructing static patterns, making shallow Engram layers functionally equivalent to deeper MoE-only layers (CKA analysis shows layer 5 of Engram aligns with ~layer 12 of baseline)
- **Frees attention capacity:** By offloading local dependencies to lookups, attention heads can focus on global context — Multi-Query NIAH jumps from 84.2 to 97.0

### Inference Efficiency
- Deterministic hash-based addressing (not dynamic routing) enables **prefetching from host memory** — 100B-parameter Engram table offloaded to CPU RAM with <3% throughput overhead
- Zipfian distribution of N-grams naturally supports multi-level caching (HBM > DRAM > SSD)

### Placement Matters
- **Layer 2 is optimal** for single-injection (one round of attention gives enough context for gating, but early enough to save depth)
- Splitting memory across layers 2 and 6 (or 2 and 15 at larger scale) is even better

---

## Relevance to nanochat

### 1. Direct Architectural Opportunity: Engram Module

nanochat's `NanoChatModel` is a vanilla GPT-2 decoder with standard `Embedding + [Block(Attn + MLP)] + LMHead`. The paper's key insight applies directly: **nanochat wastes its shallow 4-12 layers reconstructing local patterns that could be looked up**.

A minimal Engram module for nanochat would involve:
- A new `EngramModule` class with:
  - Hash function(s) mapping bigram/trigram token IDs to embedding table indices
  - A learnable embedding table (`nn.Embedding` with prime-sized vocab)
  - A gating mechanism: `sigmoid(RMSNorm(h) . RMSNorm(W_k @ e) / sqrt(d))` to produce scalar gate
  - Value projection: `W_v @ e`, gated and added as residual
- Injection point: after `Block` index 1 (i.e., layer 2 in 1-indexed) — the paper shows one attention round is enough context for effective gating
- In `NanoChatModel.forward()`, after the second block's output, apply `x = x + engram(x, token_ids)`

This is feasible even at nanochat's `nano` scale (4 layers, 256-dim). Even a small embedding table (e.g., 50K-500K entries x 64-dim) would let the model offload common bigram/trigram patterns and focus its limited depth on harder compositional tasks.

### 2. Improving Memory Management

nanochat's `MemoryManager` (in `memory.py`) currently does pure sliding-window truncation — exactly the kind of "all tokens treated equally" limitation the docstring flags. Engram's findings suggest a concrete improvement path:

- **Selective retention via N-gram importance:** Tokens completing high-gate-activation N-grams (static entities, key phrases) could be prioritized in the KV-cache, while tokens in low-activation regions (filler, whitespace) are truncated first. This would be a step toward the "importance-based retention" listed in MemoryManager's TODOs.
- **Static pattern offloading:** If nanochat had an Engram-style lookup, the KV-cache wouldn't need to retain context for local pattern reconstruction — the information lives in the embedding table. This effectively increases the useful context window without changing `max_cache_len`.

### 3. Context Window Efficiency for Chat

nanochat's `ChatSession._build_prompt()` does oldest-turn-first truncation with `max_context_tokens=1024`. The paper shows Engram dramatically improves long-context performance by freeing attention from local dependencies. For nanochat, this means:

- An Engram module could make the model's effective context "feel" longer than 1024 tokens, because attention isn't wasted on local pattern matching within the window
- The paper's finding that Engram-27B at 82% of training FLOPs matches baseline long-context performance suggests even small Engram tables are high-leverage for context-constrained models like nanochat

### 4. Practical Implementation Considerations

- **Hash function:** The paper uses lightweight multiplicative-XOR hashing. For nanochat this is trivial — a few lines of torch ops on token ID pairs/triples
- **Table sizing:** Use prime-sized tables to minimize hash collisions. Even at nanochat's scale, a ~100K-entry table with 64-dim embeddings = ~25MB — negligible
- **Training:** Engram embeddings should use higher LR (5x in the paper) with Adam and no weight decay. Conv params initialized to zero (identity at start)
- **No MoE needed:** nanochat uses dense FFNs, not MoE. The paper's sparsity allocation framework is MoE-specific, but the core Engram mechanism works independently — it's a parallel residual branch, not a replacement for the FFN
- **Weight tying note:** nanochat ties `wte` and `lm_head`. Engram is a separate embedding space (keyed by N-grams, not unigrams) and doesn't interfere with this

### 5. Connections to Existing TODOs

The paper directly addresses several items from nanochat's `memory.py` planned improvements:
- "Landmark/sink tokens to anchor important context" — Engram's gating mechanism identifies exactly which token positions carry static pattern completions; these are natural landmark candidates
- "Learned context compression" — Engram is a form of this: static patterns are "compressed" into O(1) lookups rather than reconstructed through multiple layers
- "Persistent memory bank across sessions" — the Engram embedding table is itself a persistent, static memory bank (trained once, used forever). It's a different flavor than the episodic cross-session memory in the TODO, but proves the value of persistent parametric memory

### 6. What This Doesn't Solve

- **Cross-session episodic memory:** Engram stores statistical patterns from training data, not conversation-specific memory. nanochat's TODO about persistent memory across sessions is a different problem (more like RAG or episodic memory)
- **Dynamic context summarization:** Engram handles static patterns; the TODO about summarizing old turns into fewer tokens requires a different approach (perhaps a small summarizer network)
- **FlashAttention:** Engram is orthogonal to the attention kernel optimization discussed in `summary_flash_attention.md` — both are valuable and complementary improvements
