Samantha LLM — Persistent Memory for AI Assistants

Documentation and website for samantha-llm

View on GitHub

Architecture

This page covers how Samantha LLM works under the hood. You don’t need any of this to use it — this is for the curious.

The Big Picture

Samantha LLM is a Python script with no external dependencies beyond the standard library. It orchestrates three things:

  1. Bootstrap — Assembles a prompt and hands it to your AI agent
  2. Memory — Maintains structured knowledge across sessions
  3. Subconscious — Automatically extracts insights after each session

Everything is plain text. Memories are Markdown. Indexes are JSON. Config is JSON. There’s no database, no daemon, no background service (except the post-session worker, which exits when done).

Session Lifecycle

samantha-llm start
       |
       v
Read config.json (agents, personas, defaults)
       |
       v
Create .ai-cerebrum symlink → samantha-llm repo
       |
       v
Check for unmerged subconscious sessions
       |
       v
Assemble bootstrap prompt (template + persona)
       |
       v
Launch agent with prompt as argument
       |
       v
 --- You work normally ---
       |
       v
Session ends → symlink cleaned up
       |
       v
Background worker launches (subconscious pipeline)
       |
       v
Memories ready for next session

Persona Injection

The bootstrap prompt is assembled from two pieces:

  1. Template (BOOTSTRAP_PROMPT.md) — Fixed instructions for initialization, memory loading, and core behaviors
  2. Persona file (persona/main.md or a custom file) — Who the assistant is

At startup, the persona content replaces a {PERSONA_CONTENT_HERE} placeholder in the template. The combined prompt is passed as an argument to your agent’s command. From the agent’s perspective, it’s just a very detailed system prompt.

BOOTSTRAP_PROMPT.md          persona/main.md
        |                          |
        +--- merge at placeholder --+
        |
        v
  Full bootstrap prompt
        |
        v
  claude "$(prompt)"

Rather than copying files into every project, Samantha LLM creates a symlink:

your-project/.ai-cerebrum  →  /path/to/samantha-llm/

This gives the agent access to the bootstrap prompt, persona files, memory directories, and core processes — all from a single repository. Multiple projects share the same symlink target, which means they share the same memories and persona.

Auto-created symlinks are removed when the session ends. Manual links (samantha-llm link) persist until explicitly removed.

Memory Architecture

Directory Layout

.ai/
├── short-term-memory/.ai/     30-90 day lifecycle
├── long-term-memory/.ai/      Permanent
├── current-tasks/.ai/         Active projects
├── work-experience/.ai/       Completed project archive
├── procedural-memory/.ai/     Operational runbooks
└── subconscious/              Session processing workspace

Index-First Loading

Each memory directory maintains an index.json that categorizes memories by importance, topic, and recency. During bootstrap, the assistant reads indexes first, then selectively loads only the memories relevant to the current session.

This keeps startup fast regardless of how many memories exist.

Bootstrap Loading Order

  1. Critical memories — Flagged critical: true, loaded every session
  2. High-priority — Frequently referenced (reference_count >= 5)
  3. Project-specific — Tagged with the current project
  4. Recent high-importanceimportance: high from the last 30 days
  5. Procedural — Runbooks matching the current context
  6. Long-term — Accessed on-demand during the session

Importance Escalation

Memory importance increases automatically as memories prove useful:

References Action
3+ Candidate for importance: high
5+ Candidate for long-term memory transfer
10+ Should definitely be in long-term memory

Low-importance memories older than 30 days are pruned. High-importance ones are promoted to long-term storage where they persist indefinitely.

Subconscious Pipeline

After each session, a background worker analyzes the conversation and generates structured memories. The pipeline has six phases.

Phase 1: Transcript Capture

Terminal I/O is recorded during the session. When the session ends, the recording is saved and the background worker launches as a detached process.

Phase 2: Chunking

Conversations longer than ~150K characters are split at natural boundaries:

Each chunk receives a summary of previous chunks for context continuity. This means conversations of any length can be processed.

Phase 3: LLM Analysis

Each chunk is sent to an LLM (Anthropic API preferred, Claude CLI as fallback) with a structured analysis prompt. The prompt instructs the LLM to extract:

Phase 4: Parsing and Merging

Raw LLM output is parsed into structured data. For chunked conversations, results from all chunks are merged and deduplicated — a decision mentioned in chunk 2 and chunk 5 appears only once in the final output.

Phase 5: Memory Generation

Parsed results are converted into a memory file with proper YAML frontmatter. Importance is assessed automatically based on insight density — sessions with many decisions score higher than sessions with routine work.

Phase 6: Procedural Extraction

A lightweight secondary analysis identifies operational patterns that could become runbooks. This runs only when the main analysis produces meaningful results (>500 characters).

Session Isolation

Each session gets its own workspace under .ai/subconscious/.ai/sessions/. Multiple concurrent sessions don’t interfere with each other. Workspaces are archived after a successful merge.

Procedural Memory Triggers

Runbooks load automatically when context signals match. To prevent false positives, the system requires multiple signals from different categories:

Category Examples
repo_signals Repository name
path_signals File paths being edited
keyword_signals Terms in conversation
domain_signals Broader context phrases

Rules:

This multi-signal approach prevents a runbook about “developing framework X” from loading every time someone merely uses framework X.

Ralph Mode

Ralph mode is an opt-in iterative coding loop for tasks with clear, machine-verifiable success criteria — test suites passing, linting clean, benchmarks met.

Named after Ralph Wiggum (iteration over perfection), it follows a cycle:

Work → Evaluate → Document → Decide → Monitor
  ^                                      |
  +--------------------------------------+

Key design principles:

Guardrails

Stored in .ai/ralph-guardrails/, guardrails are learned constraints:

## Don't Use pip

**Context:** Docker image build kept failing

### What Failed
Used `pip install` — project standard is `uv`

### Correct Approach
Always use `uv pip install --system`

Guardrails are loaded at the start of each Ralph iteration and grow over time. They’re the system’s way of not making the same mistake twice.

Design Principles

A few principles that shaped the architecture:

Next Steps