Skip to main content

Memory

Osaurus includes a persistent memory system that learns from your conversations and provides personalized context to every AI interaction. Memory runs entirely in the background — it extracts knowledge automatically, deduplicates entries, detects contradictions, and injects relevant context into each new conversation.

No manual tagging, saving, or annotation is required.

Getting Started

  1. Open the Management window (⌘⇧M) → Memory
  2. Memory is enabled by default — toggle it off in the Memory settings if you prefer stateless conversations
  3. Choose a core model for extraction (default: anthropic/claude-haiku-4-5) — this model processes conversation turns to extract structured memories
  4. Start chatting — memories are extracted automatically from each conversation turn

How It Works

Memory is organized into four layers, each serving a different purpose:

┌──────────────────────────────────────────────────────────────────┐
│ Memory System │
├──────────────────────────────────────────────────────────────────┤
│ Layer 1: User Profile │
│ Auto-generated summary of who you are, rebuilt as new │
│ contributions accumulate. Includes user overrides. │
├──────────────────────────────────────────────────────────────────┤
│ Layer 2: Working Memory │
│ Structured entries: facts, preferences, decisions, │
│ corrections, commitments, relationships, skills. │
├──────────────────────────────────────────────────────────────────┤
│ Layer 3: Conversation Summaries │
│ Structured recaps of past sessions (topics, decisions, │
│ key dates, action items), generated after inactivity. │
├──────────────────────────────────────────────────────────────────┤
│ Layer 4: Conversation Chunks │
│ Raw conversation turns indexed for query-matched retrieval. │
└──────────────────────────────────────────────────────────────────┘
│ │
│ Knowledge Graph (cross-cutting) │
│ Entities and relationships extracted from all layers. │
└───────────────────────────────────────────────────────────────────┘

Layer 1: User Profile

A continuously updated summary of who you are. The profile is regenerated automatically after a configurable number of new contributions (default: 10).

  • Auto-generated — built from accumulated profile facts extracted during conversations
  • Version tracked — each regeneration increments the version number
  • User overrides — explicit facts you add manually that always appear in context, regardless of profile regeneration

User overrides take the highest priority in context assembly and are never overwritten by automatic extraction.

Layer 2: Working Memory

Structured memory entries extracted from every conversation turn. Each entry has a type, confidence score, tags, and temporal validity.

Entry TypeDescriptionExample
FactFactual information"User works at Acme Corp as a backend engineer"
PreferenceLikes, dislikes, and preferences"Prefers Swift over Objective-C"
DecisionDecisions made during conversations"Decided to use PostgreSQL for the new project"
CorrectionCorrections to previous information"Actually uses Python 3.12, not 3.11"
CommitmentPromises, plans, or intentions"Plans to migrate to Kubernetes next quarter"
RelationshipConnections between people, projects, or concepts"Alice is the tech lead on Project Nova"
SkillSkills, expertise, or knowledge areas"Experienced with Docker and CI/CD pipelines"

Entries include:

  • Confidence scores (0.0–1.0) reflecting extraction certainty
  • Tags for categorization
  • Temporal validity (validFrom / validUntil) for time-bounded facts
  • Access tracking (last accessed time and count) for relevance scoring
  • Supersession tracking when newer information replaces older entries

Layer 3: Conversation Summaries

Structured recaps of past conversation sessions. Summaries are generated automatically using a debounced approach:

  • A timer starts after the last conversation turn (default: 60 seconds)
  • If no new messages arrive within the debounce window, a summary is generated
  • Session changes (switching to a different conversation) also trigger summary generation
  • On startup, any orphaned pending signals from a previous session are recovered and processed

Each summary captures:

  • Topics discussed
  • Decisions made
  • Key dates or deadlines mentioned
  • Action items or commitments
  • A brief overall summary

Layer 4: Conversation Chunks

Raw conversation turns stored individually and indexed for query-matched retrieval. Each chunk records the conversation ID, chunk index, role, full content, token count, and timestamp.

Chunks are not dumped into context wholesale. At query time, only semantically relevant chunks are retrieved via hybrid search (BM25 + vector), reranked with MMR, and included within a token budget. Adjacent turns are loaded via window expansion to preserve conversational flow. This layer acts as a lossless fallback for details the extraction pipeline may have missed.


Knowledge Graph

The memory system builds a knowledge graph from extracted entities and relationships.

Entity types: person, company, place, project, tool, concept, event

Relationships connect entities with:

  • A descriptive relation string (e.g., "works at", "manages", "depends on")
  • A confidence score
  • Temporal validity (optional validFrom / validUntil)

Graph search supports:

  • Search by entity name to find all connected relationships
  • Search by relation type to discover entities with a specific connection
  • Depth-limited traversal (default depth: 2) to explore the neighborhood of an entity

Search & Retrieval

Memory search uses a hybrid approach combining text and semantic matching.

When VecturaKit is available (embedding model downloaded):

  1. BM25 scores documents by keyword relevance
  2. Vector embeddings score documents by semantic similarity
  3. Scores are combined for a unified ranking

When VecturaKit is unavailable, the system falls back to SQLite LIKE queries for basic text matching.

MMR Reranking

To avoid returning many near-identical results, search results are reranked using Maximal Marginal Relevance (MMR):

  1. Over-fetch results (default: 2x the requested topK)
  2. Iteratively select results that balance relevance (search score) with diversity (Jaccard distance from already-selected results)
  3. The lambda parameter controls the tradeoff: 1.0 = pure relevance, 0.0 = pure diversity (default: 0.7)

Search Scopes

ScopeWhat It SearchesTime Window
Memory entriesWorking memory (Layer 2)All time
ConversationsConversation chunks (Layer 4)All time (query-aware retrieval)
SummariesConversation summaries (Layer 3)Retention window (default: 180 days)
GraphKnowledge graph entities and relationshipsAll time

Verification Pipeline

Before a new memory entry is stored, it passes through a 3-layer verification pipeline. This pipeline is entirely deterministic (no LLM calls) and prevents redundant or conflicting entries.

Layer 1: Jaccard Deduplication

Compares the new entry's words against existing entries using Jaccard similarity (word overlap). If the similarity exceeds the threshold (default: 0.6), the entry is skipped as a near-duplicate.

Layer 2: Contradiction Detection

For entries of the same type, if the Jaccard similarity is moderate (above 0.3 but below the dedup threshold), the new entry is flagged as a potential contradiction. The newer entry supersedes the older one.

Layer 3: Semantic Deduplication

Uses vector search to find semantically similar entries. If the similarity score exceeds the threshold (default: 0.85), the entry is skipped as a semantic duplicate even if the wording differs.


Context Assembly

Before each AI interaction, the MemoryContextAssembler builds a memory block that is injected into the system prompt. The block always begins with the current date so the model can reason about time relative to stored memories.

Context is assembled in priority order with per-section token budgets:

PrioritySectionDefault Budget
0Current DateAlways included (temporal anchor)
1User OverridesAlways included (no budget limit)
2User Profile2,000 tokens
3Working Memory3,000 tokens
4Conversation Summaries3,000 tokens
5Key Relationships300 tokens

When a user query is provided, an additional query-aware retrieval pass runs in parallel, searching entries, summaries, and conversation chunks for semantically relevant results. These are deduplicated against the base context and appended as additional sections:

SectionDefault Budget
Relevant Conversation Excerpts3,000 tokens
Relevant Memories3,000 tokens
Relevant Summaries3,000 tokens
  • Results are cached for 10 seconds per agent to avoid redundant database queries
  • Cache is invalidated when memory content changes
  • If total memory context exceeds available space, lower-priority sections are truncated first

Managing Memory

Viewing Memory

Open the Management window (⌘⇧M) → Memory to see:

  • Your generated user profile with version history
  • User overrides you've added manually
  • Per-agent statistics showing memory entry counts
  • Processing statistics (total calls, success rate, average duration)
  • Database size

Adding User Overrides

User overrides are explicit facts that always appear in context. Use these for information the AI should never forget:

  1. Go to MemoryUser Overrides
  2. Click Add Override
  3. Enter a fact (e.g., "I prefer tabs over spaces" or "My company uses a monorepo")

Syncing

Click Sync Now to force-process any pending conversation signals immediately, rather than waiting for the debounce timer.

Clearing Memory

warning

The Memory view includes a danger zone for clearing all memory data. This removes all entries, summaries, chunks, profile data, and knowledge graph entities. The action is irreversible.


Configuration Reference

All settings are configurable via the Memory tab in the Management window. The configuration file is stored as JSON at ~/.osaurus/config/memory.json.

Core Model

SettingDefaultDescription
coreModelProvideranthropicProvider for the extraction model
coreModelNameclaude-haiku-4-5Model used for memory extraction and summarization
embeddingBackendmlxEmbedding backend (mlx or none)
embeddingModelnomic-embed-text-v1.5Model used for vector embeddings

Token Budgets

SettingDefaultRangeDescription
profileMaxTokens2,000100–50,000Max tokens for user profile
workingMemoryBudgetTokens3,00050–10,000Token budget for working memory in context
summaryBudgetTokens3,00050–10,000Token budget for summaries in context
chunkBudgetTokens3,00050–20,000Token budget for conversation chunk excerpts in context
graphBudgetTokens30050–5,000Token budget for knowledge graph in context

Profile

SettingDefaultRangeDescription
profileRegenerateThreshold101–100New contributions before profile regeneration

Summaries

SettingDefaultRangeDescription
summaryDebounceSeconds6010–3,600Inactivity period before summary generation
summaryRetentionDays1800–3,650How long summaries are retained (0 = unlimited)
SettingDefaultRangeDescription
recallTopK301–100Number of results for recall searches
temporalDecayHalfLifeDays301–365Half-life for temporal decay in ranking
mmrLambda0.70.0–1.0Relevance vs. diversity tradeoff
mmrFetchMultiplier2.01.0–10.0Over-fetch multiplier before MMR reranking

Verification

SettingDefaultRangeDescription
verificationEnabledtruetrue/falseEnable the 3-layer verification pipeline
verificationJaccardDedupThreshold0.60.0–1.0Jaccard threshold for near-duplicate detection
verificationSemanticDedupThreshold0.850.0–1.0Vector similarity threshold for semantic dedup

Limits

SettingDefaultRangeDescription
maxEntriesPerAgent5000–10,000Max active entries per agent (0 = unlimited)
enabledtruetrue/falseMaster toggle for the memory system

API Integration

Osaurus exposes its memory system through the HTTP API, enabling any OpenAI-compatible client to benefit from persistent, personalized context. See the API Reference for full endpoint documentation.


Architecture Details

Actor-Based Concurrency

MemoryService and MemorySearchService are Swift actors, ensuring all state mutations are serialized and thread-safe. Background extraction never blocks the chat UI — conversation turns are recorded and processed asynchronously.

Circuit Breaker

To prevent hammering a failing model service, the memory system implements a circuit breaker:

  • Closed (normal): requests proceed normally
  • Open (tripped): after 5 consecutive failures, all requests are short-circuited for 60 seconds
  • Half-open: after cooldown, one request is allowed through to test recovery

Retry Logic

Failed extraction and summarization calls use exponential backoff:

  • Delays: 1s, 2s, 4s
  • Max retries: 3
  • Timeout: 60 seconds per attempt
  • Only retryable errors (network, transient) trigger retries
  • Uses VecturaKit for hybrid BM25 + vector search
  • Embeddings generated by SwiftEmbedder (default model: nomic-embed-text-v1.5)
  • Deterministic UUIDs for indexed documents using SHA-256 hashing
  • Graceful fallback to SQLite text search when the embedding model is unavailable

Storage

All memory data is stored in a local SQLite database with WAL (Write-Ahead Logging) mode for concurrent read performance.

ItemLocation
Database~/.osaurus/memory/memory.sqlite
Configuration~/.osaurus/config/memory.json

The database schema is versioned with automatic migrations. Indexes are maintained on agent ID, status, temporal fields, and conversation IDs for efficient queries.


For memory-related API endpoints, see the API Reference. For memory quality benchmarks, see Benchmarks.