AI Agent Memory Systems — Short-Term, Long-Term, and RAG

May 14, 2026 · Intermediate

30-Second Takeaway

Problem Solved: Agents without memory start from scratch every session. Three-layer memory gives Agents conversation history, preferences, and knowledge access.
Core Method: Short-term (message list + window management), long-term (JSON/SQLite persistence), RAG (vector DB + embedding retrieval).
Key Insight: "Sliding window + periodic summarization + JSON file long-term memory" is sufficient for most use cases. Only introduce RAG for large document collections.
What You'll Gain: Cross-session memory for your Agent, knowing which memory layer to use when.

Agents and humans share a key trait: without memory, you can't get anything done.

Imagine an Agent helping you with a three-day data analysis project. Day one, you tell it where the data lives and what chart style you prefer. Day two, you ask it to continue — if it remembers nothing from day one, you have to repeat every instruction.

That's what memory systems solve. This article breaks down three Agent memory mechanisms, from concept to code.

The Three-Layer Memory Model

Type	Analogy	Lifetime	Implementation
Short-Term	Working memory	Current session	Message list
Long-Term	Notebook	Cross-session persistent	Database / files
RAG	Library	On-demand retrieval	Vector database

Layer 1: Short-Term Memory (Conversation Window)

Short-term memory is simply the messages list. Our Agent from the previous article already uses it:

messages = [
    {"role": "system", "content": "You are..."},
    {"role": "user", "content": user_input},
    {"role": "assistant", "content": "Let me search..."},
    {"role": "tool", "content": "Search results..."},
    {"role": "assistant", "content": "Based on the search..."}
]

The problem: model context windows are finite. What happens when the conversation grows too long?

Window Management Strategies

1. Sliding Window — keep only the last N messages. Simple and brute-force, but loses early key information.

2. Smart Summarization — periodically compress conversation history using the model. Turn "user prefers blue charts, data is in data/ directory" into a compact system prompt, replacing verbose raw dialogue:

def compress_history(messages, client):
    """Compress conversation history into a one-paragraph summary."""
    summary_prompt = "Summarize the key information from this conversation:\n" + \
        "\n".join([f"{m['role']}: {m['content'][:200]}" for m in messages[-20:]])
    summary = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": summary_prompt}]
    ).choices[0].message.content
    return summary

3. Tiered Window — keep recent messages raw, slightly older ones as summaries, drop the oldest. Balances context and efficiency.

📌 Practical advice: "Sliding window + periodic summarization" covers the vast majority of use cases. Only reach for more complex approaches when you need precise recall of historical details.

Layer 2: Long-Term Memory (Persistent Storage)

Long-term memory lets the Agent remember across sessions — your preferences, project structure, past task results.

Simplest Implementation: File Storage

import json, os

MEMORY_FILE = "agent_memory.json"

def load_memory() -> dict:
    if os.path.exists(MEMORY_FILE):
        with open(MEMORY_FILE) as f:
            return json.load(f)
    return {"facts": [], "preferences": {}}

def save_fact(key: str, value: str):
    memory = load_memory()
    memory["facts"].append({"key": key, "value": value, "time": datetime.now().isoformat()})
    with open(MEMORY_FILE, "w") as f:
        json.dump(memory, f, indent=2)

def get_relevant_context(query: str) -> str:
    """On startup, inject memory into the system prompt."""
    memory = load_memory()
    facts = "\n".join([f"- {f['key']}: {f['value']}" for f in memory["facts"]])
    return f"Known user information:\n{facts}"

Usage: load memory at the start of each conversation and inject it into the system message:

system_prompt = f"""You are a helpful assistant.
{get_relevant_context()}
Answer user questions based on known information."""

Advanced: Structured Memory

For more complex scenarios, use SQLite with categorized memories:

CREATE TABLE memory (
    id INTEGER PRIMARY KEY,
    category TEXT,      -- 'preference', 'project', 'person', 'fact'
    key TEXT,
    value TEXT,
    importance REAL,    -- 0.0 to 1.0, determines if injected into context
    created_at TIMESTAMP,
    last_accessed TIMESTAMP
);

On injection, only select high-importance or recently accessed memories to prevent context bloat.

Layer 3: RAG Memory (Retrieval-Augmented Generation)

The first two layers are great for "meta-information" — preferences, state, facts. But when you have large volumes of documents — codebases, manuals, research papers — you need RAG.

The RAG pipeline:

Index: Split documents into chunks, convert to vectors via embedding model, store in vector database
Retrieve: When the user asks, convert the question to a vector, search for the most similar document chunks
Generate: Inject retrieved chunks into the prompt, let the model answer based on this information

import chromadb
from openai import OpenAI

client = OpenAI(base_url="...", api_key="...")
db = chromadb.PersistentClient(path="./agent_rag_db")
collection = db.get_or_create_collection("knowledge_base")

# 1. Index — store documents
def index_document(doc_id: str, content: str):
    chunks = [content[i:i+500] for i in range(0, len(content), 500)]
    for i, chunk in enumerate(chunks):
        embedding = client.embeddings.create(
            model="text-embedding-3-small", input=chunk
        ).data[0].embedding
        collection.add(
            ids=[f"{doc_id}_{i}"],
            embeddings=[embedding],
            documents=[chunk]
        )

# 2. Retrieve — find relevant snippets
def retrieve(query: str, top_k: int = 5) -> str:
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding
    results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
    return "\n\n".join(results["documents"][0])

# 3. Inject into Agent — append to system prompt
rag_context = retrieve(user_input)
system_prompt += f"\n\nReference knowledge base:\n{rag_context}"

How the Three Layers Work Together

In practice, the three layers aren't mutually exclusive — each has its role:

Scenario	Which Layer
Remember what the user just said	Short-term (message list)
Remember user preferences, project paths	Long-term (JSON/SQLite)
Look up technical docs, research papers	RAG (vector database)
Find relevant functions in a codebase	RAG + directory tree index
Summarize completed tasks	Long-term + periodic summarization

A Complete Memory-Aware Agent

Integrating all three layers into our Agent loop:

def run_agent_with_memory(user_input: str, user_id: str = "default"):
    # Load long-term memory
    long_term = load_memory(user_id)

    # RAG retrieval
    rag_context = retrieve(user_input)

    # Build system prompt (three-layer fusion)
    system = f"""You are the user's AI assistant.
## User Preferences
{long_term}

## Knowledge Base Reference
{rag_context}

## Conversation Guidelines
- Prioritize information from the knowledge base
- Remember user preferences for future conversations"""

    messages = [{"role": "system", "content": system}]
    messages.extend(load_recent_history(user_id)[-20:])  # Short-term
    messages.append({"role": "user", "content": user_input})

    # ReAct loop (unchanged)
    for turn in range(10):
        response = client.chat.completions.create(
            model="gpt-4o", messages=messages, tools=TOOLS
        )
        msg = response.choices[0].message
        if not msg.tool_calls:
            # Proactively save newly learned facts
            extract_and_save_facts(msg.content, user_id)
            return msg.content
        # ... execute tools ...

    return "Max turns reached"

Common Pitfalls

Memory bloat — indiscriminately saving everything bloats the context. Fix: add importance scoring, only inject high-score memories.
Stale memory — old preferences no longer apply but keep getting injected. Fix: add TTL expiration, or periodically ask the model to re-confirm.
Privacy leaks — sensitive information stored in memory gets exposed in later conversations. Fix: mark sensitive info as non-persistent, or encrypt storage.
RAG noise — retrieving irrelevant chunks misleads the model. Fix: add relevance threshold filtering, or let the model judge whether retrieved content is actually useful.

📖 Next: Agent Error Recovery & Self-Correction — teaching your Agent to fix its own mistakes

Next Steps

📖 Foundational: Write Your First AI Agent
📖 Next: Agent Error Recovery & Self-Correction
📖 Advanced: Building an Agent Framework from Scratch

Frequently Asked Questions

Q: Core difference between short-term and long-term memory?

A: Short-term lives in the current session's message list. Long-term persists to files or databases across sessions.

Q: When RAG vs simple long-term memory?

A: Long-term for structured facts/preferences. RAG for large volumes of unstructured documents. Under ~1,000 records: long-term. Beyond ~10,000: RAG.

Q: Which vector database?

A: ChromaDB is open-source and runs locally — great for prototypes. Pinecone is managed cloud for production. Start with ChromaDB.

Q: How to prevent memory bloat?

A: Three strategies: importance scoring, TTL expiration, and summarization compression. Don't store everything indiscriminately.