AI Agent Memory Systems — Short-Term, Long-Term, and RAG
30-Second Takeaway
- Problem Solved: Agents without memory start from scratch every session. Three-layer memory gives Agents conversation history, preferences, and knowledge access.
- Core Method: Short-term (message list + window management), long-term (JSON/SQLite persistence), RAG (vector DB + embedding retrieval).
- Key Insight: "Sliding window + periodic summarization + JSON file long-term memory" is sufficient for most use cases. Only introduce RAG for large document collections.
- What You'll Gain: Cross-session memory for your Agent, knowing which memory layer to use when.
Agents and humans share a key trait: without memory, you can't get anything done.
Imagine an Agent helping you with a three-day data analysis project. Day one, you tell it where the data lives and what chart style you prefer. Day two, you ask it to continue — if it remembers nothing from day one, you have to repeat every instruction.
That's what memory systems solve. This article breaks down three Agent memory mechanisms, from concept to code.
The Three-Layer Memory Model
| Type |
Analogy |
Lifetime |
Implementation |
| Short-Term |
Working memory |
Current session |
Message list |
| Long-Term |
Notebook |
Cross-session persistent |
Database / files |
| RAG |
Library |
On-demand retrieval |
Vector database |
Layer 1: Short-Term Memory (Conversation Window)
Short-term memory is simply the messages list. Our Agent from the previous article already uses it:
messages = [
{"role": "system", "content": "You are..."},
{"role": "user", "content": user_input},
{"role": "assistant", "content": "Let me search..."},
{"role": "tool", "content": "Search results..."},
{"role": "assistant", "content": "Based on the search..."}
]
The problem: model context windows are finite. What happens when the conversation grows too long?
Window Management Strategies
1. Sliding Window — keep only the last N messages. Simple and brute-force, but loses early key information.
2. Smart Summarization — periodically compress conversation history using the model. Turn "user prefers blue charts, data is in data/ directory" into a compact system prompt, replacing verbose raw dialogue:
def compress_history(messages, client):
"""Compress conversation history into a one-paragraph summary."""
summary_prompt = "Summarize the key information from this conversation:\n" + \
"\n".join([f"{m['role']}: {m['content'][:200]}" for m in messages[-20:]])
summary = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": summary_prompt}]
).choices[0].message.content
return summary
3. Tiered Window — keep recent messages raw, slightly older ones as summaries, drop the oldest. Balances context and efficiency.
📌 Practical advice: "Sliding window + periodic summarization" covers the vast majority of use cases. Only reach for more complex approaches when you need precise recall of historical details.
Layer 2: Long-Term Memory (Persistent Storage)
Long-term memory lets the Agent remember across sessions — your preferences, project structure, past task results.
Simplest Implementation: File Storage
import json, os
MEMORY_FILE = "agent_memory.json"
def load_memory() -> dict:
if os.path.exists(MEMORY_FILE):
with open(MEMORY_FILE) as f:
return json.load(f)
return {"facts": [], "preferences": {}}
def save_fact(key: str, value: str):
memory = load_memory()
memory["facts"].append({"key": key, "value": value, "time": datetime.now().isoformat()})
with open(MEMORY_FILE, "w") as f:
json.dump(memory, f, indent=2)
def get_relevant_context(query: str) -> str:
"""On startup, inject memory into the system prompt."""
memory = load_memory()
facts = "\n".join([f"- {f['key']}: {f['value']}" for f in memory["facts"]])
return f"Known user information:\n{facts}"
Usage: load memory at the start of each conversation and inject it into the system message:
system_prompt = f"""You are a helpful assistant.
{get_relevant_context()}
Answer user questions based on known information."""
Advanced: Structured Memory
For more complex scenarios, use SQLite with categorized memories:
CREATE TABLE memory (
id INTEGER PRIMARY KEY,
category TEXT, -- 'preference', 'project', 'person', 'fact'
key TEXT,
value TEXT,
importance REAL, -- 0.0 to 1.0, determines if injected into context
created_at TIMESTAMP,
last_accessed TIMESTAMP
);
On injection, only select high-importance or recently accessed memories to prevent context bloat.
Layer 3: RAG Memory (Retrieval-Augmented Generation)
The first two layers are great for "meta-information" — preferences, state, facts. But when you have large volumes of documents — codebases, manuals, research papers — you need RAG.
The RAG pipeline:
- Index: Split documents into chunks, convert to vectors via embedding model, store in vector database
- Retrieve: When the user asks, convert the question to a vector, search for the most similar document chunks
- Generate: Inject retrieved chunks into the prompt, let the model answer based on this information
import chromadb
from openai import OpenAI
client = OpenAI(base_url="...", api_key="...")
db = chromadb.PersistentClient(path="./agent_rag_db")
collection = db.get_or_create_collection("knowledge_base")
# 1. Index — store documents
def index_document(doc_id: str, content: str):
chunks = [content[i:i+500] for i in range(0, len(content), 500)]
for i, chunk in enumerate(chunks):
embedding = client.embeddings.create(
model="text-embedding-3-small", input=chunk
).data[0].embedding
collection.add(
ids=[f"{doc_id}_{i}"],
embeddings=[embedding],
documents=[chunk]
)
# 2. Retrieve — find relevant snippets
def retrieve(query: str, top_k: int = 5) -> str:
query_embedding = client.embeddings.create(
model="text-embedding-3-small", input=query
).data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
return "\n\n".join(results["documents"][0])
# 3. Inject into Agent — append to system prompt
rag_context = retrieve(user_input)
system_prompt += f"\n\nReference knowledge base:\n{rag_context}"
How the Three Layers Work Together
In practice, the three layers aren't mutually exclusive — each has its role:
| Scenario |
Which Layer |
| Remember what the user just said |
Short-term (message list) |
| Remember user preferences, project paths |
Long-term (JSON/SQLite) |
| Look up technical docs, research papers |
RAG (vector database) |
| Find relevant functions in a codebase |
RAG + directory tree index |
| Summarize completed tasks |
Long-term + periodic summarization |
A Complete Memory-Aware Agent
Integrating all three layers into our Agent loop:
def run_agent_with_memory(user_input: str, user_id: str = "default"):
# Load long-term memory
long_term = load_memory(user_id)
# RAG retrieval
rag_context = retrieve(user_input)
# Build system prompt (three-layer fusion)
system = f"""You are the user's AI assistant.
## User Preferences
{long_term}
## Knowledge Base Reference
{rag_context}
## Conversation Guidelines
- Prioritize information from the knowledge base
- Remember user preferences for future conversations"""
messages = [{"role": "system", "content": system}]
messages.extend(load_recent_history(user_id)[-20:]) # Short-term
messages.append({"role": "user", "content": user_input})
# ReAct loop (unchanged)
for turn in range(10):
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=TOOLS
)
msg = response.choices[0].message
if not msg.tool_calls:
# Proactively save newly learned facts
extract_and_save_facts(msg.content, user_id)
return msg.content
# ... execute tools ...
return "Max turns reached"
Common Pitfalls
- Memory bloat — indiscriminately saving everything bloats the context. Fix: add importance scoring, only inject high-score memories.
- Stale memory — old preferences no longer apply but keep getting injected. Fix: add TTL expiration, or periodically ask the model to re-confirm.
- Privacy leaks — sensitive information stored in memory gets exposed in later conversations. Fix: mark sensitive info as non-persistent, or encrypt storage.
- RAG noise — retrieving irrelevant chunks misleads the model. Fix: add relevance threshold filtering, or let the model judge whether retrieved content is actually useful.
📖 Next: Agent Error Recovery & Self-Correction — teaching your Agent to fix its own mistakes
Frequently Asked Questions
Q: Core difference between short-term and long-term memory?
A: Short-term lives in the current session's message list. Long-term persists to files or databases across sessions.
Q: When RAG vs simple long-term memory?
A: Long-term for structured facts/preferences. RAG for large volumes of unstructured documents. Under ~1,000 records: long-term. Beyond ~10,000: RAG.
Q: Which vector database?
A: ChromaDB is open-source and runs locally — great for prototypes. Pinecone is managed cloud for production. Start with ChromaDB.
Q: How to prevent memory bloat?
A: Three strategies: importance scoring, TTL expiration, and summarization compression. Don't store everything indiscriminately.