Agents and humans share a key trait: without memory, you can't get anything done.
Imagine an Agent helping you with a three-day data analysis project. Day one, you tell it where the data lives and what chart style you prefer. Day two, you ask it to continue — if it remembers nothing from day one, you have to repeat every instruction.
That's what memory systems solve. This article breaks down three Agent memory mechanisms, from concept to code.
| Type | Analogy | Lifetime | Implementation |
|---|---|---|---|
| Short-Term | Working memory | Current session | Message list |
| Long-Term | Notebook | Cross-session persistent | Database / files |
| RAG | Library | On-demand retrieval | Vector database |
Short-term memory is simply the messages list. Our Agent from the previous article already uses it:
messages = [
{"role": "system", "content": "You are..."},
{"role": "user", "content": user_input},
{"role": "assistant", "content": "Let me search..."},
{"role": "tool", "content": "Search results..."},
{"role": "assistant", "content": "Based on the search..."}
]
The problem: model context windows are finite. What happens when the conversation grows too long?
1. Sliding Window — keep only the last N messages. Simple and brute-force, but loses early key information.
2. Smart Summarization — periodically compress conversation history using the model. Turn "user prefers blue charts, data is in data/ directory" into a compact system prompt, replacing verbose raw dialogue:
def compress_history(messages, client):
"""Compress conversation history into a one-paragraph summary."""
summary_prompt = "Summarize the key information from this conversation:\n" + \
"\n".join([f"{m['role']}: {m['content'][:200]}" for m in messages[-20:]])
summary = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": summary_prompt}]
).choices[0].message.content
return summary
3. Tiered Window — keep recent messages raw, slightly older ones as summaries, drop the oldest. Balances context and efficiency.
Long-term memory lets the Agent remember across sessions — your preferences, project structure, past task results.
import json, os
MEMORY_FILE = "agent_memory.json"
def load_memory() -> dict:
if os.path.exists(MEMORY_FILE):
with open(MEMORY_FILE) as f:
return json.load(f)
return {"facts": [], "preferences": {}}
def save_fact(key: str, value: str):
memory = load_memory()
memory["facts"].append({"key": key, "value": value, "time": datetime.now().isoformat()})
with open(MEMORY_FILE, "w") as f:
json.dump(memory, f, indent=2)
def get_relevant_context(query: str) -> str:
"""On startup, inject memory into the system prompt."""
memory = load_memory()
facts = "\n".join([f"- {f['key']}: {f['value']}" for f in memory["facts"]])
return f"Known user information:\n{facts}"
Usage: load memory at the start of each conversation and inject it into the system message:
system_prompt = f"""You are a helpful assistant.
{get_relevant_context()}
Answer user questions based on known information."""
For more complex scenarios, use SQLite with categorized memories:
CREATE TABLE memory (
id INTEGER PRIMARY KEY,
category TEXT, -- 'preference', 'project', 'person', 'fact'
key TEXT,
value TEXT,
importance REAL, -- 0.0 to 1.0, determines if injected into context
created_at TIMESTAMP,
last_accessed TIMESTAMP
);
On injection, only select high-importance or recently accessed memories to prevent context bloat.
The first two layers are great for "meta-information" — preferences, state, facts. But when you have large volumes of documents — codebases, manuals, research papers — you need RAG.
The RAG pipeline:
import chromadb
from openai import OpenAI
client = OpenAI(base_url="...", api_key="...")
db = chromadb.PersistentClient(path="./agent_rag_db")
collection = db.get_or_create_collection("knowledge_base")
# 1. Index — store documents
def index_document(doc_id: str, content: str):
chunks = [content[i:i+500] for i in range(0, len(content), 500)]
for i, chunk in enumerate(chunks):
embedding = client.embeddings.create(
model="text-embedding-3-small", input=chunk
).data[0].embedding
collection.add(
ids=[f"{doc_id}_{i}"],
embeddings=[embedding],
documents=[chunk]
)
# 2. Retrieve — find relevant snippets
def retrieve(query: str, top_k: int = 5) -> str:
query_embedding = client.embeddings.create(
model="text-embedding-3-small", input=query
).data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
return "\n\n".join(results["documents"][0])
# 3. Inject into Agent — append to system prompt
rag_context = retrieve(user_input)
system_prompt += f"\n\nReference knowledge base:\n{rag_context}"
In practice, the three layers aren't mutually exclusive — each has its role:
| Scenario | Which Layer |
|---|---|
| Remember what the user just said | Short-term (message list) |
| Remember user preferences, project paths | Long-term (JSON/SQLite) |
| Look up technical docs, research papers | RAG (vector database) |
| Find relevant functions in a codebase | RAG + directory tree index |
| Summarize completed tasks | Long-term + periodic summarization |
Integrating all three layers into our Agent loop:
def run_agent_with_memory(user_input: str, user_id: str = "default"):
# Load long-term memory
long_term = load_memory(user_id)
# RAG retrieval
rag_context = retrieve(user_input)
# Build system prompt (three-layer fusion)
system = f"""You are the user's AI assistant.
## User Preferences
{long_term}
## Knowledge Base Reference
{rag_context}
## Conversation Guidelines
- Prioritize information from the knowledge base
- Remember user preferences for future conversations"""
messages = [{"role": "system", "content": system}]
messages.extend(load_recent_history(user_id)[-20:]) # Short-term
messages.append({"role": "user", "content": user_input})
# ReAct loop (unchanged)
for turn in range(10):
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=TOOLS
)
msg = response.choices[0].message
if not msg.tool_calls:
# Proactively save newly learned facts
extract_and_save_facts(msg.content, user_id)
return msg.content
# ... execute tools ...
return "Max turns reached"
📖 Next: Agent Error Recovery & Self-Correction — teaching your Agent to fix its own mistakes