The raw problem
Large language models have a context window — a fixed number of tokens (words, roughly) they can see at a time. The window is big by historical standards (up to 2 million tokens for Gemini, 200k for Claude Opus), but it's still finite, and more importantly, it's per-request.
Close the tab, and the context is gone. The model doesn't remember anything you said because the model is stateless. Every request starts fresh.
That's the memory problem.
How memory gets bolted on
Memory in AI products is always a layer *around* the model, never inside it. The model is stateless; the product wraps it with storage and retrieval. There are three common patterns:
1. Full-context replay
Dump the entire conversation history into every request. Works for short conversations. Falls apart as the history gets long — you hit the context window ceiling, or the model starts getting confused by too much text, or the bill gets huge.
2. Summarization
Periodically generate a summary of the conversation so far, store the summary, and feed the summary into future requests instead of the full history. Cheaper. Loses fine detail. Good for capturing vibe/state, bad for remembering specifics like 'my sister's name is Mira'.
3. Retrieval-augmented (RAG)
Store structured facts about the user, and at query time, retrieve only the relevant facts based on what the user just said. Then feed those facts + recent context into the request. This is what most serious long-memory AI uses, including Brumo.
What 'structured facts' actually means
When Brumo reads your messages, an extraction step pulls out durable facts:
- People — 'my sister Mira', 'my therapist Dr. Price'
- Plans and promises — 'i'm cutting sugar starting monday', 'call mom friday'
- Life events — 'got the promotion', 'mira is moving to LA'
- Preferences — 'hates vegetables', 'wakes at 6am'
- Moods — time-stamped emotional state data
Each fact is stored with a type and a timestamp. When a future message comes in, the retrieval step scores facts by relevance and pulls the top N into the prompt.
Why retrieval beats summarization
Summarization tells the model 'the user and I had this kind of conversation'. Retrieval tells the model 'the user's sister is named Mira, is moving to LA in May, and the user is stressed about it'. Much more useful, much more specific.
The trade-off: you only get what you retrieve. If the retriever misses a relevant fact, the model replies without knowing it. Good retrieval tuning is most of what makes memory feel like memory rather than 'the AI sometimes remembers stuff'.
How Brumo's memory works, specifically
For every incoming message, the backend:
- Loads the last day or two of chat history (short-term context)
- Runs a retrieval query over your stored facts, scoring by relevance to the new message
- Pulls the top-scoring facts into the prompt alongside the short-term history
- Generates the reply
- Runs an extraction step on your new message to add new facts to storage
Every user has their own private fact store, scoped to their phone number. You can view it all on the Brain tab of the dashboard and delete anything.
Why most AI 'memory' feels bad
ChatGPT added a memory feature and most users don't think it's very memorable. Reasons:
1. Memory is opt-in and bounded
ChatGPT memory is a small number of fact slots per user, manually triggered. That's by design for privacy, but it means memory feels like a feature you have to manage, not a capability the AI has.
2. Tool framing vs. companion framing
Tools are episodic (one task at a time). Memory is most valuable in relational use, which ChatGPT isn't primarily designed for. The memory feels grafted on because it is.
3. Context vs. retrieval
Some products just stuff everything into context and call it memory. That works until it doesn't — the context window fills, the model gets confused, and the illusion breaks.
What 'good memory' feels like
When memory is working right, the AI brings things up unprompted in context. Three weeks after you mentioned your sister moving, you say 'ugh this weekend', and Brumo asks 'is that the mira goodbye thing?'
That's the signal that retrieval is tuned well. The AI knew what slice of memory to surface without you having to spell it out.
The future
Context windows will keep growing, but memory as a capability will still be an architectural layer on top. The models themselves will stay stateless. The interesting work is in the memory architecture — how facts get extracted, retrieved, decayed, corrected.
The companies that win the long-memory AI race will be the ones who treat memory as the product, not a feature.
