Every Task Deserves Its Own Memory Harness

Wenbo Pan, Shujie Liu, Xiangyang Zhou, Shiwei Zhang, Wanlu Shi, Mirror Xu, Xiaohua Jia — City University of Hong Kong · Microsoft

Read the paper Source code

Large language model agents rely on a memory harness to write, organize, retrieve, and use past experience. A harness that works well for one task often fails on another, because conversation, embodied planning, and specialized reasoning each demand different storage and retrieval behavior. M★ automatically discovers a task-optimized memory harness for each task through reflective code evolution.

Rather than choosing among a fixed set of designs, M★ represents a memory harness as an executable Python program and searches over that program directly. Three parts of the program are optimized together:

Schema — dataclasses (KnowledgeItem, Query) deciding what the memory stores and how it is queried.
Logic — the write() and read() methods, over a toolkit of SQLite, ChromaDB, and a budget-limited LLM.
Instruction — prompt constants that steer how the agent summarizes, queries, and answers.

The task agent that uses the memory is held fixed, so any change in score is attributable to the memory. Across four tasks — conversation, embodied planning, and expert reasoning — M★ improves over static memory harnesses on every task, and the programs it discovers are structurally distinct across domains. The Evolution Loop tab shows how programs are discovered; the Inspector tab opens the recorded runs program by program.

The interface every program implements

INSTRUCTION_KNOWLEDGE_ITEM = "..."
INSTRUCTION_QUERY          = "..."
INSTRUCTION_RESPONSE       = "..."

@dataclass
class KnowledgeItem:
    summary: str

@dataclass
class Query:
    query_text: str

class KnowledgeBase:
    def __init__(self, toolkit):
        self.col = toolkit.chroma.collection("kb")

    def write(self, item, raw_text=""):
        self.col.add(documents=[item.summary])

    def read(self, query):
        r = self.col.query([query.query_text], n=5)
        return "\n".join(r["documents"][0])

Figure. Left: one turn of the reflective code-evolution loop — sample a parent, evaluate it, mutate it from its failures, and pass quality gates. Right: the phylogeny of evolved programs. Each completed turn of the loop grows one new program on each of the four task quadrants, in the order they were produced during the recorded runs.

Best score over iterations

Evolution tree — click a program

Select a program in the tree to view its source.

Run record. The two runs with released per-iteration source. Click any program to read its full Python; use “Diff vs parent” to see exactly what the reflector changed. The memory M★ discovers for conversational question answering is structurally unlike the one it discovers for embodied planning, although both begin from the same three seed programs.