Most AI systems aren't ready. Check yours in 15 min →
SC

Sibyl Capital’s File Memory Hits 95.6% on LongMemeVal Benchmark

AuthorAndrew
Published on:
Published in:AI

This is the kind of update that sounds amazing until you picture the real world around it. A “new memory architecture” that scores 95.6% on a benchmark, ships as three simple installs, turns on with one command, and plugs into popular AI coding tools? That’s the dream. And it’s also exactly how you sneak something powerful into people’s workflow before they’ve asked the boring question: what, exactly, are we letting this thing remember—and who gets to hold that memory?

Based on what’s been shared publicly, Sibyl Capital says it improved a file-based memory system and got a 95.6% score on LongMemeVal. That’s the headline flex: it can keep long-term context better than before. They also claim it’s now easy to use—packaged into three pip installs and activated with a single command—so it can slot into AI coding tools like Claude Code, Codex, and Hermes agent.

Ease is not a small detail here. Ease is the whole story.

The hard part with “memory” in AI tools is not whether you can build it. The hard part is whether it should be on by default, and whether regular people can tell what’s going into it. A file-based approach sounds grounded, almost comforting—like “it’s just files on your machine.” But “file-based” can still mean a lot of things. Where are the files stored? What gets written? What’s the retention? What gets synced? What gets logged? If you can’t answer those without reading code and tracing it, then the feature is not “simple.” It’s just easy to install.

Benchmarks like LongMemeVal are useful, but they also have a way of hypnotizing people. 95.6% feels like you’re basically done. The problem is: scoring well at remembering the right bits in a test is not the same thing as remembering the right bits in someone’s actual working life. Real work is messy. It’s passwords pasted into terminals. It’s customer names. It’s “quick, try this token.” It’s internal links. It’s proprietary code. It’s things you didn’t even realize were sensitive until they’re somewhere they shouldn’t be.

Now add the other detail they highlighted: authentication options. Users can authenticate with a wallet signature, an email plus a terminal code, or even a tiny USDC transfer from a mobile wallet.

That part makes me uneasy.

Not because wallet login is automatically bad. But because it’s an indicator of where this might be headed: identity tied to usage, usage tied to a persistent memory layer, and a path toward monetizing or metering access. The “sub-cent USDC transfer” detail is clever—frictionless payments, global, no card forms. It also nudges the product into a zone where people treat it like infrastructure: always on, always connected, always accumulating history.

Imagine you’re a developer using an AI coding tool all day. You turn on this memory layer because it promises better context and fewer repeats. Week one, it’s great. The assistant stops asking the same questions. It remembers your preferences. It recalls that weird build step your repo needs. You feel faster.

Week four, you’re debugging a production incident at 2 a.m. You paste something you shouldn’t paste. Or you describe a customer situation in plain language. Or you share an internal endpoint. The tool is “helpful” and stores it because it’s doing what you asked: remember. Later, someone else on your team inherits your setup. Or you export your environment. Or the memory files get copied to a place you didn’t think about. Suddenly “better memory” looks like “better leakage.”

And if you’re not a developer, the risk is still there. Say you’re a founder using these tools to ship faster. You connect it, you feed it docs, you let it “learn your business.” It starts to feel like a private assistant. But the reality of many toolchains is that private turns into shared the moment you collaborate, or the moment you troubleshoot, or the moment you try to move machines. Convenience is how data changes hands.

To be fair, there’s a strong argument on the other side: long-term memory is the missing piece. Without it, these tools are goldfish. People waste time re-explaining everything. If Sibyl really made memory easy to add, and if it truly stays local and controllable, that’s a real win. A file-based approach could be the more responsible path compared to stuffing everything into some remote account by default. And giving multiple auth methods could just be about accessibility—letting different kinds of users get started quickly.

But the thing I don’t like is the default framing: “look at the score, look how easy it is.” When the selling point is “one command,” the product is silently asking users to skip the step where they decide what they’re comfortable with. Memory isn’t a normal feature. Memory is a policy decision. It changes how you behave because you stop thinking about what you’ve told the system. People get sloppy when they trust recall.

And there’s a bigger consequence if this trend keeps going: the tools that win won’t just be the ones that code well. They’ll be the ones that collect the richest personal context. That’s great for performance. It’s terrible for boundaries. The winner becomes whoever convinces you to pour your working life into their “memory,” then makes it painful to leave because leaving means forgetting.

I’m not saying Sibyl’s approach is wrong. I’m saying the bar should be higher than a benchmark score and a clean install. If this is going to sit inside the tools people use to write code, ship products, and handle real user data, then “what gets remembered, where it lives, and how you delete it” should be the headline, not the afterthought.

So here’s the question I actually care about: should AI memory tools be built so they remember by default, or should they be built so forgetting is the default and remembering is something you have to deliberately earn?

Frequently asked questions

What is AI agent governance?

AI agent governance is the set of policies, controls, and monitoring systems that ensure autonomous AI agents behave safely, comply with regulations, and remain auditable. It covers decision logging, policy enforcement, access controls, and incident response for AI systems that act on behalf of a business.

Does the EU AI Act apply to my company?

The EU AI Act applies to any organisation that develops, deploys, or uses AI systems in the EU, regardless of where the company is headquartered. High-risk AI systems face strict obligations starting 2 August 2026, including risk management, data governance, transparency, human oversight, and conformity assessments.

How do I test an AI agent for security vulnerabilities?

AI agent security testing evaluates agents for prompt injection, data exfiltration, policy bypass, jailbreaks, and compliance violations. Talan.tech's Talantir platform runs 500+ automated test scenarios across 11 categories and produces a certified security score with remediation guidance.

Where should I start with AI governance?

Start with a free AI Readiness Assessment to benchmark your current maturity across 10 dimensions (strategy, data, security, compliance, operations, and more). The assessment takes about 15 minutes and produces a prioritised roadmap you can act on immediately.

Ready to secure and govern your AI agents?

Start with a free AI Readiness Assessment to benchmark your maturity across 10 dimensions, or dive into the product that solves your specific problem.