Most AI systems aren't ready. Check yours in 15 min →
RK

Running Kimi K2.5 on RTX 3060 Using 768GB Optane at 4 t/s

AuthorAndrew
Published on:
Published in:AI

This is either a glimpse of a more open AI future, or a really clever party trick that people are going to misunderstand on purpose.

Running a one‑trillion‑parameter model on a single RTX 3060 is the kind of headline that makes everyone’s brain short‑circuit. It reads like: “You don’t need a data center anymore.” And if you’ve been priced out of serious AI work, that’s an intoxicating idea.

But the fine print matters, because the fine print is basically the whole story.

From what’s been shared publicly, someone got the Kimi K2.5 model running on an RTX 3060 12GB GPU and hit over 4 tokens per second. The “how” is the real point: they paired the cheap consumer GPU with 768GB of retired Intel Optane memory. The Optane isn’t doing magic compute. It’s acting like a huge memory layer that can hold most of the model’s weights so the GPU doesn’t have to.

And the other “how” is architectural: this model is Mixture‑of‑Experts. It doesn’t use the whole trillion parameters every time. It activates a smaller slice—reported as 32 billion parameters—while the rest sit there like a library you don’t open on every page.

That combination is genuinely interesting. It’s also dangerously easy to turn into the wrong conclusion.

The right conclusion is: clever memory tricks plus sparse models can make “impossible” models usable on scrappy hardware, sometimes. The wrong conclusion is: trillion‑parameter models are now practical on your gaming PC.

Because look at what had to happen here. First, you need a very specific, very large pool of memory—768GB is not normal “I upgraded my RAM” territory. It’s a pile of retired enterprise hardware that most people can’t just grab. Second, you’re accepting a speed that might be fine for a demo, but not for the way people actually want to use these tools.

4 tokens per second sounds okay until you imagine the real scenario. Say you’re trying to write a long email, or debug code, or ask for a plan and then iterate. That back‑and‑forth matters more than raw “it runs.” If every response feels like waiting for an old laptop to open a big spreadsheet, you don’t build habits around it. You go back to whatever is faster, even if it’s less private or costs money.

Still, I don’t want to dismiss this as a toy. I think the experiment is a warning shot to the “only giants can do this” story. It suggests a path where hardware that’s “obsolete” for one market becomes a weapon for another. Retired Optane sitting around as e‑waste becomes the thing that unlocks local models for small teams, hobbyists, and countries that can’t just rent huge clusters whenever they feel like it.

That’s not just geeky. That’s political.

If you can run a serious model locally, you don’t have to send your prompts, your customer data, your legal drafts, or your internal documents to someone else’s servers. A small clinic could keep patient notes in‑house. A company could let engineers query internal code without exposing it. A journalist could work with sensitive material without uploading it. These are boring, practical freedoms—and they matter more than flashy benchmarks.

But there’s another side, and it’s not comfortable: the same “AI on cheap hardware” story also lowers the cost for spam, scams, and harassment. If a person can run a capable model at home without guardrails, nobody can rate‑limit them. Nobody can kick them off a platform. That doesn’t mean “ban it.” It means we should stop pretending that access has no downside.

There’s also a subtle economic trap here. If everyone gets excited about “running a trillion parameters on a 3060,” a lot of people will waste time chasing the headline rather than choosing the right tool. They’ll buy random parts, fight drivers, tune settings, and end up with something that technically works but doesn’t fit their life. Meanwhile, the sensible move for many users might still be: run a smaller local model that’s actually fast, or pay for a hosted model when speed matters.

I’m also skeptical of how portable this setup really is. Optane is “retired” for a reason. Supplies won’t last forever. Support won’t get better over time. If this approach depends on a shrinking pool of odd hardware, it’s not a stable foundation—it’s a clever bridge.

But bridges matter. They change what people build next.

The bigger idea here isn’t “one RTX 3060 can do anything.” It’s that the bottleneck is shifting. Memory and movement of weights can matter as much as raw compute. Sparse models that “wake up” only the parts they need can change the economics. And once a few people prove a thing is possible, a lot more people start optimizing it, simplifying it, and making it less weird.

The tension is simple: do we want AI capability to concentrate in the hands of the few who can pay for endless compute, or spread outward through hacks like this that make power messier and harder to control?

If hardware tricks and sparse models keep pushing “big model, small machine” forward, what should we prioritize more: making that power widely available, or building new limits around how it can be used?

Frequently asked questions

What is AI agent governance?

AI agent governance is the set of policies, controls, and monitoring systems that ensure autonomous AI agents behave safely, comply with regulations, and remain auditable. It covers decision logging, policy enforcement, access controls, and incident response for AI systems that act on behalf of a business.

Does the EU AI Act apply to my company?

The EU AI Act applies to any organisation that develops, deploys, or uses AI systems in the EU, regardless of where the company is headquartered. High-risk AI systems face strict obligations starting 2 August 2026, including risk management, data governance, transparency, human oversight, and conformity assessments.

How do I test an AI agent for security vulnerabilities?

AI agent security testing evaluates agents for prompt injection, data exfiltration, policy bypass, jailbreaks, and compliance violations. Talan.tech's Talantir platform runs 500+ automated test scenarios across 11 categories and produces a certified security score with remediation guidance.

Where should I start with AI governance?

Start with a free AI Readiness Assessment to benchmark your current maturity across 10 dimensions (strategy, data, security, compliance, operations, and more). The assessment takes about 15 minutes and produces a prioritised roadmap you can act on immediately.

Ready to secure and govern your AI agents?

Start with a free AI Readiness Assessment to benchmark your maturity across 10 dimensions, or dive into the product that solves your specific problem.