Most AI systems aren't ready. Check yours in 15 min →
ZZ

Zyphra’s Zamba2-VL Speeds Vision-Language AI With Hybrid Mamba2

AuthorAndrew
Published on:
Published in:AI

This is the kind of AI progress I trust more than “it’s smarter now.” Making models faster to respond is a real improvement. It’s also the kind of improvement that quietly changes what people will build, because speed isn’t just a nice-to-have. Speed decides what feels usable, what feels magical, and what gets abandoned after a week.

Zyphra just released something called Zamba2-VL, a vision-language model. In plain terms: it can look at images and handle text, and it’s designed to start answering much faster. The headline claim is big: time-to-first-token is cut by about an order of magnitude. That’s not “a bit quicker.” That’s the difference between a tool that feels like a conversation and a tool that feels like a loading screen.

The interesting part is how they did it. Zamba2-VL is a hybrid model. It mixes Mamba2-style state-space layers with a smaller number of transformer blocks. You don’t need to care about the math to care about the trade: they’re trying to keep enough of what makes modern models work, while ditching some of what makes them slow and expensive at the start of a response.

And that “start of a response” is where a lot of products secretly fail.

People talk about AI like it’s only about accuracy. But in real life, latency is accuracy’s older sibling. If the model takes too long to speak, you stop asking. If it answers quickly, you forgive a lot. You’ll ask again. You’ll integrate it into your day. You’ll start depending on it.

Imagine you’re using a phone app to read a document you just scanned. If it waits and waits before it even begins to answer, it doesn’t matter if the final answer is perfect. You’ll just manually search the PDF. Or you’ll give up. Now imagine the same app starts responding instantly. Even if it’s not the best model on earth, you’ll feel like you have a helper, not a chore.

That’s why Zyphra aiming at near-linear-time prefill matters. Prefill is the “think before you speak” phase. Cutting that time is like cutting the time a barista takes to even acknowledge you. It changes the whole vibe.

Where I get opinionated: this kind of release pushes the industry toward “good enough, everywhere” instead of “best in the cloud.” And I think that’s mostly good. On-device and edge use isn’t just a tech preference; it changes power. If more of this work can happen on your own hardware, it’s less dependent on a remote service, less tied to constant connectivity, and potentially less data leaving your device.

Potentially. That’s the catch.

Faster local models can be a privacy win, but they can also become the default excuse not to talk about privacy at all. “Don’t worry, it’s on-device,” companies will say, while still logging prompts, still syncing, still collecting analytics, still nudging you into accounts. Speed can become a smokescreen. People love convenience, and convenience makes us lazy about asking hard questions.

Performance-wise, Zyphra isn’t claiming this model beats the biggest models across the board. Public reporting says it does well on certain tasks like counting and document question answering, but it’s not dominant on all metrics. I actually like that honesty, because it points to a more realistic future: lots of specialized models that feel fast and capable, without pretending they’re the single best brain.

But that also creates a new kind of mess. If you’re a developer picking a model, “fast” is a seductive metric. It’s easy to demo. It’s easy to sell internally. And it can hide the fact that the model is weaker in the exact corner case your users care about.

Say you’re building a tool for insurance claims where a user uploads a photo and asks what it shows. If the model replies instantly but misses a key detail, you’ve just traded trust for speed. Or say you’re building a study app that helps students interpret charts. If the model is great at counting items but shaky at reasoning about what the chart implies, you might accidentally create a confident wrong-answer machine that trains people to accept nonsense quickly.

There’s also the open-source angle. Zyphra released the models under Apache 2.0, with code and weights publicly available. That’s a real contribution. It means small teams can experiment without begging for access. It means researchers can inspect and adapt. It means the broader community can stress-test the claims.

It also means the model can be used by anyone, for anything. I’m not doing the pearl-clutching “open-source is dangerous” routine, because closed models are dangerous too. But speed plus accessibility does lower the barrier to mass use. If it becomes easier to build apps that read documents, understand images, and respond instantly, we’re going to see a flood of tools that feel authoritative even when they’re not.

The real question for me is whether this kind of architecture pushes the market toward better behavior, or just faster habits. If the default experience becomes instant answers about your screenshots, your forms, your receipts, your photos—do we become more capable, or just more dependent on a system that can be confidently wrong at high speed?

So here’s what I actually want to know: when fast, open models like this spread into everyday apps, will we use that speed to bring AI closer to users and their privacy, or will we use it to ship more shallow “instant” helpers that people trust too much?

Frequently asked questions

What is AI agent governance?

AI agent governance is the set of policies, controls, and monitoring systems that ensure autonomous AI agents behave safely, comply with regulations, and remain auditable. It covers decision logging, policy enforcement, access controls, and incident response for AI systems that act on behalf of a business.

Does the EU AI Act apply to my company?

The EU AI Act applies to any organisation that develops, deploys, or uses AI systems in the EU, regardless of where the company is headquartered. High-risk AI systems face strict obligations starting 2 August 2026, including risk management, data governance, transparency, human oversight, and conformity assessments.

How do I test an AI agent for security vulnerabilities?

AI agent security testing evaluates agents for prompt injection, data exfiltration, policy bypass, jailbreaks, and compliance violations. Talan.tech's Talantir platform runs 500+ automated test scenarios across 11 categories and produces a certified security score with remediation guidance.

Where should I start with AI governance?

Start with a free AI Readiness Assessment to benchmark your current maturity across 10 dimensions (strategy, data, security, compliance, operations, and more). The assessment takes about 15 minutes and produces a prioritised roadmap you can act on immediately.

Ready to secure and govern your AI agents?

Start with a free AI Readiness Assessment to benchmark your maturity across 10 dimensions, or dive into the product that solves your specific problem.