What is AI agent governance?

AI agent governance is the set of policies, controls, and monitoring systems that ensure autonomous AI agents behave safely, comply with regulations, and remain auditable. It covers decision logging, policy enforcement, access controls, and incident response for AI systems that act on behalf of a business.

Does the EU AI Act apply to my company?

The EU AI Act applies to any organisation that develops, deploys, or uses AI systems in the EU, regardless of where the company is headquartered. High-risk AI systems face strict obligations starting 2 August 2026, including risk management, data governance, transparency, human oversight, and conformity assessments.

How do I test an AI agent for security vulnerabilities?

AI agent security testing evaluates agents for prompt injection, data exfiltration, policy bypass, jailbreaks, and compliance violations. Talan.tech's Talantir platform runs 500+ automated test scenarios across 11 categories and produces a certified security score with remediation guidance.

Where should I start with AI governance?

Start with a free AI Readiness Assessment to benchmark your current maturity across 10 dimensions (strategy, data, security, compliance, operations, and more). The assessment takes about 15 minutes and produces a prioritised roadmap you can act on immediately.

MoonMath Open-Sources HIP bf16 Attention Kernel Faster on MI300X

This is the kind of thing that sounds boring until you realize what it implies: AMD’s fastest AI chips might be held back less by hardware and more by the software culture around them.

MoonMath AI says it open-sourced a bf16 attention kernel for AMD’s MI300X, written in HIP, not hand-tuned assembly, and it still beats AMD’s own AITER v3 “on every shape and rounding mode” they tested. If that holds up, it’s embarrassing in the healthy way—the kind that forces a reset.

Because the usual story goes like this: if you want top speed, you pay the “assembly tax.” You hire the rare people who can write fragile, device-specific code. You accept that every new chip or compiler tweak might break something. You get performance, but you also get a codebase that only a few people on Earth can safely touch. That’s not a strategy, that’s a hostage situation.

MoonMath is basically arguing you can have most of the speed without building a giant pile of hand-written GCN assembly. Their kernel is “non-assembly” in the sense that it’s HIP, but with a twist: tiny one-instruction inline-assembly wrappers so they can pick exact opcodes while still letting the compiler do register allocation. That’s a very specific bet. It says: we don’t need to micromanage everything, we just need to control the few points where the compiler might choose the wrong instruction.

I like that bet, and I also don’t fully trust it yet.

The performance claims are bold: faster than AITER v3 across an 8K–128K token sweep, with geomean speedups around 1.08×–1.18× depending on rounding mode, and up to 1.26× peak. They also claim bigger gains versus a modular baseline. Numbers like that aren’t “nice to have.” In attention workloads, that’s the difference between “we can serve this model” and “we can’t afford to.”

But the real argument isn’t the exact multiplier. It’s the maintenance argument. MoonMath is saying: we can keep performance high while lowering long-term burden. That’s a big deal because the dirty secret of high-performance kernels isn’t writing them once—it’s keeping them correct and fast across compiler versions, driver updates, and the constant churn of model shapes.

If you’ve ever been on a team that depends on a performance-critical kernel, you know the feeling. The model changes slightly. Sequence lengths shift. Suddenly the “fast path” isn’t the fast path. Or it is, but it’s wrong in one corner case. Now you’re burning weeks chasing a regression that only shows up in production loads. Assembly makes that worse because there are fewer tools, fewer eyes, and fewer people willing to touch it.

MoonMath’s post also gives a pretty grounded explanation of where the speed comes from: scheduling and memory placement. Not one magic trick. Eight waves split into two groups, barriers per iteration, overlapping matrix-core work with softmax and prefetch. Keeping K in LDS, keeping V hot in L1, keeping Q and accumulators in registers. That’s the kind of stuff that decides whether your expensive GPU is actually doing math or just waiting.

And that’s where the stakes get real.

Imagine you’re a company that wants to move inference off NVIDIA because pricing or supply is killing you. You buy MI300X boxes. The hardware is strong. But the software stack is the make-or-break layer. If the best kernels are locked up in fragile assembly and only improve when the vendor prioritizes them, you’re stuck waiting. Open, high-performance kernels change the balance. They give teams an escape hatch: you can profile, patch, and ship improvements without begging for roadmap attention.

That’s good for AMD too, even if it stings. Because the competition isn’t “does AMD have fast chips.” It’s “can real teams get real performance without a priesthood.”

Now, the pushback: vendor kernels aren’t always losing because vendor engineers are worse. Sometimes they’re solving a different problem. They might be optimizing for broader coverage, safer numerical behavior, fewer sharp edges across drivers, or stability under weird workloads. A community kernel can pick a narrower target—MI300X, specific shapes, specific choices—and win benchmarks. That’s not cheating, but it’s a different game.

And the “no visible quality regression” claim in SGLang diffusion—cool, but also squishy. “Visible” depends on who’s looking, what they’re comparing, and how hard they try to find artifacts. It matters because attention kernels touch numerics, and rounding modes exist for a reason. A speed win that quietly changes outputs in rare cases is the kind of bug that doesn’t show up in a demo but does show up when a customer complains.

Still, I’m glad this is happening in public. If MoonMath’s kernel really beats AITER v3 across shapes and rounding modes, that’s a signal: the bottleneck isn’t hardware, it’s iteration speed and openness. And if it doesn’t hold up under independent testing, that’s still useful, because it forces clearer benchmarks and better baselines.

The bigger question is what AMD does next: do they treat this as a threat to control, or as proof that their ecosystem can move faster when outsiders can compete on performance without writing a thousand lines of brittle assembly?

If open kernels like this keep winning, does AMD lean into a more community-driven “best code wins” stack, or do they double down on vendor-owned kernels and hope that’s enough to keep developers loyal?

MoonMath Open-Sources HIP bf16 Attention Kernel Faster on MI300X

Frequently asked questions

What is AI agent governance?

Does the EU AI Act apply to my company?

How do I test an AI agent for security vulnerabilities?

Where should I start with AI governance?

Ready to secure and govern your AI agents?

You may also like

Why AI Governance Must Be Embedded, Not External

Process Tracing US-Israeli Strikes to Hormuz Energy Shock