This is the kind of thing that sounds boring until you realize what it implies: AMD’s fastest AI chips might be held back less by hardware and more by the software culture around them.
MoonMath AI says it open-sourced a bf16 attention kernel for AMD’s MI300X, written in HIP, not hand-tuned assembly, and it still beats AMD’s own AITER v3 “on every shape and rounding mode” they tested. If that holds up, it’s embarrassing in the healthy way—the kind that forces a reset.
Because the usual story goes like this: if you want top speed, you pay the “assembly tax.” You hire the rare people who can write fragile, device-specific code. You accept that every new chip or compiler tweak might break something. You get performance, but you also get a codebase that only a few people on Earth can safely touch. That’s not a strategy, that’s a hostage situation.
MoonMath is basically arguing you can have most of the speed without building a giant pile of hand-written GCN assembly. Their kernel is “non-assembly” in the sense that it’s HIP, but with a twist: tiny one-instruction inline-assembly wrappers so they can pick exact opcodes while still letting the compiler do register allocation. That’s a very specific bet. It says: we don’t need to micromanage everything, we just need to control the few points where the compiler might choose the wrong instruction.
I like that bet, and I also don’t fully trust it yet.
The performance claims are bold: faster than AITER v3 across an 8K–128K token sweep, with geomean speedups around 1.08×–1.18× depending on rounding mode, and up to 1.26× peak. They also claim bigger gains versus a modular baseline. Numbers like that aren’t “nice to have.” In attention workloads, that’s the difference between “we can serve this model” and “we can’t afford to.”
But the real argument isn’t the exact multiplier. It’s the maintenance argument. MoonMath is saying: we can keep performance high while lowering long-term burden. That’s a big deal because the dirty secret of high-performance kernels isn’t writing them once—it’s keeping them correct and fast across compiler versions, driver updates, and the constant churn of model shapes.
If you’ve ever been on a team that depends on a performance-critical kernel, you know the feeling. The model changes slightly. Sequence lengths shift. Suddenly the “fast path” isn’t the fast path. Or it is, but it’s wrong in one corner case. Now you’re burning weeks chasing a regression that only shows up in production loads. Assembly makes that worse because there are fewer tools, fewer eyes, and fewer people willing to touch it.
MoonMath’s post also gives a pretty grounded explanation of where the speed comes from: scheduling and memory placement. Not one magic trick. Eight waves split into two groups, barriers per iteration, overlapping matrix-core work with softmax and prefetch. Keeping K in LDS, keeping V hot in L1, keeping Q and accumulators in registers. That’s the kind of stuff that decides whether your expensive GPU is actually doing math or just waiting.
And that’s where the stakes get real.
Imagine you’re a company that wants to move inference off NVIDIA because pricing or supply is killing you. You buy MI300X boxes. The hardware is strong. But the software stack is the make-or-break layer. If the best kernels are locked up in fragile assembly and only improve when the vendor prioritizes them, you’re stuck waiting. Open, high-performance kernels change the balance. They give teams an escape hatch: you can profile, patch, and ship improvements without begging for roadmap attention.
That’s good for AMD too, even if it stings. Because the competition isn’t “does AMD have fast chips.” It’s “can real teams get real performance without a priesthood.”
Now, the pushback: vendor kernels aren’t always losing because vendor engineers are worse. Sometimes they’re solving a different problem. They might be optimizing for broader coverage, safer numerical behavior, fewer sharp edges across drivers, or stability under weird workloads. A community kernel can pick a narrower target—MI300X, specific shapes, specific choices—and win benchmarks. That’s not cheating, but it’s a different game.
And the “no visible quality regression” claim in SGLang diffusion—cool, but also squishy. “Visible” depends on who’s looking, what they’re comparing, and how hard they try to find artifacts. It matters because attention kernels touch numerics, and rounding modes exist for a reason. A speed win that quietly changes outputs in rare cases is the kind of bug that doesn’t show up in a demo but does show up when a customer complains.
Still, I’m glad this is happening in public. If MoonMath’s kernel really beats AITER v3 across shapes and rounding modes, that’s a signal: the bottleneck isn’t hardware, it’s iteration speed and openness. And if it doesn’t hold up under independent testing, that’s still useful, because it forces clearer benchmarks and better baselines.
The bigger question is what AMD does next: do they treat this as a threat to control, or as proof that their ecosystem can move faster when outsiders can compete on performance without writing a thousand lines of brittle assembly?
If open kernels like this keep winning, does AMD lean into a more community-driven “best code wins” stack, or do they double down on vendor-owned kernels and hope that’s enough to keep developers loyal?