This is the part of the AI boom that actually matters, and it’s also the part people love to ignore: the “magic” isn’t just the model. It’s the rack.
Perplexity is saying it can serve its post-trained Qwen3 235B models better by running them on NVIDIA’s GB200 NVL72 Blackwell racks. In plain terms: they moved the same kind of big model onto a newer, more tightly connected chunk of hardware, and they’re getting faster results, especially for the kind of models that split work across parts (mixture-of-experts). They even point to a specific win: all-reduce latency dropping from 586.1 microseconds. That’s not a number most people care about, but it’s a real signal that the system is coordinating faster, which is exactly what you want when you’re trying to answer tons of questions without everything bogging down.
Here’s my take: this is impressive engineering, but it should also make you a little uneasy about where the leverage is.
Because what Perplexity is really announcing is not “we found a new way to think.” It’s “we bought access to the newest, most expensive plumbing, and now we can push more tokens through it.” That’s not nothing. It can mean lower wait times, fewer timeouts, and a better experience for users. It can make a big model feel less like a lab demo and more like a product you can depend on.
But it also quietly reinforces a reality a lot of AI fans don’t want to admit: the winners may be the companies with the best hardware relationships, not the companies with the cleverest ideas.
When you read “rack-scale NVLink interconnect” and “enhanced tensor cores,” what that translates to is coordination. These models are not a single brain in a box. They’re more like a team of specialists passing notes constantly. If the note-passing is slow, everything slows. If it’s fast, you can do more at once and still respond quickly. So shaving down that coordination delay is a big deal for high-throughput inference, where you’re serving many people at the same time.
Imagine you’re a regular user asking a complicated question. You don’t see “all-reduce.” You see the page loading. You see whether the answer streams smoothly or stutters. You see whether the tool feels calm under pressure or panics at peak hours. If Perplexity can keep latency down while serving a huge model, that’s a real competitive edge.
Now imagine you’re a small team trying to compete. You don’t have a line to the newest racks. You’re renting what you can, when you can, at prices that can swing hard. You may have a better product idea, better design, better safety work, better focus. But the user only notices that your competitor feels faster and more reliable. That’s the brutal part: speed becomes the brand, even when the underlying “intelligence” isn’t that different.
The obvious pushback is: so what? Faster is good. Efficiency is good. And yes, if the hardware makes serving cheaper per query, that could eventually mean lower prices or better access. That’s the optimistic version: better racks turn into better tools for more people.
But the pessimistic version is hard to shake. Hardware advantages don’t just make things faster. They shape what gets built. If only a few players can reliably run giant MoE systems at high throughput, then the market starts to bend around their constraints and incentives. The “open” side of AI can keep releasing weights, but if the best way to run those weights is locked behind scarce infrastructure, openness becomes a vibe, not a reality.
There’s also a second-order effect people hand-wave away: once you can serve bigger models efficiently, you’re tempted to use bigger models everywhere, even when you don’t need them. Not because it’s best for users, but because it’s a flex and because it locks people in. If your app becomes dependent on a giant model that only runs well on a certain setup, switching becomes painful. That’s not a technical problem. That’s a business strategy wearing a technical costume.
And I’m not even calling Perplexity cynical for doing this. They’re competing in a market where users punish slow tools instantly. If you can cut latency and increase throughput, you do it. The question is what kind of ecosystem that creates.
If this trend continues, we’ll get a world where “AI progress” looks like hardware cycles: the next rack, the next interconnect, the next advantage that’s hard to copy. That can produce real improvements, but it can also narrow the field. More dependency, fewer viable competitors, and a lot of innovation that quietly shifts from “better answers” to “better access.”
So yeah, it’s a win. It’s also a warning light.
If the future of “better AI” depends more and more on who can secure the best racks, what happens to the people and companies who can’t?