There's a story the AI industry has been telling itself lately about what must happen to compete in the agentic era. It goes something like this: if you want truly real-time inference that is fast enough to make an AI agent feel less like a vending machine and more like a colleague, then you need to buy new hardware.
This means specialized silicon in the form of the exotic chips that Cerebras, Groq (which was kinda-sorta bought by Nvidia), and SambaNova have spent years and billions building. The thesis got a very loud endorsement last month, when Cerebras went public in a blockbuster IPO that valued it at $56 billion and confirmed that fast inference is now its own infrastructure category.
A small team in Paris would beg to differ.
Kog, an 11-person AI infrastructure startup founded in 2023, has just opened a public tech preview of its inference engine, making a claim that runs counter to prevailing wisdom. On a single node of eight AMD MI300X GPUs (the kind already humming away in enterprise datacenters), Kog says it generates more than 3,000 output tokens per second for a single user request, putting it in the same speed bracket as the dedicated-silicon crowd, but on standard kit.
The pitch is potentially enticing as the AI frenzy gives way to anxiety over the reality of soaring operating costs: you may not need to migrate to a new hardware ecosystem to get dedicated-silicon speeds. You might just need someone to use the GPUs you already own a lot more cleverly.
"It's not only the hardware."
"A growing part of the AI industry assumes that truly real-time AI [would] require entirely new hardware architectures," said Nicolas Constant, Kog's Sales & Talent Lead, in an interview ahead of the launch. He noted that the recent Cerebras IPO "reinforced it even further."