Skip to main content
Avi Weinstock

How is Groq so Fast? An Overview of Groq's TSP Architecture

An overview of Groq's (surprisingly easy to read) whitepaper
Article heading

You may have seen Groq’s recent demo, with state-of-the-art LLM inference speed. But how is it so fast?

We did a deep dive into Groq’s whitepaper (surprisingly easy to read!) to find out.

This blog post was adapted from the original Twitter thread, which can be found here.

Tensor Streaming Processors vs Conventional CPUs/GPUs

You may be wondering how custom processors could be faster than conventional processors, which have had decades of optimizations put into them.

The answer is that many optimizations improve average-case performance, but not worst-case performance and worst-case performance matters more for enabling an end-to-end optimization: compile time data transfer scheduling.

The main idea is that their processors (Tensor Streaming Processors, henceforth TSPs) are executing instructions in lockstep as if they were synchronized to a common clock, so that data transfers can be scheduled at compile time, and TSPs send and receive data using their instruction pointers as implicit timestamps.

This wouldn’t be straightforward to do with conventional CPUs/GPUs, whose optimizations, like caching, branch prediction, and speculative execution, introduce nondeterminism in instruction timing.

The TSP architecture avoids these for the sake of determinism and replaces them with a higher quantity of ALUs and memory than would be present in a conventional CPU or GPU made on a comparable process.

Optimal Cache Eviction

Additionally, conventional cache eviction policies may not optimally benefit machine learning workloads: cache eviction policies on general CPUs attempt to approximate the problem of anticipating what memory the program will need in the future, which is undecidable for general Turing-complete programs.

But with many machine learning workloads, the computation graph is known ahead of time, so it is decidable what memory will be needed at each program point, so a compiler can take memory latency into account explicitly, reducing or eliminating the benefit of a conventional cache.

TSP Synchronization

There’s still some timing variance from clock skew and physical connection quality, so the TSPs synchronize themselves by calibrating one-byte counters for use as a shared clock between directly connected TSPs, with dedicated instructions that align the instruction stream to the TSP’s counter.

One benefit to compile time scheduling of data transfers is that more of the physical connections between TSPs can be in use simultaneously.

Concretely, using paths A->B->C and A->C simultaneously gives more bandwidth than only using A->C, but without a global view of all the data flow, dynamic routing protocols might conservatively not use A->B->C in case it would interfere with other uses of B’s bandwidth.

Another benefit is that since the sending TSP knows what instruction to send data on, and the receiving TSP knows what instruction to expect data to arrive by, there’s no need for any sort of request/response protocol messages at runtime, which both reduces latency (from eliminating a round trip) and increases bandwidth utilization (from reduced header sizes/number of messages).

Conclusion

Check out the papers if you want to read more:

About Us

Zellic specializes in securing emerging technologies. Our security researchers have uncovered vulnerabilities in the most valuable targets, from Fortune 500s to DeFi giants.

Developers, founders, and investors trust our security assessments to ship quickly, confidently, and without critical vulnerabilities. With our background in real-world offensive security research, we find what others miss.

Contact us for an audit that’s better than the rest. Real audits, not rubber stamps.