> how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level
Every time the high level architectures of models change, there are new lower level optimizations to be done. Even recent releases like GPT-OSS adds new areas for improvements, like MXFP4, that requires the lower level parts to created and optimized.
How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics? How reconfigurable are TPUs? Are there any standardized feature flags for TPUs yet?
Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?
> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?
There are multiple TPU vendors.
I listed multiple AI accelerator TPU products in the comment you are replying to.
> How reconfigurable are TPUs?
TIL Google's TPUs are reconfigurable with OCS Optical Circuit Switches that can be switched between for example 3D torus or twisted torus configurations.
(FWIW also, quantum libraries mostly have Line qubits and Lattice qubits. There is a recent "Layer Coding" paper; to surpass Surface Coding.)
But classical TPUs;
I had already started preparing a response to myself to improve that criteria; And then paraphrasing from 2.5pro:
> Don't rank by TOPS/wHr alone; rank by TOPS/wHr @ [Specific Precision]. Don't rank by Memory Bandwidth alone; rank by Effective Bandwidth @ [Specific Precision].
Every time the high level architectures of models change, there are new lower level optimizations to be done. Even recent releases like GPT-OSS adds new areas for improvements, like MXFP4, that requires the lower level parts to created and optimized.