> how much of ML development these days touches these “lower level” parts of the...

westurner · 2025-11-07T17:24:52 1762536292

How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics? How reconfigurable are TPUs? Are there any standardized feature flags for TPUs yet?

Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?

From https://news.ycombinator.com/item?id=45775181 re: current TPUs in 2025; "AI accelerators" :

> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?

almostgotcaught · 2025-11-08T01:47:39 1762566459

this is like 5 different questions all across the landscape - what exactly do you think answers will do for you?

> How often do hardware optimizations get created for lower level optimization of LLMs and Tensor physics?

LLMs? all the time? "tensor physics" (whatever that is) never

> How reconfigurable are TPUs?

very? as reconfigurable as any other programmable device?

> Are there any standardized feature flags for TPUs yet?

have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).

> Is TOPS/Whr a good efficiency metric for TPUs and for LLM model hosting operations?

i don't see why it wouldn't be? you're just asking is (stuff done)/(energy consumed) a good measure of efficiency to which the answer is yes?

westurner · 2025-11-08T10:39:34 1762598374

> have no idea what a feature flag is in this context nor why they would be standardized (there's only one manufacturer/vendor/supplier of TPUs).

X86, ARM, and RISC have all standardized on feature flags which can be reviewed on Linux with /proc/cpuinfo or with dmidecode.

  cat /proc/cpuinfo | grep -E '^processor|Features|^BogoMIPS|^CPU'

There are multiple TPU vendors. I listed multiple AI accelerator TPU products in the comment you are replying to.

> How reconfigurable are TPUs?

TIL Google's TPUs are reconfigurable with OCS Optical Circuit Switches that can be switched between for example 3D torus or twisted torus configurations.

(FWIW also, quantum libraries mostly have Line qubits and Lattice qubits. There is a recent "Layer Coding" paper; to surpass Surface Coding.)

But classical TPUs;

I had already started preparing a response to myself to improve that criteria; And then paraphrasing from 2.5pro:

> Don't rank by TOPS/wHr alone; rank by TOPS/wHr @ [Specific Precision]. Don't rank by Memory Bandwidth alone; rank by Effective Bandwidth @ [Specific Precision].

Hardware Rank criteria for LLM hosting costs:

Criterion 1: EGB (Effective Generative Bandwidth) Memory Bandwidth (GB/s) / Precision (Bytes)

Criterion 2: GE (Generative Efficiency) EGB / Total Board Power (Watts)

Criterion 3: TTFT Potential Raw TOPS @ Prompt Precision

LLM hosting metrics: Tokens Per Second (TPS) for throughput, Time to First Token (TTFT) for latency, and Tokens Per Joule for efficiency.

almostgotcaught · 2025-11-08T15:57:03 1762617423

> There are multiple TPU vendors

There are not - TPU is literally a Google trademark:

> Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google.

https://en.wikipedia.org/wiki/Tensor_Processing_Unit

The rest of what you're talking about is irrelevant

westurner · 2025-11-10T06:20:47 1762755647

"A Brief Guide of xPU for AI Accelerators" https://www.sigarch.org/a-brief-guide-of-xpu-for-ai-accelera...

NPU: Neural Processing Unit: https://en.wikipedia.org/wiki/Neural_processing_unit

Coprocessor: https://en.wikipedia.org/wiki/Coprocessor