Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fix Intel CPU Throttling on Linux (github.com/erpalma)
189 points by ladyanita22 on March 5, 2023 | hide | past | favorite | 137 comments


This is really not the correct approach. https://github.com/intel/thermal_daemon ought to do a better job without ignoring manufacturer thermal limits (I reverse engineered Intel's Dynamic Power and Thermal Framework a few years back, and upstream kernels should have everything needed now: https://mjg59.dreamwidth.org/54923.html)


Thank you for your comment.

I installed thermald on my Lenovo T480 with Debian Bookworm and I get 20% better results in stress-ng. The fans are a bit louder now under high load and off under low load.

Without thermald:

  $ stress-ng --matrix 0 -t 3m --metrics-brief
  stress-ng: info:  [3755113] setting to a 180 second (3 mins, 0.00 secs) run per stressor
  stress-ng: info:  [3755113] dispatching hogs: 8 matrix
  stress-ng: info:  [3755113] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
  stress-ng: info:  [3755113]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
  stress-ng: info:  [3755113] matrix          2278812    180.00   1437.43      0.27     12660.06        1585.04

With thermald:

  $ stress-ng --matrix 0 -t 3m --metrics-brief
  stress-ng: info:  [3755550] setting to a 180 second (3 mins, 0.00 secs) run per stressor
  stress-ng: info:  [3755550] dispatching hogs: 8 matrix
  stress-ng: info:  [3755550] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
  stress-ng: info:  [3755550]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
  stress-ng: info:  [3755550] matrix          2791272    180.00   1404.32      0.57     15507.06        1986.83

I just installed it using apt and did no extra configuration. My system was anyway configured for balanced power mode. Why is thermald not installed on desktop installations by default?


Is dptfxtract still needed?

The thermald man page says this:

> In some newer platforms the auto creation of the config file is done by a companion tool "dptfxtract". This tool can be downloaded from "https://github.com/intel/dptfxtract". It is suggested as parts of the install process, run dptfxtract.

The dptfxtract gibthub project (https://github.com/intel/dptfxtract) says Intel discontinued the project.


Found it out myself, it is not needed by recent thermald. The ReadMe says this:

> Thermald version 2.0 and later has in built parser for thermal tables. So this utility is not required. Make sure that thermald "--adaptive" option is used.

https://github.com/intel/dptfxtract/blob/master/README.txt


What? Doesn't Debian come with thermald by default?


I have had people tell me that they don't care if their computers break, as long as they run faster in the meantime. Some manufacturers genuinely set the limits way too low for their own hardware.

I also use this tool to bypass manufacturer limits in battery mode that are intended to make the system seem like the battery is not undersized for the CPU's power draw. Sometimes I'd rather have more CPU for less time.


I recall hearing CPU Tmax is designed for a 10 year lifespan.

I have never used a laptop for a decade, nor have I had the CPU fail. So perhaps faster performance and shorter lifespan is okay.


There's a lot of 10+ year old CPUs in use, with heatsinks so clogged as to be at their thermal limit all the time, that I think 10 years is a highly conservative estimate which also assumes absolute max Vcore (which if you've seen the datasheets, are far higher than normal values, and something that only extreme --- as in, liquid nitrogen --- overclockers will exceed); it's more likely other things like motherboard capacitors will fail first.


I kept my last system (i5-2500K) for a decade, and only replaced it (i5-12400F) because it was getting flaky powering back on when turned off, which I assumed was the motherboard. If it still worked then I probably wouldn't have upgraded.


I have a i7 laptop of the same generation that I'm using to type this. I've replaced the screen, bezel, keyboard, touchpad, WiFi board, charger (3 or 4 times, thanks a lot Dell), battery. Upgraded the memory and added an SSD. Reapplied CPU paste once or twice. Never had a single problem with the CPU. Actually I don't know that I've ever had a computer with a CPU failure.

In all I've spent about 25% of the cost of the laptop over the years in repairs, but the result is that I'm able to keep using a device that's almost 12 years old instead of junking it. Saved quite a bit of money vs a new device, too.

Incidentally, it's interesting in regards to this issue that these old bulky laptops with beefy chips are actually quite good at avoiding thermal throttling. Despite having a traditional loud fan, I rarely even hear it come on.


This reminds me of Trigger's broom.


For non-brits: Triggers broom is a joke revolving around the ideas of the ship of theseus.

In the show, trigger is a street sweep who has "kept the same broom" for 10 years, only periodically replacing the brush and occasionally replacing the handle.

The joke is of course, that it's no longer the same broom.

https://www.youtube.com/watch?v=56yN2zHtofM


In parts of the US it's "grampa's axe": I've replaced the handle 3 times and the head twice!


Even if I am of the same type: boosted macbook pro with ssd and memory beyond what is officially supported, I think integration of almost everything on one board skipping connectors, sockets, is more cost effective and more "green" overall.

Dies of silicons are packaged, socketed, etc instead every part of silicon will get closer in a package and will handle everything.

Faster, cheaper, more energy efficient.


I'm impressed that you only spent 25% of the laptop's purchase price on so many repairs. Where did you get the parts?


He cannibalized the flesh of others, so his companion may live a damned existence!!

(Response crafted by anti-repair lobby gpt)


> ...was getting flaky powering back on when turned off, which I assumed was the motherboard.

This is my PC right now. Sometimes it does a boot loop, where it can get to BIOS but not GRUB and I'm not sure why, as there are no error messages. Pulling the plug and plugging it back in helps.

I'm not sure whether it's the motherboard, or maybe a disk not wanting to read data or something. Guess I'll have to wait and see, eventually swapping out parts one by one once it fails entirely.


Did you ever overclock, though? My 2500k went from 5 to 4ghz over the course of its life.


No. Ironically I bought the unlocked version then never took it past stock speed. That's why I didn't bother getting the unlocked new one.


Fwiw, I have 3 ThinkPad t420 and t420s running around the household, used daily, they're at about a decade.

Just replaced my main desktop which was fx8350, but that computer is now on its 2nd life with a friend and shows no signs of stopping.

There was a time when 10 year old computer was absolute rubbish. These days, if you upgrade from spinning drive to ssd, 10 year old well made computers are just fine :)

Basically, in 3 decades of personal computing, I never had a CPU failure other than that one time I crushed atlon xp with the heatsink :O


My 2005 vintage T42p wants to have a word with you. In all its years it has seen the following parts replaced: the (PATA) 2.5" hard drive, now using an SSD through a PATA-SATA adapter.

That's it.

I still regularly use this machine even though it not longer is my main portable device. The only thing keeping me from doing so is its 32-bit CPU which now that 32-bit support has mostly gone the way of the dodo. I much prefer its 4:3 form factor 1600x1200 screen, keyboard and construction over the P50 I'm using in its stead.


I still use a T420 (can't stand chicklets) and a repaste does the trick.


My T430s is still going strong.


IIRC Intel consumer CPU target 5 years (or is it 3? Not sure and the info is not public) at 30% duty cycle.

Of course lots of CPU make it for longer, but 10 years is certainly not a design target.


How relevant is that figure when Intel CPUs universally outperform that by leaps and bounds?


If you actually use consummer CPUs a lot (maybe e.g. you are a kernel maintainer and compile Linux all day) some of them may not last how long as you think.

Or if you put consummer CPUs in embedded equipements that are always (or at least often) under load, Intel will not care much if you have an high failure rate after a few years, even if you buy an high volume. They have dedicated SKUs with better expected durability.


I don't understand why this is a problem for Linux and not fit Windows.

The computers don't break on Windows, why Linux uses should take this overly conservative approach that limits the performance of their computer?


Intel ship DPTF drivers for Windows, so the appropriate thermal policy is configured at boot time. If you install thermald then Linux will do the same.


In the case of T480 the limit is specified by whether the laptop is in your lap or on the table. In the latter case it allows a much higher limit.

This check doesn’t work on Linux, so for safety reasons the former limit is enforced.


How does that check work? Do you have details?


without ignoring manufacturer thermal limits

The whole point is to ignore them because they're horrible and hold back what these CPUs can really do. Fuck the manufacturers playing these stupid marketing games.

Intel warrants their CPUs at TjMax 24/7, they'll automatically throttle when they hit that limit, and disabling all this other throttling crap makes them run that way for full performance.


The whole point is that the CPU is only a single part of the equation. Yes, you're not going to burn out the CPU itself by unlimiting PL1/2 (although if the system vendor cheaped out on power circuitry because they'd only designed for sustained 20W draw then you might burn that out), but you're now generating more heat than the system is designed to dissipate. This may result in obvious outcomes like the chassis heating up enough to burn your legs, but it may also result in other components being operated outside their thermal limits and their lifetime being shortened as a result.


The experience of many users shows otherwise --- ThrottleStop has similar scary disclaimers, but I have yet to see any evidence of hardware damage caused by opening up the power limits and letting the CPU naturally attain its full performance.

but you're now generating more heat than the system is designed to dissipate

I believe that's more accurately rephrased as "now you have more performance than they wanted you to have", because that's how it's being used in practice.

FYI, it is painfully obvious that you're toeing some sort of company line here.

Edit: and getting flagged for pointing out the truth should itself be quite telling. ;-)


> "now you have more performance than they wanted you to have"

"An unthrottled CPU without any OS-level policy will generate enough heat that the case temperature may rise to levels that violate safety regulations" can certainly be interpreted as "now you have more performance than they wanted you to have", yes.

> FYI, it is painfully obvious that you're toeing some sort of company line here.

I work for a company that develops trucks. While thermal dissipation is something that we do need to worry about, I can promise that it is absolutely not at the scale involved here. I've also been pretty consistent in criticising Intel for refusing to document DPTF. Whose company line do you think I'm toeing here?


"An unthrottled CPU without any OS-level policy will generate enough heat that the case temperature may rise to levels that violate safety regulations" can certainly be interpreted as "now you have more performance than they wanted you to have", yes.

Despite literally zero evidence of that actually being a problem in practice?

I've also been pretty consistent in criticising Intel for refusing to document DPTF. Whose company line do you think I'm toeing here?

Not Intel, but there's plenty of evidence that DPTF just cripples performance; and why else would you jump in to an article about unlocking manufacturer-crippled products with FUD like "not the correct approach"? It's like posting "may cause processor damage" and advocating for stock speeds on every article about overclocking.

For the record, I've been overclocking (not at extreme or competitive levels) since the early 90s and haven't killed any hardware because of it. The system I'm posting this from has been on a 50% overclock for over a decade and it's still rock-solid stable.


> Despite literally zero evidence of that actually being a problem in practice?

Sustained contact with metal at a temperature as low as 45C is sufficient to cause discomfort, and it's a surprisingly small rise above that before you can actually cause burns (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...). Skin temperature is a standard input to DPTF policies, and during my testing I was certainly able to get the skin temperature of my test laptop well above that level - DPTF would trigger throttling instead.

> why else would you jump in to an article about unlocking manufacturer-crippled products with FUD like "not the correct approach"?

DPTF defaults to a "safe" configuration, where "safe" is a euphemism for "almost unusably slow". If what you care about is your laptop not being almost unusably slow, it's reasonable to choose an approach that obtains that without introducing any additional thermal concerns. If what you care about is obtaining the absolute maximum performance possible from your hardware with no regard to any safety or longevity concerns, then yes, implementing the DPTF policy is probably not what you want to do. But recommending the latter without providing any information about the tradeoffs is irresponsible.


For a layman's overview of this, Linus from LinusTechTips gives a basic rundown of the relevant safety regulation and some numbers in one of his videos featuring the M1 Macbook Air and its cooling solution.[1]

[1]: https://youtu.be/ghDvyItIHTY?t=319


[flagged]


While developing my DPTF implementation for Linux I caused the outer skin of my test laptop to reach a temperature that would have caused burns to my legs if I had left it sitting on them for more than a couple of minutes. Now, I have working thermoreception, so I'd probably have noticed this and put it somewhere that wasn't my legs before that happened, but there are various scenarios in which people don't and manufacturers of objects that are intended to be put in your lap are kind of reluctant to design them in such a way that they could cause injury, which is why the manufacturer policies ensure that doesn't happen.

If you're fully aware of the possible consequences of ignoring those policies then go wild. I think it's the wrong choice for most people, and as I said I think it's irresponsible to suggest people change the thermal policy in that way without ensuring they're aware of the possible consequences. If saying "You should only do this if you're ok with your laptop potentially becoming hot enough to burn human skin" is authoritarian then I guess I'm authoritarian?


Your work on reverse engineering DPTF seems abandoned. Is it that everything was upstreamed? Because there's still a performance gap between Windows and Linux.

Also, why is Intel so reluctant on fixing this themselves? It looks so simple to me too fix this by releasing the docs and fixing thermald, and still, here we are having to use throttled for those wanting a little bit more punch.

And finally, how is it that nobody it's creating a custom, patched thermald that takes all this into account? Is it that the process for reverse engineering is too complicated and that we don't have the full picture yet? It shouldn't be too difficult to read the values Windows takes and reverse engineer it, although it may be tougher than it seems at first.


I worked on it to the extent that it worked for my use case, given it was a spare time project. It's all been merged upstream, and Intel (to their credit) are maintaining it now even though still refusing to document it.


I don't understand why Intel is so reluctant to do the things right


> And assuming that humans won't have a natural "this is hot, better not touch it" response is absurd.

FWIW, the incidence of people burning themselves on hot pads / body warmers would strongly suggest otherwise:

https://www.hmpgloballearningnetwork.com/site/wounds/reviews...


> Despite literally zero evidence of that actually being a problem in practice?

You choosing to not even look for evidence so you could make an argument in ignorance does not automatically mean it doesn't exist. This is literally the first google result for "laptop case temperature causing burns": https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4292129/

Also look at this fun condition: https://en.wikipedia.org/wiki/Erythema_ab_igne

> Temperatures between 43 and 47 °C can cause this skin condition; modern laptops can generate temperatures in this range. Indeed, laptops with powerful processors can reach temperatures of 50 °C and be associated with burns.

Some more results:

https://escholarship.org/uc/item/4n04r793

https://www.researchgate.net/publication/23164210_Thigh_Burn...

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3853603/


Maybe you should read my rebuttal first? https://news.ycombinator.com/item?id=35026981


So your rebuttal consists of "just put it off your legs, duh". Got it.


That's a flagged and dead reply to a sibling comment which would look schizophrenic as a response to this one - on top of everything else that is already wrong with it.


Not Intel, but there's plenty of evidence that DPTF just cripples performance; and why else would you jump in to an article about unlocking manufacturer-crippled products with FUD like "not the correct approach"?

Seems like there are quite a few other reasons why someone would "jump in", including being a relative expert. Maybe you could think of a few others yourself?


Slap enough cooling on a CPU and it’s never going to thermal throttle.

They added thermal throttling because most users only max out their CPU a tiny fraction of the time so the thermal inertia can compensate for inadequate cooling. It’s infrequent enough that most users could even disable it without noticing a difference. The problem is under the wrong conditions Aka inadequate cooling plus high ambient temperatures plus extreme workload = dead CPU.

And no this isn’t theoretical, one summer working without AC I cooked a CPU before they added thermal protection. It actually happened to several people in the same heatwave.


Closed-loop thermal management a.k.a. "throttling" was an inevitable consequence of the invention of SpeedStep.


I recall 3 different stages, thermal protection was first designed to save chips if the cooling fans failed etc. It was very conservative because the only option was to turn the system off. SpeedStep was designed to save laptop battery life, so they never needed to thermal throttle in normal operation.

Only fairly recently with Turbo Boost did Intel combine both ideas. It’s actually a fairly clever optimization as single threaded workloads benefit quite a bit from the extra thermal overhead.


The Pentium III was the last Intel CPU where shutdown was the only possible response to overheating, when the CPU asserted THERMTRIP#. The Pentium 4 has bidirectional PROCHOT# and the Thermal Control Circuit, which divides the clock down whenever the processor is hot, and continues to run.


I had forgotten about P4 style thermal protection, but that wasn’t supposed to trip in normal operation the way SpeedStep and Turbo Boost is.


It's also what enables some of the more modern packaging technologies.

Silicon aging is worse at higher temperatures, people are going to kill their 7/10nm chips far quicker if they keep overclocking as they've previously done. But that'll be Big Silicon's fault too, no doubt.


FWIW, I've managed to make multiple X1C6s thermally trip and shut off without changing from the safe configuration, which rather impressed me because I didn't know that was feasible these days.

Mostly I mention it to say that there's not _no_ reason for thermal limits in modern setups...


The EC may trigger an emergency shutdown if any monitored component (not just the CPU) ends up outside its thermal limits.


Indeed; I've replaced the laptop since then, so I don't think I have the log, but I believe it very explicitly logged it was the CPU temperature provoking it before actually shutting down.


Is this normally logged anywhere? You may have just saved me RMAing a laptop.


I had a computer with a broken GPU fan and it thermally shut off. There where kernel log entries about temperature warnings, though I don't remember if it mentioned which component. Systemd should preserve them before the system shuts off.


I believe that's called a defective product --- and reminds me of a certain somewhat infamous series of Toshiba laptops which would overheat and shutdown if you spent too long in BIOS setup, because the fans were entirely software controlled.

There's zero reason for all this extra "thermal management" crap beyond corporate greed.


You can design hardware that's capable of dissipating the maximum heat output of every component simultaneously, but you'll end up with hardware that's significantly larger, heavier, and costs more to manufacture. Or you can design hardware that's capable of dissipating the expected heat output under a range of use-cases, and devolve policy management to the OS so it can (based on user preference) prioritise appropriately, allowing you to produce smaller, lighter machines that either cost less or generate more profit. Even if the motivation for cutting costs is to increase profit margins, users generally still seem to prefer the smaller, lighter hardware that works just fine under most circumstances.


No... That's what you, the manufacturer looking for a rationale to justify extracting more profit want.

What people want is a box that does computing within spec and without melting, sterilizing their laps/hands, or otherwise introducing malbehaviors they have to work around.

People want stuff that doesn't or is very difficult to break. When it does break, they want it to be a straightforward repair.


That is them, not me.


I had an HP laptop that did that in Windows or Haiku. I rewired the fan to run 100% all of the time and aside from losing some battery time the laptop ran fine. And when plugged into the external power supply I would run it all weekend (4-days) non-stop with no problems.


If you remove the thermal limit, and if this breaks your cpu, for example, within one year, should intel then replace your cpu?

Thermal limits like this are really about managing the manufacturers liabilities, and protecting the expected lifetine of the product.


The ultimate limit that prevents processor damage, called THERMTRIP, is not possible to disable AFAIK --- it's a purely hardware feature. The one below that, PROCHOT, can be disabled, but it's not useful to do that because it'll just hit the THERMTRIP temperature and shut down instead.

Like I said, Intel warrants their CPUs to be operational at TjMax continuously, and that's the temperature they'll reach and stay at automatically if not given enough cooling. There's plenty of stories of people with computers whose heatsink has somehow detached or become so severely clogged as to be at that limit all the time, and the CPUs survive just fine.

This article isn't about that; it's about manufacturers artifically limiting performance beyond that to hit a marketing target like battery life or power consumption.


What? No, it isn't. It's about something that only affects Ultrabook-style devices, and it's extremely clear that it's not to hit a marketing target because if you run Windows the drivers disable the PL1/2 limitations.


The existence of ThrottleStop and the experience of many others shows that this is also a persistent problem on Windows --- CPU is barely getting warm, and sometimes the fan doesn't even turn on, but then gets throttled to some insanely low speed. Search "800MHz throttling" for plenty of complaints. Disabling all the DPTF and other extra power management bloat puts the CPU speed and performance back where it should be.


IEC 62368-1 places an upper bound on the temperature of materials in electronic devices that are likely to come into contact with users' skin, and that's incorporated into EU regulations[1]. You have two ways to achieve that - add enough cooling such that it's impossible for the device to ever produce enough heat to get that hot, or throttle components that are generating heat if the device is getting close to that temperature. The former requires a cooling system that's going to be significant overkill for most users most of the time, resulting in more expensive hardware that's bigger and heavier.

So, instead, you see the latter. This means that, yes, even if using official Intel DPTF drivers, systems will still occasionally thermally throttle. They may do so even if the CPU itself is not at a dangerous temperature, because the rest of the system may still be too hot and the CPU is the easiest tunable knob in terms of heat generation.

So, yeah, you can still obtain better performance by disabling DPTF and overriding power limits. And in the process you'll end up with a system that will become hotter than permitted by international standards, and you'll risk various types of failure ranging from inconvenient (the adhesive sticking things like rubber feet to the bottom of the machine tending to melt) to expensive (the unexpected levels of thermal expansion cycles causing components to fail earlier). It's fine that you're not worried about these things, but it's just straight up wrong to recommend that people override those limits without being clear about potential outcomes.

[1] https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL...

(Edit: replaced IEC 60950-1 with 62368-1, which is the up to date version, and linked to EU incorporation of that spec)


I feel like you're being overly charitable to laptop manufacturers. If such manufacturers always considered the thermal limits of their products when advertising their performance, then I don't think many people would complain.

Since that isn't the case, however, many people get a laptop that isn't capable of performing the tasks it's advertised for. The excuse of "well, we didn't think you'd really need that much power" doesn't really cut it, when said power was strongly advertised on the box.


Oh yeah the fact that a given laptop has a given CPU certainly doesn't mean that it's going to run at the same speed as another laptop with the same CPU, and it's unfortunate that manufacturers don't tend to give users the tools needed to figure out whether the thermal limitations are going to affect them. But review benchmarks do generally show the performance distinctions fairly clearly?


I'm sure there are benchmarks out there that comprehensively cover thermal behavior of laptops, but many sadly don't. I've had many laptops with great benchmarks that completely fall apart under prolonged load (>1 hour) Usually because the benchmark was 10-30 mins, during which the other components didn't heat up enough for throttling to occur.

I don't think the average laptop user should need to concern themselves with acquiring the necessary knowledge to discern good laptop benchmarks.


> I don't think the average laptop user should need to concern themselves with acquiring the necessary knowledge to discern good laptop benchmarks.

They don't. The average laptop user doesn't have sustained workloads that keep the system running at 100% capacity for over an hour, so they don't need and shouldn't want benchmarks reflecting such a use case. They're genuinely better served by benchmarks reflecting workloads that come in shorter bursts.


I assume people got burned by the artificial segmentation of the CPUs solely for pricing purposes.

Trusting Intel to provide accurate info on the actual performances of a chip feels too naive at this point.


If Intel wanted to use power/thermal limits for market segmentation they'd be locked down like the turbo frequency tables on non-K CPUs.


Can't that be hacked too?


I'm not aware of any modern CPUs getting hacked to unlock the multiplier. All you can do is BCLK overclocking


What exactly is preventing it?


But on Windows there's not such a problem, why segment then on Linux and not on Windows?


Because Windows is mainstream and Linux is not.


I don't understand why this is a problem for Linux and not fit Windows.

It shouldn't be too difficult to correct on Linux. Why is Windows taking the manufacturer's limits into account while Linux basically ignore it?


It is corrected on Linux. Just install thermald.


Fedora and Ubuntu ship with this. So my understanding is that it may or may not be as good as in Windows, but it's not running under the most conservative approach by default


Yes. And if you run thermald with the adaptive policy enabled, you shouldn't see any benefit from throttled. It's possible that you will, but if so that's likely a bug.


I used to disable turbo boost for the longest time, but if someone finally fixed CPU scaling and thermal controls, I might give thermald a try.


Hi Matthew, I've been a huge fan of your work ever since, back in 2010 or 2011, thanks to you I got the Gobi 2000 mobile broadband chip working on Linux on my Thinkpad W510!

As for the pros and cons of the `throttled` project, yes, this might not be the "officially desired" approach but I know several people who have been using to great success for years. The reality is, unfortunately, that particularly those of us who use Linux machines for work and order the most recent & powerful Thinkpads or Dells they can get their hands on, often realize later on (when the machine arrives) that the default settings handicap our machine to such a degree that we can barely work. Unfortunately, not everyone is a kernel developer, though, or knows how these things work, so quick fixes are often welcome, even if they limit the hardware's lifetime (after all, we'll buy a new device in a few years anyway).

What exacerbates this problem is that it's all very intransparent: The "official" way to solve these issues (which, as far as I understand you, is installing thermald?) is not really communicated anywhere, nor does thermald come preinstalled on any of the major distributions AFAIK[0]. What's worse, thermald often doesn't even solve the throttling issues without installing further patches. On top of that, BIOS updates by the manufacturers also seem to play a major role as manufacturers like Lenovo introduce different performance modes and things like "lap mode" etc. To be honest, to this day I haven't quite understood how these things interact and whose responsibility it is to fix things.

In my particular case, I have been using a Thinkpad X1 Carbon Gen9 (which Lenovo says "supports" Linux) and at some point, after installing numerous BIOS updates and working my way through hundreds of posts on the Lenovo forums, I just gave up: My machine still regularly throttles down to 800 Mhz per core and 16W under medium load until I hit the secret Fn + H key chore to tell the BIOS to switch back to high performance mode and set the thermal limit back to the maximum.

Do you happen to have a recommendation for me as to where I should start looking (again) for a solution? Does thermald fix these issues these days? (I know that when I last looked into it, it didn't.)

[0]: (EDIT) I take that back, it looks like thermald does come preinstalled on Fedora and Ubuntu these days. At least it's present on my new Ubuntu 22.04.2 installation. Unfortunately, that doesn't really help me since (and now I remember reading this last time I looked into thermald) according to the changelog[1] for v2.3:

> - thermald will not run on Lenovo platforms with lap mode sysfs entry

Great, so I still have nowhere to go from here it seems.

[1]: https://github.com/intel/thermal_daemon


You can make thermald ignore that protection by using the `--ignore-cpu-id` flag. See thread leading up to: https://github.com/intel/thermal_daemon/issues/268#issuecomm...


Thanks, I didn't know about that parameter! However, it does sound somewhat dangerous – sure, I could give it a shot but I basically have no idea how thermald and the firmware interact and what the consequences would be. :\ I mean, I want to ignore/disable lap mode (because that one has been just plain stupid) but I don't want thermald to completely ignore what CPU I have, either.

(Essentially we're back again to the situation I described: It's all very intransparent to me.)


This looks like an excellent tool for people repurposing old laptops as servers by putting their motherboard in a different chassis and adding some proper cooling of their own to the board. May need to cool parts close to the CPU as well if the board wasn't designed to transport that much heat.

If you try to do this to your laptop, well... there's a reason you can't legally sell laptops that heat up beyond 40-45℃. Expose yourself to that all you want, but be prepared for hardware damage, overheated skin, or decreased sperm count due to putting an overheated laptop in your lap.

I wouldn't call this a fix in the same way I wouldn't call throwing out your smoke alarm a fix for the constant flat battery beeping.


I want my laptop to be more predictable and reliable and with great battery life instead of having more performance.

But thanks to turbo boost, sometimes my laptop is hot playing a youtube video but cool when compiling code or the other way around. There is no predictability on how long a compilation will take or how long the battery will last, since it would depend on N thermal and power factors.

At least to me, this feels like when marketing designs products instead of product managers. I recently bought an Intel 12th gen i5-1240p laptop (asus zenbook) and this processor boosts from 1.7Ghz to 4.4Ghz i.e. more than twice the base frequency. That's absurd? I rather have a stable ~2Ghz than have the processor boost up to ~4Ghz while surfing the web.

Hence we wouldn't need tools like this if at-least on laptop, Intel released chips with no or smaller turbo boost range.


We'll probably see physical "turbo" switches on laptops at some point as a gimmic, but that would honestly be ideal at this point.


...and then there is me using

  echo "75000000" | sudo tee /sys/class/powercap/intel-rapl/intel-rapl\:0/constraint_1_power_limit_uw 
to cap my i7-10700 to prevent it from overpowering the system fan by peaking to 200+ watts.


It's honestly so frustrating. I bought an XPS 13 two and a half years ago and it's been a nightmare getting it to perform. I had to do the following things to make it run on non-turbo boosted clockspeeds without throttling:

- Liquid metal TIM

- Thermal pads + heat pipes connected to chassis to dissipate heat (Yes this means the bottom chassis heats up a lot)

- Disable the intel_rapl_msr linux driver + disable BD_RPOCHOT via MSR

Laptop has worked like a charm since. I really don't want a super thin laptop. I want a small laptop. I wouldn't mind 2 cm thick 13 inch laptop. But I can't handle a 15 inch laptop. I just find it way to large to be seriously portable.


14 inch laptops are the sweet spot for me, easily fit in a backpack with a protective travel case and support larger memory & more performant CPUs.

e.g. ThinkPad T14 Gen 3 AMD, Ryzen 7 pro 6850U, 32 gb LPDDR5-6400MHz for around 1000 EUR

Having used an XPS 13 and XPS 15 I was underwhelmed and none of Dell's laptops hit the sweet spot for me.


It seems there is an endless supply of people who know just enough to write some system programs but not enough to learn basic energy accounting. You cannot simply make a CPU run faster by writing MSRs. The current goes in and the heat goes out and the temperature goes up. You can't make it just work under arbitrary parameters.


>You cannot simply make a CPU run faster by writing MSRs

I like such generalised statements. You can read about xeon v3 hack and ThrottleStop PowerCut. Each is just "writing MSRs" with a funny side-effect of your CPU taking more current in.


Sure you can. It just makes the cooling fan to actually start spinning and do its job.

Yeah. It's that bad. I have a Thinkpad P14s.


Can concur, the defaults cause throttling before fans start spinning properly.

What's worse, these things have an accelerometer that causes the same type of throttling if you move your laptop.

Fuck Intel-based clothes-iron laptops so hard.


You can, because they come crippled by default now.

The benchmarks do not lie.


Why is intel like this =\


Manufacturers (rightly or wrongly) believe users want machines that are as thin and light as possible. This makes a bunch of things more complicated, including managing system thermals. Heat generated from the CPU has to go somewhere. As you get thinner, it's hard to get as much airflow and so fans are less effective. As you reduce the amount of material in the chassis, the less heat can be dumped in there without it heating up enough to potentially be uncomfortable for the user. Larger internal batteries become another source of heat while charging. Handling all of this safely becomes difficult, especially because there isn't necessarily a policy that satisfies all your users. But you can't leave it purely up to the OS either, because the OS has no idea of what the thermal characteristics of the platform are. So rather than attempting to encode all of this policy directly into firmware, Intel wrote the Dynamic Power and Thermal Framework (DPTF) spec, providing a mechanism for the firmware to share information about thermal control interfaces, interactions, and desired temperature bounds, and then let the OS make policy control decisions around that. Until the OS indicates it's ready to take over, the firmware imposes a default safe policy that's guaranteed to avoid any thermal issues, albeit at the cost of performance.

Of course, this only works if the OS knows how to do this, and Intel never publicly documented it so I had to reverse engineer it instead.


Another example of how being open source friendly boils down to "it depends on the green paper" even for the companies that do market themselves as such.

This is not the only area where Intel doesn't really support Linux, some of their GPU models also come to mind, like the PowerVR based ones in the past.


Is Linux taking a conservative approach because they're ignoring the DPTF? Why is this not a problem on Windows?


By default Linux is taking no approach at all, which means the thermal policy is the most conservative the firmware can configure. If you install thermald then Linux will implement the same policy that Windows does and should have the same performance.


I was very surprised by some of the thermal characteristics of my i7-13700k. My previous build was an i7-4790k, so it's been a minute. I had to undervolt this thing and cap it's max TDP (disable boost modes -- it has boost modes which are very thirsty) to get it to complete benchmarks while staying under 90* C (with a top of the line case, very good fans / circulation, and a large AIO). It's great now but undertuning the thing is a total departure from what I recall from '00s and '10s gaming machines.


> i7-13700k

253 W max turbo power is not that crazy by today's standards.

> top of the line case, very good fans / circulation, and a large AIO

I think you'll find that what people consider good cooling for a desktop has changed somewhat in the last decade. My first GPU didn't even have a fan, but today it's fairly common for enthusiast builds to have an external radiator. I dunno what you consider large, but most AIOs only have slightly more surface area than large air coolers so they really aren't worth it for sustained workloads like gaming or ML training. Custom loops have always been the go-to solution.


What AIO did you use? I just built a new PC with an i9-13900k and an MSI MEG Coreliquid 360 AIO cooler.

It benchmarks really well and I’ve never seen it over 50*C, the fans are really quiet, and I haven’t changed any of the configuration for it.

On the flip side I’ve got a i9-12900k in a different PC with air cooling and a more compact case and between that and the graphics card, the smaller machine runs super hot and noisy.


It's a CoolerMaster 360 AIO in a midtower Corsair case.

I was using Cinebench for benchmarking. I haven't seen any performance hits, in benchmarks or typical usage, so I'm comfortable with all of it.

I decided to undervolt the thing based on about a dozen or so Reddit threads worth of conversation about the thermal properties of these 13th generation Intel chips.


I wish I could undervolt my laptop.

As noted by the author:

> ===== Notice that undervolt is typically locked from 10th gen onwards! =====

I can't even modify the BIOS due to BootGuard and the keys burned into the CPU.

Hopefully there will be a way to leak/extract the keys someday, as this create real e-waste for fake security.


Boot Guard is based on RSA signatures, so the private key material isn't present on the device


I was thinking more about the recent Alder Lake Lenovo leaks :)

Right now I'm fighting with nohz_full and rcu_nocbs to limit the wakes per second, with irqaffinity on top to isolate to one core what must happen so that the others can stay in PC8 when I'm not compiling.

I'm down to about 350 on a 12th gen Alder Lake with E and P cores running Wayland with a few terminals, Thunderbird and Edge) but I wish I had the ability to also undervolt to go beyond 4W


The laptop builder likely didn’t spend the extra $2 to properly cool the CPU, so the CPU slows down to prevent burning out or burning your lap? The CPU being smart about its own temp is a good thing.


That's not what's happening in this case. Linux is getting much lower performance than Windows on the same laptop due to a firmware bug.


No, Linux is getting lower performance than Windows because distributions aren't shipping the tooling needed to inform the firmware that the OS can manage platform thermals, and so the firmware is defaulting to a safe state. Of course, the reason they're failing to do so is because Intel never publicly documented any of this.


I don't understand what's so difficult about this.

Why is Intel not providing the documentation? And what is it so difficult to reverse engineer?

Is this tool correctly reading the DPTF data and correcting for it?


> I don't understand what's so difficult about this.

It requires parsing multiple undocumented binary file formats, making sense of how the information in each is interlinked, and writing code that keeps track of system state, compares it against the firmware policy, and then imposes the appropriate constraints.

> Why is Intel not providing the documentation?

I have no idea.

> And what is it so difficult to reverse engineer?

https://github.com/mjg59/thermal_daemon/compare/028bde5bf0f1... was my initial implementation of this (remember to click "Load diff" for anything that's too large for github to display by default). There's a fair amount of complexity here!

> Is this tool correctly reading the DPTF data and correcting for it?

Thermald reads the DPTF data, throttled doesn't. I can't promise that thermald reads it correctly given the lack of a spec, but to the best of my knowledge it does.


This isn't Intel's fault... unless you consider them providing things like adjustable power limits a problem. Its CPUs have had automatic thermal throttling and will shutdown on catastrophic overheating ever since the Pentium II.

It's all the fault of manufacturers who want to both save cost with inadequate heatsinks and impose arbitrary restrictions on their products. The software in this article looks like the Linux equivalent of ThrottleStop, a Windows application that was the first to expose the truth behind it all.


I'm not sure how failing to publicly document the DPTF specification is anything other than Intel's fault. The CPUs are not running in such a constrained configuration under Windows, for example, because Intel supply drivers to configure them appropriately.


Exactly, the CPUs run much more loosely on Windows than on Linux. It shouldn't be too difficult to correct this...


> It shouldn't be too difficult to correct this...

In the sense of "You can install thermald and it will be corrected", yes. In the sense of "The only reason this works on Linux is that I reverse engineered the Intel spec because Intel refuses to publish it", then I'd disagree with the characterisation that it shouldn't have been too difficult, because it was really pretty difficuly.


Windows doesn't need DPTF either --- a quick search online shows just how many others are getting much better performance after applying NoDPTF.reg and similar fixes.


You're being weirdly obtuse here. Yes, you can make hardware perform faster by running it outside its design limits. That doesn't mean it's a good idea in the general case, even if the tradeoffs make sense for you as an individual.


"design limits", more like "marketing limits".

It's certainly a good idea if you realise just how much they're fleecing you.


Consumers: keep buying whatever garbage Intel puts out each year

Consumers: "Why would Intel do this?"


Why does Intel allow firmware to control power management? That's a long story but it's very boring and hardly evil.


What tools can do the opposite? I have a refurb Thinkpad X1 Carbon, running Deb 11 w/ i3, I use for creative writing (vim/markdown/pandoc). I'd like the battery to last as long as possible.


Check /sys/class/powercap - if you have some RAPL entries there you can set the maximum power draw of the CPU. But in general if you have a fixed workload (ie, your system wants to do a certain amount of work, not use a certain percentage of CPU) then reducing CPU power limits will result in the CPU slowing down enough that it has to stay awake for longer to do that work, and will (counter-intuitively) actually consume more power to do the same amount of work. Running the CPU fast to get the work done quickly means the CPU can then put itself in a low-power state that shuts down a lot of ancillary components, saving more power than running the CPU at half the speed for twice as long.


Throttled is great for undervolting your CPU


The real solution for laptops ~ just disable boost! (Granted, you might have to disable it via /sys every boot, but that can be scripted...)

Laptops shouldn't be using boost anyways, because their form factors and CPU coolers just can't handle the heat output.


If you want really long battery life and no heat at all, you can downclock all the way to something like 200-400MHz. A recent CPU at that speed is actually quite usable for things like text editing and reading documentation.

Laptops shouldn't be using boost anyways, because their form factors and CPU coolers just can't handle the heat output.

On the other hand, if it's plugged in much of the time, then let it boost as much as it can, with speed only thermally limited. Otherwise you're not getting the true performance you paid for.


> If you want really long battery life and no heat at all, you can downclock all the way to something like 200-400MHz. A recent CPU at that speed is actually quite usable for things like text editing and reading documentation.

Linux does grant the user that flexibility, so if someone actually wants that, they can have it.

The max non-boost frequency is usually the sweet spot for performance and efficiency.

> On the other hand, if it's plugged in much of the time, then let it boost as much as it can, with speed only thermally limited. Otherwise you're not getting the true performance you paid for.

If the user wants to live with a potentially reduced laptop lifespan, sure thing. But it's just not worth it for a laptop, frankly, given their limited thermal cooling capacities. That CPU will degrade over time when run at that level of heat.


That CPU will degrade over time when run at that level of heat.

If Intel warrants their CPUs to be at TjMax 24/7, it's a good sign that it shouldn't be a problem. I have not heard of overheating killing CPUs since the days when AMD's didn't have any thermal protection[1], and I've cleaned out machines which were heavily clogged with dust and thermally throttling all the time for many years (the service prompted by their owners complaining about their computers being slow.) In one memorable case the push-in heatsink pins must've been originally not fully inserted, since they came out at some point and the heatsink was not even touching the CPU anymore, yet the CPU kept running for years in that state.

[1] There's a famous TomsHardware video about that: https://www.youtube.com/watch?v=y39D4529FM4


> If Intel warrants their CPUs to be at TjMax 24/7, it's a good sign that it shouldn't be a problem.

It certainly does work. I was working on some manufacturing equipment in 2018. The 2010 release 1st Gen i3 had a centimeter gap between the IHS and the HSF. The Intel HSF thermal compound had not been touched, it was like new. The CPU had run thermal throttled to about 700MHz for eight years, continuously. Properly attached the HSF, the slowness and thermal throttling went away.


> That CPU will degrade over time

I overvolted and overclocked pretty much every chip I've owned from my 20 year old Athlon 64 to my 0 year old RTX 4090. None of them have degraded. If you watch overclocking livestreams you'll see just how much abuse it takes to get any sort of reaction from silicon.


Same here, if within sensible ranges it won't harm. It's the current that destroys the ( longevity ) of the silicon, not so much the clock or voltage.


I'd say it's neither. The only failures I've seen in the data center were caused by differential thermal expansion cycles and broken solder balls. Same phenomena that kills game consoles but not ML/mining GPUs that spend all day and night at max power/current and constant temperature.


Until you power cycle them a couple of times... My old AMD GPU even held a record for how "good" the ASIC still was or was not. TL;DR Yes, your GPU wears and becomes slower over time; eventually it brakes. Not because of bad thermal coefficients but because of the current... Even faster so when OC'ing because U=I*R. You can benchmark it yourself. The broken solder balls of the past where attributed to the transition from leaded to lead free solder. If I recall correctly, it was mostly NVIDIA and Apple who suffered from this and only temporarily / 1 or 2 generations.


> Laptops shouldn't be using boost anyways, because their form factors and CPU coolers just can't handle the heat output.

That's quite a sweeping statement. The adequacy of cooling seems to depend a lot on the device and its configuration.

I've had two small form-factor ThinkPads at default configuration over the last nine years, and there have been zero problems with heat. The CPUs have conservative TDP limits, and the cooling seems adequate for that.

One of the laptops ended up with a broken keyboard after ~seven years, but if that hadn't happened, it would probably still be in daily (and not always particularly light) use, as it had been until then.

I'm sure some manufacturers and models choose their parts less conservatively, and put too powerful CPUs or GPUs (or set their cTDP too high) in a chassis that can't really handle it. For them, the maximum turbo allowed by their CPU/configuration might be too much over prolonged periods. Some of those devices might end up failing due to thermal issues within some years.

But for more conservatively configured laptops (such as business ones), disabling turbo would probably quite needlessly limit their performance. Unless you're aiming for a much, much longer lifespan than almost anybody uses their devices.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: