Page 1 of 1

T520's NVS 4200M quits after 15-ish minutes of load

Posted: Sun Jul 09, 2023 11:15 pm
by FuzzleSnuz
My trusty T520 has developed an issue with its 4200M and I'm stuck scratching my head as to what has gone wrong. Fruitlessly searching on the googles didn't help, but it did lead to me to discover this neat forum. I'm hoping one of the knowledgeable fellows around here can help me identify what's wrong with my T520.

My T520 is a model 4243CU6, 99% stock. Original i5-2520, 1600x900, NVS 4200M, etc. My only changes are putting an mSATA SSD in the 3G/4G radio slot, adding a usb 3 expresscard, and replacing the ultrabay after it failed a year ago. It has been my daily laptop for the last 7 years, mostly for VS, office, miscellany, and some games.

After many years of smooth sailing, something appears to have gone wrong with the NVS 4200M. About 6 or so months ago, it developed a behavior of partially quitting after running it under load for 15 or minutes. The symptoms are very strange. I think this is all best described with a chronology of events.

Observations

Scenario 1:
1. Fresh boot.
2. Run something which will consistently load the 4200M at a high level. For this scenario, my testing go-to is a particular game which keeps the gpu utilization in the 70-76% range. 4200M temp is at a stable 180-183°F. (yeah, that's really hot, but pretty typical for laptops of this era)
3. Everything is fine for the first 15-ish minutes, game runs fine (for a 4200M) at a stable 30 fps, until eventually...
4. Some unknown switch is flipped and the 4200M craps out. The gpu utilization skyrockets to 97-100% and the game is now struggling to run at about 0.66 fps. 4200M temp is now at a stable 151-154°F.

Scenario 2:
1. Fresh boot.
2. Run something which will consistently load the 4200M at a low level. For this scenario, my testing go-to is watching an mpeg-dash video stream in a chromium-based browser that decodes it with nvdec, which keeps the gpu utilization in the 12-14% range. 4200M temp is at stable 145-149°F.
3. Everything is fine for the first 4-6 or so hours, gpu video decoding is fine, until eventually...
4. That unknown event happens and the 4200M craps out. Exact same symptoms as in Scenario 1. Only a reboot solves the problem.

Once the crap-out event hits, all applications using the 4200M for anything non-negligible (i.e. everything except dwm) are now slowed to a crawl - this includes currently running applications as well as any applications launched after the crap-out event. It doesn't matter if the application is intensive (like a game) or basic (like video decoding). Closing the applications and waiting for a while doesn't help. I've waited up to 5 hours to no dice. The only solution is to reboot the machine.

Here's how things look from the perspective of my addgadgets GPU meter:
Image
Again, even though the gpu utilization is inexplicably maxed, all gpu-using applications are actually running at arthritic snail speed. It's the opposite of what you'd expect.

Background

This issue spontaneously started happening about half a year ago. It never happened before then. I've had the same nvidia driver for the last 4 years; but for good measure, I tested with every single previous nvidia driver I've had installed (and also the ancient R320 driver that the lenovo website recommends) and the crap-out occurs is now occurring with all of them, so the cause of the problem is elsewhere. It's also not related to the applications I use for testing in Scenario 1 and 2. The game I use in Scenario 1 is one that hasn't changed since 2009 and I've run it flawlessly on the T520 for many years before this problem started happening. Same blistering 180°F temps, too. As for Scenario 2, well Google has made a mockery of software versioning but I can't imagine that Chromium's gpu h264 decoder pipeline substantially changed right at the exact moment this issue started happening on my T520.

Other potentially relevant details:
- Optimus is enabled, and the NVS 4200M is in "Optimal Power" mode with the basic desktop perf level keeping it at 33% max frequency. It will go to max frequency (810 MHz) correctly for demanding applications.
- I've only had the 376.33, 385.41, 385.90, 392.56, and 392.58 nvidia drivers installed. The 392.58 driver has been installed for the last 4 years (long before the 4200M started crapping out). Never had this issue before with any of them.
- The 4200M's idle temp is 120°F when the T520 is on its dock.
- Applications that intermittently load the gpu (e.g. typical WPF things like VS) do not cause the crap-out. Or perhaps they might, but only if I left them running and redrawing for something like 3 weeks, which is not something I have tested.
- I've used the same genuine 90W AC adapter(s), genuine 55++ batteries, and series 3 thinkpad mini-dock for the last 7 years.
- This problem occurs both when my T520 is on its dock and when off-dock & plugged directly into the AC adapter.
- I use Windows 7 Professional x64, which has all its shots and is currently updated to 6.1.7601.26561 (June 2023).
- BIOS is the latest official unmodified one with the spectre mitigation.

The two 100% reliable methods of inducing the crap-out event suggest to me that the cause of the problem is related to GPU throughput, which doesn't say much directly but in turn could be related to a number of things. At first, I thought the 4200M was simply overheating and entering some kind of emergency cooling mode that isn't smart enough to automatically end, but Scenario 2 seems to disprove that. My next guess would be some kind of permanent heat damage (not sure how or what kind) to the NVS 4200M that came from those 180°F temps after many years, but in the last 7 years I have I only put about 50 hours @ 180°F into the 4200M. Also, it will hold itself just fine in the 180-183°F range for those 15 minutes before the crap-out actually occurs. The temp does not sharply increase right before the crap-out. My last guess is some kind of transfer threshold between the cpu and gpu that, when exceeded, causes the crap-out - which itself would just be indicative of some more specific cause, like some failing PCB component, since there is no virtual pcie bandwidth police in Windows or the nvidia driver. I wouldn't even know where to start if the problem is with the circuits.


At any rate, thank you for reading this post which turned out to be essay (oops), and I hope somebody out there knows what is going on with my T520. I know the quickest answer is simply "4200M go bad, buy new mobo", but I am hoping for a more exacting diagnosis and potentially a cheaper fix.

Thanks much in advance to anyone with advice.

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Mon Jul 10, 2023 12:33 am
by TPFanatic
I'll spitball it sounds memory related.
I believe the DGPU will use its VRAM first before sharing system memory. One or the other could be failing.
I understand with Optimus the Intel HD graphics and memory controller (both part of the removable CPU) are just as important as the Nvidia for actually displaying the accelerated graphics.
So between RAM and the CPU you could add them to the parts shotgun and see if anything changes. Otherwise, it's likely unfortunately as you said, NVS 4200m bad.

In case the NVS 4200m is faulty, I suggest NOT ever setting in BIOS the display device to "Dedicated". This tends to brick laptops if the DGPU is defective.

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Mon Jul 10, 2023 1:44 am
by RealBlackStuff
When was the last time you refreshed the thermal paste between CPU/GPU and fan?

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Tue Jul 11, 2023 5:59 pm
by FuzzleSnuz
Hi and thanks for the replies!
RealBlackStuff wrote:
Mon Jul 10, 2023 1:44 am
When was the last time you refreshed the thermal paste between CPU/GPU and fan?
Hmm, well that would be... never. Now I hope that the refurbisher did so way back when, but even still that means roughly 7 years of the same paste at best. Most of what I do on my T520 keeps the average CPU utilization below 35% and the GPU below 5% so I've never really felt much pressure to put on fresh paste. But I do know that this type of T520 has the more advanced heat pipe + rear vent to accommodate the 4200M. Maybe this larger cooler is more prone to pumping out the paste than the basic one on HD 3000 only T520s? I don't have a lot of knowledge or experience with thermal paste other than a little more than the basics and putting it on chips that are bigger and in bigger chassis with simpler cooling setups (desktops and older game consoles).

If old thermal paste is the problem (or at least part of it), what do you recommend for regular maintenance to prevent this in the future? Would say re-applying (high quality) paste every 5 years be good enough? This laptop mostly does things with sporadic cpu requirements, i.e. a bunch of near idling, then full bore when compiling something in VS, then back to near idling, then spikes while VS chews on something in the background, then back to idle-ish, etc.

TPFanatic wrote:
Mon Jul 10, 2023 12:33 am
I'll spitball it sounds memory related.
I believe the DGPU will use its VRAM first before sharing system memory. One or the other could be failing.
I understand with Optimus the Intel HD graphics and memory controller (both part of the removable CPU) are just as important as the Nvidia for actually displaying the accelerated graphics.
So between RAM and the CPU you could add them to the parts shotgun and see if anything changes. Otherwise, it's likely unfortunately as you said, NVS 4200m bad.

In case the NVS 4200m is faulty, I suggest NOT ever setting in BIOS the display device to "Dedicated". This tends to brick laptops if the DGPU is defective.
I do have about half a shell's worth of a part shotgun already loaded, in the form of an i7-2760QM and some gelid gc-extreme paste. When the 4200M craps out, chromium and some other applications are smart enough to revert to software for things like video decoding, and the stock i5-2520M really struggles with modern video format decoding (like VP9) and also with WARP performance on contemporary applications that have too many visual effects. So I thought I'd at least give the CPU some help, in case fixing the 4200M doesn't work out, so that things are at least a little better when it starts to crap-out. And also so my T520 can be a little more like the W520 it/I wishes it was. I know the i7 comes with increased thermal demands, but I can always put the 2520M back in if I need to. And if the signs point to my 2520M having damage related to optimus & the memory controller like suggested, I'll just get a 2540M for it instead.

So I will certainly try a CPU swap soon-ish and see if that has any improvement, which ofc entails fresh paste like RealBlackStuff was suggesting. From what I read, the gelic gc-extreme is a popular choice for laptops and their unique cooling setups, but I'm open to something else if you guys happen to know what the T520's favorite paste is.

And the memory in my T520 might be the stock model RAM units (or equivalent), but certainly not the original 2011 units since they have some numbers on them that read like a date in 2014. They are samsung made and stamped with a lenovo label. I don't have any extra ram of this type on hand, but I will add that to my list of things to buy if the CPU swap doesn't yield any improvement.

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Wed Jul 12, 2023 12:56 am
by RealBlackStuff
How to apply thermal paste: http://www.arcticsilver.com/pdf/appmeth ... d_v1.1.pdf

You mention Gelid C-Extreme paste, that would be fine.
With your heat problems, I bet that fan is probably full of dust and the paste has fully dried out.
The first thing to do RIGHT NOW is to get the T520 manual from the HMM link at the top of this page.
Open the T520, remove the cooler, clean its plates plus CPU and GPU properly with 90-99% Isopropyl alcohol, clean all the dust from fan and fan-blades, apply thermal paste on CPU and GPU, put the cooler back on properly and put the rest of the parts back.
That Gelid paste takes about 48 hours to properly cure.
Unless the T520 starts overheating again, I suggest to repeat the above at least once every 2 years.

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Thu Jul 13, 2023 1:45 am
by FuzzleSnuz
RealBlackStuff wrote:
Wed Jul 12, 2023 12:56 am
How to apply thermal paste: http://www.arcticsilver.com/pdf/appmeth ... d_v1.1.pdf

You mention Gelid C-Extreme paste, that would be fine.
With your heat problems, I bet that fan is probably full of dust and the paste has fully dried out.
The first thing to do RIGHT NOW is to get the T520 manual from the HMM link at the top of this page.
Open the T520, remove the cooler, clean its plates plus CPU and GPU properly with 90-99% Isopropyl alcohol, clean all the dust from fan and fan-blades, apply thermal paste on CPU and GPU, put the cooler back on properly and put the rest of the parts back.
That Gelid paste takes about 48 hours to properly cure.
Unless the T520 starts overheating again, I suggest to repeat the above at least once every 2 years.
Back when I replaced the ultrabay about a year ago, I took the opportunity to disassemble several of the T520's layers (using that same HMM) so that I could give it a good dust cleaning. I was a little surprised that it was less dust than I expected. I will give it another dust cleaning when I re-do the paste / potentially swap the CPU. Probably the latter. I hate confounding variables when I'm trying to learn something new, but I also hate wasting expensive paste.

So my guess is that dust might not be a big factor here, but the beat to [censored] old paste probably is. If I can remember to I'll take some photos of the current paste for the record before removing it. I doubt I'd be able to tell if it is still healthy or totally used up just by looking at it.

You mention overheating and I do worry about that. I know laptop cooling from this era chronically sucks, but I don't like seeing those 180 temps under max load either. To the best of your knowledge, what would you say is a healthy but also reasonably achievable temperature range for the 2520M and 4200M (at max load) in this model of T520? On quality paste that still has some life in it.

Thanks for all the pointers and advice.

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Thu Jul 13, 2023 6:34 am
by RealBlackStuff
With decent paste:
Idling 30-40 C / 85-105 F
Load 60-70 C / 140-160 F
Full blast max. 90 C / 195 F

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Mon Jul 17, 2023 5:51 pm
by FuzzleSnuz
After replacing my T520's original i5-2520M with an i7-2760QM, the NVS 4200M is now behaving correctly again. No more crapping out, no matter how hard I try to reproduce that symptom. Since I changed two variables (replacement processor + different model of processor), it is impossible to say with 100% certainty what the exact problem was. But I can at least conclude that either one or both of the helpful diagnoses from TPFanatic and RealBlackStuff were correct.

I was a little surprised when I removed the cooler assembly to get to the CPU. Either the refurbisher shelled out extra money for a cooler assembly FRU with machine pre-applied paste, or much more likely my T520 was still rocking the very first paste it got from the factory in 2011. To my amateur eye, it looked like the vast majority of the paste that was once in the interface area had been pumped out to the sides over the years.
https://i.postimg.cc/hPv9XW8K/paste-on-all.jpg

FORUM warning:
picture(s) WAY too big, tags removed. Please read the Forum Rules, especially Section 5: https://forum.thinkpads.com/viewtopic.php?f=16&t=14339


The new gelic gc extreme paste I applied is working well. Now that it has had a few days to fully cure, the 4200M sits in the 150-152°F range at 98% utilization. Way better than the 180-183°F at 75% utilization I was getting before. I wonder how the gelid paste compares to whatever paste the Lenovo factories used (circa 2011) and how long the gelid would take to get the state shown in the photo above. But I certainly won't let it sit for another 12 years before getting refreshed.

At any rate, thanks again to TPFanatic and RealBlackStuff for the advice and help. I've learned some things and I'll be keeping an eye on the temps to tell when it is time to add new paste. I guess I expected too much of the refurbisher to give it fresh paste. And I'll also be prepared to replace the CPU again eventually if indeed the Sandy Bridge memory controller is something that goes bad under these conditions.

Now it's time to update my signature b/c this T520 is no longer stock 8)

Re: T520's NVS 4200M quits after 15-ish minutes of load

Posted: Fri Jul 21, 2023 4:41 pm
by TPFanatic
Good job and very glad that it's now behaving!

The factory Lenovo paste doesn't last past the warranty period. Every secondhand ThinkPad I've had my hands on between 2016 and now has had dried up disgusting ineffective factory paste. :D On the dGPU, workstation, and any other infamously hot-headed models this is a slow and hot path to death. I personally love the Arctic brand paste, I did well in the past with AS-5 and my current favorite is MX-6.