The system:

MSI Raider GE67 HX 12UHS

Intel Core i9-12900HX

nVidia GeForce RTX 3080Ti (laptop)

32GiB RAM

Win11 Pro 64-bit

The problem:

Once in a while (usually 2-3 times per day), the system crashes, usually resulting in a blue screen with one of various error codes. Codes I’ve seen include:

HYPERVISOR_ERROR

CLOCK_WATCHDOG_TIMEOUT

VIDEO_TDR_FAILURE

IRQL_NOT_LESS_OR_EQUAL

Sometimes the system hangs but the blue screen never comes, and I have to power it off manually. When this happens, the fans go to full speed and yet the laptop quickly becomes incredibly hot if I don’t power it off as soon as possible, suggesting that the CPU or GPU is maxing out for some reason.

Checking with Event Viewer shows nothing out of the ordinary in the lead up to the crash.

Things I’ve ruled out:

I initially thought it only happened while plugged in, and bought a new power supply. That didn’t seem to affect the frequency of the issue, and I also have now seen it happen while on battery. I also initially thought it was more frequent while playing games that use the dedicated graphics card, but I’m not sure that’s actually true; I have seen it happen even while just watching Youtube. At one point I felt that it happened more when I moved the laptop or plugged in USB devices, but I think that may be magical thinking; I have never been able to make it happen on purpose by doing those things. It does seem to be true that after it happens, if I let the laptop restart automatically, it often happens again in a short time, but shutting down and then turning it back on gives more time before the next incident.

Solutions I’ve tried:

I tried updating the BIOS and the Intel firmware to the latest available on MSI’s website, but that doesn’t seem to have helped. I also updated my nVidia drivers.

A possibly related issue:

A week or so before this happened for the first time, I updated the BIOS to fix a different issue. What happened then was: I was playing a game on battery unintentionally, and didn’t notice until that “low battery - switching to Super Battery” warning appeared and began throttling system performance. I plugged the laptop in, but performance didn’t improve. I restarted and performance was terrible across all applications, even Firefox. I checked Resource Manager and noticed that the CPU was being throttled down to around 0.16GHz. Event Viewer was showing warnings about this that said the processor was being limited by system firmware.

I tried using various Windows and MSI power management settings to resolve the issue, which persisted across restarts, fully charging the battery, etc. In the end, I solved it by updating the BIOS (to a version that is now one version back from the most current one).

It was a while, maybe a week, after running the update that the crash happened for the first time.

Current theory:

Is it possible I screwed up the BIOS update somehow? I noticed that it instructs you to return clock speeds to stock before doing the update. I don’t think I’ve manually adjusted them, but MSI’s “MSI Center” software seems to offer automatic adjustment. It was set to “Balanced” when I did the most recent update, but it may have been set to “Auto” when I did the first one, which I guess could be a problem if the CPU was automatically overclocked.

  • GrundlButter@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    Doesn’t seem like you have tested the RAM, have you? Repeated BSODs with different error codes can be a sign of bad RAM, and loading memtest on a bootable USB is a great way to test for that.

    • ryven@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      11 months ago

      Update: memtest86 passed! That’s good, I guess, but I really did think this was the best suggestion, so I’m kind of surprised. I’m going to find a test for the graphics card, and if it passes I’m following the other recommendation to clean reinstall the OS.

      • GrundlButter@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        2
        ·
        11 months ago

        Good and bad news indeed. I think you’ve got the right course of action, if it’s not a discernable piece of hardware, then a nuclear approach to software is warranted. BIOS/microcode updates are another effort I would add as well. I wish you luck!

        • ryven@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          10 months ago

          Hey just so you know, I finally got around to fixing this after puttering around with it a bit at a time for months, and long story short the SSD was failing, despite several test programs claiming it was good (???). New SSD is running fine.

          Edit: Well, that didn’t last long. The bluescreens are back on the new hardware with a clean install. New hypothesis: whatever is causing them is also what caused the previous SSD to fail. Rather than sacrifice additional components trying to figure it out, I’m just going to call it here and see if it’s still under warranty.

    • ryven@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Sounds like a good call, I’ll make sure I’ve got everything important backed up and try it this weekend.

      • sicjoke@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        It sounds like a nuclear option but it should avoid what would otherwise be a lot of fucking about. Fingers crossed a reinstall sorts it and you don’t have an underlying hardware issue.

        • ryven@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          10 months ago

          Update: it was the SSD, despite some tests I ran claiming otherwise. Reinstalling Windows failed, but an install on a fresh SSD is running fine!

          Edit: Well that didn’t last long, see my edit here.

  • Appoxo@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    3
    ·
    1 year ago

    For better troubleshooting of BSODs search “Bluescreenview” online.
    Makes it way easier to analyze the dumps.

    You could also try Safe Mode with networking to see if it’s something done by software or OS.