I got a really weird problem.

Two years ago bought a set of 4x DDR4 3200 16gb each, single sided and placed them in a ryzen 5600 desktop computer, which i almost never turned it off. It worked without issue.

This weekend I wanted to dust off the PC, so I took all the components out, replaced the thermal paste and so on.

Turned on the PC again, worked apparently without issues until after a while Linux was pissed about going out of memory. Out of memory? With 64gb of RAM? I checked with dmidecode -t memory and I saw that a channel was reporting completely empty.

Shut down the PC, reinserted the second channel, rebooted, saw 64gb. One hour later, kernel panic. Rebooted in memtest86+, error in memory. What? Removed one module, error. Removed two modules, no error. Switched the modules, no error. What??

Placed the two modules that are passing the test in another computer, error. Put back in the original computer, pass test. AAAAAAAAAAAAAAA

Now I downclocked from 3200 to 2400 and everything seems working fine.

What could be? Have I been cursed?

After a few reinsertions do the slots degrade to a point that can’t sustain 3200 anymore?

  • breadsmasher@lemmy.world
    link
    fedilink
    English
    arrow-up
    21
    ·
    edit-2
    1 month ago

    Maybe the contacts were damaged on reinsert? Not just degrading / wearing down, but physically damaged

  • Nougat@fedia.io
    link
    fedilink
    arrow-up
    13
    ·
    1 month ago

    … dust off the PC …

    It’s not at all out of the question that some filth got into your connector(s). Hit them with a mess of canned air and try again?

    • Moonrise2473@feddit.itOP
      link
      fedilink
      arrow-up
      1
      ·
      1 month ago

      it might be, after all i took out all the components and then dusted the case with compressed air (didn’t let the fans spin)

    • infeeeee@lemm.ee
      link
      fedilink
      arrow-up
      1
      ·
      1 month ago

      That’s the most common thing, happened to me multiple times. Even a very small amount of dust in the slot can cause issues like that.

  • hsdkfr734r@feddit.nl
    link
    fedilink
    arrow-up
    9
    ·
    edit-2
    1 month ago

    I don’t think that you will see a difference in performance. :)

    SO DIMM and DIMM sockets have a somewhat limited durability (mating cycles) of just 25. link

    I never reached that limit. And I’m not sure if this is related to your case.

    • Asifall@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      1 month ago

      I wonder what that 25 number actually means. It’s 25 across multiple slot types so I’m guessing it’s less a measured value and more a quality control number based on their most fragile product.

      Probably something like a sample is cycled 25 times and if less than X% still test as being in spec they know something is wrong with the current batch, but again that’s mostly a guess and the actual durability experienced by the end user would vary significantly depending on what the acceptable failure rate is.

      • hsdkfr734r@feddit.nl
        link
        fedilink
        arrow-up
        2
        ·
        1 month ago

        I think so too. Most likely most of the sockets will survive more than 25 cycles. Maybe it’s a specified minimum durability which is guaranteed for nearly all sockets.

  • _haha_oh_wow_
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 month ago

    Inspect the channels for debris. Hit the RAM contacts and slot with contact cleaner (don’t get any on your skin).

  • aubeynarf@lemmynsfw.com
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    1 month ago

    RAM is easily damaged by static discharge. Were you wearing a ground strap and took care not to let the memory module touch any ungrounded surfaces while you were handling it?

    Static damage can often appear as marginal or intermittent failures, probably more often than complete failure.

    • Moonrise2473@feddit.itOP
      link
      fedilink
      arrow-up
      2
      ·
      1 month ago

      No I manhandled them and put them on a random shelf, I was under the impression modern electronics are designed to withstand that light abuse, saw a electroboom video where he tries and fails to fry RAM with electrostatic discharge

      • dylanmorgan@slrpnk.net
        link
        fedilink
        arrow-up
        5
        ·
        1 month ago

        Newer components are if anything more vulnerable to ESD because they have more delicate construction.

  • Asifall@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    1 month ago

    Placed the two modules that are passing the test in another computer, error

    So you put the ram you thought was good in another motherboard and it failed memtest? I’d interpret that to mean one of 3 things

    A) the problem is in one of those modules you switched

    B) separate problems occurred on both motherboards either due to unrelated issues or the memory being seated incorrectly (this is really unlucky)

    C) there’s a problem with the modules you switched and an unrelated problem either in the other modules or in your primary motherboard (you poor bastard)

    Did you take note of where in memory memtest was finding errors? If it wasn’t in the same general area between runs then its more likely to be a motherboard issue.

    • Moonrise2473@feddit.itOP
      link
      fedilink
      arrow-up
      1
      ·
      1 month ago

      On the x370 Ryzen motherboard the test always failed at test #5 and it appeared to be shifted bytes (expected FEFEFEFE got 00FEFEFEFE)

      On a H series lowest end Intel motherboard it just beeps and won’t even boot in dual channel. Single channel instead boots and pass the test. The Intel motherboard has those shitty RAM slots where there’s only one clip on a single side and the other is fixed (to save 1¢ I guess) so it’s a bit difficult to assure proper contact

  • brygphilomena@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    1 month ago

    You put new thermal paste on things? Did you remove the CPU as well? You could have damaged some pins there too.

    The delay in the failure sounds like it could be as the components expand with heat.

    Take it apart and look at all the pins of both the RAM, RAM slot, and CPU (if you removed that) for any damage.

    • Moonrise2473@feddit.itOP
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      1 month ago

      i put the new thermal stuff only on the cpu, specifically that new honeywell material. It’s a bit smaller than the cpu, ordered 3x3 cm measuring a core i3 that i had on hand, while the ryzen has a bigger IHS and fits better with a 4x4 cm

      i’m thinking maybe i tightened the cooler too much but it’s the OEM one, so it shouldn’t allow overtightening because has the stoppers on the threads… unless the honeywell pad is too thick for that