…according to a Twitter post by the Chief Informational Security Officer of Grand Canyon Education.

So, does anyone else find it odd that the file that caused everything CrowdStrike to freak out, C-00000291-
00000000-00000032.sys was 42KB of blank/null values, while the replacement file C-00000291-00000000-
00000.033.sys was 35KB and looked like a normal, if not obfuscated sys/.conf file?

Also, apparently CrowdStrike had at least 5 hours to work on the problem between the time it was discovered and the time it was fixed.

  • @Imgonnatrythis
    link
    English
    102 months ago

    Maybe. But I’d like to think I’d just say something clever like, “says here that this year the pummel horse will be replaced by yours truly!”

    • @[email protected]
      link
      fedilink
      English
      172 months ago

      Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.

      • @[email protected]
        link
        fedilink
        English
        22 months ago

        Poorly written code can’t.

        In this case:

        1. Load config data
        2. If data is valid:
          1. Use config data
        3. If data is invalid:
          1. Crash entire OS

        Is just poor code.

        • @[email protected]
          link
          fedilink
          English
          142 months ago

          When talking about the driver level, you can’t always just proceed to the next thing when an error happens.

          Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can’t just stitch you up and tell you to get on with it, you’ll be bleeding away inside.

          In this specific case we’re talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it’s open season for hackers on your sensitive device. We’ve seen hospitals held random, we’ve seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.

          The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.

          • Morphit
            link
            fedilink
            English
            52 months ago

            This error isn’t intentionally crashing because of a security risk, though that could happen. It’s a null pointer exception, so there are no static or runtime checks that could have prevented or handled this more gracefully. This was presumably a bug in the driver for a long time, then a faulty config file came and triggered the crashes. Better static analysis and testing of the kernel driver is one aspect, how these live config updates are deployed and monitored is another.

          • @[email protected]
            link
            fedilink
            English
            22 months ago

            But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.

            In which case this should’ve been documented behaviour and probably configurable.

          • @[email protected]
            link
            fedilink
            English
            1
            edit-2
            1 month ago

            That’s a bad analogy. CrowdStrike’s driver encountering an error isn’t the same as not having disk IO or a memory corruption. If CrowdStrike’s driver didn’t load at all wasn’t installed the system could still boot.

            It should absolutely be expected that if the CrowdStrike driver itself encounters an error, there should be a process that allows the system to gracefully recover. The issue is that CrowdStrike likely thought of their code as not being able to crash as they likely only ever tested with good configs, and thus never considered a graceful failure of their driver.

            • @[email protected]
              link
              fedilink
              English
              11 month ago

              I don’t doubt that in this case it’s both silly and unacceptable that their driver was having this catastrophic failure, and it was probably caused by systemic failure at the company, likely driven by hubris and/or cost-cutting measures.

              Although I wouldn’t take it as a given that the system should be allowed to continue if the anti-virus doesn’t load properly more generally.

              For an enterprise business system, it’s entirely plausible that if a crucial anti-virus driver can’t load properly then the system itself may be compromised by malware, or at the very least the system may be unacceptably vulnerable to malware if it’s allowed to finish booting. At that point the risk of harm that may come from allowing the system to continue booting could outweigh the cost of demanding manual intervention.

              In this specific case, given the scale and fallout of the failure, it probably would’ve been preferable to let the system continue booting to a point where it could receive a new update, but all I’m saying is that I’m not surprised more generally that an OS just goes ahead and treats an anti-virus driver failure at BSOD worthy.

          • @[email protected]
            link
            fedilink
            English
            12 months ago

            You know there’s a whole other scenario where the system can simply boot the last known good config.

              • @[email protected]
                link
                fedilink
                English
                1
                edit-2
                1 month ago

                The following:

                • An internal backup of previous configs
                • Encrypted copies
                • Massive warnings in the system that current loaded config has failed integrity check

                There’s a load of other checks that could be employed. This is literally no different than securing the OS itself.

                This is essentially a solved problem, but even then it’s impossible to make any system 100% secure. As the person you replied to said: “this is poor code”

                Edit: just to add, failure for the system to boot should NEVER be the desired outcome. Especially when the party implementing that is a 3rd party service. The people who setup these servers are expecting them to operate for things to work. Nothing is gained from a non-booting critical system and literally EVERYTHING to lose. If it’s critical then it must be operational.

                • The 3rd party service is AV. You do not want to boot a potentially compromised or insecure system that is unable to start its AV properly, and have it potentially access other critical systems. That’s a recipe for a perhaps more local but also more painful disaster. It makes sense that a critical enterprise system does not boot if something is off. No AV means the system is a security risk and should not boot and connect to other critical/sensitive systems, period.

                  These sorts of errors should be alleviated through backup systems and prevented by not auto-updating these sorts of systems.

                  Sure, for a personal PC I would not necessarily want a BSOD, I’d prefer if it just booted and alerted the user. But for enterprise servers? Best not.

                  • @[email protected]
                    link
                    fedilink
                    English
                    11 month ago

                    Sure, for a personal PC I would not necessarily want a BSOD, I’d prefer if it just booted and alerted the user. But for enterprise servers? Best not.

                    You have that backwards. I work as a dev and system admin for a medium sized company. You absolutely do not want any server to ever not boot. You absolutely want to know immediately that there’s an issue that needs to be addressed ASAP, but a loss of service generally means loss of revenue and, even worse, a loss of reputation. If you server is briefly at a lower protection level that’s not an issue unless you’re actively being targeted and attacked. But if that’s the case then getting notified of an issue can get some people to deal with it immediately.

        • @[email protected]
          link
          fedilink
          English
          102 months ago

          I agree that the code is probably poor but I doubt it was a conscious decision to crash the OS.

          The code is probably just:

          1. Load config data
          2. Do something with data

          And 2 fails unexpectedly because the data is garbage and wasn’t checked if it’s valid.

          • Morphit
            link
            fedilink
            English
            32 months ago

            You can still catch the error at runtime and do something appropriate. That might be to say this update might have been tampered with and refuse to boot, but more likely it’d be to just send an error report back to the developers that an unexpected condition is being hit and just continuing without loading that one faulty definition file.

            • @[email protected]
              link
              fedilink
              English
              21 month ago

              Unfortunately, an OS that covers such cases is a lost monetization opportunity, fuck the system, use a Linux distro, you get the idea. Microsoft makes money off of tech support for people too unversed in computers to fix it themselves.

    • @[email protected]
      link
      fedilink
      English
      72 months ago

      I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?

      • @[email protected]
        link
        fedilink
        English
        72 months ago

        What are the chances that Crowdstrike started using ai to do their update deployments, and they just won’t admit it?