Featured Image: Fail Open or Fail Closed?

Fail Open or Fail Closed?

Unless you’ve been hiding under a rock, by now you’ve seen the news and perhaps even been impacted by the Crowdstrike update that unfortunately blue-screened windows machines around the world and brought various industries to a halt. 

I don’t want to throw shade, and I don’t think that we as an industry should.  Sure, it’s a massive screw-up on a global scale, and will likely have far-reaching ramifications and consequences. We shouldn’t underplay the significance and impact here.

But remember – we live and operate in a world where criminals are trying to do bad things every day, and the creation of complex software to keep those criminals at bay likely has bugs and defects that won’t easily be uncovered without wide-scale production usage.

Which brings up what I think is an important topic, which should always be an important design criteria but the Crowdstrike issue once again brings to the forefront – when there is a fault in production, should the system fail open or fail closed?

To state it simply, failing closed means that a device or system stops or otherwise shuts down and prevents further operation when a failure condition occurs; in contrast, failing open means that the error is essentially ignored and the system or device operation proceeds as if everything was operating normally.

In the world of physical devices, the answer as to which model is preferred is generally obvious. Consider for example a lawnmower with a hand-closed lever that must be held down at all times to operate. Any time the lever is released, or any time the sensor even thinks that the lever is released, the blade stops rotating – this is purposely designed to ensure physical safety.  The system fails closed, and rightly so.

In other scenarios, the answer may not always be as obvious, such as in the world of cybersecurity.  In general, solutions should fail closed in situations where security concerns override the need for access, and fail open when access is deemed the first and foremost priority.  If we take a physical security example, for instance, in which a card-reader controls an electronic lock on an access door, you would probably want the system to fail-closed if the card-reader fails, as otherwise anyone and everyone could potentially gain access; however, you may want the system to fail open on a power failure, to ensure that people inside the building can properly exit in case of an emergency.

I have seen various comments in the past few days arguing that the overall Crowdstrike system should have failed open instead of failing-closed with the notorious Blue Screen of Death (BSOD). Had it done this, we may be complaining (and digging out from) a set of widespread breaches and attacks instead of recovering from the BSOD, and it isn’t necessarily obvious which situation is inherently worse.  

Instead, the question we need to ask ourselves, and the considerations we need to think about as it pertains to cybersecurity, is how does one particular failure impact the overall system, where “overall” refers not just to one software component and not even just one physical machine but the overall set of machines, components, systems, and processes that together form an organization and a complete cyber security strategy and approach. Unfortunately, many of our overall systems today are still designed without the inherent aspect of resiliency in mind – without thinking about “what happens if component X fails”.  If we assume the worst, and design the overall cybersecurity approach to assume that bad things will happen, then we can tolerate a given component failing open and still be confident that we can keep our organizations protected (at least for a period of time until the specific issue is fixed).

This is exactly how HYAS believes we should be thinking about cybersecurity – and in fact the overall notion of designing resiliency into cybersecurity systems is supported by the United States government and international governments alike.  The role of Protective DNS is specifically to identify and stop an attack that successfully got through other cyber security layers, whether that was because of a new tactic or technique that evades detection, a fault or software defect, or even human error.  However the malware or attack got into the environment, Protective DNS will see the telltale signs of the breach – the outbound communication to command-and-control for instructions, attack progression, and data exfiltration – and prevent the communication from happening, rendering the attack inert.  

The future is about resiliency, about security-in-depth, and about combining and integrating solutions together to not just stop attacks but truly be proactive and resilient in spite of the changing landscape, attack vectors, techniques and, yes, software defects.  Ultimately a cyber resilient system architecture allows individual components to fail open (for short periods of time) and can therefore help ensure not just continual access and usage but appropriate levels of security as well.

Try HYAS Insight Intel Threat Intelligence Feed - Organizations can get actionable intelligence on adversary infrastructure FREE!
Register here

Try HYAS Protect At Home - FREE enterprise-grade protective DNS for your home network.
Register here


David Ratner
CEO, HYAS


Ready to step up your defensive game? Learn how HYAS solutions can transform your cybersecurity strategy from reactive to proactive.