Cloud Outages: Who?s at Fault?- Valutrics

Cloud outages do happen. So, how can you and your IT group minimize the impact on the company?

Imagine a scenario where the unthinkable happens. Your company’s cloud provider suffers a major outage that grinds business to a halt. While the IT department and CIO are busy placing blame of the outage on the provider, the rest of the organization is likely to blame the internal IT staff for the disruption in service. So that begs the question, where does the responsibility for outages lie, and what can be done to mitigate outage risk?

An “unthinkable” scenario of a catastrophic cloud outage isn’t nearly as far-fetched as some might think. Want some recent proof? How about Amazon’s massive 5-hour AWS outage that occurred in February? This impacted things such as Quora, GigHub and Docker. Another recent example is when Rackspace faced a 3-hour worldwide cloud outage that impacted popular SaaS products including Cisco Spark.

Clearly, cloud outages remain a fairly common occurrence. IT leadership must recognize this and understand that they aren’t off the hook when it comes to outages that could impact a company’s bottom line. When you outsource infrastructure management to a third-party cloud provider, you’re trusting that the provider will adhere to the level of accessibility as outlined in their service level agreement (SLA). But, it’s important to note that a transference of trust in supporting underlying network components is not a blanket transference of responsibility when outages occur. For IT departments, the SLA is your first line of defense. If an SLA does not meet your requirements, it’s up to you to seek out providers that offer more robust solutions with higher penalties if the agreement level is breached.

Beyond the SLA, there are plenty of other ways that cloud customers can limit the impact of a major service provider outage. One way is to leverage a hybrid cloud approach where you load balance between on-premises and public cloud resources. That way, an outage in one segment of the infrastructure will not completely knock your applications offline. Multi-cloud strategies are also becoming popular. This is especially true now that administrators have a wide array of multi-cloud management platforms that significantly reduce the effort required when working inside differently-architected cloud environments.

For those of you that already have robust plans and network designs in place that sufficiently reduce the impact of a potential public cloud outage, I have one more question for you: What is your strategy surrounding uptime of shadow IT applications?

Even though an employee or department skirted standard operating procedures for application usage on the corporate network, it remains the duty of the IT department to track down these apps and do whatever is possible to manage accessibility risk. This is where a shadow IT outreach program could be used to identify and wrap protection around unauthorized applications that remain critical to the business.

The one caveat to cloud outage risk mitigation is that it’s not going to be free. Cloud service providers that offer more robust infrastructures and improved SLA’s are going to demand a premium price. So too is the time and money spent implementing hybrid, multi-cloud and other cloud resiliency protocols and procedures.

If the business makes the universal decision to not pay for this type of risk mitigation, that’s one thing. But IT departments and IT leadership must, at minimum, perform their due diligence and provide a cost/benefit analysis based on the probability that a cloud outage will economically impact business operations. In some cases, that economic impact will be so low that the added time and money spent to bolster cloud resiliency is not worth the investment. But for most, at least some form of added protection will be money well spent. It’s simply up to the IT department to determine exactly where that level of protection should be.