Any system is only as good as its availability. When one is developing an application that will be deployed to the cloud it is important to understand that the remoteness of the execution platform, the cloud, is a disadvantage when an application fails. Do you know how long it will take you to deploy and emergency fix to your cloud environment?
Development processes are executed every day and we routinely deploy fixes in a methodical and thoughtful manner. But what happens when the system in the cloud is down inexplicably?
Part of our approach to designing applications for the cloud is to design them in a fail-safe and fail-soft manner. The application architect and designers need to give as much attention to detecting and handling when the system is failing as they do to the normal operation of the system.
Customer interactions with a system that is failing must be handled in way that preserves customer data and tries to recover the user input. It is not uncommon to find cute messages like “Oops something went wrong” or “Well this is embarrassing” in modern cloud-based applications that tell us the system knows there’s a problem and its being handled.
We must also have technology in place that reports these incident with diagnostic data to the service management teams. The first we hear of an outage should not be the 6 o’clock news or an email from an irate customer. This too must be part of the architecture and design of the system.
Of course critical to our cloud applications are the numerous services we incorporate from third parties that give our applications that rich user experience and save us the coding efforts. Wherever possible it is important to get Service Level Agreements in place with these publishers especially around the volatility of the service’s interface.
The Software-as-a-Service features (SaaS) we incorporate may change without warning, or may be unavailable from time to time. And so we need to build in detection of this into our architecture too.
All software systems fail from time to time and cloud-based systems are not an exception. When we have to face this challenge the detection, management and remediation should be a well-rehearsed process that can be executed calming and efficiently. Make sure you build this activity into your test planning.