As I explained in part one of this blog, many organisations fail to recognise the need for disaster recovery (DR) planning for their cloud-based applications. Even when the need is understood it can be challenging to put effective plans in place. One challenge is that developing a comprehensive plan requires collaboration and engagement across multiple functions within an organisation – it cannot simply be a job for IT. Another challenge is that many IT services now rely on multiple application components, some of which may be running in the cloud, while others are running in the organisation’s data centres. Building an effective disaster recovery plan (DRP) therefore requires a structured, cross-functional approach that focuses on the resiliency of complete IT services, not individual workloads.
To approach DR planning, it is important to ask some probing questions of your organisation’s current approach. The answers that arise may be vague or uncomfortable, yet that is exactly what makes the process valuable. If a cloud platform experiences an outage or incident, can you recover critical systems from the cloud? How long would it take? Have you tested it? Acknowledging where gaps exist can help to direct your efforts and bring a sense of urgency to stakeholders who have overlooked the risks.
When a workload fails, the service it supports will be down for your users. This could potentially cause great damage for internal users, who suffer a loss of productivity, and for the customers that rely on your services whose trust cannot be taken for granted. Bringing back the service requires coordination and it is important that this is managed promptly to limit the extent of disruption. It is worth emphasising that it is the responsibility of the cloud customer to ensure that DR procedures are in place because the cloud providers are not contractually bound to provide this service.
Effective DR planning starts with a Business Impact Assessment (BIA). This cross-functional exercise identifies all the IT services used by the organisation, establishes the impact (operational and financial) that service downtime would cause, and consequently determines the disaster recovery requirements for each service. Many IT organisations maintain a Service Catalogue as part of their Configuration Management Database (CMDB). This can simplify the process of identifying a full list of IT services. If such a catalogue does not exist, then the inventory must be established through a discovery process.
To quantify the DR requirements of your IT services, it is useful to consider two critical metrics – the Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
In practice, there can often be a trade-off between these two objectives – IT services can be restored quickly, but with greater data loss, or they can be recovered with less data loss, but more slowly. As you might imagine, lower RTOs and RPOs typically require more costly technology solutions.
Having established the business impact for all IT services used by the organisation and determined the acceptable RTO and RPO for each, the next step is to understand all the IT application components on which each IT service depends. Building out a dependency map for each IT service will ensure that the appropriate recovery measures are put in place for all the necessary application components, whether they are running in the data centre or in the cloud.
After that, you will need to evaluate the current data protection and resiliency capabilities supporting each IT application, to determine whether they can collectively deliver the required RTO and RPO for the IT services. This needs to be done in a holistic manner, considering the impact of the most severe outage. For example, the right technology might be in place to recover a single application within the required RTO, but would that technology support the recovery of ten’s, 100’s or even 1,000’s of applications in parallel? Can you use the same technical solutions across your data centres and your cloud environments? Needing multiple tools will complicate your DR procedures. Having evaluated the current technology capabilities, you can then identify additional technical solutions to fill the gaps.
While deploying the right recovery tools is critical, technology alone is not sufficient to assure DR. A vital step is to create a hierarchical set of recovery plans that can be used to step the organisation through a DR. The top-level plans will document how the recovery activities are coordinated, while lower-level plans will include step-by-step procedures for recovering each IT service. Building and maintaining these plans is a significant investment, but they will be critical to ensuring an effective recovery from a major incident.
To ensure the plans will work in practice, they need to be tested on a regular basis. Tests should be conducted at least once a year, but more frequently for critical applications. It can also be an incident risk if it involves the use of live data. However, testing is an essential part of DR planning that you should never ignore.
Public cloud offers organisations a highly scalable and resilient platform for hosting workloads. Used properly, it can enhance the resiliency of your IT services. However, adopting public cloud does not absolve you of your responsibility for service availability and disaster recovery. While cloud offers many building-blocks to support a DR strategy, you need to use these, combined with other technologies and procedures, to deliver a cohesive plan.
Achieving multi-cloud resiliency requires a comprehensive approach, covering many of the same considerations that would go into DR around your owned data assets. Multi-cloud DR also contains further complexities regarding where data is located, what dependencies exist and how data and workloads can be recovered in an event of an adverse situation with the cloud provider.
The goal of DR planning and testing is to ensure that recovery is possible in line with your RPO and RTO targets, which in turn will give you assurance that your customers, both internal and external, will not be impacted in the event of downtime.
To find out more about the approach and principles you can apply to prepare for unexpected situations, read our comprehensive Multi-Cloud Disaster Recovery guide.