The first thing that comes to mind when thinking about disaster recovery is probably some natural disaster, however businesses need to adopt disaster recovery (DR) for a wide range of scenarios. In recent years the number of ransomware attacks increased significantly, placing it as the number one cause for DR events. Power outages, natural disaster, human error, and hardware failures follow as causes for DR events.
Although organizations realize the importance of implementing a robust DR solution for business continuity, traditional DR solutions can be complex and unreliable, leaving many organizations less than confident that their DR plan will work when needed.
The first step when defining a disaster recovery architecture is to understand the business requirements. Two important metrics are the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The RPO defines how frequently backups are created and stored and as a result the amount of data lost at recovery. RTO defines the amount of time it takes to recover from a DR event. Both of these numbers also play into the total availability of an application, typically defined in percentages – for example 99.9% from 8am-8pm, Mon-Fri.
Another important aspect is the frequency of DR testing. The more frequently you can test and validate the DR plans, the lower the risk of recovery issues arising when actually needed at the time of DR. More frequent and faster testing yields a higher degree of confidence and predictability in the results during an actual recovery.
VxRail customers can choose from a wide range of DR solutions, based on their business needs. In this blog post we are going to explore VMware Cloud Disaster Recovery as a cloud-based DR solution.
VMware Cloud Disaster Recovery is a cloud-based solution that combines efficient cloud storage with simple SaaS-based management for IT resiliency at scale. Some of the underlying technology was acquired via Datrium.
VMware Cloud Disaster Recovery (VMware Cloud DR) can be used to protect vSphere virtual machines by replicating them periodically to the cloud and recovering them as needed to a target VMware Cloud on AWS Software Defined Data Center ("SDDC"). The target SDDC can be created immediately prior to performing a recovery and doesn’t need to be provisioned to support the replications in the steady state.
VMware Cloud DR Architecture
The overall system architecture can be divided into three main elements:
On premises VxRail cluster (Managed by the end user)
Cloud based filesystem as replication target (Managed by VMware)
VMware Cloud on AWS SDDC (Managed by VMware/AWS)
Let’s look at the VMware Cloud DR specific components in more detail:
VMware Cloud DR DRaaS Connector ("DRaaS Connector") – a virtual appliance installed in the VMware vSphere environment where the virtual machines to be protected are running under normal circumstances.
Scale-Out Cloud File System ("SCFS") – a cloud component that enables the efficient storage of backups of the protected virtual machines in cloud storage and allows virtual machines to be recovered very quickly without a time-consuming data rehydration process.
SaaS Orchestrator ("Orchestrator") – a cloud component that presents a user interface (UI) to consume the Service Offering and includes several disaster recovery orchestration capabilities to automate the disaster recovery process.
DR SDDC Deployment Options
Just-in-Time Deployment: Just-in-time deployment of a cloud DR site presents an attractive alternative to continuously maintaining a warm standby cloud DR site. With just-in-time deployment, the recurring costs of a cloud DR site are eliminated in their entirety until a failover occurs, and cloud resources are provisioned.
The on-demand nature of public clouds allows DRaaS to reduce the operating costs of DR by deploying the bulk of the DR infrastructure programmatically following a DR event. During steady-state operation, DRaaS maintains a minimal, low-cost AWS cloud footprint to accommodate cloud backups with no ongoing charges for the cloud DR site.
Ahead-of-Time Deployment: In cases where a DR site has the secondary function of executing non-DR workloads during regular operation, an SDDC can be provisioned before failover.
If the sole purpose of the Cloud DR site is to take over workload execution in the event of a disaster and it remains otherwise unutilized, further significant cost savings are possible with the just-in-time deployment.
Ahead-of-time vs. just-in-time provisioning of SDDC is a trade-off between costs and RTO. With ahead-of-time SDDC provisioning, SDDC creation latency is eliminated. Just-in-time SDDC provisioning dramatically lowers the costs but increases the RTO by deploying SDDC only in the event a failover.
Pilot Light: In Pilot Light mode, DRaaS enables a smaller subset of SDDC hosts to be deployed ahead of time for recovering critical applications with lower RTO requirements.
This deployment model allows organizations to reduce the total cost of cloud infrastructure by keeping a scaled-down version of a fully functional environment always running in warm-standby while ensuring that core applications are readily available when a disaster event is triggered.
With Pilot Light mode, DRaaS presents an option for administrators to add extra SDDC hosts through Cloud Bursting and failover the remaining applications. Expanding the SDDC by adding hosts happens in minutes, providing a lower RTO for all applications than the just-in-time deployment RTO at a fraction of the cost of the ahead-of-time deployment. A full SDDC deployment is a more time-consuming operation with a higher RTO impact than SDDC expansion. Pilot Light mode is an efficient solution with a range of options to balance costs and RTO.
Setup Process
The steps below outline the implementation process of VMware Cloud DR.
Dashboard
The SaaS Orchestrator dashboard provides a cloud-based UI to help manage the on-prem and cloud configuration components of VMware Cloud DR.
Protection Groups
Protection groups are a way of grouping virtual machines that will be recovered together. A protection group contains virtual machines whose data will be replicated by the DRaaS Connector to the Scale-out File System following the same protection policy.
The protection policy defines the frequency when snapshots are taken and how long the recovery point is retained in the cloud-based Scale-out File System. In many cases, a protection group will consist of the virtual machines that support a service or application such as email or an accounting system.
For example, an application might consist of a two-server database cluster, three application servers and four web servers. In most cases, it would not be beneficial to fail over part of this application, so all virtual machines would be included in a single protection group. Creating a protection group for each application or service has the benefit of selective testing.
Having a protection group for each application enables non-disruptive, low risk testing of individual applications allowing application owners to non-disruptively test disaster recovery plans as needed.
Virtual machines can belong to more than one Protection Group. Protection Groups can belong to more than on DR plan.
DR Plans
A DR Plan defines what is going to fail over, where it's going to go in the cloud configuration, how it's going to come in online and in in what order.
If something is incomplete or incorrect it will show a warning mark. On the right we have grouped them relative to what they're doing in the plan. The first three items – general, sites and groups, set the scope. This is the type of plan, the source and destination sites, and the protection groups that are part of this plan.
The second group of five items in the list are essentially the mappings from the source vCenter to the destination vCenter – from the data center to the cloud. This includes datastores and vCenter folders, compute resources, and virtual networks. When creating a plan, the UI helps guide you in this mapping setup. These are very similar to the mappings needed if you were in a vCenter environment and you wanted to put something into inventory.
The next group falls into the customization category – this is where the IP addresses can be manipulated or remapped. This can be either specific fixed static IP or complete ranges when the virtual machines are placed into inventory.
There is also a script capability to extend the functionality of any DR plan. Each DR plan can have one script VM associated with it – and that script VM can be Windows or Linux. This does not depend on any particular SDK or API, it is just allowing callable scripts from the DR plan execution into that script virtual machine to perform something in the context of the current plan step.
The fourth area is the sequence. This is the actual ordering of the recovery steps and we'll look at that in just a little bit more detail in a moment, but the recovery steps is the recipe for what to do first, second, third, fourth and so forth
The last part of the DR plan structure is the alerting mechanism that sends email alerts to configured administrator(s) of VMware Cloud Disaster Recovery
Planned Migration and Disaster Recovery
Running a recovery plan differs from testing a recovery plan. Testing a recovery plan does not disrupt virtual machines at the protected site. There are no dependencies between the protected site and the recovery site when it comes to recovery.
The first step for recovery is to ensure that an SDDC is deployed or to get one deployed if required. This SDDC could be a “just in time” SDDC, it could be a pilot-light SDDC or it could be a fully provisioned cloud site. Whatever makes the most sense based on requirements. For “just in time” and on-demand recovery there isn’t an always-on SDDC running in the cloud. In these situations, it will take approximately two hours to provision and prepare the new SDDC.
After the SDDC is in place the recovery points to failover to can be chosen. The recovery point could be the last good replication point or something hours, days, or even weeks older if that is required by circumstances (eg. ransomware, data corruption).
Once the recovery plan has finished the failover and recovered the virtual machines, there is the choice to commit the plan and continue running at the recovery site or to roll back. The rollback process is similar to cleaning up after a test. The recovered VMs are powered off and the SDDC is returned to the state it was in prior to executing the plan.
Failback
After the disaster has been resolved, returning back to normal operations is just as easy as failing over in the event of a disaster. Simply select the desired plan, duplicate it and then reverse its direction. Once the new plan is created, run through the health checks to make sure that everything's ready to failback. Changes may need to be made to the plan or the environments depending on what happened while operating in the cloud or resolving the on-prem datacenter. The health check process will provide guidance on what needs to be addressed. Then the failback plan can be executed.
The failback process uses change block tracking to minimize the amount of data that needs to replicate back to the on-prem site through the Scale-out Cloud File System back to the DRaaS Connector.
At the end of the failback, all virtual machines are restored to the same point in time that the cloud instance was last running. At that point, the related cloud compute resources are no longer needed, and the VMware Cloud on AWS SDDC could be reduced in size or even eliminated, depending on requirements.
It is important to note here that a failback operation is a planned activity and there will be some downtime of the applications. This will occur during the snapshot and replication stages of this process and that will depend on how much has changed during the DR operation period as well as on network bandwidth.