Cross-Region Disaster Recovery for AWS EFS

In this article, I will share some learnings about configuring AWS EFS for a region failure scenario.

Context

Elastic File System is a fully managed popular file system solution by Amazon Web Services. Some of the use cases of the same include file sharing between different computes like EC2, ECS, EKS, and Lambda in terms of Disaster Recovery, it had functionality to handle AZ failure with the help of Multi-AZ configuration. 2 years back, AWS introduced support of replication across regions where primary file system supports both read and write while the secondary region supports only reads. This introduced the ability to sustain complete region failure for EFS.

Architecture

Uni-Directional Replication

As you can see from the diagram above, EFS currently supports uni-directional replication. This allows the data in the file system in the primary region to be synchronized with the file system in the secondary region.

One thing to note when creating applications that leverage this solution is that the secondary region's filesystem is accessible in read-only mode.

Implementation

In terms of DR implementation of the Failover and Failback of the filesystems, the following operations are required to be performed:

Failover is all about smoothly shifting your important work to another location when the main center is down. The aim is to minimize the impact of a disaster or service interruption on your business and customers. Let's try to understand this using example of a Virtual Machine (VM).

After resolving any issues at your primary site following a disaster, you can move your business operations back to the original VM. Failback is the process of recovering the original VM on the source host or a new location and returning workloads from the VM replica to the original VM. However, since changes may have occurred in the VM replica during failover, it's crucial to synchronize the original VM and the replica before failback to avoid losing critical information. During failback, only the changed data is sent back to the original system.

Failover Steps
  1. Delete Replication from the Primary EFS so that the Secondary EFS becomes independent and can accept both read and write requests
  2. Once the Primary region is back up, create replication from the Secondary EFS to the Primary EFS so that the changes are synced upstream
Failback Steps
  1. When it is time to failback, and there are no ongoing operations on Secondary EFS, delete the replication on the Secondary EFS pointing to Primary EFS
  2. Create Replication again from the Primary EFS to the Secondary EFS to make our filesystems as they were before the disaster occurred.

These operations can be automated with the help of either AWS CLI or using API.

Conclusion

This replication feature from EFS is an effort to implement DR-capable solutions to accommodate business continuity. This approach still does not help in implementing Active - Active DR strategy but hopefully AWS will introduce bi-directional replication to accommodate this.