We currently have automated WCH database backup for paid tenants running on a schedule in Production. We have also tested restore steps for Database DR scenario and documented them here. https://github.ibm.com/DX/prod-infra-rtp-backup-service/wiki/Disaster-Recovery-Backup
There are additional requirements and gaps for Database DR that need to addressed to provide us with a more complete solution.
We need to first do a design to address these gaps so we can size and prioritize them for implementation.
Automated process to empty out Kafka service data during restore
The DR restore restores the databases to the point in time from when the backups were taken. Therefore pending Kafka messages may be invalid since they were based on a different database state and need to be reset.
Inclusion of Akamai data during backup/restore
Backup and restore the Akamai data with the the Cloudant databases.
Backup and restore individual tenants
Allow the administrator to restore individual tenants (instead of the entire system during DR restore).
Support incremental backup/restore functionality
Use Kafka messages or delta Cloudant backups to backup/restore changes to the database over the existing base.
Determine impact of Kafka DB Replication on Backup/Restore
Determine how the backup service could be improved by using Kafka messages.
Convert scheduling for backup/purge to use the prod-infra-rtp-scheduling-service
Move the existing scheduled backup processing from the prod-infra-backup-schedule service to the prod-infra-rtp-scheduling-service.
Support ability to restore non-replicated tenants from the failing DC into the surviving DC
It may be faster to restore a failing data center (DC) by restoring the non-replicated tenants into the surviving data center. This item involves supporting this scenario when restoring a failing data center.
What is your industry? | Non-Industry Specific |
What is the idea priority? | High |