You take business continuity seriously. You have working backup in place and you are confident that you can quickly restore your environment if something happens.
Are you sure? How do you prove readiness to auditors, to your boss, to your team, to yourself?
The only way to have that confidence is to know your key performance indicators (KPI) for business continuity, and to prove you are meeting those performance objectives. Because the time to find out if your business continuity measures are working isn’t during the disaster. It’s well before.
Business Continuity KPIs: It’s All about the Data
Every data center is made up of four major components: 1) physical facility, 2) hardware, 3) software, and 4) data. If you lose the first three, you’ll certainly scramble. But if you’ve done your business continuity homework, then you’ll have a remote site for mission-critical failover, a bare metal restore project plan, and a detailed procedure for restoring applications. But how about #4 — your data?
The reality is that numbers 1, 2, and 3 are replaceable. You have to do it fast and right, but you can buy facilities, hardware, and software. What you can’t do is call up the data store and get your missing or pre-corrupted data back.
So how much time is recovery really going to take? Critical applications cost money every minute they’re down. And even if you recover all of your data up to the point of corruption or loss, that is scant comfort if you’ve lost 5 hours’ worth of transactions.
So… How Bad Can it Get?
Pretty bad. Data risks are no hollow threat or unlikely event.
- Natural disasters — obvious culprits include earthquake, fire, flood, tornadoes.
- Operator error – it’s amazing how much damage a single well-meaning person can accomplish with sufficient user rights.
- Malicious employee – revenge, spying, fraud; it happens.
- Malicious outsider – coming to a town near you with ransomware, hacking, and viruses.
- Hardware failure – when your server perishes for no apparent reason and takes your data with it; and when your storage insists that it’s over threshold when it’s at 65%.
- Software failure – we have all been there. What do you mean Exchange is down?!
- Power outage – that sinking feeling when the grid goes dark and your current transactions are toast.
- Terrorist attacks – this one is no joking matter.
The losses due to even one of these threats can easily scale up to thousands, tens of thousands and millions of dollars thanks to direct impact on revenue, reputation losses that result in lower sales, and fines and sanctions from lawsuits and regulatory agencies.
Protect Your Data with Business Continuity KPIs
Provable metrics/KPIs are the way to fight back, and to ensure business continuity in case of data loss or corruption. There is not a dramatic distinction in business between metrics and KPIs. KPIs are a specific type of metric that assigns business value to another metric. In the case of business continuity, the metric is to maintain a sufficient level of compute uptime to minimize financial losses related to downtime.
Business continuity KPIs are recovery point objectives (RPO) and recovery time objectives (RTO) that serve the more general uptime metric. Establish application-specific RTO and RPO, and use incident and testing reports to quantify your level of success. (Or in case of failure, remediation.)
Recovery time objectives (RTO) will set the goals for recovery time per application. It doesn’t do much good to report that yearly downtime was below the uptime threshold, but all downtime affected the mission-critical ERP system. Assign varying RTOs by application priority, or by how much time an application can be down before generating serious business loss. Critical systems come first, other systems can take a while longer.
Recovery point objectives (RPO) assign recovery points by application: how much data you can lose from a given application before the business experiences financial loss. Critical applications may require near-zero or zero RPO.
3-Part Business Continuity Plan
Successful business continuity depends on three tasks: assign strategic RTOs and RPOs by application priority, combine on-premise and remote data protection sites, and set and track recovery performance.
- Assign RPOs and RTOs by application priority. One application might be fine with backing up to tape and keeping to a monthly tape rotation. Other applications may meet objectives with nightly backup, but instead of going to tape the data goes to disk and/or replicates to the cloud. Mission and business-critical applications need more frequent backup to meet tighter RTO and RPO. Top solutions include snapshots, data replication to remote sites or the cloud, continuous backup, and applications that go live on partial data restore while completing full restore in the background. These solutions are more expensive than nightly backup, but are the only way to achieve RTO within minutes, and zero RPO.
- Combine on-premise and the cloud. Active data is hard to protect if it only exists on-premise. A lower priority data loss may not be a significant if you can restore it from nightly backup, but if it’s a mission-critical application – or if you lose your backup in the same disaster that took your data – this can be a very big loss. Backing up to the cloud will protect your backup data from site-wide disasters. Investing in cloud-based failover is another level of protection that will ensure the most stringent time and recovery objectives. If bandwidth causes performance problems, back up data to a disk-based appliance that continuously replicates to the cloud.
- Optimize, monitor, and RPO and RTO. There is no such thing as “set and forget” in business continuity. Use regular testing and verification to make sure that your data will restore on time and on point. Never leave this to chance. If you’re using a cloud provider for backup and recovery, build your RTO and RPO into the service level agreement. It will cost more than throwing archives into cold cloud storage, but the cost will be a fraction of what you will spend if your critical data cannot restore on time. And as your application mix changes over time, refigure RTO and RPO as needed.