Disciplines · 08 · Continuity

Most businesses have a backup. Fewer have a plan.

Somewhere along the way, "backup" and "disaster recovery" collapsed into the same phrase. The tape vendor sold you off-site. The cloud vendor sold you off-site-with-extra- steps. Both called it DR. Neither was wrong on the technology — but both were wrong on what DR actually is. Disaster recovery isn't a product. It's a process. The question isn't whether your data sits in two places. The question is whether anyone in the room knows what to do at 2 a.m. when it all stops.

The frame

Backup and DR aren't the same exercise.

Most of the businesses that call us about continuity don't start by asking for a runbook. They ask for a backup tool, or a DR plan, or "something better than what we've got." When we dig in, the underlying assumption is almost always the same: backup and DR are two names for the same thing, and if we're paying for one we've got the other.

They aren't. A backup is a copy of data. A DR plan is what a business does when the copy has to get used. Those are different problems, owned by different people, tested in different ways, and they fail for different reasons.

Conflating them is how a business ends up with redundant off-site backups that have never been restored from, and a DR binder written three years ago that nobody in the current IT team has read. The backup is running. The DR plan is decoration. That's the shape we walk into.

Where they diverge

Two exercises, two different questions.

The easiest way to see the gap is to line the two up against each other. Same event — a ransomware encryption, a building fire, a primary region going dark — but two different answers the business is trying to deliver.

Backup

The copy that sits ready.

A scheduled, restorable copy of data — files, databases, mailboxes, virtual machines. Its whole job is to exist, quietly, until the day it's needed.

  • The question it answers — can we get this file, mailbox, or VM back?
  • What it protects — data.
  • Success metric — the backup job completed; a test restore produces the file.
  • Who owns it — IT operations, usually a single named engineer.
  • How it's tested — run a restore. Does the file come back.
  • How it fails — quietly. Jobs error, nobody looks, months pass.
Disaster recovery

The plan that kicks in.

A coordinated response to an event that disrupts the business's ability to operate. It isn't a product. It's a runbook, a named caller, a decision tree, and the muscle memory to use all three under pressure.

  • The question it answers — can we keep operating if the primary environment is gone?
  • What it protects — operations. The ability of the business to function.
  • Success metric — services are back inside RTO with data loss inside RPO, and the business kept running.
  • Who owns it — shared. IT runs the mechanics; the business owns the decision to invoke.
  • How it's tested — tabletop, failover exercise, or the real thing.
  • How it fails — loudly. Nobody knows who calls it. Dependencies aren't in the plan.

These are related exercises. They share tooling. They share a vocabulary. But they aren't the same project. A business with a great backup and no DR plan has a file it can retrieve and no idea what to do with it. A business with a great DR plan and weak backups has a runbook that starts with "locate the most recent recoverable copy" and ends in a long silence. The two have to be built together, but they have to be built separately.

What makes it a process

A runbook, a named caller, and a practiced room.

"DR is a process" is easy to say. Here's what it actually means in the room. A runbook — written down, current, and reachable when the primary systems aren't. A named person with explicit authority to invoke it, so the decision doesn't default upward until it stalls. Criteria that define what counts as a DR-level event, so the call gets made on evidence instead of instinct. A list of questions the person on the call is supposed to be asking in the first ten minutes. And a tabletop cadence that makes sure all of the above still works when the people in the roles have changed.

That's the process. Without it, a business has a collection of recovery capabilities and no idea how to sequence them under stress. With it, a business has something closer to a reflex.

Backup is a task. DR is a process. The stance on Continuity
The business conversation

The two numbers the business owns.

Recovery Time Objective and Recovery Point Objective are the two numbers the architecture answers to. RTO is how long the business can tolerate being down. RPO is how much data it can tolerate losing. Every tool choice, every tier, every dollar spent on continuity traces back to those two numbers. But almost nobody gets them from the business side first.

The conversation tends to go one of two ways. In the first, we mention RTO and RPO and a blank stare washes over the room. That's a useful signal — it means the business has never been asked to own those decisions, and the numbers that drive the architecture are numbers IT invented on the business's behalf. From there we run a mini tabletop: what the terms mean, why they matter, and what happens to the business at four hours offline, versus twenty-four, versus seventy-two. The answer isn't "faster is better." The answer is whatever the business can actually afford, paired with the honesty that a faster answer costs more.

In the second, the targets already exist — the business came to the conversation with "four-hour RTO, one-hour RPO" already on a page somewhere. That's a different kind of work. We go through how often the system is actually validated against those targets, because targets that have never been tested are aspirations, not commitments. And we expand the frame to the cloud posture — what does continuity look like when the workloads aren't in your datacenter anymore?

In the cloud

What backup and DR look like now.

The conversation shifts when the estate is cloud-first. The failure modes change, the tooling changes, and the misconceptions change with them.

On M365, the most dangerous assumption is also the most common: that Microsoft is handling backup. They aren't. Microsoft is responsible for keeping the service available. Your data inside the service is your responsibility — they provide the tools for you to take it, but the taking is on you. SharePoint and Exchange have retention features that cover a lot of the everyday "I deleted the wrong file" scenarios, and for many businesses those features are enough. For others — especially in regulated industries — the retention policies inside the tenant don't satisfy compliance or governance requirements, and an independent backup is still required. Sometimes that backup is another cloud service. Sometimes, for the right reasons, it's still on-premises.

On Azure, the toolset is broad. Azure Backup for VMs and files. Azure Site Recovery for full-workload replication and failover. Regional pairs and zone-redundant storage for resilience against localized outages. Immutable backup vaults for ransomware resilience. Which of those gets used on a given engagement isn't a technology question — it's the business question running underneath every continuity conversation: how much downtime is tolerable, how much data loss is tolerable, and which data is actually critical. Every business answers those three differently. Price is always part of the answer.

On the M365 assumption The single most common misconception in cloud-era continuity is that Microsoft is backing up your tenant. They aren't. They're keeping the service running. What happens to your data inside that service — retention, recoverability, long-term preservation — is your call to make. Sometimes the built-in features are enough. Often, especially once compliance enters the room, they aren't.
Microsoft provides the tools for you to take responsibility — not responsibility itself. On the M365 assumption
The preflight

What we check first, every time.

When a client tells us "yes, we have DR," the first engagement hour is usually the same six questions. These are the gaps that show up over and over. None of them are sophisticated. All of them are the difference between a continuity story and a continuity posture.

The preflight

The quiet failure modes behind claimed DR posture

  1. 01

    Backups that have never been restored from

    A backup that hasn't produced a working restore is a hope, not a backup. The first thing we ask is when the last successful restore test was and what it covered. If the answer is vague, that is the answer.

  2. 02

    A runbook written three years ago

    Runbooks age out. The people named in them move on. The phone numbers stop working. The tenant IDs change. A current runbook and a plausible runbook aren't the same document.

  3. 03

    No named person with authority to invoke DR

    When nobody has explicit authority to make the call, the decision defaults upward — and then stalls while the person who'd make it is in the air, on vacation, or unreachable. A named caller, with a named backup, is what moves the clock from decision to action.

  4. 04

    RTO and RPO numbers nobody owns

    If the targets were set by IT and never ratified by the business, the architecture is guessing. The numbers have to come from the people who pay the cost of being down, not the people who pay the cost of the infrastructure.

  5. 05

    Off-site backup with no compute on the other end

    A replicated backup that can't be run on its own isn't disaster recovery. It's a second copy of the data with a long wait for a rebuilt environment attached. Real DR assumes the primary environment isn't coming back on its own.

  6. 06

    Dependencies that aren't in the plan

    DNS. Identity. The one internal app that everything else authenticates against. The certificate authority. These are the silent dependencies that a plan built around "recover the application servers" quietly assumes into existence. When they aren't in the plan, the plan doesn't work.

The real deliverable

The artifact is the runbook. The outcome is sleep.

Every continuity engagement produces a stack of tangible artifacts. A runbook. Scheduled backup tasks that actually work. An architecture that answers to the numbers the business owns. A tabletop exercise that's been run at least once. Documentation that describes what "good" looks like in enough detail that a different person could step into the role.

Those artifacts matter. But the deliverable that the business actually hired us for is less visible. It's the shift in how the question gets answered when something happens. Before the engagement, the answer to "what do we do?" is "I don't know — what now?" After, the answer is a sequence. Someone calls it. Someone invokes it. Someone starts working the plan. The business doesn't stop trembling — real incidents are still stressful — but it stops freezing.

That's the real deliverable. Peace of mind. The confidence that if something bad happens on a Tuesday at 2 a.m., the response is action instead of a long silence. The artifacts are the proof. The sleep is the product.

Book a call

When's the last time you actually tested a restore?

If the answer is "I'm not sure," "we haven't," or "the backup runs nightly so we should be fine" — that's the conversation to have. The good news is that a credible recovery story is usually closer, cheaper, and less dramatic than people expect. Most of the work is turning what you already have into something that would actually hold up at 2 a.m.

Or reach us directly: info@fouronesixit.ca · (647) 371-0400