What is disaster recovery planning for IBM MQ?

DR planning documents how messaging services restore after site loss, corruption, or prolonged outage. It defines recovery time and point objectives, roles, dependencies, backup queue managers, and test schedules—not only technology choices.

How is DR different from high availability?

HA handles node or process failure within or near a site with seconds-to-minutes failover. DR addresses catastrophic loss—datacenter fire, region outage, ransomware—often involving a secondary site and longer recovery windows.

What are RTO and RPO in MQ DR?

Recovery time objective (RTO) is maximum acceptable downtime. Recovery point objective (RPO) is maximum acceptable data loss. Persistent messages and log strategy determine whether RPO is zero or minutes of messages.

Who should sign off an MQ DR plan?

Business owners for revenue-critical flows, enterprise architecture, security, operations, and application teams that consume MQ. Technology alone cannot set RTO/RPO without business input.

How often should DR plans be tested?

At least annually for full DR exercises on critical hubs; quarterly component tests (failover, log restore, client reconnect) are common in regulated industries. Untested plans are assumptions.

MainframeMaster

Disaster Recovery Planning

Disaster recovery planning for IBM MQ is the discipline of deciding—before smoke fills the datacenter—how your organization will restore message queuing when the primary site is gone, untrusted, or unreachable for hours. High availability keeps a hub running when one server fails; disaster recovery answers what happens when the entire building, region, or storage array is lost. Beginners often treat HA products as complete protection. They are not. Multi-instance queue managers, Native HA, and RDQM excel at fast failover within a resilience domain. When that domain disappears, you need a documented path to a backup queue manager, restored logs, redirected clients, and validated message backlog handling. This tutorial walks the planning lifecycle: business impact, recovery objectives, tiered architectures, documentation, dependencies, security during crisis, and testing culture. The goal is a plan operators can execute at 3 a.m. without inventing steps.

Start With Business Impact, Not Technology

List every application that puts or gets messages through each critical queue manager. For each flow, capture peak message rate, average message size, persistence requirement, whether transactions use syncpoint or XA, and regulatory retention rules. Payment authorization might tolerate two minutes of downtime but zero lost persistent messages. Marketing email triggers might tolerate an hour and best-effort delivery. Without this matrix, architects over-build DR for low-value traffic and under-build for settlement hubs. Interview application owners with concrete scenarios: primary datacenter unavailable, ransomware encrypting shared storage, operator error deleting queue manager data. Their answers become recovery time objective (RTO) and recovery point objective (RPO) numbers with names attached—not anonymous SLAs.

Common DR planning artifacts
Artifact	Purpose	Typical owner
Business impact analysis	Prioritize flows and set RTO/RPO	Business continuity
MQ dependency map	Channels, LDAP, DB2, CICS, DNS, firewalls	Middleware ops
DR runbook	Step-by-step recovery commands	MQ administrators
Client reconnect guide	CCDT, connection names, ports	Application teams
Test report	Proof RTO/RPO met in drill	All signatories

RTO and RPO in Plain Terms

Recovery time objective is the longest period messaging may be unavailable before business harm becomes unacceptable. It includes detection, decision to declare disaster, network redirection, starting the backup queue manager, log replay, channel restarts, and application reconnect—not only the strmqm command. Recovery point objective is how much message data you may lose. Synchronous replication and shared storage with strict fencing target RPO near zero for persistent messages. Asynchronous replication across regions may accept seconds or minutes of lag—document that lag as RPO. Non-persistent messages are often excluded from RPO guarantees by design; say so explicitly so auditors do not assume otherwise.

Tiered DR Strategies

Not every queue manager warrants a hot standby in another region. Tiering saves cost and complexity. Tier 1: synchronous or near-synchronous HA plus automated DR to a warm backup queue manager with pre-defined channels and DNS. Tier 2: warm standby with periodic object replication and log shipping—manual or semi-automated activation. Tier 3: cold backup—media restore from tape or object storage, longest RTO, acceptable only for non-critical workloads. Map each queue manager in your estate to a tier and publish it. Architects reviewing a new application should look up the tier instead of renegotiating DR from scratch per project.

Tier 1: RTO under fifteen minutes, RPO zero for persistent traffic—payment hubs, fraud scoring.
Tier 2: RTO one to four hours, small RPO—internal logistics, batch feeds with replay files.
Tier 3: RTO next business day—development mirrors, low-volume reporting queues.

HA and DR Work Together

Document how HA failover transitions into DR when the whole site fails. Example: RDQM loses quorum when two of three nodes are in the flooded datacenter. The surviving node may not promote; DR runbook starts the backup queue manager in the secondary region using shipped logs and object exports. Another pattern: multi-instance queue manager on shared SAN fails when the SAN is corrupted—standby cannot start cleanly; restore from backup media to DR site. Each product has failure modes HA alone cannot fix. Your plan should list those modes and the decision tree operators follow.

Scope: What the Plan Must Cover

Queue manager names, platforms (Linux, Windows, z/OS), and HA products in use.
Network: listeners, TLS certificates, firewall rules, DNS aliases, cluster repository locations.
Security: CONNAUTH, CHLAUTH, certificates expiring during DR—renewal paths from DR site.
Applications: CCDT versions stored off-site, reconnect options, idempotent consumers.
Operations: who declares disaster, communication tree, rollback if primary returns mid-DR.
Data: log retention, backup frequency, object definitions export (MQSC or GitOps).

Object Definitions and Configuration Drift

Messages live in logs and queues; routing lives in object definitions—queues, channels, authinfo, listeners. DR fails when the backup site has outdated MQSC. Treat authoritative definitions as code in version control, exported nightly, or captured by infrastructure-as-code pipelines. After DR activation, operators should not hand-type hundreds of DEFINE CHANNEL commands under pressure. Automate baseline restore, then apply emergency deltas documented in the runbook. On z/OS, include CSQINP, PROCS, and RACF profiles in scope; distributed sites need setmqaut equivalents and OS users.

Backlog and Cutover Messaging

During DR, transmission queues may hold hours of traffic. Plan disk on the backup site for maximum credible backlog: rate × duration × average message size × safety factor. Decide whether applications replay from upstream systems or drain XMITQ after channels restart. Some estates freeze producers at the load balancer while MQ recovers, then allow catch-up with throttling to avoid overwhelming consumers. Document maximum queue depth and actions when MAXDEPTH approaches—pause non-critical feeds first.

Regulatory and Audit Expectations

Financial and healthcare regulators expect evidence of tested DR, not slide decks. Retain test reports with date, participants, actual RTO achieved, anomalies, and remediation tickets. Separate HA failover tests from full DR tests—both matter. Auditors often ask whether privileged access during DR is logged and whether break-glass accounts differ from production norms. Include security officer review in annual plan updates.

Tutorial: DR Plan Outline for One Hub

text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
DR PLAN: QM_PAYMENT (Tier 1)
Business owner: Payments COO | RTO: 10 min | RPO: 0 persistent
 
PRIMARY: RDQM 3-node, dc-east
BACKUP: QM_PAYMENT_DR, dc-west (warm, manual activation)
 
Dependencies: LDAP auth.example.com, DNS mq-pay.example.com,
  TLS cert *.mq-pay.example.com (vault path /certs/mq-pay)
 
Activation steps (summary):
  1. Declare disaster (ops lead + business)
  2. Stop false promotion risk in dc-east (network isolate)
  3. START QM_PAYMENT_DR on dc-west per runbook §4
  4. Verify DISPLAY QMSTATUS, listener, critical queues
  5. Update DNS mq-pay -> dc-west LB
  6. Notify app teams; verify CCDT reconnect
  7. START channels per channel list appendix B
  8. Monitor CURDEPTH until backlog < threshold
 
Test: Full DR drill 2026-Q2, next 2026-Q4

Explainer: Fire Drill for the Post Office

DR planning is deciding which backup post office opens when the main building burns, how mail bags in trucks are rerouted, and how long customers wait for letters already in transit. You write the plan while calm, practice it yearly, and fix anything that confused people during practice.

Explain Like I'm Five

Your favorite toy store might close if the roof breaks. DR planning is picking another store that can sell the same toys, writing directions for grown-ups to move the toys there, and practicing the move before the roof actually breaks so nobody cries on toy day.

Practice Exercises

Exercise 1: RTO/RPO Workshop

For a card authorization queue with persistent messages, draft RTO and RPO with business justification in three sentences each.

Exercise 2: Dependency Map

List ten non-MQ dependencies that could block DR activation for one queue manager.

Exercise 3: Tier Assignment

Assign Tier 1–3 to five fictional queue managers with different business criticality.

Frequently Asked Questions

Test Your Knowledge

1. DR planning primarily answers:

Which cipher to use
How to restore messaging after major outage
How to tune BATCHSZ
How to define topics only

2. RPO measures:

Acceptable data loss window
Channel batch size
Listener port
Topic string depth

3. HA versus DR:

HA is local/fast; DR is site-level/slower
They are identical
DR is always faster
HA never uses logs

4. A good DR plan includes:

Runbooks, roles, test dates, dependencies
Only queue names
Only JCL
No client impact analysis

Disaster Recovery Planning

Start With Business Impact, Not Technology

RTO and RPO in Plain Terms

Tiered DR Strategies

HA and DR Work Together

Scope: What the Plan Must Cover

Object Definitions and Configuration Drift

Backlog and Cutover Messaging

Regulatory and Audit Expectations

Tutorial: DR Plan Outline for One Hub

Explainer: Fire Drill for the Post Office

Explain Like I'm Five

Practice Exercises

Exercise 1: RTO/RPO Workshop

Exercise 2: Dependency Map

Exercise 3: Tier Assignment

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

Failover

Backup Queue Managers

Active/Passive

Replicated Data Queue Managers