Disaster Recovery Planning

Disaster recovery planning for IBM MQ is the discipline of deciding—before smoke fills the datacenter—how your organization will restore message queuing when the primary site is gone, untrusted, or unreachable for hours. High availability keeps a hub running when one server fails; disaster recovery answers what happens when the entire building, region, or storage array is lost. Beginners often treat HA products as complete protection. They are not. Multi-instance queue managers, Native HA, and RDQM excel at fast failover within a resilience domain. When that domain disappears, you need a documented path to a backup queue manager, restored logs, redirected clients, and validated message backlog handling. This tutorial walks the planning lifecycle: business impact, recovery objectives, tiered architectures, documentation, dependencies, security during crisis, and testing culture. The goal is a plan operators can execute at 3 a.m. without inventing steps.

Start With Business Impact, Not Technology

List every application that puts or gets messages through each critical queue manager. For each flow, capture peak message rate, average message size, persistence requirement, whether transactions use syncpoint or XA, and regulatory retention rules. Payment authorization might tolerate two minutes of downtime but zero lost persistent messages. Marketing email triggers might tolerate an hour and best-effort delivery. Without this matrix, architects over-build DR for low-value traffic and under-build for settlement hubs. Interview application owners with concrete scenarios: primary datacenter unavailable, ransomware encrypting shared storage, operator error deleting queue manager data. Their answers become recovery time objective (RTO) and recovery point objective (RPO) numbers with names attached—not anonymous SLAs.

Common DR planning artifacts
ArtifactPurposeTypical owner
Business impact analysisPrioritize flows and set RTO/RPOBusiness continuity
MQ dependency mapChannels, LDAP, DB2, CICS, DNS, firewallsMiddleware ops
DR runbookStep-by-step recovery commandsMQ administrators
Client reconnect guideCCDT, connection names, portsApplication teams
Test reportProof RTO/RPO met in drillAll signatories

RTO and RPO in Plain Terms

Recovery time objective is the longest period messaging may be unavailable before business harm becomes unacceptable. It includes detection, decision to declare disaster, network redirection, starting the backup queue manager, log replay, channel restarts, and application reconnect—not only the strmqm command. Recovery point objective is how much message data you may lose. Synchronous replication and shared storage with strict fencing target RPO near zero for persistent messages. Asynchronous replication across regions may accept seconds or minutes of lag—document that lag as RPO. Non-persistent messages are often excluded from RPO guarantees by design; say so explicitly so auditors do not assume otherwise.

Tiered DR Strategies

Not every queue manager warrants a hot standby in another region. Tiering saves cost and complexity. Tier 1: synchronous or near-synchronous HA plus automated DR to a warm backup queue manager with pre-defined channels and DNS. Tier 2: warm standby with periodic object replication and log shipping—manual or semi-automated activation. Tier 3: cold backup—media restore from tape or object storage, longest RTO, acceptable only for non-critical workloads. Map each queue manager in your estate to a tier and publish it. Architects reviewing a new application should look up the tier instead of renegotiating DR from scratch per project.

  1. Tier 1: RTO under fifteen minutes, RPO zero for persistent traffic—payment hubs, fraud scoring.
  2. Tier 2: RTO one to four hours, small RPO—internal logistics, batch feeds with replay files.
  3. Tier 3: RTO next business day—development mirrors, low-volume reporting queues.

HA and DR Work Together

Document how HA failover transitions into DR when the whole site fails. Example: RDQM loses quorum when two of three nodes are in the flooded datacenter. The surviving node may not promote; DR runbook starts the backup queue manager in the secondary region using shipped logs and object exports. Another pattern: multi-instance queue manager on shared SAN fails when the SAN is corrupted—standby cannot start cleanly; restore from backup media to DR site. Each product has failure modes HA alone cannot fix. Your plan should list those modes and the decision tree operators follow.

Scope: What the Plan Must Cover

  • Queue manager names, platforms (Linux, Windows, z/OS), and HA products in use.
  • Network: listeners, TLS certificates, firewall rules, DNS aliases, cluster repository locations.
  • Security: CONNAUTH, CHLAUTH, certificates expiring during DR—renewal paths from DR site.
  • Applications: CCDT versions stored off-site, reconnect options, idempotent consumers.
  • Operations: who declares disaster, communication tree, rollback if primary returns mid-DR.
  • Data: log retention, backup frequency, object definitions export (MQSC or GitOps).

Object Definitions and Configuration Drift

Messages live in logs and queues; routing lives in object definitions—queues, channels, authinfo, listeners. DR fails when the backup site has outdated MQSC. Treat authoritative definitions as code in version control, exported nightly, or captured by infrastructure-as-code pipelines. After DR activation, operators should not hand-type hundreds of DEFINE CHANNEL commands under pressure. Automate baseline restore, then apply emergency deltas documented in the runbook. On z/OS, include CSQINP, PROCS, and RACF profiles in scope; distributed sites need setmqaut equivalents and OS users.

Backlog and Cutover Messaging

During DR, transmission queues may hold hours of traffic. Plan disk on the backup site for maximum credible backlog: rate × duration × average message size × safety factor. Decide whether applications replay from upstream systems or drain XMITQ after channels restart. Some estates freeze producers at the load balancer while MQ recovers, then allow catch-up with throttling to avoid overwhelming consumers. Document maximum queue depth and actions when MAXDEPTH approaches—pause non-critical feeds first.

Regulatory and Audit Expectations

Financial and healthcare regulators expect evidence of tested DR, not slide decks. Retain test reports with date, participants, actual RTO achieved, anomalies, and remediation tickets. Separate HA failover tests from full DR tests—both matter. Auditors often ask whether privileged access during DR is logged and whether break-glass accounts differ from production norms. Include security officer review in annual plan updates.

Tutorial: DR Plan Outline for One Hub

text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
DR PLAN: QM_PAYMENT (Tier 1) Business owner: Payments COO | RTO: 10 min | RPO: 0 persistent PRIMARY: RDQM 3-node, dc-east BACKUP: QM_PAYMENT_DR, dc-west (warm, manual activation) Dependencies: LDAP auth.example.com, DNS mq-pay.example.com, TLS cert *.mq-pay.example.com (vault path /certs/mq-pay) Activation steps (summary): 1. Declare disaster (ops lead + business) 2. Stop false promotion risk in dc-east (network isolate) 3. START QM_PAYMENT_DR on dc-west per runbook §4 4. Verify DISPLAY QMSTATUS, listener, critical queues 5. Update DNS mq-pay -> dc-west LB 6. Notify app teams; verify CCDT reconnect 7. START channels per channel list appendix B 8. Monitor CURDEPTH until backlog < threshold Test: Full DR drill 2026-Q2, next 2026-Q4

Explainer: Fire Drill for the Post Office

DR planning is deciding which backup post office opens when the main building burns, how mail bags in trucks are rerouted, and how long customers wait for letters already in transit. You write the plan while calm, practice it yearly, and fix anything that confused people during practice.

Explain Like I'm Five

Your favorite toy store might close if the roof breaks. DR planning is picking another store that can sell the same toys, writing directions for grown-ups to move the toys there, and practicing the move before the roof actually breaks so nobody cries on toy day.

Practice Exercises

Exercise 1: RTO/RPO Workshop

For a card authorization queue with persistent messages, draft RTO and RPO with business justification in three sentences each.

Exercise 2: Dependency Map

List ten non-MQ dependencies that could block DR activation for one queue manager.

Exercise 3: Tier Assignment

Assign Tier 1–3 to five fictional queue managers with different business criticality.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. DR planning primarily answers:

  • Which cipher to use
  • How to restore messaging after major outage
  • How to tune BATCHSZ
  • How to define topics only

2. RPO measures:

  • Acceptable data loss window
  • Channel batch size
  • Listener port
  • Topic string depth

3. HA versus DR:

  • HA is local/fast; DR is site-level/slower
  • They are identical
  • DR is always faster
  • HA never uses logs

4. A good DR plan includes:

  • Runbooks, roles, test dates, dependencies
  • Only queue names
  • Only JCL
  • No client impact analysis
Published
Read time22 min
AuthorMainframeMaster
Verified: IBM MQ 9.3 documentation