Disaster recovery planning for IBM MQ is the discipline of deciding—before smoke fills the datacenter—how your organization will restore message queuing when the primary site is gone, untrusted, or unreachable for hours. High availability keeps a hub running when one server fails; disaster recovery answers what happens when the entire building, region, or storage array is lost. Beginners often treat HA products as complete protection. They are not. Multi-instance queue managers, Native HA, and RDQM excel at fast failover within a resilience domain. When that domain disappears, you need a documented path to a backup queue manager, restored logs, redirected clients, and validated message backlog handling. This tutorial walks the planning lifecycle: business impact, recovery objectives, tiered architectures, documentation, dependencies, security during crisis, and testing culture. The goal is a plan operators can execute at 3 a.m. without inventing steps.
List every application that puts or gets messages through each critical queue manager. For each flow, capture peak message rate, average message size, persistence requirement, whether transactions use syncpoint or XA, and regulatory retention rules. Payment authorization might tolerate two minutes of downtime but zero lost persistent messages. Marketing email triggers might tolerate an hour and best-effort delivery. Without this matrix, architects over-build DR for low-value traffic and under-build for settlement hubs. Interview application owners with concrete scenarios: primary datacenter unavailable, ransomware encrypting shared storage, operator error deleting queue manager data. Their answers become recovery time objective (RTO) and recovery point objective (RPO) numbers with names attached—not anonymous SLAs.
| Artifact | Purpose | Typical owner |
|---|---|---|
| Business impact analysis | Prioritize flows and set RTO/RPO | Business continuity |
| MQ dependency map | Channels, LDAP, DB2, CICS, DNS, firewalls | Middleware ops |
| DR runbook | Step-by-step recovery commands | MQ administrators |
| Client reconnect guide | CCDT, connection names, ports | Application teams |
| Test report | Proof RTO/RPO met in drill | All signatories |
Recovery time objective is the longest period messaging may be unavailable before business harm becomes unacceptable. It includes detection, decision to declare disaster, network redirection, starting the backup queue manager, log replay, channel restarts, and application reconnect—not only the strmqm command. Recovery point objective is how much message data you may lose. Synchronous replication and shared storage with strict fencing target RPO near zero for persistent messages. Asynchronous replication across regions may accept seconds or minutes of lag—document that lag as RPO. Non-persistent messages are often excluded from RPO guarantees by design; say so explicitly so auditors do not assume otherwise.
Not every queue manager warrants a hot standby in another region. Tiering saves cost and complexity. Tier 1: synchronous or near-synchronous HA plus automated DR to a warm backup queue manager with pre-defined channels and DNS. Tier 2: warm standby with periodic object replication and log shipping—manual or semi-automated activation. Tier 3: cold backup—media restore from tape or object storage, longest RTO, acceptable only for non-critical workloads. Map each queue manager in your estate to a tier and publish it. Architects reviewing a new application should look up the tier instead of renegotiating DR from scratch per project.
Document how HA failover transitions into DR when the whole site fails. Example: RDQM loses quorum when two of three nodes are in the flooded datacenter. The surviving node may not promote; DR runbook starts the backup queue manager in the secondary region using shipped logs and object exports. Another pattern: multi-instance queue manager on shared SAN fails when the SAN is corrupted—standby cannot start cleanly; restore from backup media to DR site. Each product has failure modes HA alone cannot fix. Your plan should list those modes and the decision tree operators follow.
Messages live in logs and queues; routing lives in object definitions—queues, channels, authinfo, listeners. DR fails when the backup site has outdated MQSC. Treat authoritative definitions as code in version control, exported nightly, or captured by infrastructure-as-code pipelines. After DR activation, operators should not hand-type hundreds of DEFINE CHANNEL commands under pressure. Automate baseline restore, then apply emergency deltas documented in the runbook. On z/OS, include CSQINP, PROCS, and RACF profiles in scope; distributed sites need setmqaut equivalents and OS users.
During DR, transmission queues may hold hours of traffic. Plan disk on the backup site for maximum credible backlog: rate × duration × average message size × safety factor. Decide whether applications replay from upstream systems or drain XMITQ after channels restart. Some estates freeze producers at the load balancer while MQ recovers, then allow catch-up with throttling to avoid overwhelming consumers. Document maximum queue depth and actions when MAXDEPTH approaches—pause non-critical feeds first.
Financial and healthcare regulators expect evidence of tested DR, not slide decks. Retain test reports with date, participants, actual RTO achieved, anomalies, and remediation tickets. Separate HA failover tests from full DR tests—both matter. Auditors often ask whether privileged access during DR is logged and whether break-glass accounts differ from production norms. Include security officer review in annual plan updates.
1234567891011121314151617181920DR PLAN: QM_PAYMENT (Tier 1) Business owner: Payments COO | RTO: 10 min | RPO: 0 persistent PRIMARY: RDQM 3-node, dc-east BACKUP: QM_PAYMENT_DR, dc-west (warm, manual activation) Dependencies: LDAP auth.example.com, DNS mq-pay.example.com, TLS cert *.mq-pay.example.com (vault path /certs/mq-pay) Activation steps (summary): 1. Declare disaster (ops lead + business) 2. Stop false promotion risk in dc-east (network isolate) 3. START QM_PAYMENT_DR on dc-west per runbook §4 4. Verify DISPLAY QMSTATUS, listener, critical queues 5. Update DNS mq-pay -> dc-west LB 6. Notify app teams; verify CCDT reconnect 7. START channels per channel list appendix B 8. Monitor CURDEPTH until backlog < threshold Test: Full DR drill 2026-Q2, next 2026-Q4
DR planning is deciding which backup post office opens when the main building burns, how mail bags in trucks are rerouted, and how long customers wait for letters already in transit. You write the plan while calm, practice it yearly, and fix anything that confused people during practice.
Your favorite toy store might close if the roof breaks. DR planning is picking another store that can sell the same toys, writing directions for grown-ups to move the toys there, and practicing the move before the roof actually breaks so nobody cries on toy day.
For a card authorization queue with persistent messages, draft RTO and RPO with business justification in three sentences each.
List ten non-MQ dependencies that could block DR activation for one queue manager.
Assign Tier 1–3 to five fictional queue managers with different business criticality.
1. DR planning primarily answers:
2. RPO measures:
3. HA versus DR:
4. A good DR plan includes: