High availability standards for IBM MQ answer two questions executives ask during an outage: how long until we are back, and did we lose messages? Without written RTO and RPO per tier, architects default to best effort and operations discovers shared storage was not mounted on the standby during the only failover that mattered. HA is not one product checkbox—multi-instance, RDQM, active/passive, queue sharing groups, and cross-site channels each trade cost, complexity, and data loss risk differently. This tutorial defines enterprise HA standards beginners can apply: classify workloads, pick patterns, document failover runbooks, test on schedule, and align CCDT and DNS with the real recovery path not the happy path diagram from year one.
Tier 1 payments might require RTO under five minutes and RPO zero with synchronous replication or shared disk plus automated failover. Tier 3 reporting queues might accept RTO four hours and RPO fifteen minutes. Standards assign each queue or application to a tier; infrastructure implements the strictest tier required by any queue on a shared queue manager or split queue managers to avoid over-provisioning everything for one fanatic app.
| Tier | RTO target | RPO target | Typical pattern |
|---|---|---|---|
| Tier 1 critical | Under 5 min | Zero | RDQM, QSG, or sync replication |
| Tier 2 business | Under 30 min | Minutes | Multi-instance, async replication |
| Tier 3 batch | Hours | Hours | DR site manual start |
Multi-instance runs active and standby on shared storage; IBM MQ coordinates failover when the active host fails. Standards require shared storage HA, fencing, and network stability. Applications reconnect with client reconnect options or CCDT with multiple connection names. Failover still implies brief unavailability—size RTO accordingly. Test killing the active node in maintenance windows and measure reconnect time from representative clients.
Replicated Data Queue Manager replicates queue manager data to a peer node using supported replication technology for your platform. Understand synchronous versus asynchronous replication impact on RPO and put latency. Standards document which queues may live on RDQM versus which require z/OS QSG. Licensing and platform support matrices belong in the architecture decision record.
QSG places shared queues in coupling facility structures so multiple queue manager members access the same queues—surviving member failure if CF and remaining members are healthy. Standards cover structure sizing, CF recovery, and repository role. Not every workload belongs in the CF; deep queues with huge messages may stay private to one member per capacity guidance.
123456HA runbook excerpt (customize): 1. Confirm incident scope: host, site, or CF structure 2. DISPLAY QSG / dspmq on all members 3. If active/passive: execute documented takeover script 4. Verify channels RESTART from DR CONNAME in CCDT 5. Validate depth drain on critical queues within RTO clock
DR often means starting queue managers at a secondary site and redirecting channels. Standards maintain current CONNAME lists in CCDT, DNS TTL low enough for DR switch, and sequence number procedures after restore. RESET CHANNEL coordination with partners is mandatory after inconsistent backup restore. XMITQ depth at failover time becomes catch-up workload—capacity DR consumers.
Client applications must use connection names listing primary and alternate hosts, reconnect options, and heartbeat intervals appropriate for RTO. Hard-coding one IP violates HA standards. JMS and .NET connection factories need the same multi-host configuration tested quarterly.
HA is not hoping fires never happen. It is a second fire station with trucks fueled, roads mapped, and firefighters who practiced the route—so when station one floods, town still gets water, within the minutes you promised.
If one toy store closes, another store opens quickly so kids still get toys, and we agree how many toys we might lose in the box during the move.
Assign RTO/RPO tiers to five queues in your catalog.
Walk through DR runbook in tabletop; list three gaps.
Measure client reconnect time after killing active multi-instance node in lab.
1. RTO measures:
2. RPO measures:
3. QSG on z/OS provides:
4. HA testing should be: