MQ HA Standards

High availability standards for IBM MQ answer two questions executives ask during an outage: how long until we are back, and did we lose messages? Without written RTO and RPO per tier, architects default to best effort and operations discovers shared storage was not mounted on the standby during the only failover that mattered. HA is not one product checkbox—multi-instance, RDQM, active/passive, queue sharing groups, and cross-site channels each trade cost, complexity, and data loss risk differently. This tutorial defines enterprise HA standards beginners can apply: classify workloads, pick patterns, document failover runbooks, test on schedule, and align CCDT and DNS with the real recovery path not the happy path diagram from year one.

RTO and RPO Tiers

Tier 1 payments might require RTO under five minutes and RPO zero with synchronous replication or shared disk plus automated failover. Tier 3 reporting queues might accept RTO four hours and RPO fifteen minutes. Standards assign each queue or application to a tier; infrastructure implements the strictest tier required by any queue on a shared queue manager or split queue managers to avoid over-provisioning everything for one fanatic app.

Example HA tiers (customize per enterprise)
TierRTO targetRPO targetTypical pattern
Tier 1 criticalUnder 5 minZeroRDQM, QSG, or sync replication
Tier 2 businessUnder 30 minMinutesMulti-instance, async replication
Tier 3 batchHoursHoursDR site manual start

Multi-Instance Queue Managers

Multi-instance runs active and standby on shared storage; IBM MQ coordinates failover when the active host fails. Standards require shared storage HA, fencing, and network stability. Applications reconnect with client reconnect options or CCDT with multiple connection names. Failover still implies brief unavailability—size RTO accordingly. Test killing the active node in maintenance windows and measure reconnect time from representative clients.

RDQM Standards

Replicated Data Queue Manager replicates queue manager data to a peer node using supported replication technology for your platform. Understand synchronous versus asynchronous replication impact on RPO and put latency. Standards document which queues may live on RDQM versus which require z/OS QSG. Licensing and platform support matrices belong in the architecture decision record.

z/OS Queue Sharing Groups

QSG places shared queues in coupling facility structures so multiple queue manager members access the same queues—surviving member failure if CF and remaining members are healthy. Standards cover structure sizing, CF recovery, and repository role. Not every workload belongs in the CF; deep queues with huge messages may stay private to one member per capacity guidance.

text
1
2
3
4
5
6
HA runbook excerpt (customize): 1. Confirm incident scope: host, site, or CF structure 2. DISPLAY QSG / dspmq on all members 3. If active/passive: execute documented takeover script 4. Verify channels RESTART from DR CONNAME in CCDT 5. Validate depth drain on critical queues within RTO clock

Cross-Site and DR Channels

DR often means starting queue managers at a secondary site and redirecting channels. Standards maintain current CONNAME lists in CCDT, DNS TTL low enough for DR switch, and sequence number procedures after restore. RESET CHANNEL coordination with partners is mandatory after inconsistent backup restore. XMITQ depth at failover time becomes catch-up workload—capacity DR consumers.

Client and CCDT HA Standards

Client applications must use connection names listing primary and alternate hosts, reconnect options, and heartbeat intervals appropriate for RTO. Hard-coding one IP violates HA standards. JMS and .NET connection factories need the same multi-host configuration tested quarterly.

HA Testing Program

  1. Tabletop review of runbook yearly.
  2. Non-prod failover test each quarter.
  3. Production controlled failover for Tier 1 annually if policy allows.
  4. Measure actual RTO and RPO; update standards if missed.
  5. Fix CCDT and automation gaps found in tests.

Explainer: Spare Fire Station

HA is not hoping fires never happen. It is a second fire station with trucks fueled, roads mapped, and firefighters who practiced the route—so when station one floods, town still gets water, within the minutes you promised.

Explain Like I'm Five: HA Standards

If one toy store closes, another store opens quickly so kids still get toys, and we agree how many toys we might lose in the box during the move.

Practice Exercises

Exercise 1

Assign RTO/RPO tiers to five queues in your catalog.

Exercise 2

Walk through DR runbook in tabletop; list three gaps.

Exercise 3

Measure client reconnect time after killing active multi-instance node in lab.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. RTO measures:

  • Time to restore service
  • Message size
  • Cipher strength
  • Queue name length

2. RPO measures:

  • Acceptable data loss window
  • Channel color
  • TLS version
  • JCL lines

3. QSG on z/OS provides:

  • Shared queues across members
  • Only FTP
  • Kafka topics
  • COBOL compile

4. HA testing should be:

  • Scheduled and documented
  • Never performed
  • Only after outage
  • Optional forever
Published
Read time25 min
AuthorMainframeMaster
Verified: IBM MQ 9.4 HA documentation