Failover is the moment high availability proves it was not only a slide deck. The active queue manager stops responding—hardware fault, hung process, datacenter network partition, or planned maintenance that went wrong—and operations or automation promotes the standby, replays logs if needed, opens listeners, and waits for applications to reconnect. Durations measured in minutes can cost millions in card authorization or settlement windows. Failover quality depends on detection speed, storage accessibility, fencing of the old active, client libraries with reconnect, and runbooks operators trust at 3 a.m. This tutorial walks the failover timeline for multi-instance and replicated HA, lists failure modes, covers indoubt transactions, documents testing discipline, and ties failover to RTO and RPO language executives understand.
| Phase | Typical owner | Risk if skipped |
|---|---|---|
| Detection | Monitoring / cluster | Long outage before anyone acts |
| Fencing | HA cluster / storage | Split brain corruption |
| Promotion | MQ ops / automation | Standby never starts |
| Recovery | Queue manager | Inconsistent queues |
| Client reconnect | Application teams | False “MQ down” while QM is up |
Standby instance on the second host runs mqsi, detects active loss, acquires storage locks, and starts the queue manager using the same data path. Shared NFS or SAN must mount cleanly—stale NFS handles are a classic failover failure. Operators verify which instance is active with status commands before attempting manual start on both nodes.
12345/* Post-failover checks — commands vary by release DISPLAY QMSTATUS DISPLAY LSSTATUS DISPLAY CHSTATUS summary Verify application MQCONN success from each tier */
Replicated data queue managers elect a new primary from replicas using quorum. Native HA follows product-specific orchestration—often Kubernetes operators. Failover still implies brief unavailability during election and client reconnect; not zero-second transparent to all apps unless paired with advanced client and network design.
Stopping one queue sharing group member is not full-site failover but operational maintenance. Shared queues remain on CF; apps on other members continue. Full sysplex or CF failure triggers larger DR playbooks—see queue sharing group and DR tutorials.
Sender channels drop TCP; partners retry per channel DISCINT and retry attributes. After failover, channel initiators on the new active must restart listeners and channels. Remote partners using fixed IP may need firewall or DNS updates if the active host IP changed in non-floating designs.
Global transactions left prepared but not committed appear indoubt after restart. MQ and Db2 administrators resolve per coordinator procedures. Applications may see duplicate messages after rollback and redelivery—idempotency keys are mandatory for payment flows.
Planned tests kill the active node or disconnect storage in a controlled window. Measure detection-to-first-successful-MQPUT. Compare to RTO. Fix runbook gaps. Unplanned failover is the test you never wanted—post-incident review updates the same runbook.
Failover is passing the baton when the first runner falls. The second runner must already be on the track and trained; the baton is the queue manager name and persistent data.
If the main teacher is sick, the substitute teacher takes the class list and keeps teaching—the class name stays the same even though the person at the desk changed.
Write a one-page failover runbook outline for a multi-instance hub.
List indoubt transaction checks after unplanned failover with Db2 and CICS.
Define pass/fail criteria for a quarterly failover drill.
1. Failover moves service from:
2. After failover apps should:
3. Log replay during failover:
4. Failover testing should be: