Failover

Failover is the moment high availability proves it was not only a slide deck. The active queue manager stops responding—hardware fault, hung process, datacenter network partition, or planned maintenance that went wrong—and operations or automation promotes the standby, replays logs if needed, opens listeners, and waits for applications to reconnect. Durations measured in minutes can cost millions in card authorization or settlement windows. Failover quality depends on detection speed, storage accessibility, fencing of the old active, client libraries with reconnect, and runbooks operators trust at 3 a.m. This tutorial walks the failover timeline for multi-instance and replicated HA, lists failure modes, covers indoubt transactions, documents testing discipline, and ties failover to RTO and RPO language executives understand.

Failover Timeline

  1. Failure occurs—crash, hang, or loss of heartbeat to partner.
  2. Detection—cluster manager, mqsi policy, or operator alert.
  3. Fencing—prevent old active from writing shared or replicated data.
  4. Promotion—standby starts queue manager as active on shared/replicated storage.
  5. Log recovery—replay to consistent point; indoubt resolution.
  6. Listeners and channels start; DISPLAY QMSTATUS shows running.
  7. Applications reconnect; depth and traffic normalize.
Phases and owners
PhaseTypical ownerRisk if skipped
DetectionMonitoring / clusterLong outage before anyone acts
FencingHA cluster / storageSplit brain corruption
PromotionMQ ops / automationStandby never starts
RecoveryQueue managerInconsistent queues
Client reconnectApplication teamsFalse “MQ down” while QM is up

Multi-Instance Failover

Standby instance on the second host runs mqsi, detects active loss, acquires storage locks, and starts the queue manager using the same data path. Shared NFS or SAN must mount cleanly—stale NFS handles are a classic failover failure. Operators verify which instance is active with status commands before attempting manual start on both nodes.

text
1
2
3
4
5
/* Post-failover checks — commands vary by release DISPLAY QMSTATUS DISPLAY LSSTATUS DISPLAY CHSTATUS summary Verify application MQCONN success from each tier */

RDQM and Native HA Failover

Replicated data queue managers elect a new primary from replicas using quorum. Native HA follows product-specific orchestration—often Kubernetes operators. Failover still implies brief unavailability during election and client reconnect; not zero-second transparent to all apps unless paired with advanced client and network design.

z/OS Member Bounce Versus Failover

Stopping one queue sharing group member is not full-site failover but operational maintenance. Shared queues remain on CF; apps on other members continue. Full sysplex or CF failure triggers larger DR playbooks—see queue sharing group and DR tutorials.

Channels During Failover

Sender channels drop TCP; partners retry per channel DISCINT and retry attributes. After failover, channel initiators on the new active must restart listeners and channels. Remote partners using fixed IP may need firewall or DNS updates if the active host IP changed in non-floating designs.

Transactions and Indoubt UOW

Global transactions left prepared but not committed appear indoubt after restart. MQ and Db2 administrators resolve per coordinator procedures. Applications may see duplicate messages after rollback and redelivery—idempotency keys are mandatory for payment flows.

Failover Runbook Essentials

  • Escalation contacts for MQ, storage, network, and apps.
  • Commands to identify active instance and start standby safely.
  • CCDT or DNS change steps if floating name not used.
  • Verification scripts: test put/get on critical queues.
  • Communication template for business stakeholders.
  • Back-out procedure if failover makes things worse—fail back criteria.

Testing Failover

Planned tests kill the active node or disconnect storage in a controlled window. Measure detection-to-first-successful-MQPUT. Compare to RTO. Fix runbook gaps. Unplanned failover is the test you never wanted—post-incident review updates the same runbook.

Explainer: Relay Race Baton Pass

Failover is passing the baton when the first runner falls. The second runner must already be on the track and trained; the baton is the queue manager name and persistent data.

Explain Like I'm Five

If the main teacher is sick, the substitute teacher takes the class list and keeps teaching—the class name stays the same even though the person at the desk changed.

Practice Exercises

Exercise 1

Write a one-page failover runbook outline for a multi-instance hub.

Exercise 2

List indoubt transaction checks after unplanned failover with Db2 and CICS.

Exercise 3

Define pass/fail criteria for a quarterly failover drill.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. Failover moves service from:

  • Failed active to standby
  • DLQ to XMITQ
  • Topic to queue
  • JCL to PROC

2. After failover apps should:

  • Reconnect to same QM name
  • Rename all queues
  • Disable TLS
  • Delete logs

3. Log replay during failover:

  • Recovers persistent state
  • Deletes all messages
  • Only affects topics
  • Skips BSDS

4. Failover testing should be:

  • Scheduled and documented
  • Never done
  • Only in production unplanned
  • Only for channels
Published
Read time26 min
AuthorMainframeMaster
Verified: IBM MQ 9.3 documentation