What is failover in IBM MQ?

Failover is the process of moving queue manager service from a failed active node to a standby node or alternate member so applications can reconnect and messaging resumes with the same logical queues and persistent data.

How long does MQ failover take?

Duration depends on detection time, log replay, storage mount, and client reconnect—seconds to minutes. Define recovery time objective with the business and measure in drills.

Do clients need new queue names after failover?

Usually no. The queue manager name stays the same in active/passive designs. Connection hostnames or ports in CCDT may change unless DNS or a floating address swings automatically.

What happens to in-flight transactions during failover?

Uncommitted work may back out or remain indoubt until the transaction manager resolves after restart. XA and syncpoint coordinators need procedures for indoubt UOW after failover.

How often should failover be tested?

At least quarterly for production-critical hubs, with documented results and application sign-off. Untested failover is a plan, not a guarantee.

MainframeMaster

Failover

Failover is the moment high availability proves it was not only a slide deck. The active queue manager stops responding—hardware fault, hung process, datacenter network partition, or planned maintenance that went wrong—and operations or automation promotes the standby, replays logs if needed, opens listeners, and waits for applications to reconnect. Durations measured in minutes can cost millions in card authorization or settlement windows. Failover quality depends on detection speed, storage accessibility, fencing of the old active, client libraries with reconnect, and runbooks operators trust at 3 a.m. This tutorial walks the failover timeline for multi-instance and replicated HA, lists failure modes, covers indoubt transactions, documents testing discipline, and ties failover to RTO and RPO language executives understand.

Failover Timeline

Failure occurs—crash, hang, or loss of heartbeat to partner.
Detection—cluster manager, mqsi policy, or operator alert.
Fencing—prevent old active from writing shared or replicated data.
Promotion—standby starts queue manager as active on shared/replicated storage.
Log recovery—replay to consistent point; indoubt resolution.
Listeners and channels start; DISPLAY QMSTATUS shows running.
Applications reconnect; depth and traffic normalize.

Phases and owners
Phase	Typical owner	Risk if skipped
Detection	Monitoring / cluster	Long outage before anyone acts
Fencing	HA cluster / storage	Split brain corruption
Promotion	MQ ops / automation	Standby never starts
Recovery	Queue manager	Inconsistent queues
Client reconnect	Application teams	False “MQ down” while QM is up

Multi-Instance Failover

Standby instance on the second host runs mqsi, detects active loss, acquires storage locks, and starts the queue manager using the same data path. Shared NFS or SAN must mount cleanly—stale NFS handles are a classic failover failure. Operators verify which instance is active with status commands before attempting manual start on both nodes.

text

1
2
3
4
5
/* Post-failover checks — commands vary by release
   DISPLAY QMSTATUS
   DISPLAY LSSTATUS
   DISPLAY CHSTATUS summary
   Verify application MQCONN success from each tier */

RDQM and Native HA Failover

Replicated data queue managers elect a new primary from replicas using quorum. Native HA follows product-specific orchestration—often Kubernetes operators. Failover still implies brief unavailability during election and client reconnect; not zero-second transparent to all apps unless paired with advanced client and network design.

z/OS Member Bounce Versus Failover

Stopping one queue sharing group member is not full-site failover but operational maintenance. Shared queues remain on CF; apps on other members continue. Full sysplex or CF failure triggers larger DR playbooks—see queue sharing group and DR tutorials.

Channels During Failover

Sender channels drop TCP; partners retry per channel DISCINT and retry attributes. After failover, channel initiators on the new active must restart listeners and channels. Remote partners using fixed IP may need firewall or DNS updates if the active host IP changed in non-floating designs.

Transactions and Indoubt UOW

Global transactions left prepared but not committed appear indoubt after restart. MQ and Db2 administrators resolve per coordinator procedures. Applications may see duplicate messages after rollback and redelivery—idempotency keys are mandatory for payment flows.

Failover Runbook Essentials

Escalation contacts for MQ, storage, network, and apps.
Commands to identify active instance and start standby safely.
CCDT or DNS change steps if floating name not used.
Verification scripts: test put/get on critical queues.
Communication template for business stakeholders.
Back-out procedure if failover makes things worse—fail back criteria.

Testing Failover

Planned tests kill the active node or disconnect storage in a controlled window. Measure detection-to-first-successful-MQPUT. Compare to RTO. Fix runbook gaps. Unplanned failover is the test you never wanted—post-incident review updates the same runbook.

Explainer: Relay Race Baton Pass

Failover is passing the baton when the first runner falls. The second runner must already be on the track and trained; the baton is the queue manager name and persistent data.

Explain Like I'm Five

If the main teacher is sick, the substitute teacher takes the class list and keeps teaching—the class name stays the same even though the person at the desk changed.

Practice Exercises

Exercise 1

Write a one-page failover runbook outline for a multi-instance hub.

Exercise 2

List indoubt transaction checks after unplanned failover with Db2 and CICS.

Exercise 3

Define pass/fail criteria for a quarterly failover drill.

Frequently Asked Questions

Test Your Knowledge

1. Failover moves service from:

Failed active to standby
DLQ to XMITQ
Topic to queue
JCL to PROC

2. After failover apps should:

Reconnect to same QM name
Rename all queues
Disable TLS
Delete logs

3. Log replay during failover:

Recovers persistent state
Deletes all messages
Only affects topics
Skips BSDS

4. Failover testing should be:

Scheduled and documented
Never done
Only in production unplanned
Only for channels

Failover

Failover Timeline

Multi-Instance Failover

RDQM and Native HA Failover

z/OS Member Bounce Versus Failover

Channels During Failover

Transactions and Indoubt UOW

Failover Runbook Essentials

Testing Failover

Explainer: Relay Race Baton Pass

Explain Like I'm Five

Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

Active/Passive

Multi-Instance Queue Managers

Client Reconnection

RDQM