Rolling upgrades are how large estates install IBM MQ 9.4 without a single global Sunday blackout: upgrade the standby multi-instance node first, fail over, upgrade the former active, roll Kubernetes pods one by one, or take queue sharing group members through maintenance in a sequence that keeps shared queues available. The phrase sounds like magic zero downtime, but reality depends on application reconnect, transaction length, channel retry, and whether you actually built HA or merely installed two binaries. A standalone queue manager on one VM cannot roll in the HA sense—you endmqm and accept outage. This tutorial explains rolling upgrade prerequisites, multi-instance step sequence at conceptual level, Kubernetes operator rolling updates, client and channel behavior during failover, QSG overview, testing failover before upgrade weekend, and common failures when standby was never validated.
Exact commands differ by release and platform—never run failover in production without matching IBM doc for your version. Shared storage must mount cleanly on both nodes; split-brain prevention is built into product but storage faults still hurt.
123456# Illustrative checks — verify IBM doc for your release dspmq -m QM1 -x DISPLAY QMSTATUS ALL * After failover test in lab: * - Application reconnect count * - Channel RETRY events in AMQERR
StatefulSet rollingUpdate replaces pods with termination grace period. MQ container must quiesce: endmqm or operator preStop hook. PersistentVolumeClaim reattaches to new pod; recovery runs on strmqm. Readiness probe must wait until queue manager accepts connections—premature Service traffic causes 2059 storms. Upgrade operator chart and native image tag in coordinated steps per compatibility table. Native HA replicas behave similarly to multi-instance at logical level.
| Platform | Mechanism | Watch item |
|---|---|---|
| Linux multi-instance | Failover between nodes | Shared storage health |
| Kubernetes MQ | Pod rolling update | Probe and PVC bind |
| z/OS QSG | Member maintenance order | CF structures |
| Single VM | Not rolling—planned outage | Drain messages first |
Message channels to remote queue managers may drop when local QM fails over. Partner sender channels enter RETRY; ensure SHORTTMR and LONGTMR allow recovery. Cluster channels may rebalance. SVRCONN clients with reconnect options retry to surviving listener IP. Channels hard-coded to a node IP that is down during upgrade fail until DNS or VIP moves—fix architecture before rolling weekend.
In-doubt XA transactions survive failover but may need resolution after upgrade if coordinators disagree. Drain or complete transactions before upgrade when possible. Batch jobs holding syncpoint open across hours block clean handoff—schedule batch away from upgrade window.
Small queue managers with tolerant maintenance windows may cost less with simple endmqm upgrade than building HA solely for rolling. Greenfield cloud may accept brief outage if no messages yet. Business decides; engineering documents honest outage seconds.
Rolling upgrades are fixing one airplane engine while the other engine keeps flying—you swap carefully so the plane never falls, instead of turning off both engines at once.
Execute lab multi-instance failover and record client reconnect behavior.
Check whether production CONNAME uses IP or DNS/VIP.
Write rolling upgrade runbook outline with rollback triggers.
1. Rolling upgrade needs:
2. Multi-instance upgrade order:
3. Client reconnect matters because:
4. Single standalone QM upgrade: