Rolling Upgrades

Rolling upgrades are how large estates install IBM MQ 9.4 without a single global Sunday blackout: upgrade the standby multi-instance node first, fail over, upgrade the former active, roll Kubernetes pods one by one, or take queue sharing group members through maintenance in a sequence that keeps shared queues available. The phrase sounds like magic zero downtime, but reality depends on application reconnect, transaction length, channel retry, and whether you actually built HA or merely installed two binaries. A standalone queue manager on one VM cannot roll in the HA sense—you endmqm and accept outage. This tutorial explains rolling upgrade prerequisites, multi-instance step sequence at conceptual level, Kubernetes operator rolling updates, client and channel behavior during failover, QSG overview, testing failover before upgrade weekend, and common failures when standby was never validated.

Prerequisites for Rolling Upgrade

  • Proven HA: multi-instance, RDQM, QSG, or K8s with tested failover.
  • Applications use automatic client reconnect or connection naming that follows active instance.
  • Channels use reconnect-friendly CONNAME (e.g. VIP or DNS) not hard-coded failed node IP.
  • Short transactions—long UOW blocks clean handoff.
  • Non-prod rolling drill completed with same automation as prod.

Multi-Instance Rolling Sequence (Conceptual)

  1. Confirm active and standby status with dspmq -m QM1 -x or platform equivalent.
  2. Upgrade MQ binaries on standby host (or node) per migration guide.
  3. Upgrade standby queue manager instance to new version.
  4. Controlled failover so applications move to upgraded standby.
  5. Upgrade former active host binaries and instance.
  6. Fail back if required by runbook; validate symmetric versions.

Exact commands differ by release and platform—never run failover in production without matching IBM doc for your version. Shared storage must mount cleanly on both nodes; split-brain prevention is built into product but storage faults still hurt.

text
1
2
3
4
5
6
# Illustrative checks — verify IBM doc for your release dspmq -m QM1 -x DISPLAY QMSTATUS ALL * After failover test in lab: * - Application reconnect count * - Channel RETRY events in AMQERR

Kubernetes and MQ Operator

StatefulSet rollingUpdate replaces pods with termination grace period. MQ container must quiesce: endmqm or operator preStop hook. PersistentVolumeClaim reattaches to new pod; recovery runs on strmqm. Readiness probe must wait until queue manager accepts connections—premature Service traffic causes 2059 storms. Upgrade operator chart and native image tag in coordinated steps per compatibility table. Native HA replicas behave similarly to multi-instance at logical level.

Rolling upgrade patterns by platform
PlatformMechanismWatch item
Linux multi-instanceFailover between nodesShared storage health
Kubernetes MQPod rolling updateProbe and PVC bind
z/OS QSGMember maintenance orderCF structures
Single VMNot rolling—planned outageDrain messages first

Channels During Failover

Message channels to remote queue managers may drop when local QM fails over. Partner sender channels enter RETRY; ensure SHORTTMR and LONGTMR allow recovery. Cluster channels may rebalance. SVRCONN clients with reconnect options retry to surviving listener IP. Channels hard-coded to a node IP that is down during upgrade fail until DNS or VIP moves—fix architecture before rolling weekend.

Transactions and Indoubt Work

In-doubt XA transactions survive failover but may need resolution after upgrade if coordinators disagree. Drain or complete transactions before upgrade when possible. Batch jobs holding syncpoint open across hours block clean handoff—schedule batch away from upgrade window.

Testing Before Production Roll

  1. Failover active to standby in lab without upgrade—measure client recovery time.
  2. Upgrade lab multi-instance using production runbook verbatim.
  3. Run channel and application load test during failover.
  4. Document maximum observed outage seconds for executives.

When Rolling Is Not Worth It

Small queue managers with tolerant maintenance windows may cost less with simple endmqm upgrade than building HA solely for rolling. Greenfield cloud may accept brief outage if no messages yet. Business decides; engineering documents honest outage seconds.

Explain Like I'm Five: Rolling Upgrades

Rolling upgrades are fixing one airplane engine while the other engine keeps flying—you swap carefully so the plane never falls, instead of turning off both engines at once.

Practice Exercises

Exercise 1

Execute lab multi-instance failover and record client reconnect behavior.

Exercise 2

Check whether production CONNAME uses IP or DNS/VIP.

Exercise 3

Write rolling upgrade runbook outline with rollback triggers.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. Rolling upgrade needs:

  • Multiple instances or pods
  • Only one QM ever
  • No clients
  • Deleted logs

2. Multi-instance upgrade order:

  • Follow IBM documented sequence
  • Random
  • Delete active first
  • Skip standby

3. Client reconnect matters because:

  • Failover must reach surviving instance
  • Channels ignore TCP
  • Logs stop
  • Topics vanish

4. Single standalone QM upgrade:

  • Usually has outage window
  • Never needs endmqm
  • No backup
  • No testing
Published
Read time22 min
AuthorMainframeMaster
Verified: IBM MQ 9.4 documentation