Recovery processing is where logging internals, message persistence, repository manager, and locking meet in production—usually at 2 a.m. when a host dies, when strmqm sits at ninety percent CPU for an hour, or when AMQ messages mention media recovery and nobody on the bridge wants to touch the log directory. Recovery is not one button: it is a phased pipeline the queue manager runs to make persistent data and object definitions consistent before accepting applications again. Beginners confuse restart with restore from backup; veterans know most daily restarts only replay recent log after clean shutdown, while true disaster needs backed-up log and qmgr copies plus IBM procedures. This tutorial walks through normal restart recovery, unclean shutdown replay, rollback of in-doubt transactions, media recovery indicators, multi-instance and HA interactions at overview, operator commands you may see referenced in runbooks, and what to capture before calling support—completing the advanced internals track started on logging and persistence pages.
DISPLAY QMSTATUS during early start may show states indicating recovery in progress on some platforms—applications attempting MQCONN receive reason codes until ready. Automation should health-check readiness, not only process existence.
| Shutdown type | Typical replay | Risk level |
|---|---|---|
| endmqm (clean) | Minimal | Low |
| Kill process / power loss | From last checkpoint forward | Medium—duration |
| Storage corruption | May fail—media recovery | High |
| Manual log deletion | Cannot complete | Critical data loss |
The log is a time-ordered journal of persistent changes. Replay reads forward from a known good point, reapplies operations for transactions that committed before crash, and backs out changes for transactions that did not. Channel in-doubt situations and XA participants may leave indoubt transactions requiring administrative resolution (DISPLAY TXN, RESOLVE CHANNEL or RESOLVE TRANSACTION per scenario). Long replay means either huge uncheckpointed log or very large numbers of operations—tune checkpoint policy and application commit frequency for next time; for now wait or follow support guidance if excessive.
12345678* After unclean shutdown — observe recovery strmqm QM1 * Monitor AMQERR and QMSTATUS runmqsc QM1 DISPLAY QMSTATUS ALL DISPLAY LSSTATUS(*) * Resolve indoubt only per runbook — example inspection: DISPLAY TXINFO(*)
Checkpoints flush queue file pages and mark log positions so replay skips settled work. Infrequent checkpoints lengthen unclean restart. Over-aggressive checkpointing increases steady-state I/O. Balance with workload—retail 24x7 may accept more checkpoint I/O; overnight batch hub may accept longer rare restart window. See logging internals for LogPrimaryFiles and related sizing.
Recovery must leave the catalog consistent with queue contents—cannot have a defined queue pointing at missing files or orphaned pages referencing deleted objects. Repository manager internals explains catalog role; persistence internals explains queue files. Split-brain from copying only some directories breaks recovery—restore sets together from supported backup.
Media recovery enters when log files are missing, truncated wrongly, or queue data does not match log end-of-chain. Symptoms include queue manager failing early with media recovery wording, or explicit operator tools requested in documentation for your release. Response pattern: stop making it worse (do not delete more files), locate last good backup from clean endmqm era, call IBM support with FFST and logs if unsure. Multi-instance queue managers and HA products have their own failover recovery—follow that product runbook instead of single-node media steps.
Multi-instance queue managers fail over to standby; recovery processing still occurs on the surviving instance but transparently to many clients if reconnect is configured. Replicated data queue managers and stretch clusters add cross-site consistency concerns—recovery is not only local log replay. Read failover and replicated data tutorials for operational paths; this page covers single-QM internals mindset.
Backup qmgr, log, and queue data per IBM backup guide for your platform. Clean endmqm quiesces activity. Document whether backups are crash-consistent only in VM snapshots—snapshots without quiesce are weaker. Test restore yearly in lab. dmpmqcfg captures definitions but not messages—message restore needs data and log discipline.
Recovery is rewinding your movie to the last bookmark (checkpoint), replaying scenes you had already decided were official (committed), and erasing scenes you had not finished deciding (uncommitted)—so the story makes sense before anyone watches again.
Recovery is fixing a board game after the table got bumped—using the notebook of moves to put every piece back where it should be before the next player rolls the dice.
Time strmqm after clean endmqm versus kill -9 on lab QM with same queue depth.
Write a one-page runbook: symptoms → checks → escalate to support.
Verify last backup date and whether endmqm preceded it.
1. Recovery after crash uses:
2. Clean endmqm before backup:
3. Non-persistent on crash:
4. Media recovery is for: