What is recovery processing in IBM MQ?

Recovery processing is the queue manager startup and maintenance logic that replays logs, reconciles queue files and repository, rolls back incomplete transactions, and optionally runs media recovery when log or data files are missing or damaged—returning the system to a consistent messaging state.

Why is restart slow after power failure?

Unclean shutdown leaves uncheckpointed log records to replay. Large log volume, millions of in-flight persistent messages, or damaged storage extend replay and page recovery time before channels and applications can connect.

What is media recovery?

When active log files are missing or queue data is inconsistent with the log, supported media recovery procedures restore from backups taken during clean shutdown or use IBM-guided steps. This is not the same as everyday strmqm after a clean endmqm.

Will non-persistent messages survive recovery?

Generally no—non-persistent messages in memory or not fully logged are lost on crash. Design business-critical traffic as persistent with syncpoint where required.

Should I delete log files to fix a stuck queue manager?

No—deleting active logs without IBM support procedures causes data loss and corruption. Capture diagnostics, verify disk, and follow documented recovery or open a support case.

MainframeMaster

Recovery Processing

Recovery processing is where logging internals, message persistence, repository manager, and locking meet in production—usually at 2 a.m. when a host dies, when strmqm sits at ninety percent CPU for an hour, or when AMQ messages mention media recovery and nobody on the bridge wants to touch the log directory. Recovery is not one button: it is a phased pipeline the queue manager runs to make persistent data and object definitions consistent before accepting applications again. Beginners confuse restart with restore from backup; veterans know most daily restarts only replay recent log after clean shutdown, while true disaster needs backed-up log and qmgr copies plus IBM procedures. This tutorial walks through normal restart recovery, unclean shutdown replay, rollback of in-doubt transactions, media recovery indicators, multi-instance and HA interactions at overview, operator commands you may see referenced in runbooks, and what to capture before calling support—completing the advanced internals track started on logging and persistence pages.

Phases of Queue Manager Startup Recovery

Bootstrap: locate qmgr directory, verify accessibility, read configuration.
Log recovery: replay since last consistent checkpoint; redo committed, undo uncommitted work.
Repository reconcile: ensure object catalog matches recovered state.
Queue file recovery: apply page updates for persistent messages as needed.
Service start: listeners, channels per configuration, command server ready.

DISPLAY QMSTATUS during early start may show states indicating recovery in progress on some platforms—applications attempting MQCONN receive reason codes until ready. Automation should health-check readiness, not only process existence.

Clean Versus Unclean Shutdown

Shutdown type impact
Shutdown type	Typical replay	Risk level
endmqm (clean)	Minimal	Low
Kill process / power loss	From last checkpoint forward	Medium—duration
Storage corruption	May fail—media recovery	High
Manual log deletion	Cannot complete	Critical data loss

Log Replay in Plain Language

The log is a time-ordered journal of persistent changes. Replay reads forward from a known good point, reapplies operations for transactions that committed before crash, and backs out changes for transactions that did not. Channel in-doubt situations and XA participants may leave indoubt transactions requiring administrative resolution (DISPLAY TXN, RESOLVE CHANNEL or RESOLVE TRANSACTION per scenario). Long replay means either huge uncheckpointed log or very large numbers of operations—tune checkpoint policy and application commit frequency for next time; for now wait or follow support guidance if excessive.

text

1
2
3
4
5
6
7
8
* After unclean shutdown — observe recovery
strmqm QM1
* Monitor AMQERR and QMSTATUS
runmqsc QM1
DISPLAY QMSTATUS ALL
DISPLAY LSSTATUS(*)
* Resolve indoubt only per runbook — example inspection:
DISPLAY TXINFO(*)

Checkpointing and Replay Boundaries

Checkpoints flush queue file pages and mark log positions so replay skips settled work. Infrequent checkpoints lengthen unclean restart. Over-aggressive checkpointing increases steady-state I/O. Balance with workload—retail 24x7 may accept more checkpoint I/O; overnight batch hub may accept longer rare restart window. See logging internals for LogPrimaryFiles and related sizing.

Repository and Message Data Together

Recovery must leave the catalog consistent with queue contents—cannot have a defined queue pointing at missing files or orphaned pages referencing deleted objects. Repository manager internals explains catalog role; persistence internals explains queue files. Split-brain from copying only some directories breaks recovery—restore sets together from supported backup.

Media Recovery Scenarios

Media recovery enters when log files are missing, truncated wrongly, or queue data does not match log end-of-chain. Symptoms include queue manager failing early with media recovery wording, or explicit operator tools requested in documentation for your release. Response pattern: stop making it worse (do not delete more files), locate last good backup from clean endmqm era, call IBM support with FFST and logs if unsure. Multi-instance queue managers and HA products have their own failover recovery—follow that product runbook instead of single-node media steps.

In-Doubt and XA Recovery

Two-phase commit leaves transactions indoubt until coordinator decides commit or rollback.
Channel indoubt may block related channels until resolved.
Automation resolving without understanding loses messages or duplicates—use documented RESOLVE semantics.
Database-MQ XA requires coordinated recovery order per middleware guide.

High Availability and Recovery

Multi-instance queue managers fail over to standby; recovery processing still occurs on the surviving instance but transparently to many clients if reconnect is configured. Replicated data queue managers and stretch clusters add cross-site consistency concerns—recovery is not only local log replay. Read failover and replicated data tutorials for operational paths; this page covers single-QM internals mindset.

Backup Discipline

Backup qmgr, log, and queue data per IBM backup guide for your platform. Clean endmqm quiesces activity. Document whether backups are crash-consistent only in VM snapshots—snapshots without quiesce are weaker. Test restore yearly in lab. dmpmqcfg captures definitions but not messages—message restore needs data and log discipline.

What Operators Should Capture Before Support

Complete AMQERR log around failure time.
FFST files with matching FCID.
dspmqver and platform storage errors (SAN, disk full).
Whether shutdown was clean; what changed recently (disk move, upgrade, script).
Short trace if reproducible in lab—not hours of production trace without guidance.

Explainer: Rewinding a Movie After a Power Cut

Recovery is rewinding your movie to the last bookmark (checkpoint), replaying scenes you had already decided were official (committed), and erasing scenes you had not finished deciding (uncommitted)—so the story makes sense before anyone watches again.

Explain Like I'm Five: Recovery Processing

Recovery is fixing a board game after the table got bumped—using the notebook of moves to put every piece back where it should be before the next player rolls the dice.

Practice Exercises

Exercise 1

Time strmqm after clean endmqm versus kill -9 on lab QM with same queue depth.

Exercise 2

Write a one-page runbook: symptoms → checks → escalate to support.

Exercise 3

Verify last backup date and whether endmqm preceded it.

Frequently Asked Questions

Test Your Knowledge

1. Recovery after crash uses:

Log replay
Only client cache
DNS flush
JCL restart

2. Clean endmqm before backup:

Recommended practice
Forbidden
Deletes all queues
Disables log

3. Non-persistent on crash:

Usually lost
Always on tape
Duplicated
Becomes persistent

4. Media recovery is for:

Missing or damaged logs/data
Every daily start
Client upgrade
Topic rename only

Recovery Processing

Phases of Queue Manager Startup Recovery

Clean Versus Unclean Shutdown

Log Replay in Plain Language

Checkpointing and Replay Boundaries

Repository and Message Data Together

Media Recovery Scenarios

In-Doubt and XA Recovery

High Availability and Recovery

Backup Discipline

What Operators Should Capture Before Support

Explainer: Rewinding a Movie After a Power Cut

Explain Like I'm Five: Recovery Processing

Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

Logging Internals

Queue Manager Startup

Failover

Locking