A channel stuck in retry is an operations emergency dressed as a yellow status light. IBM MQ keeps scheduling reconnects, error logs repeat, transmission queue depth climbs, and dashboards show RETRY for so long that on-call teams normalize it. Unlike a brief network blip that clears in two short retries, a stuck channel shares the same LASTCHLERR every cycle—wrong port, certificate rejected, CHLAUTH block, sequence mismatch—until someone stops the retry theater and fixes root cause. This tutorial teaches beginners to break the loop safely: when to STOP versus let retry continue, how to read retry counters, partner coordination, avoiding duplicate RESET mistakes, and communication templates for business stakeholders watching queue depth.
Healthy RETRY lasts minutes with changing network conditions and succeeds when the listener returns. Stuck RETRY shows identical errors across multiple long timer periods, often spanning change windows or weekends. Compare timestamps in the error log: if ten attempts over six hours all say connection refused to the same IP, the firewall rule is wrong—not flaky. If errors alternate between TLS and CHLAUTH, fix security before sequence work. Document the first RETRY time in the ticket; SLA reports use it.
| LASTCHLERR pattern | Likely cause | Action |
|---|---|---|
| Connection refused | Listener down or wrong port | Fix LISTENER CONNAME |
| SSL or handshake | Cipher or cert trust | See SSL handshake tutorial |
| CHLAUTH | Rule BLOCK or no MAP | DISPLAY CHLAUTH |
| Sequence or protocol | DR skew | Coordinated RESET |
1234567DISPLAY CHSTATUS('PARIS.TO.LONDON') ALL STOP CHANNEL('PARIS.TO.LONDON') DISPLAY QSTATUS('SYSTEM.XMITQ.PARIS') CURDEPTH * Fix root cause — example: CONNAME port ALTER CHANNEL('PARIS.TO.LONDON') CHLTYPE(SDR) CONNAME('host.corp(1414)') START CHANNEL('PARIS.TO.LONDON') DISPLAY CHSTATUS('PARIS.TO.LONDON')
STOP ends the current instance and stops consuming retry timers until start policy triggers again. It does not by itself fix sequence state—pair with RESET when LASTCHLERR requires it. Inform partner operations before STOP on bilateral channels so they do not simultaneously debug the wrong queue manager. Capture DISPLAY CHSTATUS ALL output to the ticket before STOP for post-mortems.
SHORTRTY and LONGRTY are finite. When exhausted, the channel may appear INACTIVE while messages remain on XMITQ—business teams think messaging is healthy because the queue manager is up. Monitoring must alert on XMITQ depth and channel NOT RUNNING, not only on QM status. Some automation issues START CHANNEL periodically, recreating retry storms—disable runaway scripts until CONNAME is valid. After exhaustion, fixing config and START is enough if sequence state is still aligned.
Your SDR stuck in RETRY may be their listener down or their RCVR CHLAUTH blocking your MCAUSER. Split the bridge call: sender team proves TCP to port open, receiver team proves LISTENER STATUS(RUNNING) and no CHLAUTH block in the same minute. Shared packet capture ends arguments. For cluster channels, multiple receivers may accept traffic—stuck RETRY on one cluster path may not stop all routes, masking partial outages.
Temporary mitigations include routing through alternate channel pairs only when architecture supports it—never duplicate production feeds without deduplication design. Draining XMITQ to file or alternate QM is a major decision requiring audit approval.
Anti-pattern: cron that only START CHANNEL without reading LASTCHLERR. Anti-pattern: raising LONGRTY to 999999 to silence alerts. Anti-pattern: disabling CHLAUTH to see if RETRY clears—proves security was the blocker but leaves you exposed. Better pattern: event-driven alert on RETRY longer than N minutes with attached LASTCHLERR text. Better pattern: change management requires CONNAME verification before go-live.
Stuck in retry is an alarm that rings every few minutes but nobody gets out of bed to fix the broken door—the house is still not secure until someone repairs the lock, not buys a louder alarm.
Your toy phone keeps calling your friend but nobody fixed the broken wire—so it rings forever until a grown-up fixes the wire instead of turning up the volume.
Write an on-call runbook section: RETRY more than 30 minutes with same LASTCHLERR.
Role-play sender versus receiver checks for connection refused stuck RETRY.
List monitoring metrics that detect stuck RETRY before users call the help desk.
1. Same LASTCHLERR every RETRY suggests:
2. STOP CHANNEL is used to:
3. Stuck in retry with growing XMITQ is:
4. After DR sequence mismatch, stuck RETRY may need: