Channel Stuck in Retry

A channel stuck in retry is an operations emergency dressed as a yellow status light. IBM MQ keeps scheduling reconnects, error logs repeat, transmission queue depth climbs, and dashboards show RETRY for so long that on-call teams normalize it. Unlike a brief network blip that clears in two short retries, a stuck channel shares the same LASTCHLERR every cycle—wrong port, certificate rejected, CHLAUTH block, sequence mismatch—until someone stops the retry theater and fixes root cause. This tutorial teaches beginners to break the loop safely: when to STOP versus let retry continue, how to read retry counters, partner coordination, avoiding duplicate RESET mistakes, and communication templates for business stakeholders watching queue depth.

Stuck Versus Healthy RETRY

Healthy RETRY lasts minutes with changing network conditions and succeeds when the listener returns. Stuck RETRY shows identical errors across multiple long timer periods, often spanning change windows or weekends. Compare timestamps in the error log: if ten attempts over six hours all say connection refused to the same IP, the firewall rule is wrong—not flaky. If errors alternate between TLS and CHLAUTH, fix security before sequence work. Document the first RETRY time in the ticket; SLA reports use it.

Stuck RETRY patterns and likely causes
LASTCHLERR patternLikely causeAction
Connection refusedListener down or wrong portFix LISTENER CONNAME
SSL or handshakeCipher or cert trustSee SSL handshake tutorial
CHLAUTHRule BLOCK or no MAPDISPLAY CHLAUTH
Sequence or protocolDR skewCoordinated RESET

Breaking the Loop: STOP and Investigate

shell
1
2
3
4
5
6
7
DISPLAY CHSTATUS('PARIS.TO.LONDON') ALL STOP CHANNEL('PARIS.TO.LONDON') DISPLAY QSTATUS('SYSTEM.XMITQ.PARIS') CURDEPTH * Fix root cause — example: CONNAME port ALTER CHANNEL('PARIS.TO.LONDON') CHLTYPE(SDR) CONNAME('host.corp(1414)') START CHANNEL('PARIS.TO.LONDON') DISPLAY CHSTATUS('PARIS.TO.LONDON')

STOP ends the current instance and stops consuming retry timers until start policy triggers again. It does not by itself fix sequence state—pair with RESET when LASTCHLERR requires it. Inform partner operations before STOP on bilateral channels so they do not simultaneously debug the wrong queue manager. Capture DISPLAY CHSTATUS ALL output to the ticket before STOP for post-mortems.

When Retry Counts Exhaust

SHORTRTY and LONGRTY are finite. When exhausted, the channel may appear INACTIVE while messages remain on XMITQ—business teams think messaging is healthy because the queue manager is up. Monitoring must alert on XMITQ depth and channel NOT RUNNING, not only on QM status. Some automation issues START CHANNEL periodically, recreating retry storms—disable runaway scripts until CONNAME is valid. After exhaustion, fixing config and START is enough if sequence state is still aligned.

Partner-Side Blind Spots

Your SDR stuck in RETRY may be their listener down or their RCVR CHLAUTH blocking your MCAUSER. Split the bridge call: sender team proves TCP to port open, receiver team proves LISTENER STATUS(RUNNING) and no CHLAUTH block in the same minute. Shared packet capture ends arguments. For cluster channels, multiple receivers may accept traffic—stuck RETRY on one cluster path may not stop all routes, masking partial outages.

Business Impact While Stuck

  • XMITQ depth and oldest message age—report to application owners.
  • Downstream batch missed cutoffs if messages are time-sensitive.
  • Reply-to and request-reply timeouts if replies cannot return.
  • Cluster publication delays if cluster channels stuck.

Temporary mitigations include routing through alternate channel pairs only when architecture supports it—never duplicate production feeds without deduplication design. Draining XMITQ to file or alternate QM is a major decision requiring audit approval.

Automation and Anti-Patterns

Anti-pattern: cron that only START CHANNEL without reading LASTCHLERR. Anti-pattern: raising LONGRTY to 999999 to silence alerts. Anti-pattern: disabling CHLAUTH to see if RETRY clears—proves security was the blocker but leaves you exposed. Better pattern: event-driven alert on RETRY longer than N minutes with attached LASTCHLERR text. Better pattern: change management requires CONNAME verification before go-live.

Recovery Validation

  1. CHSTATUS RUNNING both sides if applicable.
  2. XMITQ depth decreasing or stable at zero for test.
  3. Test message put and confirmed at consumer.
  4. No new sequence errors in log for one hour.
  5. Update CMDB and close ticket with root cause class.

Explainer: Alarm Clock That Keeps Ringing

Stuck in retry is an alarm that rings every few minutes but nobody gets out of bed to fix the broken door—the house is still not secure until someone repairs the lock, not buys a louder alarm.

Explain Like I'm Five: Channel Stuck in Retry

Your toy phone keeps calling your friend but nobody fixed the broken wire—so it rings forever until a grown-up fixes the wire instead of turning up the volume.

Practice Exercises

Exercise 1

Write an on-call runbook section: RETRY more than 30 minutes with same LASTCHLERR.

Exercise 2

Role-play sender versus receiver checks for connection refused stuck RETRY.

Exercise 3

List monitoring metrics that detect stuck RETRY before users call the help desk.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. Same LASTCHLERR every RETRY suggests:

  • Permanent config fault
  • Random success soon
  • No problem
  • DLQ full

2. STOP CHANNEL is used to:

  • Halt retry attempts for investigation
  • Delete XMITQ
  • Disable TLS globally
  • Remove CHLAUTH

3. Stuck in retry with growing XMITQ is:

  • Delivery SLA risk
  • Healthy
  • Expected always
  • Client-only

4. After DR sequence mismatch, stuck RETRY may need:

  • Coordinated RESET
  • Higher MAXDEPTH only
  • New model queue
  • Disable listener
Published
Read time19 min
AuthorMainframeMaster
Verified: IBM MQ 9.3 documentation