Channel Retry Loops

A channel retry loop is what operations sees when DISPLAY CHSTATUS shows RETRY hour after hour, the transmission queue depth climbs, and AMQERR fills with the same channel error every few minutes. IBM MQ is doing what it was configured to do—schedule another connect after SHORTRTY or LONGTMR—but the underlying fault never cleared. Beginners increase retry counts or restart the queue manager; experienced teams read LASTCHLERR once, fix TLS or CONNAME, and the loop ends on the next successful bind. This tutorial explains how retry loops form, how short and long retry phases interact, why XMITQ backlog is the business impact, how to distinguish flapping network from wrong certificate, when RESET CHANNEL helps, and how to prevent loops through monitoring and change control.

Anatomy of a Retry Loop

A sender channel (CHLTYPE SDR) reads messages from its XMITQ and opens a session to the partner receiver. If TCP fails, TLS handshake fails, or MQ channel negotiation fails, the instance moves to RETRY. The queue manager increments retry counters and waits. When the timer expires, MQ tries again. If nothing changed on the network or configuration, the same failure occurs—another RETRY. This is a loop: same channel name, same error family, predictable interval. Loops differ from a single retry after a brief blip; loops last beyond your incident threshold and correlate with monotonic XMITQ depth increase.

Retry timer attributes (sender channel)
AttributePhaseEffect on loop
SHORTRTYShort retry countHow many quick attempts before long phase
SHORTTMRShort retry intervalSeconds between early retries—fast loop if low
LONGRTYLong retry countAdditional attempts after short phase exhausted
LONGTMRLong retry intervalSlower loop—still endless if fault remains

Symptoms Operations Notices

  • DISPLAY CHSTATUS shows STATUS(RETRY) with rising retry count.
  • XMITQ CURDEPTH near MAXDEPTH; remote applications may see indirect backlog.
  • AMQ9208 or SSL-related AMQ messages repeat at SHORTTMR or LONGTMR cadence.
  • Partner receiver shows no RUNNING instance or listener never sees connect.
  • Monitoring alerts on channel not RUNNING and queue depth percentage.

Common Root Causes

Network and listener

Firewall rule removed, wrong port in CONNAME, listener STOPPED, or DNS pointing to decommissioned host. TCP timeout produces retry loops that look like network outages. Verify telnet or nc to partner port from the sending host during the incident—not from your laptop unless that matches the channel path.

TLS and certificates

Expired personal certificate, missing intermediate CA, cipher mismatch on SSLCIPH, or SSLCAUTH REQUIRED without client cert. LASTCHLERR and AMQ9638-class messages point here. Fixing retry timers does not renew a certificate.

Channel authentication

CHLAUTH rules block partner IP, QMNAME, or SSLPEER DN. AMQERR often names the rule. The loop continues until the rule is corrected or the partner presents the expected identity.

Sequence numbers and partner state

After restore or DR, sequence number mismatch prevents RUNNING. RESET CHANNEL on both sides may be required per runbook after confirmed consistent backup state—not as a first action.

Breaking the Loop: Triage Steps

  1. DISPLAY CHSTATUS(channel) ALL — note STATUS, LASTCHLERR, CONNAME, SSL attributes.
  2. Read AMQERR at last retry timestamp on both queue managers if accessible.
  3. Verify listener and network path from sending host.
  4. Compare channel definitions: name, CHLTYPE pair, TLS settings.
  5. Fix root cause; test one manual START CHANNEL or wait for next retry cycle.
  6. RESET CHANNEL if instance stuck after fix; confirm RUNNING and XMITQ draining.
  7. Review XMITQ MAXDEPTH and alerting if outage duration was longer than design.
shell
1
2
3
4
5
6
DISPLAY CHSTATUS('PARIS.TO.LONDON') ALL DISPLAY QSTATUS('QM_LONDON.XMIT') CURDEPTH MAXDEPTH tail -30 /var/mqm/qmgrs/QM_PARIS/errors/AMQERR01.log * After fix: RESET CHANNEL('PARIS.TO.LONDON') START CHANNEL('PARIS.TO.LONDON')

Retry Loops vs Increasing Retries

Raising SHORTRTY or LONGRTY tolerates longer partner outages during planned maintenance—it does not fix wrong configuration. Use higher retries when business approved outage windows exceed current LONGRTY times LONGTMR total wait. Document maximum acceptable XMITQ depth for that wait. If loops run indefinitely in production without planned outage, treat as misconfiguration, not capacity tuning.

Explainer: Hamster Wheel

A retry loop is a hamster wheel: the channel keeps running in place—RETRY—without delivering mail. Mail piles up on the cart (XMITQ) beside the wheel. Stop fixing the wheel speed; open the door (fix TLS, network, or auth) so the hamster can exit to RUNNING.

Prevention

  • Alert on channel not RUNNING longer than N minutes.
  • Certificate expiry monitoring thirty days ahead.
  • CHLAUTH change review with partner IP and DN validation.
  • XMITQ depth percent thresholds tied to channel status dashboards.
  • Post-incident review: was LASTCHLERR read before first RESET?

Explain Like I'm Five: Channel Retry Loops

Your toy car tries to drive to a friend's house but the bridge is out. Every few minutes it tries again and cannot cross. Toys pile up in the car trunk because they cannot be delivered. Fix the bridge—not the timer on how often the car tries.

Practice Exercises

Exercise 1

Lab: break CONNAME, observe RETRY and XMITQ depth over three SHORTTMR cycles. Record LASTCHLERR.

Exercise 2

Write runbook: retry loop with AMQ9638 vs AMQ9208 different actions.

Exercise 3

Calculate XMITQ messages accumulated during 4 hours of retry at 200 msg/sec put rate.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. Retry loop means channel keeps:

  • Entering RETRY after failed connect
  • Running forever
  • Deleting XMITQ
  • Disabling TLS

2. Fix root cause before:

  • RESET CHANNEL only
  • DELETE QMGR
  • Removing all queues
  • Disabling logs

3. XMITQ grows during retry loop because:

  • Messages cannot be transmitted
  • Consumers too fast
  • MAXDEPTH is zero
  • DLQ disabled

4. LASTCHLERR helps identify:

  • Last failure reason on channel
  • Queue CURDEPTH only
  • LDAP port
  • JCL class
Published
Read time20 min
AuthorMainframeMaster
Verified: IBM MQ 9.3 documentation