Production troubleshooting interviews ask whether you can run an incident without making it worse. The panel listens for safe sequencing: confirm scope, protect data, gather evidence, apply the smallest fix that restores service, document timeline, and schedule root cause after stability. Technical depth matters—knowing RETRYING versus STOPPED, when to RESET CHANNEL, how backout threshold moves messages to DLQ—but so does judgment: you do not restart a queue manager during peak settlement without approval and a backup plan. This page lists common production interview questions with model answers tied to AMQERR, FDC, trace, metrics, and bridge communication.
Acknowledge alert and assign incident commander. Confirm customer impact (which flows stopped). Identify affected queue managers, channels, and queues. Establish war room chat with app, network, and DBA teams. Gather DISPLAY QSTATUS and CHSTATUS snapshots. Apply known quick fixes only if low risk (restart dead consumer, fix obvious firewall rule). Communicate every fifteen minutes with impact and ETA. After restore, capture timeline for post-incident review; enable problem management for root cause within 48 hours.
When IBM support or runbook indicates unrecoverable internal state and stakeholders accept planned outage—or when clean shutdown is required for maintenance. Not as first action for channel retry. Prefer fixing channel, consumer, or auth first. Always assess in-flight persistent messages and coordinating applications.
AMQERR is the queue manager error log. Messages include AMQ codes with text explaining failures—channel errors, authority failures, resource problems. Tail the log on the failing host filtered by time of incident. Correlate with application timestamp and MsgId. Share relevant lines in the incident ticket, not the whole multi-gigabyte file.
First Failure Data Capture records are written for serious queue manager or channel agent failures. They assist IBM support analysis. In interviews, mention locating FDC per IBM documentation and opening a case for repeated FDC storms—not deleting queue managers blindly.
When logs are insufficient and you can reproduce or isolate traffic. Use strmqtrc with appropriate classes for the failing component, reproduce once, endmqtrc, analyze with support tools. Disable trace afterward; trace impacts performance and disk. Mention change control approval in regulated estates.
1234DISPLAY QSTATUS('PAYMENTS.REQUEST') CURDEPTH IPPROCS OPPROCS DISPLAY CHSTATUS('*) WHERE(CHSTATUS NE RUNNING) DISPLAY QMSTATUS * Production triage trio: queue depth, broken channels, QM health
Do not RESET CHANNEL repeatedly without fixing root cause—that can worsen sequence number issues. Verify network path, certificates, CHLAUTH, partner maintenance window. If partner confirms fix, allow retry or coordinated channel restart per runbook. Monitor XMITQ; if disk risk, escalate to business for traffic pause. Document sequence number errors separately—they need IBM procedure, not guesswork.
Compare cipher suite, certificate chain, trust store on both sides, CHLAUTH SSLPEERMAP rules, and expiry dates in local and partner keystores. Test with openssl s_client equivalent mindset. Roll back cert if new intermediate CA missing on partner.
| Action | Risk | When appropriate |
|---|---|---|
| Restart consumer app | Low-medium | Process hung, queue backing up |
| START CHANNEL | Medium | Channel STOPPED after known fix |
| RESET CHANNEL | Medium-high | Runbook allows; seqnum understood |
| endmqm / strmqm | High | Approved maintenance or IBM direction |
| DISABLE CHLAUTH | Critical | Never in prod interview answer |
Sample messages with amqsget or approved tooling. Identify common error—schema version, bad currency code, missing field. Stop requeueing until publisher or consumer is patched. Clear DLQ only after stakeholder sign-off and audit trail export. Adjust BOTHRESH if backout was too aggressive versus too lenient.
Consumer rolls back repeatedly; backout counter increments until BOQNAME receives it. Fix consumer to handle or skip with audit. For interview: explain why infinite retry hurts more than DLQ routing.
Investigate log I/O, disk latency, CPU saturation, long-running syncpoint holding locks, channel bandwidth, or network packet loss. Use accounting and statistics. Depth low rules out simple consumer stop—look at round-trip time and log waits.
Shared queues in QSG stop accepting puts; broader than one application queue. Involve z/OS systems programmer; may need structure alter or message drain. Different playbook from distributed MAXDEPTH.
The toy factory stopped moving toys. You look at the alarm bell log (AMQERR), see which belt is stuck (queue), check the tunnel between buildings (channel), fix the safe problem, and tell everyone when toys move again—without smashing the whole factory on the first guess.
Write a five-line bridge update for Scenario channel RETRYING with 12k XMITQ depth.
List DISPLAY commands you would run in first ten minutes of queue full incident.
Draft post-incident review headings for a poison message outbreak.
1. AMQERR contains:
2. Before requeueing DLQ messages you should:
3. strmqtrc is used to:
4. Bridge call communication should include: