IBM MQ Production Troubleshooting Interview Questions

Production troubleshooting interviews ask whether you can run an incident without making it worse. The panel listens for safe sequencing: confirm scope, protect data, gather evidence, apply the smallest fix that restores service, document timeline, and schedule root cause after stability. Technical depth matters—knowing RETRYING versus STOPPED, when to RESET CHANNEL, how backout threshold moves messages to DLQ—but so does judgment: you do not restart a queue manager during peak settlement without approval and a backup plan. This page lists common production interview questions with model answers tied to AMQERR, FDC, trace, metrics, and bridge communication.

Incident Leadership Questions

Walk me through how you handle a severity-1 MQ incident.

Acknowledge alert and assign incident commander. Confirm customer impact (which flows stopped). Identify affected queue managers, channels, and queues. Establish war room chat with app, network, and DBA teams. Gather DISPLAY QSTATUS and CHSTATUS snapshots. Apply known quick fixes only if low risk (restart dead consumer, fix obvious firewall rule). Communicate every fifteen minutes with impact and ETA. After restore, capture timeline for post-incident review; enable problem management for root cause within 48 hours.

When would you restart a queue manager?

When IBM support or runbook indicates unrecoverable internal state and stakeholders accept planned outage—or when clean shutdown is required for maintenance. Not as first action for channel retry. Prefer fixing channel, consumer, or auth first. Always assess in-flight persistent messages and coordinating applications.

Diagnostic Tools

What is AMQERR and how do you use it?

AMQERR is the queue manager error log. Messages include AMQ codes with text explaining failures—channel errors, authority failures, resource problems. Tail the log on the failing host filtered by time of incident. Correlate with application timestamp and MsgId. Share relevant lines in the incident ticket, not the whole multi-gigabyte file.

What are FDC files?

First Failure Data Capture records are written for serious queue manager or channel agent failures. They assist IBM support analysis. In interviews, mention locating FDC per IBM documentation and opening a case for repeated FDC storms—not deleting queue managers blindly.

When do you enable MQ trace?

When logs are insufficient and you can reproduce or isolate traffic. Use strmqtrc with appropriate classes for the failing component, reproduce once, endmqtrc, analyze with support tools. Disable trace afterward; trace impacts performance and disk. Mention change control approval in regulated estates.

shell
1
2
3
4
DISPLAY QSTATUS('PAYMENTS.REQUEST') CURDEPTH IPPROCS OPPROCS DISPLAY CHSTATUS('*) WHERE(CHSTATUS NE RUNNING) DISPLAY QMSTATUS * Production triage trio: queue depth, broken channels, QM health

Channel and Network Incidents

Partner channel RETRYING for hours. Safe actions?

Do not RESET CHANNEL repeatedly without fixing root cause—that can worsen sequence number issues. Verify network path, certificates, CHLAUTH, partner maintenance window. If partner confirms fix, allow retry or coordinated channel restart per runbook. Monitor XMITQ; if disk risk, escalate to business for traffic pause. Document sequence number errors separately—they need IBM procedure, not guesswork.

SSL handshake failures after cert renewal.

Compare cipher suite, certificate chain, trust store on both sides, CHLAUTH SSLPEERMAP rules, and expiry dates in local and partner keystores. Test with openssl s_client equivalent mindset. Roll back cert if new intermediate CA missing on partner.

Production action risk levels
ActionRiskWhen appropriate
Restart consumer appLow-mediumProcess hung, queue backing up
START CHANNELMediumChannel STOPPED after known fix
RESET CHANNELMedium-highRunbook allows; seqnum understood
endmqm / strmqmHighApproved maintenance or IBM direction
DISABLE CHLAUTHCriticalNever in prod interview answer

Queue Depth and Poison Messages

DLQ suddenly receives thousands of messages.

Sample messages with amqsget or approved tooling. Identify common error—schema version, bad currency code, missing field. Stop requeueing until publisher or consumer is patched. Clear DLQ only after stakeholder sign-off and audit trail export. Adjust BOTHRESH if backout was too aggressive versus too lenient.

One message blocks the queue (poison).

Consumer rolls back repeatedly; backout counter increments until BOQNAME receives it. Fix consumer to handle or skip with audit. For interview: explain why infinite retry hurts more than DLQ routing.

Performance Incidents

Everything is slow but depth is low.

Investigate log I/O, disk latency, CPU saturation, long-running syncpoint holding locks, channel bandwidth, or network packet loss. Use accounting and statistics. Depth low rules out simple consumer stop—look at round-trip time and log waits.

z/OS Production Notes

Coupling facility structure full.

Shared queues in QSG stop accepting puts; broader than one application queue. Involve z/OS systems programmer; may need structure alter or message drain. Different playbook from distributed MAXDEPTH.

Communication and Post-Incident

What do you put in the incident ticket?

  • Start and end time, impact, queues and channels affected.
  • Commands run and who approved them.
  • AMQERR excerpts and sample MsgIds.
  • Workaround applied and permanent fix ticket reference.
  • Lessons learned and monitoring gap.

Explain Like I'm Five: Production Troubleshooting

The toy factory stopped moving toys. You look at the alarm bell log (AMQERR), see which belt is stuck (queue), check the tunnel between buildings (channel), fix the safe problem, and tell everyone when toys move again—without smashing the whole factory on the first guess.

Practice Exercises

Exercise 1

Write a five-line bridge update for Scenario channel RETRYING with 12k XMITQ depth.

Exercise 2

List DISPLAY commands you would run in first ten minutes of queue full incident.

Exercise 3

Draft post-incident review headings for a poison message outbreak.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. AMQERR contains:

  • Queue manager error messages
  • Only JCL
  • Browser cache
  • COBOL listing

2. Before requeueing DLQ messages you should:

  • Fix root cause
  • Always delete QM
  • Disable all consumers
  • Turn off TLS

3. strmqtrc is used to:

  • Capture detailed diagnostics
  • Start channel always
  • Format disk
  • Compile Java

4. Bridge call communication should include:

  • Impact, ETA, workaround
  • Only blame
  • No status
  • Secret passwords
Published
Read time21 min
AuthorMainframeMaster
Verified: IBM MQ 9.4 documentation