What is different about production troubleshooting interviews?

They focus on live incident command: severity, blast radius, communication, evidence collection, safe changes, rollback, and post-incident review—not only technical root cause. Panels may include operations managers who care about MTTR and customer impact.

Should I mention strmqtrc in interviews?

Yes, with caution: trace is powerful and can fill disk. Say you enable trace on one failing process with IBM guidance, reproduce, endmqtrc, and analyze— not leave trace on for weeks.

What logs do production MQ admins check first?

AMQERR on the queue manager, FDC files for severe errors, DISPLAY CHSTATUS and QSTATUS, application logs with MsgId, and platform logs (firewall, Kubernetes events). Order depends on symptom.

How do I answer if I have no on-call experience?

Use lab scenarios and describe the process you would follow. Honesty plus structured method beats inventing war stories.

What tutorials support this page?

AMQERR logs, FDC files, MQ trace, channel retry loops, SSL failures, authentication failures, and poison messages tutorials align with these interview themes.

MainframeMaster

IBM MQ Production Troubleshooting Interview Questions

Production troubleshooting interviews ask whether you can run an incident without making it worse. The panel listens for safe sequencing: confirm scope, protect data, gather evidence, apply the smallest fix that restores service, document timeline, and schedule root cause after stability. Technical depth matters—knowing RETRYING versus STOPPED, when to RESET CHANNEL, how backout threshold moves messages to DLQ—but so does judgment: you do not restart a queue manager during peak settlement without approval and a backup plan. This page lists common production interview questions with model answers tied to AMQERR, FDC, trace, metrics, and bridge communication.

Incident Leadership Questions

Walk me through how you handle a severity-1 MQ incident.

Acknowledge alert and assign incident commander. Confirm customer impact (which flows stopped). Identify affected queue managers, channels, and queues. Establish war room chat with app, network, and DBA teams. Gather DISPLAY QSTATUS and CHSTATUS snapshots. Apply known quick fixes only if low risk (restart dead consumer, fix obvious firewall rule). Communicate every fifteen minutes with impact and ETA. After restore, capture timeline for post-incident review; enable problem management for root cause within 48 hours.

When would you restart a queue manager?

When IBM support or runbook indicates unrecoverable internal state and stakeholders accept planned outage—or when clean shutdown is required for maintenance. Not as first action for channel retry. Prefer fixing channel, consumer, or auth first. Always assess in-flight persistent messages and coordinating applications.

Diagnostic Tools

What is AMQERR and how do you use it?

AMQERR is the queue manager error log. Messages include AMQ codes with text explaining failures—channel errors, authority failures, resource problems. Tail the log on the failing host filtered by time of incident. Correlate with application timestamp and MsgId. Share relevant lines in the incident ticket, not the whole multi-gigabyte file.

What are FDC files?

First Failure Data Capture records are written for serious queue manager or channel agent failures. They assist IBM support analysis. In interviews, mention locating FDC per IBM documentation and opening a case for repeated FDC storms—not deleting queue managers blindly.

When do you enable MQ trace?

When logs are insufficient and you can reproduce or isolate traffic. Use strmqtrc with appropriate classes for the failing component, reproduce once, endmqtrc, analyze with support tools. Disable trace afterward; trace impacts performance and disk. Mention change control approval in regulated estates.

shell

1
2
3
4
DISPLAY QSTATUS('PAYMENTS.REQUEST') CURDEPTH IPPROCS OPPROCS
DISPLAY CHSTATUS('*) WHERE(CHSTATUS NE RUNNING)
DISPLAY QMSTATUS
* Production triage trio: queue depth, broken channels, QM health

Channel and Network Incidents

Partner channel RETRYING for hours. Safe actions?

Do not RESET CHANNEL repeatedly without fixing root cause—that can worsen sequence number issues. Verify network path, certificates, CHLAUTH, partner maintenance window. If partner confirms fix, allow retry or coordinated channel restart per runbook. Monitor XMITQ; if disk risk, escalate to business for traffic pause. Document sequence number errors separately—they need IBM procedure, not guesswork.

SSL handshake failures after cert renewal.

Compare cipher suite, certificate chain, trust store on both sides, CHLAUTH SSLPEERMAP rules, and expiry dates in local and partner keystores. Test with openssl s_client equivalent mindset. Roll back cert if new intermediate CA missing on partner.

Production action risk levels
Action	Risk	When appropriate
Restart consumer app	Low-medium	Process hung, queue backing up
START CHANNEL	Medium	Channel STOPPED after known fix
RESET CHANNEL	Medium-high	Runbook allows; seqnum understood
endmqm / strmqm	High	Approved maintenance or IBM direction
DISABLE CHLAUTH	Critical	Never in prod interview answer

Queue Depth and Poison Messages

DLQ suddenly receives thousands of messages.

Sample messages with amqsget or approved tooling. Identify common error—schema version, bad currency code, missing field. Stop requeueing until publisher or consumer is patched. Clear DLQ only after stakeholder sign-off and audit trail export. Adjust BOTHRESH if backout was too aggressive versus too lenient.

One message blocks the queue (poison).

Consumer rolls back repeatedly; backout counter increments until BOQNAME receives it. Fix consumer to handle or skip with audit. For interview: explain why infinite retry hurts more than DLQ routing.

Performance Incidents

Everything is slow but depth is low.

Investigate log I/O, disk latency, CPU saturation, long-running syncpoint holding locks, channel bandwidth, or network packet loss. Use accounting and statistics. Depth low rules out simple consumer stop—look at round-trip time and log waits.

z/OS Production Notes

Coupling facility structure full.

Shared queues in QSG stop accepting puts; broader than one application queue. Involve z/OS systems programmer; may need structure alter or message drain. Different playbook from distributed MAXDEPTH.

Communication and Post-Incident

What do you put in the incident ticket?

Start and end time, impact, queues and channels affected.
Commands run and who approved them.
AMQERR excerpts and sample MsgIds.
Workaround applied and permanent fix ticket reference.
Lessons learned and monitoring gap.

Explain Like I'm Five: Production Troubleshooting

The toy factory stopped moving toys. You look at the alarm bell log (AMQERR), see which belt is stuck (queue), check the tunnel between buildings (channel), fix the safe problem, and tell everyone when toys move again—without smashing the whole factory on the first guess.

Practice Exercises

Exercise 1

Write a five-line bridge update for Scenario channel RETRYING with 12k XMITQ depth.

Exercise 2

List DISPLAY commands you would run in first ten minutes of queue full incident.

Exercise 3

Draft post-incident review headings for a poison message outbreak.

Frequently Asked Questions

Test Your Knowledge

1. AMQERR contains:

Queue manager error messages
Only JCL
Browser cache
COBOL listing

2. Before requeueing DLQ messages you should:

Fix root cause
Always delete QM
Disable all consumers
Turn off TLS

3. strmqtrc is used to:

Capture detailed diagnostics
Start channel always
Format disk
Compile Java

4. Bridge call communication should include:

Impact, ETA, workaround
Only blame
No status
Secret passwords

IBM MQ Production Troubleshooting Interview Questions

Incident Leadership Questions

Walk me through how you handle a severity-1 MQ incident.

When would you restart a queue manager?

Diagnostic Tools

What is AMQERR and how do you use it?

What are FDC files?

When do you enable MQ trace?

Channel and Network Incidents

Partner channel RETRYING for hours. Safe actions?

SSL handshake failures after cert renewal.

Queue Depth and Poison Messages

DLQ suddenly receives thousands of messages.

One message blocks the queue (poison).

Performance Incidents

Everything is slow but depth is low.

z/OS Production Notes

Coupling facility structure full.

Communication and Post-Incident

What do you put in the incident ticket?

Explain Like I'm Five: Production Troubleshooting

Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

AMQERR Logs

FDC Files

MQ Trace

Poison Messages