Stress Testing

Stress testing IBM MQ means pushing the system harder than load testing until something gives—queue full, log disk saturated, channels in RETRY, consumers drowning, or CPU pegged. The goal is not to break production on purpose in anger; it is to learn where the cliff edge lies so capacity planning includes margin and operators recognize early warning signs before customers do. Load testing asks whether we handle Black Friday traffic. Stress testing asks what happens on Black Friday plus thirty percent from a duplicate feed bug. Beginners confuse the two and either never test beyond averages or run uncontrolled spikes in production. This tutorial explains stress methodology, ramp strategies, failure modes to observe, safety controls, recovery validation, and how results feed capacity planning and monitoring thresholds.

Stress Versus Load

Load test vs stress test
AspectLoad testStress test
Target intensityPlanned peakBeyond peak until failure
Success criterionMeet SLA at peakDocument max and failure mode
Risk toleranceLowControlled breakage in test
OutputBaseline metricsCeiling and degradation curve

Ramp Strategies

Step ramp increases producer rate every five minutes until errors appear—plot latency and depth at each step to find knee of curve. Spike test doubles rate instantly for ten minutes to simulate misconfigured batch replay. Soak at one hundred twenty percent of peak for hours reveals memory leaks and log archive fill. Combine strategies in separate runs; one marathon confuses results.

Failure Modes to Watch

  • MQRC_Q_FULL when CURDEPTH hits MAXDEPTH—producers block or fail.
  • Filesystem full on queue or log paths—queue manager may stop.
  • Channel session limit or partner rejection—XMITQ depth explodes.
  • Listener backlog—clients cannot connect though MQ internals healthy.
  • Poison or retry storms—CPU flat, depth barely moves.
  • Indoubt transactions after forced stop—need TM recovery.

Safety and Ethics

Isolate network from production partners—use test channels or stub receivers. Snapshot disks or use disposable VMs for easy rebuild. Schedule windows with infrastructure teams. Never point stress drivers at production queue names without governance. Document rollback: endmqm, clear test queues, restore from image.

Observing Degradation

Healthy systems degrade gracefully: latency rises before total failure; errors are explicit reason codes. Unhealthy systems exhibit hangs, indoubt states, or partial writes—capture AMQ logs and kernel metrics during stress. Note whether depth recovers after producers stop—drain rate under stress defines recovery time.

Recovery Validation

  1. Stop stress drivers.
  2. Verify queue manager RUNNING and listeners up.
  3. Drain or purge test queues per policy.
  4. Resolve indoubt transactions.
  5. Re-run short load test at fifty percent to confirm normal.

Using Results

Record maximum sustainable msg/s and byte/s before p99 latency doubled or errors exceeded one percent. Set monitoring alerts at seventy percent of that ceiling. Update capacity plan headroom. File defects for non-linear failures (small depth causing disproportionate slowdown).

Tutorial: Stress Run Log Template

text
1
2
3
4
5
6
7
8
STRESS RUN 2026-05-17 — QM_PERF Ramp: +500 msg/s every 5 min, persistent 2KB Failure at: 7,500 msg/s — MQRC_Q_FULL on PAYMENT.IN CURDEPTH: 100,000 / MAXDEPTH 100,000 Channel TO.HUB: RUNNING, XMITQ depth 0 Recovery: stop producers, 45 min drain at 3,200 get/s Max sustainable (p99 < 500ms): ~6,800 msg/s Action: raise MAXDEPTH + disk OR add consumer instance

Explainer: How Many People Fit in the Elevator

Stress testing keeps adding people to the elevator until the alarm rings—you learn the real limit, not the sign that says eight when twelve seemed fine until someone brought luggage.

Explain Like I'm Five

Stress testing is putting more and more marbles in the jar until marbles start falling on the floor—then you know the jar is too small for that many at once.

Practice Exercises

Exercise 1

Design a step ramp from 2,000 to 10,000 msg/s in five steps.

Exercise 2

List recovery steps after stress test fills the log disk.

Exercise 3

Convert stress test ceiling to a monitoring alert threshold with margin.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. Stress testing pushes load:

  • Beyond planned peak
  • Only to fifty percent
  • Only on DLQ
  • Only off hours production

2. A common stress outcome is:

  • MQRC_Q_FULL or disk full
  • Automatic free capacity
  • Higher MAXMSGL always
  • Fewer logs

3. After stress test you should:

  • Document limits and recovery
  • Delete all queues
  • Disable TLS
  • Skip capacity plan

4. Stress tests belong in:

  • Isolated test environment
  • Production checkout
  • Partner prod first
  • DLQ only
Published
Read time19 min
AuthorMainframeMaster
Verified: IBM MQ 9.3 documentation