Cross-Region Recovery

Cross-region recovery is what you execute when an entire geography stops being a safe place to run IBM MQ—not when one Linux host reboots. Cloud availability zones fail together more often than marketing suggests; hurricanes, fiber cuts, and misconfigured backbone routes take whole regions offline. Enterprises spread queue managers across us-east and us-west, London and Dublin, or Tokyo and Osaka to survive those events. Cross-region recovery combines backup queue managers, log shipping or asynchronous replication, global DNS or traffic management, and runbooks tested under realistic WAN conditions. Latency between regions affects replication lag and therefore RPO. Legal constraints affect whether cross-border DR is even allowed. This tutorial explains regional architecture patterns, network and TLS design, declaration procedures, failback, cloud-specific notes, and how cross-region recovery differs from multi-instance in one datacenter.

Regional Topology Patterns

Cross-region MQ patterns
PatternDescriptionRPO tendency
Active/passive per regionHA inside region; DR to second regionMinutes with async ship
Active/active regionsEach region serves local traffic; replication for shared dataVaries by queue
Hub in region A, DR hub in BSpokes reconnect via DNS to DR hubBacklog on spokes during outage
Stretch cluster (rare)HA nodes split across regionsLow if quorum holds; fragile on WAN

Most banks choose pattern one: production hub with RDQM or MIQM in region A, warm backup queue manager in region B receiving shipped logs or storage replication. Pattern three suits global retail: each country has a local hub, but a central settlement hub in region A fails over to region B. Pattern four is seductive but operationally difficult—WAN partition can destroy quorum. Document which pattern you use per queue manager.

Latency, Bandwidth, and RPO

Every hundred milliseconds of round-trip time adds to replication lag when products replicate synchronously over WAN—and synchronous cross-region replication is often too slow for application SLAs. Asynchronous replication accepts lag: if primary region fails, messages not yet replicated are lost within your RPO window. Size network links for peak log generation, not average lunch-hour traffic. Compression and dedicated MPLS help. Monitor bytes shipped per minute versus primary log write rate; lag alarms belong on the same wall as channel retry alarms.

DNS, Load Balancers, and Clients

Applications should not hard-code region A IP addresses. Use fully qualified names with low TTL DNS records swinging to region B load balancer on DR declaration. Client channel definition tables list multiple connection names in priority order—some clients try the next host after failure. Test that corporate DNS caches do not stick to dead region for hours; TTL five minutes is common in DR designs. TLS certificates must be valid for the DNS name clients use, including DR-specific names if applicable.

Network Security Across Regions

Firewalls must allow DR activation paths before disaster: region B listeners, return channels from partners, management plane for operators VPN. CHLAUTH and TLS on cross-region channels need cipher alignment. Some clouds charge egress fees for log shipping—finance should know. Private connectivity (VPC peering, ExpressRoute, Direct Connect) keeps logs off the public internet.

Declaration and Failback

  1. Incident commander confirms region A unrecoverable within RTO window—not only one AZ glitch.
  2. Isolate region A queue managers from network to prevent split brain if power returns.
  3. Activate region B backup per runbook; verify QMSTATUS and critical queues.
  4. Swing DNS; notify partners with DR CONNAME appendix.
  5. Throttle producers if backlog risks MAXDEPTH on critical queues.
  6. Operate in region B until region A is rebuilt and consistency proven.
  7. Failback: planned reverse replication, quiesce B, sync, swing DNS back—often harder than failover.

Failback deserves its own runbook section. Teams that drill failover but not failback strand operations in DR region for years. Schedule failback exercises less frequently than failover but do not skip them.

Cloud Region Considerations

In AWS, Azure, or GCP, place backup queue managers in a different region than primary, not only another availability zone. Use encrypted object storage for log shipping landing zones with cross-region replication policies. Kubernetes Native HA namespaces need node pools in DR region. Tag resources for cost chargeback per region. Automate infrastructure with Terraform modules parameterized by region code—QM1_use1 and QM1_usw2 should differ only by variables, not copy-paste.

Data Residency and Compliance

EU personal data may require DR within EU. Some nations prohibit replication to US clouds. Cross-region recovery plans must be reviewed by legal before architecture sign-off. When in-country DR is required, your second region might be another city in the same country—not another continent. Document data classification per queue; do not replicate PCI queues to a region without PCI certification.

Hub-and-Spoke During Regional Outage

Spokes buffer on transmission queues when the regional hub dies—store-and-forward protects spoke data while DNS moves to DR hub. Spoke disk must handle multi-hour backlog. After DR hub starts, channels drain XMITQ; consumers may see delayed bursts. Coordinate with business on whether timestamps on messages matter for SLA reporting.

Tutorial: Regional DR Checklist

text
1
2
3
4
5
6
7
8
9
10
11
12
CROSS-REGION DR CHECKLIST — QM_SETTLE Primary region: eu-west-1 | DR region: eu-central-1 RTO: 30 min | RPO: 5 min (async log ship) [ ] Backup QM objects synced (last: nightly Git pipeline) [ ] Log ship lag alert < 5 min (PagerDuty) [ ] DNS settle.mq.example.com TTL 300s -> DR LB tested [ ] TLS cert covers settle.mq.example.com on DR listeners [ ] Partner channels appendix: 12 banks updated in 2025 drill [ ] CCDT v3.4 in artifact repo with both region hosts [ ] Legal: EU-only data — DR region in EU confirmed [ ] Failback runbook owner assigned

Explainer: Two Cities, One Chain Store

A store chain keeps a warehouse in another city. If City A floods, trucks reroute to City B warehouse. Customers still see the same store brand; only the warehouse address behind the scenes changed.

Explain Like I'm Five

If the playground on the east side of town is closed, everyone meets at the west-side playground instead. Grown-ups already wrote the map and told parents both addresses before the east playground broke.

Practice Exercises

Exercise 1

Estimate RPO with async replication lag averaging eight seconds and shipping failures twice daily for two minutes.

Exercise 2

Write DNS swing steps for a client using three-host CCDT.

Exercise 3

List three legal questions for cross-border MQ DR in healthcare.

Frequently Asked Questions

Frequently Asked Questions

Test Your Knowledge

Test Your Knowledge

1. Cross-region recovery addresses:

  • Entire region loss
  • Single channel retry
  • One poison message
  • Topic wildcard typo

2. Async replication across regions usually:

  • Increases RPO versus sync local HA
  • Eliminates all lag
  • Removes TLS need
  • Deletes logs

3. Data residency may require:

  • DR site within same country
  • Any US region only
  • No backup
  • Public internet queues

4. DNS in cross-region DR:

  • Swings clients to DR listeners
  • Replaces MQ logs
  • Defines MAXDEPTH
  • Creates topics only
Published
Read time22 min
AuthorMainframeMaster
Verified: IBM MQ 9.3 documentation