When a CICS region fails or is shut down, recovery and restart are the processes that bring it back to a consistent, running state. CICS supports several restart types: warm restart (using existing system data and logs), cold restart (rebuilding from scratch), and emergency restart (minimal recovery when others are not possible). The recovery manager uses system logs to determine what was in progress at failure and to perform backout or forward recovery. This page explains what recovery and restart mean, how each restart type works, when to use it, and how logging supports recovery.
Imagine the power goes out while you are building a tower of blocks. When the power comes back, you have two choices: you can try to put the tower back the way it was using a photo you took (like warm restart), or you can clear the table and start over (like cold restart). CICS does something similar. When the system stops, it has already written down what it was doing in a log. When you restart, it can read that log and either undo things that did not finish (backout) or redo things that were done (forward recovery) so that data is consistent. Warm restart uses that saved information; cold restart ignores it and starts fresh; emergency restart is like getting the room working again as fast as possible with minimal checking.
Recovery is the set of actions that restore data and resource state to a consistent point after a failure. In CICS, recovery involves the recovery manager, the system log, forward recovery logs, and (for in-flight transactions) dynamic transaction backout. Restart is the process of bringing the CICS region back up so it can accept work again. Restart uses recovery: during restart, CICS applies log data to restore state, back out incomplete units of work, and ensure that resources (files, queues, etc.) are consistent. The choice of restart type (warm, cold, or emergency) determines how much of the saved state is used and how much is rebuilt from configuration.
CICS offers three main restart types. Each uses a different mix of saved state and reinitialization.
| Type | What it does | When to use | Relative speed |
|---|---|---|---|
| Warm | Uses existing logs and state | Preferred when data is valid | Fastest |
| Cold | Rebuilds from scratch | Warm fails or data corrupted | Slower |
| Emergency | Minimal recovery | Normal restart not possible | Fast but limited |
Warm restart is the preferred method when the system data and logs are valid. CICS reads the existing system data (e.g. from the catalog, dump data sets, or coupling facility) and the system log to restore the region to its state before the failure. It recovers transaction and program state, restores resource allocations, and validates integrity. Benefits include faster recovery time, preserved system state, and minimal data loss. Requirements include valid system data files, intact logs, and available resources. If warm restart succeeds, the region comes back with its previous configuration and recovered state. IBM generally recommends warm restart as the standard procedure after a CICS region failure; cold start is only needed in rare cases such as global catalog or system log corruption.
Cold restart initializes the region from scratch. It does not rely on the previous run's state; instead it rebuilds system structures and reloads resources from the CSD (CICS System Definition) or other configuration. Cold restart is used when warm restart fails (e.g. because of corrupted or missing data) or when you intentionally want a clean state. It takes longer than warm restart because everything is reloaded. After a cold restart, in-flight work from before the failure is not recovered from the previous run's log in the same way; you rely on forward recovery or application-level recovery for durable resources (e.g. databases, queues) if needed.
Emergency restart is a minimal recovery path. It aims to get CICS up and running with basic functionality in the shortest time. It is used when normal warm or cold restart cannot be performed (e.g. critical data sets unavailable or repeated failures). During emergency restart, CICS may restore processes to a pre-failure state where possible and may rerun activities that were active at failure; the exact behavior depends on the release and options. Emergency restart trades off full state recovery for speed and availability. Use it as a last resort when other options are not viable.
The CICS recovery manager coordinates recovery during restart. It uses the system log (and related logs) to determine which units of work were in progress at the time of failure and what updates they had made. For failed or in-doubt transactions, it drives backout: undoing changes so that resources are left as if the transaction had not run. For resources that support forward recovery (e.g. some file types), it can use forward recovery logs to reapply changes. The system log is therefore critical; if it is lost or corrupted, warm restart may not be possible and cold or emergency restart may be the only options. Protecting and backing up the system log (and related data sets) is part of recovery planning.
Dynamic transaction backout is the mechanism that undoes a single failed transaction's updates (e.g. file updates, queue changes) so the system stays consistent. Backout can happen during normal running (when a task abends) or during restart (when the recovery manager processes the log and backs out in-flight or in-doubt work). So recovery and restart use the same backout logic; at restart, the recovery manager identifies transactions that did not complete and triggers backout for them. Applications that use syncpoint and resource recovery (e.g. file control, DB2) participate in this so that after restart, durable data reflects only completed units of work.
1. Which CICS restart type uses existing system data and logs?
2. When would you use a cold restart?
3. What role does the system log play in CICS recovery?