Start I/O Error Recovery

When VSAM starts an I/O (via a channel program), the channel and device perform the operation. Sometimes the I/O fails: the device might be unavailable, a hardware error might occur, or the media might be damaged. The system must detect the failure, possibly retry, and eventually report the error to the application so it can take action. When a job ends abnormally (ABEND or CANCEL) while a VSAM cluster is open, the catalog and cluster may be left in an inconsistent state; recovery procedures (such as IDCAMS VERIFY) are used to restore consistency. This page explains how Start I/O error recovery works: what happens when I/O fails, sense data and status, retry behaviour, and how to use VERIFY and other steps to recover VSAM after abnormal termination.

What Happens When an I/O Fails

After the channel program is started, the channel and device execute the I/O. When they finish, they signal the CPU (e.g. via an I/O interrupt) and present status. The status includes whether the I/O completed normally or with an error. If there is an error, the device and control unit may provide sense data—additional bytes that describe the failure (e.g. seek error, data check, unit check). The I/O supervisor in z/OS receives this and decides what to do. For some errors it may retry the I/O (for example, transient conditions that might succeed on a second try). For permanent errors, or after retries are exhausted, the I/O supervisor marks the I/O as failed and POSTs the ECB with an error indication. The code that started the I/O (e.g. VSAM or the I/O driver) then gets control with the error; it can return a non-zero return code or reason code to the application so the application knows the request failed.

Sense Data and Status

Sense data is device-specific. Different device types (e.g. 3390 DASD, tape) have different sense formats. For DASD, the sense bytes might indicate conditions such as incorrect length, data check, or not ready. The I/O supervisor and error recovery routines (EREP, dynamic support, or product-specific code) use sense data to classify the error and to decide whether to retry. As an application programmer using VSAM you typically do not see the raw sense data; instead you see return codes and reason codes (e.g. in the RPL or in the language runtime). Those codes are derived from the I/O status and sense data by the access method. So when a VSAM GET or PUT fails, the reason code tells you the general kind of failure (e.g. I/O error, record not found); the detailed sense data is used internally for recovery and logging.

Retry Behaviour

The z/OS I/O supervisor and the device support may retry certain I/O operations automatically. For example, a temporary hardware condition might cause one attempt to fail and the next to succeed. The number of retries and the conditions that trigger them are part of the I/O configuration and the access method. VSAM may also retry in some cases (e.g. retry a read or write once or a few times before returning an error to the application). For permanent errors (e.g. permanent I/O error, or "no such device"), retries will not help; the error is reported to the application. Your program should check the return or reason code after each VSAM call and handle errors: log them, retry at the application level if it makes sense, or terminate with a clear message.

Steps in I/O error detection and recovery
StepDescription
I/O completes with errorChannel/device reports status (e.g. CSW, sense data). I/O supervisor may retry (transient) or post ECB with error (permanent).
VSAM receives errorAccess method gets control with error indication. It may retry internally or return error to the application (e.g. non-zero return code, reason code in RPL).
Application handlingApplication checks return/reason codes. It may retry the operation, switch to another file, or abend. For critical data, logging and recovery procedures are used.
Abnormal terminationIf the job abends or is cancelled while VSAM has the cluster open, run IDCAMS VERIFY to reset catalog state and complete or back out interrupted updates before reopening.

Abnormal Termination and VERIFY

If a job ends abnormally—for example, the job abends (ABEND) or is cancelled (CANCEL)—while a VSAM cluster is open, the cluster may be left in an inconsistent state. The catalog might still think the cluster is in use, or updates that were in progress might not have been committed or rolled back. Before you can safely reopen that cluster (in the same or another job), you should run the IDCAMS VERIFY command. VERIFY tells the system to complete or back out any interrupted VSAM operations for that cluster and to reset the catalog state (e.g. clear the "in use" flag). You can specify the cluster by dataset name (VERIFY DATASET(dsname)) or by the DD name that was used when the cluster was open (VERIFY FILE(ddname)). If you use FILE(ddname), the job that runs VERIFY must have a DD statement that points to the same cluster (or you use JOBCAT/STEPCAT if the cluster is in a user catalog). After VERIFY completes successfully, you can open the cluster again. If VERIFY finds problems, you may need to run EXAMINE or other recovery procedures as documented by IBM or your site.

Application-Level Error Handling

In your program, after each VSAM request (GET, PUT, DELETE, etc.) you should check the return code or the RPL feedback. A non-zero return code or a non-zero reason code in the RPL indicates that the request did not complete successfully. Common reasons include: end of file (for sequential read), record not found (for keyed read), I/O error, and logic errors (e.g. invalid RPL option). For I/O errors, you might log the error and the key or RBA, retry the operation a limited number of times, or abend so that the job can be restarted and VERIFY can be run. For batch jobs that update VSAM, it is good practice to design for restart: if the job fails partway through, you can fix the cause (e.g. run VERIFY, fix data), and then restart the job from a checkpoint or from the beginning, depending on your design.

EXAMINE and Further Recovery

If VERIFY does not fully resolve the problem, or if you need to diagnose damage, IDCAMS provides the EXAMINE command. EXAMINE can check the cluster (or catalog) for inconsistencies and optionally correct them. Your installation may have procedures that say when to run EXAMINE and what parameters to use. In general, VERIFY is the first step after an abnormal end; EXAMINE is used when VERIFY is not enough or when you need to inspect or repair the cluster. After any recovery, you may also run LISTCAT to confirm the cluster is in the expected state before reopening it in your application.

Key Takeaways

  • When an I/O fails, the channel/device report status and possibly sense data; the I/O supervisor may retry (transient errors) or post the ECB with error (permanent or after retries).
  • VSAM returns errors to the application via return codes and RPL reason codes; the application should check them and handle errors (log, retry, or abend).
  • After abnormal job termination (ABEND/CANCEL) with a VSAM cluster open, run IDCAMS VERIFY (FILE or DATASET) to complete or back out interrupted operations and reset catalog state before reopening.
  • Sense data describes the device-level failure; the access method translates it into reason codes for the application.
  • Use EXAMINE when VERIFY is insufficient or when diagnosing cluster damage; follow your site's recovery procedures.

Explain Like I'm Five

When the robot (the channel) tries to get the box (the I/O) and something goes wrong, it sends back a note (sense data) saying what went wrong. Sometimes the teacher (the system) lets the robot try again. If it still fails, the teacher tells you (the application) that the box could not be fetched. If you had to leave the room suddenly (abend) while the box was half open, the teacher has to tidy up (VERIFY) before anyone can use that box again.

Test Your Knowledge

Test Your Knowledge

1. When should you run VERIFY on a VSAM cluster?

  • Before every OPEN
  • After abnormal job termination (ABEND/CANCEL) when the cluster was open
  • Only for KSDS
  • Never

2. What is sense data used for?

  • To read records
  • To describe why an I/O failed (device/control unit status)
  • To define the cluster
  • To POST the ECB

3. Who can retry a failed I/O?

  • Only the application
  • The I/O supervisor and possibly VSAM (e.g. for transient errors)
  • Only the channel
  • No one
Published
Updated
Read time4 min
AuthorMainframeMaster
Reviewed by MainframeMaster teamVerified: IBM z/OS 2.5 documentationSources: IBM DFSMS Access Method Services, z/OS VSAM documentationApplies to: z/OS 2.5