What happens when a VSAM I/O fails?

When an I/O operation fails (e.g. hardware error, device not ready, permanent error), the channel reports status to the I/O supervisor. The supervisor may retry the I/O (for transient errors) or post the ECB with an error. VSAM receives the error and can return a non-zero return code or reason code to the application. For abnormal job termination (ABEND, CANCEL), the cluster may be left in an inconsistent state; IDCAMS VERIFY is used to complete or back out interrupted operations.

What is VERIFY in VSAM?

VERIFY is an IDCAMS command used after abnormal termination of a job that had a VSAM cluster open. It tells the catalog and access method to complete or back out any interrupted updates so the cluster is in a consistent state. You run VERIFY (e.g. VERIFY FILE(ddname) or VERIFY DATASET(dsname)) before reopening the cluster. Without VERIFY, the catalog may have the cluster flagged as in-use or inconsistent.

What is sense data in mainframe I/O?

Sense data is device-specific status information returned when an I/O operation fails. The device and control unit provide sense bytes that describe the error (e.g. seek error, data check, not ready). The I/O supervisor and error recovery routines use sense data to decide whether to retry and to report the error to the application. VSAM and other access methods may surface this as a reason code or in the RPL/feedback area.

Does VSAM retry I/O on error?

The z/OS I/O supervisor and device support may retry transient I/O errors (e.g. certain hardware retries). VSAM itself may retry in some cases (e.g. retry a read or write). For permanent errors, no amount of retry will succeed; the I/O is reported as failed and the application receives an error. The exact retry behaviour depends on the error type and the access method and I/O configuration.

MainframeMaster

Start I/O Error Recovery

When VSAM starts an I/O (via a channel program), the channel and device perform the operation. Sometimes the I/O fails: the device might be unavailable, a hardware error might occur, or the media might be damaged. The system must detect the failure, possibly retry, and eventually report the error to the application so it can take action. When a job ends abnormally (ABEND or CANCEL) while a VSAM cluster is open, the catalog and cluster may be left in an inconsistent state; recovery procedures (such as IDCAMS VERIFY) are used to restore consistency. This page explains how Start I/O error recovery works: what happens when I/O fails, sense data and status, retry behaviour, and how to use VERIFY and other steps to recover VSAM after abnormal termination.

What Happens When an I/O Fails

After the channel program is started, the channel and device execute the I/O. When they finish, they signal the CPU (e.g. via an I/O interrupt) and present status. The status includes whether the I/O completed normally or with an error. If there is an error, the device and control unit may provide sense data—additional bytes that describe the failure (e.g. seek error, data check, unit check). The I/O supervisor in z/OS receives this and decides what to do. For some errors it may retry the I/O (for example, transient conditions that might succeed on a second try). For permanent errors, or after retries are exhausted, the I/O supervisor marks the I/O as failed and POSTs the ECB with an error indication. The code that started the I/O (e.g. VSAM or the I/O driver) then gets control with the error; it can return a non-zero return code or reason code to the application so the application knows the request failed.

Sense Data and Status

Sense data is device-specific. Different device types (e.g. 3390 DASD, tape) have different sense formats. For DASD, the sense bytes might indicate conditions such as incorrect length, data check, or not ready. The I/O supervisor and error recovery routines (EREP, dynamic support, or product-specific code) use sense data to classify the error and to decide whether to retry. As an application programmer using VSAM you typically do not see the raw sense data; instead you see return codes and reason codes (e.g. in the RPL or in the language runtime). Those codes are derived from the I/O status and sense data by the access method. So when a VSAM GET or PUT fails, the reason code tells you the general kind of failure (e.g. I/O error, record not found); the detailed sense data is used internally for recovery and logging.

Retry Behaviour

The z/OS I/O supervisor and the device support may retry certain I/O operations automatically. For example, a temporary hardware condition might cause one attempt to fail and the next to succeed. The number of retries and the conditions that trigger them are part of the I/O configuration and the access method. VSAM may also retry in some cases (e.g. retry a read or write once or a few times before returning an error to the application). For permanent errors (e.g. permanent I/O error, or "no such device"), retries will not help; the error is reported to the application. Your program should check the return or reason code after each VSAM call and handle errors: log them, retry at the application level if it makes sense, or terminate with a clear message.

Steps in I/O error detection and recovery
Step	Description
I/O completes with error	Channel/device reports status (e.g. CSW, sense data). I/O supervisor may retry (transient) or post ECB with error (permanent).
VSAM receives error	Access method gets control with error indication. It may retry internally or return error to the application (e.g. non-zero return code, reason code in RPL).
Application handling	Application checks return/reason codes. It may retry the operation, switch to another file, or abend. For critical data, logging and recovery procedures are used.
Abnormal termination	If the job abends or is cancelled while VSAM has the cluster open, run IDCAMS VERIFY to reset catalog state and complete or back out interrupted updates before reopening.

Abnormal Termination and VERIFY

If a job ends abnormally—for example, the job abends (ABEND) or is cancelled (CANCEL)—while a VSAM cluster is open, the cluster may be left in an inconsistent state. The catalog might still think the cluster is in use, or updates that were in progress might not have been committed or rolled back. Before you can safely reopen that cluster (in the same or another job), you should run the IDCAMS VERIFY command. VERIFY tells the system to complete or back out any interrupted VSAM operations for that cluster and to reset the catalog state (e.g. clear the "in use" flag). You can specify the cluster by dataset name (VERIFY DATASET(dsname)) or by the DD name that was used when the cluster was open (VERIFY FILE(ddname)). If you use FILE(ddname), the job that runs VERIFY must have a DD statement that points to the same cluster (or you use JOBCAT/STEPCAT if the cluster is in a user catalog). After VERIFY completes successfully, you can open the cluster again. If VERIFY finds problems, you may need to run EXAMINE or other recovery procedures as documented by IBM or your site.

Application-Level Error Handling

In your program, after each VSAM request (GET, PUT, DELETE, etc.) you should check the return code or the RPL feedback. A non-zero return code or a non-zero reason code in the RPL indicates that the request did not complete successfully. Common reasons include: end of file (for sequential read), record not found (for keyed read), I/O error, and logic errors (e.g. invalid RPL option). For I/O errors, you might log the error and the key or RBA, retry the operation a limited number of times, or abend so that the job can be restarted and VERIFY can be run. For batch jobs that update VSAM, it is good practice to design for restart: if the job fails partway through, you can fix the cause (e.g. run VERIFY, fix data), and then restart the job from a checkpoint or from the beginning, depending on your design.

EXAMINE and Further Recovery

If VERIFY does not fully resolve the problem, or if you need to diagnose damage, IDCAMS provides the EXAMINE command. EXAMINE can check the cluster (or catalog) for inconsistencies and optionally correct them. Your installation may have procedures that say when to run EXAMINE and what parameters to use. In general, VERIFY is the first step after an abnormal end; EXAMINE is used when VERIFY is not enough or when you need to inspect or repair the cluster. After any recovery, you may also run LISTCAT to confirm the cluster is in the expected state before reopening it in your application.

Key Takeaways

When an I/O fails, the channel/device report status and possibly sense data; the I/O supervisor may retry (transient errors) or post the ECB with error (permanent or after retries).
VSAM returns errors to the application via return codes and RPL reason codes; the application should check them and handle errors (log, retry, or abend).
After abnormal job termination (ABEND/CANCEL) with a VSAM cluster open, run IDCAMS VERIFY (FILE or DATASET) to complete or back out interrupted operations and reset catalog state before reopening.
Sense data describes the device-level failure; the access method translates it into reason codes for the application.
Use EXAMINE when VERIFY is insufficient or when diagnosing cluster damage; follow your site's recovery procedures.

Explain Like I'm Five

When the robot (the channel) tries to get the box (the I/O) and something goes wrong, it sends back a note (sense data) saying what went wrong. Sometimes the teacher (the system) lets the robot try again. If it still fails, the teacher tells you (the application) that the box could not be fetched. If you had to leave the room suddenly (abend) while the box was half open, the teacher has to tidy up (VERIFY) before anyone can use that box again.

Test Your Knowledge

1. When should you run VERIFY on a VSAM cluster?

Before every OPEN
After abnormal job termination (ABEND/CANCEL) when the cluster was open
Only for KSDS
Never

2. What is sense data used for?

To read records
To describe why an I/O failed (device/control unit status)
To define the cluster
To POST the ECB

3. Who can retry a failed I/O?

Only the application
The I/O supervisor and possibly VSAM (e.g. for transient errors)
Only the channel
No one

Start I/O Error Recovery

What Happens When an I/O Fails

Sense Data and Status

Retry Behaviour

Abnormal Termination and VERIFY

Application-Level Error Handling

EXAMINE and Further Recovery

Key Takeaways

Explain Like I'm Five

Test Your Knowledge

Test Your Knowledge

Virtual Channel Programs

Wait / Task Synchronization

SVC for I/O Driver

VSAM Cluster