CSV (comma-separated values) is a common format for data exchange. On the mainframe, DFSORT can read CSV input, parse it into fixed-length fields, sort or filter by columns, and write fixed-length or formatted output. This page covers parsing CSV with PARSE, handling quoted fields (where a comma appears inside a value), dealing with header rows, and considerations for character set (EBCDIC vs ASCII) and variable-length input.
CSV is a text format where each record is a line and each field is separated by a comma. Example: Name,Age,City and Alice,30,NYC. If a value contains a comma, it is often enclosed in double quotes, e.g. "Smith, John",25,Boston. DFSORT does not "know" CSV by name; you use the same PARSE mechanism as for any delimited file, with comma as the delimiter and optional handling for quotes.
To turn CSV into fixed-length fields, use INREC or OUTFIL with PARSE. For each column you define a parsed field (e.g. %01, %02) with:
After defining all parsed fields, use BUILD to place them at fixed positions in the output record. The exact BUILD syntax (e.g. 1:%01,11:%02) is product-dependent; the numbers usually indicate starting position and possibly length.
| Option | Meaning |
|---|---|
| ENDBEFR=C',' | Field ends at the next comma; used for each CSV column. |
| STARTAFT=C'"' | Start the field after the opening double quote; used for quoted CSV fields. |
| ENDBEFR=C'"' | End the field at the closing double quote so commas inside are not delimiters. |
| FIXLEN=n | Output length for the parsed value; padded or truncated to fit the fixed-length output. |
| SKIPREC=1 | Skip the first input record (e.g. CSV header row). |
Input CSV has three columns: ID (numeric), Name, City. Example record: 100,Alice,NYC. You want fixed-length output: ID in positions 1–3, Name in 4–23 (20 bytes), City in 24–33 (10 bytes).
12345OPTION COPY INREC PARSE=(%01=(ENDBEFR=C',',FIXLEN=3), %02=(ENDBEFR=C',',FIXLEN=20), %03=(ENDBEFR=C',',FIXLEN=10)), BUILD=(1:%01,4:%02,24:%03)
PARSE splits each record on commas. %01 gets "100", %02 gets "Alice" (padded to 20 bytes), %03 gets "NYC" (padded to 10). BUILD places them at 1, 4, and 24. With INREC, the reformatted record is what gets written (and with OPTION COPY, no sort is performed). If you use SORT FIELDS= instead of OPTION COPY, you can sort by any of these positions.
In standard CSV, a field that contains a comma (or a newline) is enclosed in double quotes. For example: "Smith, John",25,Boston. If you used only ENDBEFR=C',', the first field would be incorrectly parsed as "Smith and the next as John". To treat the whole quoted string as one field, define that parsed field to start after the opening quote and end before the closing quote:
1%01=(STARTAFT=C'"',ENDBEFR=C'"',FIXLEN=20)
STARTAFT=C'"' means "start after the first double quote"; ENDBEFR=C'"' means "end when you see the next double quote." So the value between the quotes (including any comma) is taken as one field. You use this for the quoted column; other columns can still use ENDBEFR=C','. Some DFSORT versions offer PAIR=QUOTE or similar for CSV; check your manual.
Many CSV files have a first record that is a header (column names), e.g. Name,Age,City. You usually do not want to sort that row with the data. Use OPTION SKIPREC=1 to skip the first record. DFSORT will not read it into the sort phase, so it will not appear in SORTOUT. If you need the header in the output file, you must add it by another method: for example, a separate step that writes the header then appends the sorted data, or an OUTFIL option that writes a header line (e.g. HEADER=) and then the records. The exact options depend on your DFSORT release.
INREC PARSE runs before the sort and before INCLUDE/OMIT. The record is converted from CSV to fixed-length first. So you can SORT FIELDS= on the fixed positions and use INCLUDE/OMIT conditions on them. Use INREC when you need to sort or filter by CSV columns. OUTFIL PARSE runs when building the output for that OUTFIL. The input to the OUTFIL may already be in CSV (or fixed) form; PARSE then converts the delimited record to fixed format for that output dataset. Use OUTFIL when you only need the parsed layout for a specific output and the main sort/filter use the original record. When using OUTFIL with PARSE you still need a SORT or MERGE or OPTION COPY so that there is a defined processing flow.
CSV is often stored as variable-length records (RECFM=VB): each line is one record, so record length varies. In VB format, the data starts after the 4-byte RDW (record descriptor word), so the first data byte is at position 5. PARSE operates on the record content (the data after the RDW). If your input is VB and output is fixed, ensure the BUILD output length matches the output DCB LRECL; product messages (e.g. ICE222A) may indicate length mismatches.
On z/OS, datasets are typically EBCDIC. CSV files created on Windows or Unix are often ASCII. The comma and double-quote characters have different hex values in EBCDIC and ASCII. If you receive a CSV in ASCII and read it as EBCDIC, PARSE may not find the commas or quotes correctly. Options: (1) Convert the file to EBCDIC before the sort (e.g. using a conversion utility or iconv); (2) Use a dataset that is tagged with the correct encoding so the system converts on read; (3) Ensure the CSV is already in EBCDIC when produced. This is especially important when CSV is transferred from distributed systems to the mainframe.
CSV can have empty fields: A,,C means the second field is empty. With ENDBEFR=C',', the parser still advances: the second parsed field will be zero bytes (or whatever is between the two commas). That empty value is then placed in the FIXLEN area, usually as padding (spaces). Trailing commas (e.g. A,B,) may produce an empty last field; again it will be padded to FIXLEN. Your BUILD and downstream logic should account for possible blank or empty parsed fields if your data can have them.
Imagine a row of boxes with commas between them: apple, banana, cherry. CSV is like that: words separated by commas. Sometimes a box has a comma inside it, so we put that box in quotes: "apple, pie". DFSORT looks at the line, finds each comma (or the quotes), and takes out each piece. We tell it how big each piece should be (FIXLEN) and where to put it in a new row (BUILD). So we turn a line of comma-separated words into a fixed row of boxes that the mainframe can sort and use.
1. What is the main DFSORT feature used to read CSV fields?
2. How do you handle a CSV field that contains a comma (e.g. "Smith, John")?
3. Why might you use INREC PARSE instead of OUTFIL PARSE for CSV?
4. What is a common pitfall with CSV on the mainframe?
5. How can you skip or handle a CSV header row?