MainframeMaster

CSV Handling in DFSORT

CSV (comma-separated values) is a common format for data exchange. On the mainframe, DFSORT can read CSV input, parse it into fixed-length fields, sort or filter by columns, and write fixed-length or formatted output. This page covers parsing CSV with PARSE, handling quoted fields (where a comma appears inside a value), dealing with header rows, and considerations for character set (EBCDIC vs ASCII) and variable-length input.

Data Transformation
Progress0 of 0 lessons

What Is CSV?

CSV is a text format where each record is a line and each field is separated by a comma. Example: Name,Age,City and Alice,30,NYC. If a value contains a comma, it is often enclosed in double quotes, e.g. "Smith, John",25,Boston. DFSORT does not "know" CSV by name; you use the same PARSE mechanism as for any delimited file, with comma as the delimiter and optional handling for quotes.

Parsing CSV with PARSE

To turn CSV into fixed-length fields, use INREC or OUTFIL with PARSE. For each column you define a parsed field (e.g. %01, %02) with:

  • ENDBEFR=C',' — The field is everything from the current position up to (but not including) the next comma. So the first field is from the start of the record to the first comma; the second is from after that comma to the next comma; and so on.
  • FIXLEN=n — The extracted value is written into a fixed-length area of n bytes. Shorter values are typically padded (e.g. with spaces); longer values may be truncated. FIXLEN is required so that BUILD can place fields at known positions.

After defining all parsed fields, use BUILD to place them at fixed positions in the output record. The exact BUILD syntax (e.g. 1:%01,11:%02) is product-dependent; the numbers usually indicate starting position and possibly length.

Key Options for CSV

Options commonly used for CSV in DFSORT
OptionMeaning
ENDBEFR=C','Field ends at the next comma; used for each CSV column.
STARTAFT=C'"'Start the field after the opening double quote; used for quoted CSV fields.
ENDBEFR=C'"'End the field at the closing double quote so commas inside are not delimiters.
FIXLEN=nOutput length for the parsed value; padded or truncated to fit the fixed-length output.
SKIPREC=1Skip the first input record (e.g. CSV header row).

Example: Simple CSV to Fixed-Length

Input CSV has three columns: ID (numeric), Name, City. Example record: 100,Alice,NYC. You want fixed-length output: ID in positions 1–3, Name in 4–23 (20 bytes), City in 24–33 (10 bytes).

text
1
2
3
4
5
OPTION COPY INREC PARSE=(%01=(ENDBEFR=C',',FIXLEN=3), %02=(ENDBEFR=C',',FIXLEN=20), %03=(ENDBEFR=C',',FIXLEN=10)), BUILD=(1:%01,4:%02,24:%03)

PARSE splits each record on commas. %01 gets "100", %02 gets "Alice" (padded to 20 bytes), %03 gets "NYC" (padded to 10). BUILD places them at 1, 4, and 24. With INREC, the reformatted record is what gets written (and with OPTION COPY, no sort is performed). If you use SORT FIELDS= instead of OPTION COPY, you can sort by any of these positions.

Quoted CSV Fields

In standard CSV, a field that contains a comma (or a newline) is enclosed in double quotes. For example: "Smith, John",25,Boston. If you used only ENDBEFR=C',', the first field would be incorrectly parsed as "Smith and the next as John". To treat the whole quoted string as one field, define that parsed field to start after the opening quote and end before the closing quote:

text
1
%01=(STARTAFT=C'"',ENDBEFR=C'"',FIXLEN=20)

STARTAFT=C'"' means "start after the first double quote"; ENDBEFR=C'"' means "end when you see the next double quote." So the value between the quotes (including any comma) is taken as one field. You use this for the quoted column; other columns can still use ENDBEFR=C','. Some DFSORT versions offer PAIR=QUOTE or similar for CSV; check your manual.

Header Row

Many CSV files have a first record that is a header (column names), e.g. Name,Age,City. You usually do not want to sort that row with the data. Use OPTION SKIPREC=1 to skip the first record. DFSORT will not read it into the sort phase, so it will not appear in SORTOUT. If you need the header in the output file, you must add it by another method: for example, a separate step that writes the header then appends the sorted data, or an OUTFIL option that writes a header line (e.g. HEADER=) and then the records. The exact options depend on your DFSORT release.

INREC vs OUTFIL for CSV

INREC PARSE runs before the sort and before INCLUDE/OMIT. The record is converted from CSV to fixed-length first. So you can SORT FIELDS= on the fixed positions and use INCLUDE/OMIT conditions on them. Use INREC when you need to sort or filter by CSV columns. OUTFIL PARSE runs when building the output for that OUTFIL. The input to the OUTFIL may already be in CSV (or fixed) form; PARSE then converts the delimited record to fixed format for that output dataset. Use OUTFIL when you only need the parsed layout for a specific output and the main sort/filter use the original record. When using OUTFIL with PARSE you still need a SORT or MERGE or OPTION COPY so that there is a defined processing flow.

Variable-Length (VB) CSV Input

CSV is often stored as variable-length records (RECFM=VB): each line is one record, so record length varies. In VB format, the data starts after the 4-byte RDW (record descriptor word), so the first data byte is at position 5. PARSE operates on the record content (the data after the RDW). If your input is VB and output is fixed, ensure the BUILD output length matches the output DCB LRECL; product messages (e.g. ICE222A) may indicate length mismatches.

Character Set: EBCDIC vs ASCII

On z/OS, datasets are typically EBCDIC. CSV files created on Windows or Unix are often ASCII. The comma and double-quote characters have different hex values in EBCDIC and ASCII. If you receive a CSV in ASCII and read it as EBCDIC, PARSE may not find the commas or quotes correctly. Options: (1) Convert the file to EBCDIC before the sort (e.g. using a conversion utility or iconv); (2) Use a dataset that is tagged with the correct encoding so the system converts on read; (3) Ensure the CSV is already in EBCDIC when produced. This is especially important when CSV is transferred from distributed systems to the mainframe.

Empty Fields and Trailing Commas

CSV can have empty fields: A,,C means the second field is empty. With ENDBEFR=C',', the parser still advances: the second parsed field will be zero bytes (or whatever is between the two commas). That empty value is then placed in the FIXLEN area, usually as padding (spaces). Trailing commas (e.g. A,B,) may produce an empty last field; again it will be padded to FIXLEN. Your BUILD and downstream logic should account for possible blank or empty parsed fields if your data can have them.

Explain It Like I'm Five

Imagine a row of boxes with commas between them: apple, banana, cherry. CSV is like that: words separated by commas. Sometimes a box has a comma inside it, so we put that box in quotes: "apple, pie". DFSORT looks at the line, finds each comma (or the quotes), and takes out each piece. We tell it how big each piece should be (FIXLEN) and where to put it in a new row (BUILD). So we turn a line of comma-separated words into a fixed row of boxes that the mainframe can sort and use.

Exercises

  1. Write PARSE and BUILD for a CSV with four columns: A (5 bytes), B (10 bytes), C (15 bytes), D (5 bytes). Assume no quoted fields.
  2. Why do we use STARTAFT and ENDBEFR with the quote character for one column instead of ENDBEFR=C','?
  3. When would you use OPTION SKIPREC=1 in a CSV job?
  4. If your CSV file is in ASCII and your mainframe dataset is EBCDIC, what can go wrong with PARSE? What would you do?

Quiz

Test Your Knowledge

1. What is the main DFSORT feature used to read CSV fields?

  • INCLUDE only
  • PARSE with ENDBEFR=C',' — each field is defined as "up to the next comma" and given a FIXLEN for output
  • OUTREC only
  • SUM only

2. How do you handle a CSV field that contains a comma (e.g. "Smith, John")?

  • PARSE cannot handle it
  • Use quoted fields: STARTAFT=C'"' and ENDBEFR=C'"' so the parser takes the value between the double quotes, and the comma inside is ignored
  • Use FINDREP first
  • Use two PARSE passes

3. Why might you use INREC PARSE instead of OUTFIL PARSE for CSV?

  • INREC is always faster
  • When you need to sort or filter (INCLUDE/OMIT) by CSV columns — INREC runs before the sort so the reformatted fixed-length record is what gets sorted and filtered
  • OUTFIL PARSE does not exist
  • Only for VB files

4. What is a common pitfall with CSV on the mainframe?

  • CSV is not supported
  • Character set: CSV files may be in ASCII (e.g. from PC) while mainframe uses EBCDIC — you may need conversion or ensure the file is in the correct code page
  • PARSE is only for fixed-length
  • Headers are required

5. How can you skip or handle a CSV header row?

  • PARSE automatically skips the first record
  • Use OPTION SKIPREC=1 to skip the first record so it is not sorted or written; the header is not passed to SORTOUT
  • Use INCLUDE to keep only data rows
  • Headers cannot be skipped