MainframeMaster

How DFSORT Works Internally (Phases)

DFSORT processes your data in a fixed order: first it reads and optionally reformats or filters records, then it sorts or merges them, then it reformats and writes the result. Understanding these phases helps you know when each control statement runs and how to design your job correctly.

Fundamentals
Progress0 of 0 lessons

The Three Main Phases

DFSORT runs your job in three main stages: input, sort (or merge), and output. Each stage has a specific role. Control statements are tied to one of these stages, so knowing the phase order tells you when your INREC, INCLUDE, OMIT, OUTREC, and OUTFIL logic runs.

Phase 1: Input Phase

In the input phase, DFSORT reads records from the input dataset(s)—the DD name(s) SORTIN or SORTIN01, SORTIN02, and so on for multi-input. For each record read, the following happen in logical order:

  • Read the record — The raw record is read from the input dataset. For variable-length (VB) datasets, the record includes the 4-byte RDW; for fixed-length (FB), the record is the full LRECL bytes.
  • Apply INREC (if specified) — If you coded INREC, DFSORT reformats the record now. It can shorten the record, reorder fields, add constants, convert dates, build new keys, or apply IFTHEN logic. The result of INREC is the "record" that will be seen by the rest of the job. So the sort keys you specify in SORT FIELDS= refer to this reformatted record, not the original input. INREC is optional; if you omit it, the record is passed through unchanged.
  • Apply INCLUDE or OMIT (if specified) — If you coded INCLUDE, only records that satisfy the INCLUDE condition are kept; all others are discarded. If you coded OMIT, records that satisfy the OMIT condition are discarded; all others are kept. Filtering here reduces the number of records that go to the sort phase, which can save time and memory. If you specify neither INCLUDE nor OMIT, every record (after INREC) is kept.

After the input phase, DFSORT has a set of records in memory (or in work datasets) that are ready to be sorted or merged. So the input phase answers: "What records do we have, and in what layout?"

Phase 2: Sort or Merge Phase

In the sort phase, DFSORT takes the records that survived the input phase and arranges them in order. How it does that depends on whether you used a SORT or a MERGE:

  • SORT — DFSORT uses the keys you specified in SORT FIELDS= to compare records and reorder them. The result is a single stream of records in ascending or descending order by those keys. Internally this may use in-memory sorting and/or external sortwork datasets (temporary datasets used when data does not fit in memory). The exact algorithm (e.g. how many passes, how much memory) is chosen by DFSORT and options like SIZE.
  • MERGE — You have two or more input streams that are already sorted by the same key. DFSORT does not re-sort; it merges the streams by repeatedly comparing the next record from each stream and writing the smallest (or largest) to the output stream. So the merge phase is a linear pass that preserves order. MERGE is faster than SORT when data is already sorted because it avoids the full sort algorithm.
  • OPTION COPY — If you specify OPTION COPY and do not specify SORT FIELDS=, DFSORT skips the sort phase entirely. Records are passed from input to output in the order they were read (after INCLUDE/OMIT). So the "sort phase" in that case is effectively "no reordering."

If you use SUM, the collapse of duplicate keys and the aggregation (sum, min, max) are typically done as part of or right after the sort/merge phase, so that records with the same key are adjacent and can be combined. So SUM is logically part of "what we do with the ordered stream" before we write it out.

Phase 3: Output Phase

In the output phase, DFSORT takes the ordered (and possibly summed) records and writes them to the output dataset(s). For each record to be written, the following apply:

  • Apply OUTREC (if specified) — OUTREC reformats the record that is about to be written. The record at this point is the one that came out of the sort/merge (and SUM if used)—so it is the post–input-phase, post-sort record. OUTREC can reorder fields, add spaces, edit numbers, add sequence numbers, do FINDREP, and use IFTHEN. The result of OUTREC is what actually gets written. If you do not specify OUTREC, the record is written as-is (the same layout as after the sort phase).
  • Write to SORTOUT and/or OUTFIL — The primary output stream is written to the dataset allocated to the SORTOUT DD. If you defined OUTFIL control statements, DFSORT may also write to additional DD names (e.g. OUTFIL FNAMES=REPORT) or split output (SPLIT, SPLITBY, etc.). Each OUTFIL can have its own INCLUDE/OMIT and reformatting, so the output phase can produce multiple different outputs from the same sorted stream.

So the output phase answers: "What do we write, and where?"

Why the Order Matters

Knowing that INREC runs before the sort and OUTREC runs after the sort helps you avoid mistakes. For example:

  • If you want to sort by a field that you are building (e.g. a computed key or a combination of columns), you must build that field in INREC, not OUTREC. If you build it only in OUTREC, the sort phase never sees it, so sorting cannot use it.
  • If you want to shorten records before sorting to save memory and improve performance, you must use INREC. OUTREC only affects what is written; the sort phase still sees the full record if you did not shorten it in INREC.
  • If you want the final report layout (e.g. column headers, edited numbers, sequence numbers), you do that in OUTREC or in OUTFIL build. Doing it in INREC would change what gets sorted and might change the order or content of the data you want in the report.

Simple Example: Order of Operations

Suppose you have a fixed-length 80-byte input. You want to: keep only records where position 1 is 'A'; sort by positions 10–19 (character); write only positions 10–19 and 40–49 to the output with a space between them. You could use:

text
1
2
3
INCLUDE COND=(1,1,CH,EQ,C'A') SORT FIELDS=(10,10,CH,A) OUTREC FIELDS=(10,10,40,10)

Processing order: (1) Input phase: read each record; no INREC, so record stays 80 bytes; INCLUDE keeps only records with byte 1 = 'A'. (2) Sort phase: sort the kept records by positions 10–19, ascending. (3) Output phase: for each sorted record, OUTREC picks bytes 10–19 and 40–49 (20 bytes total) and writes them to SORTOUT. So the phases clearly separate "filter and keep layout" → "order" → "final layout."

When SUM Is Involved

When you use the SUM control statement, DFSORT collapses records that have the same key (and optionally aggregates numeric fields). Conceptually this happens after the sort phase has put records in key order: records with the same key are adjacent, so DFSORT can merge them into one and apply sum/min/max. So the flow is: input phase (read, INREC, INCLUDE/OMIT) → sort phase (order by key) → SUM (collapse and aggregate) → output phase (OUTREC, write). SUM does not run in the input phase; it needs the sorted order to know which records to combine.

When MERGE Is Used

For a MERGE job, the input phase still applies: you can have INREC and INCLUDE/OMIT on the merge inputs. The "sort" phase is replaced by a merge phase: two or more pre-sorted streams are merged into one ordered stream. The output phase is the same: OUTREC and OUTFIL apply when writing. So the three-phase model still holds; only the middle phase is "merge" instead of "sort."

Explain It Like I'm Five

Imagine you have two piles of cards: one pile is the raw cards (input), and you want to end up with one neat pile in order (output). Step 1 (input): You look at each card. Maybe you fix the card or make a shorter copy (INREC), and maybe you throw away some cards you don't need (INCLUDE/OMIT). The cards you keep go into a "to be sorted" pile. Step 2 (sort): You take that pile and put the cards in order by the rule you chose (e.g. by name). Now you have one pile in the right order. Step 3 (output): Before you write the answer on a new sheet, you might copy only some parts of each card or add numbers (OUTREC). Then you write that to the output. So: get the cards ready and filter → put them in order → format and write. That order never changes.

Exercises

  1. You want to sort by a key that is the first 10 bytes of the record plus the last 10 bytes. Should you build that key in INREC or OUTREC? Why?
  2. If you use both INCLUDE and INREC, which is applied first to each record?
  3. What is the difference between what the sort phase does in a SORT job and in a MERGE job?
  4. Where in the phase order does SUM run? Why must it run there?

Quiz

Test Your Knowledge

1. In which phase does DFSORT apply INREC?

  • Output phase
  • Sort phase
  • Input phase
  • None of these

2. When is OUTREC applied?

  • Before the sort
  • During the sort
  • After the sort, when writing output
  • Only when OUTFIL is used

3. What happens in the sort phase?

  • Records are read from SORTIN
  • Records are reordered by sort keys
  • Records are written to SORTOUT
  • INCLUDE/OMIT is applied

4. When are INCLUDE and OMIT applied?

  • During the output phase
  • During the input phase, before sort
  • After OUTREC
  • Only when SUM is used

5. How many main phases does DFSORT use for a typical SORT job?

  • One
  • Two
  • Three
  • Four