MainframeMaster

Sorting vs Copying vs Merging

DFSORT can do three different kinds of "move" from input to output: sort (reorder records by keys), copy (keep the same order, no reordering), and merge (combine two or more already-sorted streams into one). Choosing the right one affects correctness, performance, and the control statements you use. This page explains what each does, when to use it, and how they differ.

Fundamentals
Progress0 of 0 lessons

What Is Sorting?

Sorting means reordering records so they appear in ascending or descending order by one or more sort keys. The input can be in any order—random, reverse, or partially sorted. DFSORT reads all the records, compares them by the keys you specify in SORT FIELDS=, and writes the output in the new order. So the output order is determined entirely by the sort keys and the direction (A or D), not by the input order.

You use sorting when the application or report needs data in a specific sequence: for example, by customer ID, by date, or by name then account number. Sorting is the most common DFSORT operation. It uses the most resources (CPU and often sortwork datasets) because DFSORT must compare and move records until the full dataset is ordered. The larger the file and the more keys, the more work sorting does.

In control statements, you request a sort by coding SORT FIELDS=(position,length,format,direction,...). You do not use OPTION COPY or MERGE. Example: to sort an 80-byte fixed file by positions 1–10 (character, ascending) then 11–18 (packed decimal, descending), you would code:

text
1
SORT FIELDS=(1,10,CH,A,11,4,PD,D)

The input order is irrelevant; the output will always be in that key order (subject to INCLUDE/OMIT and any SUM).

What Is Copying?

Copying means passing records from input to output without changing their order. The first input record is the first output record, the second is the second, and so on (after any filtering). You use copy when you do not need to reorder—for example, you only want to filter records (INCLUDE/OMIT), reformat them (INREC, OUTREC), write to multiple outputs (OUTFIL), or remove duplicates (SUM). Because there is no sort phase, copy is usually faster and uses less memory and sortwork than a full sort.

To request a copy in DFSORT, you use one of these:

  • OPTION COPY — The most common. You code OPTION COPY and do not specify SORT FIELDS= or MERGE. Records are read, optionally filtered and reformatted (INREC, INCLUDE/OMIT), and written in the same order (with OUTREC/OUTFIL if specified). The sort/merge phase is skipped.
  • SORT FIELDS=COPY — Same effect as OPTION COPY. Some shops use this form to make it explicit that "sort" is being used in copy mode. There is no performance difference.
  • MERGE FIELDS=COPY — Also equivalent when you have merge inputs; used when the control statement set is MERGE-based but you want no reordering. Functionally the same as OPTION COPY for the copy result.

Important: copy does not mean "no other processing." You can still use INCLUDE, OMIT, INREC, OUTREC, OUTFIL, and SUM. So you might run OPTION COPY with INCLUDE to keep only certain records and OUTREC to reformat them—the records stay in input order, but you have filtered and reformatted. If you use SUM with OPTION COPY, duplicate key collapse and aggregation still happen; the "key" order for SUM is the input order, so adjacent records with the same key are collapsed.

text
1
2
3
OPTION COPY INCLUDE COND=(1,1,CH,EQ,C'A') OUTREC FIELDS=(1,80)

This keeps only records with 'A' in position 1 and writes them in the same order they appeared in the input; no sort is performed.

What Is Merging?

Merging means combining two or more input streams that are already sorted by the same key into one output stream that is also sorted by that key. DFSORT does not re-sort the data; it does a merge pass: it reads the next record from each input, compares the keys, and writes the "smallest" (or "largest" for descending) to the output. So the output is the combined, sorted result without running a full sort algorithm. Merging is efficient when your inputs are already in order—for example, two daily files each sorted by transaction ID that you want to combine into one chronological file.

You request a merge by coding MERGE FIELDS=(position,length,format,direction,...) with the same key definition as the order of your input datasets. You must allocate multiple inputs (e.g. SORTIN01, SORTIN02, or SORTIN and SORTIN02 depending on DFSORT conventions). Each input must be pre-sorted by that key. If any input is not in the correct order, the merged output will be wrong: records from different files can be interleaved incorrectly. So MERGE is only correct when you can guarantee the sort order of each input.

When to choose MERGE over SORT:

  • You have two or more datasets (not just one).
  • Each dataset is already sorted by the same key (and direction) you want for the combined output.
  • You want to combine them into one sorted dataset.

If you have one unsorted file, you use SORT. If you have one file already sorted and only need to filter or reformat, you might use OPTION COPY. If you have two or more sorted files to combine, MERGE is the right and usually fastest choice.

Comparing the Three: Order and Control Statements

Output order

  • Sort — Output order is determined by SORT FIELDS=. Input order is ignored. Result is ascending/descending by the key(s) you specify.
  • Copy — Output order is the same as input order (after filtering). First record in is first record out.
  • Merge — Output order is the same key order as the inputs. The merge algorithm preserves that order when combining the streams.

Control statements

  • Sort — SORT FIELDS=(...). Do not use OPTION COPY. Can use one or more SORTIN (or SORTIN01, SORTIN02, etc.); multiple inputs are concatenated then sorted as one.
  • Copy — OPTION COPY (or SORT FIELDS=COPY). No SORT FIELDS= with actual keys, no MERGE. Same DD names as sort (SORTIN, SORTOUT, SYSIN).
  • Merge — MERGE FIELDS=(...). Multiple input DDs (e.g. SORTIN01, SORTIN02). Each must be pre-sorted by MERGE FIELDS key.

Performance

  • Sort — Most expensive: full sort algorithm, sortwork datasets if data does not fit in memory. Use when you need to reorder.
  • Copy — Least expensive: no sort phase, linear read and write. Use when order does not matter or you want to preserve input order.
  • Merge — Typically much cheaper than sort for the same data volume when inputs are pre-sorted: single merge pass, no re-sort. Use when combining already-sorted files.

When to Use Which?

Use SORT when the input is unsorted (or sorted by a different key) and you need output in a specific key order. Use OPTION COPY when you do not need to change order—you only need to filter, reformat, split, or aggregate (e.g. INCLUDE, OUTREC, OUTFIL, SUM). Use MERGE when you have two or more datasets already sorted by the same key and you want one combined sorted dataset. Picking the wrong one can cause wrong results (e.g. using SORT when inputs are sorted and you only need to merge—it works but is wasteful; using MERGE when an input is not sorted—output will be wrong).

Example: Same Data, Three Approaches

Suppose you have one file of 80-byte records, sorted by position 1–10 (customer ID). You want only records with position 11 = 'Y' and output in the same order (customer ID order).

Option 1 — OPTION COPY: Use OPTION COPY, INCLUDE COND=(11,1,CH,EQ,C'Y'). No SORT FIELDS. Records stay in customer ID order; filtering is applied. Fast and correct.

Option 2 — SORT: Use SORT FIELDS=(1,10,CH,A), INCLUDE same as above. Result is also in customer ID order, but DFSORT runs a full sort. Slower and unnecessary if the file is already in that order.

Option 3 — MERGE: Not applicable—you have only one file. MERGE is for combining two or more pre-sorted files.

So for this requirement, OPTION COPY is the right choice. If you had two files already sorted by customer ID and wanted one combined file in customer ID order, you would use MERGE with two input DDs and MERGE FIELDS=(1,10,CH,A).

Explain It Like I'm Five

Imagine you have cards with names and numbers. Sorting is like shuffling the whole pile and then putting the cards in order by name (or number)—you look at every card and rearrange them. Copying is like taking the pile as it is and just moving it to another table, maybe throwing away some cards or copying only part of each card—but you don't change the order they're in. Merging is when you have two piles that are already in order (e.g. both by name), and you combine them into one pile that's still in order by taking the "next" card from either pile so the combined pile stays sorted. So: sort = put in order; copy = keep the order you have; merge = mix two already-ordered piles into one ordered pile.

Exercises

  1. You have one unsorted file and need output in ascending order by bytes 20–25. Do you use SORT, COPY, or MERGE? What control statement do you code?
  2. You have two files, both already sorted by employee ID. You want one file with all records in employee ID order. Which operation do you use? What must be true about the two inputs?
  3. You need to drop all records where byte 1 is 'Z' and write the rest in the same order. Do you need SORT FIELDS? What option do you use?
  4. Why is MERGE usually faster than SORT when combining two sorted files?

Quiz

Test Your Knowledge

1. When should you use MERGE instead of SORT?

  • When you have one unsorted file
  • When you have two or more files already sorted by the same key
  • When you want to filter records only
  • When output must be variable length

2. What does OPTION COPY do?

  • Copies the SYSIN dataset
  • Passes records through without reordering
  • Merges two files
  • Sorts by the first column

3. Can you use INCLUDE and OUTREC with OPTION COPY?

  • No, COPY only copies as-is
  • Yes, filtering and reformatting still apply
  • Only INCLUDE
  • Only OUTREC

4. What is the main difference between SORT and MERGE in terms of input?

  • SORT has one input, MERGE has two
  • SORT expects unsorted input; MERGE expects pre-sorted input
  • MERGE requires SORTIN01 only
  • There is no difference

5. Which operation is usually fastest for combining two already-sorted files?

  • SORT with two SORTIN DD names
  • MERGE
  • OPTION COPY
  • Copy is always slowest