MainframeMaster

Merge Performance

DFSORT MERGE is usually faster and uses fewer resources than SORT when you are combining data that is already sorted. MERGE does a linear pass over the input streams; SORT reorders all records. This page explains why merge performs better for pre-sorted data, how sortwork and CPU compare, when to use sort-then-merge instead of one big SORT, and what can hurt merge performance.

MERGE Processing
Progress0 of 0 lessons

Why MERGE Is Faster Than SORT for Pre-Sorted Data

SORT takes one input and reorders it. To produce sorted output, DFSORT must compare records and rearrange them. That typically requires work proportional to n log n (where n is the number of records): many comparisons and intermediate storage (sortwork). MERGE takes two or more inputs that are already in key order and only combines them. At each step it looks at the current record from each stream and writes the one that comes next in the key. So the work is proportional to n: one pass over the data. That is why, for the same total number of records, MERGE uses less CPU and often less sortwork when the inputs are already sorted.

SORT vs MERGE: performance aspects
AspectSORTMERGE
AlgorithmFull sort (reorder)Linear merge (combine streams)
Typical time complexityO(n log n)O(n)
SortworkOften substantialTypically less
When optimalSingle unsorted inputMultiple pre-sorted inputs

Sortwork and MERGE

Sortwork (SORTWK datasets) is used by DFSORT for temporary storage during the sort phase. A full SORT often needs sortwork space proportional to the input size because it must hold and reorder records. MERGE, in contrast, only reads from each input and writes the next record in key order; it does not need to hold the entire dataset in sortwork to reorder it. So MERGE typically requires less sortwork (or in some configurations minimal sortwork) compared to a SORT of the same total data. The exact amount depends on your product version and options; when in doubt, allocate SORTWK DDs for a MERGE step as you would for a SORT, and the product will use what it needs.

When Sort-Then-Merge Beats One Big SORT

Suppose you have a very large unsorted file. You could run one SORT on the whole file. Alternatively, you could split the file into several smaller files, run a SORT on each (e.g. in separate jobs or in parallel), and then MERGE the sorted results. The second approach can be faster because: (1) each SORT operates on less data, so each sort step may run in less time or use less resource per step; (2) if you can run the sort steps in parallel, total elapsed time can drop; (3) the final MERGE is one linear pass. So for very large data that can be split (e.g. by key range or by partition), sort-then-merge can reduce both elapsed time and peak resource use compared to one large SORT.

Number of Merge Inputs

With more merge inputs (e.g. 16 instead of 2), the merge still does one linear pass over the total records. The only extra cost is that for each output record, DFSORT must determine which input stream has the next record in key order. With 2 streams that is one comparison; with 16 streams it is more comparisons (or a small heap/priority structure). So there can be a modest increase in CPU when you have many inputs, but the overall work remains linear and much cheaper than a full re-sort. You should not avoid multi-file merge just because there are many inputs; the benefit of not re-sorting usually outweighs the cost of more comparisons per record.

What Can Hurt MERGE Performance

  • Inputs not pre-sorted: If any input is out of order, the merged output is wrong. Correcting it (e.g. by re-sorting the output or the inputs) wastes the benefit of merge. Always ensure each input is sorted by the same key as MERGE FIELDS=.
  • Heavy INREC/OUTREC: Complex reformatting adds CPU. Use only what you need.
  • Very large records or many inputs: Large records mean more data movement; many inputs mean more comparisons per output record. These are usually still better than one full SORT.

Concatenate-and-SORT vs MERGE

If you have two sorted files and you concatenate them into one dataset and run SORT, DFSORT treats the combined file as unsorted. It will re-sort everything. That uses more CPU and sortwork than feeding the two files as SORTIN01 and SORTIN02 and running MERGE. So whenever the data is already in the correct key order, use MERGE instead of SORT to combine it. The performance difference can be large for big files.

Explain It Like I'm Five

Imagine you have two stacks of cards already in A-to-Z order. To get one big A-to-Z stack, you don’t shuffle everything. You just keep picking the next card from the top of either stack (whichever comes first in the alphabet) and put it on the result. That’s fast—you only look at the top of each stack. That’s MERGE. If you mixed the two stacks into one big messy pile and then sorted the whole pile, you’d have to look at lots of cards and move them around. That’s SORT. So when your piles are already in order, merging is the fast way to combine them.

Exercises

  1. You have 4 sorted files of 1 million records each. Would you prefer to concatenate them and run SORT, or run MERGE with SORTIN01–SORTIN04? Why?
  2. What is the main reason MERGE typically needs less sortwork than SORT?
  3. When might splitting one large unsorted file into 8 parts, sorting each part, and then merging the 8 be faster than sorting the large file once?
  4. If you discover after a MERGE that one input was out of order, what are the implications for performance and correctness?

Quiz

Test Your Knowledge

1. Why is MERGE typically faster than SORT when combining pre-sorted data?

  • MERGE uses more memory
  • MERGE does a linear pass over the streams; SORT does a full reorder (e.g. O(n log n))
  • SORT always reads twice
  • MERGE skips INCLUDE

2. If you concatenate two sorted files into SORTIN and run SORT, vs feeding them as SORTIN01 and SORTIN02 and running MERGE, which usually uses more sortwork?

  • MERGE uses more
  • SORT usually uses more; it must reorder the combined data
  • They use the same
  • It depends only on record length

3. When might sort-then-merge be faster than one big SORT?

  • Never
  • When you can sort smaller partitions in parallel or in sequence and then merge; the merge is linear and the per-partition sorts are on smaller data
  • Only when you have two inputs
  • Only when using OPTION COPY

4. Does the number of merge inputs (e.g. 4 vs 16) significantly affect MERGE performance?

  • No; merge is always the same speed
  • More inputs mean more comparisons per output record (e.g. pick smallest of 16 vs 4); it can have some impact but merge is still linear
  • More inputs always make it faster
  • MERGE only allows 2 inputs

5. What can hurt MERGE performance?

  • Using OUTREC
  • Inputs that are not actually sorted (wrong order forces incorrect output; if you then fix by re-sorting, you lose the benefit)
  • Using SORTIN01
  • Using more than two inputs