DFSORT MERGE is usually faster and uses fewer resources than SORT when you are combining data that is already sorted. MERGE does a linear pass over the input streams; SORT reorders all records. This page explains why merge performs better for pre-sorted data, how sortwork and CPU compare, when to use sort-then-merge instead of one big SORT, and what can hurt merge performance.
SORT takes one input and reorders it. To produce sorted output, DFSORT must compare records and rearrange them. That typically requires work proportional to n log n (where n is the number of records): many comparisons and intermediate storage (sortwork). MERGE takes two or more inputs that are already in key order and only combines them. At each step it looks at the current record from each stream and writes the one that comes next in the key. So the work is proportional to n: one pass over the data. That is why, for the same total number of records, MERGE uses less CPU and often less sortwork when the inputs are already sorted.
| Aspect | SORT | MERGE |
|---|---|---|
| Algorithm | Full sort (reorder) | Linear merge (combine streams) |
| Typical time complexity | O(n log n) | O(n) |
| Sortwork | Often substantial | Typically less |
| When optimal | Single unsorted input | Multiple pre-sorted inputs |
Sortwork (SORTWK datasets) is used by DFSORT for temporary storage during the sort phase. A full SORT often needs sortwork space proportional to the input size because it must hold and reorder records. MERGE, in contrast, only reads from each input and writes the next record in key order; it does not need to hold the entire dataset in sortwork to reorder it. So MERGE typically requires less sortwork (or in some configurations minimal sortwork) compared to a SORT of the same total data. The exact amount depends on your product version and options; when in doubt, allocate SORTWK DDs for a MERGE step as you would for a SORT, and the product will use what it needs.
Suppose you have a very large unsorted file. You could run one SORT on the whole file. Alternatively, you could split the file into several smaller files, run a SORT on each (e.g. in separate jobs or in parallel), and then MERGE the sorted results. The second approach can be faster because: (1) each SORT operates on less data, so each sort step may run in less time or use less resource per step; (2) if you can run the sort steps in parallel, total elapsed time can drop; (3) the final MERGE is one linear pass. So for very large data that can be split (e.g. by key range or by partition), sort-then-merge can reduce both elapsed time and peak resource use compared to one large SORT.
With more merge inputs (e.g. 16 instead of 2), the merge still does one linear pass over the total records. The only extra cost is that for each output record, DFSORT must determine which input stream has the next record in key order. With 2 streams that is one comparison; with 16 streams it is more comparisons (or a small heap/priority structure). So there can be a modest increase in CPU when you have many inputs, but the overall work remains linear and much cheaper than a full re-sort. You should not avoid multi-file merge just because there are many inputs; the benefit of not re-sorting usually outweighs the cost of more comparisons per record.
If you have two sorted files and you concatenate them into one dataset and run SORT, DFSORT treats the combined file as unsorted. It will re-sort everything. That uses more CPU and sortwork than feeding the two files as SORTIN01 and SORTIN02 and running MERGE. So whenever the data is already in the correct key order, use MERGE instead of SORT to combine it. The performance difference can be large for big files.
Imagine you have two stacks of cards already in A-to-Z order. To get one big A-to-Z stack, you don’t shuffle everything. You just keep picking the next card from the top of either stack (whichever comes first in the alphabet) and put it on the result. That’s fast—you only look at the top of each stack. That’s MERGE. If you mixed the two stacks into one big messy pile and then sorted the whole pile, you’d have to look at lots of cards and move them around. That’s SORT. So when your piles are already in order, merging is the fast way to combine them.
1. Why is MERGE typically faster than SORT when combining pre-sorted data?
2. If you concatenate two sorted files into SORTIN and run SORT, vs feeding them as SORTIN01 and SORTIN02 and running MERGE, which usually uses more sortwork?
3. When might sort-then-merge be faster than one big SORT?
4. Does the number of merge inputs (e.g. 4 vs 16) significantly affect MERGE performance?
5. What can hurt MERGE performance?