When you use DFSORT JOINKEYS to join two large files, performance and resource use depend on whether the inputs are already sorted, how much sortwork (SORTWK) is available, and how large the joined output can grow. JOINKEYS internally uses parallel processing: separate tasks read file 1 (F1) and file 2 (F2) and can run at the same time, which helps elapsed time. You can avoid redundant sorting by telling DFSORT when a file is already in join-key order using the SORTED option. This page covers how to join large datasets efficiently: SORTED and NOSEQCK, sortwork sizing, avoiding double sort, output size planning, and practical tips for production jobs.
If either or both of your join inputs are already sorted by the same key you use for the join (same position, length, format, and order), you can tell DFSORT so it does not sort them again. That saves CPU and I/O and is one of the most effective ways to speed up large joins.
SORTED is specified on the JOINKEYS statement for that file. It means: this file is in the same order as the join key (ascending or descending as specified in FIELDS=). DFSORT will not sort this file; it will use it as-is for the join.
NOSEQCK (no sequence check) tells DFSORT not to verify that the file is in order. Use NOSEQCK only when you are certain the data is correctly sorted; otherwise a sequence error can catch bad data. If you are not sure, omit NOSEQCK so DFSORT validates the order (at some CPU cost).
1234JOINKEYS F1=MASTER,FIELDS=(1,10,CH,A),SORTED,NOSEQCK JOINKEYS F2=DETAIL,FIELDS=(5,10,CH,A),SORTED,NOSEQCK REFORMAT FIELDS=(F1:1,80,F2:1,100) SORT FIELDS=COPY
Here both MASTER and DETAIL are already in ascending order by the 10-byte key (F1 at 1–10, F2 at 5–14). SORTED avoids sorting either file. NOSEQCK skips sequence checking. The join then proceeds without the sort phase for F1 and F2, which greatly reduces work for large files.
JOINKEYS runs three logical tasks: one for F1, one for F2, and a main join task. If a file is not declared SORTED, DFSORT must sort (or merge) it by the join key; that sort uses SORTWK (or SORTWK01, SORTWK02, …) work datasets. The main join task also produces a stream of joined records that may need to be sorted or written to work files. So a single JOINKEYS step can use many SORTWK datasets—dozens in real-world large jobs—and each must be large enough to hold its share of the data.
If you do not allocate enough sortwork, the step may fail with out-of-space or resource errors. IBM provides tuning recommendations for large DFSORT tasks: estimate the volume of data (F1 size, F2 size, and approximate join output size), then allocate enough SORTWK space. Using multiple SORTWKnn DDs (e.g. SORTWK01 through SORTWK22) spreads I/O and can improve performance.
A common mistake is to run a SORT step that sorts a file by the join key and writes it to a dataset, then use that dataset as SORTJNF1 (or F1) in a JOINKEYS step without specifying SORTED. DFSORT then does not know the file is already in order and may sort it again. That doubles the sort work for that file.
Whenever the input to JOINKEYS is the output of a previous step that sorted by the same key (same position, length, format, order), specify SORTED on that JOINKEYS. The same applies if the dataset was created by another job or tool that guarantees order by the join key.
JOINKEYS is designed so that the F1 and F2 input tasks can run in parallel. While one task is reading and (if needed) sorting F1, the other can read and sort F2. That overlap reduces elapsed time when both files are large. To get the best benefit, place the two input datasets on different volumes or channels so I/O does not contend; also ensure sufficient memory and sortwork so neither task is starved.
The join output can be much larger than either input when there are duplicate keys (one-to-many or many-to-many). For each key, output rows = (F1 records with that key) × (F2 records with that key). So total output record count can be huge. You must:
Use a blocksize that is a multiple of record length and within the device maximum (e.g. 32K or 64K for disk). Large blocks reduce the number of I/O operations. For very large files, dynamic allocation (DYNALLOC) and appropriate SIZE hints can help DFSORT allocate optimal work and output space.
| Area | Action |
|---|---|
| Pre-sorted inputs | Specify SORTED (and NOSEQCK if safe) on JOINKEYS for each file already in join-key order. |
| Sortwork | Allocate enough SORTWK (or SORTWK01, SORTWK02, ...) for F1 sort, F2 sort, and join output; use IBM tuning guidelines for size. |
| Output size | Estimate output rows (e.g. sum of F1×F2 per key for inner join) and size SORTOUT and downstream datasets accordingly. |
| Blocksize | Use efficient blocksize for input and output (e.g. multiple of record length, within device limits) to reduce I/O. |
| Double sort | Do not feed a sorted dataset into JOINKEYS without SORTED; otherwise DFSORT will sort it again. |
Imagine matching two long lists of names. If both lists are already in alphabetical order, you can walk through them together without sorting either list again—that’s SORTED. If you have lots of papers (sortwork), you can spread the work on the table and not run out of space. And if two people read the two lists at the same time (one reads list 1, one reads list 2), you finish faster—that’s JOINKEYS reading F1 and F2 in parallel.
1. Your two join inputs are already sorted by the join key. What can you specify to avoid DFSORT re-sorting them?
2. Why might a JOINKEYS step use many SORTWK datasets?
3. You run a SORT step that writes SORTOUT, then a second step that uses that SORTOUT as SORTJNF1 for JOINKEYS. What risk is there?
4. How does JOINKEYS use the two input files internally?
5. What should you plan for when the join output is much larger than either input (e.g. many-to-many)?