How do I join large datasets efficiently with DFSORT JOINKEYS?

Use SORTED on each JOINKEYS when the corresponding file is already in join-key order so DFSORT does not re-sort. Size SORTWK (and optionally multiple SORTWKnn) and SORTOUT for the expected record count and length. Use appropriate blocksize and avoid double sorting by declaring pre-sorted inputs when they are.

What is SORTED on JOINKEYS?

SORTED is an option on JOINKEYS that tells DFSORT the file is already in the same order as the join key. DFSORT then skips sorting that file, which saves CPU and I/O on large datasets. Use it when the input was previously sorted by the join key (e.g. from a prior SORT step).

Why does JOINKEYS need so much sortwork?

JOINKEYS internally sorts or merges each file by the join key (unless SORTED is specified) and then merges the joined stream. Large F1, large F2, and a large joined output all need intermediate work space. Multiple SORTWK datasets are used to hold this data; for large joins, many SORTWKnn DD statements may be required.

Can JOINKEYS read both files in parallel?

Yes. JOINKEYS uses separate tasks for file 1 and file 2 that can run in parallel. So F1 and F2 are read concurrently, which improves elapsed time when both files are large and on separate volumes or channels.

How do I avoid sorting the same data twice in a JOINKEYS job?

If one or both inputs are already sorted by the join key (e.g. from a previous SORT step), specify SORTED on the corresponding JOINKEYS. For example JOINKEYS F1=MASTER,FIELDS=(1,10,CH,A),SORTED. That tells DFSORT not to sort that file again.

DFSORT Joining Large Datasets Efficiently - JOINKEYS Performance and Sortwork

Joining Large Datasets Efficiently

When you use DFSORT JOINKEYS to join two large files, performance and resource use depend on whether the inputs are already sorted, how much sortwork (SORTWK) is available, and how large the joined output can grow. JOINKEYS internally uses parallel processing: separate tasks read file 1 (F1) and file 2 (F2) and can run at the same time, which helps elapsed time. You can avoid redundant sorting by telling DFSORT when a file is already in join-key order using the SORTED option. This page covers how to join large datasets efficiently: SORTED and NOSEQCK, sortwork sizing, avoiding double sort, output size planning, and practical tips for production jobs.

JOINKEYS / performance

Declare Pre-Sorted Inputs: SORTED and NOSEQCK

If either or both of your join inputs are already sorted by the same key you use for the join (same position, length, format, and order), you can tell DFSORT so it does not sort them again. That saves CPU and I/O and is one of the most effective ways to speed up large joins.

SORTED is specified on the JOINKEYS statement for that file. It means: this file is in the same order as the join key (ascending or descending as specified in FIELDS=). DFSORT will not sort this file; it will use it as-is for the join.

NOSEQCK (no sequence check) tells DFSORT not to verify that the file is in order. Use NOSEQCK only when you are certain the data is correctly sorted; otherwise a sequence error can catch bad data. If you are not sure, omit NOSEQCK so DFSORT validates the order (at some CPU cost).

Example: both files already sorted by join key

text

1
2
3
4
  JOINKEYS F1=MASTER,FIELDS=(1,10,CH,A),SORTED,NOSEQCK
  JOINKEYS F2=DETAIL,FIELDS=(5,10,CH,A),SORTED,NOSEQCK
  REFORMAT FIELDS=(F1:1,80,F2:1,100)
  SORT FIELDS=COPY

Here both MASTER and DETAIL are already in ascending order by the 10-byte key (F1 at 1–10, F2 at 5–14). SORTED avoids sorting either file. NOSEQCK skips sequence checking. The join then proceeds without the sort phase for F1 and F2, which greatly reduces work for large files.

How JOINKEYS Uses Sortwork

JOINKEYS runs three logical tasks: one for F1, one for F2, and a main join task. If a file is not declared SORTED, DFSORT must sort (or merge) it by the join key; that sort uses SORTWK (or SORTWK01, SORTWK02, …) work datasets. The main join task also produces a stream of joined records that may need to be sorted or written to work files. So a single JOINKEYS step can use many SORTWK datasets—dozens in real-world large jobs—and each must be large enough to hold its share of the data.

If you do not allocate enough sortwork, the step may fail with out-of-space or resource errors. IBM provides tuning recommendations for large DFSORT tasks: estimate the volume of data (F1 size, F2 size, and approximate join output size), then allocate enough SORTWK space. Using multiple SORTWKnn DDs (e.g. SORTWK01 through SORTWK22) spreads I/O and can improve performance.

Avoiding Double Sort

A common mistake is to run a SORT step that sorts a file by the join key and writes it to a dataset, then use that dataset as SORTJNF1 (or F1) in a JOINKEYS step without specifying SORTED. DFSORT then does not know the file is already in order and may sort it again. That doubles the sort work for that file.

Whenever the input to JOINKEYS is the output of a previous step that sorted by the same key (same position, length, format, order), specify SORTED on that JOINKEYS. The same applies if the dataset was created by another job or tool that guarantees order by the join key.

Parallel Reading of F1 and F2

JOINKEYS is designed so that the F1 and F2 input tasks can run in parallel. While one task is reading and (if needed) sorting F1, the other can read and sort F2. That overlap reduces elapsed time when both files are large. To get the best benefit, place the two input datasets on different volumes or channels so I/O does not contend; also ensure sufficient memory and sortwork so neither task is starved.

Planning Output and Sortwork Size

The join output can be much larger than either input when there are duplicate keys (one-to-many or many-to-many). For each key, output rows = (F1 records with that key) × (F2 records with that key). So total output record count can be huge. You must:

Estimate the output record count (e.g. sum over keys of F1 count × F2 count for inner join).
Set the output record length from REFORMAT (sum of the lengths you specify for F1 and F2 segments).
Size SORTOUT and any downstream datasets for that many records and total bytes.
Allocate enough SORTWK for the join phase to hold and process the joined stream (refer to IBM tuning guides for formulas and examples).

Blocksize and I/O Efficiency

Use a blocksize that is a multiple of record length and within the device maximum (e.g. 32K or 64K for disk). Large blocks reduce the number of I/O operations. For very large files, dynamic allocation (DYNALLOC) and appropriate SIZE hints can help DFSORT allocate optimal work and output space.

Quick Reference: What to Do for Large Joins

Actions for joining large datasets
Area	Action
Pre-sorted inputs	Specify SORTED (and NOSEQCK if safe) on JOINKEYS for each file already in join-key order.
Sortwork	Allocate enough SORTWK (or SORTWK01, SORTWK02, ...) for F1 sort, F2 sort, and join output; use IBM tuning guidelines for size.
Output size	Estimate output rows (e.g. sum of F1×F2 per key for inner join) and size SORTOUT and downstream datasets accordingly.
Blocksize	Use efficient blocksize for input and output (e.g. multiple of record length, within device limits) to reduce I/O.
Double sort	Do not feed a sorted dataset into JOINKEYS without SORTED; otherwise DFSORT will sort it again.

Explain It Like I'm Five

Imagine matching two long lists of names. If both lists are already in alphabetical order, you can walk through them together without sorting either list again—that’s SORTED. If you have lots of papers (sortwork), you can spread the work on the table and not run out of space. And if two people read the two lists at the same time (one reads list 1, one reads list 2), you finish faster—that’s JOINKEYS reading F1 and F2 in parallel.

Exercises

Your job has step 1: SORT by customer ID, output to TEMP. Step 2: JOINKEYS with TEMP as F1 and another file as F2, join key = customer ID. What should you add to the JOINKEYS for F1 and why?
Why might a JOINKEYS step need more SORTWK space than a simple SORT of one file of the same size?
F1 has 1 million records, F2 has 2 million; many-to-many can produce up to 10 million joined rows. What must you plan for SORTOUT and sortwork?
What is the trade-off of using NOSEQCK with SORTED?

Quiz

Test Your Knowledge

1. Your two join inputs are already sorted by the join key. What can you specify to avoid DFSORT re-sorting them?

OPTION COPY
SORTED on each JOINKEYS (e.g. JOINKEYS F1=...,SORTED) and optionally NOSEQCK to skip sequence check
JOIN FAST
PRE SORTED in REFORMAT

2. Why might a JOINKEYS step use many SORTWK datasets?

Only for full outer join
JOINKEYS uses separate sort/merge work for F1, F2, and the main join; large files need multiple SORTWKnn (e.g. SORTWK01–SORTWK22) for intermediate data
Only when using REFORMAT
SORTWK is only for single-file SORT

3. You run a SORT step that writes SORTOUT, then a second step that uses that SORTOUT as SORTJNF1 for JOINKEYS. What risk is there?

None
The data may be sorted twice—once in the first SORT and again inside JOINKEYS if you do not tell JOINKEYS that F1 is already sorted; use SORTED on the JOINKEYS for F1 to avoid redundant work
JOINKEYS cannot read from SORTOUT
You must use MERGE instead

4. How does JOINKEYS use the two input files internally?

It reads F1 then F2 sequentially
It uses separate tasks for F1 and F2 that can run in parallel; F1 and F2 are read concurrently, which helps I/O and elapsed time on large datasets
It merges them first then joins
Only F1 is read; F2 is used as lookup

5. What should you plan for when the join output is much larger than either input (e.g. many-to-many)?

Only SORTOUT size
SORTOUT record count and length, sortwork (SORTWK) space for the join output stream, and downstream dataset sizes; output can exceed either input
Only REFORMAT length
Nothing special