MainframeMaster

Joining Large Datasets Efficiently

When you use DFSORT JOINKEYS to join two large files, performance and resource use depend on whether the inputs are already sorted, how much sortwork (SORTWK) is available, and how large the joined output can grow. JOINKEYS internally uses parallel processing: separate tasks read file 1 (F1) and file 2 (F2) and can run at the same time, which helps elapsed time. You can avoid redundant sorting by telling DFSORT when a file is already in join-key order using the SORTED option. This page covers how to join large datasets efficiently: SORTED and NOSEQCK, sortwork sizing, avoiding double sort, output size planning, and practical tips for production jobs.

JOINKEYS / performance
Progress0 of 0 lessons

Declare Pre-Sorted Inputs: SORTED and NOSEQCK

If either or both of your join inputs are already sorted by the same key you use for the join (same position, length, format, and order), you can tell DFSORT so it does not sort them again. That saves CPU and I/O and is one of the most effective ways to speed up large joins.

SORTED is specified on the JOINKEYS statement for that file. It means: this file is in the same order as the join key (ascending or descending as specified in FIELDS=). DFSORT will not sort this file; it will use it as-is for the join.

NOSEQCK (no sequence check) tells DFSORT not to verify that the file is in order. Use NOSEQCK only when you are certain the data is correctly sorted; otherwise a sequence error can catch bad data. If you are not sure, omit NOSEQCK so DFSORT validates the order (at some CPU cost).

Example: both files already sorted by join key

text
1
2
3
4
JOINKEYS F1=MASTER,FIELDS=(1,10,CH,A),SORTED,NOSEQCK JOINKEYS F2=DETAIL,FIELDS=(5,10,CH,A),SORTED,NOSEQCK REFORMAT FIELDS=(F1:1,80,F2:1,100) SORT FIELDS=COPY

Here both MASTER and DETAIL are already in ascending order by the 10-byte key (F1 at 1–10, F2 at 5–14). SORTED avoids sorting either file. NOSEQCK skips sequence checking. The join then proceeds without the sort phase for F1 and F2, which greatly reduces work for large files.

How JOINKEYS Uses Sortwork

JOINKEYS runs three logical tasks: one for F1, one for F2, and a main join task. If a file is not declared SORTED, DFSORT must sort (or merge) it by the join key; that sort uses SORTWK (or SORTWK01, SORTWK02, …) work datasets. The main join task also produces a stream of joined records that may need to be sorted or written to work files. So a single JOINKEYS step can use many SORTWK datasets—dozens in real-world large jobs—and each must be large enough to hold its share of the data.

If you do not allocate enough sortwork, the step may fail with out-of-space or resource errors. IBM provides tuning recommendations for large DFSORT tasks: estimate the volume of data (F1 size, F2 size, and approximate join output size), then allocate enough SORTWK space. Using multiple SORTWKnn DDs (e.g. SORTWK01 through SORTWK22) spreads I/O and can improve performance.

Avoiding Double Sort

A common mistake is to run a SORT step that sorts a file by the join key and writes it to a dataset, then use that dataset as SORTJNF1 (or F1) in a JOINKEYS step without specifying SORTED. DFSORT then does not know the file is already in order and may sort it again. That doubles the sort work for that file.

Whenever the input to JOINKEYS is the output of a previous step that sorted by the same key (same position, length, format, order), specify SORTED on that JOINKEYS. The same applies if the dataset was created by another job or tool that guarantees order by the join key.

Parallel Reading of F1 and F2

JOINKEYS is designed so that the F1 and F2 input tasks can run in parallel. While one task is reading and (if needed) sorting F1, the other can read and sort F2. That overlap reduces elapsed time when both files are large. To get the best benefit, place the two input datasets on different volumes or channels so I/O does not contend; also ensure sufficient memory and sortwork so neither task is starved.

Planning Output and Sortwork Size

The join output can be much larger than either input when there are duplicate keys (one-to-many or many-to-many). For each key, output rows = (F1 records with that key) × (F2 records with that key). So total output record count can be huge. You must:

  • Estimate the output record count (e.g. sum over keys of F1 count × F2 count for inner join).
  • Set the output record length from REFORMAT (sum of the lengths you specify for F1 and F2 segments).
  • Size SORTOUT and any downstream datasets for that many records and total bytes.
  • Allocate enough SORTWK for the join phase to hold and process the joined stream (refer to IBM tuning guides for formulas and examples).

Blocksize and I/O Efficiency

Use a blocksize that is a multiple of record length and within the device maximum (e.g. 32K or 64K for disk). Large blocks reduce the number of I/O operations. For very large files, dynamic allocation (DYNALLOC) and appropriate SIZE hints can help DFSORT allocate optimal work and output space.

Quick Reference: What to Do for Large Joins

Actions for joining large datasets
AreaAction
Pre-sorted inputsSpecify SORTED (and NOSEQCK if safe) on JOINKEYS for each file already in join-key order.
SortworkAllocate enough SORTWK (or SORTWK01, SORTWK02, ...) for F1 sort, F2 sort, and join output; use IBM tuning guidelines for size.
Output sizeEstimate output rows (e.g. sum of F1×F2 per key for inner join) and size SORTOUT and downstream datasets accordingly.
BlocksizeUse efficient blocksize for input and output (e.g. multiple of record length, within device limits) to reduce I/O.
Double sortDo not feed a sorted dataset into JOINKEYS without SORTED; otherwise DFSORT will sort it again.

Explain It Like I'm Five

Imagine matching two long lists of names. If both lists are already in alphabetical order, you can walk through them together without sorting either list again—that’s SORTED. If you have lots of papers (sortwork), you can spread the work on the table and not run out of space. And if two people read the two lists at the same time (one reads list 1, one reads list 2), you finish faster—that’s JOINKEYS reading F1 and F2 in parallel.

Exercises

  1. Your job has step 1: SORT by customer ID, output to TEMP. Step 2: JOINKEYS with TEMP as F1 and another file as F2, join key = customer ID. What should you add to the JOINKEYS for F1 and why?
  2. Why might a JOINKEYS step need more SORTWK space than a simple SORT of one file of the same size?
  3. F1 has 1 million records, F2 has 2 million; many-to-many can produce up to 10 million joined rows. What must you plan for SORTOUT and sortwork?
  4. What is the trade-off of using NOSEQCK with SORTED?

Quiz

Test Your Knowledge

1. Your two join inputs are already sorted by the join key. What can you specify to avoid DFSORT re-sorting them?

  • OPTION COPY
  • SORTED on each JOINKEYS (e.g. JOINKEYS F1=...,SORTED) and optionally NOSEQCK to skip sequence check
  • JOIN FAST
  • PRE SORTED in REFORMAT

2. Why might a JOINKEYS step use many SORTWK datasets?

  • Only for full outer join
  • JOINKEYS uses separate sort/merge work for F1, F2, and the main join; large files need multiple SORTWKnn (e.g. SORTWK01–SORTWK22) for intermediate data
  • Only when using REFORMAT
  • SORTWK is only for single-file SORT

3. You run a SORT step that writes SORTOUT, then a second step that uses that SORTOUT as SORTJNF1 for JOINKEYS. What risk is there?

  • None
  • The data may be sorted twice—once in the first SORT and again inside JOINKEYS if you do not tell JOINKEYS that F1 is already sorted; use SORTED on the JOINKEYS for F1 to avoid redundant work
  • JOINKEYS cannot read from SORTOUT
  • You must use MERGE instead

4. How does JOINKEYS use the two input files internally?

  • It reads F1 then F2 sequentially
  • It uses separate tasks for F1 and F2 that can run in parallel; F1 and F2 are read concurrently, which helps I/O and elapsed time on large datasets
  • It merges them first then joins
  • Only F1 is read; F2 is used as lookup

5. What should you plan for when the join output is much larger than either input (e.g. many-to-many)?

  • Only SORTOUT size
  • SORTOUT record count and length, sortwork (SORTWK) space for the join output stream, and downstream dataset sizes; output can exceed either input
  • Only REFORMAT length
  • Nothing special