MainframeMaster

De-duplication Strategies

De-duplication—keeping one record per unique key and discarding or handling the rest—is a common requirement in mainframe batch. DFSORT offers several ways to do it: SUM FIELDS=NONE for a single output of unique records, and ICETOOL SELECT with NODUPS and ALLDUPS when you need both a unique file and a duplicate file. The strategy you choose depends on whether you need to keep the first or last record in each group, whether the key is single or composite, and whether you must output the duplicate records somewhere. This page brings together expert strategies: sort order and EQUALS for predictable “keep first” or “keep last,” composite keys, filtering before dedup, and when to use SUM versus ICETOOL.

Advanced Tricks
Progress0 of 0 lessons

Strategy 1: SUM FIELDS=NONE for One Record per Key

The standard DFSORT way to remove duplicates is to sort by the key that defines “duplicate,” then use SUM FIELDS=NONE. SUM runs after the sort and collapses every group of consecutive records with the same sort key into one record. With FIELDS=NONE no numeric summing is done—only duplicate elimination. The result is one record per unique key in SORTOUT. The sort key must match the duplicate key: if two records are duplicates when they have the same customer ID, then SORT FIELDS= must include (at least) customer ID so that all records with the same customer ID are adjacent. Then SUM can collapse them. If you sort by a different key, records with the same customer ID may be scattered and SUM will not deduplicate correctly.

Strategy 2: Keep First vs Keep Last—Sort Order and EQUALS

SUM FIELDS=NONE keeps one record per key, but which record? With OPTION EQUALS (common default), the relative order of records that compare equal on the sort key is preserved. So the first record in each group—the one that appears first after the sort—is the one SUM keeps. That means you control which record is kept by choosing the sort order. To keep the “first” by time (e.g. earliest date), sort the key and then the date in ascending order: the earliest date is first in the group. To keep the “last” (e.g. most recent date), sort the key and then the date in descending order: the most recent date is first in the group, and SUM keeps it. With OPTION NOEQUALS, the order among equal keys is not guaranteed, so which record is kept is not predictable. For predictable deduplication, use EQUALS and set the sort order so the record you want is first.

Example: Keep Latest Record per Customer

Input has customer ID in 1–10 and transaction date in 20–27 (YYYYMMDD). You want one record per customer, keeping the one with the latest date:

text
1
2
3
OPTION EQUALS SORT FIELDS=(1,10,CH,A,20,8,CH,D) SUM FIELDS=NONE

The primary sort key is customer ID (ascending). The secondary key is date descending (D). So within each customer ID, the record with the latest date is first. SUM FIELDS=NONE keeps the first of each group, so you get the latest record per customer. If you wanted the earliest date instead, you would use 20,8,CH,A (ascending) so the earliest date is first.

Strategy 3: Composite Keys—Multiple Fields Define Duplicate

Sometimes a duplicate is defined by more than one field. For example, one record per employee per day: the key is (employee ID, date). In that case, put both fields in SORT FIELDS= in the order you need (e.g. employee ID first, then date). After the sort, all records with the same (employee ID, date) are adjacent. SUM FIELDS=NONE then keeps one record per unique (employee ID, date). So the composite key defines what “duplicate” means. You can have three or more key fields; the same rule applies: records that match on every key field are in the same group and are collapsed to one.

Strategy 4: Two Outputs—Uniques and Duplicates (ICETOOL SELECT)

SUM FIELDS=NONE writes only the unique records to SORTOUT; the duplicate records are discarded. If you need both a file of unique records and a file of all duplicate records (e.g. for auditing or reprocessing), DFSORT does not support XSUM (writing duplicates to a second DD). Instead, use ICETOOL with two SELECT operators. One SELECT writes records whose key appears exactly once to the “unique” file (NODUPS). The other SELECT writes every record whose key appears more than once to the “duplicates” file (ALLDUPS). The input must be sorted by the key you use in ON(); otherwise duplicate detection can be wrong. See the “Removing duplicates into separate files” tutorial for full JCL and control statement examples.

Strategy 5: Deduplicate Within a Subset (Filter Then Sort Then SUM)

Often you want to remove duplicates only among records that meet a condition—for example, one record per account among “active” accounts only. You can do this in one step: use INCLUDE (or OMIT) to filter first, then SORT FIELDS= by the duplicate key, then SUM FIELDS=NONE. Only records that pass the filter are sorted and summed. So the output has one record per key within the filtered set. This is usually more efficient than running a separate step to write a temp file and then sort/sum it, because you avoid extra I/O and a second pass.

Strategy Summary

Choosing a de-duplication strategy
GoalApproach
One record per key; which record does not matterSORT FIELDS=(key) SUM FIELDS=NONE; OPTION EQUALS or NOEQUALS.
Keep first (e.g. earliest date)SORT by key and date ascending; EQUALS; SUM FIELDS=NONE.
Keep last (e.g. latest date)SORT by key and date descending; EQUALS; SUM FIELDS=NONE.
Unique records in one file, duplicates in anotherSort by key; ICETOOL SELECT NODUPS to one file, ALLDUPS to the other.
Dedup only within a subsetINCLUDE/OMIT first, then SORT and SUM FIELDS=NONE.

Overflow and Variable-Length Records

When using SUM with numeric summing (e.g. SUM FIELDS=(30,8,PD)) in addition to duplicate elimination, overflow can occur if the summed value is too large for the output field. OPTION VLSHRT and NOVLSHRT affect how variable-length records are handled when length changes. For pure deduplication with SUM FIELDS=NONE, overflow is not an issue because no numeric fields are summed. If you combine SUM FIELDS=NONE with other SUM field specifications, see the “Handling overflow” tutorial for overflow behavior.

Explain It Like I'm Five

Imagine you have a stack of cards with names and dates. You want one card per person. If you want the newest card for each person, you first sort the stack so that for each name, the card with the latest date is on top. Then you go through the stack and, for each name, keep only the top card and throw the rest away. That’s SUM: sort so the one you want is “first,” then keep one per name. If you also want to put the thrown-away cards in a separate pile (the duplicates), you need a different rule: first sort by name, then use one pile for “only one card per name” and another pile for “all the extra cards.” That’s what NODUPS and ALLDUPS do. So: sort so the right card is first, then either keep one per key (SUM) or split into two piles (SELECT NODUPS and ALLDUPS).

Exercises

  1. You have records with (dept, emp_id, hire_date). You want one record per (dept, emp_id), keeping the one with the latest hire_date. Write the SORT and SUM control statements.
  2. When would you use ICETOOL SELECT with NODUPS and ALLDUPS instead of SUM FIELDS=NONE? What does SUM not give you that SELECT does?
  3. You need one record per order_id among records where status = 'S'. How do you combine INCLUDE and SUM in one step?

Quiz

Test Your Knowledge

1. You need one record per customer ID and want to keep the most recent (by date in cols 20–27). What strategy do you use?

  • SORT by customer ID only, then SUM FIELDS=NONE
  • SORT by customer ID (secondary) and date descending (primary or tie-breaker) so the latest date is first in each group, then SUM FIELDS=NONE with EQUALS to keep the first (most recent)
  • Use INCLUDE only
  • Use two SUM statements

2. When should you use ICETOOL SELECT NODUPS/ALLDUPS instead of SUM FIELDS=NONE?

  • Never; SUM is always better
  • When you need both a file of unique records and a file of duplicate records; SUM keeps only uniques in SORTOUT and does not write duplicates anywhere
  • Only for MERGE
  • Only when the key is numeric

3. What defines a "duplicate" when using a composite sort key?

  • Only the first field
  • The combination of all fields in SORT FIELDS=—records with the same values in all key fields are in the same group and SUM collapses them to one
  • Only the last field
  • SUM ignores composite keys

4. Why use OPTION EQUALS when deduplicating with SUM?

  • EQUALS is required for SUM
  • EQUALS preserves the relative order of records with equal sort keys, so the “first” record in each group is predictable—you control which record is kept by sort order
  • EQUALS removes more duplicates
  • EQUALS is faster than NOEQUALS

5. You want to deduplicate by account number but only among records that passed INCLUDE (e.g. status = active). How do you structure the step?

  • Use two steps: first INCLUDE to a temp file, then SORT/SUM in a second step
  • Use one step: INCLUDE to filter first, then SORT FIELDS= by account number, then SUM FIELDS=NONE—filtering reduces the set, then sort and SUM collapse by key
  • OMIT after SUM
  • Use JOINKEYS