De-duplication—keeping one record per unique key and discarding or handling the rest—is a common requirement in mainframe batch. DFSORT offers several ways to do it: SUM FIELDS=NONE for a single output of unique records, and ICETOOL SELECT with NODUPS and ALLDUPS when you need both a unique file and a duplicate file. The strategy you choose depends on whether you need to keep the first or last record in each group, whether the key is single or composite, and whether you must output the duplicate records somewhere. This page brings together expert strategies: sort order and EQUALS for predictable “keep first” or “keep last,” composite keys, filtering before dedup, and when to use SUM versus ICETOOL.
The standard DFSORT way to remove duplicates is to sort by the key that defines “duplicate,” then use SUM FIELDS=NONE. SUM runs after the sort and collapses every group of consecutive records with the same sort key into one record. With FIELDS=NONE no numeric summing is done—only duplicate elimination. The result is one record per unique key in SORTOUT. The sort key must match the duplicate key: if two records are duplicates when they have the same customer ID, then SORT FIELDS= must include (at least) customer ID so that all records with the same customer ID are adjacent. Then SUM can collapse them. If you sort by a different key, records with the same customer ID may be scattered and SUM will not deduplicate correctly.
SUM FIELDS=NONE keeps one record per key, but which record? With OPTION EQUALS (common default), the relative order of records that compare equal on the sort key is preserved. So the first record in each group—the one that appears first after the sort—is the one SUM keeps. That means you control which record is kept by choosing the sort order. To keep the “first” by time (e.g. earliest date), sort the key and then the date in ascending order: the earliest date is first in the group. To keep the “last” (e.g. most recent date), sort the key and then the date in descending order: the most recent date is first in the group, and SUM keeps it. With OPTION NOEQUALS, the order among equal keys is not guaranteed, so which record is kept is not predictable. For predictable deduplication, use EQUALS and set the sort order so the record you want is first.
Input has customer ID in 1–10 and transaction date in 20–27 (YYYYMMDD). You want one record per customer, keeping the one with the latest date:
123OPTION EQUALS SORT FIELDS=(1,10,CH,A,20,8,CH,D) SUM FIELDS=NONE
The primary sort key is customer ID (ascending). The secondary key is date descending (D). So within each customer ID, the record with the latest date is first. SUM FIELDS=NONE keeps the first of each group, so you get the latest record per customer. If you wanted the earliest date instead, you would use 20,8,CH,A (ascending) so the earliest date is first.
Sometimes a duplicate is defined by more than one field. For example, one record per employee per day: the key is (employee ID, date). In that case, put both fields in SORT FIELDS= in the order you need (e.g. employee ID first, then date). After the sort, all records with the same (employee ID, date) are adjacent. SUM FIELDS=NONE then keeps one record per unique (employee ID, date). So the composite key defines what “duplicate” means. You can have three or more key fields; the same rule applies: records that match on every key field are in the same group and are collapsed to one.
SUM FIELDS=NONE writes only the unique records to SORTOUT; the duplicate records are discarded. If you need both a file of unique records and a file of all duplicate records (e.g. for auditing or reprocessing), DFSORT does not support XSUM (writing duplicates to a second DD). Instead, use ICETOOL with two SELECT operators. One SELECT writes records whose key appears exactly once to the “unique” file (NODUPS). The other SELECT writes every record whose key appears more than once to the “duplicates” file (ALLDUPS). The input must be sorted by the key you use in ON(); otherwise duplicate detection can be wrong. See the “Removing duplicates into separate files” tutorial for full JCL and control statement examples.
Often you want to remove duplicates only among records that meet a condition—for example, one record per account among “active” accounts only. You can do this in one step: use INCLUDE (or OMIT) to filter first, then SORT FIELDS= by the duplicate key, then SUM FIELDS=NONE. Only records that pass the filter are sorted and summed. So the output has one record per key within the filtered set. This is usually more efficient than running a separate step to write a temp file and then sort/sum it, because you avoid extra I/O and a second pass.
| Goal | Approach |
|---|---|
| One record per key; which record does not matter | SORT FIELDS=(key) SUM FIELDS=NONE; OPTION EQUALS or NOEQUALS. |
| Keep first (e.g. earliest date) | SORT by key and date ascending; EQUALS; SUM FIELDS=NONE. |
| Keep last (e.g. latest date) | SORT by key and date descending; EQUALS; SUM FIELDS=NONE. |
| Unique records in one file, duplicates in another | Sort by key; ICETOOL SELECT NODUPS to one file, ALLDUPS to the other. |
| Dedup only within a subset | INCLUDE/OMIT first, then SORT and SUM FIELDS=NONE. |
When using SUM with numeric summing (e.g. SUM FIELDS=(30,8,PD)) in addition to duplicate elimination, overflow can occur if the summed value is too large for the output field. OPTION VLSHRT and NOVLSHRT affect how variable-length records are handled when length changes. For pure deduplication with SUM FIELDS=NONE, overflow is not an issue because no numeric fields are summed. If you combine SUM FIELDS=NONE with other SUM field specifications, see the “Handling overflow” tutorial for overflow behavior.
Imagine you have a stack of cards with names and dates. You want one card per person. If you want the newest card for each person, you first sort the stack so that for each name, the card with the latest date is on top. Then you go through the stack and, for each name, keep only the top card and throw the rest away. That’s SUM: sort so the one you want is “first,” then keep one per name. If you also want to put the thrown-away cards in a separate pile (the duplicates), you need a different rule: first sort by name, then use one pile for “only one card per name” and another pile for “all the extra cards.” That’s what NODUPS and ALLDUPS do. So: sort so the right card is first, then either keep one per key (SUM) or split into two piles (SELECT NODUPS and ALLDUPS).
1. You need one record per customer ID and want to keep the most recent (by date in cols 20–27). What strategy do you use?
2. When should you use ICETOOL SELECT NODUPS/ALLDUPS instead of SUM FIELDS=NONE?
3. What defines a "duplicate" when using a composite sort key?
4. Why use OPTION EQUALS when deduplicating with SUM?
5. You want to deduplicate by account number but only among records that passed INCLUDE (e.g. status = active). How do you structure the step?