Sometimes you need to split your data into two outputs: one file that contains only unique records (one per key), and another that contains only the duplicate records (every extra copy of a key that appeared more than once). For example, you might send uniques to a master file and duplicates to an exception file for review. In Syncsort you can do this in one step with SUM FIELDS=NONE and XSUM, which writes the eliminated duplicates to a SORTXSUM dataset. DFSORT does not support XSUM. In DFSORT you use ICETOOL with the SELECT operator: one SELECT with NODUPS sends unique records to one file, and another SELECT with ALLDUPS sends all duplicate occurrences to another file. This page explains the goal, the Syncsort XSUM approach, and the DFSORT ICETOOL SELECT approach with NODUPS and ALLDUPS.
Assume you have a key (e.g. employee ID in positions 1–10). A record is unique if no other record has the same key. A record is a duplicate if at least one other record has the same key. When we say “unique file” we mean a file that contains only records whose key appears exactly once in the input—so one record per key and no key repeated. When we say “duplicate file” we mean a file that contains every record whose key appears more than once—so all the “extra” copies. The first occurrence of a key can be sent to the unique file (or you might send only keys that never duplicate; that depends on the operator). NODUPS and ALLDUPS in ICETOOL implement these two sets: NODUPS = keys that appear only once; ALLDUPS = every record that is part of a duplicate group.
In Syncsort, SUM FIELDS=NONE removes duplicates and keeps the first occurrence of each key in SORTOUT. Adding XSUM writes all the records that were eliminated (the duplicates) to the dataset pointed to by the SORTXSUM DD. So in one step you get: SORTOUT = one record per key; SORTXSUM = all the duplicate records. DFSORT does not support XSUM. If you code XSUM in DFSORT you get message ICE172A (XSUM IS NOT SUPPORTED - USE ICETOOL SELECT IF APPROPRIATE). So on DFSORT you use ICETOOL SELECT to achieve the same result.
12SORT FIELDS=(1,10,CH,A) SUM FIELDS=NONE,XSUM
SORTOUT receives the first of each key; SORTXSUM receives all eliminated duplicates. In DFSORT you omit XSUM and use ICETOOL SELECT instead.
ICETOOL SELECT copies records from an input to an output based on criteria. The ON() clause defines the key (position, length, format). NODUPS means “select only records whose key appears exactly once in the input.” ALLDUPS means “select every record whose key appears more than once.” So one SELECT with NODUPS fills the “unique” file, and another SELECT with ALLDUPS fills the “duplicates” file. Both read the same input; the input must be sorted by the key so that duplicate detection works correctly.
| Option | Meaning | Output |
|---|---|---|
| NODUPS | Key appears exactly once | Only those records (unique keys) |
| ALLDUPS | Key appears more than once | Every occurrence of such keys (all duplicates) |
| FIRST | First of each key group | One record per key (like SUM FIELDS=NONE) |
NODUPS and ALLDUPS determine “same key” by comparing adjacent records (or by building a view of key counts, depending on implementation). For correct results, the input must be sorted by the same key you specify in ON(). If the input is not sorted, two records with the same key might be far apart and the tool might treat them as unique. So before running ICETOOL SELECT, run a SORT (or use a SORT control statement inside the same ICETOOL run if supported) so that all records with the same key are adjacent.
To produce both a unique file and a duplicate file, use two SELECT operators in the same ICETOOL run. The first SELECT writes to the unique file with NODUPS; the second writes to the duplicate file with ALLDUPS. Each SELECT specifies FROM(input_dd) TO(output_dd) ON(position,length,format) and the appropriate option. The USING() clause can point to a control statement file if you need SORT or INCLUDE/OMIT; for a simple SELECT, you may only need ON() and the option.
12SELECT FROM(IN) TO(UNIQUE) ON(1,10,CH) NODUPS SELECT FROM(IN) TO(DUPS) ON(1,10,CH) ALLDUPS
IN is the input DD (sorted by positions 1–10). UNIQUE receives only records whose key (1–10) appears exactly once. DUPS receives every record whose key appears more than once. So you get the same logical split as XSUM: one file with “first occurrence only” style uniques, one file with all the duplicates. Exact ICETOOL syntax (e.g. whether ON takes position,length,format or a different form) is product-dependent; check your ICETOOL manual.
If your input is not yet sorted, you can have ICETOOL run a SORT first using a control statement file. You define a CTL dd (e.g. CTL1) with SORT FIELDS=(1,10,CH,A) and then reference it in SELECT with USING(CTL1). The order of operations is typically: SORT (if USING is present), then SELECT. So the data is sorted by the key before NODUPS/ALLDUPS is applied. This way you can do “sort and split into unique/duplicate” in one job step.
12345//CTL1 DD * SORT FIELDS=(1,10,CH,A) /* SELECT FROM(IN) TO(UNIQUE) ON(1,10,CH) USING(CTL1) NODUPS SELECT FROM(IN) TO(DUPS) ON(1,10,CH) USING(CTL1) ALLDUPS
The USING(CTL1) supplies the SORT fields so that IN is processed in sorted order by the key before the SELECT logic runs. Syntax for combining SORT and SELECT in one ICETOOL run varies by product; see your documentation.
If you only need one record per key (like SUM FIELDS=NONE) and do not need a separate file of duplicates, you can use SELECT with FIRST: it keeps the first occurrence of each key and discards the rest. That gives you a single “deduplicated” file. Use FIRST when you do not need to keep or inspect the duplicate records.
So the unique file has no repeated keys; the duplicate file has only keys that were repeated. Together they partition the input (each input record goes to one or the other, depending on whether its key was unique or duplicated).
Imagine a pile of name tags. Some names appear only once (unique), and some names appear two or more times (duplicates). You want two boxes: one box for “names that appear only once,” and one box for “every copy of a name that had duplicates.” NODUPS is the rule for the first box: put in only the name tags for names that have no twin. ALLDUPS is the rule for the second box: put in every name tag that has at least one twin. So you end up with one box of “lonely” names and one box of “duplicate” names. In the computer we do the same thing: one file gets the unique records, the other gets all the duplicate records. XSUM in Syncsort does this in one step; in DFSORT we use ICETOOL SELECT with NODUPS and ALLDUPS.
1. What does NODUPS do in ICETOOL SELECT?
2. What does ALLDUPS do in ICETOOL SELECT?
3. Does DFSORT support XSUM?
4. For ICETOOL SELECT with ON(1,10,CH), what must be true about the input?
5. How do you get both a "unique only" file and a "duplicates only" file in one DFSORT/ICETOOL run?