MainframeMaster

Removing Duplicates Into Separate Files

Sometimes you need to split your data into two outputs: one file that contains only unique records (one per key), and another that contains only the duplicate records (every extra copy of a key that appeared more than once). For example, you might send uniques to a master file and duplicates to an exception file for review. In Syncsort you can do this in one step with SUM FIELDS=NONE and XSUM, which writes the eliminated duplicates to a SORTXSUM dataset. DFSORT does not support XSUM. In DFSORT you use ICETOOL with the SELECT operator: one SELECT with NODUPS sends unique records to one file, and another SELECT with ALLDUPS sends all duplicate occurrences to another file. This page explains the goal, the Syncsort XSUM approach, and the DFSORT ICETOOL SELECT approach with NODUPS and ALLDUPS.

OUTFIL Advanced / ICETOOL
Progress0 of 0 lessons

What “Unique” and “Duplicate” Files Mean

Assume you have a key (e.g. employee ID in positions 1–10). A record is unique if no other record has the same key. A record is a duplicate if at least one other record has the same key. When we say “unique file” we mean a file that contains only records whose key appears exactly once in the input—so one record per key and no key repeated. When we say “duplicate file” we mean a file that contains every record whose key appears more than once—so all the “extra” copies. The first occurrence of a key can be sent to the unique file (or you might send only keys that never duplicate; that depends on the operator). NODUPS and ALLDUPS in ICETOOL implement these two sets: NODUPS = keys that appear only once; ALLDUPS = every record that is part of a duplicate group.

Syncsort XSUM (Not in DFSORT)

In Syncsort, SUM FIELDS=NONE removes duplicates and keeps the first occurrence of each key in SORTOUT. Adding XSUM writes all the records that were eliminated (the duplicates) to the dataset pointed to by the SORTXSUM DD. So in one step you get: SORTOUT = one record per key; SORTXSUM = all the duplicate records. DFSORT does not support XSUM. If you code XSUM in DFSORT you get message ICE172A (XSUM IS NOT SUPPORTED - USE ICETOOL SELECT IF APPROPRIATE). So on DFSORT you use ICETOOL SELECT to achieve the same result.

Syncsort example (for comparison)

text
1
2
SORT FIELDS=(1,10,CH,A) SUM FIELDS=NONE,XSUM

SORTOUT receives the first of each key; SORTXSUM receives all eliminated duplicates. In DFSORT you omit XSUM and use ICETOOL SELECT instead.

ICETOOL SELECT: NODUPS and ALLDUPS

ICETOOL SELECT copies records from an input to an output based on criteria. The ON() clause defines the key (position, length, format). NODUPS means “select only records whose key appears exactly once in the input.” ALLDUPS means “select every record whose key appears more than once.” So one SELECT with NODUPS fills the “unique” file, and another SELECT with ALLDUPS fills the “duplicates” file. Both read the same input; the input must be sorted by the key so that duplicate detection works correctly.

SELECT options for duplicate handling
OptionMeaningOutput
NODUPSKey appears exactly onceOnly those records (unique keys)
ALLDUPSKey appears more than onceEvery occurrence of such keys (all duplicates)
FIRSTFirst of each key groupOne record per key (like SUM FIELDS=NONE)

Input Must Be Sorted by the Key

NODUPS and ALLDUPS determine “same key” by comparing adjacent records (or by building a view of key counts, depending on implementation). For correct results, the input must be sorted by the same key you specify in ON(). If the input is not sorted, two records with the same key might be far apart and the tool might treat them as unique. So before running ICETOOL SELECT, run a SORT (or use a SORT control statement inside the same ICETOOL run if supported) so that all records with the same key are adjacent.

Two SELECT Operators for Two Files

To produce both a unique file and a duplicate file, use two SELECT operators in the same ICETOOL run. The first SELECT writes to the unique file with NODUPS; the second writes to the duplicate file with ALLDUPS. Each SELECT specifies FROM(input_dd) TO(output_dd) ON(position,length,format) and the appropriate option. The USING() clause can point to a control statement file if you need SORT or INCLUDE/OMIT; for a simple SELECT, you may only need ON() and the option.

Example: unique and duplicate files with ICETOOL

text
1
2
SELECT FROM(IN) TO(UNIQUE) ON(1,10,CH) NODUPS SELECT FROM(IN) TO(DUPS) ON(1,10,CH) ALLDUPS

IN is the input DD (sorted by positions 1–10). UNIQUE receives only records whose key (1–10) appears exactly once. DUPS receives every record whose key appears more than once. So you get the same logical split as XSUM: one file with “first occurrence only” style uniques, one file with all the duplicates. Exact ICETOOL syntax (e.g. whether ON takes position,length,format or a different form) is product-dependent; check your ICETOOL manual.

Using a SORT Control Statement with ICETOOL

If your input is not yet sorted, you can have ICETOOL run a SORT first using a control statement file. You define a CTL dd (e.g. CTL1) with SORT FIELDS=(1,10,CH,A) and then reference it in SELECT with USING(CTL1). The order of operations is typically: SORT (if USING is present), then SELECT. So the data is sorted by the key before NODUPS/ALLDUPS is applied. This way you can do “sort and split into unique/duplicate” in one job step.

Example: sort then SELECT (conceptual)

text
1
2
3
4
5
//CTL1 DD * SORT FIELDS=(1,10,CH,A) /* SELECT FROM(IN) TO(UNIQUE) ON(1,10,CH) USING(CTL1) NODUPS SELECT FROM(IN) TO(DUPS) ON(1,10,CH) USING(CTL1) ALLDUPS

The USING(CTL1) supplies the SORT fields so that IN is processed in sorted order by the key before the SELECT logic runs. Syntax for combining SORT and SELECT in one ICETOOL run varies by product; see your documentation.

FIRST: One Record per Key (No Separate Duplicate File)

If you only need one record per key (like SUM FIELDS=NONE) and do not need a separate file of duplicates, you can use SELECT with FIRST: it keeps the first occurrence of each key and discards the rest. That gives you a single “deduplicated” file. Use FIRST when you do not need to keep or inspect the duplicate records.

Comparing the Two Outputs

  • Unique file (NODUPS): Every key in this file appears exactly once in the input. Record count ≤ input count; if there were no duplicates, it equals the input count.
  • Duplicate file (ALLDUPS): Every key in this file appears at least twice in the input. So every record in this file has at least one “twin.” Record count = total input records minus the number of unique keys (if we count first occurrence as unique) or product-specific; check how your product counts.

So the unique file has no repeated keys; the duplicate file has only keys that were repeated. Together they partition the input (each input record goes to one or the other, depending on whether its key was unique or duplicated).

Explain It Like I'm Five

Imagine a pile of name tags. Some names appear only once (unique), and some names appear two or more times (duplicates). You want two boxes: one box for “names that appear only once,” and one box for “every copy of a name that had duplicates.” NODUPS is the rule for the first box: put in only the name tags for names that have no twin. ALLDUPS is the rule for the second box: put in every name tag that has at least one twin. So you end up with one box of “lonely” names and one box of “duplicate” names. In the computer we do the same thing: one file gets the unique records, the other gets all the duplicate records. XSUM in Syncsort does this in one step; in DFSORT we use ICETOOL SELECT with NODUPS and ALLDUPS.

Exercises

  1. Your input is sorted by employee ID (1–8, CH). Write the two ICETOOL SELECT statements to send unique IDs to DD UNIQUE and all duplicate occurrences to DD DUPS.
  2. Why must the input be sorted by the key before using NODUPS or ALLDUPS?
  3. If you get ICE172A when using SUM with XSUM, what does it mean and what should you use instead?
  4. What is the difference between FIRST and NODUPS in SELECT? When would you use each?

Quiz

Test Your Knowledge

1. What does NODUPS do in ICETOOL SELECT?

  • Selects all duplicate records
  • Selects only records that have no duplicate—each key appears exactly once in the output
  • Removes the first record
  • Sorts by key only

2. What does ALLDUPS do in ICETOOL SELECT?

  • Selects only the first of each duplicate group
  • Selects every record that has at least one other record with the same key—all occurrences of duplicate keys go to the output
  • Removes all duplicates
  • Selects only the last duplicate

3. Does DFSORT support XSUM?

  • Yes, same as Syncsort
  • No—XSUM is a Syncsort feature; DFSORT gives ICE172A and you use ICETOOL SELECT with NODUPS/ALLDUPS instead
  • Only in ICETOOL
  • Yes, but only with SUM FIELDS=NONE

4. For ICETOOL SELECT with ON(1,10,CH), what must be true about the input?

  • Input can be unsorted
  • Input should be sorted by the same key (positions 1–10) so that duplicate keys are adjacent; otherwise duplicate detection may be wrong
  • Only one ON is allowed
  • ON is optional

5. How do you get both a "unique only" file and a "duplicates only" file in one DFSORT/ICETOOL run?

  • Use XSUM
  • Use two SELECT operators: one FROM(IN) TO(UNIQUE) ON(key) NODUPS and one FROM(IN) TO(DUPS) ON(key) ALLDUPS, with IN sorted by key
  • Use one SELECT with two TO()
  • Use OUTFIL SPLIT only