Removing duplicates means keeping one record per unique key and discarding the rest. In DFSORT the standard way to do this is to sort by the key that defines “duplicate,” then use SUM FIELDS=NONE. SUM runs after the sort and collapses every group of records with the same key into one record. With SUM FIELDS=NONE no numeric fields are summed—only duplicate elimination happens. Which record is kept (first or last in the group) depends on sort order and on OPTION EQUALS or NOEQUALS. This page focuses on using SUM specifically for deduplication: choosing the key, single vs composite key, EQUALS vs NOEQUALS, and when to use SUM versus ICETOOL SELECT (e.g. when you need the duplicate records in a separate file).
SUM works on the sorted stream. It looks at consecutive records and treats records with the same sort key as one group. So the “duplicate key” must be exactly what you sort by. If you sort by employee ID, then two records with the same employee ID are adjacent and SUM collapses them to one. If you sort by something else (e.g. date), then two records with the same employee ID might not be adjacent, and SUM would not treat them as duplicates. So always set SORT FIELDS= to the field or combination of fields that define a duplicate.
For a single key (e.g. employee ID in positions 1–10), sort by that key in ascending or descending order, then add SUM FIELDS=NONE. Ascending (A) keeps the “smallest” first in the group; descending (D) keeps the “largest” first. Often we want “first occurrence” in the original file order—that usually means sort the key A and use EQUALS so the first record in the group is the one that was first in the input (after the sort).
12SORT FIELDS=(1,10,CH,A) SUM FIELDS=NONE
Records are sorted by the first 10 bytes. Every group of records with the same 10-byte value is reduced to one record. That one record is typically the first of the group when OPTION EQUALS is in effect.
When “duplicate” is defined by more than one field (e.g. employee ID and date), list all of them in SORT FIELDS=. The order of fields in SORT FIELDS= determines the major and minor sort order; for deduplication what matters is that records with the same combination are adjacent. So SORT FIELDS=(1,8,CH,A,9,8,CH,A) sorts by (positions 1–8, then 9–16). Records with the same (field1, field2) are adjacent; SUM FIELDS=NONE keeps one per (field1, field2).
12SORT FIELDS=(1,5,CH,A,10,3,CH,A) SUM FIELDS=NONE
Duplicates are defined by the 5-byte field at 1 and the 3-byte field at 10. One record per unique (ID, date) is kept.
When two records have the same sort key, EQUALS tells DFSORT to preserve their relative order (so the “first” in the group is well-defined). NOEQUALS allows the sort to reorder equal keys for performance, so you cannot rely on first or last. For removing duplicates, if you care which record is kept (e.g. “keep the first by timestamp”), use OPTION EQUALS and order the sort so the one you want is first—e.g. sort by (key, timestamp,A) so the earliest timestamp is first, then SUM FIELDS=NONE keeps that record. If you use NOEQUALS, which record survives is implementation-dependent.
| Topic | Detail |
|---|---|
| Sort key | Must match the duplicate key so equal keys are adjacent |
| SUM FIELDS=NONE | One record per key; no numeric summing |
| EQUALS | Keeps first of each group (predictable); use for dedup |
| NOEQUALS | Which record is kept may be unspecified |
| Two files | Use ICETOOL SELECT NODUPS/ALLDUPS, not SUM alone |
By default (with EQUALS and ascending sort on the key), the first record in each key group is kept. To keep the last, either (1) sort the key in descending order so the “last” becomes first in the group, or (2) add a tie-breaker field (e.g. sequence or timestamp) and sort by (key, tie-breaker D) so the latest or highest is first, then SUM FIELDS=NONE. Or use ICETOOL SELECT with LAST if your product supports it. The idea is to arrange the sort order so the record you want is the first in the group when SUM runs.
SUM FIELDS=NONE keeps one record per key in SORTOUT and discards the rest. It does not write the discarded records anywhere. In Syncsort you can use XSUM to send those records to a SORTXSUM dataset; DFSORT does not support XSUM. So if you need both a “unique” file and a “duplicates” file, use ICETOOL SELECT with NODUPS (unique records to one file) and ALLDUPS (all duplicate occurrences to another file). See the “Removing duplicates into separate files” tutorial.
Imagine many cards with names. You want one card per name. First you put all the cards in order by name (that’s the sort). Then you look at the pile: whenever you see two or more cards with the same name in a row, you keep only the first and throw the rest away. That’s SUM FIELDS=NONE. The “name” is the key. You have to sort by name first so the same names are next to each other. If you don’t sort by name, the same name might be in different places and you won’t know which to throw away. EQUALS means “when two names are the same, don’t mix their order”—so the one you keep is the one that was first in line. If you also want a separate pile of the cards you threw away, SUM can’t do that; you need a different tool (ICETOOL SELECT) that puts “extra” names in another box.
1. What must you do before using SUM FIELDS=NONE to remove duplicates?
2. Which record does SUM FIELDS=NONE keep when there are duplicates?
3. How do you remove duplicates based on more than one field (e.g. employee ID and date)?
4. If you need both a file of unique records and a file of the duplicate records, can you use only SUM FIELDS=NONE?
5. What is the difference between EQUALS and NOEQUALS when removing duplicates?