MainframeMaster

Removing Duplicates with SUM

Removing duplicates means keeping one record per unique key and discarding the rest. In DFSORT the standard way to do this is to sort by the key that defines “duplicate,” then use SUM FIELDS=NONE. SUM runs after the sort and collapses every group of records with the same key into one record. With SUM FIELDS=NONE no numeric fields are summed—only duplicate elimination happens. Which record is kept (first or last in the group) depends on sort order and on OPTION EQUALS or NOEQUALS. This page focuses on using SUM specifically for deduplication: choosing the key, single vs composite key, EQUALS vs NOEQUALS, and when to use SUM versus ICETOOL SELECT (e.g. when you need the duplicate records in a separate file).

SUM Statement
Progress0 of 0 lessons

Why the Sort Key Must Be the Duplicate Key

SUM works on the sorted stream. It looks at consecutive records and treats records with the same sort key as one group. So the “duplicate key” must be exactly what you sort by. If you sort by employee ID, then two records with the same employee ID are adjacent and SUM collapses them to one. If you sort by something else (e.g. date), then two records with the same employee ID might not be adjacent, and SUM would not treat them as duplicates. So always set SORT FIELDS= to the field or combination of fields that define a duplicate.

Basic Syntax: One Key

For a single key (e.g. employee ID in positions 1–10), sort by that key in ascending or descending order, then add SUM FIELDS=NONE. Ascending (A) keeps the “smallest” first in the group; descending (D) keeps the “largest” first. Often we want “first occurrence” in the original file order—that usually means sort the key A and use EQUALS so the first record in the group is the one that was first in the input (after the sort).

Example: duplicate key in positions 1–10

text
1
2
SORT FIELDS=(1,10,CH,A) SUM FIELDS=NONE

Records are sorted by the first 10 bytes. Every group of records with the same 10-byte value is reduced to one record. That one record is typically the first of the group when OPTION EQUALS is in effect.

Composite Key: Multiple Fields Define a Duplicate

When “duplicate” is defined by more than one field (e.g. employee ID and date), list all of them in SORT FIELDS=. The order of fields in SORT FIELDS= determines the major and minor sort order; for deduplication what matters is that records with the same combination are adjacent. So SORT FIELDS=(1,8,CH,A,9,8,CH,A) sorts by (positions 1–8, then 9–16). Records with the same (field1, field2) are adjacent; SUM FIELDS=NONE keeps one per (field1, field2).

Example: duplicate = ID + date

text
1
2
SORT FIELDS=(1,5,CH,A,10,3,CH,A) SUM FIELDS=NONE

Duplicates are defined by the 5-byte field at 1 and the 3-byte field at 10. One record per unique (ID, date) is kept.

EQUALS vs NOEQUALS: Which Record Is Kept

When two records have the same sort key, EQUALS tells DFSORT to preserve their relative order (so the “first” in the group is well-defined). NOEQUALS allows the sort to reorder equal keys for performance, so you cannot rely on first or last. For removing duplicates, if you care which record is kept (e.g. “keep the first by timestamp”), use OPTION EQUALS and order the sort so the one you want is first—e.g. sort by (key, timestamp,A) so the earliest timestamp is first, then SUM FIELDS=NONE keeps that record. If you use NOEQUALS, which record survives is implementation-dependent.

Key points for SUM deduplication
TopicDetail
Sort keyMust match the duplicate key so equal keys are adjacent
SUM FIELDS=NONEOne record per key; no numeric summing
EQUALSKeeps first of each group (predictable); use for dedup
NOEQUALSWhich record is kept may be unspecified
Two filesUse ICETOOL SELECT NODUPS/ALLDUPS, not SUM alone

Keeping the “Last” Record Instead of the First

By default (with EQUALS and ascending sort on the key), the first record in each key group is kept. To keep the last, either (1) sort the key in descending order so the “last” becomes first in the group, or (2) add a tie-breaker field (e.g. sequence or timestamp) and sort by (key, tie-breaker D) so the latest or highest is first, then SUM FIELDS=NONE. Or use ICETOOL SELECT with LAST if your product supports it. The idea is to arrange the sort order so the record you want is the first in the group when SUM runs.

You Do Not Get a Second File of Duplicates

SUM FIELDS=NONE keeps one record per key in SORTOUT and discards the rest. It does not write the discarded records anywhere. In Syncsort you can use XSUM to send those records to a SORTXSUM dataset; DFSORT does not support XSUM. So if you need both a “unique” file and a “duplicates” file, use ICETOOL SELECT with NODUPS (unique records to one file) and ALLDUPS (all duplicate occurrences to another file). See the “Removing duplicates into separate files” tutorial.

When to Use SUM vs ICETOOL SELECT

  • Use SUM FIELDS=NONE when you only need one output file with one record per key and do not need to keep or inspect the duplicate records.
  • Use ICETOOL SELECT with NODUPS/ALLDUPS when you need two outputs: one with unique records and one with the duplicate records. Use SELECT with FIRST when you want one record per key with more control (e.g. FIRST/LAST) in an ICETOOL flow.

Explain It Like I'm Five

Imagine many cards with names. You want one card per name. First you put all the cards in order by name (that’s the sort). Then you look at the pile: whenever you see two or more cards with the same name in a row, you keep only the first and throw the rest away. That’s SUM FIELDS=NONE. The “name” is the key. You have to sort by name first so the same names are next to each other. If you don’t sort by name, the same name might be in different places and you won’t know which to throw away. EQUALS means “when two names are the same, don’t mix their order”—so the one you keep is the one that was first in line. If you also want a separate pile of the cards you threw away, SUM can’t do that; you need a different tool (ICETOOL SELECT) that puts “extra” names in another box.

Exercises

  1. Your duplicate key is in positions 20–25 (character). Write SORT and SUM control statements to keep one record per key.
  2. You want one record per (customer ID 1–8, order date 9–16). Write SORT FIELDS= and SUM FIELDS=NONE.
  3. Why does OPTION NOEQUALS make “which record is kept” unpredictable? When might you still use NOEQUALS?
  4. You need unique records in one file and duplicate records in another. What do you use instead of SUM FIELDS=NONE alone?

Quiz

Test Your Knowledge

1. What must you do before using SUM FIELDS=NONE to remove duplicates?

  • Use INCLUDE only
  • Sort by the key that defines a duplicate—SORT FIELDS= must match the key so that records with the same key are adjacent
  • Use OPTION COPY
  • Nothing

2. Which record does SUM FIELDS=NONE keep when there are duplicates?

  • Always the last
  • Typically the first of each key group when EQUALS is in effect; with NOEQUALS the choice may not be guaranteed
  • A random record
  • The one with the highest value

3. How do you remove duplicates based on more than one field (e.g. employee ID and date)?

  • Use two SUM statements
  • Use a composite sort key: SORT FIELDS=(pos1,len1,format1,order, pos2,len2,format2,order) then SUM FIELDS=NONE—duplicates are defined by the combination of all key fields
  • Only one field is allowed
  • Use INCLUDE twice

4. If you need both a file of unique records and a file of the duplicate records, can you use only SUM FIELDS=NONE?

  • Yes, SUM writes duplicates to SORTXSUM
  • No—DFSORT does not support XSUM; use ICETOOL SELECT with NODUPS and ALLDUPS to get two files
  • Yes, with two OUTFILs
  • SUM always writes both

5. What is the difference between EQUALS and NOEQUALS when removing duplicates?

  • They are the same for SUM
  • EQUALS preserves the order of records with equal keys, so the “first” in the sort order is predictable; NOEQUALS does not guarantee order among equal keys, so which record SUM keeps may not be predictable
  • NOEQUALS keeps the first
  • EQUALS removes all records