How do you remove duplicates in DFSORT?

Sort by the key that defines a duplicate, then use SUM FIELDS=NONE. For example SORT FIELDS=(1,10,CH,A) SUM FIELDS=NONE. Records with the same value in the first 10 bytes are collapsed to one record per key. The key in SORT FIELDS= is the “duplicate key”; all records with the same key become one record in the output.

Does SUM FIELDS=NONE keep the first or last duplicate?

With OPTION EQUALS (common default), the first record of each key group is kept—the one that appears first in the sorted order. With OPTION NOEQUALS, which record is kept is not guaranteed. To keep the “last” record, sort in descending order on the key (or on a tie-breaker) so the one you want is first, or use ICETOOL SELECT with LAST.

Can you remove duplicates based on multiple fields?

Yes. Put all fields that define a duplicate in SORT FIELDS=. For example SORT FIELDS=(1,8,CH,A,9,8,CH,A) SUM FIELDS=NONE keeps one record per unique (field1, field2). Records with the same combination are adjacent after the sort and are collapsed to one by SUM.

How do you get the duplicate records into a separate file in DFSORT?

SUM FIELDS=NONE does not write duplicates to a second file. To get unique records in one file and duplicate records in another, use ICETOOL SELECT: one SELECT with NODUPS to the unique file and one with ALLDUPS to the duplicate file. See the “Removing duplicates into separate files” tutorial.

Why must the sort key match the duplicate key?

SUM groups records by the sort key. It processes the sorted stream and collapses consecutive records that have the same key. If the sort key is different from the field(s) you consider “duplicate,” records with the same duplicate key may not be adjacent, and SUM will not collapse them correctly. So the sort key must be the duplicate key (or include it).

DFSORT Removing Duplicates with SUM - SUM FIELDS=NONE Deduplication

Removing Duplicates with SUM

Removing duplicates means keeping one record per unique key and discarding the rest. In DFSORT the standard way to do this is to sort by the key that defines “duplicate,” then use SUM FIELDS=NONE. SUM runs after the sort and collapses every group of records with the same key into one record. With SUM FIELDS=NONE no numeric fields are summed—only duplicate elimination happens. Which record is kept (first or last in the group) depends on sort order and on OPTION EQUALS or NOEQUALS. This page focuses on using SUM specifically for deduplication: choosing the key, single vs composite key, EQUALS vs NOEQUALS, and when to use SUM versus ICETOOL SELECT (e.g. when you need the duplicate records in a separate file).

SUM Statement

Why the Sort Key Must Be the Duplicate Key

SUM works on the sorted stream. It looks at consecutive records and treats records with the same sort key as one group. So the “duplicate key” must be exactly what you sort by. If you sort by employee ID, then two records with the same employee ID are adjacent and SUM collapses them to one. If you sort by something else (e.g. date), then two records with the same employee ID might not be adjacent, and SUM would not treat them as duplicates. So always set SORT FIELDS= to the field or combination of fields that define a duplicate.

Basic Syntax: One Key

For a single key (e.g. employee ID in positions 1–10), sort by that key in ascending or descending order, then add SUM FIELDS=NONE. Ascending (A) keeps the “smallest” first in the group; descending (D) keeps the “largest” first. Often we want “first occurrence” in the original file order—that usually means sort the key A and use EQUALS so the first record in the group is the one that was first in the input (after the sort).

Example: duplicate key in positions 1–10

text

1
2
  SORT FIELDS=(1,10,CH,A)
  SUM FIELDS=NONE

Records are sorted by the first 10 bytes. Every group of records with the same 10-byte value is reduced to one record. That one record is typically the first of the group when OPTION EQUALS is in effect.

Composite Key: Multiple Fields Define a Duplicate

When “duplicate” is defined by more than one field (e.g. employee ID and date), list all of them in SORT FIELDS=. The order of fields in SORT FIELDS= determines the major and minor sort order; for deduplication what matters is that records with the same combination are adjacent. So SORT FIELDS=(1,8,CH,A,9,8,CH,A) sorts by (positions 1–8, then 9–16). Records with the same (field1, field2) are adjacent; SUM FIELDS=NONE keeps one per (field1, field2).

Example: duplicate = ID + date

text

1
2
  SORT FIELDS=(1,5,CH,A,10,3,CH,A)
  SUM FIELDS=NONE

Duplicates are defined by the 5-byte field at 1 and the 3-byte field at 10. One record per unique (ID, date) is kept.

EQUALS vs NOEQUALS: Which Record Is Kept

When two records have the same sort key, EQUALS tells DFSORT to preserve their relative order (so the “first” in the group is well-defined). NOEQUALS allows the sort to reorder equal keys for performance, so you cannot rely on first or last. For removing duplicates, if you care which record is kept (e.g. “keep the first by timestamp”), use OPTION EQUALS and order the sort so the one you want is first—e.g. sort by (key, timestamp,A) so the earliest timestamp is first, then SUM FIELDS=NONE keeps that record. If you use NOEQUALS, which record survives is implementation-dependent.

Key points for SUM deduplication
Topic	Detail
Sort key	Must match the duplicate key so equal keys are adjacent
SUM FIELDS=NONE	One record per key; no numeric summing
EQUALS	Keeps first of each group (predictable); use for dedup
NOEQUALS	Which record is kept may be unspecified
Two files	Use ICETOOL SELECT NODUPS/ALLDUPS, not SUM alone

Keeping the “Last” Record Instead of the First

By default (with EQUALS and ascending sort on the key), the first record in each key group is kept. To keep the last, either (1) sort the key in descending order so the “last” becomes first in the group, or (2) add a tie-breaker field (e.g. sequence or timestamp) and sort by (key, tie-breaker D) so the latest or highest is first, then SUM FIELDS=NONE. Or use ICETOOL SELECT with LAST if your product supports it. The idea is to arrange the sort order so the record you want is the first in the group when SUM runs.

You Do Not Get a Second File of Duplicates

SUM FIELDS=NONE keeps one record per key in SORTOUT and discards the rest. It does not write the discarded records anywhere. In Syncsort you can use XSUM to send those records to a SORTXSUM dataset; DFSORT does not support XSUM. So if you need both a “unique” file and a “duplicates” file, use ICETOOL SELECT with NODUPS (unique records to one file) and ALLDUPS (all duplicate occurrences to another file). See the “Removing duplicates into separate files” tutorial.

When to Use SUM vs ICETOOL SELECT

Use SUM FIELDS=NONE when you only need one output file with one record per key and do not need to keep or inspect the duplicate records.
Use ICETOOL SELECT with NODUPS/ALLDUPS when you need two outputs: one with unique records and one with the duplicate records. Use SELECT with FIRST when you want one record per key with more control (e.g. FIRST/LAST) in an ICETOOL flow.

Explain It Like I'm Five

Imagine many cards with names. You want one card per name. First you put all the cards in order by name (that’s the sort). Then you look at the pile: whenever you see two or more cards with the same name in a row, you keep only the first and throw the rest away. That’s SUM FIELDS=NONE. The “name” is the key. You have to sort by name first so the same names are next to each other. If you don’t sort by name, the same name might be in different places and you won’t know which to throw away. EQUALS means “when two names are the same, don’t mix their order”—so the one you keep is the one that was first in line. If you also want a separate pile of the cards you threw away, SUM can’t do that; you need a different tool (ICETOOL SELECT) that puts “extra” names in another box.

Exercises

Your duplicate key is in positions 20–25 (character). Write SORT and SUM control statements to keep one record per key.
You want one record per (customer ID 1–8, order date 9–16). Write SORT FIELDS= and SUM FIELDS=NONE.
Why does OPTION NOEQUALS make “which record is kept” unpredictable? When might you still use NOEQUALS?
You need unique records in one file and duplicate records in another. What do you use instead of SUM FIELDS=NONE alone?

Quiz

Test Your Knowledge

1. What must you do before using SUM FIELDS=NONE to remove duplicates?

Use INCLUDE only
Sort by the key that defines a duplicate—SORT FIELDS= must match the key so that records with the same key are adjacent
Use OPTION COPY
Nothing

2. Which record does SUM FIELDS=NONE keep when there are duplicates?

Always the last
Typically the first of each key group when EQUALS is in effect; with NOEQUALS the choice may not be guaranteed
A random record
The one with the highest value

3. How do you remove duplicates based on more than one field (e.g. employee ID and date)?

Use two SUM statements
Use a composite sort key: SORT FIELDS=(pos1,len1,format1,order, pos2,len2,format2,order) then SUM FIELDS=NONE—duplicates are defined by the combination of all key fields
Only one field is allowed
Use INCLUDE twice

4. If you need both a file of unique records and a file of the duplicate records, can you use only SUM FIELDS=NONE?

Yes, SUM writes duplicates to SORTXSUM
No—DFSORT does not support XSUM; use ICETOOL SELECT with NODUPS and ALLDUPS to get two files
Yes, with two OUTFILs
SUM always writes both

5. What is the difference between EQUALS and NOEQUALS when removing duplicates?

They are the same for SUM
EQUALS preserves the order of records with equal keys, so the “first” in the sort order is predictable; NOEQUALS does not guarantee order among equal keys, so which record SUM keeps may not be predictable
NOEQUALS keeps the first
EQUALS removes all records