MainframeMaster

Handling Duplicates

When you sort a file, many records can have the same sort key (or the same combination of keys). How DFSORT handles those duplicates depends on what you ask for: (1) Keep all—by default, every record is written to the output; duplicates stay, and their order can be preserved with OPTION EQUALS (stable sort) or left unspecified with NOEQUALS. (2) Remove duplicates—use the SUM statement to collapse records with the same control (sort) fields; with SUM FIELDS=NONE you keep one record per unique key and drop the rest. (3) Aggregate—SUM can also add up numeric fields for each key and output one record per key with totals. So "handling duplicates" means either keeping them in a defined order (EQUALS) or reducing them (SUM). This page explains when to keep duplicates, when to remove them, and how to use EQUALS and SUM.

SORT Statement Deep Dive
Progress0 of 0 lessons

Keeping All Records (Default)

A plain SORT (no SUM) keeps every record. Records with the same sort key all appear in the output. So if you have 100 records and 20 have the same key, you still get 100 records out. The order of those 20 among themselves depends on OPTION EQUALS (preserve input order) or NOEQUALS (order not guaranteed). Use this when you need every record—e.g. all transactions per customer, or a report that lists every line—and you only care that they are sorted (and optionally in stable order within each key).

Removing Duplicates with SUM

To remove duplicates and keep one record per key, use the SUM statement. You specify the same fields as the sort key as the "control" fields, and with SUM FIELDS=NONE you do not sum any numeric fields—you just collapse the group to one record. The record that is kept is typically the first in sort order for that key. So after the sort, each unique key appears once. SUM has more options (e.g. summing amounts, overflow handling); see the SUM statement tutorial. For simple de-duplication, SUM FIELDS=NONE with the appropriate control fields is the usual approach.

Order Among Duplicates: EQUALS

When you keep all duplicates, their order in the output may matter. OPTION EQUALS makes the sort stable: records with equal keys keep their input order. So if the input was in time order and you sort by department, with EQUALS the records within each department stay in time order. If you do not specify EQUALS (NOEQUALS default), the order of records with the same key is not guaranteed. So use EQUALS when downstream processing depends on the order within duplicate keys.

When to Keep vs Remove

Keep all when: you need every record (e.g. detail report, all transactions); you are only sorting for order; or you will process duplicates in a later step. Remove duplicates when: you need one row per key (e.g. unique customer list, or one record per key for a join); or you want to aggregate (SUM with numeric fields). Choose EQUALS when you keep all and care about order within keys.

Example: Keep All, Stable Order

text
1
2
SORT FIELDS=(1,10,CH,A) OPTION EQUALS

Sort by bytes 1–10; keep all records; preserve input order for records with the same key.

Example: One Record per Key

Sort by bytes 1–10 and keep one record per unique key (first in sort order):

text
1
2
SORT FIELDS=(1,10,CH,A) SUM FIELDS=NONE

Control fields for SUM must match the sort key; see SUM statement for full syntax (e.g. control field positions and lengths).

Explain It Like I'm Five

Duplicates are when two people have the same name. We can either keep everyone in line (and maybe keep the order they arrived—EQUALS) or say "one person per name" and only keep the first for each name (SUM). Keeping everyone is like a full list; keeping one per name is like a name-only list.

Exercises

  1. You need a unique list of customer IDs (one record per ID). Do you use SUM or just SORT? What SUM syntax?
  2. You need all transactions sorted by customer, and within customer by date. Do you want EQUALS? Why?
  3. What is the difference between OPTION EQUALS and SUM FIELDS=NONE for duplicate keys?

Quiz

Test Your Knowledge

1. If you only sort (no SUM), what happens to records with duplicate sort keys?

  • They are dropped
  • All are kept; they appear in the output in some order (input order if OPTION EQUALS, else implementation-dependent)
  • Only the first is kept
  • DFSORT abends

2. How do you remove duplicate keys and keep only one record per key in DFSORT?

  • OPTION NODUPS
  • SUM FIELDS=NONE—collapse records with the same control fields and output one record per group (the first or with summary fields)
  • SORT FIELDS=NONE
  • INCLUDE DUPS=1

3. What does OPTION EQUALS do for duplicate keys?

  • Removes them
  • Preserves the input order of records with equal keys in the output (stable sort)
  • Keeps only the first
  • Sorts duplicates last

4. When might you want to keep all duplicates (not use SUM)?

  • Never
  • When every record is needed—e.g. all transactions for each customer, or a report that lists every line; you just want them sorted (and maybe in stable order with EQUALS)
  • Only for character keys
  • Only when EQUALS is on

5. SUM FIELDS=NONE keeps one record per unique key. Which record is kept?

  • Random
  • Typically the first record in sort order for that key (after the sort); product behavior is defined in the SUM documentation
  • The last
  • The one with the largest second key