Aggregation means grouping records by a key and computing a single value per group—for example, the sum of a numeric field, or a count. In DFSORT the main tool for this is the SUM statement with SUM FIELDS=(position, length, format,...). You first sort by the key that defines the group (SORT FIELDS=). Records with the same key are then adjacent; SUM adds the specified numeric fields across each group and outputs one record per group with those totals. Non-summed fields in the output usually come from the first record of the group. This page covers the idea of group-by and sum, the role of the sort key, which data types can be summed (PD, ZD, BI, FI, FL), summing multiple fields, overflow (VLSHRT vs NOVLSHRT), and using INREC to prepare data for aggregation.
You have many detail records (e.g. one per transaction) and you want one record per group (e.g. per customer or per department) with a total (e.g. sum of amount). That is aggregation: group by a key, then compute a value (sum, count, and in some products min/max/avg) per group. In DFSORT, the group is defined by the SORT FIELDS= key. SUM FIELDS= then specifies which numeric fields to add up. So you get one output record per unique key value, with the summed fields containing the total and other fields typically copied from the first record of the group.
There is no separate “group by” key. The key that defines the group is exactly the SORT FIELDS= (or MERGE FIELDS=) key. So you must sort by the field(s) you want to aggregate on. If you sort by customer ID, SUM groups by customer ID. If you sort by (region, department), SUM groups by (region, department). After the sort, all records with the same key are adjacent; SUM reads that stream and, for each run of identical keys, produces one record with the summed values. So the order of operations is: sort by key → SUM (group by same key, sum specified fields) → output.
Each summed field is specified as (position, length, format). Position and length are the starting column and length of the field in the record (after INREC if used). Format must be a numeric type. Allowed formats typically include:
| Format | Name | Typical use |
|---|---|---|
| PD | Packed decimal | Common for amounts; compact |
| ZD | Zoned decimal | One digit per byte; sign in last byte |
| BI | Binary | Fullword/halfword integers |
| FI | Fixed-point | Fixed-point numeric |
| FL | Floating-point | Floating-point numeric |
If you specify the wrong format (e.g. ZD for a packed field), the bytes are interpreted incorrectly and the sum will be wrong or you may get a S0C7. So the format must match how the data is actually stored.
Input: records with department at positions 30–39 and a 3-byte zoned decimal “marks” field at positions 45–47. Requirement: one record per department with the sum of marks.
12SORT FIELDS=(30,10,CH,A) SUM FIELDS=(45,3,ZD)
Records are sorted by department (30–39). For each department, the 3-byte ZD field at 45–47 is summed. The output has one record per department; the record is based on the first record of the group, with positions 45–47 replaced by the sum. So you get department-level totals in the summed field.
You can list several fields in SUM FIELDS=. Each is (position, length, format). For example SUM FIELDS=(11,5,ZD,21,4,PD) sums the 5-byte zoned decimal at 11–15 and the 4-byte packed decimal at 21–24. Both sums are computed for each group; the output record has the first group’s key and non-summed data (from the first record) and both totals in positions 11–15 and 21–24. So you can aggregate multiple numeric fields in one SUM statement.
12SORT FIELDS=(1,10,CH,A) SUM FIELDS=(11,5,ZD,21,4,PD)
One record per unique (1–10). The 5-byte ZD at 11–15 and the 4-byte PD at 21–24 are summed. Other positions (e.g. 1–10, 16–20, 25–80) come from the first record of the group.
When you sum many large numbers, the total can exceed the size of the output field (e.g. the sum does not fit in 5 bytes). Two behaviors are available. VLSHRT (variable-length short): if the sum overflows, truncate it to fit in the field so the job continues; you may lose high-order digits. NOVLSHRT: do not truncate; if the sum overflows, DFSORT terminates with an error (e.g. ICE or abend) so you can increase the field length or fix the data. Use NOVLSHRT when you need to guarantee that no overflow is silently lost. The option can be specified as OPTION NOVLSHRT or on the SUM statement (e.g. SUM FIELDS=(...),NOVLSHRT); see your manual.
123OPTION NOVLSHRT SORT FIELDS=(1,10,CH,A) SUM FIELDS=(11,5,ZD),NOVLSHRT
If the sum of the 5-byte ZD field overflows 5 bytes, the step will fail instead of truncating. Use this when the result must be correct and you prefer to fix the layout or data rather than accept a truncated total.
If the input record is not in a form that SUM can use (e.g. the “amount” is in character or signed form), use INREC to reformat or convert before the sort and SUM. INREC runs before the sort; the record that SORT and SUM see is the INREC output. So the positions and lengths in SORT FIELDS= and SUM FIELDS= refer to the record after INREC. For example, if the input has a signed character amount at 20–25, you might use INREC to convert it to ZD in a fixed position, then SORT and SUM on that layout. That way SUM receives a proper numeric field.
SUM produces aggregated records: one record per group with summed fields in the record. That is record-level aggregation. OUTFIL TRAILER1 (and TRAILER3 with SECTIONS) can add summary lines at the end of a report (or after each section) with COUNT= or TOTAL= for grand total or section total. So SUM = aggregated dataset (many groups, one row per group); TRAILER = extra line(s) with totals. You can use both: e.g. SUM to build department-level records, then OUTFIL to write a report with TRAILER1 for grand total.
Imagine you have a pile of receipts, each with a store name and an amount. You sort them so all “Store A” receipts are together and all “Store B” receipts are together. Aggregation is like having a helper take each pile: for Store A they add up all the amounts and write one line “Store A total: $500.” For Store B they do the same. So you end up with one line per store and the total for that store. In the computer, the “store” is the sort key, and SUM FIELDS= is the “add up the amounts” step. The helper has to know what kind of number is on the receipt (packed, zoned, etc.) so they add correctly—that’s the format. If the total is too big to fit in the box (overflow), VLSHRT means “chop it to fit”; NOVLSHRT means “stop and tell me so I can get a bigger box.”
1. What defines the "group" when using SUM FIELDS=(...)?
2. Which data types can you sum with SUM FIELDS=?
3. What happens to non-summed fields in the output record?
4. Why would you use INREC before SUM?
5. What is the difference between VLSHRT and NOVLSHRT when a sum overflows the field?