MainframeMaster

Aggregation Basics

Aggregation means grouping records by a key and computing a single value per group—for example, the sum of a numeric field, or a count. In DFSORT the main tool for this is the SUM statement with SUM FIELDS=(position, length, format,...). You first sort by the key that defines the group (SORT FIELDS=). Records with the same key are then adjacent; SUM adds the specified numeric fields across each group and outputs one record per group with those totals. Non-summed fields in the output usually come from the first record of the group. This page covers the idea of group-by and sum, the role of the sort key, which data types can be summed (PD, ZD, BI, FI, FL), summing multiple fields, overflow (VLSHRT vs NOVLSHRT), and using INREC to prepare data for aggregation.

SUM Statement
Progress0 of 0 lessons

What Aggregation Means

You have many detail records (e.g. one per transaction) and you want one record per group (e.g. per customer or per department) with a total (e.g. sum of amount). That is aggregation: group by a key, then compute a value (sum, count, and in some products min/max/avg) per group. In DFSORT, the group is defined by the SORT FIELDS= key. SUM FIELDS= then specifies which numeric fields to add up. So you get one output record per unique key value, with the summed fields containing the total and other fields typically copied from the first record of the group.

The Sort Key Is the Group Key

There is no separate “group by” key. The key that defines the group is exactly the SORT FIELDS= (or MERGE FIELDS=) key. So you must sort by the field(s) you want to aggregate on. If you sort by customer ID, SUM groups by customer ID. If you sort by (region, department), SUM groups by (region, department). After the sort, all records with the same key are adjacent; SUM reads that stream and, for each run of identical keys, produces one record with the summed values. So the order of operations is: sort by key → SUM (group by same key, sum specified fields) → output.

SUM FIELDS= Syntax and Numeric Formats

Each summed field is specified as (position, length, format). Position and length are the starting column and length of the field in the record (after INREC if used). Format must be a numeric type. Allowed formats typically include:

Numeric formats for SUM FIELDS=
FormatNameTypical use
PDPacked decimalCommon for amounts; compact
ZDZoned decimalOne digit per byte; sign in last byte
BIBinaryFullword/halfword integers
FIFixed-pointFixed-point numeric
FLFloating-pointFloating-point numeric

If you specify the wrong format (e.g. ZD for a packed field), the bytes are interpreted incorrectly and the sum will be wrong or you may get a S0C7. So the format must match how the data is actually stored.

Example: Sum One Field by Department

Input: records with department at positions 30–39 and a 3-byte zoned decimal “marks” field at positions 45–47. Requirement: one record per department with the sum of marks.

text
1
2
SORT FIELDS=(30,10,CH,A) SUM FIELDS=(45,3,ZD)

Records are sorted by department (30–39). For each department, the 3-byte ZD field at 45–47 is summed. The output has one record per department; the record is based on the first record of the group, with positions 45–47 replaced by the sum. So you get department-level totals in the summed field.

Summing Multiple Fields

You can list several fields in SUM FIELDS=. Each is (position, length, format). For example SUM FIELDS=(11,5,ZD,21,4,PD) sums the 5-byte zoned decimal at 11–15 and the 4-byte packed decimal at 21–24. Both sums are computed for each group; the output record has the first group’s key and non-summed data (from the first record) and both totals in positions 11–15 and 21–24. So you can aggregate multiple numeric fields in one SUM statement.

Example: sum two fields

text
1
2
SORT FIELDS=(1,10,CH,A) SUM FIELDS=(11,5,ZD,21,4,PD)

One record per unique (1–10). The 5-byte ZD at 11–15 and the 4-byte PD at 21–24 are summed. Other positions (e.g. 1–10, 16–20, 25–80) come from the first record of the group.

Overflow: VLSHRT and NOVLSHRT

When you sum many large numbers, the total can exceed the size of the output field (e.g. the sum does not fit in 5 bytes). Two behaviors are available. VLSHRT (variable-length short): if the sum overflows, truncate it to fit in the field so the job continues; you may lose high-order digits. NOVLSHRT: do not truncate; if the sum overflows, DFSORT terminates with an error (e.g. ICE or abend) so you can increase the field length or fix the data. Use NOVLSHRT when you need to guarantee that no overflow is silently lost. The option can be specified as OPTION NOVLSHRT or on the SUM statement (e.g. SUM FIELDS=(...),NOVLSHRT); see your manual.

Example: sum with overflow check

text
1
2
3
OPTION NOVLSHRT SORT FIELDS=(1,10,CH,A) SUM FIELDS=(11,5,ZD),NOVLSHRT

If the sum of the 5-byte ZD field overflows 5 bytes, the step will fail instead of truncating. Use this when the result must be correct and you prefer to fix the layout or data rather than accept a truncated total.

INREC Before SUM: Reformat and Convert

If the input record is not in a form that SUM can use (e.g. the “amount” is in character or signed form), use INREC to reformat or convert before the sort and SUM. INREC runs before the sort; the record that SORT and SUM see is the INREC output. So the positions and lengths in SORT FIELDS= and SUM FIELDS= refer to the record after INREC. For example, if the input has a signed character amount at 20–25, you might use INREC to convert it to ZD in a fixed position, then SORT and SUM on that layout. That way SUM receives a proper numeric field.

Aggregation vs Report Totals

SUM produces aggregated records: one record per group with summed fields in the record. That is record-level aggregation. OUTFIL TRAILER1 (and TRAILER3 with SECTIONS) can add summary lines at the end of a report (or after each section) with COUNT= or TOTAL= for grand total or section total. So SUM = aggregated dataset (many groups, one row per group); TRAILER = extra line(s) with totals. You can use both: e.g. SUM to build department-level records, then OUTFIL to write a report with TRAILER1 for grand total.

Explain It Like I'm Five

Imagine you have a pile of receipts, each with a store name and an amount. You sort them so all “Store A” receipts are together and all “Store B” receipts are together. Aggregation is like having a helper take each pile: for Store A they add up all the amounts and write one line “Store A total: $500.” For Store B they do the same. So you end up with one line per store and the total for that store. In the computer, the “store” is the sort key, and SUM FIELDS= is the “add up the amounts” step. The helper has to know what kind of number is on the receipt (packed, zoned, etc.) so they add correctly—that’s the format. If the total is too big to fit in the box (overflow), VLSHRT means “chop it to fit”; NOVLSHRT means “stop and tell me so I can get a bigger box.”

Exercises

  1. You have records with customer ID at 1–8 and amount (4-byte PD) at 21–24. Write SORT and SUM to get one record per customer with the sum of amount.
  2. Sum two fields: a 5-byte ZD at 30 and a 4-byte PD at 40. Sort by bytes 1–10. Write the control statements.
  3. Why would you specify NOVLSHRT when summing? What could go wrong if you use VLSHRT and the sum overflows?
  4. If the input has a 6-byte character field that looks like a signed number (e.g. -02963), can you sum it directly with SUM FIELDS=(...,6,ZD)? What might you do instead?

Quiz

Test Your Knowledge

1. What defines the "group" when using SUM FIELDS=(...)?

  • The SUM FIELDS positions
  • The SORT FIELDS= key—records with the same sort key are in the same group and are summed together
  • INREC only
  • The first record

2. Which data types can you sum with SUM FIELDS=?

  • Only CH (character)
  • Numeric types: PD (packed decimal), ZD (zoned decimal), BI (binary), FI (fixed-point), FL (floating-point)
  • Only PD
  • Any type

3. What happens to non-summed fields in the output record?

  • They are set to zero
  • They usually come from the first record of the group (e.g. key and descriptive fields from the first record, summed fields from the total)
  • They are dropped
  • They are concatenated

4. Why would you use INREC before SUM?

  • To sort only
  • To reformat or convert data before the sort and SUM—e.g. convert character/signed data to ZD so SUM can add it correctly; positions in SUM FIELDS= then refer to the INREC output
  • To add headers
  • INREC is after SUM

5. What is the difference between VLSHRT and NOVLSHRT when a sum overflows the field?

  • They are the same
  • VLSHRT truncates the result to fit in the field (you may lose digits); NOVLSHRT causes an error/abend so you can fix the field size or data
  • NOVLSHRT truncates
  • Only VLSHRT is valid