Data normalization means making your data consistent so that sorting, filtering, and downstream programs see a single format. In DFSORT you can normalize case (uppercase or lowercase), field lengths (padding with spaces or leading zeros), date formats (e.g. all to YYYYMMDD), and numeric formats. You can also treat deduplication as a form of normalization: one row per key. This page explains why normalization matters, how to do it with INREC and OUTREC, and when to normalize before the sort (INREC) versus only in the output (OUTREC).
Raw data often has mixed formats: "Smith" and "SMITH", dates as MM/DD/YYYY in one file and YYYYMMDD in another, or numeric codes as "7" in one record and "007" in another. If you sort or compare such data without normalizing, "Smith" and "SMITH" sort in different places, dates sort incorrectly, and "7" and "007" may not match in INCLUDE/OMIT or in a join. Normalizing means converting everything to one convention so that the sort order is correct and comparisons and downstream logic work as intended.
| Type | Goal | How |
|---|---|---|
| Case | Same case (all upper or all lower). | Translation (TR=) or UPPERCASE/LOWERCASE in INREC/OUTREC. |
| Padding | Fixed-length fields with consistent leading zeros or trailing spaces. | BUILD with edit masks, overlay, or conversion to fixed length. |
| Date format | Single date format (e.g. YYYYMMDD). | Date conversion (DATE1, JFY, etc.) in INREC/OUTREC. |
| Numeric format | Consistent numeric representation (e.g. zoned, packed, length). | Edit masks, conversion, BUILD with proper format and length. |
| Deduplication | One row per key or one copy of each record. | SUM FIELDS=NONE or SUM with keys; often after normalizing key fields. |
Converting all letters to uppercase (or all to lowercase) ensures that "Alice", "ALICE", and "alice" are treated the same for sorting and matching. In DFSORT you use translation: a translate table that maps lowercase to uppercase (or the reverse), or a built-in option such as UPPERCASE. Apply it to the relevant field(s) in INREC or OUTREC. If you sort by name or use INCLUDE COND= on name, do the conversion in INREC so the sort and filter see the normalized value. If you only need the output file to have consistent case, you can do it in OUTREC.
Numeric codes (e.g. 7, 12, 123) sort incorrectly as character if one is "7" and another is "007"—character sort puts "7" after "123". Normalize by converting to a fixed length with leading zeros: 007, 012, 123. Use BUILD with an edit mask or numeric conversion that produces a fixed-length field with leading zeros. For character fields that should be a fixed length (e.g. 10 bytes), use BUILD to place the field and pad with spaces so that short values are trailing-space padded. That way every record has the same layout and sort order is consistent.
Dates in different formats (MMDDYYYY, DD-MM-YYYY, YYYYMMDD, Julian) do not sort or compare correctly when mixed. Convert all date fields to one format (typically YYYYMMDD) in INREC or OUTREC using DFSORT date conversion. Then SORT FIELDS= on the date position and INCLUDE/OMIT conditions (e.g. date range) work correctly. See the date conversion, Julian dates, and century windowing tutorials for the exact control statement syntax.
Normalizing numeric format means representing numbers in a consistent way: same length (e.g. 5 digits), same type (e.g. zoned decimal), and possibly same edit (e.g. with leading zeros). Use edit masks or conversion in BUILD so that numeric fields have a fixed length and format. That avoids mismatches when comparing or when joining with another file that expects a standard format.
Removing duplicates (SUM FIELDS=NONE or SUM with keys) makes the dataset “normal” in the sense that you have at most one record per key (or one copy of each record). It is often done after normalizing the key fields: for example, convert names to uppercase and dates to YYYYMMDD in INREC, then sort and use SUM so that duplicates are identified on the normalized key. That way "Alice" and "ALICE" are treated as the same and only one is kept.
INREC runs before the sort and before INCLUDE/OMIT. Use INREC when the normalized value must be used for sorting (SORT FIELDS=) or filtering (INCLUDE/OMIT). For example, normalize dates to YYYYMMDD in INREC so that the sort key is the normalized date. OUTREC runs after the sort when building the output record. Use OUTREC when you only need the written output to have consistent format and the sort/filter logic use the original data. You can also use both: normalize in INREC for sort/filter and refine or reformat again in OUTREC for output.
Input has a 10-byte name at 1–10 and a 2-byte code at 11–12. You want name in uppercase and code as 3 bytes with leading zeros (e.g. 7 → 007). Do both in INREC so the sort sees normalized data.
12INREC BUILD=(1,10,UPPERCASE,13,3,11,2,ZDF) SORT FIELDS=(1,10,CH,A,13,3,CH,A)
Syntax is illustrative; exact keywords (UPPERCASE, ZDF for zoned with leading zeros) depend on your DFSORT release. The idea: BUILD builds a record with the name converted to uppercase and the code converted to 3-byte zoned with leading zeros, then SORT FIELDS sorts by that normalized record.
Imagine you have a list of names and some are in big letters and some in small letters. If you sort the list, the big and small letter versions of the same name end up in different places. Normalizing is like deciding: we’ll write every name in big letters first. Then when we sort, all the same names are together. Same for numbers: if sometimes we write "7" and sometimes "007", we decide to always use three digits (007, 012, 123) so they line up and sort right. DFSORT does that: it fixes the way the data looks before sorting or writing it out.
1. What is data normalization in a DFSORT context?
2. How can you normalize text to uppercase in DFSORT?
3. Why normalize dates to one format (e.g. YYYYMMDD)?
4. What is padding normalization?
5. When should you normalize in INREC vs OUTREC?