What is data normalization in DFSORT?

Data normalization in DFSORT means making data consistent: same character case (uppercase/lowercase), same field lengths (padding with spaces or zeros), same date format (e.g. YYYYMMDD), or same numeric format. You use INREC or OUTREC with translation, BUILD, edit masks, and date conversion so that downstream processing sees a uniform format.

How do I convert a field to uppercase in DFSORT?

Use INREC or OUTREC with a translate option (e.g. TR= or UPPERCASE, syntax is product-dependent). Specify the field position and length so that the selected bytes are converted to uppercase. Apply in INREC if the sort or INCLUDE/OMIT should use the uppercase value; in OUTREC if only the output needs it.

How do I normalize date formats in DFSORT?

Use date conversion in INREC or OUTREC to convert various input date formats to one output format (e.g. YYYYMMDD). DFSORT supports formats such as Julian, Gregorian, and various separators. Convert all date fields to the same format so that sorting and comparison work correctly. See the date conversion and century windowing tutorials for details.

How do I pad a numeric field with leading zeros in DFSORT?

Use BUILD or overlay with an edit mask or conversion that produces fixed-length numeric output with leading zeros. For example, convert a 2-byte numeric field to a 5-byte zoned field with leading zeros so that 7 becomes 00007. Syntax is product-dependent (e.g. edit masks, or conversion with length). This ensures consistent sort order and field length.

Is removing duplicates a form of normalization?

Yes. Deduplication (e.g. with SUM FIELDS=NONE or SUM with duplicate key) makes the dataset “normal” in the sense that duplicate keys (or duplicate records) are reduced to one. So the output has a consistent rule: one row per key or one row per record. It is often combined with other normalization (case, date, padding) so that duplicates are detected on normalized data.

Data Normalization in DFSORT - Standardize Case, Padding, Dates

Data Normalization in DFSORT

Data normalization means making your data consistent so that sorting, filtering, and downstream programs see a single format. In DFSORT you can normalize case (uppercase or lowercase), field lengths (padding with spaces or leading zeros), date formats (e.g. all to YYYYMMDD), and numeric formats. You can also treat deduplication as a form of normalization: one row per key. This page explains why normalization matters, how to do it with INREC and OUTREC, and when to normalize before the sort (INREC) versus only in the output (OUTREC).

Data Transformation

Why Normalize?

Raw data often has mixed formats: "Smith" and "SMITH", dates as MM/DD/YYYY in one file and YYYYMMDD in another, or numeric codes as "7" in one record and "007" in another. If you sort or compare such data without normalizing, "Smith" and "SMITH" sort in different places, dates sort incorrectly, and "7" and "007" may not match in INCLUDE/OMIT or in a join. Normalizing means converting everything to one convention so that the sort order is correct and comparisons and downstream logic work as intended.

Types of Normalization

Common normalization goals and how to achieve them in DFSORT
Type	Goal	How
Case	Same case (all upper or all lower).	Translation (TR=) or UPPERCASE/LOWERCASE in INREC/OUTREC.
Padding	Fixed-length fields with consistent leading zeros or trailing spaces.	BUILD with edit masks, overlay, or conversion to fixed length.
Date format	Single date format (e.g. YYYYMMDD).	Date conversion (DATE1, JFY, etc.) in INREC/OUTREC.
Numeric format	Consistent numeric representation (e.g. zoned, packed, length).	Edit masks, conversion, BUILD with proper format and length.
Deduplication	One row per key or one copy of each record.	SUM FIELDS=NONE or SUM with keys; often after normalizing key fields.

Case Normalization

Converting all letters to uppercase (or all to lowercase) ensures that "Alice", "ALICE", and "alice" are treated the same for sorting and matching. In DFSORT you use translation: a translate table that maps lowercase to uppercase (or the reverse), or a built-in option such as UPPERCASE. Apply it to the relevant field(s) in INREC or OUTREC. If you sort by name or use INCLUDE COND= on name, do the conversion in INREC so the sort and filter see the normalized value. If you only need the output file to have consistent case, you can do it in OUTREC.

Padding and Field Length

Numeric codes (e.g. 7, 12, 123) sort incorrectly as character if one is "7" and another is "007"—character sort puts "7" after "123". Normalize by converting to a fixed length with leading zeros: 007, 012, 123. Use BUILD with an edit mask or numeric conversion that produces a fixed-length field with leading zeros. For character fields that should be a fixed length (e.g. 10 bytes), use BUILD to place the field and pad with spaces so that short values are trailing-space padded. That way every record has the same layout and sort order is consistent.

Date Format Normalization

Dates in different formats (MMDDYYYY, DD-MM-YYYY, YYYYMMDD, Julian) do not sort or compare correctly when mixed. Convert all date fields to one format (typically YYYYMMDD) in INREC or OUTREC using DFSORT date conversion. Then SORT FIELDS= on the date position and INCLUDE/OMIT conditions (e.g. date range) work correctly. See the date conversion, Julian dates, and century windowing tutorials for the exact control statement syntax.

Numeric Format

Normalizing numeric format means representing numbers in a consistent way: same length (e.g. 5 digits), same type (e.g. zoned decimal), and possibly same edit (e.g. with leading zeros). Use edit masks or conversion in BUILD so that numeric fields have a fixed length and format. That avoids mismatches when comparing or when joining with another file that expects a standard format.

Deduplication as Normalization

Removing duplicates (SUM FIELDS=NONE or SUM with keys) makes the dataset “normal” in the sense that you have at most one record per key (or one copy of each record). It is often done after normalizing the key fields: for example, convert names to uppercase and dates to YYYYMMDD in INREC, then sort and use SUM so that duplicates are identified on the normalized key. That way "Alice" and "ALICE" are treated as the same and only one is kept.

INREC vs OUTREC

INREC runs before the sort and before INCLUDE/OMIT. Use INREC when the normalized value must be used for sorting (SORT FIELDS=) or filtering (INCLUDE/OMIT). For example, normalize dates to YYYYMMDD in INREC so that the sort key is the normalized date. OUTREC runs after the sort when building the output record. Use OUTREC when you only need the written output to have consistent format and the sort/filter logic use the original data. You can also use both: normalize in INREC for sort/filter and refine or reformat again in OUTREC for output.

Example: Uppercase and Fixed-Length Code

Input has a 10-byte name at 1–10 and a 2-byte code at 11–12. You want name in uppercase and code as 3 bytes with leading zeros (e.g. 7 → 007). Do both in INREC so the sort sees normalized data.

text

1
2
  INREC BUILD=(1,10,UPPERCASE,13,3,11,2,ZDF) 
  SORT FIELDS=(1,10,CH,A,13,3,CH,A)

Syntax is illustrative; exact keywords (UPPERCASE, ZDF for zoned with leading zeros) depend on your DFSORT release. The idea: BUILD builds a record with the name converted to uppercase and the code converted to 3-byte zoned with leading zeros, then SORT FIELDS sorts by that normalized record.

Explain It Like I'm Five

Imagine you have a list of names and some are in big letters and some in small letters. If you sort the list, the big and small letter versions of the same name end up in different places. Normalizing is like deciding: we’ll write every name in big letters first. Then when we sort, all the same names are together. Same for numbers: if sometimes we write "7" and sometimes "007", we decide to always use three digits (007, 012, 123) so they line up and sort right. DFSORT does that: it fixes the way the data looks before sorting or writing it out.

Exercises

Why might sorting by a date field give wrong order if dates are in mixed formats? What normalization would you apply?
You need to match records from two files on a 3-byte code. One file has "7" (1 byte) and the other has "007". What normalization step would you use in DFSORT?
If you normalize a field in OUTREC only (not INREC), can INCLUDE COND= use that normalized value? Why or why not?

Quiz

Test Your Knowledge

1. What is data normalization in a DFSORT context?

Only removing duplicates
Making data consistent: same case, padding, date format, or field lengths so that downstream programs or sorts see a uniform format
Only sorting
Only INCLUDE/OMIT

2. How can you normalize text to uppercase in DFSORT?

Only with ICETOOL
Use translation (TR=) or UPPERCASE in INREC/OUTREC to convert a field to uppercase before or after the sort
Only in OUTFIL
FINDREP only

3. Why normalize dates to one format (e.g. YYYYMMDD)?

Sorting does not use dates
So that sorting, comparison, and downstream programs see a single format; mixed formats (MM/DD/YYYY, DD-MM-YYYY, YYYYMMDD) sort incorrectly and are hard to compare
Only for reports
Dates cannot be converted in DFSORT

4. What is padding normalization?

Removing all spaces
Making fields a fixed length with consistent padding: e.g. leading zeros for numbers (so 7 becomes 007) or trailing spaces for text (so "AB" becomes "AB ") so that sort order and comparisons are correct
Only for VB
Only in SUM

5. When should you normalize in INREC vs OUTREC?

Always OUTREC
Use INREC when the normalized value must be used for sorting or filtering (INCLUDE/OMIT)—so the sort key and conditions see the normalized data. Use OUTREC when only the written output needs to be normalized
Always INREC
Normalization is only in OUTFIL