Data cleansing means improving data quality so that downstream systems and reports get consistent, valid data. A data cleansing pipeline is a set of operations—often one or more DFSORT steps—that filter out or isolate bad records, normalize formats (case, padding, dates), fix known bad values, and optionally remove duplicates. This page explains how to design and implement such pipelines with DFSORT: filtering with INCLUDE/OMIT, normalizing with INREC and OUTREC, using FINDREP for replacements, validating dates and ranges, deduplicating with SUM, and splitting output into valid and reject files with OUTFIL.
Data cleansing (or cleaning) is the process of detecting and correcting (or removing) invalid, inconsistent, or duplicate data. On the mainframe, source data often has mixed case, inconsistent padding, invalid or out-of-range dates, wrong codes, or duplicate keys. A cleansing pipeline applies rules to: (1) filter—drop or isolate records that fail validation; (2) normalize—convert to a single format (e.g. uppercase, leading zeros, YYYYMMDD); (3) fix—replace known bad values with correct ones; (4) deduplicate—keep one record per key. DFSORT can do all of this in one step or across several steps, and can write both a "clean" and a "reject" file so you can audit what was dropped.
| Step | DFSORT tool | Purpose |
|---|---|---|
| Filter | INCLUDE/OMIT | Drop or isolate invalid records (bad codes, blank keys, out-of-range dates) |
| Normalize | INREC/OUTREC | Standardize case, padding, date format, field lengths |
| Fix values | FINDREP | Replace known bad strings or codes with correct values |
| Deduplicate | SUM | Keep one record per key (optional aggregation) |
| Split output | OUTFIL | Write valid to one file, reject to another |
The first layer of cleansing is to exclude records that should not be processed. Use INCLUDE to keep only records that meet your rules, or OMIT to drop records that fail. Examples: OMIT records with blank key fields; INCLUDE only records where a code is in a valid list (or in a valid range); OMIT records where a date field is out of range (e.g. before 1900 or after 2099). You can combine conditions with AND/OR. The filtered-out records are discarded unless you use a separate OUTFIL with the inverse condition to write them to a reject file. So you can keep "good" records in one output and "bad" in another for review or correction.
After (or while) filtering, normalize the data so that every field has a consistent format. Use INREC if the sort key or INCLUDE/OMIT must see the normalized data; use OUTREC if only the final output needs it. Common normalizations: (1) Case—convert all text to uppercase (or lowercase) with translation or UPPERCASE so sorts and comparisons are consistent. (2) Padding—ensure numeric fields have leading zeros (e.g. 007 not 7) and character fields have trailing spaces so lengths and sort order are correct. (3) Dates—convert all date fields to one format (e.g. YYYYMMDD) using INREC or OUTREC date conversion so range checks and sorts work. The data normalization tutorial covers these in detail.
When you know specific wrong values (e.g. a code that was mistyped in the source system, or a string that should be standardized), use FINDREP. In INREC or OUTREC specify FINDREP=(IN=value,OUT=value,...). You can list multiple IN/OUT pairs. The product replaces each occurrence (or the first, depending on product) of IN with OUT. Use INREC when the corrected value must be used for sorting or for later INCLUDE/OMIT; use OUTREC when only the written output needs the fix. FINDREP is especially useful when legacy data has a fixed set of bad codes that you want to map to the correct ones without changing the source system.
Date validation means ensuring date fields are in a valid range and format. Use INCLUDE with COND= to keep only records where the date is between a low and high value (e.g. GE low and LE high for YYYYMMDD). If the source date is in another format, convert it first in INREC to a standard format at a fixed position, then apply the range check. Invalid dates (e.g. month 13) may cause conversion to produce wrong values or abends; filtering to a valid range reduces that risk. The date validation tutorial has more on range conditions and conversion order.
After filtering and normalizing, you often want one record per key (or per key and selected fields). Use SUM with the same key as (or a subset of) your sort key. SUM keeps one record per unique key and can sum (or take MIN/MAX of) numeric fields. So the pipeline can be: INCLUDE/OMIT and INREC (normalize), then SORT FIELDS= key, then SUM FIELDS= (key and optional numeric fields). The result is a clean, deduplicated file. See the SUM and deduplication strategies tutorials for syntax and options (e.g. handling overflow).
To keep both clean and reject records, use OUTFIL with two (or more) outputs. One OUTFIL has INCLUDE=(your valid condition) and writes to the clean file DD; another has OMIT=(same condition) and writes to the reject file DD. So in one pass you produce a clean dataset for downstream use and a reject dataset for audit or reprocessing. You can also use different BUILD= or FIELDS= on each OUTFIL to format the two outputs differently (e.g. include reason codes only on the reject file).
Read input (SORTIN). OMIT records with blank key (e.g. OMIT COND=(1,20,CH,EQ,C\' \')). INREC: normalize date at 21,8 to YYYYMMDD, uppercase at 29,10. SORT FIELDS=(1,20,CH,A). SUM FIELDS=(1,20,CH) to deduplicate. OUTFIL FNAMES=CLEAN for the main output. Optionally a second OUTFIL with INCLUDE/OMIT inverse and FNAMES=REJECT to capture rejects. Control statements and positions depend on your layout.
123456OMIT COND=(1,20,CH,EQ,C' ') INREC BUILD=(1,20, 21,8, 29,10,UPPERCASE,...) SORT FIELDS=(1,20,CH,A) SUM FIELDS=(1,20,CH) OUTFIL FNAMES=CLEAN,BUILD=(...) OUTFIL FNAMES=REJECT,OMIT=(...),BUILD=(...) /* same condition as above, so reject gets bad records */
When one step is too complex or you want to separate concerns, use multiple steps. Step 1: filter and normalize, write to &&CLEAN1. Step 2: read &&CLEAN1, apply FINDREP or additional logic, sort and SUM, write clean output and optionally reject. Each step has its own SYSIN and can be tested independently. Use temporary datasets (&&) between steps to avoid permanent intermediates unless you need them for restart or audit.
Cleansing is like cleaning your toys before putting them away: you throw away broken ones (filter), make sure each toy looks the same (normalize—same label, same box), fix the ones with the wrong sticker (FINDREP), and put only one of each kind in the drawer (dedup). DFSORT does all of that. You can even have two boxes: one for good toys (clean file) and one for broken ones (reject file) so Mom can see what you threw out.
1. What is a data cleansing pipeline in a DFSORT context?
2. How can you produce both a "clean" and a "reject" file in one DFSORT step?
3. Why normalize data (e.g. uppercase, fixed padding) during cleansing?
4. When would you use FINDREP in a cleansing pipeline?
5. How does SUM help in a cleansing pipeline?