MainframeMaster

Data Cleansing Pipelines

Data cleansing means improving data quality so that downstream systems and reports get consistent, valid data. A data cleansing pipeline is a set of operations—often one or more DFSORT steps—that filter out or isolate bad records, normalize formats (case, padding, dates), fix known bad values, and optionally remove duplicates. This page explains how to design and implement such pipelines with DFSORT: filtering with INCLUDE/OMIT, normalizing with INREC and OUTREC, using FINDREP for replacements, validating dates and ranges, deduplicating with SUM, and splitting output into valid and reject files with OUTFIL.

Real World Use Cases
Progress0 of 0 lessons

What Is Data Cleansing?

Data cleansing (or cleaning) is the process of detecting and correcting (or removing) invalid, inconsistent, or duplicate data. On the mainframe, source data often has mixed case, inconsistent padding, invalid or out-of-range dates, wrong codes, or duplicate keys. A cleansing pipeline applies rules to: (1) filter—drop or isolate records that fail validation; (2) normalize—convert to a single format (e.g. uppercase, leading zeros, YYYYMMDD); (3) fix—replace known bad values with correct ones; (4) deduplicate—keep one record per key. DFSORT can do all of this in one step or across several steps, and can write both a "clean" and a "reject" file so you can audit what was dropped.

Cleansing steps and DFSORT tools
StepDFSORT toolPurpose
FilterINCLUDE/OMITDrop or isolate invalid records (bad codes, blank keys, out-of-range dates)
NormalizeINREC/OUTRECStandardize case, padding, date format, field lengths
Fix valuesFINDREPReplace known bad strings or codes with correct values
DeduplicateSUMKeep one record per key (optional aggregation)
Split outputOUTFILWrite valid to one file, reject to another

Filtering Invalid Records

The first layer of cleansing is to exclude records that should not be processed. Use INCLUDE to keep only records that meet your rules, or OMIT to drop records that fail. Examples: OMIT records with blank key fields; INCLUDE only records where a code is in a valid list (or in a valid range); OMIT records where a date field is out of range (e.g. before 1900 or after 2099). You can combine conditions with AND/OR. The filtered-out records are discarded unless you use a separate OUTFIL with the inverse condition to write them to a reject file. So you can keep "good" records in one output and "bad" in another for review or correction.

Normalizing Format

After (or while) filtering, normalize the data so that every field has a consistent format. Use INREC if the sort key or INCLUDE/OMIT must see the normalized data; use OUTREC if only the final output needs it. Common normalizations: (1) Case—convert all text to uppercase (or lowercase) with translation or UPPERCASE so sorts and comparisons are consistent. (2) Padding—ensure numeric fields have leading zeros (e.g. 007 not 7) and character fields have trailing spaces so lengths and sort order are correct. (3) Dates—convert all date fields to one format (e.g. YYYYMMDD) using INREC or OUTREC date conversion so range checks and sorts work. The data normalization tutorial covers these in detail.

Fixing Known Bad Values with FINDREP

When you know specific wrong values (e.g. a code that was mistyped in the source system, or a string that should be standardized), use FINDREP. In INREC or OUTREC specify FINDREP=(IN=value,OUT=value,...). You can list multiple IN/OUT pairs. The product replaces each occurrence (or the first, depending on product) of IN with OUT. Use INREC when the corrected value must be used for sorting or for later INCLUDE/OMIT; use OUTREC when only the written output needs the fix. FINDREP is especially useful when legacy data has a fixed set of bad codes that you want to map to the correct ones without changing the source system.

Validating Dates and Ranges

Date validation means ensuring date fields are in a valid range and format. Use INCLUDE with COND= to keep only records where the date is between a low and high value (e.g. GE low and LE high for YYYYMMDD). If the source date is in another format, convert it first in INREC to a standard format at a fixed position, then apply the range check. Invalid dates (e.g. month 13) may cause conversion to produce wrong values or abends; filtering to a valid range reduces that risk. The date validation tutorial has more on range conditions and conversion order.

Deduplicating with SUM

After filtering and normalizing, you often want one record per key (or per key and selected fields). Use SUM with the same key as (or a subset of) your sort key. SUM keeps one record per unique key and can sum (or take MIN/MAX of) numeric fields. So the pipeline can be: INCLUDE/OMIT and INREC (normalize), then SORT FIELDS= key, then SUM FIELDS= (key and optional numeric fields). The result is a clean, deduplicated file. See the SUM and deduplication strategies tutorials for syntax and options (e.g. handling overflow).

Splitting Valid and Reject Output

To keep both clean and reject records, use OUTFIL with two (or more) outputs. One OUTFIL has INCLUDE=(your valid condition) and writes to the clean file DD; another has OMIT=(same condition) and writes to the reject file DD. So in one pass you produce a clean dataset for downstream use and a reject dataset for audit or reprocessing. You can also use different BUILD= or FIELDS= on each OUTFIL to format the two outputs differently (e.g. include reason codes only on the reject file).

Example: single-step cleansing

Read input (SORTIN). OMIT records with blank key (e.g. OMIT COND=(1,20,CH,EQ,C\' \')). INREC: normalize date at 21,8 to YYYYMMDD, uppercase at 29,10. SORT FIELDS=(1,20,CH,A). SUM FIELDS=(1,20,CH) to deduplicate. OUTFIL FNAMES=CLEAN for the main output. Optionally a second OUTFIL with INCLUDE/OMIT inverse and FNAMES=REJECT to capture rejects. Control statements and positions depend on your layout.

text
1
2
3
4
5
6
OMIT COND=(1,20,CH,EQ,C' ') INREC BUILD=(1,20, 21,8, 29,10,UPPERCASE,...) SORT FIELDS=(1,20,CH,A) SUM FIELDS=(1,20,CH) OUTFIL FNAMES=CLEAN,BUILD=(...) OUTFIL FNAMES=REJECT,OMIT=(...),BUILD=(...) /* same condition as above, so reject gets bad records */

Multi-Step Cleansing

When one step is too complex or you want to separate concerns, use multiple steps. Step 1: filter and normalize, write to &&CLEAN1. Step 2: read &&CLEAN1, apply FINDREP or additional logic, sort and SUM, write clean output and optionally reject. Each step has its own SYSIN and can be tested independently. Use temporary datasets (&&) between steps to avoid permanent intermediates unless you need them for restart or audit.

Best Practices

  • Filter first when possible—drop invalid records early so you do not waste work normalizing or sorting them (unless you need to write them to a reject file).
  • Normalize before dedup—so that "ABC" and "abc" become the same and SUM can collapse them.
  • Document your rules—which INCLUDE/OMIT conditions and valid ranges you use, so others can maintain the pipeline.
  • Capture rejects when needed—use OUTFIL to write reject records so you can count and correct them.
  • Validate record counts—compare input, clean, and reject counts to ensure nothing is lost or duplicated unexpectedly.

Explain It Like I'm Five

Cleansing is like cleaning your toys before putting them away: you throw away broken ones (filter), make sure each toy looks the same (normalize—same label, same box), fix the ones with the wrong sticker (FINDREP), and put only one of each kind in the drawer (dedup). DFSORT does all of that. You can even have two boxes: one for good toys (clean file) and one for broken ones (reject file) so Mom can see what you threw out.

Exercises

  1. Design a one-step cleansing: OMIT blank key, INREC uppercase a 10-byte name field, SORT by key, SUM to dedup. List the control statements in order.
  2. Add a reject file: same step but OUTFIL1 = records that passed (e.g. non-blank key), OUTFIL2 = records that were omitted. What INCLUDE/OMIT do you use on each OUTFIL?
  3. When would you use FINDREP in INREC instead of OUTREC? Give an example.
  4. Your cleansing step must keep only records with a date (positions 21–28) between 20200101 and 20301231. Write the INCLUDE COND= (assume CH format).

Quiz

Test Your Knowledge

1. What is a data cleansing pipeline in a DFSORT context?

  • Only sorting
  • A sequence of operations (often one or more DFSORT steps) that filter invalid records, normalize formats (case, padding, dates), fix known bad values (e.g. FINDREP), and optionally deduplicate—so the output is clean for downstream use
  • Only INCLUDE
  • Only OUTFIL

2. How can you produce both a "clean" and a "reject" file in one DFSORT step?

  • Only two steps
  • Use OUTFIL with two (or more) outputs: one OUTFIL with INCLUDE for records that pass your rules (clean), another OUTFIL with OMIT (or the inverse condition) for records that fail (reject). Both write to different DDs
  • Only SUM
  • FINDREP only

3. Why normalize data (e.g. uppercase, fixed padding) during cleansing?

  • Only for display
  • So that downstream sorts, comparisons, and programs see a consistent format; mixed case or inconsistent padding causes wrong sort order, failed matches, and duplicate-looking records that are not
  • Only for reports
  • DFSORT cannot normalize

4. When would you use FINDREP in a cleansing pipeline?

  • Only for dates
  • When you have known bad values to fix: e.g. a code that was mistyped (replace OLDCODE with NEWCODE), or a string that should be standardized (replace multiple spaces with one). Use INREC if the fix must affect sort/filter; OUTREC if only output needs the fix
  • Only in OUTREC
  • Only for dedup

5. How does SUM help in a cleansing pipeline?

  • Only for totals
  • SUM can remove duplicates: keep one record per key (or per key and selected fields) and optionally sum numeric fields. So after filtering and normalizing, a SUM step (or SUM in the same step) can deduplicate the cleansed data
  • Only for JOINKEYS
  • SUM does not help cleansing