ETL—Extract, Transform, Load—is a standard pattern for moving and preparing data: you extract from a source, transform it (filter, reformat, aggregate, join), and load it into targets. On the mainframe, DFSORT is often the workhorse for ETL: it can do the extract (read and filter), the transform (reformat, sort, sum, join), and the load (write to one or more outputs). This page explains how DFSORT maps to ETL, how to design single-step and multi-step ETL jobs, when to use DFSORT versus an application program, and how to chain steps with temporary or permanent datasets.
In ETL, extract means reading data from one or more sources. In a DFSORT step, the source is SORTIN (and for JOINKEYS, the secondary files). So the extract phase is "read SORTIN." You can narrow the extract with INCLUDE or OMIT—only records that meet the condition are processed; the rest are dropped. That is still part of extract: you are extracting a subset. Transform means changing the data: reordering or building new fields (INREC, OUTREC), aggregating (SUM), joining (JOINKEYS), converting dates or doing character translation. Load means writing the result somewhere. In DFSORT, that is SORTOUT and any OUTFIL datasets. So one DFSORT step can implement a full ETL flow for a single source and one or more targets.
| Phase | DFSORT mechanism |
|---|---|
| Extract | SORTIN (read); INCLUDE/OMIT (filter) |
| Transform | INREC, OUTREC, SUM, JOINKEYS, date conversion, translation |
| Load | SORTOUT, OUTFIL (write to output datasets) |
Many ETL flows can be done in one step. Example: read a transaction file (SORTIN), omit invalid or test records (OMIT), normalize dates and case (INREC or OUTREC), sort by key (SORT FIELDS=), and write to a single output (SORTOUT) or to multiple outputs (OUTFIL with different INCLUDE/OMIT or BUILD). The control statements in SYSIN define the whole pipeline. Advantages: one program (DFSORT), one step, no intermediate datasets. Use this when the logic fits in SORT, INCLUDE/OMIT, INREC, OUTREC, OUTFIL, and optionally SUM or JOINKEYS.
When the logic is too complex for one step, or when you want to separate concerns (e.g. first step cleans, second step sorts and splits), use multiple steps. Step 1 runs DFSORT (or another program) and writes to a dataset. Step 2 uses that dataset as input (e.g. SORTIN). In JCL, you allocate a temporary dataset in step 1 with DISP=(,PASS) and reference it in step 2 with the same name (e.g. &&TEMP). The system passes the dataset from step 1 to step 2; at job end, temporaries are normally deleted. So you get a pipeline: extract and clean in step 1, transform and load in step 2, without leaving permanent intermediates unless you choose to.
Step 1: DFSORT reads raw input (SORTIN), OMITs bad records, uses INREC to normalize dates and pad fields, writes to &&CLEAN. Step 2: DFSORT reads &&CLEAN as SORTIN, SORT FIELDS= by key, OUTFIL splits by region or type, writes to final datasets. Each step has its own SYSIN. The data flow is clear and each step can be tested or changed independently.
1234567891011121314151617//STEP1 EXEC PGM=SORT //SORTIN DD DSN=RAW.DATA,DISP=SHR //SORTOUT DD DSN=&&CLEAN,DISP=(,PASS),SPACE=... //SYSIN DD * OMIT COND=(...) INREC BUILD=(...) OPTION COPY /* //STEP2 EXEC PGM=SORT //SORTIN DD DSN=&&CLEAN,DISP=(OLD,PASS) //SORTOUT DD DSN=FINAL.MAIN,DISP=(,CATLG),... //OUTFIL1 DD DSN=FINAL.SPLIT1,... //SYSIN DD * SORT FIELDS=(1,20,CH,A) OUTFIL FNAMES=OUTFIL1,INCLUDE=(...),BUILD=(...) ... /*
DFSORT is strong at: filtering (INCLUDE/OMIT), sorting and merging, reformatting (INREC/OUTREC), aggregation (SUM), joining (JOINKEYS), date conversion, and simple translation. Use it when your ETL transform is mostly these operations. Use an application program (e.g. COBOL) when you need: complex business rules that vary by record type, many conditional branches, calls to databases or other systems, or logic that is easier to express in code than in control statements. Often the best design is a mix: DFSORT does the heavy bulk filter and reformat; a program does the record-by-record logic or enrichment. That keeps the pipeline simple and maintainable.
Between steps, you can write to a temporary dataset (e.g. DSN=&&TEMP) or a permanent one (e.g. DSN=MY.ETL.WORK). Temporaries are deleted at job end (unless passed and retained by job structure); they avoid catalog clutter and are good for one-time ETL runs. Permanent intermediates are useful when you need to restart from the middle, audit the intermediate, or reuse it in another job. Choose based on your restart, audit, and retention requirements.
ETL is like moving toys: you take them out of the box (extract), clean them and put them in order (transform), and put them in the right drawers (load). DFSORT can do all three: it takes data from one place (SORTIN), changes it (filter, sort, fix dates, join with another file), and writes it to new places (SORTOUT and OUTFIL). Sometimes you need two moves: first step cleans the toys, second step sorts them into drawers. Each step is one "move" and you can pass the result from the first step to the second with a temporary box (&&TEMP) that you throw away at the end.
1. What does ETL stand for and how does DFSORT fit?
2. How can you chain multiple DFSORT steps in one job for ETL?
3. When is it better to use DFSORT for transform instead of a COBOL program?
4. What is the "load" part of ETL in a DFSORT step?
5. Why use a temporary dataset between two DFSORT steps in an ETL job?