MainframeMaster

ETL Workflows

ETL—Extract, Transform, Load—is a standard pattern for moving and preparing data: you extract from a source, transform it (filter, reformat, aggregate, join), and load it into targets. On the mainframe, DFSORT is often the workhorse for ETL: it can do the extract (read and filter), the transform (reformat, sort, sum, join), and the load (write to one or more outputs). This page explains how DFSORT maps to ETL, how to design single-step and multi-step ETL jobs, when to use DFSORT versus an application program, and how to chain steps with temporary or permanent datasets.

Real World Use Cases
Progress0 of 0 lessons

ETL and DFSORT: Extract, Transform, Load

In ETL, extract means reading data from one or more sources. In a DFSORT step, the source is SORTIN (and for JOINKEYS, the secondary files). So the extract phase is "read SORTIN." You can narrow the extract with INCLUDE or OMIT—only records that meet the condition are processed; the rest are dropped. That is still part of extract: you are extracting a subset. Transform means changing the data: reordering or building new fields (INREC, OUTREC), aggregating (SUM), joining (JOINKEYS), converting dates or doing character translation. Load means writing the result somewhere. In DFSORT, that is SORTOUT and any OUTFIL datasets. So one DFSORT step can implement a full ETL flow for a single source and one or more targets.

ETL phases and DFSORT equivalents
PhaseDFSORT mechanism
ExtractSORTIN (read); INCLUDE/OMIT (filter)
TransformINREC, OUTREC, SUM, JOINKEYS, date conversion, translation
LoadSORTOUT, OUTFIL (write to output datasets)

Single-Step ETL with DFSORT

Many ETL flows can be done in one step. Example: read a transaction file (SORTIN), omit invalid or test records (OMIT), normalize dates and case (INREC or OUTREC), sort by key (SORT FIELDS=), and write to a single output (SORTOUT) or to multiple outputs (OUTFIL with different INCLUDE/OMIT or BUILD). The control statements in SYSIN define the whole pipeline. Advantages: one program (DFSORT), one step, no intermediate datasets. Use this when the logic fits in SORT, INCLUDE/OMIT, INREC, OUTREC, OUTFIL, and optionally SUM or JOINKEYS.

Multi-Step ETL: Chaining DFSORT Steps

When the logic is too complex for one step, or when you want to separate concerns (e.g. first step cleans, second step sorts and splits), use multiple steps. Step 1 runs DFSORT (or another program) and writes to a dataset. Step 2 uses that dataset as input (e.g. SORTIN). In JCL, you allocate a temporary dataset in step 1 with DISP=(,PASS) and reference it in step 2 with the same name (e.g. &&TEMP). The system passes the dataset from step 1 to step 2; at job end, temporaries are normally deleted. So you get a pipeline: extract and clean in step 1, transform and load in step 2, without leaving permanent intermediates unless you choose to.

Example: two-step ETL

Step 1: DFSORT reads raw input (SORTIN), OMITs bad records, uses INREC to normalize dates and pad fields, writes to &&CLEAN. Step 2: DFSORT reads &&CLEAN as SORTIN, SORT FIELDS= by key, OUTFIL splits by region or type, writes to final datasets. Each step has its own SYSIN. The data flow is clear and each step can be tested or changed independently.

text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//STEP1 EXEC PGM=SORT //SORTIN DD DSN=RAW.DATA,DISP=SHR //SORTOUT DD DSN=&&CLEAN,DISP=(,PASS),SPACE=... //SYSIN DD * OMIT COND=(...) INREC BUILD=(...) OPTION COPY /* //STEP2 EXEC PGM=SORT //SORTIN DD DSN=&&CLEAN,DISP=(OLD,PASS) //SORTOUT DD DSN=FINAL.MAIN,DISP=(,CATLG),... //OUTFIL1 DD DSN=FINAL.SPLIT1,... //SYSIN DD * SORT FIELDS=(1,20,CH,A) OUTFIL FNAMES=OUTFIL1,INCLUDE=(...),BUILD=(...) ... /*

When to Use DFSORT vs an Application Program

DFSORT is strong at: filtering (INCLUDE/OMIT), sorting and merging, reformatting (INREC/OUTREC), aggregation (SUM), joining (JOINKEYS), date conversion, and simple translation. Use it when your ETL transform is mostly these operations. Use an application program (e.g. COBOL) when you need: complex business rules that vary by record type, many conditional branches, calls to databases or other systems, or logic that is easier to express in code than in control statements. Often the best design is a mix: DFSORT does the heavy bulk filter and reformat; a program does the record-by-record logic or enrichment. That keeps the pipeline simple and maintainable.

Temporary vs Permanent Intermediate Datasets

Between steps, you can write to a temporary dataset (e.g. DSN=&&TEMP) or a permanent one (e.g. DSN=MY.ETL.WORK). Temporaries are deleted at job end (unless passed and retained by job structure); they avoid catalog clutter and are good for one-time ETL runs. Permanent intermediates are useful when you need to restart from the middle, audit the intermediate, or reuse it in another job. Choose based on your restart, audit, and retention requirements.

Best Practices for ETL with DFSORT

  • Keep SYSIN in a PDS member or dataset so you can version and reuse control statements; avoid large in-stream data if the same logic is used in many jobs.
  • Use meaningful DD and dataset names so the data flow (SORTIN → SORTOUT, OUTFIL names) is clear in the JCL.
  • Document the record layout and any assumptions (e.g. fixed length, key positions) so future changes do not break the step.
  • For multi-step jobs, use temporaries (&&) for intermediates unless you need to keep them; that keeps the job self-contained.
  • Validate record counts: compare input and output counts (and OUTFIL counts) to catch unexpected filtering or duplication. Use ICE messages or a follow-on step to check.

Explain It Like I'm Five

ETL is like moving toys: you take them out of the box (extract), clean them and put them in order (transform), and put them in the right drawers (load). DFSORT can do all three: it takes data from one place (SORTIN), changes it (filter, sort, fix dates, join with another file), and writes it to new places (SORTOUT and OUTFIL). Sometimes you need two moves: first step cleans the toys, second step sorts them into drawers. Each step is one "move" and you can pass the result from the first step to the second with a temporary box (&&TEMP) that you throw away at the end.

Exercises

  1. Design a single-step ETL: read a file, OMIT records with a bad flag, convert a date field with INREC, sort by key, write to SORTOUT. List the control statements in order.
  2. Design a two-step ETL: step 1 filters and normalizes to &&WORK; step 2 reads &&WORK, sorts, and writes two OUTFILs. What JCL do you need for the temporary dataset?
  3. When would you choose a COBOL program instead of DFSORT for the transform phase? Give two concrete examples.
  4. Your ETL job has three DFSORT steps. Step 2 fails. How do you use ICE000I and SYSOUT to see what control statements step 2 used and why it failed?

Quiz

Test Your Knowledge

1. What does ETL stand for and how does DFSORT fit?

  • Execute, Test, Load—DFSORT only tests
  • Extract, Transform, Load. DFSORT can do all three: extract (read and filter with INCLUDE/OMIT), transform (INREC, OUTREC, SUM, JOINKEYS, dates), and load (write to SORTOUT/OUTFIL)
  • Only Transform
  • DFSORT does not fit ETL

2. How can you chain multiple DFSORT steps in one job for ETL?

  • Only one step per job
  • Use the output of one step as the input of the next: step 1 writes to a dataset that step 2 reads as SORTIN. So step 1 might filter and normalize; step 2 might sort and split
  • Only with ICETOOL
  • DFSORT cannot chain

3. When is it better to use DFSORT for transform instead of a COBOL program?

  • Never
  • When the transform is expressible as sort, filter, reformat, join, or aggregate—DFSORT is fast, uses no application code, and is easy to change via SYSIN. Use a program when you need complex business logic or many conditional branches
  • Only for sorting
  • Only for JOINKEYS

4. What is the "load" part of ETL in a DFSORT step?

  • Only SORTIN
  • Writing the result to SORTOUT and any OUTFIL datasets—the transformed data is loaded into those output datasets for downstream steps or systems
  • Only INREC
  • Only SYSIN

5. Why use a temporary dataset between two DFSORT steps in an ETL job?

  • You never do
  • To pass the output of the first step to the second without keeping a permanent copy; use a temporary (e.g. &&TEMP) so the job cleans it up and you avoid catalog clutter
  • Only for MERGE
  • Only for JOINKEYS