What is ETL and how does DFSORT fit in?

ETL stands for Extract, Transform, Load. DFSORT can perform all three: it extracts by reading input (SORTIN) and optionally filtering (INCLUDE/OMIT); it transforms using INREC, OUTREC, SUM, JOINKEYS, date conversion, and translation; it loads by writing to SORTOUT and OUTFIL. So DFSORT is commonly used in mainframe ETL workflows, either in a single step or chained with other steps.

Can I use multiple DFSORT steps in one ETL job?

Yes. A typical pattern is: step 1 reads raw data, filters and normalizes (INCLUDE/OMIT, INREC), and writes to a temporary or permanent dataset; step 2 (or a COBOL step) reads that dataset, does further processing (e.g. sort, OUTFIL split), and writes the final outputs. Use temporary datasets (e.g. &&TEMP) to pass data between steps without keeping permanent intermediates.

When should I use DFSORT vs a COBOL program for ETL transform?

Use DFSORT when the transform is filter, sort, reformat, aggregate (SUM), or join (JOINKEYS)—operations that DFSORT does natively with control statements. Use COBOL (or another program) when you need complex business rules, many conditional branches, or calls to other systems. Often a hybrid is best: DFSORT for bulk filter/transform, COBOL for record-by-record logic.

What does "load" mean in a DFSORT ETL step?

Load is writing the transformed data to its destination. In DFSORT that is SORTOUT (primary output) and any OUTFIL datasets (additional outputs). So the load phase is the writing of the final (or intermediate) result to those datasets for use by the next step or by other applications.

How do I pass data from one DFSORT step to the next in a job?

Step 1 writes to a dataset (e.g. a temporary with DSN=&&TEMP,DISP=(,PASS)). Step 2 references that dataset as SORTIN (e.g. //SORTIN DD DSN=&&TEMP,DISP=(OLD,PASS) or DISP=(SHR,PASS)). The job passes the dataset from step 1 to step 2; at job end, temporary datasets are deleted unless you use PASS and a later step keeps them.

DFSORT in ETL Workflows - Extract Transform Load with Sort

ETL Workflows

ETL—Extract, Transform, Load—is a standard pattern for moving and preparing data: you extract from a source, transform it (filter, reformat, aggregate, join), and load it into targets. On the mainframe, DFSORT is often the workhorse for ETL: it can do the extract (read and filter), the transform (reformat, sort, sum, join), and the load (write to one or more outputs). This page explains how DFSORT maps to ETL, how to design single-step and multi-step ETL jobs, when to use DFSORT versus an application program, and how to chain steps with temporary or permanent datasets.

Real World Use Cases

ETL and DFSORT: Extract, Transform, Load

In ETL, extract means reading data from one or more sources. In a DFSORT step, the source is SORTIN (and for JOINKEYS, the secondary files). So the extract phase is "read SORTIN." You can narrow the extract with INCLUDE or OMIT—only records that meet the condition are processed; the rest are dropped. That is still part of extract: you are extracting a subset. Transform means changing the data: reordering or building new fields (INREC, OUTREC), aggregating (SUM), joining (JOINKEYS), converting dates or doing character translation. Load means writing the result somewhere. In DFSORT, that is SORTOUT and any OUTFIL datasets. So one DFSORT step can implement a full ETL flow for a single source and one or more targets.

ETL phases and DFSORT equivalents
Phase	DFSORT mechanism
Extract	SORTIN (read); INCLUDE/OMIT (filter)
Transform	INREC, OUTREC, SUM, JOINKEYS, date conversion, translation
Load	SORTOUT, OUTFIL (write to output datasets)

Single-Step ETL with DFSORT

Many ETL flows can be done in one step. Example: read a transaction file (SORTIN), omit invalid or test records (OMIT), normalize dates and case (INREC or OUTREC), sort by key (SORT FIELDS=), and write to a single output (SORTOUT) or to multiple outputs (OUTFIL with different INCLUDE/OMIT or BUILD). The control statements in SYSIN define the whole pipeline. Advantages: one program (DFSORT), one step, no intermediate datasets. Use this when the logic fits in SORT, INCLUDE/OMIT, INREC, OUTREC, OUTFIL, and optionally SUM or JOINKEYS.

Multi-Step ETL: Chaining DFSORT Steps

When the logic is too complex for one step, or when you want to separate concerns (e.g. first step cleans, second step sorts and splits), use multiple steps. Step 1 runs DFSORT (or another program) and writes to a dataset. Step 2 uses that dataset as input (e.g. SORTIN). In JCL, you allocate a temporary dataset in step 1 with DISP=(,PASS) and reference it in step 2 with the same name (e.g. &&TEMP). The system passes the dataset from step 1 to step 2; at job end, temporaries are normally deleted. So you get a pipeline: extract and clean in step 1, transform and load in step 2, without leaving permanent intermediates unless you choose to.

Example: two-step ETL

Step 1: DFSORT reads raw input (SORTIN), OMITs bad records, uses INREC to normalize dates and pad fields, writes to &&CLEAN. Step 2: DFSORT reads &&CLEAN as SORTIN, SORT FIELDS= by key, OUTFIL splits by region or type, writes to final datasets. Each step has its own SYSIN. The data flow is clear and each step can be tested or changed independently.

text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//STEP1   EXEC PGM=SORT
//SORTIN   DD DSN=RAW.DATA,DISP=SHR
//SORTOUT  DD DSN=&&CLEAN,DISP=(,PASS),SPACE=...
//SYSIN    DD *
  OMIT COND=(...)
  INREC BUILD=(...)
  OPTION COPY
/*
//STEP2   EXEC PGM=SORT
//SORTIN   DD DSN=&&CLEAN,DISP=(OLD,PASS)
//SORTOUT  DD DSN=FINAL.MAIN,DISP=(,CATLG),...
//OUTFIL1  DD DSN=FINAL.SPLIT1,...
//SYSIN    DD *
  SORT FIELDS=(1,20,CH,A)
  OUTFIL FNAMES=OUTFIL1,INCLUDE=(...),BUILD=(...)
  ...
/*

When to Use DFSORT vs an Application Program

DFSORT is strong at: filtering (INCLUDE/OMIT), sorting and merging, reformatting (INREC/OUTREC), aggregation (SUM), joining (JOINKEYS), date conversion, and simple translation. Use it when your ETL transform is mostly these operations. Use an application program (e.g. COBOL) when you need: complex business rules that vary by record type, many conditional branches, calls to databases or other systems, or logic that is easier to express in code than in control statements. Often the best design is a mix: DFSORT does the heavy bulk filter and reformat; a program does the record-by-record logic or enrichment. That keeps the pipeline simple and maintainable.

Temporary vs Permanent Intermediate Datasets

Between steps, you can write to a temporary dataset (e.g. DSN=&&TEMP) or a permanent one (e.g. DSN=MY.ETL.WORK). Temporaries are deleted at job end (unless passed and retained by job structure); they avoid catalog clutter and are good for one-time ETL runs. Permanent intermediates are useful when you need to restart from the middle, audit the intermediate, or reuse it in another job. Choose based on your restart, audit, and retention requirements.

Best Practices for ETL with DFSORT

Keep SYSIN in a PDS member or dataset so you can version and reuse control statements; avoid large in-stream data if the same logic is used in many jobs.
Use meaningful DD and dataset names so the data flow (SORTIN → SORTOUT, OUTFIL names) is clear in the JCL.
Document the record layout and any assumptions (e.g. fixed length, key positions) so future changes do not break the step.
For multi-step jobs, use temporaries (&&) for intermediates unless you need to keep them; that keeps the job self-contained.
Validate record counts: compare input and output counts (and OUTFIL counts) to catch unexpected filtering or duplication. Use ICE messages or a follow-on step to check.

Explain It Like I'm Five

ETL is like moving toys: you take them out of the box (extract), clean them and put them in order (transform), and put them in the right drawers (load). DFSORT can do all three: it takes data from one place (SORTIN), changes it (filter, sort, fix dates, join with another file), and writes it to new places (SORTOUT and OUTFIL). Sometimes you need two moves: first step cleans the toys, second step sorts them into drawers. Each step is one "move" and you can pass the result from the first step to the second with a temporary box (&&TEMP) that you throw away at the end.

Exercises

Design a single-step ETL: read a file, OMIT records with a bad flag, convert a date field with INREC, sort by key, write to SORTOUT. List the control statements in order.
Design a two-step ETL: step 1 filters and normalizes to &&WORK; step 2 reads &&WORK, sorts, and writes two OUTFILs. What JCL do you need for the temporary dataset?
When would you choose a COBOL program instead of DFSORT for the transform phase? Give two concrete examples.
Your ETL job has three DFSORT steps. Step 2 fails. How do you use ICE000I and SYSOUT to see what control statements step 2 used and why it failed?

Quiz

Test Your Knowledge

1. What does ETL stand for and how does DFSORT fit?

Execute, Test, Load—DFSORT only tests
Extract, Transform, Load. DFSORT can do all three: extract (read and filter with INCLUDE/OMIT), transform (INREC, OUTREC, SUM, JOINKEYS, dates), and load (write to SORTOUT/OUTFIL)
Only Transform
DFSORT does not fit ETL

2. How can you chain multiple DFSORT steps in one job for ETL?

Only one step per job
Use the output of one step as the input of the next: step 1 writes to a dataset that step 2 reads as SORTIN. So step 1 might filter and normalize; step 2 might sort and split
Only with ICETOOL
DFSORT cannot chain

3. When is it better to use DFSORT for transform instead of a COBOL program?

Never
When the transform is expressible as sort, filter, reformat, join, or aggregate—DFSORT is fast, uses no application code, and is easy to change via SYSIN. Use a program when you need complex business logic or many conditional branches
Only for sorting
Only for JOINKEYS

4. What is the "load" part of ETL in a DFSORT step?

Only SORTIN
Writing the result to SORTOUT and any OUTFIL datasets—the transformed data is loaded into those output datasets for downstream steps or systems
Only INREC
Only SYSIN

5. Why use a temporary dataset between two DFSORT steps in an ETL job?

You never do
To pass the output of the first step to the second without keeping a permanent copy; use a temporary (e.g. &&TEMP) so the job cleans it up and you avoid catalog clutter
Only for MERGE
Only for JOINKEYS

ETL Workflows

ETL and DFSORT: Extract, Transform, Load

Single-Step ETL with DFSORT

Multi-Step ETL: Chaining DFSORT Steps

Example: two-step ETL

When to Use DFSORT vs an Application Program

Temporary vs Permanent Intermediate Datasets

Best Practices for ETL with DFSORT

Explain It Like I'm Five

Exercises

Quiz

Test Your Knowledge

Related Concepts

Data normalization

INREC and OUTREC

OUTFIL

JOINKEYS

Related Pages