MainframeMaster

Sampling Records in OUTFIL

Sampling means selecting a subset of records by their position in the file rather than by their content. For example, you might want every 10th record for an audit, or records 100 through 200 for a test extract. In DFSORT OUTFIL you do this using the Relative Record Number (RRN)—the position of each record in the input (1 for the first, 2 for the second, and so on). The main parameters are STARTREC= and ENDREC= to define a range of record positions, and SAMPLE= to select every nth record (or SAMPLE=(n,m) for a take/skip pattern). This page explains RRN, STARTREC, ENDREC, SAMPLE=n, and SAMPLE=(n,m), and when to use sampling instead of INCLUDE/OMIT.

OUTFIL Advanced
Progress0 of 0 lessons

Relative Record Number (RRN)

The Relative Record Number is the ordinal position of the record in the input dataset. The first record has RRN 1, the second has RRN 2, and so on. RRN is not stored in the record; DFSORT assigns it based on the order records are read. So after a SORT, the “first” record is the one that sorted first (RRN 1), and the “last” is the one that sorted last (RRN n). All OUTFIL sampling and range selection is based on this implicit RRN.

STARTREC= and ENDREC=: Selecting a Range

STARTREC=n means “start including records from the nth record onward.” ENDREC=m means “stop after the mth record.” So together they restrict the output to records whose RRN is between n and m inclusive. If you specify only STARTREC=10, records 1–9 are skipped and 10 through the end are written. If you specify only ENDREC=100, records 1–100 are written and the rest are skipped. If you specify both STARTREC=10 and ENDREC=50, only records 10–50 are considered for the output. This is useful for extracting a slice of the file (e.g. for testing) or for limiting the scope before applying SAMPLE=.

Sampling and range parameters
ParameterMeaningEffect
STARTREC=nFirst record (by RRN) to includeSkip records before position n
ENDREC=mLast record (by RRN) to includeStop after position m
SAMPLE=nEvery nth recordReduce output to 1/n of the (range) records
SAMPLE=(n,m)Take/skip pattern (product-dependent)Blocks of m with gaps (see manual)

Example: records 5 through 10 only

text
1
2
SORT FIELDS=COPY OUTFIL FNAMES=SORTOUT,STARTREC=5,ENDREC=10

Only the 5th, 6th, 7th, 8th, 9th, and 10th records (by position) are written to SORTOUT. Records 1–4 and 11 onward are not written.

SAMPLE=n: Every nth Record

SAMPLE=n selects every nth record. The exact rule can vary by product: some implementations include the first record and then every nth (so SAMPLE=3 gives RRN 1, 4, 7, 10, …), others include records at positions n, 2n, 3n (so SAMPLE=3 gives RRN 3, 6, 9, …). The MainframeTechHelp example shows SAMPLE=3 producing records 1, 4, 7, 10—i.e. first record then every third. Check your DFSORT manual for the exact behavior. Either way, the output has roughly 1/n of the records (over the range that is being considered).

Example: every 3rd record

text
1
2
SORT FIELDS=COPY OUTFIL FNAMES=SORTOUT,SAMPLE=3

Only every 3rd record (by the product’s rule) is written. If the input has 1000 records, the output has on the order of 333 records. Use this for statistical sampling or to reduce volume for testing.

Combining Range and Sampling

You can use STARTREC= and ENDREC= to limit the range, then SAMPLE= to take every nth record within that range. For example STARTREC=10,ENDREC=50,SAMPLE=5 considers only records 10–50 (41 records), then selects every 5th from that set (e.g. 10, 15, 20, 25, 30, 35, 40, 45, 50—9 records). So you get a window of the file and a sample within that window.

Example: range then every 5th

text
1
2
SORT FIELDS=COPY OUTFIL FNAMES=SORTOUT,STARTREC=10,ENDREC=50,SAMPLE=5

Records 10–50 are in scope; from those, every 5th is written. Exact list depends on whether SAMPLE=5 starts at 10 or at 15; typically you get a subset of {10, 15, 20, 25, 30, 35, 40, 45, 50}.

SAMPLE=(n,m): Take/Skip Pattern

Some products support SAMPLE=(n,m) to define a pattern of “copy m records, then skip some, then repeat.” The exact meaning of n and m is product-dependent. In one example, SAMPLE=(3,2) with STARTREC=2 produces records 2, 3, 5, 6, 8, 9: that is, from record 2 onward, copy 2 records (2 and 3), skip 1 (4), copy 2 (5 and 6), skip 1 (7), copy 2 (8 and 9). So (3,2) can mean “in groups of 3, take 2”—i.e. take 2, skip 1. Use SAMPLE=(n,m) when you want blocks of consecutive records with gaps between blocks. Check your manual for the exact (n,m) semantics.

Example: SAMPLE=(3,2) from record 2

text
1
2
SORT FIELDS=(1,5,ZD,A) OUTFIL FNAMES=OUTPUT3,STARTREC=2,SAMPLE=(3,2)

Starting at RRN 2, the (3,2) pattern copies 2 records then skips 1. So output gets records 2, 3, then 5, 6, then 8, 9 (RRNs 2, 3, 5, 6, 8, 9). Record 4 and 7 are skipped. Useful when you want “pairs” of records with one skipped between each pair.

Multiple OUTFILs with Different Sampling

You can have several OUTFIL statements with different STARTREC/ENDREC/SAMPLE settings so that different subsets go to different files. For example: one OUTFIL with SAMPLE=3 for a 1-in-3 sample, another with STARTREC=4,SAMPLE=4,ENDREC=10 for records 4 and 8 only. Each OUTFIL is independent; the same input is read once and each OUTFIL applies its own selection. So you can produce multiple sample extracts in one pass.

Sampling vs INCLUDE/OMIT

Sampling (STARTREC, ENDREC, SAMPLE) selects by position: “which record number” or “every nth record.” INCLUDE and OMIT select by content: “records where this field equals this value” or “records where this condition is true.” Use sampling when the criterion is positional (e.g. audit 1% by taking every 100th record, or test on records 1–500). Use INCLUDE/OMIT when the criterion is data-driven (e.g. department = 5, or amount > 1000). You can combine both: for example INCLUDE to filter by department, then SAMPLE= to take every 10th of those.

Explain It Like I'm Five

Imagine a long line of people. RRN is their place in line: first person is 1, second is 2, and so on. STARTREC=5 means “start from the 5th person,” and ENDREC=10 means “stop after the 10th.” So you only look at people 5 through 10. SAMPLE=2 means “pick every 2nd one”—so from those six people you might pick the 5th, 7th, and 9th. So we are not choosing by name or age; we are choosing by where they stand in line. That’s sampling by position. SAMPLE=(3,2) is like “take 2 people, skip 1, take 2, skip 1”—so you get little groups of 2 with a gap between.

Exercises

  1. Write an OUTFIL that writes only records 20 through 30 (by RRN). Use STARTREC and ENDREC.
  2. Write an OUTFIL that writes every 10th record from the full file. Use SAMPLE=10.
  3. Combine STARTREC=1, ENDREC=100 and SAMPLE=5. How many records do you expect in the output (approximately)?
  4. When would you use SAMPLE= instead of INCLUDE? Give an example of each.

Quiz

Test Your Knowledge

1. What is RRN in DFSORT OUTFIL?

  • A field in the record
  • Relative Record Number—the position of the record in the input (1 for first, 2 for second, etc.)
  • Random record number
  • Only for VB files

2. What does SAMPLE=5 do?

  • Writes 5 records only
  • Selects every 5th record (e.g. RRN 1, 6, 11, ... or product-defined pattern like 5, 10, 15, ...)
  • Samples 5 percent
  • Skips the first 5

3. How do you limit sampling to a range of records (e.g. records 10 through 50)?

  • Use INCLUDE only
  • Use STARTREC=10 and ENDREC=50 to define the range; you can combine with SAMPLE= to take every nth within that range
  • SAMPLE has no range
  • Use OMIT with position

4. What does SAMPLE=(3,2) mean?

  • Every 3rd and every 2nd record
  • A pattern: copy 2 records, then skip until the next cycle (e.g. copy 2, skip 1, copy 2, skip 1), so you get blocks of 2 with a gap—exact behavior is product-dependent
  • Only 2 or 3 records
  • Records 2 and 3 only

5. When would you use sampling instead of INCLUDE/OMIT?

  • Only when INCLUDE is not available
  • When you want to select by position (every nth record or a record range) rather than by field value; INCLUDE/OMIT select by content
  • Never
  • Only for reports