Sampling means selecting a subset of records by their position in the file rather than by their content. For example, you might want every 10th record for an audit, or records 100 through 200 for a test extract. In DFSORT OUTFIL you do this using the Relative Record Number (RRN)—the position of each record in the input (1 for the first, 2 for the second, and so on). The main parameters are STARTREC= and ENDREC= to define a range of record positions, and SAMPLE= to select every nth record (or SAMPLE=(n,m) for a take/skip pattern). This page explains RRN, STARTREC, ENDREC, SAMPLE=n, and SAMPLE=(n,m), and when to use sampling instead of INCLUDE/OMIT.
The Relative Record Number is the ordinal position of the record in the input dataset. The first record has RRN 1, the second has RRN 2, and so on. RRN is not stored in the record; DFSORT assigns it based on the order records are read. So after a SORT, the “first” record is the one that sorted first (RRN 1), and the “last” is the one that sorted last (RRN n). All OUTFIL sampling and range selection is based on this implicit RRN.
STARTREC=n means “start including records from the nth record onward.” ENDREC=m means “stop after the mth record.” So together they restrict the output to records whose RRN is between n and m inclusive. If you specify only STARTREC=10, records 1–9 are skipped and 10 through the end are written. If you specify only ENDREC=100, records 1–100 are written and the rest are skipped. If you specify both STARTREC=10 and ENDREC=50, only records 10–50 are considered for the output. This is useful for extracting a slice of the file (e.g. for testing) or for limiting the scope before applying SAMPLE=.
| Parameter | Meaning | Effect |
|---|---|---|
| STARTREC=n | First record (by RRN) to include | Skip records before position n |
| ENDREC=m | Last record (by RRN) to include | Stop after position m |
| SAMPLE=n | Every nth record | Reduce output to 1/n of the (range) records |
| SAMPLE=(n,m) | Take/skip pattern (product-dependent) | Blocks of m with gaps (see manual) |
12SORT FIELDS=COPY OUTFIL FNAMES=SORTOUT,STARTREC=5,ENDREC=10
Only the 5th, 6th, 7th, 8th, 9th, and 10th records (by position) are written to SORTOUT. Records 1–4 and 11 onward are not written.
SAMPLE=n selects every nth record. The exact rule can vary by product: some implementations include the first record and then every nth (so SAMPLE=3 gives RRN 1, 4, 7, 10, …), others include records at positions n, 2n, 3n (so SAMPLE=3 gives RRN 3, 6, 9, …). The MainframeTechHelp example shows SAMPLE=3 producing records 1, 4, 7, 10—i.e. first record then every third. Check your DFSORT manual for the exact behavior. Either way, the output has roughly 1/n of the records (over the range that is being considered).
12SORT FIELDS=COPY OUTFIL FNAMES=SORTOUT,SAMPLE=3
Only every 3rd record (by the product’s rule) is written. If the input has 1000 records, the output has on the order of 333 records. Use this for statistical sampling or to reduce volume for testing.
You can use STARTREC= and ENDREC= to limit the range, then SAMPLE= to take every nth record within that range. For example STARTREC=10,ENDREC=50,SAMPLE=5 considers only records 10–50 (41 records), then selects every 5th from that set (e.g. 10, 15, 20, 25, 30, 35, 40, 45, 50—9 records). So you get a window of the file and a sample within that window.
12SORT FIELDS=COPY OUTFIL FNAMES=SORTOUT,STARTREC=10,ENDREC=50,SAMPLE=5
Records 10–50 are in scope; from those, every 5th is written. Exact list depends on whether SAMPLE=5 starts at 10 or at 15; typically you get a subset of {10, 15, 20, 25, 30, 35, 40, 45, 50}.
Some products support SAMPLE=(n,m) to define a pattern of “copy m records, then skip some, then repeat.” The exact meaning of n and m is product-dependent. In one example, SAMPLE=(3,2) with STARTREC=2 produces records 2, 3, 5, 6, 8, 9: that is, from record 2 onward, copy 2 records (2 and 3), skip 1 (4), copy 2 (5 and 6), skip 1 (7), copy 2 (8 and 9). So (3,2) can mean “in groups of 3, take 2”—i.e. take 2, skip 1. Use SAMPLE=(n,m) when you want blocks of consecutive records with gaps between blocks. Check your manual for the exact (n,m) semantics.
12SORT FIELDS=(1,5,ZD,A) OUTFIL FNAMES=OUTPUT3,STARTREC=2,SAMPLE=(3,2)
Starting at RRN 2, the (3,2) pattern copies 2 records then skips 1. So output gets records 2, 3, then 5, 6, then 8, 9 (RRNs 2, 3, 5, 6, 8, 9). Record 4 and 7 are skipped. Useful when you want “pairs” of records with one skipped between each pair.
You can have several OUTFIL statements with different STARTREC/ENDREC/SAMPLE settings so that different subsets go to different files. For example: one OUTFIL with SAMPLE=3 for a 1-in-3 sample, another with STARTREC=4,SAMPLE=4,ENDREC=10 for records 4 and 8 only. Each OUTFIL is independent; the same input is read once and each OUTFIL applies its own selection. So you can produce multiple sample extracts in one pass.
Sampling (STARTREC, ENDREC, SAMPLE) selects by position: “which record number” or “every nth record.” INCLUDE and OMIT select by content: “records where this field equals this value” or “records where this condition is true.” Use sampling when the criterion is positional (e.g. audit 1% by taking every 100th record, or test on records 1–500). Use INCLUDE/OMIT when the criterion is data-driven (e.g. department = 5, or amount > 1000). You can combine both: for example INCLUDE to filter by department, then SAMPLE= to take every 10th of those.
Imagine a long line of people. RRN is their place in line: first person is 1, second is 2, and so on. STARTREC=5 means “start from the 5th person,” and ENDREC=10 means “stop after the 10th.” So you only look at people 5 through 10. SAMPLE=2 means “pick every 2nd one”—so from those six people you might pick the 5th, 7th, and 9th. So we are not choosing by name or age; we are choosing by where they stand in line. That’s sampling by position. SAMPLE=(3,2) is like “take 2 people, skip 1, take 2, skip 1”—so you get little groups of 2 with a gap between.
1. What is RRN in DFSORT OUTFIL?
2. What does SAMPLE=5 do?
3. How do you limit sampling to a range of records (e.g. records 10 through 50)?
4. What does SAMPLE=(3,2) mean?
5. When would you use sampling instead of INCLUDE/OMIT?