MainframeMaster

Unicode Handling

DFSORT is built for EBCDIC and fixed-byte record processing. It does not natively support Unicode or UTF-8: it does not interpret multi-byte character sequences, and its sort comparison is byte-oriented, not Unicode collation. When you need to sort or process data that is in UTF-8 or another Unicode encoding, you must either convert the data to a form DFSORT can handle (e.g. EBCDIC) before the sort, use product features such as LOCALE if your sort product supports them, or use a user exit or external program. This page explains why Unicode is different, what can go wrong if you sort UTF-8 data directly, and what workarounds are commonly used on the mainframe.

Other DFSORT Topics
Progress0 of 0 lessons

Why DFSORT Is EBCDIC-Oriented

On z/OS, the native character encoding is EBCDIC. DFSORT (and many mainframe utilities) assume that record data is stored in EBCDIC and that sort keys are specified by fixed byte position and length. Comparison is done byte by byte using the EBCDIC (or ALTSEQ) collating sequence. That model works well for single-byte character sets. Unicode, and in particular UTF-8, uses variable-length encoding: one character can be 1, 2, 3, or 4 bytes. DFSORT does not parse UTF-8 sequences or apply Unicode collation rules, so it is not designed to "understand" Unicode data natively.

What Goes Wrong If You Sort UTF-8 Directly

If you feed UTF-8 data to DFSORT without conversion, DFSORT treats it as a stream of bytes. When you specify a sort key (e.g. bytes 20–40), that range is compared byte by byte. If a multi-byte character starts at byte 19 and continues into byte 21, the key "cuts" through the character; the sort does not know that bytes 20–21 are one character. The resulting order is by raw byte value, not by Unicode code point or by locale-specific collation. So the sort order may not match what users expect for text (e.g. accented characters, CJK, or symbols). In addition, if the data is mixed or the encoding is wrong, you can get garbage or incorrect ordering. So direct sorting of UTF-8 with DFSORT is not recommended unless you have a product feature that explicitly supports it.

Workaround: Convert Before Sort

A common and reliable approach is to convert the data before the sort. If the data arrives in UTF-8 (e.g. from a file or message), run a conversion step or program that translates UTF-8 to EBCDIC (or to another single-byte encoding that DFSORT can sort). Then run DFSORT on the EBCDIC data. If the downstream system needs UTF-8 again, run a second conversion step after the sort to convert EBCDIC back to UTF-8. The conversion program or utility must handle multi-byte UTF-8 correctly (e.g. using z/OS Unicode Services or a licensed conversion tool). That way DFSORT only sees fixed-byte EBCDIC and behaves as designed.

LOCALE and Product Features

Some sort products (or newer DFSORT/Beyond Sort levels) may offer LOCALE or similar features that affect sort order or character handling. The exact behavior—for example, whether Unicode or a specific locale is supported—depends on the product and release. If your site has such a feature, the documentation will describe how to specify the locale and what character sets or encodings are supported. Use that when available and when it meets your requirement; otherwise, the convert-then-sort approach remains the safest.

ALTSEQ and Single-Byte Conversion

ALTSEQ defines a one-byte-to-one-byte mapping. It is suitable for converting between single-byte character sets (e.g. EBCDIC to ASCII). It cannot represent UTF-8 properly because UTF-8 uses multi-byte sequences: one character maps to 1–4 bytes, not one byte. So do not try to use ALTSEQ alone for UTF-8 to EBCDIC conversion; use a proper conversion program or service that understands UTF-8 encoding.

User Exits and External Programs

If you need to normalize or convert Unicode data as part of the sort flow, you can use a user exit (E15 or E35) or a pre/post step. For example, an E15 exit could call a conversion routine that translates UTF-8 to EBCDIC for each record (or for key fields) before the record is sorted. The exit must correctly handle multi-byte UTF-8; that logic is non-trivial and is usually implemented with a library or API that supports UTF-8. Alternatively, keep conversion in a separate step and use DFSORT only on already-converted data.

Explain It Like I'm Five

The sorter was built for one kind of alphabet where each letter is one box. Unicode is like an alphabet where some letters take two or three boxes. The sorter does not know how to read those multi-box letters; it just looks at boxes one by one. So if you give it that kind of alphabet, the order can get mixed up. Grown-ups fix it by changing the alphabet into the one-box kind before the sorter sees it, then the sorter does its job, and sometimes they change it back after.

Exercises

  1. Why is byte-by-byte comparison unsuitable for UTF-8 text sort order?
  2. Describe the "convert before sort" workaround in one or two sentences.
  3. Why cannot ALTSEQ alone be used for UTF-8 to EBCDIC conversion?
  4. What should you check in your sort product documentation regarding Unicode or LOCALE?

Quiz

Test Your Knowledge

1. Does DFSORT natively support UTF-8 or Unicode data?

  • Yes, natively
  • DFSORT is designed primarily for EBCDIC; it does not have built-in Unicode/UTF-8 support. You may need to convert data first or use workarounds (e.g. LOCALE, conversion step, or user exit)
  • Only with MODS
  • Only for INREC

2. Why is UTF-8 difficult for a traditional sort utility to handle directly?

  • It is not
  • UTF-8 is variable-length (1–4 bytes per character). Sort utilities often assume fixed positions and lengths for sort keys; multi-byte characters break that assumption and sort order may be wrong
  • Only EBCDIC is variable
  • DFSORT requires UTF-16

3. What is a common workaround for sorting data that arrives in UTF-8?

  • Use INCLUDE only
  • Convert the data from UTF-8 to EBCDIC (or a single-byte encoding) before the sort step, then sort with DFSORT; optionally convert back after the sort if the target system needs UTF-8
  • Use OPTION COPY
  • Use MERGE only

4. What is LOCALE in the context of DFSORT/sort products?

  • A DD name
  • A product feature (where supported) that can affect sort order or character handling according to a locale; check your product manual for Unicode or locale support
  • A type of INREC
  • Same as ALTSEQ

5. If you sort UTF-8 data with DFSORT without conversion, what can happen?

  • Nothing
  • Byte-wise comparison is used; sort order may not match Unicode collation, and key positions may split multi-byte characters, leading to incorrect or unexpected order and possible data issues
  • It always works
  • Only the first byte is used