DFSORT is built for EBCDIC and fixed-byte record processing. It does not natively support Unicode or UTF-8: it does not interpret multi-byte character sequences, and its sort comparison is byte-oriented, not Unicode collation. When you need to sort or process data that is in UTF-8 or another Unicode encoding, you must either convert the data to a form DFSORT can handle (e.g. EBCDIC) before the sort, use product features such as LOCALE if your sort product supports them, or use a user exit or external program. This page explains why Unicode is different, what can go wrong if you sort UTF-8 data directly, and what workarounds are commonly used on the mainframe.
On z/OS, the native character encoding is EBCDIC. DFSORT (and many mainframe utilities) assume that record data is stored in EBCDIC and that sort keys are specified by fixed byte position and length. Comparison is done byte by byte using the EBCDIC (or ALTSEQ) collating sequence. That model works well for single-byte character sets. Unicode, and in particular UTF-8, uses variable-length encoding: one character can be 1, 2, 3, or 4 bytes. DFSORT does not parse UTF-8 sequences or apply Unicode collation rules, so it is not designed to "understand" Unicode data natively.
If you feed UTF-8 data to DFSORT without conversion, DFSORT treats it as a stream of bytes. When you specify a sort key (e.g. bytes 20–40), that range is compared byte by byte. If a multi-byte character starts at byte 19 and continues into byte 21, the key "cuts" through the character; the sort does not know that bytes 20–21 are one character. The resulting order is by raw byte value, not by Unicode code point or by locale-specific collation. So the sort order may not match what users expect for text (e.g. accented characters, CJK, or symbols). In addition, if the data is mixed or the encoding is wrong, you can get garbage or incorrect ordering. So direct sorting of UTF-8 with DFSORT is not recommended unless you have a product feature that explicitly supports it.
A common and reliable approach is to convert the data before the sort. If the data arrives in UTF-8 (e.g. from a file or message), run a conversion step or program that translates UTF-8 to EBCDIC (or to another single-byte encoding that DFSORT can sort). Then run DFSORT on the EBCDIC data. If the downstream system needs UTF-8 again, run a second conversion step after the sort to convert EBCDIC back to UTF-8. The conversion program or utility must handle multi-byte UTF-8 correctly (e.g. using z/OS Unicode Services or a licensed conversion tool). That way DFSORT only sees fixed-byte EBCDIC and behaves as designed.
Some sort products (or newer DFSORT/Beyond Sort levels) may offer LOCALE or similar features that affect sort order or character handling. The exact behavior—for example, whether Unicode or a specific locale is supported—depends on the product and release. If your site has such a feature, the documentation will describe how to specify the locale and what character sets or encodings are supported. Use that when available and when it meets your requirement; otherwise, the convert-then-sort approach remains the safest.
ALTSEQ defines a one-byte-to-one-byte mapping. It is suitable for converting between single-byte character sets (e.g. EBCDIC to ASCII). It cannot represent UTF-8 properly because UTF-8 uses multi-byte sequences: one character maps to 1–4 bytes, not one byte. So do not try to use ALTSEQ alone for UTF-8 to EBCDIC conversion; use a proper conversion program or service that understands UTF-8 encoding.
If you need to normalize or convert Unicode data as part of the sort flow, you can use a user exit (E15 or E35) or a pre/post step. For example, an E15 exit could call a conversion routine that translates UTF-8 to EBCDIC for each record (or for key fields) before the record is sorted. The exit must correctly handle multi-byte UTF-8; that logic is non-trivial and is usually implemented with a library or API that supports UTF-8. Alternatively, keep conversion in a separate step and use DFSORT only on already-converted data.
The sorter was built for one kind of alphabet where each letter is one box. Unicode is like an alphabet where some letters take two or three boxes. The sorter does not know how to read those multi-box letters; it just looks at boxes one by one. So if you give it that kind of alphabet, the order can get mixed up. Grown-ups fix it by changing the alphabet into the one-box kind before the sorter sees it, then the sorter does its job, and sometimes they change it back after.
1. Does DFSORT natively support UTF-8 or Unicode data?
2. Why is UTF-8 difficult for a traditional sort utility to handle directly?
3. What is a common workaround for sorting data that arrives in UTF-8?
4. What is LOCALE in the context of DFSORT/sort products?
5. If you sort UTF-8 data with DFSORT without conversion, what can happen?