Text processing in COBOL involves parsing, normalizing, validating, and transforming text data to prepare it for use in your programs. Unlike basic string manipulation, text processing focuses on working with structured text formats like CSV files, delimited records, user input, and formatted data. Understanding text processing is essential for handling real-world data that comes in various formats, cleaning input data, validating user entries, and converting between different text representations in mainframe COBOL applications.
Text processing encompasses operations that work with formatted or structured text data. Key text processing operations include:
These operations are fundamental for processing user input, parsing file records, handling data imports, validating forms, and preparing data for storage or display in business applications.
One of the most common text processing tasks is parsing delimited data, where fields are separated by specific characters like commas, pipes, or tabs. This is essential for processing CSV files, log files, and formatted input records.
CSV (Comma-Separated Values) is a common format where fields are separated by commas. Here's how to parse CSV data:
123456789101112131415161718192021222324252627282930313233WORKING-STORAGE SECTION. 01 CSV-LINE PIC X(200). 01 CUSTOMER-ID PIC X(10). 01 CUSTOMER-NAME PIC X(30). 01 CITY PIC X(20). 01 STATE PIC X(2). 01 ZIP-CODE PIC X(10). 01 FIELD-COUNT PIC 9(4) VALUE ZERO. 01 CSV-POINTER PIC 9(4) VALUE 1. PROCEDURE DIVISION. MAIN-PARA. MOVE '12345,JOHN SMITH,AUSTIN,TX,78701' TO CSV-LINE UNSTRING CSV-LINE DELIMITED BY ',' INTO CUSTOMER-ID CUSTOMER-NAME CITY STATE ZIP-CODE WITH POINTER CSV-POINTER TALLYING IN FIELD-COUNT END-UNSTRING DISPLAY 'Customer ID: ' CUSTOMER-ID DISPLAY 'Name: ' CUSTOMER-NAME DISPLAY 'City: ' CITY DISPLAY 'State: ' STATE DISPLAY 'Zip: ' ZIP-CODE DISPLAY 'Fields parsed: ' FIELD-COUNT STOP RUN.
In this example:
After execution, CUSTOMER-ID contains "12345", CUSTOMER-NAME contains "JOHN SMITH", CITY contains "AUSTIN", STATE contains "TX", ZIP-CODE contains "78701", and FIELD-COUNT contains 5.
CSV data often contains empty fields (consecutive commas). Use DELIMITED BY ALL to handle this:
12345678910111213141516171819202122232425WORKING-STORAGE SECTION. 01 CSV-LINE PIC X(200) VALUE '12345,,AUSTIN,TX,78701'. 01 CUSTOMER-ID PIC X(10). 01 MIDDLE-NAME PIC X(20). 01 CITY PIC X(20). 01 STATE PIC X(2). 01 ZIP-CODE PIC X(10). PROCEDURE DIVISION. MAIN-PARA. UNSTRING CSV-LINE DELIMITED BY ALL ',' INTO CUSTOMER-ID MIDDLE-NAME CITY STATE ZIP-CODE END-UNSTRING *> MIDDLE-NAME will be empty (spaces) because of the empty field DISPLAY 'ID: ' CUSTOMER-ID DISPLAY 'Middle: [' MIDDLE-NAME ']' DISPLAY 'City: ' CITY STOP RUN.
DELIMITED BY ALL ',' treats multiple consecutive commas as a single delimiter. Without ALL, each comma would create a separate field, but with ALL, consecutive commas represent a single empty field. This is crucial for correctly parsing CSV data with missing values.
Pipe-delimited format uses the pipe character (|) as a separator. It's common in data exchange formats:
123456789101112131415161718192021WORKING-STORAGE SECTION. 01 PIPE-LINE PIC X(200) VALUE '12345|JOHN SMITH|AUSTIN|TX|78701'. 01 CUSTOMER-ID PIC X(10). 01 CUSTOMER-NAME PIC X(30). 01 CITY PIC X(20). 01 STATE PIC X(2). 01 ZIP-CODE PIC X(10). PROCEDURE DIVISION. MAIN-PARA. UNSTRING PIPE-LINE DELIMITED BY '|' INTO CUSTOMER-ID CUSTOMER-NAME CITY STATE ZIP-CODE END-UNSTRING DISPLAY 'Parsed pipe-delimited data' STOP RUN.
You can parse data that uses multiple possible delimiters:
12345678910111213141516171819202122232425WORKING-STORAGE SECTION. 01 INPUT-LINE PIC X(200) VALUE 'ID=12345|NAME=JOHN|CITY=AUSTIN'. 01 ID-FIELD PIC X(10). 01 NAME-FIELD PIC X(30). 01 CITY-FIELD PIC X(20). 01 FILLER-FIELD PIC X(20). PROCEDURE DIVISION. MAIN-PARA. *> Parse alternating label=value pairs UNSTRING INPUT-LINE DELIMITED BY '=' OR '|' INTO FILLER-FIELD *> Skip "ID" ID-FIELD *> Get "12345" FILLER-FIELD *> Skip "NAME" NAME-FIELD *> Get "JOHN" FILLER-FIELD *> Skip "CITY" CITY-FIELD *> Get "AUSTIN" END-UNSTRING DISPLAY 'ID: ' ID-FIELD DISPLAY 'Name: ' NAME-FIELD DISPLAY 'City: ' CITY-FIELD STOP RUN.
This pattern alternates between labels and values. By using FILLER (or a throwaway field) for labels and actual fields for values, you can extract just the data you need from formatted input like "KEY=VALUE|KEY=VALUE".
Text normalization converts text data to a consistent format, making it easier to compare, search, and process. Common normalization operations include case conversion, whitespace handling, and character standardization.
Converting text to consistent case (uppercase or lowercase) ensures comparisons work correctly:
123456789101112131415161718192021222324WORKING-STORAGE SECTION. 01 USER-INPUT PIC X(30) VALUE 'John Smith'. 01 UPPER-NAME PIC X(30). 01 LOWER-NAME PIC X(30). 01 SEARCH-NAME PIC X(30) VALUE 'JOHN SMITH'. PROCEDURE DIVISION. MAIN-PARA. *> Convert to uppercase MOVE FUNCTION UPPER-CASE(USER-INPUT) TO UPPER-NAME *> Convert to lowercase MOVE FUNCTION LOWER-CASE(USER-INPUT) TO LOWER-NAME DISPLAY 'Original: ' USER-INPUT DISPLAY 'Uppercase: ' UPPER-NAME DISPLAY 'Lowercase: ' LOWER-NAME *> Now comparisons work correctly IF UPPER-NAME = SEARCH-NAME DISPLAY 'Match found!' END-IF STOP RUN.
FUNCTION UPPER-CASE converts all alphabetic characters to uppercase while preserving numbers, spaces, and special characters. FUNCTION LOWER-CASE does the opposite. These functions are essential for case-insensitive comparisons and data standardization.
Removing leading and trailing spaces normalizes text fields:
12345678910111213WORKING-STORAGE SECTION. 01 SOURCE-FIELD PIC X(30) VALUE 'JOHN SMITH '. 01 TRIMMED-FIELD PIC X(15). PROCEDURE DIVISION. MAIN-PARA. *> Moving to smaller field automatically trims trailing spaces MOVE SOURCE-FIELD TO TRIMMED-FIELD DISPLAY 'Source: [' SOURCE-FIELD ']' DISPLAY 'Trimmed: [' TRIMMED-FIELD ']' STOP RUN.
When you move a field to a smaller PIC field, COBOL automatically truncates trailing spaces. TRIMMED-FIELD (PIC X(15)) will contain "JOHN SMITH" without the trailing spaces from the 30-character source field.
12345678910111213141516171819202122232425262728WORKING-STORAGE SECTION. 01 SOURCE-FIELD PIC X(30) VALUE ' JOHN SMITH'. 01 TRIMMED-FIELD PIC X(30). 01 WORK-FIELD PIC X(30). 01 SPACE-COUNT PIC 9(4) VALUE ZERO. 01 START-POS PIC 9(4). PROCEDURE DIVISION. MAIN-PARA. *> Count leading spaces INSPECT SOURCE-FIELD TALLYING SPACE-COUNT FOR LEADING SPACE *> Calculate starting position (1-based) COMPUTE START-POS = SPACE-COUNT + 1 *> Extract non-space portion using UNSTRING UNSTRING SOURCE-FIELD DELIMITED BY ALL SPACE INTO TRIMMED-FIELD WITH POINTER START-POS END-UNSTRING DISPLAY 'Original: [' SOURCE-FIELD ']' DISPLAY 'Trimmed: [' TRIMMED-FIELD ']' STOP RUN.
This approach uses INSPECT to count leading spaces, then UNSTRING starting after the leading spaces to extract the actual content. Alternatively, you can use INSPECT REPLACING to replace leading spaces with another character, but the UNSTRING method is cleaner for extraction.
Normalize multiple spaces to single spaces:
123456789101112131415161718192021222324WORKING-STORAGE SECTION. 01 TEXT-FIELD PIC X(50) VALUE 'JOHN SMITH DOE'. 01 NORMALIZED-FIELD PIC X(50). 01 WORK-FIELD PIC X(50). 01 CHAR-PTR PIC 9(4) VALUE 1. 01 PREV-CHAR PIC X. PROCEDURE DIVISION. MAIN-PARA. *> Replace multiple spaces with single space MOVE TEXT-FIELD TO WORK-FIELD *> First pass: replace all double spaces with single space PERFORM UNTIL WORK-FIELD NOT CONTAINS ' ' INSPECT WORK-FIELD REPLACING ALL ' ' BY ' ' END-PERFORM MOVE WORK-FIELD TO NORMALIZED-FIELD DISPLAY 'Original: [' TEXT-FIELD ']' DISPLAY 'Normalized: [' NORMALIZED-FIELD ']' STOP RUN.
This repeatedly replaces double spaces with single spaces until no double spaces remain. The result is text with normalized spacing: "JOHN SMITH DOE" instead of "JOHN SMITH DOE".
Validating text data ensures it meets expected formats, lengths, and content requirements before processing or storage.
When parsing delimited data, verify you received the expected number of fields:
12345678910111213141516171819202122232425262728WORKING-STORAGE SECTION. 01 CSV-LINE PIC X(200). 01 FIELD-1 PIC X(20). 01 FIELD-2 PIC X(20). 01 FIELD-3 PIC X(20). 01 FIELD-COUNT PIC 9(4) VALUE ZERO. 01 EXPECTED-COUNT PIC 9(4) VALUE 3. PROCEDURE DIVISION. MAIN-PARA. MOVE 'VALUE1,VALUE2,VALUE3' TO CSV-LINE UNSTRING CSV-LINE DELIMITED BY ',' INTO FIELD-1 FIELD-2 FIELD-3 TALLYING IN FIELD-COUNT END-UNSTRING IF FIELD-COUNT NOT = EXPECTED-COUNT DISPLAY 'ERROR: Expected ' EXPECTED-COUNT ' fields, got ' FIELD-COUNT STOP RUN END-IF DISPLAY 'Validation passed: ' FIELD-COUNT ' fields' STOP RUN.
TALLYING IN counts how many receiving fields were filled. If the source has fewer delimiters than expected, some fields remain empty. Always validate the field count matches expectations to catch malformed input.
Check that parsed fields don't exceed maximum lengths:
12345678910111213141516171819202122WORKING-STORAGE SECTION. 01 INPUT-FIELD PIC X(50). 01 MAX-LENGTH PIC 9(4) VALUE 20. 01 ACTUAL-LENGTH PIC 9(4). PROCEDURE DIVISION. MAIN-PARA. MOVE 'THIS IS A VERY LONG FIELD THAT EXCEEDS LIMIT' TO INPUT-FIELD *> Get actual length (trimmed) COMPUTE ACTUAL-LENGTH = FUNCTION LENGTH( FUNCTION TRIM(INPUT-FIELD) ) IF ACTUAL-LENGTH > MAX-LENGTH DISPLAY 'ERROR: Field length ' ACTUAL-LENGTH ' exceeds maximum ' MAX-LENGTH ELSE DISPLAY 'Field length valid: ' ACTUAL-LENGTH END-IF STOP RUN.
FUNCTION LENGTH returns the length of a string. Combined with FUNCTION TRIM (if available), you can validate that trimmed field lengths are within acceptable ranges. This prevents data truncation and ensures data integrity.
Verify that data matches expected formats (numeric, alphabetic, date format, etc.):
1234567891011121314151617WORKING-STORAGE SECTION. 01 ZIP-CODE PIC X(10) VALUE '78701'. 01 ZIP-NUMERIC PIC 9(5). 01 IS-VALID PIC X VALUE 'N'. PROCEDURE DIVISION. MAIN-PARA. *> Validate ZIP code is numeric IF ZIP-CODE IS NUMERIC MOVE ZIP-CODE TO ZIP-NUMERIC MOVE 'Y' TO IS-VALID DISPLAY 'ZIP code is valid: ' ZIP-CODE ELSE DISPLAY 'ERROR: ZIP code must be numeric: ' ZIP-CODE END-IF STOP RUN.
COBOL provides class tests like IS NUMERIC, IS ALPHABETIC, and IS ALPHANUMERIC to validate data types. Use these to ensure data matches expected formats before processing.
Ensure required fields are not empty:
1234567891011121314151617181920WORKING-STORAGE SECTION. 01 CUSTOMER-NAME PIC X(30) VALUE SPACES. 01 IS-EMPTY PIC X VALUE 'N'. 01 SPACE-COUNT PIC 9(4) VALUE ZERO. PROCEDURE DIVISION. MAIN-PARA. *> Check if field is empty (all spaces) INSPECT CUSTOMER-NAME TALLYING SPACE-COUNT FOR CHARACTERS IF SPACE-COUNT = FUNCTION LENGTH(CUSTOMER-NAME) MOVE 'Y' TO IS-EMPTY DISPLAY 'ERROR: Customer name is required' ELSE DISPLAY 'Customer name is valid: ' CUSTOMER-NAME END-IF STOP RUN.
This counts all characters in the field. If the count equals the field length, the field contains only spaces (is empty). This validation ensures required fields have actual data.
Data cleaning removes unwanted characters, fixes formatting issues, and prepares data for processing:
123456789101112131415161718192021222324252627WORKING-STORAGE SECTION. 01 PHONE-NUMBER PIC X(20) VALUE '(555) 123-4567'. 01 CLEAN-PHONE PIC X(20). PROCEDURE DIVISION. MAIN-PARA. MOVE PHONE-NUMBER TO CLEAN-PHONE *> Remove parentheses INSPECT CLEAN-PHONE REPLACING ALL '(' BY SPACE INSPECT CLEAN-PHONE REPLACING ALL ')' BY SPACE *> Remove spaces INSPECT CLEAN-PHONE REPLACING ALL SPACE BY ZERO *> Remove dashes INSPECT CLEAN-PHONE REPLACING ALL '-' BY ZERO DISPLAY 'Original: ' PHONE-NUMBER DISPLAY 'Cleaned: ' CLEAN-PHONE STOP RUN.
This removes formatting characters (parentheses, spaces, dashes) from a phone number, leaving only digits. Multiple INSPECT statements handle different character replacements.
123456789101112131415161718192021WORKING-STORAGE SECTION. 01 TEXT-LINE PIC X(100) VALUE 'TEXT WITH' X'09' 'TABS'. 01 CLEAN-LINE PIC X(100). PROCEDURE DIVISION. MAIN-PARA. MOVE TEXT-LINE TO CLEAN-LINE *> Replace tab character (X'09') with space INSPECT CLEAN-LINE REPLACING ALL X'09' BY SPACE *> Replace other control characters INSPECT CLEAN-LINE REPLACING ALL X'0D' BY SPACE *> Carriage return INSPECT CLEAN-LINE REPLACING ALL X'0A' BY SPACE *> Line feed DISPLAY 'Cleaned text: ' CLEAN-LINE STOP RUN.
Control characters like tabs (X'09'), carriage returns (X'0D'), and line feeds (X'0A') can cause issues. Replace them with spaces or remove them to clean input data.
Many text processing tasks involve converting between different text formats:
123456789101112131415161718192021222324252627282930WORKING-STORAGE SECTION. 01 DATE-INPUT PIC X(10) VALUE '12/25/2023'. 01 MONTH PIC X(2). 01 DAY PIC X(2). 01 YEAR PIC X(4). 01 DATE-OUTPUT PIC X(10) VALUE SPACES. PROCEDURE DIVISION. MAIN-PARA. *> Parse MM/DD/YYYY format UNSTRING DATE-INPUT DELIMITED BY '/' INTO MONTH DAY YEAR END-UNSTRING *> Convert to YYYY-MM-DD format STRING YEAR DELIMITED BY SIZE '-' DELIMITED BY SIZE MONTH DELIMITED BY SIZE '-' DELIMITED BY SIZE DAY DELIMITED BY SIZE INTO DATE-OUTPUT END-STRING DISPLAY 'Input: ' DATE-INPUT DISPLAY 'Output: ' DATE-OUTPUT STOP RUN.
This parses a date from MM/DD/YYYY format and converts it to YYYY-MM-DD format. UNSTRING extracts the components, then STRING rebuilds them in the new format.
1234567891011121314151617181920212223242526WORKING-STORAGE SECTION. 01 FIRST-NAME PIC X(20) VALUE 'JOHN'. 01 LAST-NAME PIC X(20) VALUE 'SMITH'. 01 FULL-NAME PIC X(42). 01 FORMATTED-NAME PIC X(50). PROCEDURE DIVISION. MAIN-PARA. *> Build full name STRING FIRST-NAME DELIMITED BY SPACE ' ' DELIMITED BY SIZE LAST-NAME DELIMITED BY SPACE INTO FULL-NAME END-STRING *> Format as "Last, First" STRING LAST-NAME DELIMITED BY SPACE ', ' DELIMITED BY SIZE FIRST-NAME DELIMITED BY SPACE INTO FORMATTED-NAME END-STRING DISPLAY 'Full Name: ' FULL-NAME DISPLAY 'Formatted: ' FORMATTED-NAME STOP RUN.
This demonstrates building formatted names from components. STRING with DELIMITED BY SPACE avoids copying trailing spaces, keeping the output clean.
Follow these best practices for effective text processing:
12345678910111213141516171819202122232425WORKING-STORAGE SECTION. 01 CSV-LINE PIC X(200). 01 FIELD-1 PIC X(20). 01 FIELD-2 PIC X(20). 01 FIELD-3 PIC X(20). 01 FIELD-COUNT PIC 9(4) VALUE ZERO. 01 EXPECTED-FIELDS PIC 9(4) VALUE 3. PROCEDURE DIVISION. PARSE-CSV. UNSTRING CSV-LINE DELIMITED BY ALL ',' INTO FIELD-1 FIELD-2 FIELD-3 TALLYING IN FIELD-COUNT END-UNSTRING IF FIELD-COUNT NOT = EXPECTED-FIELDS DISPLAY 'ERROR: Invalid CSV format' STOP RUN END-IF *> Process fields... EXIT.
123456789101112131415161718192021WORKING-STORAGE SECTION. 01 USER-INPUT PIC X(30). 01 NORMALIZED-INPUT PIC X(30). 01 IS-VALID PIC X VALUE 'N'. PROCEDURE DIVISION. NORMALIZE-INPUT. *> Convert to uppercase MOVE FUNCTION UPPER-CASE(USER-INPUT) TO NORMALIZED-INPUT *> Trim trailing spaces MOVE NORMALIZED-INPUT(1:FUNCTION LENGTH( FUNCTION TRIM(NORMALIZED-INPUT) )) TO NORMALIZED-INPUT *> Validate not empty IF NORMALIZED-INPUT NOT = SPACES MOVE 'Y' TO IS-VALID END-IF EXIT.
1234567891011121314151617181920212223WORKING-STORAGE SECTION. 01 INPUT-LINE PIC X(200) VALUE 'ID=123|NAME=JOHN|CITY=AUSTIN'. 01 KEY-VALUE-PAIRS PIC X(20) OCCURS 10 TIMES. 01 KEY-FIELD PIC X(10). 01 VALUE-FIELD PIC X(20). 01 PAIR-COUNT PIC 9(4) VALUE ZERO. PROCEDURE DIVISION. PARSE-KEY-VALUE. *> Parse alternating key=value pairs UNSTRING INPUT-LINE DELIMITED BY '=' OR '|' INTO KEY-FIELD VALUE-FIELD KEY-FIELD VALUE-FIELD KEY-FIELD VALUE-FIELD TALLYING IN PAIR-COUNT END-UNSTRING *> Process key-value pairs... EXIT.
Think of text processing like organizing a messy toy box:
So text processing is all about taking messy, unorganized text data and making it neat, consistent, and ready to use—just like organizing and cleaning your toys!
Complete these exercises to reinforce your understanding of text processing:
Create a program that parses a CSV line with 5 fields (ID, First Name, Last Name, Email, Phone). Validate that all 5 fields are present, and display each field. Handle empty fields correctly.
Create a program that normalizes user input: converts to uppercase, trims leading and trailing spaces, and normalizes multiple spaces to single spaces. Display the original and normalized versions.
Create a program that validates phone numbers. Accept input in various formats (with/without dashes, parentheses, spaces), clean the input to remove formatting, and validate that it contains exactly 10 digits.
Create a program that parses a date in MM/DD/YYYY format, validates the components (month 1-12, day 1-31, year reasonable), and converts it to YYYY-MM-DD format. Handle invalid dates with error messages.
Create a program that parses a string in the format "KEY1=VALUE1|KEY2=VALUE2|KEY3=VALUE3". Extract each key-value pair, normalize the keys to uppercase, and display them in a formatted list.
1. What is the primary purpose of text processing in COBOL?
2. How do you parse a comma-separated value (CSV) line in COBOL?
3. How do you convert text to uppercase in COBOL?
4. What does DELIMITED BY ALL do in UNSTRING?
5. How do you trim trailing spaces from a text field?
6. What does TALLYING IN do in UNSTRING?