Introduction

  • Lexical Analysis is the first layer/phase of a compiler.
  • The Lexical Analysis phase is also called Scanning.
  • A Pattern in this phase is a rule (often expressed with regular expressions) that defines how lexemes are recognized as tokens.
  • A Lexeme in this phase is the actual string of characters that forms a token.

Definition

  • Lexical Analysis reads/scans the source program character by character from left to right and converts sequences of characters into groups of meaningful units, called tokens.

Characteristics

  • In this phase, source code is scanned character by character and grouped into meaningful units called tokens (the smallest meaningful units in a program, e.g., keywords, identifiers, operators, literals).
  • Lexical Analysis is performed by Lexical Analyser/Scanner/ Lexer.
  • The Lexical Analysis phase acts as a bridge between raw code and the parser, ensuring the input is clean and structured.
  • Lexical Analysis transforms messy text into a structured stream of tokens, making it the foundation of compiler design.
  • The Lexical Analyzer (or Scanner) plays a crucial role in the compiler’s front-end design.
  • It acts as the first processing unit that bridges the raw source code and the parser.
  • In this phase, the Lexical Analyser acts like a service, i.e., whenever the parser of the Syntax Analysis phase requests the next token, the lexical analyzer supplies it.

Categories of Tokens

A token may exist in the form of –

  • Keywords: Reserved words like if, else, while, return, etc.
  • Identifiers: Names for variables, functions, arrays (e.g., x, sum).
  • Operators: Symbols like +, -, *, /.
  • Literals: Constants such as numbers (10, 3.14) or strings (“Hello”).
  • Punctuation: Delimiters like ; , ,, {, }.

Importance/Advantages

  • This phase simplifies the parsing process by abstracting raw characters into structured tokens.
  • This phase detects lexical errors early (e.g., illegal characters).
  • This phase provides a clean interface between source code and syntax analysis.

Role/Function of the Lexical Analyzer (Lexer/Scanner)

  • The following major roles are performed by the Lexical Analyser phase –
    • Tokenization
      • In this step, the Lexical Analyser converts the stream of characters from source code into a group of tokens (keywords, identifiers, operators, literals, punctuation).
      • For example: -In C programming,
int x = 10;
Group of Tokens formed: int (keyword token), x (identifier token), = (operator token), 10 (literal token), ; (punctuation token).
    • Removing Noise
      • In this step, the Lexical Analyser eliminates whitespace, tabs, and comments that are irrelevant for syntax analysis. 
      • This step ensures the parser receives only meaningful tokens.
    • Error Detection
      • In this step, the Lexical Analyser detects lexical errors such as illegal characters or malformed identifiers.
      • For example: int 9x ; → Error: identifier cannot start with a digit.
    • Symbol Table Management
      • In this step, the Lexical Analyser helps to build and maintain the symbol table, which 
        • Stores identifiers (variables, functions, constants).
        • Records attributes like type, scope, and memory location.
    • Interface with Parser
      • In this step, the Lexical Analyser provides tokens to the next phase of the compiler design – the syntax analyzer, on demand.
    • Efficiency
      • In this step, the Lexical Analyser improves compiler performance by buffering input and handling lookahead efficiently.
      • In this step, the Lexical Analyser uses finite automata and regular expressions to recognize token patterns quickly.
                  • In brief, the function of the Lexical Analyzer is –
                    • To take the Input: Raw source code.
                    • Scan input → Characters
                    • To remove whitespace and comments.
                    • Classify tokens or Group Them into lexemes → Keywords, identifiers, operators, etc.
                    • To detect invalid tokens and report errors, if any.
                    • To update a symbol table for identifiers.
                    • Pass tokens to parser → Structured input

                  Finally, the Lexical Analyzer cleans and structures the raw source code into tokens, manages identifiers, and ensures the parser receives error-free, meaningful input.

                  Working Mechanism of the Lexical Analyzer (Lexer/Scanner)

                  The Lexical Analyzer has several architectural parts that perform unique operations. These are –

                  (a.) Input Buffering 
                    • Input buffering is a technique used in the lexical analysis phase of a compiler to efficiently read the source program. Since source code is processed character by character, buffering reduces the overhead of frequent disk or system calls by storing chunks of the input in memory.
                    • Why Input Buffering is Needed
                      • Because it helps in reading one character at a time directly from the disk, which is slow.
                      • Lexical analyzers often need lookahead (examining upcoming characters) to recognize tokens.
                      • Buffering ensures smooth scanning and avoids repeated I/O operations.
                    • Structure of Input Buffering
                      • Typically, an Input Buffering consists of or uses a two-buffer scheme: Buffer1 and Buffer2. Here, 
                        • Source code is loaded into Buffer 1 first.
                        • When the forward pointer reaches the end of Buffer 1, Buffer 2 is loaded automatically.
                        • This alternation continues until the entire source program is read.
                        • If the forward pointer encounters EOF, the analyzer knows the input has ended.
                      • Buffer Size: Each buffer holds a fixed number of characters (often equal to one disk block).
                      • Two Pointers:
                        • Lexeme Begin – This pointer marks the start of the current lexeme.
                        • Forward Pointer – This pointer moves ahead to find the end of the lexeme.
                      • Sentinels: Each buffer ends with a special marker (EOF) to detect the end of input.
                    • For example
                  Suppose the source code is:
                  int x = 10;
                        • Lexeme Begin points to i of int.
                        • Forward Pointer scans ahead until ; reaches.
                        • The substring int is recognized as a keyword token.
                        • Buffering ensures that scanning continues seamlessly without repeated disk reads.
                    •  Advantages
                      • Input buffering helps in faster input reading.
                      • Input buffering supports lookahead for token recognition.
                      • Input buffering reduces system call overhead.
                      • Input buffering makes lexical analysis more efficient and reliable.

                  Thus, Input buffering is a performance optimization technique that allows the lexical analyzer to read source code efficiently using two buffers and pointer management.

                   (b.) Specification of Tokens 
                    • The specification of tokens is the formal description of how tokens (the smallest meaningful units of a program) are defined and recognized during lexical analysis.
                    • Tokens are specified using patterns, most commonly expressed in regular expressions, which describe the structure of valid lexemes in the source code.
                    • A token consists of and is represented as a pair:
                      • Token name – Using symbolic category (e.g., identifier, keyword, number tokens, etc.).
                      • Attribute value – It is optional information about the token (e.g., the actual lexeme string, or a pointer to the symbol table entry).
                    • For example:
                  int x = 10;
                        • int – Token name: keyword, attribute: none
                        • x – Token name: identifier, attribute: pointer to symbol table entry
                        • 10 – Token name: number, attribute: numeric value
                    • Tokens are formally specified using regular expressions because they can describe sets of strings concisely. For example:-
                      • Identifier: [A-Za-z][A-Za-z0-9]*
                      • Number: [0-9]+
                      • Keyword: if | else | while | return
                      • Operator: + | – | * | /
                    • Tokens may be categorized into –
                      • Keywords → Reserved words (int, float, while)
                      • Identifiers → User-defined names (sum, price)
                      • Constants/Literals → Numeric or string values (42, “Hello”)
                      • Operators → Arithmetic, relational, logical (+, ==, &&)
                      • Punctuation/Delimiters → Symbols like ;, {}, ()
                    • Roles in Lexical Analysis
                      • The lexical analyzer uses token specifications to match input strings against patterns.
                      • Each recognized lexeme is classified into its token category.
                      • Tokens are then passed to the syntax analyzer for grammar checking.
                      • Example :
                    Suppose in the C program
                    float price = 99.5;
                    So, the Token specification is:
                          • float – Keyword
                          • price – Identifier
                          • = – Operator
                          • 99.5 – Literal
                          • ; – Punctuation

                    Thus, the specification of tokens defines the rules for identifying valid lexemes in a programming language. Tokens are described using regular expressions, categorized into keywords, identifiers, literals, operators, and punctuation. This specification ensures that the lexical analyzer can systematically recognize and classify input into meaningful units for further compilation.

                    (c) Recognition of Tokens 
                    • Recognition of tokens is the process by which the lexical analyzer identifies substrings of the source code (called lexemes) and classifies them into valid tokens according to the language’s rules. It ensures that the input stream is broken into meaningful units for the parser.
                    • Process of Token Recognition
                      • The lexical analyzer scans the input character stream.
                      • It matches substrings against token specifications (usually defined by regular expressions).
                      • Once a match is found, the substring is recognized as a lexeme belonging to a particular token category.
                      • The token is then passed to the syntax analyzer.
                    • Techniques Used in the Token Recognition Process
                      • Regular Expressions – They define the patterns for tokens (e.g., identifiers, numbers).
                      • Finite Automata (NFA/DFA) – They implement the recognition process:
                        • Regular expressions are converted into an NFA.
                        • NFA is transformed into a DFA for efficient recognition.
                        • DFA scans input character by character to find the longest match.
                          • Steps in Token Recognition Process
                            • Start at Lexeme Begin Pointer – Marks the start of a potential token.
                            • Advance Forward Pointer – Reads characters until a match is found.
                            • Match Against Patterns – Use DFA to check if the substring matches a token specification.
                            • Classify Lexeme – Assign token name and attributes.
                            • Return Token – Send token to parser for syntax analysis.
                          • For example
                          Suppose in a C program
                          int x = 10;
                          In the Recognition process:
                            • int – Matches keyword pattern – Token: keyword
                            • x – Matches identifier pattern – Token: identifier
                            • = – Matches operator pattern – Token: operator
                            • 10 – Matches number pattern – Token: literal
                            • ; – Matches punctuation pattern -Token: delimiter
                          • Error Handling
                            • If no pattern matches a substring, the lexical analyzer reports a lexical error.
                            • Example: int 9x; → Error, because identifiers cannot start with a digit.

                          Thus, Recognition of tokens is the core function of the lexical analyzer, where input characters are grouped into lexemes and classified into tokens using regular expressions and finite automata. This process ensures that the parser receives a clean, structured stream of tokens for further compilation.

                          Lex

                          • Lex is a common and widely used lexical analyzer generator tool.
                          • It is a tool that automatically creates a lexical analyzer (scanner) from a set of token specifications written as regular expressions. The generated analyzer reads source code, takes token specifications (regex and actions), produces a C program using yylex(), and helps recognize tokens such as identifiers, numbers, and operators, and finally passes them to the next phase, the parser.
                          • It is widely used in compiler construction because it reduces manual effort and ensures correctness.
                          • Lex rules use regular expressions to match patterns.
                          • How Lex Works
                            • Input to Lex
                              • A Lex specification file is written by the programmer.
                              • This file contains rules that define how tokens are recognized.
                            • Lex Specification File Structure – A Lex program is divided into three sections:
                            Definitions
                            %%
                            Rules
                            %%
                            User Subroutines
                            Here, the Definitions section includes Declarations, header files, global variables, etc.
                            Rules section includes Regular expressions with associated actions.
                            User Subroutines section includes Helper functions written in C.
                              • Output of Lex
                                • Lex translates the specification into a C program.
                                • This program contains a function yylex() which performs lexical analysis.
                                • When compiled, it produces an executable lexical analyzer.
                              • Some Common Examples of Lex Programs
                            %{
                            #include <stdio.h>
                            %}
                            %%
                            [0-9]+      { printf(“NUMBER\n”); }
                            [a-zA-Z]+   { printf(“IDENTIFIER\n”); }
                            “+”         { printf(“PLUS OPERATOR\n”); }
                            .           { /* ignore other characters */ }
                            %%
                            int main()
                            {
                                yylex();
                                return 0;
                            }

                            Here in the above program,

                                  • [0-9]+ → when matches numbers → Prints “NUMBER”.
                                  • [a-zA-Z]+ → when matches identifiers → Prints “IDENTIFIER”.
                                  • “+” → when matches plus operator → Prints “PLUS OPERATOR”.
                                  • . → when matches any other character → Ignored.

                            Now the created Lex program can be used for a C program statement as below –

                            x + 25
                            Output:
                            IDENTIFIER
                            PLUS OPERATOR
                            NUMBER

                             Example 1: Lex program for Recognizing Numbers and Identifiers

                            A Lex Program File (saved as – example1.l):
                            %{
                            #include <stdio.h>
                            %}
                            %%
                            [0-9]+        { printf(“NUMBER: %s\n”, yytext); }
                            [a-zA-Z][a-zA-Z0-9]*   { printf(“IDENTIFIER: %s\n”, yytext); }
                            .              { /* ignore other characters */ }
                            %%
                            int main()
                            {
                                yylex();    //scans input code and applies Lex rules
                                return 0;
                            }
                            A C Program Statement Input:
                            x = 123;
                            Output:
                            IDENTIFIER: x
                            NUMBER: 123

                            Example 2: Lex program for Recognizing Arithmetic Operators

                            A Lex Program File (saved as – example2.l):

                            %{
                            #include <stdio.h>
                            %}
                            %%
                            “+”     { printf(“PLUS OPERATOR\n”); }
                            “-”     { printf(“MINUS OPERATOR\n”); }
                            “*”     { printf(“MULTIPLY OPERATOR\n”); }
                            “/”     { printf(“DIVIDE OPERATOR\n”); }
                            [0-9]+  { printf(“NUMBER: %s\n”, yytext); }
                            [a-zA-Z]+ { printf(“IDENTIFIER: %s\n”, yytext); }
                            .       { /* ignore */ }
                            %%
                            int main()
                            {
                                yylex();
                                return 0;
                            }
                            Input code of the C program statement
                            a + b – 25 / c
                            Output:
                            IDENTIFIER: a
                            PLUS OPERATOR
                            IDENTIFIER: b
                            MINUS OPERATOR
                            NUMBER: 25
                            DIVIDE OPERATOR
                            IDENTIFIER: c
                            Example 3: Lex program for recognizing Keywords vs Identifiers

                            A Lex Program File (saved asexample3.l):

                            %{
                            #include <stdio.h>
                            %}
                            %%
                            “int”    { printf(“KEYWORD: int\n”); }
                            “float”  { printf(“KEYWORD: float\n”); }
                            [a-zA-Z][a-zA-Z0-9]* { printf(“IDENTIFIER: %s\n”, yytext); }
                            [09]+   { printf(“NUMBER: %s\n”, yytext); }
                            .        { /* ignore */ }
                            %%
                            int main()
                            {
                                yylex();
                            return0;
                            }
                            Input code of the C program statement
                            int x = 50;
                            float price = 99;
                            Output:
                            KEYWORD: int
                            IDENTIFIER: x
                            NUMBER: 50
                            KEYWORD: float
                            IDENTIFIER: price
                            NUMBER: 99
                            Example 4: Lex Program for recognizing Whitespace and Comments
                            A Lex Program File (saved as – example4.l):
                            %{
                            #include <stdio.h>
                            %}
                            %%
                            [ \t\n]+        { /* ignore whitespace */ }
                            “//”.*          { /* ignore single-line comments */ }
                            [a-zA-Z][a-zA-Z0-9]* { printf(“IDENTIFIER: %s\n”, yytext); }
                            [09]+          { printf(“NUMBER: %s\n”, yytext); }
                            .               { printf(“SYMBOL: %s\n”, yytext); }
                            %%
                            int main()
                            {
                                yylex();
                                return 0;
                            }
                            Input code of the C program statement:
                            // This is a comment
                            int value = 42;
                            Output:
                            IDENTIFIER: int
                            IDENTIFIER: value
                            SYMBOL: =
                            NUMBER: 42
                            SYMBOL: ;

                            Example

                            In a C program
                            int x = 10;
                            In the Lexical Analysis phase of the above C program, the Lexer produces tokens in the form of:
                            • int → Keyword token
                            • x → Identifier token
                            • = → Operator token
                            • 10 → Literal token
                            • ; → Punctuation token

                             

                            Loading


                            0 Comments

                            Leave a Reply

                            Your email address will not be published. Required fields are marked *

                            This site uses Akismet to reduce spam. Learn how your comment data is processed.