Blog.

Syntax Analysis in CuriousX

Cover Image for Syntax Analysis in CuriousX
Jennifer
Jennifer

In our last post, we talked about how the lexical analysis stage works like a librarian, sorting through the jumbled mess of letters in our source code and putting them into neat little piles (tokens) so that our compiler can start to understand what we're trying to say. But the journey doesn't end there

Next up is the syntax analysis stage, where the compiler starts to make sense of the code by organizing the tokens into a structured format. This is where the fun really begins, because the syntax analysis stage is like a detective trying to solve a mystery.

The mystery in this case is our source code, and the detective is trying to figure out the meaning of the code by piecing together clues.

The detective in this case is our trusty parser, which is responsible for determining whether the sequence of tokens generated by the lexer is a valid program according to the language's grammar.

These rules specify the allowed sequence of tokens (such as operators, operands, and parentheses) in an expression, as well as the precedence and associativity of operators.

Grammar rules used in a CuriousX in natural language :

  • Expressions can consist of operands (such as numbers or variables) and operators (such as +, -, *, /).
  • Parentheses can be used to group sub-expressions and indicate precedence.
  • Operators have a specific order of precedence (such as multiplication and division before addition and subtraction).
  • Operators with the same precedence have a specific associativity (such as left-to-right or right-to-left).

in a formal notation

<expr> ::= <term> {<addop> <term>}

<term> ::= <factor> {<mulop> <factor>}

<factor> ::= '(' <expr> ')' | <number> | <identifier>

<addop> ::= '+' | '-'

<mulop> ::= '*' | '/'

where <expr> ,<term>, <factor>, <addop> and <mulop> are non-terminal symbols, and +, -, *, /, (, ), <number> and <identifier> are terminal symbols.

  • expression is made up of the sum of terms
  • Terms is a product of factors
  • Factor is a number or parenthesized sub expression

The compiler uses these grammar rules to check if the input expression is a valid one and also generates the parse tree of the input expression. It's like solving a crossword puzzle; the lexer breaks the source code into individual pieces (tokens), and the parser puts those pieces back together to form a coherent whole (a parse tree).

A parse tree is a hierarchical representation of the source code, like a family tree, where each node in the tree corresponds to a construct in the language, such as a statement or an expression.

The leaves of the tree are the tokens generated by the lexer, and the branches of the tree represent the structure of the code, such as the grouping of sub-expressions and the order of operations.

For example, consider the following expression in our CuriousX:

x = 5 + 2 * 3

To explain this using the formal notation <factor> = <expr>

where

<factor> is x and <expr> = 5 + 2 * 3

then

<expr> = <term1> + <term2>

where

<term1> = <factor> = 5

<term2> = <factor> * <factor> = 2 * 3

The lexer output for this expression as stated in the previous post would be:

[x]    ->   <line:1, col:1>;	 VarToken
[=]    ->   <line:1, col:3>;	 AssignToken
[5]    ->   <line:1, col:5>;	 IntToken
[+]    ->   <line:1, col:7>;	 PlusToken
[2]    ->   <line:1, col:9>;	 IntToken
[*]    ->   <line:1, col:10>;	 MultiplyToken
[3]    ->   <line:1, col:12>;	 IntToken

A parse tree for this expression would have the following structure:

                         =
                        / \
                       x   +
                          / \
                         5   *
                            / \
                           2   3

In this parse tree, the '=' is like the father, x is like his son, + is like his older daughter, and the * is like the youngest daughter. Each one of them have their own children, 5, 2 and 3 respectively, these kids represent the operands that go into making the final value.

The parse tree represents the order of operations of the expression 5 + 2 * 3, first the multiplication will happen and then addition resulting in the final value of 11 which is then stored in the variable x (note the operator precedence)

The syntax analysis stage also plays a vital role in catching errors in the source code. The parser is able to detect a variety of errors, such as missing or extra tokens, invalid token sequences, and incorrect use of language constructs.

By generating a parse tree, the compiler now has a clear and structured representation of the source code, making it easier to analyze and generate code for the next stages of the compiler.

In summary, syntax analysis takes the token generated by the lexer and outputs a parse tree, code on github