Blog.

Semantic Analysis in CuriousX

Cover Image for Semantic Analysis in CuriousX
Jennifer
Jennifer

Semantic analysis is the process of checking the source code for meaning and correctness. It verifies that the code adheres to the language's rules and that all the elements of the code are used correctly. This includes checking for things like variable and function declarations, type compatibility, and proper usage of operators.

Think of it like a spell checker for your code. Just like how a spell checker checks for spelling mistakes in your document, a semantic analyzer checks for errors in your code.

You might think that the lexical and syntax analyzers already have error checking capabilities, but semantic analysis serves as a final check to ensure that all errors have been caught. Just like how your mom double-checks your homework for any mistakes even after you've said you've completed it 😏, semantic analysis looks for any errors that the syntax and lexical analysis may have missed, such as using a variable that hasn't been defined or attempting to add a string to an integer.

Before we dive into the specifics of error checking in CuriousX, it's important to understand the type of programming language it compiles. CuriousX supports both implicitly and strongly typed languages.

An implicitly typed language means that the programmer does not explicitly specify the type of data being used, such as integers or floats (well those are the only types CuriousX actually supports 😂). For example, in Python, you can simply write x = 10 without specifying that it is an integer, no need for the int keyword. The compiler automatically figures it out by looking at the context in which it's used.

On the other hand, a strongly typed language means that a variable can only hold data of the type it was declared to hold and an error will be generated if an attempt is made to assign a value of a different type to that variable, like trying to put a square peg in a round hole. For example, in the following code snippet:

x = 9.8
y = x * 2

This would result in an error because x is a float and 2 is an integer, and binary operations can't be performed on them, so what type will the variable y be intfloat? No, not happening, sorry 😒. To correct this, you would need to do something like this:

x = 9.8
y = x * 2.0

This way, both x and 2.0 are floats and the operation can be performed. defining this features will help us understand the types of errors that need to be caught in CuriousX.

so what errors can CuriousX semantic Analyzer catch that the lexer and syntax cant

  • Type Mismatches: like I explained earlier CuriousX supports strongly typed PL, type mismatches occur when a value or variable of one type is used where a value or variable of a different type is expected. In our case, performing binary operations with different types is a mismatch e.g adding a float to an int

  • Division by zero: is an error that occurs when a program tries to divide a number by zero. This is undefined mathematically, and can lead to unexpected results or errors in the program.

  • Undeclared variables: when a variable is used in a program without being first defined or assigned to any value, since CuriousX supports implicit PL, you cant just declare a variable without defining it.

  • duplicate variable declaration: In the case of CuriousX, this occurs when a variable is defined more than once to a different type for example:

x = 5
x = 5.9

Syntax analyzer and Lexer cant catch this errors because they only check the structure and organization of the code, while the lexer can only catch invalid tokens, the syntax analyzer on the other hand ensure that the program conform to the language's grammar rules.


So How Does it Work ??

The semantic analysis takes the AST tree generated by the syntax analysis as input, and traverses it

As the semantic analysis traverses the tree, it also checks and adds variables to the symbol table. The symbol table is like a dictionary for your code, it keeps track of all the variables and their types.

But the real magic happens when the semantic analysis performs type inference on each expression in the code since CuriousX supports implicitly typed PL. Type inference is like detective work, it figures out the type of each expression based on the context.

For example, if an expression is the result of adding two integers, the type inferred would be an integer. If its a binary operation between an integer and float thats an error, just think of it like an AND logic 🙂

The inferred types are then stored in the AST tree and symbol table for later use. This way, if an error is detected, the semantic analysis knows exactly where to find it and how to fix it.

Once the semantic analyzer has completed traversing the AST tree and performing type inference, the resulting AST tree is annotated with the inferred types, this means semantic analysis takes in an AST tree from the syntax analyzer as input and its output is an error free AST tree with it types inferred.

For example, consider the following expression in our CuriousX:

x = 5 + 2 * 3

The lexer output for this expression as stated in the previous Lexical Analysis post would be:

[x]    ->   <line:1, col:1>;	 VarToken
[=]    ->   <line:1, col:3>;	 AssignToken
[5]    ->   <line:1, col:5>;	 IntToken
[+]    ->   <line:1, col:7>;	 PlusToken
[2]    ->   <line:1, col:9>;	 IntToken
[*]    ->   <line:1, col:10>;	 MultiplyToken
[3]    ->   <line:1, col:12>;	 IntToken

A parse tree for this expression also stated in the previous Syntax Analysis post would have the following structure:

                         =
                        / \
                       x   +
                          / \
                         5   *
                            / \
                           2   3

the output of the semantic analysis would be an annotated AST that looks like this:

                         =
                        / \
             (INTEGER) x   +
                          / \
                         5   *
                            / \
                           2   3

it is basically error free AST, where all the variables are inferred to their types

Well we are done with the frontend of CuriousX!, Lexical analysis, syntax analysis and semantic analysis are called frontend because they serve as the initial stages of the compilation process. code on github

They are the gatekeepers that examine and structure the source code before it is passed on for further processing. The frontend acts like a quality control inspector, checking for errors and making sure that the code is written in a way that can be understood by the compiler.

Once the frontend has done its job, the real work of compilation can begin. The backend will then take the structured and error-free code and turn it into machine language or bytecode that can be executed by a computer. Without the diligent work of the frontend, the compilation process would be like trying to build a house without a blueprint, it will be chaotic and will not work.

Special shout out to my friends Ogaga and Tolu for keeping up with my unending questions 😅