You can now write programs in Malayalam, the native language of Kerala, using the programming language ‘Malluscript’, which has been developed using Rust.
The concept of using a language that is more human-like when communicating with machines has always fascinated me. My introduction to programming began with the use of an embedded language called PAWN, also known as ‘small C’, to create servers and mods for video games. The PAWN compiler, true to its name, was small and straightforward to comprehend. It gave me a strong foundation in understanding the inner workings of a compiler and of programming languages in general. This experience sparked my desire to create my own language, one that is written in my native language.
Why Rust?
Rust has been around since 2010, but it gained significant popularity in 2015. My journey with Rust began in late 2019, while I was working on writing plugins and libraries for PAWN. After feeling the limitations of C++, which I had been using to write these plugins, Rust was a refreshing change. It provided everything that C++ did, but in a better way (at least in my opinion).
Aside from being another system-level language, Rust’s unique features such as its ownership and borrowing system caught my interest. It made me appreciate Rust’s approach towards memory management, and its emphasis on safety and security. Rust’s unique feature set sets it apart from other system-level languages, and made it an attractive choice for me to continue developing in. The reasons include:
- Memory safety: Rust has a strong emphasis on preventing null or dangling pointer references, which can help prevent memory safety issues such as segmentation faults and buffer overflows.
- Concurrency: Rust’s ownership and borrowing system makes it easier to write concurrent code, and its lightweight threads (called ‘tasks’) can be used to achieve parallelism.
Performance: Rust code can be as fast as C or C++ code, making it a good choice for performance-critical applications. - Ecosystem: Rust has a growing and supportive community, with a wealth of libraries and tools available through its package manager, Cargo.
- Safety-critical systems: Rust is being used in a number of safety-critical systems, such as the Firefox web browser and the Oxide browser engine, due to its focus on memory safety and concurrency.
- Easy to learn: Although it’s debatable, in my opinion, Rust has a friendly and approachable syntax, making it easy for new users to pick up and learn.
Ezhil
In the past, there have been several efforts to create programming languages that incorporate Indian languages. Ezhil is one such programming language. It is a cutting-edge, open source, interpreted programming language that was specifically designed for native Tamil speakers, particularly for students. Its goal is to simplify the process of learning programming and numeracy for Tamil speakers by incorporating Tamil keywords and grammar, while also providing logical constructs that are similar to those found in English-based programming languages. As the first freely available programming language in Tamil, Ezhil was officially announced in 2009 after being developed since 2007. The language is intended to make computer programming and numeracy more accessible for individuals who may not be fluent in English.
I am a native speaker of Malayalam, and I noticed the lack of a programming language like Ezhil that is tailored to the Malayalam language. This led me to the idea of creating a language that would allow us to write computer programs in Malayalam. Because of the benefits discussed earlier, I opted to utilise Rust for the development of my new language, Malluscript.
Malluscript
As people from the Indian state of Kerala primarily speak Malayalam, they are often referred to as ‘Mallus’ across India. Malluscript felt like an appropriate choice of name for a programming language for Mallus that enables them to write programs in Malayalam. One of the main aims of this language was to use it in an instant coding competition, in which students would be provided with a brand new language to solve certain problems in. Since the main objective was to attract younger students for a competition, I designed Malluscript in a comic tone, involving trendy memetic keywords as tokens.
Like any standard programming language, Malluscript also contains a lexer, a parser and an executor. This esoteric scripting language follows the same structure and principles. The idea was to make it easy to understand and accessible for younger students while still adhering to the fundamental concepts of programming.
Writing a lexer
The lexer’s main job is to identify the individual tokens in the input stream and to classify them based on their type. For example, a lexer might identify a sequence of characters like ‘if’ as a keyword token, or a sequence of digits as a numerical value token. Malluscript’s lexer does the same; it converts the text representation of language to an iterator of token, which will be used by the parser in the next stage. Figure 1 shows the list of tokens in Malluscript.
Malluscript has 31 different tokens, with some of the keywords having multiple aliases. When I was writing Malluscript initially, I only intended it to support Manglish (Malayalam language written using English words) since that is the most common way Malayalam speaking people communicate through the internet. But after the first release, I decided to incorporate actual Malayalam Unicodes too. Therefore, Malluscript’s keywords can have multiple aliases in both English as well as Malayalam keywords. I wrote a specialised function in the lexer to detect this as well. Figure 2 shows the keyword mapping in Malluscript.
Lexer then reads through the input (which is our program) and converts it to an iterator of tokens, which will be used by parser.
Writing a parser
The parser’s main job is to take the input provided by the lexer (which is a stream of tokens) and arrange it into a tree-like structure called a parse tree. The parse tree represents the hierarchical structure of the input and shows how the different tokens fit together to form a complete program.
The parser uses a set of rules known as grammar. Using this grammar, parser determines how to arrange the tokens. Malluscript’s parser is written using a parser generator called LALRPOP.
LALRPOP, as quoted by its author, is a Rust parser generator framework with usability as its primary goal. LALRPOP lets us write clean and readable context-free grammars, and we define how Malluscript’s code should be interpreted and how it should look. For example, consider the statement parser of Malluscript in LALRPOP shown in Figure 3. It closely resembles the context-free grammar we usually write for theoretical compiler design problems. On compilation (compilation of the interpreter), LALRPOP converts these CFG rules to Rust code. And using our parser, we parse our tokens generated in the previous step, and generate an abstract syntax tree (AST).
Writing an executor
Now our lexer-converted tokens are built into an AST, which represents how the code should be executed. The job of our executor is to simply navigate through AST nodes and do the necessary operations mentioned by them. For example, Figure 4 shows executor code for a declaration statement in Malluscript.
The code in Figure 4 is a declaration statement. The executor takes a symbol and checks if that symbol exists on the symbol table. If it does, we throw an error ‘Symbol already defined’; otherwise, we add the symbol to the symbol table. A simple program to find factorial in Malluscript will look like what’s shown below:
pwoli_sadhanam num; pwoli_sadhanam factorial; dhe_pidicho “Input number:”; num = number_thada; factorial = 1; repeat_adi 0 um num um same_alle { factorial = factorial * num; num = num -1; } dhe_pidicho “Factoral is : “ + factorial + “\n”;
The same in pure Malayalam Unicode is shown in Figure 5.
Malluscript’s code is open source and available on GitHub. Individuals are welcome to participate in its continuous development and are encouraged to do so. Malluscript is available at https://github.com/Sreyas-Sreelal/malluscript.