Assembly Language Programming in Linux: An Overview

0
4965
assembly programming

One often comes across a lot of ads encouraging children to learn programming, which can be a very useful skill to have these days. This article focuses on assembly language programming in Linux, which figures at number 17 on the TIOBE index.

If you plan to make programming your career, it is a good idea to learn a programming language like assembly language at some stage. There are a number of reasons for this. First, it is still popular and ranks among the top 20 in the TIOBE index. The Stack Overflow Developer Survey 2020 ranked the assembly language 17th among the most wanted programming languages. (But I do not want to misrepresent facts, as assembly language is also featured in the list of the most dreaded languages — in fact, it’s ranked 4th here. This language is also notorious for its steep learning curve.)

Second, it is widely used in the industry. Assembly language is used while designing operating systems and compilers. Of course, nowadays, a lot of this code is written in high-level languages like C, C++, etc. But certain parts of the kernel of the operating system are still written in assembly language. For example, while writing device drivers for a machine, programmers often rely on assembly coding. Assembly programming is also used in certain phases of compilation. For example, code optimisation phase of compilers often involve assembly coding. Essentially, when you are programming on ‘bare metal’ (when you are accessing the hardware directly), you still need some assembly coding, whatever the application may be. Assembly language code can run very fast as there is one less translation to machine code when compared with high-level languages. In the earlier days, when resources like memory and processor speed were limited, this was a great advantage. Nowadays, even laptops have gigabytes of RAM and processors have gigahertz of speed. Does it make assembly language programming less attractive? Maybe a bit. But don’t assume all the systems in the world have huge amounts of resources available. Consider embedded systems, for example.

The microchips in our washing machines and cars do not have huge memories or powerful processors. So embedded system programming still involves a lot of assembly coding. Third, learning assembly language programming will enhance your knowledge of computers in general. A high-level language programmer can write thousands of lines of code without caring the least about the underlying architecture of the hardware. Of course, this is really a nice feature of high-level languages. At the same time, if you are a serious programmer, it is a good idea to know what is happening inside the machine when the code you write is executed. If you learn assembly language programming, things will change. You need to have a clear understanding about registers, primary memory, processors, etc, to write a decent assembly language program.

According to Wikipedia, “A program written in assembly language consists of a series of mnemonic processor instructions and meta-statements (known variously as directives, pseudo-instructions, and pseudo-ops), comments and data.” Now, before we proceed any further, I would like to clarify an important point. The right term is not ‘assembly language’ but ‘assembly languages’, as there are a number of programming languages that fall under this category. Every unique computer architecture has its own assembly language associated with it. An easy classification would be based on the CISC vs RISC classification of architectures. CISC (complex instruction set computer) tries to complete a task in as few lines of assembly code as possible, whereas RISC (reduced instruction set computer) processors generally use simple instructions that can be executed within one clock cycle. Assembly languages like the MIPS assembly language, ARM assembly language, etc, are associated with the RISC architecture. The x86 assembly language is generally considered as a CISC based assembly language. But this classification of assembly languages is becoming less meaningful, as more and more architectures tend to be a hybrid of both CISC and RISC.

With most laptops having Intel or AMD processors (both of which use the x86 assembly language), how can we test a RISC based assembly language? I urge you to use Raspberry Pi, a small single-board computer. It uses the ARM assembly language. But I didn’t have a Raspberry Pi computer with me while writing this article. So I either had to depend on an ARM emulator or an online ARM assembler. There is an ARM emulator called VisUAL, which can be installed in your laptop to test the ARM assembly language program intended for your Raspberry Pi computer. But I chose to use an online ARM assembler called OakSim. The URL https://wunkolo.github.io/OakSim/ will take you to the OakSim online assembler. The ARM assembly code shown below is executed using this assembler.

MOV r0, #15 ;
MOV r1, #10 ;
ADD r2, r0, r1 ;

The program stores the decimal value 15 (f in hexadecimal) in Register 0 and the decimal value 10 (a in hexadecimal) in Register 1; it adds the two numbers and stores the result in Register 2. Figure 1 shows the execution window of OakSim. Figure 2 shows the values in the registers after the execution of the program. It can be seen that Register 2 has hexadecimal value 19 (25 in decimal), the sum of the values in Registers 0 and 1.

Figure 1: Execution window of OakSim

Now that we have discussed an example for a RISC based assembly language, let us discuss a CISC based assembly language — the x86 assembly language. There are two popular syntax branches for the x86 assembly language — the AT&T syntax popular with Linux, and the Intel syntax popular with the Microsoft Windows operating system. Both the assemblers used in Microsoft Windows — MASM (Microsoft Macro Assembler) and TASM (Turbo Assembler) — use the Intel syntax.

Figure 2: Values in the Registers after program execution

First, let us discuss the AT&T syntax of the x86 assembly language. This syntax is popularised by GAS (the GNU assembler). This is an important assembler because, by default, it is used as the backend for the GNU compiler collection (GCC). Now let us see an example for an assembly language program written in AT&T syntax. We will use the GNU assembler called with the command as to assemble this program. Consider the assembly language program atandt.asm in AT&T syntax shown below:

//This is a comment in AT&T Syntax
movl $111, %eax
movl $222, %ebx
addl %ebx, %eax
addl $333, %ebx

Notice that the above assembly language program is minimal and will not produce any visual output. The constant value 111 is stored in the register %eax and the constant value 222 is stored in the register %ebx. It will add the two numbers in the registers %eax and %ebx, store the result in the register %eax, and then add the constant value of 333 to the content of the register %ebx and store the result in the register %ebx itself. The GAS assembler will assemble this program without any errors or warnings.
Now let us discuss the Intel syntax of x86 assembly language. Although it is possible to use the GNU assembler GAS itself to process assembly code in Intel syntax also, it is easier to use another assembler called Netwide Assembler (NASM). NASM is very popular with those assembly language programmers who work with both Linux and Microsoft Windows operating systems. Now let us see a program similar to the program atandt.asm called intel.asm, but written in the Intel syntax, so that we can compare the two syntax branches. The program intel.asm is shown below:

//This is a comment in AT&T Syntax
movl $111, %eax
movl $222, %ebx
addl %ebx, %eax
addl $333, %ebx

This program works similar to the program atandt.asm. The constant value 111 is stored in the register eax and the constant value 222 is stored in the register ebx. The values in the registers eax and ebx are added, and the result is stored in the register eax. The constant value 333 is added to the register ebx and the result is stored in the register ebx itself. Figure 3 shows the two programs being assembled with the assemblers GAS and NASM.

Figure 3: Assembly of the two x86 programs

As you can observe, there are a number of differences between the AT&T and Intel syntax, with the most annoying being the order of the source and destination operands. I hope the above program shows us this difference. We can see that in the AT&T syntax, the destination register comes at the end after all the source registers; whereas in the Intel syntax, the destination register comes at the beginning before any source registers. The program shows us how comments are written in different styles in each syntax. It also shows us how register names and immediate values are written in each syntax. In the AT&T syntax, register names are preceded with a percentage symbol (%). Similarly, in the AT&T syntax, immediate operands are preceded with a dollar symbol ($). The commands addq and add tell us that the two syntax branches use different techniques to identify the size of memory operands. Of course, there are many other differences between the two syntax branches, but the aim of the article is not to list all such subtle differences. Rather, the article tries to point out the variety available, while choosing an assembly language to learn.

Depending on the architecture or the operating system you choose, there is a wide variety of assembly languages, each with a different syntax to choose from. So before proceeding any further with assembly languages, I request you set your priorities regarding the architecture and operating system you plan to work with in the long term.

LEAVE A REPLY

Please enter your comment!
Please enter your name here