#+title: x86 Assembly from my understanding
#+OPTIONS: ^:{}
#+AUTHOR: Crystal
#+OPTIONS: num:nil
#+EXPORT_FILE_NAME: ../../../../blog/asm/1.html
#+HTML_HEAD:
#+HTML_HEAD:
#+OPTIONS: html-style:nil
#+OPTIONS: toc:nil
#+HTML_HEAD:
#+HTML_LINK_HOME: https://crystal.tilde.institute/
Soooo this article (or maybe even a series of articles, who knows ?) will be about x86 assembly, or rather, what I understood from it and my road from the bottom-up hopefully reaching a good level of understanding
* Memory :
Memory is a sequence of octets (Aka 8bits) that each have a unique integer assigned to them called *The Effective Address (EA)*, in this particular CPU Architecture (the i8086), the octet is designated by a couple (A segment number, and the offset in the segment)
- The Segment is a set of 64 consecutive Koctets (1 Koctet = 1024 octets).
- And the offset is to specify the particular octet in that segment.
The offset and segment are encoded in 16bits, so they take a value between 0 and 65535
*** Important :
The relation between the Effective Address and the Segment & Offset is as follow :
**Effective address = 16 x segment + offset** keep in mind that this equation is encoded in decimal, which will change soon as we use Hexadecimal for convention reasons.
**** Example :
Let the Physical address (Or Effective Address, these two terms are interchangeable) *12345h* (the h refers to Hexadecimal, which can also be written like this *0x12345*), the register *DS = 1230h* and the register *SI = 0045h*, the CPU calculates the physical address by multiplying the content of the segment register *DS* by 10h (or 16) and adding the content of the register *SI*. so we get : *1230h x 10h + 45h = 12345h*
Now if you are a clever one ( I know you are, since you are reading this <3 ) you may say that the physical address *12345h* can be written in more than one way....and you are right, more precisely : *2^{12} = 4096* different ways !!!
** Registers
The 8086 CPU has 14 registers of 16bits of size. From the POV of the user, the 8086 has 3 groups of 4 registers of 16bits. One state register of 9bits and a counting program of 16bits inaccessible to the user (whatever this means).
*** General Registers
General registers contribute to arithmetic's and logic and addressing too.
Each half-register is accessible as a register of 8bits, therefor making the 8086 backwards compatible with the 8080 (which had 8bit registers)
Now here are the Registers we can find in this section:
*AX*: This is the accumulator. It is of 16 bits and is divided into two 8-bit registers AH and AL to also perform 8-bit instructions. It is generally used for arithmetical and logical instructions but in 8086 microprocessor it is not mandatory to have an accumulator as the destination operand. Example:
#+BEGIN_SRC asm
ADD AX, AX ;(AX = AX + AX)
#+END_SRC
*BX*: This is the base register. It is of 16 bits and is divided into two 8-bit registers BH and BL to also perform 8-bit instructions. It is used to store the value of the offset. Example:
#+BEGIN_SRC asm
MOV BL, [500] ;(BL = 500H)
#+END_SRC
*CX*: This is the counter register. It is of 16 bits and is divided into two 8-bit registers CH and CL to also perform 8-bit instructions. It is used in looping and rotation. Example:
#+BEGIN_SRC asm
MOV CX, 0005
LOOP
#+END_SRC
*DX*: This is the data register. It is of 16 bits and is divided into two 8-bit registers DH and DL to also perform 8-bit instructions. It is used in the multiplication and input/output port addressing. Example:
#+BEGIN_SRC asm
MUL BX (DX, AX = AX * BX)
#+END_SRC
** Addressing and registers...again
*** I realized what I wrote here before was almost gibberish, sooo here we go again I guess ?
Well lets take a step back to the notion of effective addresses VS relative ones.
*** Effective = 10h x Segment + Offset . Part1
When trying to access a specific memory space, we use this annotation *[Segment:Offset]*, so for example, and assuming *DS = 0100h*. We want to write the value *0x0005* to the memory space defined by the physical address *1234h*, what do we do ?
**** Answer :
#+BEGIN_SRC asm
MOV [DS:0234h], 0x0005
#+END_SRC
Why ? Let's break it down :
[[../../src/gifs/lain-dance.gif]]
We Already know that *Effective = 10h x Segment + Offset*, So here we have : *1234h = 10h x DS + Offset*, we already know that *DS = 0100h*, we end up with this simple equation *1234h = 1000h + Offset*, therefor the Offset is *0234h*
Simple, right ?, now for another example
*** Another example :
What if we now have this instruction ?
#+BEGIN_SRC asm
MOV [0234h], 0x0005
#+END_SRC
What does it do ? You might or might not be surprised that it does the exact same thing as the other snipped of code, why though ? Because apparently and for some odd reason I don't know, the compiler Implicitly assumes that the segment used is the *DS* one. So if you don't specify a register( we will get to this later ), or a segment. Then the offset is considered an offset with a DS segment.
*** Segment + Register <3
Consider *DS = 0100h* and *BX = BP = 0234h* and this code snippet:
#+BEGIN_SRC asm
MOV [BX], 0x0005 ; NOTE : ITS NOT THE SAME AS MOV BX, 0x0005. Refer to earlier paragraphs
#+END_SRC
Well you guessed it right, it also does the same thing, but now consider this :
#+BEGIN_SRC asm
MOV [BP], 0x0005
#+END_SRC
If you answered that its the same one, you are wrong. And this is because the segment used changes according to the offset as I said before in an implicit way. Here is the explicit equivalent of the two commands above:
#+BEGIN_SRC asm
MOV [DS:BX], 0x0005
MOV [SS:BP], 0x0005
#+END_SRC
The General rule of thumb is as follows :
- If the offset is : DI SI or BX, the Segment used is DS.
- If its BP or SP, then the segment is SS.
**** Note
The values of the registers CS DS and SS are automatically initialized by the OS when launching the program. So these segments are implicit. AKA : If we want to access a specific data in memory, we just need to specify its offset. Also you can't write directly into the DS or CS segment registers, so something like
#+BEGIN_SRC asm
MOV DS, 0x0005 ; Is INVALID
MOV DS, AX ; This one is VALID
#+END_SRC
* The ACTUAL thing :
Enough technical rambling, and now we shall go to the fun part, the ACTUAL CODE. But first, some names you should be familiar with :
- *Mnemonics* : Or *Instructions*, are the...well...Instructions executed by the CPU like *MOV* , *ADD*, *MUL*...etc, they are case *insensitive* but i like them better in UPPERCASE.
- *Operands* : These are the options passed to the instructions, like *MOV dst, src*, and they can be anything from a memory location, to a variable to an immediate address.
** Structure of an assembly program :
While there is no "standard" structure, i prefer to go with this one :
#+BEGIN_SRC asm
org 100h
.data
; variables and constants
.code
; instructions
#+END_src
** MOV dst, src
The MOV instruction copies the Second operand (src) to the First operand (dst)... The source can be a memory location, an immediate value, a general-purpose register (AX BX CX DX). As for the Destination, it can be a general-purpose register or a memory location.
these types of operands are supported:
#+BEGIN_SRC asm
MOV REG, memory
MOV memory, REG
MOV REG, REG
MOV memory, immediate
MOV REG, immediate
#+END_SRC
*REG*: AX, BX, CX, DX, AH, AL, BL, BH, CH, CL, DH, DL, DI, SI, BP, SP.
*memory*: [BX], [BX+SI+7], variable
*immediate*: 5, -24, 3Fh, 10001101b
for segment registers only these types of MOV are supported:
#+BEGIN_SRC asm
MOV SREG, memory
MOV memory, SREG
MOV REG, SREG
MOV SREG, REG
SREG: DS, ES, SS, and only as second operand: CS.
#+END_SRC
*REG*: AX, BX, CX, DX, AH, AL, BL, BH, CH, CL, DH, DL, DI, SI, BP, SP.
*memory*: [BX], [BX+SI+7], variable
*** Note : The MOV instruction *cannot* be used to set the value of the CS and IP registers
** Variables :
Let's say you want to use a specific value multiple times in your code, do you prefer to call it using something like *var1* or *E4F9:0011* ? If your answer is the second option, you can gladly skip this section, or even better, seek therapy.
Anyways, we have two types of variables, *bytes* and *words(which are two bytes)*, and to define a variable, we use the following syntax
#+BEGIN_SRC asm
name DB value ; To Define a Byte
name DW value ; To Define a Word
#+END_SRC
*name* - can be any letter or digit combination, though it should start with a letter. It's possible to declare unnamed variables by not specifying the name (this variable will have an address but no name).
*value* - can be any numeric value in any supported numbering system (hexadecimal, binary, or decimal), or "?" symbol for variables that are not initialized.
*** Example code :
#+BEGIN_SRC asm
org 100h
.data
x db 33
y dw 1350h
.code
MOV AL, x
MOV BX, y
#+END_SRC
*** Arrays :
We can also define Arrays instead of single values using comma separated vaues. like this for example
#+BEGIN_SRC asm
a db 48h, 65h, 6Ch, 6Fh, 00H
b db 'Hello', 0
#+END_SRC
Surprise Surprise, the arrays a and b are identical, the reason behind it is that characters are first converted to their ASCII values then stored in memory!!! Wonderful right ? And guess what, accessing values in assembly IS THE SAME AS IN C !!!
#+BEGIN_SRC asm
MOV AL, a[0] ; Copies 48h to AL
MOV BL, b[0] ; Also Copies 48h to BL
#+END_SRC
You can also use any of the memory index registers BX, SI, DI, BP, for example:
#+BEGIN_SRC asm
MOV SI, 3
MOV AL, a[SI]
#+END_SRC
If you need to declare a large array you can use DUP operator.
The syntax for *DUP*:
number DUP ( value(s) )
*number* - number of duplicate to make (any constant value).
*value* - expression that DUP will duplicate.
for example:
#+BEGIN_SRC asm
c DB 5 DUP(9)
;is an alternative way of declaring:
c DB 9, 9, 9, 9, 9
#+END_SRC
one more example:
#+BEGIN_SRC asm
d DB 5 DUP(1, 2)
;is an alternative way of declaring:
d DB 1, 2, 1, 2, 1, 2, 1, 2, 1, 2
#+END_SRC
Of course, you can use DW instead of DB if it's required to keep values larger then 255, or smaller then -128. DW cannot be used to declare strings.
*** LEA
LEA stands for (Load Effective Address) is an instruction used to get the offset of a specific variable. We will see later how its used, but first. here is something we will need :
In order to tell the compiler about data type,
these prefixes should be used:
*BYTE PTR* - for byte.
*WORD PTR* - for word (two bytes).
For example:
*BYTE PTR [BX]* ; byte access.
or
*WORD PTR [BX]* ; word access.
assembler supports shorter prefixes as well:
- b. - for BYTE PTR
- w. - for WORD PTR
in certain cases the assembler can calculate the data type automatically.
**** Example :
#+BEGIN_SRC asm
org 100h
.data
VAR1 db 50h
VAR2 dw 1234h
.code
MOV AL, VAR1 ; We check the value of VAR1 by putting it in AL
MOV AX, VAR2 ; Same here
LEA BX, VAR1 ; BX receives the Address of VAR1
MOV b.[BX], 44h
MOV AL, VAR1 ; We effectively changed the content of the VAR1 variable
LEA BX, VAR2
MOV w.[BX], 5678h
MOV AX, VAR2
#+END_SRC
*** Constants :
Constants in Assembly only exist until the code is assembled, meaning that if you disassemble your code later, you wont see your constant definitions.
Defining constants is pretty straight forward :
#+BEGIN_SRC asm
name EQU value
#+END_SRC
Of course constants cant be changed, and aren't stored in memory. So they are like little macros that live in your code.
** ⚐ :
Now comes the notion of *Flags*, which are bits in the *Status register*, which are used for logical and arithmetical instructions and can take a value of 1 or 0 . Here are the 8 flags that exist for the 8086 CPU :
- *Carry Flag(CF):* Set to 1 when there is an *unsigned overflow*, for example when you add 255 + 1( not in range [0,255] ). by default its set to 0.
- *Overflow Flag(CF):* Set to 1 when there is a *signed overflow*, for example when you add 100 + 50( not in range [-128, 128[ ). by default its set to 0.
- *Zero Flag(ZF):* Set to 1 when the result is 0. by default its set to 0.
- *Auxiliary Flag(AF):* Set to 1 when there is an *unsigned overflow* for low nibble (4bits), or in human words : when there is a carry inside the number. for example when you add 29H + 4CH , 9 + C => 15. So we carry the 1 to 2 + 4 and AF is set to 1.
- *Parity Flag(PF):* Set to 1 when the result has an even number of one bits. and 0 if it has an odd number of one bits. Even if a result is a word, only the Low 8bits are analyzed.
- *Sign Flag(SF):* Self explanatory, set to 1 if the result is negative and 0 if its positive.
- *Interrupt Enable Flag(IF):* When its set to 1, the CPU reacts to interrupts from external devices.
- *Direction Flag(DF):* When this flag is set to 0, the processing is done forward, if its set to 1, its done backward.