Catch up on the first part here here

The purpose of the compiler is to convert a string of commands (content) into tokens, then compile those tokens into instructions which will be executed on a virtual machine with vm.exceute(). The process consists of two main stages: tokenization and compilation. The tokenize function breaks the input into manageable pieces, while compile_to_instrs processes these pieces into something executable.This system is a form of parsing-turning raw data into a structured form that can be understood and executed by the machine.

Difference between compilers and interpreters:

A compiler translates the entire source code of a high-level programming language into machine code in one go. Thereafter, the system stores and executes the machine code. This approach is efficient for execution, as the translation happens before the program runs. Examples of compiled languages include C, C++, and Rust.

Meanwhile, interpreters translate high-level code into machine code line by line, executing each instruction immediately after translation. This allows for dynamic execution but can be slower compared to compiled code. Examples of interpreted languages include: Python, JavaScript, PHP, etc.

Tokenization

The tokenize function takes in a string of instructions (e.g., push 5, add). It processes the string line by line, splitting the content into individual words (tokens). If a token is a valid :

  1. Operation (e.g., push, add, etc.)- it’s converted into an Op (Operation) enum.
  2. Number- it’s treated as a value (integer) and converted into a Token::Value(i64).
  3. Label (e.g., :start)- If the token starts with a colon (:).
  4. Unknown-If the token is not recognized, it's labeled as Token::Unknown.

Tokenization Flow

The final output is a list of tokens representing the instructions. To write the Tokenization code, you’d make use of the Op file:

Op file

The Op enum defines a set of operations that form the building blocks of the virtual machine's instruction set. Each operation has a corresponding numeric value (u8), specified using the #[repr(u8)] attribute. The to_u64 and from_u64 methods allow serialization (conversion to bytes) and deserialization (conversion back to instructions) transitions between human-readable instructions and machine-friendly representations.

This makes the enum compact and easily convertible to and from bytecode. Using Option in from_u64 ensures invalid numeric values return None instead of causing runtime errors. The #[repr(u8)] attribute guarantees the enum variants match their numeric representation.

Example Variants:

Push: Adds a value to the stack.

Pop: Removes the top value from the stack.

Add: Pops two values, adds them, and pushes the result.

Complete code

use strum::FromRepr;

#[derive(Debug, FromRepr, Clone, Copy, PartialEq, PartialOrd)]
#[repr(u8)]
pub enum Op {
    Push = 0x01, Pop = 0x02, Print = 0x03, Add, Inc, Dec, Sub, Mul, Div, Mod, 
    Halt, Dup, Dup2, Swap, Clear, Over, Je, Jn, Jg, Jl, Jge, Jle, Jmp, Jz, Jnz,
}

impl Op {
    /// Converts an `Op` variant into a `u64`.
    pub fn to_u64(self) -> u64 {
        self as u64
    }

    /// Converts a `u64` back into an `Op` variant, if valid.
    pub fn from_u64(value: u64) -> Option<Self> {
        Op::from_repr(value as u8)
    }
}

The representation of the tokenization process in Code:


pub fn tokenize(content: &str) -> Result<Vec<Token>> {
    let tokens: Vec<Token> = content
        .lines()
        .flat_map(|line| line.split_whitespace())
        .filter(|raw_token| !raw_token.trim().is_empty())
        .map(|raw_token| match raw_token {
            "push" => Token::Op(Op::Push),
            "pop" => Token::Op(Op::Pop),
            "print" => Token::Op(Op::Print),
            "add" => Token::Op(Op::Add),
            "inc" => Token::Op(Op::Inc),
            "dec" => Token::Op(Op::Dec),
            "sub" => Token::Op(Op::Sub),
            "mul" => Token::Op(Op::Mul),
            "div" => Token::Op(Op::Div),
            "mod" => Token::Op(Op::Mod),
            "halt" => Token::Op(Op::Halt),
            "dup" => Token::Op(Op::Dup),
            "swap" => Token::Op(Op::Swap),
            "clear" => Token::Op(Op::Clear),
            "over" => Token::Op(Op::Over),
            "je" => Token::Op(Op::Je),
            "jn" => Token::Op(Op::Jn),
            "jg" => Token::Op(Op::Jg),
            "jl" => Token::Op(Op::Jl),
            "jge" => Token::Op(Op::Jge),
            "jle" => Token::Op(Op::Jle),
            "jmp" => Token::Op(Op::Jmp),
            "jz" => Token::Op(Op::Jz),
            "jnz" => Token::Op(Op::Jnz),
            val => {
                if let Ok(int) = val.parse::<i64>() {
                    Token::Value(int)
                } else if val.starts_with(':') {
                    Token::Label(val.strip_prefix(':').unwrap().to_string())
                } else {
                    Token::Unknown(val.to_string())
                }
            }
        })
        .collect();

    if let Some(Token::Unknown(val)) = tokens.iter().find(|t| matches!(t, Token::Unknown(_))) {
        return Err(anyhow!("Unexpected token: '{}'", val));
    }

    Ok(tokens)
}

The compile_to_instrs Function

Once the content has been tokenized, it’s ready for compilation into instructions. This function takes the tokens and processes them to generate machine-readable instructions (Instr).

fn compile_to_instrs(tokens: &[Token]) -> Result<Vec<Instr>> {
    let mut abstract_instrs: Vec<AbstractInstr> = Vec::new();
    let mut labels: HashMap<String, usize> = HashMap::new();
    let mut tail = tokens;

    while !tail.is_empty() {
        match tail {
            [Token::Label(name), rest @ ..] => {
                if labels.contains_key(name) {
                    return Err(anyhow!("Label '{}' defined more than once", name));
                }
                labels.insert(name.clone(), abstract_instrs.len());
                tail = rest;
            }
            [Token::Op(op), rest @ ..] if *op < Op::Push => {
                abstract_instrs.push(AbstractInstr { op: *op, value: AbstractValue::None });
                tail = rest;
            }
            [Token::Op(op), Token::Value(value), rest @ ..] if *op >= Op::Push => {
                abstract_instrs.push(AbstractInstr { op: *op, value: AbstractValue::Integer(*value) });
                tail = rest;
            }
            [Token::Op(op), Token::Label(label), rest @ ..] if *op > Op::Push => {
                abstract_instrs.push(AbstractInstr { op: *op, value: AbstractValue::Label(label.clone()) });
                tail = rest;
            }
            _ => return Err(anyhow!("Invalid token sequence in compilation: {:?}", tail)),
        }
    }

    for instr in &mut abstract_instrs {
        if let AbstractInstr { value: AbstractValue::Label(label), .. } = instr {
            if let Some(&addr) = labels.get(label) {
                instr.value = AbstractValue::Integer(addr as i64);
            } else {
                return Err(anyhow!("Label '{}' not defined", label));
            }
        }
    }

    let instrs = abstract_instrs
        .into_iter()
        .map(|abstract_instr| Instr {
            op: abstract_instr.op,
            value: match abstract_instr.value {
                AbstractValue::Integer(value) => value as u64,
                AbstractValue::None => 0,
                AbstractValue::Label(_) => unreachable!(),
            },
        })
        .collect();

    Ok(instrs)
}

Here’s an image description of what is happening in the code above:

The input is a vector of Token enums. The function creates an empty list of AbstractInstr to hold the intermediate instructions. The HashMap called labels to map labels to their positions in the instruction list, then Iterates over the tokens:

If it encounters a label (Token::Label), it adds it to the labels map and continues.

If it encounters an operation (Token::Op), it processes it by checking the associated value type (whether it's just an operation like push or includes additional data, like a number or label). "push" becomes Token::Op(Op::Push)."42" becomes Token::Value(42).

":start" becomes Token::Label("start") . If the sequence of tokens doesn’t match an expected pattern (like an invalid token sequence), it returns an error.

  1. The compile Function

The compile function is a high-level function that combines tokenize and compile_to_instrs. It tokenizes the input and then compiles the tokens into instructions. It prints the tokens and instructions for debugging purposes.

pub fn compile(content: &str) -> Result<Vec<Instr>, MyError> {
    let tokens = tokenize(content)?;
    println!("Tokens: {:#?}", tokens);

    let instrs = compile_to_instrs(&tokens)?;
    println!("Instructions: {:#?}", instrs);

    Ok(instrs)
}

Here’s the flow:

The input is a string of content containing instructions. It calls tokenize to get the tokens and then compile_to_instrs to convert them into instructions and finally returns a vector of compiled instructions (Instr).

Conclusion.

The compiler transforms human-readable instructions into machine-executable code through tokenization and compilation. The tokenize function breaks input into tokens (e.g., operations, values, labels), while compile_to_instrs converts these tokens into instructions for the virtual machine. The Op enum, with its compact numeric representation (u8), ensures efficient execution and serialization. Error handling via Option and Result ensures robustness. Together, these components create a reliable system for parsing, compiling, and executing programs on a virtual machine.

Author Of article : Pluri45 Read full article