A 2-post collection

Crabs All the Way Down: Running Rust on Logic Gates

This article will discuss many topics, from CPU architecture design to historical shenanigans. Take a drink, it's downhill from there.

Even though the number has steadily decreased since the 90s, there are still many different and incompatible CPU architectures in use nowadays. Most computers use x86_64 and pretty much all mobile devices and recent Macs use some kind of ARM64-based ISA (instruction set architecture).

In specific fields, though, there are more exotic ones: most routers still use MIPS (for historical reasons), a roomful of developers use RISC-V, the PS3 used PowerPC, some servers 20 years ago used Itanium, and of course IBM still sells their S/390-based mainframes (now rebranded as z/Architecture). The embedded world has even more: AVR (used in Arduino), SuperH (Saturn, Dreamcast, Casio 9860 calculators), and the venerable 8051, an Intel chip from 1980 which is still being produced, sold and even extended by third parties.

All these architectures differ on their defining characteristics, the main ones being:

  • word size: 8, 16, 31, 32, 64 bits, sometimes more
  • design style: RISC (few instructions, simple operations), CISC (many instructions, performing complex operations, VLIW (long instructions, doing many things at once in parallel)
  • memory architecture: Harvard (separate code memory and data memory), von Neumann (shared)
  • licensing costs: RISC-V is open and free to use, whereas x86 and ARM, for example, require licensing fees
  • broadly, their feature set: floating-point numbers (x87), encryption (AES-NI), support for native high-level bytecode execution (Jazelle, AVR32B), vectorized computation (SSE, AVX, AltiVec)
That's not even counting DSP architectures, which are, to put it lightly, the ISA counterpart to the Twilight Zone (supporting weird arithmetic operations, peculiar data sizes, etc).

A lot of people have built homemade CPUs, either on real breadboards or in software, for emulators or circuit synthesis. It's a very interesting project to do, even for beginners (really, check out Ben Eater's video series), because it really helps grasp how code translates to the electrical signals that power every device we use, and how complex language features can really be implemented on top of simple operations.

Making a CPU

A number of circumstances have led me to design a simple ARM-ish CPU in a digital circuit simulator. I originally used logisim-evolution (of which I have since become a member of the development team), and recently migrated the circuit to Digital, for performance reasons (Logisim couldn't simulate my circuit at more than 50 or 60 Hz, whereas Digital reaches 20 kHz).

ARM, because it supports a subset of the ARM Thumb instruction set, which itself is one of the multiple instruction sets supported by ARM CPUs. It uses 32-bit words, but the instruction are 16 bits wide.

-ish, because, well, it only supports a subset of it (big, but nowhere near complete), and is deliberately limited in some aspects. Weird instructions, such as the PUSH / POP / LDM / STM family (one of the big CISC ink blots in the RISC ARM ISA), are not supported and are implemented as manual load/stores by the assembler. Interrupts are not supported either.

Pedantically, I didn't design only a CPU but what one could call a computer; it has a ROM, a RAM, and various devices that serve as the "front panel".

Quick sidenote: devices

To be really useful, a computer will not only have a CPU and a memory chip. It'll also have peripherals and other items plugged into it: a keyboard, a screen, a disk drive, speakers, a network card; pretty much anything you can (or can't) imagine has already been made into a computer device.

At the end of the day, the only thing you need is to be able to transmit data to and from the device. There are two opposite ways to do this: either devices are ✨special✨, either they aren't.

Basically, some architectures (x86, I'm looking at you) have, in addition to the memory, a special, separate address space for I/O, with its own special, different instructions: on an 8086, you'd use MOV to read and write main memory, and IN / OUT to read and write to a device. Some devices (the most important ones: PS/2 controller, floppy disk, serial port, ...) have a fixed port number, other have a port number assigned at boot by the BIOS. In the olden days, it was common practice to require setting environment variables or writing config files to inform software of which devices were plugged were (e.g. the famous BLASTER config line). This is called PMIO (port-mapped input/output).

The other option, the one used by pretty much everybody else, including modern x86 computers, is to have a single unified address space, but to make it virtual.

I'm using the word virtual here to differentiate this unified address space from the real physical memory address space (which, in itself, only really means anything on machines with a single memory unit). There's another concept called virtual memory that designates a completely unrelated (though similar in fashion) thing: providing programs with an address space larger than the computer's RAM through the use of strategies like swapping that allow moving RAM pages to the disk storage to free space in the working memory.

Imagine how IP addresses are supposed to map the entire Internet, but in reality an address does not have to exactly map to a single machine somewhere. For example, (::1 in IPv6) is the local loopback, and maps to the machine you're using. This is not required to be known by software communicating over the network, since the mapping is done by the network stack of the OS.

It's the same here: areas of the (virtual) address space are mapped to physical components. To give you a real-world example, here's the NES's address space:

address space diagram, showing WRAM, PPU & APU registers, cartridge RAM and ROM
image from

How to read: addresses from 0 to 800 (hexadecimal) are mapped to WRAM (work RAM), from 2000 to 2008 to the PPU (graphics card) control registers, from 4000 to 4018 to the APU (sound card), I trust you can take it from there. This is called MMIO (memory-mapped input/output).

The big advantage of this approach, for me, is really its simplicity, CPU-wise: it's just memory! Take the address of the device's area, read, write, it's really simple. It also makes things easier software-wise: you don't have to write inline assembly to call special instructions; as long as you can read and write from pointers, you're good to go.

In addition to this, memory mapping can also be used to provide access to different memory chips (e.g. ROM and RAM). Here's what it looks like for my circuit:

FFFFFF00 - FFFFFFFFMMIO controller00100000 - 001FFFFFRAM00000000 - 0000FFFFROM01000000 - 010FFFFFVideo displayMemory mapperWriteRead

Notice the arrows on each side of the edges between the components and the mapper; they indicate whether the component is read-only, read/write or write-only.


It's simple, really. I mean, compared to real CPUs.

As in real Thumb, there are sixteen 32-bit registers, numbered r0 through r15. The last three have nicknames: r13 is sp (stack pointer), r14 is lr (link register) and r15 is pc (program counter).

Quick sidenote: stack pointer

Memory is hard.

16 is a lot; so in reality they're divided into the low (r0-r7) Β and high (r8-r15) registers. High registers can only be manipulated using specific instructions, so the low ones are the ones you're gonna use in everyday life.

Instructions are grouped into several categories, each containing instructions with a common header. I won't list them all here, but the most used ones are the ALU (arithmetic and logic) operations, the load/store instructions (relative to pc, to sp, or to a general register), the stack manipulation instructions and the branch instructions (conditional and unconditional).

Instruction groups are handled by separate, independent subcircuits, that all write into shared buses.

Quick sidenote: buses

Bus is a surprisingly polysemic word. It has definitions in more domains than I can count, and even in electronics and hardware it's used for a variety of things. The common factor between all those definitions is "thing that links other things".
As in previous parts, I will be simplifying the explanation anyway, because electronics is a vast field of study.

In circuit design parlance, a bus is a group of wires connected together. Specifically, in this case, it's a group of wires where only one wire emits a signal at a given instant. Electrically, this is permitted by the use of what's called tri-state logic: a signal is either 0, 1 or Z (pronounced "high-impedance"). Z is "weak", in the sense that if you connect a Z signal with a 0 or 1 signal, the output will be that of the second signal. This is useful: you can have lots of independent components, each with an "enable" input, that only output a signal if they are enabled, and otherwise output Z. The basic component used for this is called, fittingly, a tri-state buffer. It's a simple logic gate that, if enabled, outputs its input unchanged, and otherwise outputs Z.

All these components can then be plugged together, and you can enable one and get its output easily.

Component example: load/store with register offset

This is the component that handles instructions of the form {direction}R{sign}{mode} {destination}, [{base}, {offset}, with:

  • {direction}: either LD (load) or ST (store)
  • {sign}: either nothing (do not extend), or S (sign-extend the value to fill 32 bits)
  • {mode}: either nothing (full word, 32 bits), H (halfword, 16 bits) or B (byte, 8 bits)
  • {destination}: the target register, to read from/write to
  • {base}, {offset}: the address in memory (which will be the sum of the values of both)

As an example, ldrh r1, [r2, r3] is roughly equivalent to r1 = *(short*)(r2 + r3) in C code.

The instructions for this group are encoded as follows:


opcode is a 3-bit value encoding both {direction}, {sign} and {mode}.

{direction} has two possible values (load, store); {sign} has two (raw, sign-extended) and {mode} has three (word, halfword, byte). That's 2β‹…2β‹…3 = 12 possible combinations, more than the 23 = 8 possible values for opcode, which means that some combinations are not possible.
This is because only load operations for incomplete (halfword, byte) values can be sign-extended, so the invalid combinations (strsh, strsb, strs, ldrs) are not given an encoding.

Here's the circuit:

image/svg+xml Instr 0-15 0-2 3-5 6-8 9-11 RegW RegA RegB Instr0708 0 Instr0708 StoreReq 0 1 LoadReq 0 Instr 0-15 9-11 0 1 2 2 0 2 1 1 Instr0708 MemMode Instr0708 MemSigned Instr 0-15 9 10

Here are the different exotic logic components used here:

  • the small triangles are tunnels: named wires that are accessible anywhere in the circuit;
  • the big trapezoids are multiplexers: they output the nth input, where n is also an input;
  • the three-wired triangles are buffers: they output their input if the side wire is high, otherwise they output Z (high-impedance);
  • the yellow boxes convert a 3-bit low register number into a 4-bit register number (by adding a zero in front of it);
  • the large rectangles with integer ranges next to them are splitters: they split a multi-bit value into multiple smaller values, to access individual bits or bit ranges

From top to bottom:

  • the three registers (Rd, Rb, Ro) are read from the instruction at their respective positions (0-2, 3-5, 6-8) and sent to the corresponding global tunnels (RW, RA, RB)
  • opcode is decoded to check whether it's a store (000, 001, 010) or a load (remaining values)
  • opcode is decoded to find the value of mode: 0 for word, 1 for halfword, 2 for byte
  • opcode is yet again decoded to find the value of sign (true only for opcodes 011 and 111 so we can check that the last two bits are high)

The Instr0708 tunnel is the activation pin for this component; it's high if the current instruction belongs to this instruction group.

Pretty much all other components look like this one, and when you plug them all together, you get a circuit that can execute instructions.

Quick sidenote: memory is hard

The seemingly simple problem of manipulating data and storing it somewhere so that you can get it back later is actually... not simple. Giving your CPU access to a big linear array of memory cells is not enough, you have to decide what you are going to do with it. Look at this Python program:

print("Hello, World!")

Where should the string be stored? It's gotta be somewhere. What about print? It's not an instruction, it's just a global variable that happens to be set to an object of type builtin_function_or_method, that you can call with the () operator. It's gotta be stored somewhere too. Remember, the only thing you really have at the moment is a big array of numbers. On top of that, the only set of operations you really have is {"load value from address", "store value at address"}. Or is it?

The CPU speaks in assembly instructions. These instructions have a fixed, defined encoding, and on the ARM Thumb instruction set they always (i.e. almost always) have the same size: 16 bits. Ignoring the header of the instruction (that tells you which one it is), that will take up a few bits, we quickly see that if we were to give address as immediates (constant values, in the instruction), we couldn't address more than 216 bytes of memory.

Hence: addressing modes and memory alignment.

If we look at everyday life programs, we can observe that there are two main use cases for memory: storing local variables (variables in functions, or parameters), and storing global variables (global configuration, memory that will be shared between programs).

Use case Allocation size Maximum lifetime Time of allocation Time of free
Local Generally small Current function call When entering function When leaving function
Global Any Static (lifetime of the program) Any Any

There's a clear difference: on one hand, "local memory", which is used for small, deterministic allocations, and "global memory", which is used for anything, at any time, with very few constraints.

How does that map to our "big block of cells"? We'll start with the "global" memory. We don't really know anything about how it'll be used, so we can't make too many assumptions. You can ask for any amount of bytes at any moment, and give it back to the OS any time you want, and you usually expect "given-back" space to be useable by subsequent allocations. This is hard. The real-world equivalent would be a big pile of stuff, laying on the ground, waiting for some program to pick it up, use it, or throw it in the trash. For this reason, it's called the heap.

Next, we can see that the "local memory" evolves in a specific way: it grows when we enter a function, shrinks when we exit it, and function calls follow a stack-like pattern (when you've entered a function, you can do anything you want, but you always end up exiting it at some point). In fact, it really is a stack (in the algorithmic data structure meaning); it's got two operations: push (grow) and pop (shrink). This "local memory" is called the stack.

Since it grows and shrinks that way, we don't really have to do any bookkeeping of the blocks of allocated memory, where they are, what strategy to use to choose where to allocate new blocks, etc. The only piece of data we need is the "depth" (i.e. how deep we are in that stack, or in other words, the length of the stack). The way this is usually done is that we set some place in memory to be the beginning of the stack, and we keep a global variable somewhere (for example, in a register) that contains the position in memory of the topmost item of the stack: the stack pointer (on ARM, sp, or its full name r13).

There's something I haven't specified yet: which direction the stack grows. Some architectures make it grow upward (push = increment stack pointer, pop = decrement), but most do it the opposite way and make it grow downward. Growing it downward means that you can easily make the heap start at address 0 and make the stack start at whatever the maximum address is, and you're assured that they won't collide until the heap grows too far up or the stack grows too far down.

In memory parlance, upward means increasing and downward means decreasing. "The stack grows downward" means that growing the stack decreases the stack pointer, and vice versa. Many diagrams online illustrate memory with the address 0 at the top, suggesting that downward means increasing, but this is misleading.

Now that we know how memory works, how do we access it? We've seen that we can't address it all, since instructions are too small, so how can we counter that?

Well, the answer is to use different addressing modes. For example, if you want to access memory on the stack, you'll usually be accessing things that are on top of the stack (e.g. your local variables), so instead of giving the full memory address (big), you only have to give the distance to the data relative to the stack pointer (small). This is the sp-relative addressing mode, and looks like ldr r1, [sp, #8].

Additionally, we can assume you will mostly store things that are 4 bytes or bigger, so we'll say that the stack is word-aligned: everything will be moved around so that addresses are multiples of 4. This means that we can fit even bigger numbers in the instruction: we only have to store the offset, divided by 4. The instruction above would store 2, for example.

Sometimes, you'll want to store data that is only useful to a single function. For example, switch / match instructions are usually implemented using jump tables: a list of offsets is stored in the program, and the correct offset is loaded and jumped to. Since this will be stored in the code of the function itself, it becomes useful to perform memory operations relative to the current position in the code, and that's how get get pc-relative addressing: ldr r2, [pc, #16]. As with sp, memory is word-aligned, so the offset has to be a multiple of 4.

Quick sidenode: function calls

The simplest way to call functions, in assembly, is through the use of jumps. You put a label somewhere, you jump to it. There's a problem though: how do you go back? A function can be called from multiple places, so you need to be able to "remember" where the function was called from, and you need to be able to jump to an address, rather than to a known label.

The simple approach, like with the stack pointer before, is to use a global variable (i.e. register) to store the address of the caller, and to have a special jump instruction that sets the register to the current position (linking), so we can go back to it later (branching). On ARM, this is the bl (branch-link) family of instructions, and that register is called the link register (abbreviated lr, nickname of r14).

But there's another problem: it doesn't work for nested calls! If you call a function from inside another called function, the value of the link register gets overwritten.

This is actually not a new problem: other registers can get overwritten as well when you call a function, and you can't expect the programmer to read the code of every function they're calling to see which registers are safe and which aren't. Here comes the calling convention: on every architecture (x86, ARM, ...) there's a set of rules (the ABI) that tell you how everything works, what a function is allowed to do, and specifically what registers should be preserved by the callee. A preserved register is not read-only: the callee can do whatever it wants with it, as long as when the control is given back to the caller, the old value is back.

The way to solve this is through register saving. When entering a function, space is allocated on the stack for local variables but also for registers that have to be preserved, and when exiting, the original values are pulled back from the stack into the registers.

Among those registers, on ARM, the link register is saved too. One cool aspect of ARM special registers being useable as general-purpose registers, is that you don't have to use a branch instruction to jump somewhere: you can just write into pc!

The usual pattern for functions in ARM assembly is thus:

    push {r4, r5, lr} ; save r4, r5 and lr
    movs r4, #123 ; do stuff
    movs r5, #42
    pop {r4, r5, pc} ; restore the values to r4, r5 and *pc*!


Not much is needed to make a circuit look like a computer. For starters, you may want to start with:

  • a keyboard (reading raw character input);
  • a terminal display (displaying characters, like a terminal emulator);
  • a video display (displaying raw pixel data);
  • a random number generator;
  • a decimal 7-segment display;
  • a network card (that can receive and transmit data via TCP).

All of these are seen as addresses in memory by the CPU and programs running on it. For example, writing a byte to address 0xFFFFFF00 will display a character in the terminal display. Reading a byte from address 0xFFFFFF18 will tell whether the keyboard buffer is empty or not.

Running code

The simplest way to run code on this thing is to simply write machine code and load it into the ROM.

Here's a simple program:

movs r0, #255	; r0 = 255 (0x000000FF)
mvns r0, r0		; r0 = ~r0 (0xFFFFFF00, address of terminal)
movs r1, #65    ; r1 = 65  (ASCII code of 'A')
str r1, [r0]    ; *r0 = r1

It's assembled as 20ff 43c0 2141 6001 (8 bytes), and when loaded and run, it shows this after 4 cycles:

terminal display showing the character A

Of course, writing programs in assembly isn't exactly practical. We invented macro assemblers and high-level (compared to assembly) programming languages for this a long time ago, so let's do that here. I originally went with C, but quickly switched to Rust for the ease of use and powerful macro support (useful for a constrained environment like this one).

Rust (technically, the reference compiler, rustc) uses LLVM as a backend for compilation, so any target LLVM supports, Rust supports it to some extent. Here, I'm using the builtin target thumbv6m-none-eabi (ARM v6-M Thumb, no vendor or OS, embedded ABI), but there's a big constraint: my CPU is not a full ARM CPU.

Since not all instructions are supported (some are emulated by my homemade assembler), I can't just build ARM binaries and load them. I need to use my own assembler, so I'm calling the compiler directly and telling it to emit raw assembly code, which is then sent to my assembler that finally generates a loadable binary file.

Additionally, since I'm running code without an OS, without any external code, I can't use Rust's standard library. This is a perfectly supported use case (called no_std), and doesn't mean I can't use anything at all: it only means that instead of using the std crate (the usual standard library), I use the core crate that contains only the bare necessities and specifically does not depend on an operating system running underneath. The core crate however does not include anything that relies on heap allocations (such as String or Vec), these are found in the alloc crate that I don't use either for a number of complex reasons related to my build system.

Basically, I wrote my own standard library. I'm now able to write programs like:

fn main() {
    println!("Hello, world!");

    screen::circle(10, 10, 20, ColorSimple::Red);

    let x = fp32::from(0.75);
    let mut video = screen::tty::blank().offset(50, 50);
    println!("sin(", Blue.fg(), x, Black.fg(), ") = ", Green.fg(), x.sin(), => &mut video);

and get:


Using rustc's raw assembly output means that I can't rely on code from other crates than the one I'm building (that would require using the linker, which I am not using here). I can't even use the compiler's intrinsics: functions such as memcpy or memclr are often used to perform block copies, but they aren't present in the generated assembly, so I had to implement them myself (I borrowed some code from Redox here).

Another problem is that since I am emulating some instructions (by translating them into sequence of other, supported instructions), branch offsets can get bigger than what the compiler expected. Problem: conditional branches on Thumb take an 8-bit signed immediate, so if you try to jump more than 128 instructions ahead or behind, you can't encode that instruction.

In practice, this means that I often have to extract code blocks from functions to make them smaller, and that the whole codebase is sprinkled with #[inline(never)] to force the compiler to keep these blocks in separate functions.

Implementing a useable standard library is not the easiest task; ensuring that the whole thing is ergonomic and pleasant to use is even harder. I had to use many unstable (nightly-only) features, such as GATs, associated type defaults, and specializations, among others.

This whole project comforts my choice to use Rust for future low-level/embedded development projects. Doing a hundredth of this in C would have been orders of magnitude harder and the code wouldn't have been anywhere as readable as this one is.




This plotter uses the fixed-point (16.16) numeric library and the video display.

Trigonometric functions were implemented using Taylor series (I know, CORDIC, but I like pain).

BASIC interpreter

This is a simple BASIC interpreter / REPL, similar to what was found on home computers of the 80s (e.g. C64). You can input programs line by line, display them, and run them. The supported instructions are PRINT, INPUT, CLS, GOTO and LET. The prompt supports LIST, RUN, LOAD, ASM and ASMRUN.

"Hello World" program

Programs can also be loaded over the network (similar to cassette loading on the C64), with a program such as:

cat $1 <(echo) |	# read file, add newline
dos2unix |			# convert line ends to Unix (LF)
nc -N ::1 4567		# send to port

Here, LOAD starts listening and the received lines are displayed with a # prefix and read as if they were being typed by the user:

Fibonacci sequence, loaded over network

Programs can also be compiled to Thumb assembly for higher performance:


In case you're wondering, here's how that part works:

  • each BASIC instruction and expression is converted (compiled) to a sequence of assembly instructions, for example CLS simply stores 12 (\f) in the address of the terminal output, while an expression simply sets up a little stack machine that evaluates the formula and stores the result in r1, and a LET instruction evaluates its expression and stores r1 into the variable's memory cell
  • once the program has been compiled, it's executed like a function, like this:
let ptr = instructions.as_ptr();
let as_fn: extern "C" fn() -> () = unsafe { core::mem::transmute(ptr) };
  • a bx lr instruction is appended at the end of the instruction list, so when the program finishes, it gives the control back to the interpreter

Web server

This program listens for HTTP requests, parses them, and processes them.

screenshot of circuit, terminal display showing request contents, and Firefox showing the web page


This is an REPL for a small but useable enough subset of R6RS. It supports most important primitive forms, many builtins and macros.

Supported datatypes are symbols, integers, booleans, strings, lists (internally stored as vectors), void and procedures (either builtin functions or user-defined closures).

addition and Fibonacci sequence

As with the BASIC interpreter, programs can be loaded over the network:

macro creating the named-let form, allowing for easier loop definition
function that creates a mutable accumulator, using closures

Terminal emulator

This is a simple terminal emulator that supports a subset of the ANSI (VT) escape codes (enough to display pretty colors and move the cursor around).

It uses a 5x7 font borrowed from here, and the ANSI decoding logic was written by hand (there are crates out there that do just that, but they support all the ANSI codes, whereas I only needed a very small subset for this program).

A script such as this one can be used to start a shell and pipe it to the circuit:

exec 3<>/dev/tcp/
cd ~
unbuffer -p sh -c 'stty echo -onlcr cols 80 rows 30 erase ^H;sh' <&3 1>&3 2>&3

MIDI player

Digital provides a MIDI output component, that supports pressing or releasing a key for a given instrument, so I wrote a simple program that uses midly to parse a MIDI file sent over network and then decodes useful messages to play the song.

Since channels are stored sequentially in a MIDI file, and since I only have one output anyway, I wrote an algorithm that merges channels together into a single message list. Events are already stored chronologically, so this is simply a "merge k sorted arrays" problem, that can be solved by recursively merging halves of the array (traditional divide-and-conquer approach) in \(O(n k \log k)\) (\(k\) arrays of \(n\) items).

Here's the result:


Parting words

All in all, this was fun. ARM/Thumb is a good architecture to implement as a side project, since it's well-supported by compilers and a small enough subset of it can suffice to run interesting code. Logisim and Digital are both excellent tools for digital circuit simulation. Rust is nice for doing things and making stuff.

Repository available here. Don't look at the commit messages.

How I Learned to Stop Worrying and Love Macros

Rust macros are powerful, that's a fact. I mean, they allow running any code at compile-time, of course they're powerful.

C macros, which are at the end of the day nothing more than glorified text substitution rules, allow you to implement new, innovative, modern language constructs, such as:

#define ever (;;)
for ever { 

or even:

#include <iostream>
#define System S s;s
#define public
#define static
#define void int
#define main(x) main()
struct F{void println(char* s){std::cout << s << std::endl;}};
struct S{F out;};

public static void main(String[] args) {
	System.out.println("Hello World!");

But these are just silly examples written for fun. Nobody would ever commit such macro abuse in real-world, production code. Nobody...

/*	mac.h	4.3	87/10/26	*/

 *	UNIX shell
 *	S. R. Bourne
 *	Bell Telephone Laboratories

#define IF		if(
#define THEN	){
#define ELSE	} else {
#define ELIF	} else if (
#define FI		;}

#define BEGIN	{
#define END		}
#define SWITCH	switch(
#define IN		){
#define ENDSW	}
#define FOR		for(
#define WHILE	while(
#define DO		){
#define OD		;}
#define REP		do{
#define PER		}while(
#undef DONE
#define DONE	);
#define LOOP	for(;;){
#define POOL	}


ADDRESS alloc(nbytes)
    POS     nbytes;
    REG POS rbytes = round(nbytes+BYTESPERWORD,BYTESPERWORD);

    LOOP    INT     c=0;
    REG BLKPTR  p = blokp;
    REG BLKPTR  q;
    REP IF !busy(p)
        THEN    WHILE !busy(q = p->word) DO p->word = q->word OD
        IF ADR(q)-ADR(p) >= rbytes
        THEN    blokp = BLK(ADR(p)+rbytes);
            IF q > blokp
            THEN    blokp->word = p->word;
        q = p; p = BLK(Rcheat(p->word)&~BUSY);
    PER p>q ORF (c++)==0 DONE
I'm sorry.

This bit of code is taken directly from the original Bourne shell code, from a BSD Tahoe 4.3 source code archive. Steve Bourne was an Algol 68 fan, so he tried to make C look more like it.

But Rust macros ain't like that.

What are Rust macros?

That's a good question. I've got a better one for you:

What are macros?

Macros, at the simplest level, are just things that do other things quicker (for the user) than by manually doing said things.

This definition encompasses quite a lot of features you've probably already encountered. Of course, if you're a developer, hearing "macro" often triggers a fight-or-flight response, deeply rooted in a bad experience with C preprocessor macros. If you're a Microsoft Office power user, "macros" are probably the first thing you're taught to be afraid of (second only to "moving a picture in a Word document"), thanks in part to the large number of malware that used VBA macros to propagate in the early 2000s. This definition even includes the simple, "key-sequence" macros that a lot of programs allowed you to record and bind to keys.

I use the past tense here, because this trend wore off quite long ago, around the same time we stopped seeing MDI interfaces.

These have pretty much disappeared, replaced by tabs and docking UIs

Back on topic. This is about programming, so the macros we're talking about are the ones we find in programming languages. Even though the C ones are the most famous (because there's no other language where there are so few built-in constructs that you have to write macros at one point or another), they weren't the first ones.

A Brief History of Macros

In the early 1950s, if you wanted to write a program for your company's mainframe computer, your choices were more limited than today, language-wise. "Portable" languages (Fortran, COBOL, eventually Algol) were a new concept, so basically everything serious was written in whatever machine language your computer understood. Of course, you didn't write the machine language directly, you used what was and is still called an assembler to translate some kind of textual representation into the raw numeric code data you'd then pass on to the mainframe using punch cards or whatnot.

After some time, assemblers started providing ways to declare "shortcuts" for other bits of code. I'll make a bit of an anachronism here by writing x86 assembly targeting Linux. Let's say you want to make a simple "Hello World" program:

section .text						; text segment (code)
	mov eax, 4 						; syscall (4 = write)
	mov ebx, 1 						; file number (1 = stdout)
	mov ecx, message				; string to write
	mov edx, length					; length of string
	int 80h							; call kernel

section .data
	message db 'Hello World', 0xA	; d(efine) b(yte) for the string, with newline
    length equ $ - message			; length is (current position) - (start)

Don't worry if you don't fully understand the code above, it's not the point. Here, writing a string takes 5 whole lines (filling up the 4 parameters and then calling the kernel). Compare this to Python:

print("Hello World")

If only there were a way to tell the assembler that those five lines are really one single operation, that we may want to do often...

%macro print 2		; define macro "print" with 2 parameters
	mov eax, 4
	mov ebx, 1
	mov ecx, %1		; message is first parameter
	mov edx, %2		; length is second parameter
	int 80h	

section .text
	print message1, length1
	print message2, length2
section .data
	message1 db 'Hello World', 0xA
	length1 equ $ - message1
	message2 db 'Simple macros', 0xA
	length2 equ $ - message2

Here, the "macro" is just performing simple text substitution. When you write print foo, bar, the assembler replaces the line with everything between %macro and %endmacro, all while also replacing every occurrence of %N with the value of the corresponding parameter. This is the most common kind of macro, and is pretty much what you get in C:

#define PRINT(message, length) write(STDOUT_FILENO, message, length)
Don't do this in C.

Too good to be true

Obviously, it has limitations. What if, in assembly, you did this:

mov eax, 42		; just storing something in eax
print foo, bar	; just printing my message
mov ebx, eax	; where's my 42 at?

Unbeknownst to you, the print macro modified the value of the eax register, so the value you wrote isn't there anymore!

What if, in C, you did this:

#define SWAP(a, b) int tmp = a;	\
                   a = b;		\
                   b = tmp;

int main() {
	int tmp = 123;
    // ...
	int x = 5, y = 6;
    SWAP(x, y); // ERROR: a variable named 'tmp' already exists

Here, there's a conflict between the lexical scope of the function (main) and the macro (SWAP). Short of using names like __macro_variable_dont_touch_tmp in your macros, there's not much you can do to entirely prevent problems like this. What about this:

int main() {
	int x = 5, y = 6, z = 7;
    int test = 100;

	if (test > 50)
    	SWAP(x, y);
    	SWAP(y, z);

The above code does not compile. It walks like correct code and quacks like correct code, but here's what it looks like after macro-expansion:

int main() {
	int x = 5, y = 6, z = 7;
    int test = 100;

	if (test > 50)
    	int tmp = x;
        x = y;
        y = tmp;
    	int tmp = y;
        y = z;
        z = tmp;

Braceless ifs must contain exactly one statement, but here there are 3 of them! Let's fix it:

#define SWAP(a, b) {int tmp = a;	\
                   a = b;			\
                   b = tmp;}

Now, it should work, shouldn't it? Nope, still broken!

if (test > 50)
   	{int tmp = x;
    x = y;
    y = tmp;};
   	{int tmp = y;
    y = z;
    z = tmp;};

Not seeing it? Let me reformat it for you:

if (test > 50) {
   	int tmp = x;
   	x = y;
  	y = tmp;

Since we're writing SWAP(x, y); there's a semicolon hanging right there, after the code block, so the else is not connected to the if anymore. The solution, obviously, is to do:

#define SWAP(a, b) do{int tmp = a;		\
                   a = b;				\
                   b = tmp;}while(0)

Here, the expanded code is equivalent to the one we had before, but requires a semicolon afterwards, so the compiler is happy.

Another simple example is

#define MUL(expr1, expr2) expr1 * expr2

int res = MUL(2 + 3, 4 + 5);

This gets expanded to

int res = 2 + 3 * 4 + 5; // bang!

Macros have no knowledge of concepts such as "expressions" or "operator precedence", so you have to resort to tricks like adding parentheses everywhere:

#define MUL(expr1, expr2) ((expr1) * (expr2))
But... it's broken? There's no reason anyone should have to do this sort of syntax wizardry just to get a multiline macro or a macro processing expressions to work!

A few years ago, in the 1960s to be precise, some smart guys in a lab realized that "just replace bits of text by other bits of text" was not, bear with me, the best way to do macros. What if, instead of performing modifications on the textual form of the code, the macros could work on an abstract representation of the code, and likewise produce an output in a similar way.

SaaS (Software as an S-expression)

This is a Lisp program (and its output):

> (print (+ 1 2))

Lisp (for LISt Processing) has a funny syntax. In Lisp, things are either atoms or lists. An atom, as its name implies, is something "not made of other things". Examples include numbers, strings, booleans and symbols. A list, well, it's a list of things. A list being itself a thing, you can nest lists. This is a list containing various things (don't try to run it, it's not a full program):

(one ("thing" or) 2 "inside" a list)

You may notice that this looks awfully like the program I wrote earlier; it's not luck: one of Lisp's basic tenets is that programs are just data. A function call, that most languages would write f(x, y) can simply be encoded as a list: (f x y).

The technical term for a "thing" (something that is either an atom or a list, with lists written with parentheses and stuff) is s-expression.

When you give Lisp an expression, it tries to evaluate it. A symbol evaluates to the value of the variable with the corresponding name. Other atoms evaluates to themselves. A list is evaluated by looking at its first element, which must be a function, and calling it with the rest of the list as its parameters. You can tell Lisp to not evaluate something, using a function (technically...) called quote.

> (+ 4 5)
> (quote (+ 4 5))
(+ 4 5)

In the end, you get code like this:

> (print (length (quote (a b c d))))

(print... and (length... are evaluated, but (a... is kept as it, because it's really a list, not a bit of code.

The opposite of quote is called eval:

> (quote (+ 4 5))
(+ 4 5)
> (eval (quote (+ 4 5))

Through this simple mechanic, Lisp allows you to modify programs dynamically as if they were any other kind of data you can manipulate – because they really are any other kind of data you can manipulate.

Let's rewrite our MUL macro from before. I'll define a function which takes two parameters, and returns code that multiply them.

> (define (mul expr1 expr2)
  	(list (quote *) expr1 expr2)) ; * is quoted so it appears verbatim in the output
> (mul (+ 2 3) (+ 4 5))
  	(* 5 9)

That's not exactly what I want, since I don't want the operands to be evaluated right at the beginning, so I'll quote them:

> (mul (quote (+ 2 3)) (quote (+ 4 5)))
  	(* (+ 2 3) (+ 4 5))
> (eval (mul (quote (+ 2 3)) (quote (+ 4 5))))

You'll notice right away that we don't have any operator precedence problem like we had in C. But we do have problems: we have to put (quote ...) around every operand to prevent it from being evaluated, and we have to (eval ...) the result to really run the code that was produced. Since these steps are quite common, they were abstracted away in a language builtin called define-macro:

> (define-macro (mul expr1 expr2)
  	(list (quote *) expr1 expr2))
> (mul (+ 2 3) (+ 4 5))

Here's what the SWAP macro would look like:

(define-macro (swap var1 var2)
    	(let ((tmp (unquote var1)))
    		(set! (unquote var1) (unquote var2))
        	(set! (unquote var2) tmp)))

I'm using functions I haven't talked about yet. quasiquote does the same thing as quote, that is, return its argument without evaluating it, except that if you write (unquote ...) somewhere in it, the argument of unquote is inserted evaluated. You don't have to understand this, only that all of these tools are, in the end, nothing more than syntactic sugar for manipulating lists. I could've written swap using only list manipulation functions:

(define-macro (swap var1 var2)
	(list 'let (list (list 'tmp var1))
    	(list 'set! var1 var2)
        (list 'set! var2 'tmp)))

We could argue that even list is syntactic sugar, and it's true, the real primitive used to build lists is cons: (list 1 2 3) can be constructed by doing (cons 1 (cons 2 (cons 3 ()))). () is a special value that corresponds to the empty list, it's often used as an equivalent of null.

set! is just what you use to change a variable's value.

Quick sidenote, since quote, unquote and quasiquote are very common, there's even more syntactic sugar in the language to write them more concisely: ' for quote , , for unquote and ` for quasiquote. The code above, in a real codebase, would look like this:

(define-macro (swap var1 var2)
  	`(let ((tmp ,var1))
		(set! ,var1 ,var2)
		(set! ,var2 tmp)))

That's cheating, Lisp isn't a real language anyway

I mean, obviously. Real languages have syntaxes way more complex than lists of things. When you look at a real program written in a real language, for example C, you don't see a list. You see blocks, declarations, statements, expressions.

int factorial(int x) // function signature
{ // code block
	if (x == 0) // conditional statement
    	return 1; // return statement
    	return x * factorial(x - 1); // expression, function call


(define (factorial x)
	(if (zero? x)
        (* x (factorial (- x 1)))))

Or, linearly:

(define (factorial x) (if (zero? x) 1 (* x (factorial (- x 1)))))
That's a list of lists if I've ever seen one

When a compiler or interpreter reads a program, it does something called parsing. It reads a sequence of characters (letters, digits, punctuation, ...) and converts it (this is the non-trivial part) into something it can process more easily.

Think about it, when you read the block of C code above, you don't read a sequence of letters and symbols. You see that it's a function declaration, with a return type, a name, a list of parameters and a body. Each parameter has a name and a type, and the body is a code block containing an if-statement, itself containing more code.

A data structure that stores things that are either atomic or made of other things is called a tree. Lisp lists (try saying that out loud quickly) can contain atoms or other lists, they're just one way of encoding trees.

Here, we're using a tree to store a program's source code, which is text, but we know that the code respects a set of rules, called the syntax. Oh, and we abstract away non-essential details like whitespace, parentheses and whatnot.

We might as well call that an Abstract Syntax Tree! (a.k.a. AST)

G fdecl Function declaration 'factorial' id1 Type 'int' fdecl->id1 Return type plist Parameter list fdecl->plist Parameters body 'If' statement fdecl->body Body p1 Parameter 'x' plist->p1 id4 Type 'int' p1->id4 Type cond Binary operation '==' body->cond Condition body_true [code if true] body->body_true If true body_false [code if false] body->body_false If false cop1 Variable 'x' cond->cop1 Operand 1 cop2 Constant 0 cond->cop2 Operand 2

I've omitted some details from the diagram above for the sake of brevity, but you get the idea. Code (text) becomes code (tree), and code (tree) matches more closely the mental idea we have of what code (text) means.

We reach the same conclusion we had we Lisp: code is data. The only difference is that in Lisp, you can really take code and turn it into data, it's "built-in", whereas there's nothing in C for that. The main reason is that code is a weird kind of data. Numbers are simple; arrays, a bit more convoluted but still simple; code is hard to reason about. Trees and stuff. Lisp is built around dynamic lists, so it's easy. C is built around human suffering, it's definitely not made for manipulating code. I mean, imagine writing a parser, or even a compiler in C.

Being able to manipulate code from code is called metaprogramming, and few languages have it built-in. Lisp does it, because it's Lisp. C# does it too, albeit only for a (large enough) subset of the language, with what they call "expression trees":

void Swap<T>(Expression<Func<T>> a, Expression<Func<T>> b)
	var tmp = Expression.Parameter(typeof(T));
	var code = Expression.Lambda(
		Expression.Block(new [] { tmp }, 		// T tmp;
			Expression.Assign(tmp, a.Body),		// tmp = [a];
			Expression.Assign(a.Body, b.Body),	// [a] = [b];
			Expression.Assign(b.Body, tmp)));	// [b] = tmp;
	var compiled = (Func<T>) code.Compile();

class Foo
	public int A;
	public int B;

var obj = new Foo { A = 123, B = 456 };
Swap(() => obj.A, () => obj.B);

It's a bit more complicated than in Lisp, because here, the way to achieve what we did with quote (i.e., pass an unevaluated expression to a function) involves declaring a parameter with the Expression<T> type, with T being a function type. This means that you can't pass any expression directly, you must pass a function returning that expression (hence the () =>). In other words, () => x is the closest C# provides to (quote x).

We don't have eval either, instead we can compile an expression we built into a real function we can then call (and it's that call that does what eval would do in Lisp in that context).

Building code is also more complicated: since C# code is not made of lists, you can't just create a sequence of things and call it a day, code here is stored as objects ("expression trees") that you build using functions such as Expression.Assign or Expression.Block.

All of this also means that only a subset of the language is available through this feature – you can't have classes in functions for examples. At the end of the day, it's not really a problem, most problems solved by macros are solved through other means in C#, and this metaprogramming-like expression tree wizardry is almost only ever used in contexts where only simple expressions will be used.

Long Recaps Considered Harmful

3,000! If you're still there, you've just read 3,000 words of me rambling about old languages and weird compiler theory terminology. This post was supposed to be about Rust macro abuse.

Rust has macros. Twice.

Rust supports two kinds of macros: declarative macros and procedural macros.

Declarative macros are a bit like C macros, in that they can be quite easy to write, although in Rust they are much less error-prone. See for yourself:

macro_rules! swap {
    ($a:expr, $b:expr) => { 
        let tmp = $a;
        $a = $b;
        $b = tmp;

fn main() {
    let (mut a, mut b) = (123, 456);
    swap!(a, b);
    println!("a={} b={}", a, b);
macro_rules! mul {
	($a:expr, $b:expr) => {
    	$a * $b

fn main() {
	println!("{}", mul!(2 + 3, 4 + 5)); // no operator precedence issue

Declarative macros can operate on various parts of a Rust program, such as expressions, statements, blocks, or even specific token types, such as identifiers or literals. They can also operate on a raw token trees, which allow for interesting code manipulation techniques. But even though they work in a cleaner way and support advanced patterns, with repetitions and optional parameters, they're just a more advanced form of substitution. So, closer to C macros.

Procedural macros, on the other hand, are more like Lisp macros. They're written in Rust, get passed a token stream (a list of tokens from the source code) and are expected to give one back to the compiler. Apart from that, they can do basically anything.

They look like this:

pub fn macro_name(input: TokenStream) -> TokenStream {

They're often used for code generation, for example for generating methods from a struct definition, for example:

struct Person {
	name: String,
    age: u8

This generates an implementation of the Debug trait for the Person type, which will contain code allowing to get a pretty-printed, human-readable version of any Person object when needed. derive(PartialEq) and derive(PartialOrd) generate equality and ordering methods, etc.

But there are less... orthodox uses for procedural macros.

Mara Bos famously wrote some interesting crates – the first one, inline_python, allows running Python code from Rust seamlessly, with bidirectional interaction (for variables):

use inline_python::python;

fn main() {
    let who = "world";
    let n = 5;
    python! {
        for i in range('n):
            print(i, "Hello", 'who)

I highly recommend reading her blogpost series on the subject where she goes deep in detail on how to implement such a macro. It involves lot of subtle tricks needed to deal with how the Rust compiler reads and tokenizes code, how errors can be mapped from Python to Rust, etc.

She also wrote whichever_compiles, which runs multiple instances of the compiler to find... whichever bit of code compiles, among a list you provide:

use whichever_compiles::whichever_compiles;

fn main() {
    whichever_compiles! {
        try { thisfunctiondoesntexist(); }
        try { invalid syntax 1 2 3 }
        try { println!("missing arg: {}"); }
        try { println!("hello {}", world()); }
        try { 1 + 2 }

whichever_compiles! {
    try { }
    try { fn world() {} }
    try { fn world() -> &'static str { "world" } }

The macros forks the compiler process, at compile-time, and each child process tries one of the branches. The first one to compile wins, and the winning branch is used for the rest of the build process.

After a series of unfortunate events, I was informed of the existence of procedural macros, and decided that I had to make one. After some nights of work, I brought to the world embed-c, the first procedural macro to allow anyone to write pure, unadulterated C code in the middle of a Rust code file. Complete with full interoperability with the Rust code, obviously. This has made a lot of people very angry and been widely regarded as a bad move.

use embed_c::embed_c;

embed_c! {
    int add(int x, int y) {
        return x + y;

fn main() {
    let x = unsafe { add(1, 2) };
    println!("{}", x);

It uses a library called C2Rust which, really, is what it sounds like. It's a toolset that relies on Clang to parse and analyze C code, and generates equivalent (in behavior) Rust code. Obviously, the generated code is not idiomatic, and quickly becomes unreadable if you use enterprise-grade C control flow features such as goto. But can Rust really replace C in the industry without a proper implementation of Duff's device?

embed_c! {
    void send(to, from, count)
        register short *to, *from;
        register count;
        register n = (count + 7) / 8;
        switch (count % 8) {
        case 0: do { *to++ = *from++;
        case 7:      *to++ = *from++;
        case 6:      *to++ = *from++;
        case 5:      *to++ = *from++;
        case 4:      *to++ = *from++;
        case 3:      *to++ = *from++;
        case 2:      *to++ = *from++;
        case 1:      *to++ = *from++;
                } while (--n > 0);

fn main() {
    let mut source = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    let mut dest = [0; 10];
    unsafe { send(dest.as_mut_ptr(), source.as_mut_ptr(), 10); };
    assert_eq!(source, dest);

After seeing Mara's inline_python crate, I was taken aback by her choice of such an outdated language – Python was created in 1991!

VBScript, first released in 1996, is a much more modern language than Python. It provides transparent COM interoperability, is supported out-of-the-box on every desktop version of Windows since 98 – even Windows CE on ARM is supported; and it has been since 2000, whereas Python won't run on Windows ARM until 3.11 (2022).

As such, I had no other choice but to create inline_vbs, for all your daily VBS needs.

use inline_vbs::*;

fn main() {
    vbs![On Error Resume Next]; // tired of handling errors?
    vbs![MsgBox "Hello, world!"];
    if let Ok(Variant::String(str)) = vbs_!["VBScript" & " Rocks!"] {
        println!("{}", str);

It relies on the Active Scripting APIs, that were originally designed to allow vendors to add scripting support to their software, and it's actually a nice idea. You can have multiple languages providers, and a program relying on the AS APIs would automatically support all installed languages. The most common were JScript and VBScript, because they were installed by default on Windows, but you could add support for Perl, REXX or even Haskell. Haskell! Think about it. This means that on a computer with the Haskell provider installed, this bit of code would be valid and would kinda work in Internet Explorer:

        <TITLE>Active Scripting demo</TITLE>
            main :: IO ()
            main = putStrLn "Hello World"

One major pain point is that VBScript, like Python, is a dynamic language, where values can change type, something that statically-typed languages like Rust are proud to say they do not like at all, thank you very much.

Since VBScript is handled through COM APIs, values are transferred using the VARIANT COM type, which is pretty much a giant union of every COM type under the sun. Luckily, this matches up perfectly with Rust's discriminated unions – I take it as a sign from the universe that Rust and VBScript were made to work together.

That's pretty much it for today.