Programming

A 4-post collection

Six handshakes away

Have you ever heard about "six degrees of separation"? It's about the famous idea that there are always less than about six persons between two individuals chosen at random in a population. Given enough people, you'll always find someone whose uncle's colleague has a friend that knows your nextdoor neighbour.

Fun fact: it's where the name of the long-forgotten social network sixdegrees.com came from.

Mathematically, it checks out. If you have 10 friends and each of those friends has 10 friends, in theory that's a total of 1+10+9*10=101 individuals. In practice, when you have 10 friends, they probably know each other as well, and their friends most probably do too. You end up with way fewer than 101 people, and no two persons in your "social graph" ever end up more than one or two handshakes away from each other.

In graph theory, those kinds of graphs where you have densely connected communities, linked together by "hubs", i.e. high-degree nodes, are called "small-world networks".

Oh you know Bob? Isn't it a small world!

I learned about it a few weeks ago in a very nice (French) video on the subject, and immediately thought "I wonder what the graph of everyone I know looks like". Obviously, I can't exhaustively list every single person I've met in my life and put them on a graph.

Or can I?


One of the few good things™ Facebook gave us is a really fast access to petabytes of data about people we know, and especially our relationships with them. I can open up my childhood best friend's profile page and see everyone he's "friends" with, and click on a random person and see who they're friends with, et cætera. So I started looking for the documentation for Facebook's public API which, obviously, exists and allows for looking up this kind of information. I quickly learned that the exact API I was looking for didn't exist anymore, and all of the "alternative" options (Web scrapers) I found were either partially or completely broken.

So I opened up PyCharm and started working on my own scraper, that would simply open up Facebook in a Chromium Webdriver instance, and fetch data using ugly XPath queries.

def query(tab):
    return "//span[text() = '" + tab + "']/ancestor::div[contains(@style, 'border-radius: max(0px, min(8px, ((100vw')]/div[1]/div[3]/div"
Truly horrible.

After 180 lines and some testing, I had something that worked.

Basically, the script loads a Facebook account's friends list page and scrolls to the bottom, waiting for the list to dynamically load until the end, and then fetches all the links in a specific <div> which each conveniently contain the ID of the friend. It then adds all of those IDs to the stored graph, and iterates through them and repeats the whole process. It's a BFS (breadth-first-search) over webpages.

In the past few years, a lot of people started realizing just how much stuff they were giving away publicly on their Facebook profile, and consequently made great use of the privacy settings that allow, for example, restricting who can see your friends list. A small step for man, but a giant leap in breaking my scraper.‌‌ People with a private friends list appear on the graph as leaves, i.e. nodes that only have one neighbour. I ignore those nodes while processing the graph.

It stores the relationships as adjacency lists in a huge JSON file (74 MiB as I'm writing), which are then converted to GEXF using NetworkX.

Now in possession of a real graph, I can fire up Gephi and start analyzing stuff.


The graph you're seeing contains around 1 million nodes, each node corresponding to a Facebook account and each edge meaning two accounts are friends. The nodes and edges are colored according to their modularity class (fancy name for the virtual "community" or "cluster" they belong to), which was computed automatically using equally fancy graph-theoretical algorithms.

At 1 million nodes, the time necessary to layout the graph and compute the useful measurements is about 60 hours (most of which is spent on calculating the centrality for each node) on my 4th-gen i7 machine.

About those small-world networks. One of their most remarkable properties is that the average length of the shortest path between two nodes chosen at random grows proportionally to the logarithm of the total number of nodes. In other words, even with huge graphs, you'll usually get unexpectedly short paths between nodes.

But what does that mean in practice? On this graph, there are people from dozens of different places where I've lived, studied, worked. Despite that, my dad living near Switzerland is only three handshakes away from my colleagues in the other side of the country.

More formally, the above graph has a diameter of 7. That means that there are no two nodes on the graph that are more than 6 "online handshakes" away from each other.

In the figure above, we can see the cumulative distribution of degrees on the graph. For a given number N, the curve shows us how many individuals have N or more friends. Intuitively, the curve is monotonically decreasing, because as N gets bigger and bigger, there are less and less people having that many friends. On the other hand, almost everyone has at least 1 friend.

You'll maybe notice a steep hill at the end, around N=5000. This is due to the fact that 5000 is the maximum number of friends you can have on Facebook; so you'll get many people with a number of friends very close to it simply because they've "filled up" their friends list.

We can enumerate all pairs of individuals on the graph and compute the length of the shortest path between the two, which gives the following figure:

In this graph, the average distance between individuals is 3.3, which is slightly lower than the one found in the Facebook paper (4.7). This can be explained by the fact that the researchers had access to the entire Facebook database whereas I only have access to the graph I obtained through scraping.

(PDF) The Anatomy of the Facebook Social Graph
PDF | We study the structure of the social graph of active Facebook users, the largest social network ever analyzed. We compute numerous features of the... | Find, read and cite all the research you need on ResearchGate
The Facebook paper

Fix for the Psy-Q Saturn SDK

If you ever want to write code for the Sega Saturn using the Psy-Q SDK (available here), you may encounter a small problem with the toolset when using #include directives.

Example:

#include "abc.h"

int main()
{
    int b = a + 43;
    return 0;
}
main.c
C:\Psyq\bin>ccsh -ITHING/ -S main.c
build.bat
int a = 98;
abc.h

This will crash with the following error: main.c:1: abc.h: No such file or directory, which is quite strange given that we explicitely told the compiler to look in that THING folder.

What we have:

  • CCSH.EXE : main compiler executable (C Compiler Super-H)
  • CPPSH.EXE preprocessor (C PreProcessor Super-H)

CCSH calls CPPSH with the source file first to get a raw code file to compile, and then actually compiles it. Here, we can see by running CPPSH alone that it still triggers the error, which means the problem effectively comes from CPPSH. After a thorough analysis in Ida, it seems that even though the code that handles parsing the command-line parameters related to include directories, those paths aren't actually added to the program's internal directory array and thus never actually used. I could have decompiled it and fixed it myself, but I found a faster and simpler way: use the PSX one.

Though CCSH and CCPSX are very different in nature (one compiles for Super-H and one for MIPS), their preprocessors are actually almost identical – when we think about it, it makes sense: the C language doesn't depend on the underlying architecture (most of the time), so why would its preprocessor do?

So here's the fix: rename CCSH to something else and copy CCPSX to CCSH. Solves all problems and finally allows compiling C code for the Sega Saturn on Windows (the only other working SDK on the Internet is for DOS, which requires using DOSBox and 8.3 filenames, which makes big projects complicated to organize).

That's nice and all but can we compile actual code? Seems that the answer is no. Here is a basic file:

#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
	printf("%d\n", 42);

	return 0;
}

Compiling this will give the following error:

In file included from bin/main.c:2:
D:\SATURN\INCLUDE\stdlib.h:7: conflicting types for 'size_t'
D:\SATURN\INCLUDE\stddef.h:166: previous declaration of 'size_t'

Weird, eh?

It seems that the STDLIB.H file in the SDK is somehow wrong, in that it has the following at the top:

#ifndef	__SIZE_TYPE__DEF
#define	__SIZE_TYPE__DEF	unsigned int
typedef	__SIZE_TYPE__DEF	size_t;
#endif
STDLIB.H

Whereas its friend STDDEF.H looks like this:

#ifndef __SIZE_TYPE__
#define __SIZE_TYPE__ long unsigned int
#endif
#if !(defined (__GNUG__) && defined (size_t))
typedef __SIZE_TYPE__ size_t;
#endif /* !(defined (__GNUG__) && defined (size_t)) */
STDDEF.H

Two incompatible declarations, the compiler dies. The simple fix is to remove the DEF at the end of the names in STDLIB.H, to get something like this:

#ifndef	__SIZE_TYPE__
#define	__SIZE_TYPE__	unsigned int
typedef	__SIZE_TYPE__	size_t;
#endif
STDLIB.H

Paella, or how to emulate PSX games in Windows CE userland

Lately, I've been decompiling Tomb Raider 5 with some friends and while researching potential sources of debug informations that could help the process, I stumbled upon the Pocket PC version of Tomb Raider 1. It was ported by Ideaworks3D, a London-based game development company specialized in porting.

It's supposed to run on low-performance handheld devices running Windows Mobile/CE 5.0 and thus one would imagine that they have simply taken the Windows code and tweaked it a little bit to make it run on CE. Well as I discovered, it's more complicated than that. First, there is no Windows version of TR1, it was only released for DOS and was never ported to either Win16 or Win32. Second, they actually didn't take the PC version as a base, but the PSX version.

It may seem weird, why take the PSX version if your product is going to run on Windows CE. As it appears, Ideaworks3D seems to have developed an in-house userland syscall JIT translator for PSX, and uses it when porting games to CE. In other words, they compile the PSX codebase to ARM code and link that binary against a DLL file called iepaella.dll which contains implementations of PSX syscalls that call the WinCE API. Apparently, they have also made a version of that DLL which runs on standard Win32 that they used to make an ActiveX port of their Pocket PC port of the PSX version. It allowed playing TR1 in Internet Explorer. Not sure why anyone would ever do that, though.

I find this quite interesting because the main game executable seems to have a code near-identical to the original PSX version, which means that "IEPaella" is effectively a full-featured userland PSX emulator for WinCE and Win32 that is capable of mapping the PSX system routines to DirectX API calls. I haven't been able to find any other similar product on the internet. The only software that could be considered similar is Usercorn, a userland emulator based on Unicorn which implements most of the Linux, BSD and Darwin syscalls and even some DOS interrupts. It's very basic though, nowhere near what Paella does.

Currently, the codebase of the decompilation project is divided in 3 main folders:

  • GAME contains the shared game code
  • SPEC_PSX contains the PSX platform code
  • SPEC_PC contains the PC platform code

Debugging on PSX is much harder than on PC, because the binary is run in an emulator and you can't just run the code step-by-step to see where it crashes. Using Paella would allow doing such a thing because the binary is effectively being run on the computer and works like any other C++ program. We're currently search for ways to implement Paella support in TOMB5, but it may take time because not all PSX syscalls are implemented. It will eventually work, though.

PyQt, or how to break Unicode in 2018

For my latest project (Turing), I chose to use PyQt 5 for the GUI because it's what seemed the best to me to allow the whole thing to be cross-platform without much of a hassle. It has done its job quite well for about everything.

After some months of development though, I ran into an issue I was unable to fix : there is a bug, somewhere inbetween pylupdate5 (that scans code files for lines to translate) and lrelease (that compiles .ts files to .qm files) which basically prevents using non-ASCII characters in source strings. You can use them in translated strings, but not in the code files.

Quite strange, since both the code files and the .ts files (which are in XML format) are encoded in UTF-8. Or so I thought.

It seems that lrelease assumes that everything is ASCII (well, to be precise, Latin-1) if you don't specify it at each <message> element in the file, even though the very first line of the file (the XML header) specifies the encoding, in this case utf-8. pylupdate5 has no problem with that and assumes UTF-8 by default.

The workaround is to add an attribute to each <message> element in the .ts file to force lrelease to understand that it's UTF-8. Basically, replacing <message> by <message encoding="UTF-8"> everywhere. The problem is that when pylupdate5 re-saves the file after adding the new lines from the code, it discards the attributes (maybe it assumes that they aren't needed, which is a correct assumption given that it's a freaking XML file). Thus, I needed to write a script that is executed between the two calls to make sure that lrelease always gets fed a file with the attributes, so that it always parses it correctly.

for ts in glob.iglob('../**/*.ts', recursive=True):
    with open(ts, "r", encoding="utf8") as f:
        orig = f.read()
        
    orig = orig.replace('<message>', '<message encoding="UTF-8">')
    
    with open(ts, "w", encoding="utf8") as f:
        f.write(orig)

I don't know if it's a bug on the PyQt side or on the Qt side, but it stills seems quite weird to me that we still encounter this kind of problems in 2018, when every sane piece of software uses UTF-8 by default.

This article by Joel Spolsky is from 2003, and it seems that back then the basic knowledge of how to correctly use encodings had already been invented, so why are so many programs still unable to use text correctly?