Does this count as a self-hosting compiler?

Today's goal [edit: this took a couple of days actually] is to write a text-to-binary converter, in x86-64 assembly, as an ELF executable file.

Input and output format

Following the sage advice of Leslie Lamport, let's State the Problem Before Describing the Solution, by describing the input format and the expected output of this program, before we write any code.

input   = line*
line    = (empty | comment | numbers) [\n]
empty   =
comment = [#] [^\n]*
numbers = [ ]* number ([ ]+ number)*
number  = digit digit
digit   = [0-9A-Fa-f]

The output is simply a sequence of bytes, one for each number which appears on a non-comment line, in order.

The input is taken from stdin, and output is sent to stdout. The program exits with status 0 on success, and a nonzero status on failure (in which case some bytes may have been written to stdout).

Finite state machine

The input format is very simple, so we can parse it with a (very) finite state machine. If we encode the states as powers of two (so, 1, 2, 4, 8, 16), then we can test whether the current state is one of an arbitrary set of states with a single instruction, so let's do that. The possible states are:

  1. Beginning of line. This happens at the start of input, and also after every newline. This state accepts the characters [\n# 0-9A-Fa-f] and end-of-file.
  2. Inside a comment. This state accepts all characters, but [\n] is treated specially.
  3. Waiting for a number. This happens after a space. This state accepts the characters [ 0-9A-Fa-f].
  4. Inside a number. This happens after the first digit. This state accepts the characters [0-9A-Fa-f].
  5. After a number. This happens after the second digit. This state accepts the characters [\n ].

Main loop

The main loop of the program is very simple:

  1. Read some input characters into a buffer. If there are none, exit.
  2. Feed each input character into the finite state machine, potentially producing an output character.
  3. Write out the output buffer, then go back to step 1.

Memory layout

We'll be using a fixed size buffer (one page, 4KB) to read the input, to avoid doing a syscall for every single character of input.

Since every byte of input produces at most one byte of output, and no backtracking is necessary, we can just use the same buffer to gather output bytes before doing a syscall to write them out. This means that our memory allocation state is always captured by just a few pointers, which keep the same relative ordering throughout program execution:

  1. Start of the 4KB buffer (constant).
  2. End of buffered output bytes (variable).
  3. Start of buffered input bytes (variable).
  4. End of buffered input bytes (variable).
  5. End of the 4KB buffer (constant).

And if we keep all addresses within the first 2GB of virtual memory (those addresses which can be represented as positive signed 32-bit pointers), we can manipulate our 64-bit pointers using 32-bit registers, which will save us from using a bunch of REX instruction prefixes.

How to read

The simplest way to read bytes from stdin is with the read syscall, but it's important to pay attention to the return value:

In particular, the existence of "try again" means that we need a loop, in addition to all the conditionals for dealing with end-of-file and IO errors.

rdi = 0(stdin)
rsi = (buffer)
rdx = (count)
do:
  rax = 0(read)
  syscall
  cmp rax, -4(eintr)
  while equal
test rax, rax
if negative goto panic
if zero goto end-of-file
(count) in rax
(garbage) in rcx
(garbage) in r11

How to write

Writing is similar to reading, with one extra complication: we want to make sure all output bytes are written before moving on, but the write syscall may not all consume all the bytes we give it before returning. So, we need to do it in a nested loop:

rdi = 1(stdout)
rsi = (buffer)
rdx = (count)
test rdx, rdx
if nonzero:
  do:
    do:
      rax = 1(write)
      syscall
      cmp rax, -4(eintr)
      while equal
    test rax, rax
    if negative goto panic
    rsi += rax
    rdx -= rax
    while nonzero
(garbage) in rcx
(garbage) in r11

Register assignments

Our entire program state fits within a few bytes, so we have an embarassment of riches when it comes to register assignments. To cut down the choices, let's make up some arbitrary constraints:

An unexpected abstraction

While writing the finite state machine code, I found that I was repeating the same sequence of instructions a few times, just with different constants. This sequence already contained a bunch of conditional jumps, so I started wondering whether I could move the code around to de-duplicate it. The answer was yes!

The reason is that for three of the possible input characters, namely newline, '#', and space, the handling is the same:

We can put the set of accepting states in dl and the following state in dh unconditionally, and branch to the "check transition" code if the character matches.

What I like about this abstraction is that I noticed it because of the code repetition, but then after thinking about it I found a conceptual reason.

Code layout

For no particular reason, I initially wrote the code in the following order (with jump offsets missing, to be filled in later):

  1. read input
  2. finite state machine
  3. found digit
  4. second digit
  5. check transition
  6. next iteration
  7. write output
  8. panic
  9. exit

I optimistically wrote all my branches using the short form of the jump instructions, which only accept a jump distance in the range [-128, 127], and of course this didn't work right away. Surprisingly, though, only a few jump targets were too far away, so I was able to reorder the labels to accomodate this:

  1. read input
  2. finite state machine
  3. found digit
  4. panic
  5. exit
  6. check transition
  7. next iteration
  8. write output
  9. second digit

Basically, "panic" and "exit" moved up closer to the center, to be reachable from all the code, and "second digit" moved out of the way, to the end. Almost all of the code blocks actually fall through to the next block, so this was basically the only way to move the blocks around without introducing extra jump instructions. Thankfully all of the jump distances were fine afterwards (including, just barely, the jump from "write output" to "read input", with a distance of -127 bytes!).

Debugging

I "compiled" the source code below using a tiny python3 script, and ran in step by step in gdb on a few small input files. While doing this, I found the following bugs:

Version 1 success!

The program has one known bug, but it's able to "compile" its own source code into an identical copy of itself, yay!

$ ./byter.py <byter.bytes >byter && chmod +x byter
$ ./byter <byter.bytes >self_byter && \
>     cmp byter self_byter && \
>     echo 'success!'
success!

The source code

# -------------------------------------------------
# byter.bytes -- a text-to-binary converter
# -------------------------------------------------

# -------------------------------------------------
# ELF header
# -------------------------------------------------
# magic number <DEL>"ELF"
7f 45 4c 46
# 64-bit architecture
             02
# little-endian
                01
# ELF version, the one and only
                   01
# no ELF extensions
                      00
# reserved
                           00 00 00 00  00 00 00 00
# executable file
02 00
# x86-64 architecture
      3e 00
# file version, the one and only
             01 00 00 00
# address of first instruction to execute
                           b0 00 40 00  00 00 00 00
# file offset of the program header table
40 00 00 00  00 00 00 00
# file offset of the section header table (absent)
                           00 00 00 00  00 00 00 00
# architecture-specific flags
00 00 00 00
# size of this ELF header
             40 00
# size of a program header
                   38 00
# count of program headers
                           02 00
# size of a section header
                                 40 00
# count of section headers
                                        00 00
# index of the section name section header (absent)
                                              00 00

# -------------------------------------------------
# 1st program header (code)
# -------------------------------------------------

# loadable segment
01 00 00 00
# permissions: read+execute
             05 00 00 00
# file offset
                           00 00 00 00  00 00 00 00
# memory address
00 00 40 00  00 00 00 00
# reserved
                           00 00 00 00  00 00 00 00
# size in file
5a 01 00 00  00 00 00 00
# size in memory
                           5a 01 00 00  00 00 00 00
# 4KB alignment
00 10 00 00  00 00 00 00

# -------------------------------------------------
# 2nd program header (read/write buffer)
# -------------------------------------------------

# loadable segment
                           01 00 00 00
# permissions: read+write
                                        06 00 00 00
# file offset
00 00 00 00  00 00 00 00
# memory address
                           00 00 41 00  00 00 00 00
# reserved
00 00 00 00  00 00 00 00
# size in file
                           00 00 00 00  00 00 00 00
# size in memory
00 10 00 00  00 00 00 00
# 4KB alignment
                           00 10 00 00  00 00 00 00

# -------------------------------------------------
# code
# -------------------------------------------------

# lea rbp, [rip-06] (start of code, end of headers)
8d 2d fa ff  ff ff
# mov bl, 01 (state: start of line)
                   b3 01

## read input
# xor edi, edi (stdin)
                           33 ff
# mov esi, [rbp-28] (start of buffer)
                                 8b 75  d8
# mov edx, [rbp-10] (size of buffer)
                                           8b 55 f0
# do: mov eax, edi (read syscall)
8b c7
# syscall
      0f 05
# cmp eax, -4 (eintr)
             3d fc ff ff   ff
# je [rip-0b] (retry)
                              74 f5
# test eax, eax
                                    85  c0
# js [rip+46] (panic)
                                           78 46
# jz [rip+49] (exit)
                                                 74
49
# mov ecx, eax (size of input)
   8b c8
# mov edi, esi (start of buffer)
         8b  fe

## finite state machine
# lodsb
                ac

# cmp al, 0a (newline)
                   3c 0a
# mov dl, 13 (states which accept newline)
                           b2 13
# mov dh, 01 (state: start of line)
                                 b6 01
# je [rip+43] (check transition)
                                        74 43

# test bl, 02 (state: inside comment)
                                              f6 c3
02
# jnz [rip+44] (next iteration)
   75 44

# cmp al, 20 (space)
         3c  20
# mov dl, 15 (states which accept space)
                b2 15
# mov dh, 04 (state: waiting for number)
                      b6   04
# je [rip+36] (check transition)
                              74 36

# cmp al, 23 ('#')
                                    3c  23
# mov dl, 01 (states which accept '#')
                                           b2 01
# mov dh, 02 (state: inside comment)
                                                 b6
02
# je [rip+2e] (check transition)
   74 2e

# cmp al, 30 ('0')
         3c  30
# jb [rip+1e] (panic)
                72 1e
# cmp al, 39 ('9')
                      3c   39
# jbe [rip+0c] (found digit)
                              76 0c
# or al, 20 (smash case)
                                    0c  20
# cmp al, 61 ('a')
                                           3c 61
# jb [rip+14] (panic)
                                                 72
14
# cmp al, 66 ('f')
   3c 66
# ja [rip+10] (panic)
         77  10
# add al, 09
                04 09

## found digit
# test bl, 05 (states which accept a first digit)
                      f6   c3 05
# jz [rip+40] (second digit)
                                 74 40
# mov bl, 08 (state: middle of number)
                                        b3 08
# shl al, 4
                                              c0 e0
04
# mov bh, al
   8a f8
# jmp [rip+12] (next iteration)
         eb  12

## panic
# mov edi, 1 (nonzero exit status)
                bf 01 00   00 00
## exit
# mov eax, 3c (exit syscall)
                                 b8 3c  00 00 00
# syscall
                                                 0f
05

## check transition
# test bl, dl (acceptable states)
   84 da
# jz [rip-10] (panic)
         74  f0
# mov bl, dh
                8a de
## next iteration
# loop [rip-54] (finite state machine)
                      e2   ac

## write output
# mov edx, edi (end of output buffer)
                              8b d7
# mov edi, 1 (stdout)
                                    bf  01 00 00 00
# mov esi, [rbp-28] (start of buffer)
8b 75 d8
# sub edx, esi (get size of output buffer)
         2b  d6
# jz [rip-7f] (read input)
                74 81
# do: mov eax, edi (write syscall)
                      8b   c7
# syscall
                              0f 05
# cmp eax, -4 (eintr)
                                    3d  fc ff ff ff
# je [rip-0b] (retry)
74 f5
# test eax, eax
      85 c0
# js [rip-31] (panic)
             78 cf
# add esi, eax
                   03 f0
# sub edx, eax
                           2b d0
# jmp [rip-17] (maybe write more)
                                 eb e9

## second digit
# test bl, 08 (states which accept a second digit)
                                       f6 c3 08
# jz [rip-3c] (panic)
                                                74
c4
# mov bl, 10 (state: end of number)
   b3 10
# and al, 0f
         24  0f
# or al, bh
                0a c7
# stosb
                      aa
# jmp [rip-33] (next iteration)
                           eb cd