XlogicX Blog

sed-regex_Based_BrainFuck_Compiler

BrainFuck is an 'esoteric' programming language with only 8 one-character instructions. I've used it here-and-there for well over a decade. I love minimalist languages, so RISCy. A brainfuck environment operates on a large array of data. There's an instruction to move the pointer in this array forwards and backwards and to increment or decrement it's value...that's already half the language. There's also an instruction for input or output of 1 character. Finally, there's an instruction to start and stop a loop. That's the language. There are many interpreters out there, I'm sure there are some compilers too, but I wanted to write a one-liner-esque command with common simple *nix tools. I actually ended up with 4 separate commands, but we will get to all of that.

 

Technical notes:

sed is the main tool I used to compile the BrainFuck code into x86. dd, printf, tr, and cat are helpers with formatting and staging the resulting 512-byte boot image output.

This does mean that the output file is limited to 512 bytes (more like 470-ish bytes after some of the overhead).

Each BrainFuck command takes different amounts of bytes to encode:
>: 1 byte
<: 1 byte
+: 2 bytes
-: 2 bytes
,: 12 bytes
.: 5 bytes
[: 3 bytes
]: 10 bytes

The looping stack (controlled by '[', and ']') has 4k allocated to it. It is unlikely you will come to the nested loop limit with this.

The data stack that you control with <, >, +, and - is about 21.5k (in theory)

 

The Compiler command:

dd bs=512 count=1 if=/dev/zero of=out.bin; printf '\x31\xc0\x8e\xd8\x8e\xd0\xbc\x00\x9c\xbb\x00\xac\xb4\xb8\x8e\xc0\xb0\x03\xcd\x10\xb4\x01\xb5\x26\xcd\x10\xb9\xd0\x07\xb8\x20\x00\x31\xff\xf3\xab\x31\xff' | dd of=out.bin bs=1 count=38 conv=notrunc ;printf '\x55\xaa' | dd of=out.bin bs=1 seek=510 count=2 conv=notrunc; cat source.bf | tr '\n' 'x' | sed -e 's/x//' -e 's/>/\x43/g' -e 's/</\x4b/g' -e 's/\+/\xfe\x07/g' -e 's/\-/\xfe\x0f/g' -e 's/,/\xb4\x01\xcd\x16\x74\xfa\x30\xe4\xcd\x16\x88\x07/g' -e 's/\./\x8a\x07\xb4\x0f\xab/g' -e 's/\[/\xe8\x00\x00/g' -e 's/\]/\x5a\x52\x80\x3f\x00\x74\x02\xff\xe2\x5a/g' | dd of=out.bin bs=1 seek=38 conv=notrunc

Let's go through these 4 separate commands in detail:

dd bs=512 count=1 if=/dev/zero of=out.bin
We are using an output file of out.bin. We want to start with a blank 512 byte file. So we use dd, set the block size to 512 bytes and use /dev/zero as the source of nulls to pad this file with.

printf '\x31\xc0\x8e\xd8\x8e\xd0\xbc\x00\x9c\xbb\x00\xac\xb4\xb8\x8e\xc0\xb0\x03\xcd\x10\xb4\x01\xb5\x26\xcd\x10\xb9\xd0\x07\xb8\x20\x00\x31\xff\xf3\xab\x31\xff' | dd of=out.bin bs=1 count=38 conv=notrunc
In this command we are sending machine code to dd. It's 38 bytes of machine code (count=38). At a high level, this machine code sets up the environment so we have our stacks, video memory (for output), a routine to clear the screen, etc... Below is what a disassembly of this would look like (as it shows the machine code too). Even though this is partially an objdump output, I have included comments too
31 c0        xor    ax,ax        ;set ax to 0
8e d8        mov    ds,ax        ;DS=0
8e d0        mov    ss,ax        ;stack starts at 0
bc 00 9c    mov    sp,0x9c00    ;200h past code start, 4k of stack
bb 00 ac    mov    bx,0xac00    ;brainfuck stack location
b4 b8        mov    ah,0xb8        ;text video memory
8e c0        mov    es,ax        ;ES=0xB800
b0 03        mov    al,0x3        ;Set Video Mode
cd 10        int    0x10            ;syscall
b4 01        mov    ah,0x1        ;Set Cursor Shape
b5 26        mov    ch,0x26
cd 10        int    0x10            ;syscall
b9 d0 07    mov    cx,0x7d0        ;whole screens worth
b8 20 00    mov    ax,0x20        ;empty black background
31 ff        xor    di,di        ;upper left corner
f3 ab        rep stos WORD PTR es:[di],ax    ;clear screen
31 ff        xor    di,di        ;set cursor back to upper left corner

printf '\x55\xaa' | dd of=out.bin bs=1 seek=510 count=2 conv=notrunc
The image file needs to end with 0x55aa to be considered a bootable image. So we use printf and dd to inject these bytes into the image

cat source.bf | tr '\n' 'x' | sed -e 's/x//' -e 's/>/\x43/g' -e 's/</\x4b/g' -e 's/\+/\xfe\x07/g' -e 's/\-/\xfe\x0f/g' -e 's/,/\xb4\x01\xcd\x16\x74\xfa\x30\xe4\xcd\x16\x88\x07/g' -e 's/\./\x8a\x07\xb4\x0f\xab/g' -e 's/\[/\xe8\x00\x00/g' -e 's/\]/\x5a\x52\x80\x3f\x00\x74\x02\xff\xe2\x5a/g' | dd of=out.bin bs=1 seek=38 conv=notrunc
This is the compound command that does the real work!

First we feed the source file to our chain of commands with 'cat'
We then use tr (translate) to replace the ending newline with the x character (because we can't just make it blank / remove it with tr)
Then we have a long line of find/replace commands with sed. The first one is simple, we now can just remove that 'x' that we put in there, so there are no longer any newlines.
The rest of the find/replace (regex) sed arguments are to literally find BrainFuck instructions and replace it with equivalent machine code. Before we explore each instruction, let's make some notes about our environment. I mentioned that we are using a large array in our environment, we will be using the 16 bit 'bx' register to track this; we use it as a pointer that we increment and decrement. I considered not even using a real stack for this resulting program, because it is mostly not needed as a concept, but I ended up needing it for the looping instructions, which I will get to

Instructions:

Instruction: > (increment array pointer)
43            inc    bx    ;bx is our pointer, increment it

Instruction: < (decrement array pointer)
4b            dec    bx    ;decrement pointer

Instruction: + (increment value in array)
fe 07        inc    BYTE PTR [bx]    ;increment the value

Instruction: - (decrement value in array)
fe 0f        dec    BYTE PTR [bx]    ;decrement the value

Instruction: , (Take keyboard input)
b4 01        mov    ah,0x1                ;check if keyboard input
cd 16        int    0x16                    ;interupt for it
74 fa        je     0x2c                    ;keep checking if no input
30 e4        xor    ah,ah                ;get keypress
cd 16        int    0x16                    ;interupt for it
88 07        mov    BYTE PTR [bx],al        ;put keypress in array

Instruction: . (Print character)
8a 07        mov    al,BYTE PTR [bx]        ;get character from array
b4 0f        mov    ah,0xf                ;make sure video background is still black (possible this isn't needed)
ab            stos   WORD PTR es:[di],ax    ;display the character

Instruction: [ (Start the loop)
e8 00 00    call   0x40        ;pushes address of next instruction onto the stack

Instruction: ] (End the loop if array at pointer is zero)
5a            pop    dx                    ;get loop start
52            push   dx                    ;keep it on the stack
80 3f 00    cmp    BYTE PTR [bx],0x0    ;is the value zero yet
74 02        je     0x49                    ;if so, the loop is over, jump past it
ff e2        jmp    dx                    ;otherwise, go back to the beginning of the loop
5a            pop    dx                    ;return stack to pristine state

 

PoC||GTFO:

Just to show this works, I used the first two examples from Wikipedia's article on BrainFuck

Adding two values:

 

 

Hello World:

 

...because reasons