Tiny Binaries: Assembly Optimization

Here’s how I reduced the assembly binary size in Tiny Binaries from 456 bytes to 114 bytes.

Shrinking the Code

Below is the original assembly code (asm-naive in the results):

;
; hi.s: unoptimized linux x86-64 assembly implementation.
;

bits 64

global _start

section .rodata
  ; "hi!\n"
  hi  db  "hi!", 10
  len equ $ - hi

section .text

_start:
  mov rax, 1 ; write
  mov rdi, 1 ; fd
  mov rsi, hi ; msg
  mov rdx, len ; len
  syscall ; call write()

  mov rax, 60 ; exit
  mov rdi, 0 ; exit code
  syscall ; call exit()

This produces a 456 byte binary with 39 bytes of code and 4 bytes of data:

$ make
nasm -f elf64 -o hi.o hi.s
ld -s -static -nostdinc -o hi hi.o
$ wc -c ./hi
456 ./hi
$ objdump -hd -Mintel ./hi
...
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000027  0000000000400080  0000000000400080  00000080  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .rodata       00000004  00000000004000a8  00000000004000a8  000000a8  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA

Disassembly of section .text:

0000000000400080 <.text>:
  400080:	b8 01 00 00 00       	mov    eax,0x1
  400085:	bf 01 00 00 00       	mov    edi,0x1
  40008a:	48 be a8 00 40 00 00 	movabs rsi,0x4000a8
  400091:	00 00 00
  400094:	ba 04 00 00 00       	mov    edx,0x4
  400099:	0f 05                	syscall
  40009b:	b8 3c 00 00 00       	mov    eax,0x3c
  4000a0:	bf 00 00 00 00       	mov    edi,0x0
  4000a5:	0f 05                	syscall

First, we replace the unnecessary 5 byte instructions with smaller equivalents:

diff --git a/src/asm-naive/hi.s b/src/asm-naive/hi.s
index 9d17cab..3694091 100644
--- a/src/asm-naive/hi.s
+++ b/src/asm-naive/hi.s
@@ -14,12 +14,12 @@ section .rodata
 section .text

 _start:
-  mov rax, 1 ; write
-  mov rdi, 1 ; fd
+  inc al ; write
+  inc edi; fd
   mov rsi, hi ; msg
-  mov rdx, len ; len
+  mov dl, len ; len
   syscall ; call write()

-  mov rax, 60 ; exit
-  mov rdi, 0 ; exit code
+  mov al, 60 ; exit
+  xor edi, edi ; exit code
   syscall ; call exit()

Notes:

inc al works because Linux zeros registers on process init.
inc edi is 2 bytes. Another 2 byte option is mov edi, eax. The other candidates (inc dil, inc di, mov dil, al, and mov di, ax) are all 3 bytes.
xor edi, edi is 2 bytes. The other candidates (mov dil, 0, mov di, 0, mov edi, 0, xor dil, dil, xor di, di, and xor rdi, rdi) are all 3-5 bytes.

These changes shrink the binary size to 440 bytes, with 24 bytes of code and 4 bytes of data:

$ make
nasm -f elf64 -o hi.o hi.s
ld -s -static -nostdinc -o hi hi.o
$ wc -c ./hi
440 ./hi
$ objdump -hd -Mintel ./hi
...
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000018  0000000000400080  0000000000400080  00000080  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .rodata       00000004  0000000000400098  0000000000400098  00000098  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA

Disassembly of section .text:

0000000000400080 <.text>:
  400080:	fe c0                	inc    al
  400082:	ff c7                	inc    edi
  400084:	48 be 98 00 40 00 00 	movabs rsi,0x400098
  40008b:	00 00 00
  40008e:	b2 04                	mov    dl,0x4
  400090:	0f 05                	syscall
  400092:	b0 3c                	mov    al,0x3c
  400094:	31 ff                	xor    edi,edi
  400096:	0f 05                	syscall

The code is now 24 bytes, of which 10 are one large mov instruction.

We can drop 2 bytes of code, 4 bytes of data, and the .rodata section by doing the following:

Remove mov rsi, str (-10 bytes, good riddance).
Drop the .rodata section (-4 bytes of data plus .rodata section overhead).
Encode "hi!\n" as a 32-bit integer and push it to the stack (+5 bytes). Hint: "hi!\n" = 68 69 21 0a, encoded as 0x0a216968 plus one byte for push.
Copy rsp to rsi (+3 bytes). This gives write a valid pointer.

Here’s the result:

bits 64

; "hi!\n", encoded as 32-bit little-endian int
str: equ 0x0a216968

section .text
global _start
_start:
  push dword str  ; push str (68 68 69 21 0a)

  inc al          ; write() (fe c0)
  inc edi         ; fd (ff c7)
  mov rsi, rsp    ; msg (48 89 e6)
  mov dl, 4       ; len (b2 04)
  syscall         ; call write() (0f 05)

  mov al, 60      ; exit() (b0 3c)
  xor edi, edi    ; exit code (31 ff)
  syscall         ; call exit() (0f 05)

This produces a 360 byte binary with 22 bytes of code and no data section:

$ make
nasm -f elf64 -o hi.o hi.s
ld -s -static -nostdinc -o hi hi.o
$ ./hi
hi!
$ wc -c ./hi
360 ./hi
$ objdump -h ./hi
...
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000016  0000000000400080  0000000000400080  00000080  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

This is the smallest legitimate assembly implementation that I could cook up. It’s available in the companion GitHub repository and shown in the results as asm-opt.

Dirty Tricks

In Tiny ELF Files: Revisited in 2021, Nathan Otterness created a 114 byte static x86-64 Linux binary by overlapping portions of the ELF header with the program header, then embedding the code in unverified* gaps of the ELF header.

(* Unverified by Linux, that is. Junk in these fields causes readelf and objdump give these binaries the stink eye, as we’ll see shortly).

Nathan also created a handy table showing which ELF header bytes are unverified by Linux. In particular, there are two unverified 12 byte regions at offsets 4 and 40 which could store our 22 bytes of code.

I reordered the code and divided it into into two chunks:

code_0: First 10 bytes, plus a two byte jump to code_1
code_1: Remaining 12 bytes.

Here’s the assembly for the two chunks:

; entry point
; (10 bytes of code plus a 2 byte jump to code_1)
code_0:
  push dword str  ; push string onto stack (68 68 69 21 0a)
  inc al          ; write() (fe c0)
  mov rsi, rsp    ; str (48 89 e6)
  jmp code_1      ; jump to next chunk (eb 18)

; ...

; second code chunk
; (12 bytes)
code_1:
  inc edi         ; fd (89 c7)
  mov dl, 4       ; len (b2 04)
  syscall         ; call write() (0f 05)

  mov al, 60      ; exit() (b0 3c)
  xor edi, edi    ; 0 exit code (31 ff)
  syscall         ; call exit() (0f 05)

These changes shrink the binary to 114 bytes. Linux will still happily execute the binary, but readelf, objdump, and file can’t make sense of it:

$ make
nasm -f bin -o hi hi.s
chmod a+x hi
$ ./hi
hi!
$ wc -c ./hi
114 ./hi
$ objdump -hd -Mintel ./hi
objdump: ./hi: file format not recognized
$ readelf -SW ./hi
There are 65329 section headers, starting at offset 0x3a:
readelf: Warning: The e_shentsize field in the ELF header is larger than the size of an ELF section header
readelf: Error: Reading 1014951344 bytes extends past end of file for section headers
readelf: Error: Too many program headers - 0x50f - the file is not that big
$ file ./hi
./hi: ELF, unknown class 104
$ hd ./hi
00000000  7f 45 4c 46 68 68 69 21  0a fe c0 48 89 e6 eb 18  |.ELFhhi!...H....|
00000010  02 00 3e 00 01 00 00 00  04 80 02 00 00 00 00 00  |..>.............|
00000020  3a 00 00 00 00 00 00 00  ff c7 b2 04 0f 05 b0 3c  |:..............<|
00000030  31 ff 0f 05 40 00 38 00  01 00 01 00 00 00 05 00  |1...@.8.........|
00000040  00 00 00 00 00 00 00 00  00 00 00 80 02 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 72 00 00 00 00 00  |..........r.....|
00000060  00 00 72 00 00 00 00 00  00 00 00 00 00 00 00 00  |..r.............|
00000070  00 00                                             |..|
00000072

It’ll even run as a Docker image:

$ cat Dockerfile
FROM scratch
COPY ./hi /hi
ENTRYPOINT ["/hi"]
$ docker build -t zoinks .
Sending build context to Docker daemon  8.704kB
Step 1/3 : FROM scratch
 --->
Step 2/3 : COPY ./hi /hi
 ---> 336acd7e2d94
Step 3/3 : ENTRYPOINT ["/hi"]
 ---> Running in e20eca61de44
Removing intermediate container e20eca61de44
 ---> 297e5d7db5f8
Successfully built 297e5d7db5f8
Successfully tagged zoinks:latest
$ docker images zoinks --format '{{.Size}}'
114B
$ docker run --rm -it zoinks
hi!
$

This glorious monstrosity is included in the companion repository and shown in the results as asm-elf.

Tiny Binaries: Assembly Optimization

January 1, 2022

Shrinking the Code

Dirty Tricks

Links