Tutorial part 4: Adding JIT-compilation to a toy interpreter¶
In this example we construct a “toy” interpreter, and add JIT-compilation to it.
Our toy interpreter¶
It’s a stack-based interpreter, and is intended as a (very simple) example of the kind of bytecode interpreter seen in dynamic languages such as Python, Ruby etc.
For the sake of simplicity, our toy virtual machine is very limited:
The only data type is int
It can only work on one function at a time (so that the only function call that can be made is to recurse).
Functions can only take one parameter.
Functions have a stack of int values.
We’ll implement function call within the interpreter by calling a function in our implementation, rather than implementing our own frame stack.
The parser is only good enough to get the examples to work.
Naturally, a real interpreter would be much more complicated that this.
The following operations are supported:
Operation |
Meaning |
Old Stack |
New Stack |
---|---|---|---|
DUP |
Duplicate top of stack. |
|
|
ROT |
Swap top two elements of stack. |
|
|
BINARY_ADD |
Add the top two elements on the stack. |
|
|
BINARY_SUBTRACT |
Likewise, but subtract. |
|
|
BINARY_MULT |
Likewise, but multiply. |
|
|
BINARY_COMPARE_LT |
Compare the top two elements on the stack and push a nonzero/zero if (x<y). |
|
|
RECURSE |
Recurse, passing the top of the stack, and popping the result. |
|
|
RETURN |
Return the top of the stack. |
|
|
PUSH_CONST arg |
Push an int const. |
|
|
JUMP_ABS_IF_TRUE arg |
Pop; if top of stack was
nonzero, jump to
|
|
|
Programs can be interpreted, disassembled, and compiled to machine code.
The interpreter reads .toy
scripts. Here’s what a simple recursive
factorial program looks like, the script factorial.toy
.
The parser ignores lines beginning with a #.
# Simple recursive factorial implementation, roughly equivalent to: # # int factorial (int arg) # { # if (arg < 2) # return arg # return arg * factorial (arg - 1) # } # Initial state: # stack: [arg] # 0: DUP # stack: [arg, arg] # 1: PUSH_CONST 2 # stack: [arg, arg, 2] # 2: BINARY_COMPARE_LT # stack: [arg, (arg < 2)] # 3: JUMP_ABS_IF_TRUE 9 # stack: [arg] # 4: DUP # stack: [arg, arg] # 5: PUSH_CONST 1 # stack: [arg, arg, 1] # 6: BINARY_SUBTRACT # stack: [arg, (arg - 1) # 7: RECURSE # stack: [arg, factorial(arg - 1)] # 8: BINARY_MULT # stack: [arg * factorial(arg - 1)] # 9: RETURN
The interpreter is a simple infinite loop with a big switch
statement
based on what the next opcode is:
int toyvm_function::interpret (int arg, FILE *trace) { toyvm_frame frame; #define PUSH(ARG) (frame.push (ARG)) #define POP(ARG) (frame.pop ()) frame.frm_function = this; frame.frm_pc = 0; frame.frm_cur_depth = 0; PUSH (arg); while (1) { toyvm_op *op; int x, y; assert (frame.frm_pc < fn_num_ops); op = &fn_ops[frame.frm_pc++]; if (trace) { frame.dump_stack (trace); disassemble_op (op, frame.frm_pc, trace); } switch (op->op_opcode) { /* Ops taking no operand. */ case DUP: x = POP (); PUSH (x); PUSH (x); break; case ROT: y = POP (); x = POP (); PUSH (y); PUSH (x); break; case BINARY_ADD: y = POP (); x = POP (); PUSH (x + y); break; case BINARY_SUBTRACT: y = POP (); x = POP (); PUSH (x - y); break; case BINARY_MULT: y = POP (); x = POP (); PUSH (x * y); break; case BINARY_COMPARE_LT: y = POP (); x = POP (); PUSH (x < y); break; case RECURSE: x = POP (); x = interpret (x, trace); PUSH (x); break; case RETURN: return POP (); /* Ops taking an operand. */ case PUSH_CONST: PUSH (op->op_operand); break; case JUMP_ABS_IF_TRUE: x = POP (); if (x) frame.frm_pc = op->op_operand; break; default: assert (0); /* unknown opcode */ } /* end of switch on opcode */ } /* end of while loop */ #undef PUSH #undef POP }
Compiling to machine code¶
We want to generate machine code that can be cast to this type and then directly executed in-process:
typedef int (*toyvm_compiled_func) (int);
Our compiler isn’t very sophisticated; it takes the implementation of each opcode above, and maps it directly to the operations supported by the libgccjit API.
How should we handle the stack? In theory we could calculate what the
stack depth will be at each opcode, and optimize away the stack
manipulation “by hand”. We’ll see below that libgccjit is able to do
this for us, so we’ll implement stack manipulation
in a direct way, by creating a stack
array and stack_depth
variables, local within the generated function, equivalent to this C code:
int stack_depth;
int stack[MAX_STACK_DEPTH];
We’ll also have local variables x
and y
for use when implementing
the opcodes, equivalent to this:
int x;
int y;
This means our compiler has the following state:
toyvm_function &toyvmfn; gccjit::context ctxt; gccjit::type int_type; gccjit::type bool_type; gccjit::type stack_type; /* int[MAX_STACK_DEPTH] */ gccjit::rvalue const_one; gccjit::function fn; gccjit::param param_arg; gccjit::lvalue stack; gccjit::lvalue stack_depth; gccjit::lvalue x; gccjit::lvalue y; gccjit::location op_locs[MAX_OPS]; gccjit::block initial_block; gccjit::block op_blocks[MAX_OPS];
Setting things up¶
First we create our types:
void compilation_state::create_types () { /* Create types. */ int_type = ctxt.get_type (GCC_JIT_TYPE_INT); bool_type = ctxt.get_type (GCC_JIT_TYPE_BOOL); stack_type = ctxt.new_array_type (int_type, MAX_STACK_DEPTH);
along with extracting a useful int constant:
const_one = ctxt.one (int_type); }
We’ll implement push and pop in terms of the stack
array and
stack_depth
. Here are helper functions for adding statements to
a block, implementing pushing and popping values:
void compilation_state::add_push (gccjit::block block, gccjit::rvalue rvalue, gccjit::location loc) { /* stack[stack_depth] = RVALUE */ block.add_assignment ( /* stack[stack_depth] */ ctxt.new_array_access ( stack, stack_depth, loc), rvalue, loc); /* "stack_depth++;". */ block.add_assignment_op ( stack_depth, GCC_JIT_BINARY_OP_PLUS, const_one, loc); } void compilation_state::add_pop (gccjit::block block, gccjit::lvalue lvalue, gccjit::location loc) { /* "--stack_depth;". */ block.add_assignment_op ( stack_depth, GCC_JIT_BINARY_OP_MINUS, const_one, loc); /* "LVALUE = stack[stack_depth];". */ block.add_assignment ( lvalue, /* stack[stack_depth] */ ctxt.new_array_access (stack, stack_depth, loc), loc); }
We will support single-stepping through the generated code in the
debugger, so we need to create gccjit::location
instances, one
per operation in the source code. These will reference the lines of
e.g. factorial.toy
.
void compilation_state::create_locations () { for (int pc = 0; pc < toyvmfn.fn_num_ops; pc++) { toyvm_op *op = &toyvmfn.fn_ops[pc]; op_locs[pc] = ctxt.new_location (toyvmfn.fn_filename, op->op_linenum, 0); /* column */ } }
Let’s create the function itself. As usual, we create its parameter first, then use the parameter to create the function:
void compilation_state::create_function (const char *funcname) { std::vector <gccjit::param> params; param_arg = ctxt.new_param (int_type, "arg", op_locs[0]); params.push_back (param_arg); fn = ctxt.new_function (GCC_JIT_FUNCTION_EXPORTED, int_type, funcname, params, 0, op_locs[0]);
We create the locals within the function.
stack = fn.new_local (stack_type, "stack"); stack_depth = fn.new_local (int_type, "stack_depth"); x = fn.new_local (int_type, "x"); y = fn.new_local (int_type, "y");
Populating the function¶
There’s some one-time initialization, and the API treats the first block you create as the entrypoint of the function, so we need to create that block first:
initial_block = fn.new_block ("initial");
We can now create blocks for each of the operations. Most of these will be consolidated into larger blocks when the optimizer runs.
for (int pc = 0; pc < toyvmfn.fn_num_ops; pc++) { char buf[16]; sprintf (buf, "instr%i", pc); op_blocks[pc] = fn.new_block (buf); }
Now that we have a block it can jump to when it’s done, we can populate the initial block:
/* "stack_depth = 0;". */ initial_block.add_assignment (stack_depth, ctxt.zero (int_type), op_locs[0]); /* "PUSH (arg);". */ add_push (initial_block, param_arg, op_locs[0]); /* ...and jump to insn 0. */ initial_block.end_with_jump (op_blocks[0], op_locs[0]);
We can now populate the blocks for the individual operations. We loop through them, adding instructions to their blocks:
for (int pc = 0; pc < toyvmfn.fn_num_ops; pc++) { gccjit::location loc = op_locs[pc]; gccjit::block block = op_blocks[pc]; gccjit::block next_block = (pc < toyvmfn.fn_num_ops ? op_blocks[pc + 1] : NULL); toyvm_op *op; op = &toyvmfn.fn_ops[pc];
We’re going to have another big switch
statement for implementing
the opcodes, this time for compiling them, rather than interpreting
them. It’s helpful to have macros for implementing push and pop, so that
we can make the switch
statement that’s coming up look as much as
possible like the one above within the interpreter:
#define X_EQUALS_POP()\
add_pop (block, x, loc)
#define Y_EQUALS_POP()\
add_pop (block, y, loc)
#define PUSH_RVALUE(RVALUE)\
add_push (block, (RVALUE), loc)
#define PUSH_X()\
PUSH_RVALUE (x)
#define PUSH_Y() \
PUSH_RVALUE (y)
Note
A particularly clever implementation would have an identical
switch
statement shared by the interpreter and the compiler, with
some preprocessor “magic”. We’re not doing that here, for the sake
of simplicity.
When I first implemented this compiler, I accidentally missed an edit
when copying and pasting the Y_EQUALS_POP
macro, so that popping the
stack into y
instead erroneously assigned it to x
, leaving y
uninitialized.
To track this kind of thing down, we can use
gccjit::block::add_comment()
to add descriptive comments
to the internal representation. This is invaluable when looking through
the generated IR for, say factorial
:
block.add_comment (opcode_names[op->op_opcode], loc);
We can now write the big switch
statement that implements the
individual opcodes, populating the relevant block with statements:
switch (op->op_opcode) { case DUP: X_EQUALS_POP (); PUSH_X (); PUSH_X (); break; case ROT: Y_EQUALS_POP (); X_EQUALS_POP (); PUSH_Y (); PUSH_X (); break; case BINARY_ADD: Y_EQUALS_POP (); X_EQUALS_POP (); PUSH_RVALUE ( ctxt.new_binary_op ( GCC_JIT_BINARY_OP_PLUS, int_type, x, y, loc)); break; case BINARY_SUBTRACT: Y_EQUALS_POP (); X_EQUALS_POP (); PUSH_RVALUE ( ctxt.new_binary_op ( GCC_JIT_BINARY_OP_MINUS, int_type, x, y, loc)); break; case BINARY_MULT: Y_EQUALS_POP (); X_EQUALS_POP (); PUSH_RVALUE ( ctxt.new_binary_op ( GCC_JIT_BINARY_OP_MULT, int_type, x, y, loc)); break; case BINARY_COMPARE_LT: Y_EQUALS_POP (); X_EQUALS_POP (); PUSH_RVALUE ( /* cast of bool to int */ ctxt.new_cast ( /* (x < y) as a bool */ ctxt.new_comparison ( GCC_JIT_COMPARISON_LT, x, y, loc), int_type, loc)); break; case RECURSE: { X_EQUALS_POP (); PUSH_RVALUE ( ctxt.new_call ( fn, x, loc)); break; } case RETURN: X_EQUALS_POP (); block.end_with_return (x, loc); break; /* Ops taking an operand. */ case PUSH_CONST: PUSH_RVALUE ( ctxt.new_rvalue (int_type, op->op_operand)); break; case JUMP_ABS_IF_TRUE: X_EQUALS_POP (); block.end_with_conditional ( /* "(bool)x". */ ctxt.new_cast (x, bool_type, loc), op_blocks[op->op_operand], /* on_true */ next_block, /* on_false */ loc); break; default: assert(0); } /* end of switch on opcode */
Every block must be terminated, via a call to one of the
gccjit::block::end_with_
entrypoints. This has been done for two
of the opcodes, but we need to do it for the other ones, by jumping
to the next block.
if (op->op_opcode != JUMP_ABS_IF_TRUE && op->op_opcode != RETURN) block.end_with_jump (next_block, loc);
This is analogous to simply incrementing the program counter.
Verifying the control flow graph¶
Having finished looping over the blocks, the context is complete.
As before, we can verify that the control flow and statements are sane by
using gccjit::function::dump_to_dot()
:
fn.dump_to_dot ("/tmp/factorial.dot");
and viewing the result. Note how the label names, comments, and variable names show up in the dump, to make it easier to spot errors in our compiler.
Compiling the context¶
Having finished looping over the blocks and populating them with statements, the context is complete.
We can now compile it, extract machine code from the result, and run it:
class compilation_result { public: compilation_result (gcc_jit_result *result) : m_result (result) { } ~compilation_result () { gcc_jit_result_release (m_result); } void *get_code (const char *funcname) { return gcc_jit_result_get_code (m_result, funcname); } private: gcc_jit_result *m_result; };compilation_result compiler_result = fn->compile (); const char *funcname = fn->get_function_name (); toyvm_compiled_func code = (toyvm_compiled_func)compiler_result.get_code (funcname); printf ("compiler result: %d\n", code (atoi (argv[2])));
Single-stepping through the generated code¶
It’s possible to debug the generated code. To do this we need to both:
Set up source code locations for our statements, so that we can meaningfully step through the code. We did this above by calling
gccjit::context::new_location()
and using the results.Enable the generation of debugging information, by setting
GCC_JIT_BOOL_OPTION_DEBUGINFO
on thegccjit::context
viagccjit::context::set_bool_option()
:ctxt.set_bool_option (GCC_JIT_BOOL_OPTION_DEBUGINFO, 1);
Having done this, we can put a breakpoint on the generated function:
$ gdb --args ./toyvm factorial.toy 10
(gdb) break factorial
Function "factorial" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (factorial) pending.
(gdb) run
Breakpoint 1, factorial (arg=10) at factorial.toy:14
14 DUP
We’ve set up location information, which references factorial.toy
.
This allows us to use e.g. list
to see where we are in the script:
(gdb) list
9
10 # Initial state:
11 # stack: [arg]
12
13 # 0:
14 DUP
15 # stack: [arg, arg]
16
17 # 1:
18 PUSH_CONST 2
and to step through the function, examining the data:
(gdb) n
18 PUSH_CONST 2
(gdb) n
22 BINARY_COMPARE_LT
(gdb) print stack
$5 = {10, 10, 2, 0, -7152, 32767, 0, 0}
(gdb) print stack_depth
$6 = 3
You’ll see that the parts of the stack
array that haven’t been
touched yet are uninitialized.
Note
Turning on optimizations may lead to unpredictable results when stepping through the generated code: the execution may appear to “jump around” the source code. This is analogous to turning up the optimization level in a regular compiler.
Examining the generated code¶
How good is the optimized code?
We can turn up optimizations, by calling
gccjit::context::set_int_option()
with
GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL
:
ctxt.set_int_option (GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL, 3);
One of GCC’s internal representations is called “gimple”. A dump of the initial gimple representation of the code can be seen by setting:
ctxt.set_bool_option (GCC_JIT_BOOL_OPTION_DUMP_INITIAL_GIMPLE, 1);
With optimization on and source locations displayed, this gives:
factorial (signed int arg)
{
<unnamed type> D.80;
signed int D.81;
signed int D.82;
signed int D.83;
signed int D.84;
signed int D.85;
signed int y;
signed int x;
signed int stack_depth;
signed int stack[8];
try
{
initial:
stack_depth = 0;
stack[stack_depth] = arg;
stack_depth = stack_depth + 1;
goto instr0;
instr0:
/* DUP */:
stack_depth = stack_depth + -1;
x = stack[stack_depth];
stack[stack_depth] = x;
stack_depth = stack_depth + 1;
stack[stack_depth] = x;
stack_depth = stack_depth + 1;
goto instr1;
instr1:
/* PUSH_CONST */:
stack[stack_depth] = 2;
stack_depth = stack_depth + 1;
goto instr2;
/* etc */
You can see the generated machine code in assembly form via:
ctxt.set_bool_option (GCC_JIT_BOOL_OPTION_DUMP_GENERATED_CODE, 1);
result = ctxt.compile ();
which shows that (on this x86_64 box) the compiler has unrolled the loop and is using MMX instructions to perform several multiplications simultaneously:
.file "fake.c"
.text
.Ltext0:
.p2align 4,,15
.globl factorial
.type factorial, @function
factorial:
.LFB0:
.file 1 "factorial.toy"
.loc 1 14 0
.cfi_startproc
.LVL0:
.L2:
.loc 1 26 0
cmpl $1, %edi
jle .L13
leal -1(%rdi), %edx
movl %edx, %ecx
shrl $2, %ecx
leal 0(,%rcx,4), %esi
testl %esi, %esi
je .L14
cmpl $9, %edx
jbe .L14
leal -2(%rdi), %eax
movl %eax, -16(%rsp)
leal -3(%rdi), %eax
movd -16(%rsp), %xmm0
movl %edi, -16(%rsp)
movl %eax, -12(%rsp)
movd -16(%rsp), %xmm1
xorl %eax, %eax
movl %edx, -16(%rsp)
movd -12(%rsp), %xmm4
movd -16(%rsp), %xmm6
punpckldq %xmm4, %xmm0
movdqa .LC1(%rip), %xmm4
punpckldq %xmm6, %xmm1
punpcklqdq %xmm0, %xmm1
movdqa .LC0(%rip), %xmm0
jmp .L5
# etc - edited for brevity
This is clearly overkill for a function that will likely overflow the
int
type before the vectorization is worthwhile - but then again, this
is a toy example.
Turning down the optimization level to 2:
ctxt.set_int_option (GCC_JIT_INT_OPTION_OPTIMIZATION_LEVEL, 2);
yields this code, which is simple enough to quote in its entirety:
.file "fake.c"
.text
.p2align 4,,15
.globl factorial
.type factorial, @function
factorial:
.LFB0:
.cfi_startproc
.L2:
cmpl $1, %edi
jle .L8
movl $1, %edx
jmp .L4
.p2align 4,,10
.p2align 3
.L6:
movl %eax, %edi
.L4:
.L5:
leal -1(%rdi), %eax
imull %edi, %edx
cmpl $1, %eax
jne .L6
.L3:
.L7:
imull %edx, %eax
ret
.L8:
movl %edi, %eax
movl $1, %edx
jmp .L7
.cfi_endproc
.LFE0:
.size factorial, .-factorial
.ident "GCC: (GNU) 4.9.0 20131023 (Red Hat 0.2-%{gcc_release})"
.section .note.GNU-stack,"",@progbits
Note that the stack pushing and popping have been eliminated, as has the recursive call (in favor of an iteration).
Putting it all together¶
The complete example can be seen in the source tree at
gcc/jit/docs/examples/tut04-toyvm/toyvm.cc
along with a Makefile and a couple of sample .toy scripts:
$ ls -al
drwxrwxr-x. 2 david david 4096 Sep 19 17:46 .
drwxrwxr-x. 3 david david 4096 Sep 19 15:26 ..
-rw-rw-r--. 1 david david 615 Sep 19 12:43 factorial.toy
-rw-rw-r--. 1 david david 834 Sep 19 13:08 fibonacci.toy
-rw-rw-r--. 1 david david 238 Sep 19 14:22 Makefile
-rw-rw-r--. 1 david david 16457 Sep 19 17:07 toyvm.cc
$ make toyvm
g++ -Wall -g -o toyvm toyvm.cc -lgccjit
$ ./toyvm factorial.toy 10
interpreter result: 3628800
compiler result: 3628800
$ ./toyvm fibonacci.toy 10
interpreter result: 55
compiler result: 55
Behind the curtain: How does our code get optimized?¶
Our example is done, but you may be wondering about exactly how the compiler turned what we gave it into the machine code seen above.
We can examine what the compiler is doing in detail by setting:
state.ctxt.set_bool_option (GCC_JIT_BOOL_OPTION_DUMP_EVERYTHING, 1);
state.ctxt.set_bool_option (GCC_JIT_BOOL_OPTION_KEEP_INTERMEDIATES, 1);
This will dump detailed information about the compiler’s state to a
directory under /tmp
, and keep it from being cleaned up.
The precise names and their formats of these files is subject to change. Higher optimization levels lead to more files. Here’s what I saw (edited for brevity; there were almost 200 files):
intermediate files written to /tmp/libgccjit-KPQbGw
$ ls /tmp/libgccjit-KPQbGw/
fake.c.000i.cgraph
fake.c.000i.type-inheritance
fake.c.004t.gimple
fake.c.007t.omplower
fake.c.008t.lower
fake.c.011t.eh
fake.c.012t.cfg
fake.c.014i.visibility
fake.c.015i.early_local_cleanups
fake.c.016t.ssa
# etc
The gimple code is converted into Static Single Assignment form, with annotations for use when generating the debuginfo:
$ less /tmp/libgccjit-KPQbGw/fake.c.016t.ssa
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
factorial (signed int arg)
{
signed int stack[8];
signed int stack_depth;
signed int x;
signed int y;
<unnamed type> _20;
signed int _21;
signed int _38;
signed int _44;
signed int _51;
signed int _56;
initial:
stack_depth_3 = 0;
# DEBUG stack_depth => stack_depth_3
stack[stack_depth_3] = arg_5(D);
stack_depth_7 = stack_depth_3 + 1;
# DEBUG stack_depth => stack_depth_7
# DEBUG instr0 => NULL
# DEBUG /* DUP */ => NULL
stack_depth_8 = stack_depth_7 + -1;
# DEBUG stack_depth => stack_depth_8
x_9 = stack[stack_depth_8];
# DEBUG x => x_9
stack[stack_depth_8] = x_9;
stack_depth_11 = stack_depth_8 + 1;
# DEBUG stack_depth => stack_depth_11
stack[stack_depth_11] = x_9;
stack_depth_13 = stack_depth_11 + 1;
# DEBUG stack_depth => stack_depth_13
# DEBUG instr1 => NULL
# DEBUG /* PUSH_CONST */ => NULL
stack[stack_depth_13] = 2;
/* etc; edited for brevity */
We can perhaps better see the code by turning off
GCC_JIT_BOOL_OPTION_DEBUGINFO
to suppress all those DEBUG
statements, giving:
$ less /tmp/libgccjit-1Hywc0/fake.c.016t.ssa
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
factorial (signed int arg)
{
signed int stack[8];
signed int stack_depth;
signed int x;
signed int y;
<unnamed type> _20;
signed int _21;
signed int _38;
signed int _44;
signed int _51;
signed int _56;
initial:
stack_depth_3 = 0;
stack[stack_depth_3] = arg_5(D);
stack_depth_7 = stack_depth_3 + 1;
stack_depth_8 = stack_depth_7 + -1;
x_9 = stack[stack_depth_8];
stack[stack_depth_8] = x_9;
stack_depth_11 = stack_depth_8 + 1;
stack[stack_depth_11] = x_9;
stack_depth_13 = stack_depth_11 + 1;
stack[stack_depth_13] = 2;
stack_depth_15 = stack_depth_13 + 1;
stack_depth_16 = stack_depth_15 + -1;
y_17 = stack[stack_depth_16];
stack_depth_18 = stack_depth_16 + -1;
x_19 = stack[stack_depth_18];
_20 = x_19 < y_17;
_21 = (signed int) _20;
stack[stack_depth_18] = _21;
stack_depth_23 = stack_depth_18 + 1;
stack_depth_24 = stack_depth_23 + -1;
x_25 = stack[stack_depth_24];
if (x_25 != 0)
goto <bb 4> (instr9);
else
goto <bb 3> (instr4);
instr4:
/* DUP */:
stack_depth_26 = stack_depth_24 + -1;
x_27 = stack[stack_depth_26];
stack[stack_depth_26] = x_27;
stack_depth_29 = stack_depth_26 + 1;
stack[stack_depth_29] = x_27;
stack_depth_31 = stack_depth_29 + 1;
stack[stack_depth_31] = 1;
stack_depth_33 = stack_depth_31 + 1;
stack_depth_34 = stack_depth_33 + -1;
y_35 = stack[stack_depth_34];
stack_depth_36 = stack_depth_34 + -1;
x_37 = stack[stack_depth_36];
_38 = x_37 - y_35;
stack[stack_depth_36] = _38;
stack_depth_40 = stack_depth_36 + 1;
stack_depth_41 = stack_depth_40 + -1;
x_42 = stack[stack_depth_41];
_44 = factorial (x_42);
stack[stack_depth_41] = _44;
stack_depth_46 = stack_depth_41 + 1;
stack_depth_47 = stack_depth_46 + -1;
y_48 = stack[stack_depth_47];
stack_depth_49 = stack_depth_47 + -1;
x_50 = stack[stack_depth_49];
_51 = x_50 * y_48;
stack[stack_depth_49] = _51;
stack_depth_53 = stack_depth_49 + 1;
# stack_depth_1 = PHI <stack_depth_24(2), stack_depth_53(3)>
instr9:
/* RETURN */:
stack_depth_54 = stack_depth_1 + -1;
x_55 = stack[stack_depth_54];
_56 = x_55;
stack ={v} {CLOBBER};
return _56;
}
Note in the above how all the gccjit::block
instances we
created have been consolidated into just 3 blocks in GCC’s internal
representation: initial
, instr4
and instr9
.
Optimizing away stack manipulation¶
Recall our simple implementation of stack operations. Let’s examine how the stack operations are optimized away.
After a pass of constant-propagation, the depth of the stack at each opcode can be determined at compile-time:
$ less /tmp/libgccjit-1Hywc0/fake.c.021t.ccp1
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
factorial (signed int arg)
{
signed int stack[8];
signed int stack_depth;
signed int x;
signed int y;
<unnamed type> _20;
signed int _21;
signed int _38;
signed int _44;
signed int _51;
initial:
stack[0] = arg_5(D);
x_9 = stack[0];
stack[0] = x_9;
stack[1] = x_9;
stack[2] = 2;
y_17 = stack[2];
x_19 = stack[1];
_20 = x_19 < y_17;
_21 = (signed int) _20;
stack[1] = _21;
x_25 = stack[1];
if (x_25 != 0)
goto <bb 4> (instr9);
else
goto <bb 3> (instr4);
instr4:
/* DUP */:
x_27 = stack[0];
stack[0] = x_27;
stack[1] = x_27;
stack[2] = 1;
y_35 = stack[2];
x_37 = stack[1];
_38 = x_37 - y_35;
stack[1] = _38;
x_42 = stack[1];
_44 = factorial (x_42);
stack[1] = _44;
y_48 = stack[1];
x_50 = stack[0];
_51 = x_50 * y_48;
stack[0] = _51;
instr9:
/* RETURN */:
x_55 = stack[0];
x_56 = x_55;
stack ={v} {CLOBBER};
return x_56;
}
Note how, in the above, all those stack_depth
values are now just
constants: we’re accessing specific stack locations at each opcode.
The “esra” pass (“Early Scalar Replacement of Aggregates”) breaks out our “stack” array into individual elements:
$ less /tmp/libgccjit-1Hywc0/fake.c.024t.esra
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
Created a replacement for stack offset: 0, size: 32: stack$0
Created a replacement for stack offset: 32, size: 32: stack$1
Created a replacement for stack offset: 64, size: 32: stack$2
Symbols to be put in SSA form
{ D.89 D.90 D.91 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 5
Number of blocks to update: 4 ( 80%)
factorial (signed int arg)
{
signed int stack$2;
signed int stack$1;
signed int stack$0;
signed int stack[8];
signed int stack_depth;
signed int x;
signed int y;
<unnamed type> _20;
signed int _21;
signed int _38;
signed int _44;
signed int _51;
initial:
stack$0_45 = arg_5(D);
x_9 = stack$0_45;
stack$0_39 = x_9;
stack$1_32 = x_9;
stack$2_30 = 2;
y_17 = stack$2_30;
x_19 = stack$1_32;
_20 = x_19 < y_17;
_21 = (signed int) _20;
stack$1_28 = _21;
x_25 = stack$1_28;
if (x_25 != 0)
goto <bb 4> (instr9);
else
goto <bb 3> (instr4);
instr4:
/* DUP */:
x_27 = stack$0_39;
stack$0_22 = x_27;
stack$1_14 = x_27;
stack$2_12 = 1;
y_35 = stack$2_12;
x_37 = stack$1_14;
_38 = x_37 - y_35;
stack$1_10 = _38;
x_42 = stack$1_10;
_44 = factorial (x_42);
stack$1_6 = _44;
y_48 = stack$1_6;
x_50 = stack$0_22;
_51 = x_50 * y_48;
stack$0_1 = _51;
# stack$0_52 = PHI <stack$0_39(2), stack$0_1(3)>
instr9:
/* RETURN */:
x_55 = stack$0_52;
x_56 = x_55;
stack ={v} {CLOBBER};
return x_56;
}
Hence at this point, all those pushes and pops of the stack are now simply assignments to specific temporary variables.
After some copy propagation, the stack manipulation has been completely optimized away:
$ less /tmp/libgccjit-1Hywc0/fake.c.026t.copyprop1
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
factorial (signed int arg)
{
signed int stack$2;
signed int stack$1;
signed int stack$0;
signed int stack[8];
signed int stack_depth;
signed int x;
signed int y;
<unnamed type> _20;
signed int _21;
signed int _38;
signed int _44;
signed int _51;
initial:
stack$0_39 = arg_5(D);
_20 = arg_5(D) <= 1;
_21 = (signed int) _20;
if (_21 != 0)
goto <bb 4> (instr9);
else
goto <bb 3> (instr4);
instr4:
/* DUP */:
_38 = arg_5(D) + -1;
_44 = factorial (_38);
_51 = arg_5(D) * _44;
stack$0_1 = _51;
# stack$0_52 = PHI <arg_5(D)(2), _51(3)>
instr9:
/* RETURN */:
stack ={v} {CLOBBER};
return stack$0_52;
}
Later on, another pass finally eliminated stack_depth
local and the
unused parts of the stack` array altogether:
$ less /tmp/libgccjit-1Hywc0/fake.c.036t.release_ssa
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
Released 44 names, 314.29%, removed 44 holes
factorial (signed int arg)
{
signed int stack$0;
signed int mult_acc_1;
<unnamed type> _5;
signed int _6;
signed int _7;
signed int mul_tmp_10;
signed int mult_acc_11;
signed int mult_acc_13;
# arg_9 = PHI <arg_8(D)(0)>
# mult_acc_13 = PHI <1(0)>
initial:
<bb 5>:
# arg_4 = PHI <arg_9(2), _7(3)>
# mult_acc_1 = PHI <mult_acc_13(2), mult_acc_11(3)>
_5 = arg_4 <= 1;
_6 = (signed int) _5;
if (_6 != 0)
goto <bb 4> (instr9);
else
goto <bb 3> (instr4);
instr4:
/* DUP */:
_7 = arg_4 + -1;
mult_acc_11 = mult_acc_1 * arg_4;
goto <bb 5>;
# stack$0_12 = PHI <arg_4(5)>
instr9:
/* RETURN */:
mul_tmp_10 = mult_acc_1 * stack$0_12;
return mul_tmp_10;
}
Elimination of tail recursion¶
Another significant optimization is the detection that the call to
factorial
is tail recursion, which can be eliminated in favor of
an iteration:
$ less /tmp/libgccjit-1Hywc0/fake.c.030t.tailr1
;; Function factorial (factorial, funcdef_no=0, decl_uid=53, symbol_order=0)
Symbols to be put in SSA form
{ D.88 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 5
Number of blocks to update: 4 ( 80%)
factorial (signed int arg)
{
signed int stack$2;
signed int stack$1;
signed int stack$0;
signed int stack[8];
signed int stack_depth;
signed int x;
signed int y;
signed int mult_acc_1;
<unnamed type> _20;
signed int _21;
signed int _38;
signed int mul_tmp_44;
signed int mult_acc_51;
# arg_5 = PHI <arg_39(D)(0), _38(3)>
# mult_acc_1 = PHI <1(0), mult_acc_51(3)>
initial:
_20 = arg_5 <= 1;
_21 = (signed int) _20;
if (_21 != 0)
goto <bb 4> (instr9);
else
goto <bb 3> (instr4);
instr4:
/* DUP */:
_38 = arg_5 + -1;
mult_acc_51 = mult_acc_1 * arg_5;
goto <bb 2> (initial);
# stack$0_52 = PHI <arg_5(2)>
instr9:
/* RETURN */:
stack ={v} {CLOBBER};
mul_tmp_44 = mult_acc_1 * stack$0_52;
return mul_tmp_44;
}