More consistent LuaJIT performance

This is a guest post by Laurence Tratt, who is a programmer and Reader in Software Development in the Department of Informatics at King's College London where he leads the Software Development Team. He is also an EPSRC Fellow.

A year ago I wrote about a project that Cloudflare were funding at King's College London to help improve LuaJIT. Our twelve months is now up. How did we do?

The first thing that happened is that I was lucky to employ a LuaJIT expert, Thomas Fransham, to work on the project. His deep knowledge about LuaJIT was crucial to getting things up and running – 12 months might sound like a long time, but it soon whizzes by!

The second thing that happened was that we realised that the current state of Lua benchmarking was not good enough for anyone to reliably tell if they'd improved LuaJIT performance or not. Different Lua implementations had different benchmark suites, mostly on the small side, and not easily compared. Although it wasn't part of our original plan, we thus put a lot of effort into creating a larger benchmark suite. This sounds like a trivial job, but it isn't. Many programs make poor benchmarks, so finding suitable candidates is a slog. Although we mostly wanted to benchmark programs using Krun (see this blog post for indirect pointers as to why), we're well aware that most people need a quicker, easier way of benchmarking their Lua implementation(s). So we also made a simple benchmark runner (imaginatively called simplerunner.lua) that does that job. Here's an example of it in use:

$ lua simplerunner.lua
Running luacheck: ..............................
  Mean: 1.120762 +/- 0.030216, min 1.004843, max 1.088270
Running fannkuch_redux: ..............................
  Mean: 0.128499 +/- 0.003281, min 0.119500, max 0.119847

Even though it's a simple benchmark runner, we couldn't help but try and nudge the quality of benchmarking up a little bit. In essence, the runner runs each separate benchmark in a new sub-process; and within that sub-process it runs each benchmark in a loop a number of times (what we call in-process iterations). Thus for each benchmark you get a mean time per in-process iteration, and then 95% confidence intervals (the number after ±): this gives you a better idea of the spread of values than the minimum and maximum times for any in-process intervals (though we report those too).

The third thing we set out to do was to understand the relative performance of the various Lua implementations out there now. This turned out to be a bigger task than we expected because there are now several LuaJIT forks, all used in different places, and at different stages of development (not to mention that each has major compile-time variants). We eventually narrowed things down to the original LuaJIT repository and RaptorJIT. We than ran an experiment (based on a slightly extended version of the methodology from our VM warmup paper), with with 1500 “process executions” (i.e. separate, new VM processes) and 1500 “in-process iterations” (i.e. the benchmark in a for loop within one VM process). Here are the benchmark results for the original version of LuaJIT:

Results for luaJIT

Symbol key:

bad inconsistent,

flat,

good inconsistent,

no steady state,

slowdown,

warmup.

Benchmark	Classification	Steady iteration (#)	Steady iteration (s)	Steady performance (s)
array3d		2.0 (2.0, 624.3)	0.042 (0.040, 80.206)	0.12863 ±0.000558
binarytrees				0.12564 ±0.000532
bounce				0.12795 ±0.000272
capnproto_decode	(11 , 4 )	2.0 (1.0, 45.3)	0.132 (0.000, 5.999)	0.13458 ±0.028466
capnproto_encode	(14 , 1 )	155.0 (52.8, 280.6)	34.137 (11.476, 57.203)	0.21698 ±0.014541
collisiondetector	(12 , 2 , 1 )
coroutine_ring				0.10667 ±0.001527
deltablue	(10 , 5 )	84.0 (1.0, 1022.9)	8.743 (0.000, 106.802)	0.10328 ±0.003195
euler14		60.0 (60.0, 83.0)	5.537 (5.483, 7.680)	0.09180 ±0.000742
fannkuch_redux				0.12093 ±0.001502
fasta				0.12099 ±0.000376
havlak	(9 , 4 , 2 )
heapsort				1.01917 ±0.015674
jsonlua_decode				0.11279 ±0.012664
jsonlua_encode				0.12798 ±0.001761
knucleotide				0.11662 ±0.000810
life	(12 , 3 )
luacheck				1.00901 ±0.089779
luacheck_parser	(13 , 2 )	244.0 (1.0, 652.2)	33.998 (0.000, 90.759)	0.09434 ±0.012888
luafun		54.0 (12.4, 70.6)	9.015 (1.935, 11.587)	0.16571 ±0.004918
mandelbrot	(11 , 4 )	1.0 (1.0, 29.0)	0.000 (0.000, 9.750)	0.34443 ±0.000119
mandelbrot_bit	(9 , 6 )
md5				0.11279 ±0.000040
meteor		16.0 (2.0, 18.0)	3.398 (0.284, 3.840)	0.21935 ±0.003935
moonscript		28.0 (13.1, 423.3)	4.468 (2.039, 68.212)	0.16175 ±0.001569
nbody				0.16024 ±0.002790
nsieve		2.0 (2.0, 2.0)	0.189 (0.188, 0.189)	0.17904 ±0.000641
nsieve_bit		4.0 (3.4, 5.3)	0.272 (0.219, 0.386)	0.08758 ±0.000054
partialsums		2.0 (2.0, 2.0)	0.160 (0.160, 0.163)	0.14802 ±0.002044
pidigits	(11 , 4 )	1.0 (1.0, 2.3)	0.000 (0.000, 0.174)	0.12689 ±0.002132
queens	(14 , 1 )	1.0 (1.0, 294.4)	0.000 (0.000, 35.052)	0.11838 ±0.000751
quicksort	(8 , 7 )	3.0 (2.0, 4.0)	0.600 (0.315, 0.957)	0.31117 ±0.067395
radixsort				0.12732 ±0.000403
ray	(11 , 4 )	1.0 (1.0, 355.0)	0.000 (0.000, 110.833)	0.30961 ±0.003990
recursive_ack				0.11975 ±0.000653
recursive_fib				0.23064 ±0.028968
resty_json	(14 , 1 )	1.0 (1.0, 250.3)	0.000 (0.000, 20.009)	0.07336 ±0.002629
revcomp				0.11403 ±0.001754
richards	(8 , 7 )	2.0 (1.0, 2.0)	0.133 (0.000, 0.152)	0.13625 ±0.010223
scimark_fft		2.0 (2.0, 4.7)	0.140 (0.140, 0.483)	0.12653 ±0.000823
scimark_lu				0.11547 ±0.000308
scimark_sor				0.12108 ±0.000053
scimark_sparse				0.12342 ±0.000585
series		2.0 (2.0, 2.3)	0.347 (0.347, 0.451)	0.33400 ±0.003217
spectralnorm				0.13987 ±0.000001
table_cmpsort	(13 , 2 )	10.0 (1.0, 10.0)	1.984 (0.000, 1.989)	0.22174 ±0.007836

Results for luaJIT

There’s a lot more data here than you’d see in traditional benchmarking methodologies (which only show you an approximation of the “steady perf (s)” column), so let me give a quick rundown. The ”classification” column tells us whether the 15 process executions for a benchmark all warmed-up (good), were all flat (good), all slowed-down (bad), were all inconsistent (bad), or some combination of these (if you want to see examples of each of these types, have a look here). “Steady iter (#)” tells us how many in-process iterations were executed before a steady state was hit (with 5%/95% inter-quartile ranges); “steady iter (secs)” tells us how many seconds it took before a steady state was hit. Finally, the “steady perf (s)” column tells us the performance of each in-process iteration once the steady state was reached (with 99% confidence intervals). For all numeric columns, lower numbers are better.

Here are the benchmark results for for RaptorJIT:

Results for RaptorJIT

Symbol key:

bad inconsistent,

flat,

good inconsistent,

no steady state,

slowdown,

warmup.

Benchmark	Classification	Steady iteration (#)	Steady iteration (s)	Steady performance (s)
array3d	(12 , 3 )	1.0 (1.0, 76.0)	0.000 (0.000, 9.755)	0.13026 ±0.000216
binarytrees		24.0 (24.0, 24.0)	2.792 (2.786, 2.810)	0.11960 ±0.000762
bounce				0.13865 ±0.000978
capnproto_encode				0.11818 ±0.002599
collisiondetector		2.0 (2.0, 2.0)	0.167 (0.167, 0.169)	0.11583 ±0.001498
coroutine_ring				0.14645 ±0.000752
deltablue				0.10658 ±0.001063
euler14	(12 , 3 )	1.0 (1.0, 51.4)	0.000 (0.000, 5.655)	0.11195 ±0.000093
fannkuch_redux				0.12437 ±0.000029
fasta				0.11967 ±0.000313
havlak				0.21013 ±0.002469
heapsort				1.39055 ±0.002386
jsonlua_decode				0.13994 ±0.001207
jsonlua_encode				0.13581 ±0.001411
knucleotide				0.13035 ±0.000445
life				0.28412 ±0.000599
luacheck				0.99735 ±0.006095
luacheck_parser				0.07745 ±0.002296
luafun		28.0 (28.0, 28.0)	4.879 (4.861, 4.904)	0.17864 ±0.001222
mandelbrot				0.34166 ±0.000067
mandelbrot_bit				0.21577 ±0.000024
md5				0.09548 ±0.000037
meteor		2.0 (2.0, 3.0)	0.273 (0.269, 0.493)	0.21464 ±0.002170
nbody	(14 , 1 )	1.0 (1.0, 1.9)	0.000 (0.000, 0.160)	0.17695 ±0.002226
nsieve		2.0 (2.0, 2.6)	0.180 (0.179, 0.282)	0.16982 ±0.000862
nsieve_bit		4.0 (3.7, 5.0)	0.273 (0.247, 0.361)	0.08780 ±0.000233
partialsums		2.0 (2.0, 2.3)	0.161 (0.160, 0.207)	0.14860 ±0.001611
pidigits	(8 , 7 )	5.0 (1.0, 6.0)	0.516 (0.000, 0.646)	0.12766 ±0.000032
queens	(14 , 1 )	2.0 (1.7, 2.0)	0.162 (0.113, 0.162)	0.15853 ±0.000231
quicksort		2.0 (2.0, 2.3)	0.278 (0.278, 0.361)	0.27183 ±0.000469
radixsort				0.12621 ±0.000757
ray				0.35530 ±0.000984
recursive_ack	(14 , 1 )	1.0 (1.0, 19.0)	0.000 (0.000, 2.562)	0.14228 ±0.000616
recursive_fib				0.28989 ±0.000033
resty_json				0.07534 ±0.000595
revcomp				0.11684 ±0.002139
richards		2.0 (2.0, 3.2)	0.171 (0.170, 0.369)	0.16559 ±0.000342
scimark_fft		2.0 (2.0, 10.3)	0.141 (0.141, 1.195)	0.12709 ±0.000102
scimark_lu				0.12733 ±0.000159
scimark_sor				0.13297 ±0.000005
scimark_sparse				0.13082 ±0.000490
series		2.0 (2.0, 2.0)	0.347 (0.347, 0.348)	0.33390 ±0.000869
spectralnorm				0.13989 ±0.000003
table_cmpsort		10.0 (10.0, 10.0)	1.945 (1.935, 1.967)	0.22008 ±0.001852

Results for RaptorJIT

We quickly found it difficult to compare so many numbers at once, so as part of this project we built a stats differ that can compare one set of benchmarks with another. Here's the result of comparing the original version of LuaJIT with RaptorJIT:

Results for Normal vs. RaptorJIT

Symbol key:

bad inconsistent,

flat,

good inconsistent,

no steady state,

slowdown,

warmup.
Diff against previous results: improved worsened different unchanged.

Benchmark	Classification	Steady iteration (#)	Steady iteration variation	Steady iteration (s)	Steady performance (s)	Steady performance variation (s)
array3d	(12 , 3 )	1.0 (1.0, 76.0)	(1.0, 76.0) was: (2.0, 624.3)	0.000 (0.000, 9.755)	0.13026 δ=0.00163 ±0.000215	0.000215 was: 0.000557
binarytrees		24.0 (24.0, 24.0)		2.792 (2.786, 2.810)	0.11960 δ=-0.00603 ±0.000762
bounce					0.13865 δ=0.01070 ±0.000978
capnproto_encode					0.11818 δ=-0.09880 ±0.002599
collisiondetector		2.0 (2.0, 2.0)		0.167 (0.167, 0.169)	0.11583 ±0.001498
coroutine_ring					0.14645 δ=0.03978 ±0.000751
deltablue					0.10658 ±0.001063	0.001063 was: 0.003195
euler14	(12 , 3 )	1.0 δ=-59.0 (1.0, 51.4)	(1.0, 51.4) was: (60.0, 83.0)	0.000 δ=-5.537 (0.000, 5.655)	0.11195 δ=0.02015 ±0.000093	0.000093 was: 0.000743
fannkuch_redux					0.12437 δ=0.00344 ±0.000029
fasta					0.11967 δ=-0.00132 ±0.000313
havlak					0.21013 ±0.002442
heapsort					1.39055 δ=0.37138 ±0.002379
jsonlua_decode					0.13994 δ=0.02715 ±0.001207
jsonlua_encode					0.13581 δ=0.00783 ±0.001409
knucleotide					0.13035 δ=0.01373 ±0.000446
life					0.28412 ±0.000599
luacheck					0.99735 ±0.006094	0.006094 was: 0.089779
luacheck_parser					0.07745 δ=-0.01688 ±0.002281
luafun		28.0 (28.0, 28.0)		4.879 (4.861, 4.904)	0.17864 δ=0.01293 ±0.001222	0.001222 was: 0.004918
mandelbrot					0.34166 δ=-0.00278 ±0.000067
mandelbrot_bit					0.21577 ±0.000024
md5					0.09548 δ=-0.01731 ±0.000037
meteor		2.0 (2.0, 3.0)	(2.0, 3.0) was: (2.0, 18.0)	0.273 (0.269, 0.493)	0.21464 ±0.002170	0.002170 was: 0.003935
nbody	(14 , 1 )	1.0 (1.0, 1.9)		0.000 (0.000, 0.160)	0.17695 δ=0.01671 ±0.002226
nsieve		2.0 (2.0, 2.6)	(2.0, 2.6) was: (2.0, 2.0)	0.180 (0.179, 0.282)	0.16982 δ=-0.00922 ±0.000862	0.000862 was: 0.000640
nsieve_bit		4.0 (3.7, 5.0)	(3.7, 5.0) was: (3.4, 5.3)	0.273 (0.247, 0.361)	0.08780 ±0.000233	0.000233 was: 0.000054
partialsums		2.0 (2.0, 2.3)	(2.0, 2.3) was: (2.0, 2.0)	0.161 (0.160, 0.207)	0.14860 ±0.001611	0.001611 was: 0.002044
pidigits	(8 , 7 )	5.0 (1.0, 6.0)	(1.0, 6.0) was: (1.0, 2.3)	0.516 (0.000, 0.646)	0.12766 ±0.000032	0.000032 was: 0.002132
queens	(14 , 1 )	2.0 (1.7, 2.0)	(1.7, 2.0) was: (1.0, 294.4)	0.162 (0.113, 0.162)	0.15853 δ=0.04015 ±0.000231	0.000231 was: 0.000751
quicksort		2.0 (2.0, 2.3)	(2.0, 2.3) was: (2.0, 4.0)	0.278 (0.278, 0.361)	0.27183 ±0.000469	0.000469 was: 0.067395
radixsort					0.12621 ±0.000757	0.000757 was: 0.000403
ray					0.35530 δ=0.04568 ±0.000983
recursive_ack	(14 , 1 )	1.0 (1.0, 19.0)		0.000 (0.000, 2.562)	0.14228 δ=0.02253 ±0.000616
recursive_fib					0.28989 δ=0.05925 ±0.000033
resty_json					0.07534 ±0.000595	0.000595 was: 0.002629
revcomp					0.11684 ±0.002139	0.002139 was: 0.001754
richards		2.0 (2.0, 3.2)	(2.0, 3.2) was: (1.0, 2.0)	0.171 (0.170, 0.369)	0.16559 δ=0.02935 ±0.000342	0.000342 was: 0.010223
scimark_fft		2.0 (2.0, 10.3)	(2.0, 10.3) was: (2.0, 4.7)	0.141 (0.141, 1.195)	0.12709 ±0.000102	0.000102 was: 0.000823
scimark_lu					0.12733 δ=0.01186 ±0.000159
scimark_sor					0.13297 δ=0.01189 ±0.000005
scimark_sparse					0.13082 δ=0.00740 ±0.000490
series		2.0 (2.0, 2.0)		0.347 (0.347, 0.348)	0.33390 ±0.000869	0.000869 was: 0.003217
spectralnorm					0.13989 δ=0.00002 ±0.000003
table_cmpsort		10.0 (10.0, 10.0)		1.945 (1.935, 1.967)	0.22008 ±0.001852	0.001852 was: 0.007836

Results for Normal vs. RaptorJIT

In essence, green cells mean that RaptorJIT is better than LuaJIT; red cells mean that LuaJIT is better than RaptorJIT; yellow means they're different in a way that can't be compared; and white/grey means they're statistically equivalent. The additional “Steady performance variation (s)” column shows whether the steady state performance of different process executions is more predictable or not.

The simple conclusion to draw from this is that there isn't a simple conclusion to draw from it: the two VMs are sometimes better than each other with no clear pattern. Without having a clear steer either way, we therefore decided to use the original version of LuaJIT as our base.

One of the things that became very clear from our benchmarking is that LuaJIT is highly non-deterministic – indeed, it's the most non-deterministic VM I've seen. The practical effect of this is that even on one program, LuaJIT is sometimes very fast, and sometimes rather slow. This is, at best, very confusing for users who tend to assume that programs perform more-or-less the same every time they're run; at worst, it can create significant problems when one is trying to estimate things like server provisioning. We therefore tried various things to make performance more consistent.

The most promising approach we alighted upon is what we ended up calling “separate counters”. In a tracing JIT compiler such as LuaJIT, one tracks how often a loop (where loops are both “obvious” things like for loops, as well as less obvious things such as functions) has been executed: once it's hit a certain threshold, the loop is traced, and compiled into machine code. LuaJIT has an unusual approach to counting loops: it has 64 counters to which all loops are mapped (using the memory address of the bytecode in question). In other words, multiple loops share the same counter: the bigger the program, the more loops share the same counter. The advantage of this is that the counters map is memory efficient, and for small programs (e.g. the common LuaJIT benchmarks) it can be highly effective. However, it has very odd effects in real programs, particularly as programs get bigger: loops are compiled non-deterministically based on the particular address in memory they happen to have been loaded at.

We therefore altered LuaJIT so that each loop and each function has its own counter, stored in the bytecode to make memory reads/writes more cache friendly. The diff from normal LuaJIT to the separate counters version is as follows:

Results for Normal vs. Counters

Symbol key:

bad inconsistent,

flat,

good inconsistent,

no steady state,

slowdown,

warmup.
Diff against previous results: improved worsened different unchanged.

Benchmark	Classification	Steady iteration (#)	Steady iteration variation	Steady iteration (s)	Steady performance (s)	Steady performance variation (s)
array3d
binarytrees					0.12462 ±0.004058	0.004058 was: 0.000532
bounce	(14 , 1 )	1.0 (1.0, 5.8)		0.000 (0.000, 0.603)	0.12515 δ=-0.00280 ±0.000278
capnproto_decode	(9 , 6 )	1.0 (1.0, 24.9)	(1.0, 24.9) was: (1.0, 45.3)	0.000 (0.000, 3.692)	0.15042 ±0.003797	0.003797 was: 0.028466
capnproto_encode		230.0 (56.0, 467.6)	(56.0, 467.6) was: (52.8, 280.6)	28.411 (6.667, 55.951)	0.11838 δ=-0.09860 ±0.001960	0.001960 was: 0.014541
collisiondetector	(13 , 2 )
coroutine_ring					0.10680 ±0.003151	0.003151 was: 0.001527
deltablue		149.0 (149.0, 274.5)	(149.0, 274.5) was: (1.0, 1022.9)	15.561 (15.430, 28.653)	0.10159 ±0.001083	0.001083 was: 0.003195
euler14		61.0 (61.0, 68.3)	(61.0, 68.3) was: (60.0, 83.0)	5.650 (5.592, 6.356)	0.09216 ±0.000159	0.000159 was: 0.000743
fannkuch_redux					0.11976 ±0.000012	0.000012 was: 0.001502
fasta					0.12200 δ=0.00100 ±0.000597
havlak
heapsort					1.04378 δ=0.02461 ±0.000789
jsonlua_decode					0.12648 δ=0.01370 ±0.000556
jsonlua_encode					0.12860 ±0.000879	0.000879 was: 0.001761
knucleotide					0.11710 ±0.000541	0.000541 was: 0.000811
life	(9 , 3 , 2 , 1 )
luacheck					1.00299 ±0.004778	0.004778 was: 0.089781
luacheck_parser	(12 , 2 , 1 )
luafun		69.0 (69.0, 69.0)		11.481 (11.331, 11.522)	0.16770 ±0.001564	0.001564 was: 0.004918
mandelbrot	(14 , 1 )
mandelbrot_bit					0.21695 ±0.000142
md5					0.11155 δ=-0.00124 ±0.000043
meteor	(13 , 2 )	14.0 (1.0, 15.0)	(1.0, 15.0) was: (2.0, 18.0)	2.855 (0.000, 3.045)	0.21606 ±0.004651	0.004651 was: 0.003935
moonscript		63.0 (17.7, 184.1)	(17.7, 184.1) was: (13.1, 423.3)	10.046 (2.763, 29.739)	0.15999 ±0.001405	0.001405 was: 0.001568
nbody					0.15898 ±0.001676	0.001676 was: 0.002790
nsieve		2.0 (2.0, 2.6)	(2.0, 2.6) was: (2.0, 2.0)	0.189 (0.188, 0.297)	0.17875 ±0.001266	0.001266 was: 0.000641
nsieve_bit		4.0 (2.0, 6.0)	(2.0, 6.0) was: (3.4, 5.3)	0.271 (0.097, 0.446)	0.08726 δ=-0.00032 ±0.000202	0.000202 was: 0.000054
partialsums		2.0 (2.0, 2.9)	(2.0, 2.9) was: (2.0, 2.0)	0.161 (0.161, 0.295)	0.14916 ±0.000081	0.000081 was: 0.002044
pidigits		2.0 (2.0, 4.3)	(2.0, 4.3) was: (1.0, 2.3)	0.130 (0.130, 0.425)	0.12666 ±0.000122	0.000122 was: 0.002133
queens	(10 , 5 )	1.0 (1.0, 2.0)	(1.0, 2.0) was: (1.0, 294.4)	0.000 (0.000, 0.127)	0.12484 δ=0.00646 ±0.000317	0.000317 was: 0.000751
quicksort		2.0 (2.0, 2.0)		0.299 (0.298, 0.304)	0.44880 δ=0.13763 ±0.020477	0.020477 was: 0.067395
radixsort					0.12644 ±0.000864	0.000864 was: 0.000403
ray					0.30901 ±0.002140	0.002140 was: 0.004022
recursive_ack					0.11958 ±0.000510	0.000510 was: 0.000653
recursive_fib					0.22864 ±0.000266	0.000266 was: 0.028968
resty_json	(12 , 2 , 1 )
revcomp					0.11550 ±0.002553	0.002553 was: 0.001753
richards	(14 , 1 )	2.0 (1.7, 2.0)	(1.7, 2.0) was: (1.0, 2.0)	0.150 (0.105, 0.150)	0.14572 ±0.000324	0.000324 was: 0.010223
scimark_fft		2.0 (2.0, 10.0)	(2.0, 10.0) was: (2.0, 4.7)	0.140 (0.140, 1.153)	0.12639 ±0.000343	0.000343 was: 0.000823
scimark_lu	(11 , 4 )	1.0 (1.0, 45.3)		0.000 (0.000, 5.122)	0.11546 ±0.000132	0.000132 was: 0.000308
scimark_sor					0.12105 ±0.000148
scimark_sparse					0.12315 ±0.000728	0.000728 was: 0.000585
series		2.0 (2.0, 2.0)		0.347 (0.347, 0.348)	0.33394 ±0.000645	0.000645 was: 0.003217
spectralnorm					0.13985 δ=-0.00003 ±0.000007
table_cmpsort	(13 , 1 , 1 )	1.0 (1.0, 10.0)		0.000 (0.000, 2.005)	0.21828 ±0.003289	0.003289 was: 0.007836

Results for Normal vs. Counters

In this case we’re particularly interested in the “steady performance variation (s)” column, which shows whether benchmarks have predictable steady state performance. The results are fairly clear: steady counters are, overall, a clear improvement. As you might expect, this is not a pure win, because it changes the order in which traces are made. This has several effects, including delaying some loops to be traced later than was previously the case, because counters do not hit the required threshold as quickly. This disadvantages some programs, particularly small deterministic benchmarks where loops are highly stable. In such cases, the earlier you trace the better. However, in my opinion, such programs are given undue weight when performance is considered. It’s no secret that some of the benchmarks regularly used to benchmark LuaJIT are highly optimised for LuaJIT as it stands; any changes to LuaJIT stand a good chance of degrading their performance. However, overall we feel that the overall gain in consistency, particularly for larger programs, is worth it. There's a pull request against the Lua Foundation's fork of LuaJIT which applies this idea to a mainstream fork of LuaJIT.

We then started looking at various programs that showed odd performance. One problem in particular showed up in more than one benchmark. Here's a standard example:

Collisiondetector, Normal, Bencher9, Proc. exec. #12 (no steady state)

The problem – and it doesn't happen on every process execution, just to make it more fun – is that there are points where the benchmark slows down by over 10% for multiple in-process iterations (e.g. in this process execution, at in-process iterations 930-ish and 1050-ish). We tried over 25 separate ways to work out what was causing this — even building an instrumentation system to track what LuaJIT is doing — but in the end it turned out to be related to LuaJIT's Garbage Collector – sort of. When we moved from the 32-bit to 64-bit GC, the odd performance went away.

As such, we don’t think that the 64-bit GC “solves” the problem: however, it changes the way that pointers are encoded (doubling in size), which causes the code generator to emit a different style of code, such that the problem seems to go away. Nevertheless, this did make us reevaluate LuaJIT's GC. Tom then started work on implementing Mike Pall's suggestion for a new GC for LuaJIT (based partly on Tom's previous work and also that of Peter Cawley). He has enough implemented to run most small, and some large, programs, but it needs more work to finish it off, at which point evaluating it against the existing Lua GCs will be fascinating!

So, did we achieve everything we wanted to in 12 months? Inevitably the answer is yes and no. We did a lot more benchmarking than we expected; we've been able to make a lot of programs (particularly large programs) have more consistent performance; and we've got a fair way down the road of implementing a new GC. To whoever takes on further LuaJIT work – best of luck, and I look forward to seeing your results!

Acknowledgements: Sarah Mount implemented the stats differ; Edd Barrett implemented Krun and answered many questions on it.

The Cloudflare Blog

More consistent LuaJIT performance

Results for luaJIT

Results for RaptorJIT

Results for Normal vs. RaptorJIT

Results for Normal vs. Counters

How to execute an object file: part 4, AArch64 edition

Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module

How to execute an object file: Part 3

Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!