ABCL - 最快的 Common Lisp 实现

很多人都认为 SBCL 是“最快”的 Common Lisp 实现。然而，这种情况，在OpenJDK 11出来后被反转了。ABCL是 JVM 上的 Common Lisp 实现。可能由于投入不足，这个实现基本上是没有多少编译优化的，生成的代码又臭又长又慢。即便如此，凭借JVM的优化，这些又臭又长的代码也能优化成最快的 Common Lisp 代码。下面的例子可能会让你大吃一惊！

233 之母 Liutos 写了一个算法程序，这个程序用最新的 SBCL 2.0.4 运行，总共需要131秒^[1]，具体如下。

$ ~/sbcl/bin/sbcl
This is SBCL 2.0.4, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
* (compile-file "liutos.lisp")
; compiling file "/home/xps13/tmp/liutos.lisp" (written 17 MAY 2020 10:46:46 PM):
; compiling (DEFUN GENERATE-PRIME-NUMBERS ...)
; compiling (DEFVAR *PRIME-NUMBERS* ...)
; compiling (DEFUN FACTORING ...)
; compiling (DEFUN COUNT-COPRIME-NUMBERS ...)
; compiling (DEFUN COUNT-COPRIME-NUMBERS-BY-FORMULA ...)
; compiling (DEFUN COUNT-TOTAL-REDUCED-PROPER-FRACTIONS ...)

; wrote /home/xps13/tmp/liutos.fasl
; compilation finished in 0:00:00.018
#P"/home/xps13/tmp/liutos.fasl"
NIL
NIL
* (load "liutos")
T
* (time (count-total-reduced-proper-fractions 1000000))
Evaluation took:
  131.693 seconds of real time
  131.682014 seconds of total run time (131.602740 user, 0.079274 system)
  [ Run times consist of 0.011 seconds GC time, and 131.672 seconds non-GC time. ]
  99.99% CPU
  357,127,568,135 processor cycles
  319,617,360 bytes consed
  
303963552391

如果用最新的 CCL 1.12 运行，需要 136 秒^[2]。如下所示。

$ ~/ccl/lx86cl64
Clozure Common Lisp Version 1.12 (v1.12-2-g5d13fc7d) LinuxX8664

For more information about CCL, please see http://ccl.clozure.com.

CCL is free software.  It is distributed under the terms of the Apache
Licence, Version 2.0.
? (compile-file "liutos.lisp")
#P"/home/xps13/tmp/liutos.lx64fsl"
NIL
NIL
? (load "liutos")
#P"/home/xps13/tmp/liutos.lx64fsl"
? (time (count-total-reduced-proper-fractions 1000000))
(COUNT-TOTAL-REDUCED-PROPER-FRACTIONS 1000000)
took 136,853,744 microseconds (136.853740 seconds) to run.
          10,086 microseconds (  0.010086 seconds, 0.01%) of which was spent in GC.
During that period, and with 4 available CPU cores,
     136,768,017 microseconds (136.768020 seconds) were spent in user mode
          90,333 microseconds (  0.090333 seconds) were spent in system mode
 319,615,296 bytes of memory allocated.
 533 minor page faults, 0 major page faults, 0 swaps.
303963552391

接下来，让我们用 ABCL 1.6.2 搭配 OpenJ9^[3] JVM 的 AdoptOpenJDK 14 来执行这个程序。要注意的是，这个程序用了大量的递归，ABCL版本来执行需要 2G ~ 3G 的堆栈空间，Open J9是唯一能跑的起来的JVM实现^[4]。为了有足够的堆栈空间，让我们给JVM加上参数-Xmx6g -Xss3g -Xssi512m。先让我们加上参数-Xjit:optlevel=scorching，开极限优化，看看效果咋样。如下所示。

$ ~/abcl/abcl
VM settings:
    Stack Size: 3.00G
    Max. Heap Size: 6.00G
    Using VM: Eclipse OpenJ9 VM

Armed Bear Common Lisp 1.6.2-dev
Java 14 AdoptOpenJDK
Eclipse OpenJ9 VM
Low-level initialization completed in 0.757 seconds.
Startup completed in 1.765 seconds.

;; ...此处省略一大堆loading...

Type ":help" for a list of available commands.
CL-USER(1): (load "liutos")
; Loading /home/xps13/tmp/liutos.abcl ...
; Loaded /home/xps13/tmp/liutos.abcl (2.747 seconds)
T

;; 先跑第一遍，让JIT编译器去编译和优化代码。注意这个过程JIT并没有把所有代码都编译。
CL-USER(2): (time (count-total-reduced-proper-fractions 1000000))
332.083 seconds real time
4853708 cons cells
303963552391

;; 跑第二遍，直接运行内存里已编译好的代码。
CL-USER(3): (time (count-total-reduced-proper-fractions 1000000))
87.066 seconds real time
4853708 cons cells
303963552391

只要 87 秒，比 SBCL 足足快了 43 秒！！！！

可能有人会说这个对比不公平，因为跑了两次，第一次才是正确的时间。然而这才是公平的跑法。首先，SBCL 和 ABCL 都开启最大优化，SBCL 开启 (safety 0) (speed 3)，ABCL反正CL层面的优化参数没什么用，直接把JVM的编译优化等级开到最大。其次，SBCL运行的是编译好的 native code，其中没有半点额外的编译开销。ABCL 跑第一遍相当于 SBCL 的 COMPILE-FILE 过程，跑第二遍才是真的直接执行编译好的 native code。要说的话，ABCL 第二遍其实还让着 SBCL 了，因为这里面有部分执行的代码是解释执行的^[5]。

其实，如果是只使用默认优化（去掉参数-Xjit:optlevel=scorching），ABCL也同样比SBCL要明显快很多。如下所示。

CL-USER(1): (load "liutos")
; Loading /home/xps13/tmp/liutos.abcl ...
; Loaded /home/xps13/tmp/liutos.abcl (0.606 seconds)
T
CL-USER(2): (time (count-total-reduced-proper-fractions 1000000))
120.97 seconds real time
4853708 cons cells
303963552391
CL-USER(3): (time (count-total-reduced-proper-fractions 1000000))
118.442 seconds real time
4853708 cons cells
303963552391

足足比 SBCL 快了 20 秒！这还是算上了 JIT 编译的时间在内的！^[6]

总结

从 OpenJDK 11 开始，依靠强大的JVM，尽管 ABCL 在优化上几乎没做什么，也一跃成为了 当今最快的 Common Lisp 实现 。从今以后， SBCL 在性能方面，应该是赶不上 ABCL 了。

这里只用了默认优化。如果开了最大优化，速度反而慢了变成132秒。 ↩︎
这速度和 SBCL 1.5.3 跑的时间一样。由于 Gary Bayers 的退出，现在的 CCL 和 SBCL 1.5.3 时的 CCL 基本没什么不同。 ↩︎
自从 OpenJDK 11 开始， ABCL 就比 SBCL 快，这是我从去年就发现的。当时的程序，普通的 HotSpot JVM 就能跑起来，但是这个不行。如果这个程序 HotSpot JVM 能跑得起来，我想应该可以跑的更快。另外，Azul 家的神级 JDK 也是 HotSpot JVM。 ↩︎
别的几个主流 JDK 都是 HotSpot JVM，最多只能把堆栈开到1G，很快就爆了。 ↩︎
之所以判断有部分代码是解释执行的，那是因为，如果强制开启所有代码都必须先编译再执行，速度是明显不一样的。如果所有代码都必须编译后再执行，那么执行所需要的时间会更长，可能是这种编译本身会妨碍优化器的分析吧。 ↩︎
这次跑第二遍只快了2秒，并不明显。原因在于，JIT 编译器必须编译的足够快，使得不会明显拖慢程序的执行速度。所以，一定程度上，可以理解为 JIT 花了两秒的时间进行编译。 ↩︎