A non-semantics-preserving LLVM pass for KLEE symbolic execution

Introduction

6 min readJul 13, 2021

Symbolic execution is a program verification technique that analyzes a program by simulating every execution path. A symbolic execution engine assumes symbolic values for every input as opposed to concrete values as a normal CPU would.

Compiler optimizations are typically designed for concrete execution on a CPU, since most of the time, that is how a program will be run.

However, the kind of intermediate code optimized to run well on CPUs will not necessarily be the easiest to symbolically execute. A well-known problem with symbolic execution is path explosion, which is when the symbolic execution engine slows down because it has to consider so many paths through a given program. The rewriting code of code to make it easier to verify, whether by KLEE or any other verification tool, is said to be a testability transformation.

This post investigates a testability transformation first proposed by Cristian Cadar in his paper Targeted Program Transformations for Symbolic Execution, the polynomial interpolation.

Motivating Example

Consider the following code.

const int values[6] = {0, 1, 8, 27, 64, 125};int foo(unsigned k) {
  return values[k] * values[k — 1];
}

When KLEE symbolically executes foo, it has to consider every possible value of k that can be passed as input. It essentially branches over each of the values in the array. foo could be rewritten* as follows:

int values(unsigned x) {
  return x*x*x;
}int foo(unsigned k) {
  return values(k) * values(k-1);
}

which could then be simplified to

int foo(unsigned k) {
  return (k * k * k) * [(k-1) * (k-1) * (k-1)];
}

Every function call to foo with argument k can thus be replaced with the following symbolic expression: (k * k * k) * [(k-1) * (k-1) * (k-1)].

This reduces the number of paths KLEE has to explore. The goal of the polynomial interpolation testability transformation is the automatic generation and replacement of array lookups with polynomials. We can make this precise in the next section.

Design of the LLVM Pass

The polynomial interpolation transformation is implemented as an LLVM pass. LLVM is an intermediate language that supports a variety of frontends, including C, as we needed for the above examples. An LLVM pass is a C++ module that transforms some given LLVM code. An LLVM pass typically makes the source code faster to run without changing the semantics of the program (an optimization), but in principle, any change to the source is possible with a pass. A variety of passes ship with LLVM, including dead code elimination and global variable optimization, but LLVM also lets developers define their own passes.

The polynomial interpolation LLVM pass, written in the file Interp.cpp, works as follows:

All constant literal arrays in the program are identified for interpolation. For example, const int values[6] = {0, 1, 8, 27, 64, 125}; would be identified for extraction, but int values[6]; would not. For every identified array:
A polynomial of degree at most 5 is fit to the array’s literal values using ALGLIB 3.16.0’s polynomialfit function.
An LLVM function is created that takes as input an integer, and returns an integer that approximates the polynomial specified by ALGLIB, but with integer coefficients. For example, `f(x) = 0.99x + 0.49x⁴` would be approximated as `f(x) = 1x`, since 0.99 rounds to 1, and 0.49 rounds to 0.
Every lookup into an identified array is replaced with a call into the generated function. (LLVM uses GEP instructions for array lookups.)

Experiment

I ran KLEE (version 2.2-pre) on my 2017 MacBook Pro with 8 GB 2133 MHz memory and a 2.3 GHz Dual-Core Intel Core i5. Four programs were tested using KLEE. This, for example, is test_6_cubic.c

#include <stdio.h>
#include <klee/klee.h>const int values[6] = {0, 1, 8, 27, 64, 125};
const int array2[5] = {27, 64, 125, 216, 343};
const int array_non_literal[5];
float const_float = 3.14;int foo(unsigned k) {
  return values[k] * values[k — 1];
}int main(int argc, char *argv[]) {
  int a;
  klee_make_symbolic(&a, sizeof(a), “a”);
  int f = foo(a);
  return (f * a) / f;
}

Thevaluesarray has 6 entries, hence test_6_cubic.c. The other 3 programs are identical to test_6_cubic.c, but with 8, 10, and 12 entries in the values array (all encoding the function f(x) = x³.)

Each program was compiled once with just the dead-code-elimination (DCE) pass, and once with both DCE and the interpolation passes. In total, then, there are 8 bitcode outputs: 2 levels of optimization for each of the 4 C programs.

I ran KLEE on each of the 8 bitcode outputs and timed each execution.

Results

KLEE was able to symbolically execute bitcode with the interpolation pass faster, with fewer paths, and with fewer tests generated than bitcode without the interpolation pass.

In table 1, we see relatively fewer instructions being run on longer length arrays in bitcodes generated without the interpolation pass. Perhaps KLEE limits the amount of exploration in longer arrays.

The first row measures the number of instructions symbolically executed by KLEE when run on bitcode optimized with just dead code elimination for each of the 4 array lengths. (If you run KLEE, this metric is reported in a line of output similar to this one: KLEE: done: total instructions = 116.)

In table 2, we see our optimization does exactly what we wanted: shrinking the exponential branches of array symbolic execution into a single** path.

Table 3. Time spent in symbolically execution

Lastly, in table 3, we see a significant speedup in symbolic execution by using interpolation, as measured by the sum of user and system times spent within the KLEE process.

Semantics preservation?

In Cadar’s original paper, this polynomial interpolation is listed as a semantics-preserving transformation. The code, as written, however, can change the semantics of the program. Consider this snippet:

// snippet a
int array[5] = {1, 4, 9, 16, 24};

The pass might generate an interpolating function as follows:

// snippet b
i32 array(int x) {
  return 1 + x*x;
}

This is an almost-quadratic function being interpreted as exactly quadratic.

Second, an out-of-bounds array access such as array[-1] is a bug, but an “out-of-bounds” interpolated function call such as array(-1) will always successfully return.

Does that mean this pass is useless? First, note that it is quite trivial to modify the pass to preserve the semantics. We could fix snippet a by requiring a generated polynomial function to pass through all values of the array before generating it. We could fix snippet b by performing a bounds check within the generated function. Even if we did not add these fixes, this pass might still be useful in approximating the behavior.

Upshot and future work

There are two areas I see as interesting continuations of this project:

Firstly, the canonical benchmark to test KLEE and KLEE-related projects is GNU’s coreutils. These are a collection of free and open-source popular programs, including ls and sed.

Unfortunately, it seems that this kind of optimization would not be useful in symbolically executing those programs, since a quick inspection of them shows they don’t have a single “polynomial-ish” array. In this post, I made a somewhat contrived benchmark of programs with polynomial arrays in them. This motivates one line of future questioning: how common are polynomial literal arrays in “real-world” programs?

Secondly, I made several design decisions for this LLVM pass, including interpolating to polynomials, and not any other function, and bounding the degree of these polynomials at five, since the constraint solvers powering KLEE perform poorly with high-degree polynomials. Are these assumptions reasonable? Could we determine a function that outputs a degree bound that will still improve the performance of KLEE based on an array’s length?

The implementation of the polynomial interpolation pass shows that we can significantly speed up symbolic execution by simplifying arrays to polynomial functions, solving the path-explosion problem for these arrays, at least for certain (contrived) programs. Without knowing how common these “polynomial-ish” arrays are in real-world programs or how reasonable degree assumptions are, this project is more an interesting exercise in LLVM pass-writing than verification-enabling software 🙃

Footnotes

* Well, not exactly, since we aren’t preserving semantics.

** In reality, there are two paths explored, one where the argument to the generated function is 0, and another where the argument is -2124761607. This second argument is perhaps related to a path involving arithmetic overflow.