Tut10: Hybrid Fuzzing

In this tutorial, we will learn about hybrid fuzzing, which combines fuzzing and symbolic execution to overcome their limitations. Moreover, we will have an exercise of using QSYM, a state-of-the-art hybrid fuzzer.

1. Limitations of Fuzzing and Symbolic Execution

To understand the limitations of fuzzing and symbolic execution, let's take a look at an example: (https://github.com/sslab-gatech/qsym/blob/master/vagrant/example.c)

int main(int argc, char** argv) {
  if (argc < 2) {
    printf("Usage: %s [input]\n", argv[0]);
    exit(-1);
  }

  FILE* fp = fopen(argv[1], "rb");

  if (fp == NULL) {
    printf("[-] Failed to open\n");
    exit(-1);
  }

  int x, y;
  char buf[32];

  ck_fread(&x, sizeof(x), 1, fp);
  ck_fread(buf, 1, sizeof(buf), fp);
  ck_fread(&y, sizeof(y), 1, fp);

First of all, the program opens a file whose name is given by the first argument of this program. Then, it fills three variables, x, buf, and y with the contents of the file.

  // Challenge for fuzzing
  if (x == 0xdeadbeef) {
    printf("Step 1 passed\n");

Then, it checks if x is equivalent to a magic number 0xdeadbeef. As we have seen before in the symbolic execution tutorial, such a check is troublesome for fuzzing, because the constraint is not likely to be met through a randomly-generated input (the chance is 1 our of 2^32, which is extremely low). However, through a symbolic execution, we can easily find the input that satisfies this condition, beause it is fairly simple to solve such a simple path constraint, i.e., x == 0xdeadbeef.

    // Challenge for symbolic execution
    int count = 0;
    for (int i = 0; i < 32; i++) {
      if (buf[i] >= 'a')
        count++;
    }

    if (count >= 8) {
      printf("Step 2 passed\n");

Unfortunately, symbolic execution is not a panacea. The example program also contains a simple, yet challenging routine for symbolic execution as shown in the above code. This introduces the most famous and notorious limitation of symbolic execution, known as path explosion. Path explosion refers to a situation where the number of paths exponentially grow during symbolic execution, making it difficult for symbolic execution to scale.

For example, in the example above, the number of feasible paths at the last if-statement is 2^32, which is extremely large to be handled through symbolic execution, because each element (total 32) of the buf array would diverge a current execution.

      // Challenge for fuzzing, again
      if ((x ^ y) == 0xbadf00d) {
        printf("Step 3 passed\n");
        ((void(*)())0)();
      }

Finally, the program has the third branch that is challenging for fuzzing to solve, followed by a buggy statement, i.e., ((void(*)())0)();. In order to reach the bug, we need to handle challenges for both symbolic execution and fuzzing, simultaneously. For exmaple, we won't find a bug if we only run symbolic execution because of the second challenge. If we only run fuzzing, it won't even pass the first challenge.

To resolve these issues, researchers have proposed hybrid fuzzing, which combines symbolic execution and fuzzing to complement their drawbacks. The idea is simple, yet effective; hybrid fuzzing selectively utilizes symbolic execution to when it faces branches that are challenging to get through with fuzzing (e.g., the first challenge above). One of the recent work in hybrid fuzzing is QSYM. In the remaining part of this tutorial, we will learn how to use QSYM for finding previously mentioned bugs, buried deep in programs.

2. Getting started with QSYM

For easier installation, QSYM provides a pre-built VM image using vagrant. Installing QSYM through vagrant is fairly straightfoward.

$ vagrant init jakkdu/qsym
$ vagrant up
$ vagrant ssh

Then, you have a SSH session in the QSYM's VM image. Let's use QSYM to find a bug in the previous example.

Before running QSYM, we need to set up several environments. First, we need to load some kernel configurations that QSYM depends on using sysctl command.

$ sudo sysctl --system

Then, we need to compile the example program into two versions; one for fuzzing and the other for concolic execution. We use a compiler from afl (afl-gcc) to instrument the binary for fuzzing, so that the code coverage could be collected while fuzzing. For compiling the binary for concolic execution, we use stock gcc.

# for fuzzing
$ ./afl-2.52b/afl-gcc -o example-afl example.c

# for concolic execution
$ gcc -o example example.c

We also need to prepare initial seed to fuzz. In our experiment, we will use a dumb test case that contains many 'A's. As you can imagine, a program being executed with this seed file as an input will not print anything, because it won't be able to pass any of the three steps that we have discussed above.

# make a seed
$ mkdir input
$ python -c'print"A"*4096' > input/seed

# nothing will print out
$ ./example ./input/seed

As we discussed before, fuzzing cannot find this bug because of constraints that are hard to be met in a random manner. To verify this, let's run two AFL instances to find the bug. It is worth noting that AFL supports fuzzing with multiple instances to utilize multiple cores in a modern computer. To enable it, you can have one master and multiple slaves. In this example, we will use one master and one slave instance for AFL.

# terminal 1 (using vagrant ssh)
$ ./afl-2.52b/afl-fuzz -M afl-master -i input -o output -- ./example-afl @@

# terminal 2
$ ./afl-2.52b/afl-fuzz -S afl-slave -i input -o output -- ./example-afl @@

Even after a few minutes (even hours), AFL will fail to find this bug. Let's add QSYM to overcome this issue.

# terminal 3
$ ./qsym/bin/run_qsym_afl.py -a afl-slave -o output -n qsym -- ./example @@

Then, after a few seconds, QSYM will find the bug magically. Let's understand how QSYM can find this bug. Note that the following file names and contents could be slightly different from yours because of randomness in fuzzing.

$ xxd output/qsym/queue/id\:000000\,src\:id\:000000
00000000: efbe adde 4141 4141 4141 4141 4141 4141  ....AAAAAAAAAAAA
00000010: 4141 4141 4141 4141 4141 4141 4141 4141  AAAAAAAAAAAAAAAA
00000020: 4141 4141 4141 4141                      AAAAAAAA

$ ./example ./output/qsym/queue/id\:000000*
Step 1 passed

If we see the first input generated by QSYM, we will see that QSYM's concolic execution sucessfully discovers an input that passes the first branch (i.e., the first four bytes are 0xdeadbeef). Unfortunately, this input only satisfies the first one, not others. After getting this input, AFL can pass the second branch using fuzzing. Let's take a look at the final input.

$ xxd output/afl-slave/crashes/id\:000000\,sig\:11\,sync\:qsym\,src\:000003
00000000: efbe adde 4141 4141 6161 6161 6161 6161  ....AAAAaaaaaaaa
00000010: 6161 6161 6161 6161 6161 6161 6161 4141  aaaaaaaaaaaaaaAA
00000020: 4141 4141 e24e 00d5                      AAAA.N..

$ ls output/qsym/queue/id\:000003*
output/qsym/queue/id:000003,src:id:000005

$ xxd output/afl-slave/queue/id\:000005\,src\:000004\,op\:havoc\,rep\:2\,+cov
00000000: efbe adde 4141 4141 6161 6161 6161 6161  ....AAAAaaaaaaaa
00000010: 6161 6161 6161 6161 6161 6161 6161 4141  aaaaaaaaaaaaaaAA
00000020: 4141 4141 4141 4141                      AAAAAAAA

As we can see, the final output has several non-'A' characters to satisfy the second branch, which requires at least eight characters that are greater than 'a'. Where does it come from? To figure out this, let's look at its file name. sync:qsym in the name of crash file represents that it is generated by QSYM's concolic execution. Moreover, the last six digits (i.e., 000003) means its identifier for QSYM. Let's find the input in the QSYM's directory. The file from QSYM also has the six-digit id (i.e., 000005) at the last, which represents its source file. If we find the source file in the afl-slave's directory, we will see that this is generated by fuzzer; the file name does not contain sync:qsym but other strategy in AFL (i.e., havoc).

[Task] Check LAVA/README.md and find at least ten bugs in base64 of LAVA-M dataset