Author: Mark Griffin

How Much Testing is Enough? Understanding Test Results with bncov and Coverage Analysis.

How Much Testing is Enough? Understanding Test Results with bncov and Coverage Analysis.

A frequently asked question in software testing is “Is that enough testing, or should we do more?” Whether you’re writing unit tests for your programs or finding bugs in closed-source third-party software, knowing what code you have and have not covered is an important piece of information. In this article, we’ll introduce bncov, an open source tool developed by ForAllSecure, and demonstrate how it can be used to answer common questions that arise in software testing.

At its core, bncov is a code coverage analysis tool. While there are several well-known tools that offer visibility into code coverage, we wanted to build a solution that enhanced and/or extended functionality in the following areas:

  1. Easily scriptable. Scriptability is a key feature to align with larger analysis efforts and for combining with other tools.
  2. Strong data presentation. Good visualizations quicken and enhance understanding.
  3. Fuzzing/testing workflow compatible. Tools that exactly fit your needs increase productivity and speed.
  4. Supports binary targets. Sometimes you don’t have the original source code.

While the existing code coverage tools are really good at some of these, our main focus was scriptability because of our requirements for flexibility. The driving purpose is to be able to answer common questions in software testing that often require a combination of information from static and dynamic analysis, so flexibility is important in order to answer a large variety of potential questions. We found that a plugin for Binary Ninja works perfectly for this, because it allows users to easily leverage the information from Binary Ninja in a python scripting environment.

The workflow for using bncov is a three-step process. While the first step is up to you, we’ve made the other steps easy to pipeline:

  1. Generate test cases. Test cases can be generated by any approach, from fuzzing solutions to manual test case development.
  2. Generate coverage data from those test cases.
  3. Run analysis and display output with bncov.

After running the normal install process for Binary Ninja plugins (instructions available at https://docs.binary.ninja/guide/plugins/index.html#using-plugins), the first step is to collect coverage information. This is done by running your target program with your inputs, also known as input files or seeds, and collecting coverage in the drcov format (from DynamoRIO’s built-in drcov tool http://dynamorio.org/docs/page_drcov.html). We’ve packaged a script to make this easier, but it’s nothing a simple bash loop couldn’t accomplish. It’s important to note to the data that’s collected, because this is the data that ends up in the plugin and forms the basis of our analyses. The coverage files generated by drcov include which basic blocks are executed, but not the order or the number of times the blocks are executed.

With the information from the coverage files, we can now visualize block coverage using bncov. Import the whole directory of coverage files, and you’ll see blocks colored in a heatmap fashion, with blocks painted from blue to purple to red. Redder hues indicate that a block was covered in a smaller percentage of input files (i.e. the block is “rare” among the inputs), while bluer hues show blocks with a higher percentage, indicating more common code paths. Blocks that have not been covered at all are not recolored. This color scheme allows users to instantly visualize which blocks have been tested and what the common code paths are as they review functions.

Figure 1: The smaller the relative percentage of test cases that cover the block (“the rarer it is”), the more reddish it is.

Coverage visualization is very helpful for manual analysis. bncov’s unique differentiator is its scripting flexibility and ability to automate analysis. The code coverage data used to provide visualization can be used within Binary Ninja’s built-in scripting console or a normal python environment (only available with a Binary Ninja commercial license), allowing for additional analyses using its existing knowledge of the binary. The ability to programmatically reason about code coverage with a set of input files is extremely powerful, and we’ve provided some built-in examples as starting points, such as the GUI commands “Highlight Rare Blocks” and “Highlight Coverage Frontier.” These examples highlight and log blocks that are only covered by a single coverage file and blocks that have an outgoing edge to an uncovered block, respectively. Users can build various interesting analyses on top of these building blocks to answer challenging questions, such as the one we started with: “Should we do more testing?”

Figure 2: Blocks highlighted in green are in the “Coverage Frontier” — meaning they have an outgoing edge that isn’t covered.

As a demonstration, let’s walk through an open source project that has built-in test resources. The open source XML library “TinyXML-2” (https://github.com/leethomason/tinyxml2) is an excellent example because it is a compact library that includes a test program, test inputs, and a Google OSS-fuzz harness. If users choose to conduct additional testing (like fuzzing) it’s helpful to understand what code the built-in test cases cover and compare how much more coverage fuzzing yields. This process is simplified by using a bncov script to compare coverage between the set of coverage files from before and after fuzzing. The code below is the heart of the coverage comparison process from the script:

Figure 3: Comparing coverage between sets of inputs with bncov

We’ll start the analysis with three sets of starting inputs:

  1. The test XML files included in the resources directory
  2. XML inputs extracted from TinyXML-2’s test binary.
  3. A set of XML files gathered from multiple test suites on the Internet  

First, we collect coverage using the bncov’s drcov automation script on each input set to understand the baseline level of coverage we get from the different inputs. We wrote a simple program that uses TinyXML-2 to parse and print input files, which we used as our target for collecting coverage (and later for fuzzing). The results from collecting baselines show that the extracted test cases offered significantly more coverage than the test cases from the resources directory, which makes sense as the test binary includes all the tests from the resources. Also, as you might expect, the combination of multiple external test suites had the most coverage among the initial input sets.

By fuzzing our target program with each of the input sets, we will explore new code paths in TinyXML-2 by generating testcases that cover new basic blocks that the initial sets do not. The results of fuzzing will vary greatly depending on multiple factors: how long the fuzzer is run and how fast the target program is, the kind of input processing the target does, the quality of the starting input set, the capabilities of the fuzzer, etc. In our case though, we’re looking to compare coverage and look for relative increases in block coverage across the input sets, so we just fuzzed each input set for the same period of time with AFL. Once the fuzzing finished, we did some comparison using one of the scripts included with bncov.

Figure 4: Coverage comparison script output.

As expected, we saw increased coverage for each input set after a short fuzzing run. Although the gap in the number of blocks covered between each input set narrows after fuzzing, there are certain blocks that were only found by the external suite. This result makes sense, as certain input constructs are harder for a fuzzer like AFL to synthesize. This is where a technique known as symbolic execution, a technology within our Mayhem solution, can often help by solving for inputs that are unlikely to be discovered by random permutations from a fuzzer.

Using the script output, we can now start to answer “how much is enough testing?” Using bncov, users now have data points that show which functions have been exercised and which basic blocks are not covered by the existing test cases. With the included coverage frontier analysis, we can also see the boundary between existing test inputs and untouched code, allowing users to automatically identify functions that could benefit from further exploration. This type of analysis quickly increases the amount of understanding a user has of the target code, and this is the kind of information needed to answer “how much is enough.”

Figure 6: Enumerate frontier blocks for each function.

Coverage analysis and using coverage information to enhance fuzzing is an active and developing research area. Using bncov to reason about coverage is a step forward because it leads to analysis automation, and flexible reasoning required for targeted application of techniques to augment fuzzing, such as directed symbolic execution. We’ll share more on these advanced topics in a future installment, but in the meantime you can fork bncov on GitHub and experiment for yourself! We hope it helps you get a better understanding of your testing coverage and discover code paths you might be missing.

ForAllSecure offers Mayhem, a dynamic testing solution that brings together the tried-and-true techniques of coverage-guided fuzzing with the advantages of symbolic execution, including patented technology from over a decade of research at Carnegie Mellon University. You can learn more at https://forallsecure.com/.