How we compared code analyzers: CppCat, Cppcheck, PVS-Studio and Visual Studio

We have carried out a thorough comparison of four analyzers for C/C++ code: CppCat, Cppcheck, PVS-Studio and Visual Studio's built-in analyzer. In this article, we will in every detail discuss the comparison methodology and the results.

Picture 2

Introduction

We have mentioned a number of times [1, 2] that comparing static code analyzers is a very complicated task. It can't be reduced to simply counting the number of diagnostic rules from the documentation. It also can't be reduced to counting the number of diagnostic messages generated by analyzers on one test project. The only reliable method is to check several different projects with all the analyzers and compare the number of bugs found by each. It is a huge and very labor-intensive task, but this technique alone allows one to carry out a reliable comparison of different analyzers.

We should say that we have done pretty much work. There were 4 of us working on this task, and we spent total 170 man-hours. It's more than a month of one programmer's work!

Labor-intensiveness

Why did it take us 170 hours to accomplish the comparison? There were quite many nuances about it, invisible and boring yet time-consuming. For example, we didn't include into our report the Steamworks SDK project's analysis results. The PVS-Studio and CppCat analyzers caught one single bug in it, while the other analyzers didn't find anything at all. It turned to be pointless to include this project's check results into the report, but we had had to spend some time on downloading and checking that project before we knew that. Here's another example. Visual Studio's built-in analyzer caused the IDE to crash a number of times, so we had to relaunch the analysis. There were numerous trifles like these that made up a huge amount of work.

Here are the labor-intensiveness metrics of various stages of our investigation:

  • 40 hours. Selecting test projects; setting up the environment to ensure correct compilation and analysis of the projects (downloading third-party libraries and so on).
  • 10 hours. Checking the projects with the analyzers.
  • 70 hours. Examining diagnostic messages and picking out interesting samples.
  • 10 hours. Working out the format of data representation, designing Excel tables' structures.
  • 15 hours. Entering the results in Excel tables.
  • 5 hours. Drawing up summary tables and designing pictures for articles.
  • 20 hours. Writing, editing and publishing the articles.

Analyzers taking part in the comparison

The following analyzers took part in the comparison:

  • CppCat 1.1;
  • Cppcheck 1.63.1;
  • PVS-Studio 5.13;
  • Microsoft Visual Studio Professional 2013's built-in analyzer.

Each analyzer was used with the recommended settings:

  • CppCat: all diagnostics enabled (there are no severity levels of messages);
  • Cppcheck: only crucial diagnostic messages enabled (Errors and Warnings);
  • PVS-Studio: recommended settings (the 1-st and 2-nd severity levels for general diagnostics);
  • Visual Studio 2013: Microsoft Native Recommended Rules.

Test Projects

We checked the following open-source projects:

  • Eigen is a high-level C++ library of template headers for linear algebra, matrix and vector operations, numerical solvers and related algorithms;
  • PostgreSQL, often simply "Postgres", is an open-source object-relational database management system (ORDBMS) with an emphasis on extensibility and standards-compliance;
  • Source SDK. The Source SDK is a software development kit compiled by Valve Corporation that is used to create games or mods for the Source engine;
  • Synergy is a software application for sharing a keyboard and mouse between multiple computers.
  • Coin3D is a free and open-source implementation (library) of the Open Inventor API;
  • OpenDDS. The Data Distribution Service for Real-Time Systems (DDS) is an Object Management Group (OMG) machine to machine middleware "m2m" standard that aims to enable scalable, real-time, dependable, high performance and interoperable data exchanges between publishers and subscribers;
  • Firebird is an open source SQL relational database management system that runs on Linux, Windows, and a variety of Unix;
  • CRYENGINE SDK, originally called Sandbox Editor, is the current version of the level editor used to create levels for CryEngine by Crytek;
  • FlightGear Flight Simulator (often shortened to FlightGear or FGFS) is a free, open source multi-platform flight simulator developed by the FlightGear project since 1997;
  • V8 implements ECMAScript as specified in ECMA-262, 5th edition, and runs on Windows (XP or newer), Mac OS X (10.5 or newer), and Linux systems that use IA-32, x64, or ARM processors;
  • Wild Magic 5. C++ library for real-time computer graphics and physics, mathematics, geometry, numerical analysis, and image analysis;
  • Rhino is an open source JavaScript engine. It is developed entirely in Java and managed by the Mozilla Foundation. It is separate from the SpiderMonkey engine, which is also developed by Mozilla, but is written in C and is used in Firefox;
  • Push Framework is an open source C++ Toolkit for high-performance server development.

We don't specify the project versions consciously. The objective of our comparison was the comparison itself, not catching bugs in the latest versions of the projects. Complete analysis of the projects took quite a while, so their source code may have well changed by now. At the very least, the number of lines of code is almost sure to have changed.

That's why we suggest that projects' authors check the latest version of their source code themselves.

Notice also that we didn't distinguish between bugs found in the projects themselves and bugs found in third-party libraries utilized by the projects. Such a distinction is irrelevant for our task.

Criteria of picking out diagnostic messages

This subject is not easy to explain. Analyzer-generated warnings suggest that the user should thoroughly investigate certain code fragments. Although they very often turn to be false positives, they are still worth examining as some refactoring can be done over them to make the code clear not only to the analyzer, but to other programmers too.

However, what we actually wanted to campare is the number of genuine or potential bugs found by each analyzer.

It's difficult to draw a distinctive line between crucial and non-crucial defects. We can't give a formal definition of what to consider a really dangerous code fragment. We had to rely on our intuition and we hope it had't let us down.

Here is an example to illustrate this idea.

There are a few diagnostics warning that a certain function uses too much stack memory, say, 16 Kbytes (programs use a 2 Mbyte stack size by default). It can happen when a local array "double x[2000]" is declared in the function. Is it a bug or is it not? Programmers use such functions pretty often and they usually don't cause any troubles. So we decided to ignore such warnings.

And here's another example: a warning about a call to the alloca() function inside a loop. The alloca() function allocates memory on the stack. What is convenient about this, the allocated memory is automatically freed after use. But when called from a loop, it may consume the whole stack pretty quick. Whether or not it will occur is unknown. We find such constructs more dangerous than one-time stack allocation when entering a function, so we took note of messages like that.

We tried to stick to the following rule:

A diagnostic message is considered crucial if it points to an evident error or a code fragment that may easily turn into a bug after refactoring.

Let's take a look at a few examples to understand this point.

An example of an evident error:

if (memcmp(&(beentry->st_clientaddr), &zero_clientaddr,
           sizeof(zero_clientaddr) == 0)) 

A parenthesis is written in a wrong place, which results in the memcmp() function comparing 0 bytes.

An example of a potential error:

const ULONG END_BUCKET = (~0) << 1;

Shifting negative values causes undefined behavior. This code will work as the programmer expects with certain compilers, but it is actually sheer luck. At any moment, this code may fail. That's why we find such warnings crucial and worth mentioning too.

An example of a code fragment of we decided to skip:

size_t length = ....;
fb_assert(ptr && length >= 0 && length < 9);

The 'lenght' variable is unsigned. Formally, the "length >= 0" comparison is pointless and therefore redundant. We didn't take account of such warnings.

False positives

At first we wanted to include the information about false positives into our report. But if we had just mentioned the total number of false warnings generated by each analyzer, the readers might have got it wrong. You see, for some projects certain analyzers generated a huge number of false positives which were actually easy to eliminate.

No, we are not trying to show PVS-Studio or CppCat in a good light - they did well anyway. It was Visual Studio that showed poor results.

All the analyzers employ a certain mechanism to cut the number of warnings. If a suspicious code fragment is located in an *.h file, the diagnostic message is generated only once regardless of how many times this header files is included into *.cpp files. There's no sense in showing one and the same warning again and again. Visual Studio's built-in analyzer uses this mechanism as well.

The trouble about it is that in Visual Studio this works only within the scope of a project, not the entire solution under analysis. As a result, you may sometimes face a problem like the one described below.

The Eigen solution contains over 500 projects. There are several header files included into all these projects. These header files contain suspicious code fragments that trigger Visual Studio's warnings. Since message filtering is performed only for an individual project, each of these warnings is repeated about 500 times.

As a result, instead of 3 more or less useful warnings, you get about 1900 warnings on those ill-fated header files.

We don't feel like sharing such negative results with you. The problem can be easily solved by modifying the header files. But we will face a dilemma then: it's not fair to pay more attention to one tool without helping the others.

After all, the characteristics of CppCat, Cppcheck and PVS-Studio can also be significantly enhanced through minimum configuration. For example, PVS-Studio may sometimes generate a warning on each code line where a suspicious macro is used. The article "PVS-Studio vs Chromium. 3-rd Check" describes an example when turning off warnings for the DVLOG macro helped us eliminate 2300 false positives at one sweep.

The same is with the Cppcheck analyzer. So it turns out that minimum customization may help significantly improve the tools' capabilities.

So, here is a useful conclusion:

One can't discuss the number of false positives produced by different tools without carrying out a prior minimum customization of them. Some tiny and annoying nuances may show the picture in a much worse light than it actually is.

So, as we have concluded, we needed some minimum customization of the analyzers. But we didn't know how to improve all the analyzers' results equally. One could easily blame us for having paid more attention to one tool over the others. That's why we decided to ignore the "number of false positives" metric altogether.

There was also another reason why we found it difficult to reliably estimate this metric. As we have already mentioned, we had to distinguish between crucial and non-crucial messages. And it was not clear whether to refer non-crucial messages to false positives. They would just have messed it all up even more.

Biased estimate

We tried to be unbiased. But you can't be completely unbiased when comparing a tool that you have developed on your own to other tools.

But trust us, we honestly tried to make the comparison as objective as possible. Besides, CppCat and PVS-Studio really found much more errors than the other analyzers, so there was no sense in artificially boosting their rating.

The aspects where our bias could show up

We are more familiar with our own analyzers, CppCat and PVS-Studio, than with any other tool. Because of that, we can tell easily and quickly when they catch genuine bugs. It's not that easy and quick with other analyzers, so we may have missed something important.

Compensation

Being aware of that possibility, we tried to compensate for this effect. For example, we didn't think the following defect to be crucial:

unsigned int x = ....;
printf("%d", x);

"%u" should be used instead of "%d" here. This code will become harmful if the algorithm is handling large numbers (larger than INT_MAX). But in a real-life application, this code is unlikely to cause troubles. That's why PVS-Studio generates a 3rd level warning, and we didn't include warnings of this level into the report.

In our opinion, similar messages generated by the other analyzers were non-crucial as well. However, Cppcheck found this bug to be of high importance, so we took note of such messages by this analyzer, though ignored them for PVS-Studio.

We hope that the comparison was fair enough due to such compensation.

Analysis speed

Quite unexpectedly we discovered that the "analysis speed" metric was totally useless in isolation from a particular project. Depending on the project, the performance of analyzers may vary greatly.

The following examples confirm this very well:

  • it takes PVS-Studio 8 minutes and Cppcheck only 1 minute to complete the analysis of the PostgreSQL project;
  • it takes PVS-Studio 1 minute and Cppcheck 7 minutes to complete the analysis of the Synergy project.

It's a very huge dispersion. On one project, the Cppcheck analyzer is 8 times faster than PVS-Studio, while PVS-Studio is 7 times faster than Cppcheck on another project.

We could calculate some average value, of course, and we did it.

But keep in mind that a programmer works with a concrete real-life project, not just some abstract "average" code. You cannot predict which of the analyzers will show the best performance on your project. Remember that.

Below is a table with the analysis speed figures for different projects:

Table 1 - Analysis time for different projects (in minutes)

Table 1 - Analysis time for different projects (in minutes)

On average, Cppcheck proved to be the fastest, while Visual Studio proved to be the slowest. CppCat is actually a bit faster than PVS-Studio as it has fewer diagnostics to check. But the difference was so insignificant (a few dozens of seconds) that we found it pointless to reflect it in the table.

The process of verifying and analyzing the results

We didn't face any serious troubles when checking the test projects. It was rather the task of examining all the diagnostic messages and picking out interesting ones that was most difficult.

Unfortunately, we had to skip certain groups of messages for certain projects: when we found a few dozens of false positives of one type, we gave up further investigation. Of course, we could have missed some genuine bugs because of that. But we could not afford wasting time closely studying each and every warning. Since all the analyzers took on the role of "a loser" at different times, the reliability of the investigation is unlikely to have been affected due to mutual compensation of the results.

Here's an example for you to understand in which cases we stopped further examination of warnings.

On one of the projects, Visual Studio generated over 500 warnings of the following type:

C6011 Dereferencing NULL pointer 'obj'. test_cppbind  
    ghost.hxx  714
    'obj' may be NULL (Enter this branch)      705
    'obj' is dereferenced, but may still be NULL      714

Let's pick some simple function and find out the reason for the warning:

static raw_type& requireObject(naContext c, naRef me)
{
  raw_type* obj = getRawPtr( fromNasal(c, me) );
  naGhostType* ghost_type = naGhost_type(me);
  if( !obj )
    naRuntimeError
    (
      c,
      "method called on object of wrong type: "
      "is '%s' expected '%s'",
      ghost_type ? ghost_type->name : "unknown",
      _ghost_type.name
    );
  return *obj;
}

There is actually no issues in here. Let's peek into the naRuntimeError() function:

void naRuntimeError(naContext c, const char* fmt, ...)
{
    va_list ap;
    va_start(ap, fmt);
    vsnprintf(c->error, sizeof(c->error), fmt, ap);
    va_end(ap);
    longjmp(c->jumpHandle, 1);
}

This function must print an error message and call the longjmp function. This is an alternative to the exception throwing mechanism.

What matters to us is that the naRuntimeError() function doesn't pass the control back, so the null pointer will never be dereferenced. Visual Studio's analyzer just cannot figure it all out and generates a pile of false positives.

We checked the first 20 messages and they all turned to be false positives of one type related to the call of the naRuntimeError() function. So we dropped further examination of diagnostic messages with the code C6011 (Dereferencing null pointer).

Is it possible that we missed a few genuine bugs among the rest 500 messages of that type? Yes, it is. But on the other hand, we didn't feel like spending several hours just to find that all the 500 warnings were false positives. In the example above, you can quickly tell there's no bug there, but large functions take much more time and effort to figure them out.

So, once again, it doesn't seem to be a serious problem to us. We had to skip certain groups of messages for each of the analyzers. Sometimes it was Visual Studio that generated a pile of false positives, and sometimes it was Cppcheck, or PVS-Studio, or CppCat. Due to this, the final results were mutually compensated for.

Summary table

Finally we can take a look at the comparison results. We arranged them in a summary table:

Table 2 - Comparing diagnostic capabilities of different analyzers. Click on the picture to view in full size.

Table 2 - Comparing diagnostic capabilities of different analyzers. Click on the picture to view in full size.

This Excel table was built in the following way. We had four Excel sheets (one sheet per analyzer) and filled in the corresponding messages by each analyzer to get four tables. Then we combined them manually into one table titled "CppCat + PVS-Studio + CppCheck + Visual Studio (including duplicate messages)". Of course, simply joining the tables results in duplicate messages which we eliminated with Excel tools. Thus we obtained a fifth table titled "CppCat + PVS-Studio + CppCheck + Visual Studio (without duplicate messages)". It was this table that we used as a template when building the final tables with the results for each of the projects by all the analyzers. The final tables on all of the checked projects were combined into one summary table shown above (Table 2).

The summary table (Table 2) contains the following information:

  • the list of analyzed projects;
  • the total number of genuine bugs found in the project by each analyzer;
  • the number of genuine bugs found by each analyzer in the numerical and percentage form;
  • the analyzer that showed the best results on the project is marked green;
  • the 'Total' column contains the summarized information on all the projects.

You can download this Excel file with the list of all the detected bugs here: SCA_Comparison.xlsx.

The difference between PVS-Studio and CppCat

The article doesn't show any differences between PVS-Studio and CppCat because PVS-Studio's operation under recommended settings is in no way different from that of CppCat.

To learn about the differences between the two analyzers, see the article "An Alternative to PVS-Studio at $250".

Were there any bugs that PVS-Studio caught and CppCat missed? Yes, there were. For example, the third-level diagnostics caught a few errors. Here is the list of all those errors, just for the sake of reference: PVS_Studio_Level3.txt. We noted down total 52 crucial warnings of the third level.

We hadn't studied the third-level warnings with much attention, so we could well have missed something. Speaking of diagnostic messages related to 64-bit errors and microoptimizations, we hadn't studied them at all.

Subjective opinion on the analyzers

It would be impolite to praise our own analyzers CppCat and PVS-Studio, so let us only mention that they have found the largest number of defects at a moderate ratio of false positives.

Among Cppcheck and Visual Studio, we would surely prefer Cppcheck. Visual Studio's analyzer is pretty raw, at the very least. For a number of times, we witnessed crashes of Visual Studio when examining diagnostic messages. You just click on the next message, and the IDE crashes. Considering that an analysis report can't be saved in Visual Studio, those crashes annoyed us very much as we had to recheck the whole project.

Besides, we found Cppcheck's diagnostic messages much more comprehensible than those of Visual Studio. What's interesting, Visual Studio's analyzer employs the code coloring mechanism when selecting a message in the list. We find this option useless for medium and large functions, while the messages themselves are quite vague to grasp their meaning.

Conclusion

As you can see, comparing static analyzers is a very difficult and complex task. We believe that we have done an objective investigation and got reliable results. We hope they will be of use and interest to all who are interested in static analysis.

References to our older articles about comparison of static code analyzers