Static analysis and regular expressions

09.12.2010 Andrey Karpov

I develop the PVS-Studio static code analyzer intended for analyzing C/C++ software. After we implemented general analysis in PVS-Studio 4.00, we received a lot of responses, both positive and negative. By the way, you are welcome to download a new version of PVS-Studio where we have fixed a lot of errors and defects thanks to users who told us about them.

While discussing PVS-Studio 4.00, the question was again raised if we could implement most checks using regular expressions and if we actually complicate the matter suggesting that we must necessarily build and handle a parse tree during analysis. This question arises not for the first time, so I decided to write an article to explain why it is a very bad idea to try to use regular expressions for C/C++ code analysis.

Those familiar with the compilation theory certainly understand that the C++ language can be parsed only relying on grammatics and not regular expressions. But most programmers are not familiar with this theory and they continue to tell us about using regular expressions to search for errors in software code over and over again.

Let me say right away that we can find some issues using regular expressions. There are even several static analyzers that use this principle. But their capabilities are very restricted and mostly come to messages like "There is the "strcpy" function being used, you'd better replace it with a safer one".

Having thought it over how to tell the community about lameness of the regular expression method, I decided to do the following simple thing. I will take the first ten diagnostic messages of general analysis implemented in PVS-Studio and show by the example of each of them what restrictions the regular expression method involves.

Diagnosis 0

Once I started describing V501, I recalled that none of the analysis types would provide me with sufficient information until #define's remain unexpanded. The error might hide inside the macro but it will remain an error all the same. It is rather simple to create a preprocessed file, so assume we already have i-files. Now we encounter the first trouble - we must determine which code fragments refer to system files and which refer to user code. If we analyze system library functions, it will significantly reduce the speed of analysis and cause a lot of unnecessary diagnostic messages. Thus, if we use regular expressions, we must parse the following lines:

#line 27 "C:\\Program Files (x86)\\Microsoft Visual Studio 8\\VC\\atlmfc\\include\\afx.h"

#line 1008 ".\\mytestfile.cpp"

and understand which of them refer to our program and which refer to Visual Studio. But that's not the half of it: we must also implement relative reading of lines inside i-files since we must generate not the absolute number of the line with the error in the preprocessed i-file but the number of the line in our native c/cpp-file we are analyzing.

So, we have not even started but already get a whole lot of difficulties.

Diagnosis 1

V501. There are identical sub-expressions to the left and to the right of the 'foo' operator.

In order not to overload the text, I suggest that the readers go by the link and read the description of this error and samples. The point of this rule is to detect constructs of this type:

if (X > 0 && X > 0)

At first sight, we could easily find such constructs using a regular expression when identical expressions stand to the left and to the right of operators &&, ||, ==, etc. For example: we search for the && operator. If there is something looking identical in parentheses to the right and to the left of &&, we certainly have an error. But it won't work because one could write it this way:

if (A == A && B)

The error is still here but there are different expressions to the left and to the right of '=='. It means that we must introduce the notion of precedence of operators. Then we must cut off boundaries on lower-priority operators such as '&&' if we have '=='; and vice versa: if it is '&&', then we must capture operators '==' to find the error for this case on approaching the limiting parentheses:

if (A == 0 && A == 0)

In the same way, we must provide for logic for all the versions of operators with different priorities. Yes, by the way - you cannot fully rely on parentheses too because you may encounter cases like this:

if ( '(' == A && '(' == B )
b = X > 0 && X > 0;

It is very difficult to provide for all the possible ways using regular expressions. We will have too many of them with a lot of exceptions. And still it won't be safe since we will not be sure that all the possible constructs have been taken into account.

Now compare this whole stuff with the elegance with which I can find this error having a syntax tree. If I have found operators &&, ==, ||, etc., I only have to compare the left and the right branches of the tree to each other. I will do this in the following way:

if (Equal(left, right))
{
  // Error!
}

That is all. You don't have to think of operators' priorities, you don't have to fear that you will encounter a bracket in this text: b = '(' == x && x == ')';. You can simply compare the left and the right tree branches.

Diagnosis 2

V502. Perhaps the '?:' operator works in a different way than it was expected. The '?:' operator has a lower priority than the 'foo' operator.

This rule searches for confusion concerning operators' priorities (see the error description for details). We must detect a text like this:

int a;
bool b;
int c = a + b ? 0 : 1;

Let's leave the question about operator's priorities aside for now: regular expressions appear too poor when used for this purpose. But what is worse, you must know the VARIABLE'S TYPE for this and many other rules.

You must derive the type of each variable. You must force your way through the maze of typedef. You must look into classes to understand what vector<int>::size_type is. You must take scopes into consideration as well as different using namespace std;. You must even derive the type of the X variable from the expression: "auto X = 1 + 2;" in C++0x.

The question is how can we do all that using regular expressions? The answer is no way. Regular expressions are perpendicular to this task. You must either write a complicated mechanism of type derivation, i.e. create a syntactical code analyzer, or have regular expressions without knowing types of variables and expressions.

The conclusion is: if we use regular expressions to handle a C/C++ application, we do not know types of variables and expressions. Note this great limitation.

Diagnosis 3

V503. This is a nonsensical comparison: pointer < 0.

This rule is very simple. Comparison of a pointer with zero using < and > looks suspicious. For example:

CMeshBase *pMeshBase = getCutMesh(Idx);
if (pMeshBase < 0)
  return NULL;

Refer to the error description to learn how we got this code.

To implement this diagnosis, we must only know the type of the pMeshBase variable. It was explained above why it is impossible.

This diagnosis cannot be implemented relying on regular expressions.

Diagnosis 4

V504. It is highly probable that the semicolon ';' is missing after 'return' keyword.

void Foo();
void Foo2(int *ptr)
{
  if (ptr == NULL)
    return
  Foo();
  ...
}

We could well diagnose constructs of this type using regular expressions. But we would have too many false alarms. We are interested only in those cases when the function returns void. Well, we could find it out using regular expressions either. But it will not be very clear where the function starts and ends. Try yourself to invent a regular expression to find the function's start. Trust me, you will like this task, especially if you understand that one could write a stuff like this:

int Foo()
{
   ...
  char c[] = 
  "void MyFoo(int x) {"
  ;
  ...
}

If we have a complete syntax tree with diverse information, everything becomes much simpler. You may find out the type of the returned function this way (the sample is taken right out of PVS-Studio):

SimpleType funcReturnType;
EFunctionReturnType fType;
if (!env->LookupFunctionReturnType(fType, funcReturnType))
  return;
if (funcReturnType != ST_VOID)
  return;

Diagnosis 5

V505. The 'alloca' function is used inside the loop. This can quickly overflow stack.

Yes, we could try to implement this rule relying on regular expressions.

But I wouldn't try to find out where the loop starts and ends for one could think up so many funny situations with curly brackets in comments and strings.

{
  for (int i = 0; i < 10; i++)
  {
    //A cool comment. There you are { - try to solve it. :)
    char *x = "You must be careful here too {";
  }
  p = _alloca(10); // Are we inside the loop or not?
}

Diagnosis 6

V506. Pointer to local variable 'X' is stored outside the scope of this variable. Such a pointer will become invalid.

We must handle variables' scope to detect these errors. We must also know types of variables.

This diagnosis cannot be implemented relying on regular expressions.

Diagnosis 7

V507. Pointer to local array 'X' is stored outside the scope of this array. Such a pointer will become invalid.

This diagnosis cannot be implemented relying on regular expressions.

Diagnosis 8

V508. The use of 'new type(n)' pattern was detected. Probably meant: 'new type[n]'.

It is good to detect misprints of this kind:

float *p = new float(10);

Everything looks simple and it seems we could implement this diagnosis using regular expressions if we knew the type of the object being created. No way. Once you change the text a bit, regular expressions become useless:

typedef float MyReal;
...
MyReal *p = new MyReal(10);

This diagnosis cannot be implemented relying on regular expressions.

Diagnosis 9

V509. The 'throw' operator inside the destructor should be placed within the try..catch block. Raising exception inside the destructor is illegal.

Yes, we could try to make this check using regular expressions. Destructors are usually small functions and we will hardly meet any troubles with curly brackets there.

But you will have to sweat over regular expressions to find the destructor function, its beginning and end and find out if it contains throw which is caught in catch. Do you imagine the whole amount of work? Can you do a thing like that?

Well, I can. This is how I made it in a very smart way in PVS-Studio (the rule is given in full):

void ApplyRuleG_509(VivaWalker &walker, Environment *env,
  const Ptree *srcPtree)
{
  SimpleType returnType;
  EFunctionReturnType fType;
  bool res = env->LookupFunctionReturnType(fType, returnType);
  if (res == false || returnType != ST_UNKNOWN)
    return;
  if (fType != DESTRUCTOR)
    return;

  ptrdiff_t tryLevel = OmpUtil::GetLevel_TRY(env);
  if (tryLevel != -1)
    return;
  string error = VivaErrors::V509();
  walker.AddError(error, srcPtree, 509, DATE_1_SEP_2010(), Level_1);
}

Diagnosis 10

V510. The 'Foo' function is not expected to receive class-type variable as 'N' actual argument.

This rule concerns passing classes of std::string type and the like as arguments into functions of printf type. We need types. That is, this diagnosis cannot be implemented relying on regular expressions as well.

Summary

I hope I have made the situation with regular expressions, syntax trees and static code analysis clearer to you. Thank you for your attention. Once again I ask you to download and try PVS-Studio. I would also appreciate if you ask questions but I am not intending to get into debates about what regular expressions can give us and what they cannot. It is not interesting. They do allow us to get much, but they do not allow us to get even more. C++ can be successfully parsed only using the grammatics mathematical apparatus.