George Gribkov

Oct 19 2020

Tags:

#Cpp

Checking the code of DeepSpeech, or why you shouldn't write in namespace std

Oct 19 2020

Author: George Gribkov

Introduction
10 interesting warnings by PVS-Studio
Conclusion

DeepSpeech is an open-source speech recognition engine developed by Mozilla. It's pretty fast and ranks high, which makes its source code an interesting target for static analysis. In this article, I'll show you some of the bugs found in DeepSpeech's C++ code.

Introduction

We have already scanned a few projects based on machine learning before, so there was nothing new about DeepSpeech to us in this respect. No wonder since the project is quite popular: as of this writing, it has 15k stars on GitHub.

As usual, the bugs discussed in this article have been found using the static code analyzer PVS-Studio.

DeepSpeech employs the TensorFlow library. I excluded the library's code from analysis because we already made a separate article about it, but I left analysis enabled for all the other libraries used by DeepSpeech. Why? Because any bugs that sit in any of the libraries included into your project become bugs in your project. That's why it makes sense to scan not only your own code but also any third-party code that you include. We gave a detailed argument for this approach in our recent article.

That's all for the introduction – let's move on to the bugs. By the way, if you are here to find out the answer to the question mentioned in the title (why you shouldn't write in namespace std), you can skip straight over to the end of the article. There you'll find an especially tasty example!

10 interesting warnings by PVS-Studio

Warning 1

V773 The function was exited without releasing the 'data' pointer. A memory leak is possible. edit-fst.h 311

// EditFstData method implementations: just the Read method.
template <typename A, typename WrappedFstT, typename MutableFstT>
EditFstData<A, WrappedFstT, MutableFstT> *
EditFstData<A, WrappedFstT, MutableFstT>::Read(std::istream &strm,
                                               const FstReadOptions &opts)
{
  auto *data = new EditFstData<A, WrappedFstT, MutableFstT>();
  // next read in MutabelFstT machine that stores edits
  FstReadOptions edits_opts(opts);

  ....
  
  std::unique_ptr<MutableFstT> edits(MutableFstT::Read(strm, edits_opts));
  if (!edits) return nullptr; // <=

  ....
}

This snippet is a classic example of a memory leak: the Read function calls 'return nullptr' without having first freed the memory allocated earlier using the 'new EditFstData' expression. When the function returns in a way like that (without calling delete data), only the pointer itself will be deleted, without calling the destructor of the object that it points at. Thus, the object will stay in memory and you won't be able to delete or use it.

Besides the bug, this snippet also uses another poor practice: one function handles both smart and regular pointers at the same time. If, for example, data were a smart pointer too, no such error would have occurred: when necessary, smart pointers will automatically call the destructor of the allocated object upon leaving the scope.

Warning 2

V1062 The 'DfsState' class defines a custom 'new' operator. The 'delete' operator must also be defined. dfs-visit.h 62

// An FST state's DFS stack state.
template <class FST>
struct DfsState {
public:
  ....
  void *operator new(size_t size, 
                     MemoryPool<DfsState<FST>> *pool) {
    return pool->Allocate();
  }
  ....
}

PVS-Studio never stops evolving and acquiring new diagnostics. The code above is a perfect example for showcasing one such new diagnostic, V1062.

The rule behind it is simple: if you define your own 'new' operator, you must also define your own 'delete' operator. Likewise, if you define your own 'delete' operator, you must also define your own 'new'.

This rule was broken in this example: an object is created using a user-defined 'new' operator but deleted using the standard 'delete'. Let's see what the Allocate function of the class MemoryPool does when it's called by the user-defined 'new':

void *Allocate() {
  if (free_list_ == nullptr) {
    auto *link = static_cast<Link *>(mem_arena_.Allocate(1));
    link->next = nullptr;
    return link;
  } else {
    auto *link = free_list_;
    free_list_ = link->next;
    return link;
  }
}

This function creates an element and adds it to a linked list. Implementing such allocation in your own 'new' makes sense.

But wait! Just a few lines later, you see the following function:

void Free(void *ptr) {
  if (ptr) {
    auto *link = static_cast<Link *>(ptr);
    link->next = free_list_;
    free_list_ = link;
  }
}

So, they already have ready-made functions both for allocation and deallocation. The programmer probably intended to write their own 'delete' operator using this Free() function for deallocation.

The analyzer found at least three more errors of this type:

V1062 The 'VectorState' class defines a custom 'new' operator. The 'delete' operator must also be defined. vector-fst.h 31
V1062 The 'CacheState' class defines a custom 'new' operator. The 'delete' operator must also be defined. cache.h 65

Warning 3

V703 It is odd that the 'first_path' field in derived class 'ShortestPathOptions' overwrites field in base class 'ShortestDistanceOptions'. Check lines: shortest-path.h:35, shortest-distance.h:34. shortest-path.h 35

// Base class
template <class Arc, class Queue, class ArcFilter>
struct ShortestDistanceOptions {
  Queue *state_queue;    // Queue discipline used; owned by caller.
  ArcFilter arc_filter;  // Arc filter (e.g., limit to only epsilon graph).
  StateId source;        // If kNoStateId, use the FST's initial state.
  float delta;           // Determines the degree of convergence required
  bool first_path;       // For a semiring with the path property (o.w.
                         // undefined), compute the shortest-distances along
                         // along the first path to a final state found
                         // by the algorithm. That path is the shortest-path
                         // only if the FST has a unique final state (or all
                         // the final states have the same final weight), the
                         // queue discipline is shortest-first and all the
                         // weights in the FST are between One() and Zero()
                         // according to NaturalLess.

  ShortestDistanceOptions(Queue *state_queue, ArcFilter arc_filter,
                          StateId source = kNoStateId,
                          float delta = kShortestDelta)
      : state_queue(state_queue),
        arc_filter(arc_filter),
        source(source),
        delta(delta),
        first_path(false) {}
};
// Derived class
template <class Arc, class Queue, class ArcFilter>
struct ShortestPathOptions
    : public ShortestDistanceOptions<Arc, Queue, ArcFilter> {
  using StateId = typename Arc::StateId;
  using Weight = typename Arc::Weight;

  int32 nshortest;    // Returns n-shortest paths.
  bool unique;        // Only returns paths with distinct input strings.
  bool has_distance;  // Distance vector already contains the
                      // shortest distance from the initial state.
  bool first_path;    // Single shortest path stops after finding the first
                      // path to a final state; that path is the shortest path
                      // only when:
                      // (1) using the ShortestFirstQueue with all the weights
                      // in the FST being between One() and Zero() according to
                      // NaturalLess or when
                      // (2) using the NaturalAStarQueue with an admissible
                      // and consistent estimate.
  Weight weight_threshold;  // Pruning weight threshold.
  StateId state_threshold;  // Pruning state threshold.

  ShortestPathOptions(Queue *queue, ArcFilter filter, int32 nshortest = 1,
                      bool unique = false, bool has_distance = false,
                      float delta = kShortestDelta, bool first_path = false,
                      Weight weight_threshold = Weight::Zero(),
                      StateId state_threshold = kNoStateId)
      : ShortestDistanceOptions<Arc, Queue, ArcFilter>(queue, filter,
                                                       kNoStateId, delta),
        nshortest(nshortest),
        unique(unique),
        has_distance(has_distance),
        first_path(first_path),
        weight_threshold(std::move(weight_threshold)),
        state_threshold(state_threshold) {}
};

It would be a tough job to try to find a bug here on your own, wouldn't it?

The problem here is that both the base and derived classes contain fields of the same name: first_path. Because of that, the derived class will have its own unique field overlapping the base class's field. Errors like that may be a source of great confusion.

To better understand what I'm talking about, take a look at a small synthetic example from our documentation. Suppose we have the following code:

class U {
public:
  int x;
};

class V : public U {
public:
  int x;  // <= V703 here
  int z;
};

Here, the name x is overlapped inside the derived class. The question is, what will the following code output?

int main() {
  V vClass;
  vClass.x = 1;
  U *uClassPtr = &vClass;
  std::cout << uClassPtr->x << std::endl;
  ....
}

If you believe it will output an undefined value, you are right. In this example, the value 1 will be written to the field of the derived class but the reading will be done from the base class's field, which by the moment of outputting the value is still undefined.

Name overlapping in class hierarchy is a potential error, which you don't want to have in your code :)

Warning 4

V1004 The 'aiter' pointer was used unsafely after it was verified against nullptr. Check lines: 107, 119. visit.h 119

template <....>
void Visit(....)
{
  ....
  // Deletes arc iterator if done.
  auto *aiter = arc_iterator[state];
  if ((aiter && aiter->Done()) || !visit) {
    Destroy(aiter, &aiter_pool);
    arc_iterator[state] = nullptr;
    state_status[state] |= kArcIterDone;
  }
  // Dequeues state and marks black if done.
  if (state_status[state] & kArcIterDone) {
    queue->Dequeue();
    visitor->FinishState(state);
    state_status[state] = kBlackState;
    continue;
  }
  const auto &arc = aiter->Value();       // <=
  ....
}

The aiter pointer is used after it has been checked for nullptr. The analyzer assumes that the presence of such a check indicates that the pointer may have the nullptr value during the check.

So, let's track the aiter pointer assuming that it's equal to null. It will first be checked in the 'if ((aiter && aiter->Done()) || !visit)' expression. This condition will evaluate to false, so we'll skip the then branch of that if statement. And then, in the way of classic errors, the null pointer will get dereferenced: 'aiter->Value();'. The result is undefined behavior.

Warning 5

This snippet has triggered two warnings at once:

V595 The 'istrm' pointer was utilized before it was verified against nullptr. Check lines: 60, 61. mapped-file.cc 60
V595 The 'istrm' pointer was utilized before it was verified against nullptr. Check lines: 39, 61. mapped-file.cc 39

MappedFile *MappedFile::Map(std::istream *istrm, bool memorymap,
                            const string &source, size_t size) {
  const auto spos = istrm->tellg();        // <=
  ....
  istrm->seekg(pos + size, std::ios::beg); // <=
  if (istrm) {                             // <=
    VLOG(1) << "mmap'ed region of " << size
            << " at offset " << pos
            << " from " << source
            << " to addr " << map;
  return mmf.release();
  }
  ....
}

This bug is clearer than the previous one. The istrm pointer is first dereferenced (twice), and only then do the check and error logging take place. This obviously means that if a null pointer is passed to this function as istrm, undefined behavior (or a crash, which is more likely) will occur without any logging. Too bad... don't let bugs like that into your code.

Warning 6

V730 Not all members of a class are initialized inside the constructor. Consider inspecting: stones_written_. ersatz_progress.cc 14

ErsatzProgress::ErsatzProgress()
  : current_(0)
  , next_(std::numeric_limits<uint64_t>::max())
  , complete_(next_)
  , out_(NULL)
{}

The warning says the constructor does not initialize all the fields of the ErzatzProgress structure. Let's compare the constructor with the list of the structure's fields:

class ErsatzProgress {
  ....
private:
    void Milestone();

    uint64_t current_, next_, complete_;
    unsigned char stones_written_;
    std::ostream *out_;
};

Indeed, as you can see, the constructor initializes all the fields except stones_written_.

Note: this snippet is not necessarily defective in itself. The real error will occur only when the program attempts to use the value of the uninitialized field.

That said, the V730 diagnostic still helps debug cases of such unsafe use in a good time. After all, it's just natural to wonder why the programmer should leave one of the class's fields uninitialized while explicitly initializing all the rest.

My suspicion that the stones_written_ field was left out by mistake proved right when I came across another constructor a few lines later:

ErsatzProgress::ErsatzProgress(uint64_t complete,
                               std::ostream *to,
                               const std::string &message)
  : current_(0)
  , next_(complete / kWidth)
  , complete_(complete)
  , stones_written_(0)
  , out_(to)
{
  ....
}

This constructor initializes all the fields, which proves the previous one was meant to do the same but the programmer overlooked one of the fields.

Warning 7

V780 The object '& params' of a non-passive (non-PDS) type cannot be initialized using the memset function. binary_format.cc 261

/* Not the best numbering system,
   but it grew this way for historical reasons
 * and I want to preserve existing binary files. */
typedef enum
{
  PROBING=0,
  REST_PROBING=1,
  TRIE=2,
  QUANT_TRIE=3,
  ARRAY_TRIE=4,
  QUANT_ARRAY_TRIE=5
}
ModelType;

....

struct FixedWidthParameters {
  unsigned char order;
  float probing_multiplier;
  // What type of model is this?
  ModelType model_type;
  // Does the end of the file 
  // have the actual strings in the vocabulary?
  bool has_vocabulary;
  unsigned int search_version;
};

....

// Parameters stored in the header of a binary file.
struct Parameters {
  FixedWidthParameters fixed;
  std::vector<uint64_t> counts;
};

....

void BinaryFormat::FinishFile(....)
{
  ....
  // header and vocab share the same mmap.
  Parameters params = Parameters();
  memset(&params, 0, sizeof(Parameters)); // <=
  ....
}

To understand this warning, let's first figure out what a PDS type is. "PDS" stands for "Passive Data Structure". Instead of "PDS", you may sometimes see "POD" – "Plain Old Data". Put simply, a PDS type is a data type that is characterized by strictly defined layout of fields and does not require access limitation and automatic management. Put even simpler, it's a data type consisting only of built-in types.

The special feature of POD types is that you can change and process variables of these types using the primitive memory management functions (memset, memcpy, and so on). But you can't say the same about "non-PDS" types: in their case, such low-level handling of values may lead to critical errors, such as memory leak, double deallocation of a resource, or undefined behavior.

As for the snippet above, the warning says you can't work with a structure of type Parameters in the way it's done there. If you look into the implementation of this structure, you'll see that its second member is of type std::vector. This type heavily relies on automatic memory management and, in addition to its contents, stores additional, service variables. Setting such a field to zero using memset may break the class's logic and is considered a serious error.

Warning 8

V575 The potential null pointer is passed into 'memcpy' function. Inspect the first argument. Check lines: 73, 68. modelstate.cc 73

Metadata*
ModelState::decode_metadata(const DecoderState& state, 
                            size_t num_results)
{
  ....
  Metadata* ret = (Metadata*)malloc(sizeof(Metadata));
  ....
  memcpy(ret, &metadata, sizeof(Metadata));
  return ret;
}

This warning says that a null pointer is passed to the memcpy function. Indeed, if the malloc function fails to allocate storage, it will return NULL. This pointer will then be passed to the memset function, where it will be dereferenced – followed by an epic crash.

This may arouse indignation in you: if memory has run out or become fragmented to the point that malloc is not able to allocate storage, why should it matter what happens next? The program will crash anyway because it won't be able to run normally under memory shortage conditions.

We've heard this opinion more than once, and we believe it's wrong. I'd elaborate on this point, but this subject calls for a separate article – so much that we already posted one a few years ago :) If you want to know why you must always check pointers returned by functions like malloc, take a look at this post: Why it is important to check what the malloc function returned.

Warning 9

This warning was issued for the same reasons as the previous one, only this one points at a somewhat different kind of error.

V769 The 'middle_begin_' pointer in the 'middle_begin_ + (counts.size() - 2)' expression could be nullptr. In such case, resulting value will be senseless and it should not be used. Check lines: 553, 552. search_trie.cc 553

template <class Quant, class Bhiksha> class TrieSearch {
....
private:
  ....
  Middle *middle_begin_, *middle_end_;
  ....
};

template <class Quant, class Bhiksha>
uint8_t *TrieSearch<Quant, Bhiksha>::SetupMemory(....)
{
  ....
  middle_begin_
    = static_cast<Middle*>(malloc(sizeof(Middle) * (counts.size() - 2)));
  middle_end_ = middle_begin_ + (counts.size() - 2);
  ....
}

Like in the previous example, memory is allocated here using the malloc function. The pointer it returns is then used in an arithmetic expression without any prior check for nullptr. This expression will evaluate to some rubbish, meaningless value, which will be stored in the middle_end_ field.

Warning 10

Finally, we've reached what in my opinion is the most interesting case. This bug was found in the kenlm library included into DeepSpeech:

V1061 Extending the 'std' namespace may result in undefined behavior. sized_iterator.hh 210

// Dirty hack because g++ 4.6 at least wants
// to do a bunch of copy operations.
namespace std {
inline void iter_swap(util::SizedIterator first,
                      util::SizedIterator second)
{
  util::swap(*first, *second);
}
} // namespace std

The hack, which is called "dirty" in the comment, is indeed a dirty one. You see, extending namespace std in a way like that may lead to undefined behavior.

Why? Because the contents of namespace std are determined solely by the Committee. That's why the international C++ standard explicitly forbids extending std in a way like it's done here.

C++03 is the latest standard supported by g++ 4.6. Here's a quote from the C++03 final working draft (see 17.6.4.2.1): "The behavior of a C++ program is undefined if it adds declarations or definitions to namespace std or to a namespace within namespace std unless otherwise specified." This statement applies to all subsequent standards (C++11, C++14, C++17, and C++20).

Now, how can we fix the code above? The first question that naturally arises is, what are those "unless otherwise specified" cases? There are several situations when extending namespace std does not lead to undefined behavior. They are all listed on the V1061 diagnostic documentation page, but we are now interested in one particular case: adding function template specializations.

Since namespace std already has a function called iter_swap (a template one, mind you), it's just logical to assume that the programmer wanted to extend its functionality so that it could work with the util::SizedIterator type. But, unfortunately, instead of adding a template function specialization, they simply wrote an ordinary overload. What they should have written is the following:

namespace std {
template <>
inline void iter_swap(util::SizedIterator first,
                      util::SizedIterator second)
{
  util::swap(*first, *second);
}
} // namespace std

Yet this code is not perfect either. The problem is that it will be correct only until C++20. Yes, starting with this version, the Standard defines template function specializations as causing undefined behavior too (see the C++20 final working draft, 16.5.4.2.1). And since the snippet under analysis comes from a library, it will sooner or later be compiled with the -std=C++20 flag. By the way, PVS-Studio distinguishes between the Standard's versions and decides whether it should issue a warning depending on what version is used in the code. Just take a look for yourself: example for C++17, example for C++20.

Actually, there's a much easier fix. You simply need to move the user definition of iter_swap to the same namespace in which the SizedIterator class is defined. You also need to add "using std::iter_swap;" before the calls to iter_swap. This is what you get (the definitions of the SizedIterator class and util::swap() function have been changed for simplicity):

namespace util
{
  class SizedIterator
  {
  public:
    SizedIterator(int i) : m_data(i) {}

    int& operator*()
    {
      return m_data;
    }
  private:
    int m_data;
  };

  ....

  inline void iter_swap(SizedIterator first,
                        SizedIterator second)
  {
    std::cout << "we are inside util::iter_swap" << std::endl;
    swap(*first, *second);
  }
}


int main()
{
  double d1 = 1.1, d2 = 2.2;
  double *pd1 = &d1, *pd2 = &d2;
  util::SizedIterator si1(42), si2(43);

  using std::iter_swap;

  iter_swap(pd1, pd2);
  iter_swap(si1, si2); // "we are inside util::iter_swap"

  return 0;
}

The compiler will now automatically choose the appropriate overload of the iter_swap function based on argument-dependent lookup (ADL). For the SizedIterator class, it will call the version from namespace util, and for all other types, it will call the version from namespace std. Here's the proof. More than that, you don't need to add any using statements inside the library functions: since their code is already inside std, the compiler will still be able to choose the appropriate overload.

And then – presto! – you get a normally working user-defined iter_swap function without any "dirty hacks" or other witchcraft :)

Conclusion

That's all for DeepSpeech. I hope you've liked the bugs discussed here and have learned something new. If you've read this far, I sincerely wish you clean and neat code. May bugs stay away from your projects!

If you write in C, C++, C#, or Java and if you are, as I am, interested in static analysis, don't hesitate to try PVS-Studio on your own projects. You can download it here.