25.5 Standard Template Library

The real strength of C++ is revealed when you need to implement more complex algorithms. The standard template library (STL) provides a set of extremely useful data structures and algorithms. This section will explain some of the most important algorithms and data structures and point you in the right direction to learn more. I can’t teach you everything you need to know about the STL, but hopefully the examples will show you the power of the STL, and persuade you that it’s useful to learn more.

If you need an algorithm or data structure that isn’t implemented in STL, a good place to look is boost. Installing boost on your computer is beyond the scope of this chapter, but once you have it installed, you can use boost data structures and algorithms by including the appropriate header file with (e.g.) #include <boost/array.hpp>.

25.5.1 Using iterators

Iterators are used extensively in the STL: many functions either accept or return iterators. They are the next step up from basic loops, abstracting away the details of the underlying data structure. Iterators have three main operators:

  1. Advance with ++.
  2. Get the value they refer to, or dereference, with *.
  3. Compare with ==.

For example we could re-write our sum function using iterators:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sum3(NumericVector x) {
  double total = 0;
  
  NumericVector::iterator it;
  for(it = x.begin(); it != x.end(); ++it) {
    total += *it;
  }
  return total;
}

The main changes are in the for loop:

  • We start at x.begin() and loop until we get to x.end(). A small optimization is to store the value of the end iterator so we don’t need to look it up each time. This only saves about 2 ns per iteration, so it’s only important when the calculations in the loop are very simple.

  • Instead of indexing into x, we use the dereference operator to get its current value: *it.

  • Notice the type of the iterator: NumericVector::iterator. Each vector type has its own iterator type: LogicalVector::iterator, CharacterVector::iterator, etc.

This code can be simplified still further through the use of a C++11 feature: range-based for loops. C++11 is widely available, and can easily be activated for use with Rcpp by adding [[Rcpp::plugins(cpp11)]].

// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sum4(NumericVector xs) {
  double total = 0;
  
  for(const auto &x : xs) {
    total += x;
  }
  return total;
}

Iterators also allow us to use the C++ equivalents of the apply family of functions. For example, we could again rewrite sum() to use the accumulate() function, which takes a starting and an ending iterator, and adds up all the values in the vector. The third argument to accumulate gives the initial value: it’s particularly important because this also determines the data type that accumulate uses (so we use 0.0 and not 0 so that accumulate uses a double, not an int.). To use accumulate() we need to include the <numeric> header.

#include <numeric>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sum5(NumericVector x) {
  return std::accumulate(x.begin(), x.end(), 0.0);
}

25.5.2 Algorithms

The <algorithm> header provides a large number of algorithms that work with iterators. A good reference is available at https://en.cppreference.com/w/cpp/algorithm. For example, we could write a basic Rcpp version of findInterval() that takes two arguments a vector of values and a vector of breaks, and locates the bin that each x falls into. This shows off a few more advanced iterator features. Read the code below and see if you can figure out how it works.

#include <algorithm>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector findInterval2(NumericVector x, NumericVector breaks) {
  IntegerVector out(x.size());

  NumericVector::iterator it, pos;
  IntegerVector::iterator out_it;

  for(it = x.begin(), out_it = out.begin(); it != x.end(); 
      ++it, ++out_it) {
    pos = std::upper_bound(breaks.begin(), breaks.end(), *it);
    *out_it = std::distance(breaks.begin(), pos);
  }

  return out;
}

The key points are:

  • We step through two iterators (input and output) simultaneously.

  • We can assign into an dereferenced iterator (out_it) to change the values in out.

  • upper_bound() returns an iterator. If we wanted the value of the upper_bound() we could dereference it; to figure out its location, we use the distance() function.

  • Small note: if we want this function to be as fast as findInterval() in R (which uses handwritten C code), we need to compute the calls to .begin() and .end() once and save the results. This is easy, but it distracts from this example so it has been omitted. Making this change yields a function that’s slightly faster than R’s findInterval() function, but is about 1/10 of the code.

It’s generally better to use algorithms from the STL than hand rolled loops. In Effective STL, Scott Meyers gives three reasons: efficiency, correctness, and maintainability. Algorithms from the STL are written by C++ experts to be extremely efficient, and they have been around for a long time so they are well tested. Using standard algorithms also makes the intent of your code more clear, helping to make it more readable and more maintainable.

25.5.3 Data structures

The STL provides a large set of data structures: array, bitset, list, forward_list, map, multimap, multiset, priority_queue, queue, deque, set, stack, unordered_map, unordered_set, unordered_multimap, unordered_multiset, and vector. The most important of these data structures are the vector, the unordered_set, and the unordered_map. We’ll focus on these three in this section, but using the others is similar: they just have different performance trade-offs. For example, the deque (pronounced “deck”) has a very similar interface to vectors but a different underlying implementation that has different performance trade-offs. You may want to try it for your problem. A good reference for STL data structures is https://en.cppreference.com/w/cpp/container — I recommend you keep it open while working with the STL.

Rcpp knows how to convert from many STL data structures to their R equivalents, so you can return them from your functions without explicitly converting to R data structures.

25.5.4 Vectors

An STL vector is very similar to an R vector, except that it grows efficiently. This makes vectors appropriate to use when you don’t know in advance how big the output will be. Vectors are templated, which means that you need to specify the type of object the vector will contain when you create it: vector<int>, vector<bool>, vector<double>, vector<String>. You can access individual elements of a vector using the standard [] notation, and you can add a new element to the end of the vector using .push_back(). If you have some idea in advance how big the vector will be, you can use .reserve() to allocate sufficient storage.

The following code implements run length encoding (rle()). It produces two vectors of output: a vector of values, and a vector lengths giving how many times each element is repeated. It works by looping through the input vector x comparing each value to the previous: if it’s the same, then it increments the last value in lengths; if it’s different, it adds the value to the end of values, and sets the corresponding length to 1.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List rleC(NumericVector x) {
  std::vector<int> lengths;
  std::vector<double> values;

  // Initialise first value
  int i = 0;
  double prev = x[0];
  values.push_back(prev);
  lengths.push_back(1);

  NumericVector::iterator it;
  for(it = x.begin() + 1; it != x.end(); ++it) {
    if (prev == *it) {
      lengths[i]++;
    } else {
      values.push_back(*it);
      lengths.push_back(1);

      i++;
      prev = *it;
    }
  }

  return List::create(
    _["lengths"] = lengths, 
    _["values"] = values
  );
}

(An alternative implementation would be to replace i with the iterator lengths.rbegin() which always points to the last element of the vector. You might want to try implementing that.)

Other methods of a vector are described at https://en.cppreference.com/w/cpp/container/vector.

25.5.5 Sets

Sets maintain a unique set of values, and can efficiently tell if you’ve seen a value before. They are useful for problems that involve duplicates or unique values (like unique, duplicated, or in). C++ provides both ordered (std::set) and unordered sets (std::unordered_set), depending on whether or not order matters for you. Unordered sets tend to be much faster (because they use a hash table internally rather than a tree), so even if you need an ordered set, you should consider using an unordered set and then sorting the output. Like vectors, sets are templated, so you need to request the appropriate type of set for your purpose: unordered_set<int>, unordered_set<bool>, etc. More details are available at https://en.cppreference.com/w/cpp/container/set and https://en.cppreference.com/w/cpp/container/unordered_set.

The following function uses an unordered set to implement an equivalent to duplicated() for integer vectors. Note the use of seen.insert(x[i]).second. insert() returns a pair, the .first value is an iterator that points to element and the .second value is a Boolean that’s true if the value was a new addition to the set.

// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalVector duplicatedC(IntegerVector x) {
  std::unordered_set<int> seen;
  int n = x.size();
  LogicalVector out(n);

  for (int i = 0; i < n; ++i) {
    out[i] = !seen.insert(x[i]).second;
  }

  return out;
}

25.5.6 Map

A map is similar to a set, but instead of storing presence or absence, it can store additional data. It’s useful for functions like table() or match() that need to look up a value. As with sets, there are ordered (std::map) and unordered (std::unordered_map) versions. Since maps have a value and a key, you need to specify both types when initialising a map: map<double, int>, unordered_map<int, double>, and so on. The following example shows how you could use a map to implement table() for numeric vectors:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::map<double, int> tableC(NumericVector x) {
  std::map<double, int> counts;

  int n = x.size();
  for (int i = 0; i < n; i++) {
    counts[x[i]]++;
  }

  return counts;
}

25.5.7 Exercises

To practice using the STL algorithms and data structures, implement the following using R functions in C++, using the hints provided:

  1. median.default() using partial_sort.

  2. %in% using unordered_set and the find() or count() methods.

  3. unique() using an unordered_set (challenge: do it in one line!).

  4. min() using std::min(), or max() using std::max().

  5. which.min() using min_element, or which.max() using max_element.

  6. setdiff(), union(), and intersect() for integers using sorted ranges and set_union, set_intersection and set_difference.