25.2 Getting started with C++

cppFunction() allows you to write C++ functions in R:

cppFunction('int add(int x, int y, int z) {
  int sum = x + y + z;
  return sum;
}')
# add works like a regular R function
add
#> function (x, y, z) 
#> .Call(<pointer: 0x7f480525a150>, x, y, z)
add(1, 2, 3)
#> [1] 6

When you run this code, Rcpp will compile the C++ code and construct an R function that connects to the compiled C++ function. There’s a lot going on underneath the hood but Rcpp takes care of all the details so you don’t need to worry about them.

The following sections will teach you the basics by translating simple R functions to their C++ equivalents. We’ll start simple with a function that has no inputs and a scalar output, and then make it progressively more complicated:

  • Scalar input and scalar output
  • Vector input and scalar output
  • Vector input and vector output
  • Matrix input and vector output

25.2.1 No inputs, scalar output

Let’s start with a very simple function. It has no arguments and always returns the integer 1:

one <- function() 1L

The equivalent C++ function is:

int one() {
  return 1;
}

We can compile and use this from R with cppFunction()

cppFunction('int one() {
  return 1;
}')

This small function illustrates a number of important differences between R and C++:

  • The syntax to create a function looks like the syntax to call a function; you don’t use assignment to create functions as you do in R.

  • You must declare the type of output the function returns. This function returns an int (a scalar integer). The classes for the most common types of R vectors are: NumericVector, IntegerVector, CharacterVector, and LogicalVector.

  • Scalars and vectors are different. The scalar equivalents of numeric, integer, character, and logical vectors are: double, int, String, and bool.

  • You must use an explicit return statement to return a value from a function.

  • Every statement is terminated by a ;.

25.2.2 Scalar input, scalar output

The next example function implements a scalar version of the sign() function which returns 1 if the input is positive, and -1 if it’s negative:

signR <- function(x) {
  if (x > 0) {
    1
  } else if (x == 0) {
    0
  } else {
    -1
  }
}

cppFunction('int signC(int x) {
  if (x > 0) {
    return 1;
  } else if (x == 0) {
    return 0;
  } else {
    return -1;
  }
}')

In the C++ version:

  • We declare the type of each input in the same way we declare the type of the output. While this makes the code a little more verbose, it also makes clear the type of input the function needs.

  • The if syntax is identical — while there are some big differences between R and C++, there are also lots of similarities! C++ also has a while statement that works the same way as R’s. As in R you can use break to exit the loop, but to skip one iteration you need to use continue instead of next.

25.2.3 Vector input, scalar output

One big difference between R and C++ is that the cost of loops is much lower in C++. For example, we could implement the sum function in R using a loop. If you’ve been programming in R a while, you’ll probably have a visceral reaction to this function!

sumR <- function(x) {
  total <- 0
  for (i in seq_along(x)) {
    total <- total + x[i]
  }
  total
}

In C++, loops have very little overhead, so it’s fine to use them. In Section 25.5, you’ll see alternatives to for loops that more clearly express your intent; they’re not faster, but they can make your code easier to understand.

cppFunction('double sumC(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')

The C++ version is similar, but:

  • To find the length of the vector, we use the .size() method, which returns an integer. C++ methods are called with . (i.e., a full stop).

  • The for statement has a different syntax: for(init; check; increment). This loop is initialised by creating a new variable called i with value 0. Before each iteration we check that i < n, and terminate the loop if it’s not. After each iteration, we increment the value of i by one, using the special prefix operator ++ which increases the value of i by 1.

  • In C++, vector indices start at 0, which means that the last element is at position n - 1. I’ll say this again because it’s so important: IN C++, VECTOR INDICES START AT 0! This is a very common source of bugs when converting R functions to C++.

  • Use = for assignment, not <-.

  • C++ provides operators that modify in-place: total += x[i] is equivalent to total = total + x[i]. Similar in-place operators are -=, *=, and /=.

This is a good example of where C++ is much more efficient than R. As shown by the following microbenchmark, sumC() is competitive with the built-in (and highly optimised) sum(), while sumR() is several orders of magnitude slower.

x <- runif(1e3)
bench::mark(
  sum(x),
  sumC(x),
  sumR(x)
)[1:6]
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sum(x)       1.12µs   1.19µs   741800.        0B     0   
#> 2 sumC(x)      2.57µs   4.37µs   248688.    2.49KB     0   
#> 3 sumR(x)     48.93µs  49.41µs    19931.  182.86KB     2.03

25.2.4 Vector input, vector output

Next we’ll create a function that computes the Euclidean distance between a value and a vector of values:

pdistR <- function(x, ys) {
  sqrt((x - ys) ^ 2)
}

In R, it’s not obvious that we want x to be a scalar from the function definition, and we’d need to make that clear in the documentation. That’s not a problem in the C++ version because we have to be explicit about types:

cppFunction('NumericVector pdistC(double x, NumericVector ys) {
  int n = ys.size();
  NumericVector out(n);

  for(int i = 0; i < n; ++i) {
    out[i] = sqrt(pow(ys[i] - x, 2.0));
  }
  return out;
}')

This function introduces only a few new concepts:

  • We create a new numeric vector of length n with a constructor: NumericVector out(n). Another useful way of making a vector is to copy an existing one: NumericVector zs = clone(ys).

  • C++ uses pow(), not ^, for exponentiation.

Note that because the R version is fully vectorised, it’s already going to be fast.

y <- runif(1e6)
bench::mark(
  pdistR(0.5, y),
  pdistC(0.5, y)
)[1:6]
#> # A tibble: 2 x 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pdistR(0.5, y)   6.49ms   7.12ms      139.    7.63MB     59.7
#> 2 pdistC(0.5, y)    5.9ms   6.22ms      161.    7.63MB     80.4

On my computer, it takes around 5 ms with a 1 million element y vector. The C++ function is about 2.5 times faster, ~2 ms, but assuming it took you 10 minutes to write the C++ function, you’d need to run it ~200,000 times to make rewriting worthwhile. The reason why the C++ function is faster is subtle, and relates to memory management. The R version needs to create an intermediate vector the same length as y (x - ys), and allocating memory is an expensive operation. The C++ function avoids this overhead because it uses an intermediate scalar.

25.2.5 Using sourceCpp

So far, we’ve used inline C++ with cppFunction(). This makes presentation simpler, but for real problems, it’s usually easier to use stand-alone C++ files and then source them into R using sourceCpp(). This lets you take advantage of text editor support for C++ files (e.g., syntax highlighting) as well as making it easier to identify the line numbers in compilation errors.

Your stand-alone C++ file should have extension .cpp, and needs to start with:

#include <Rcpp.h>
using namespace Rcpp;

And for each function that you want available within R, you need to prefix it with:

// [[Rcpp::export]]

You can embed R code in special C++ comment blocks. This is really convenient if you want to run some test code:

/*** R
# This is R code
*/

The R code is run with source(echo = TRUE) so you don’t need to explicitly print output.

To compile the C++ code, use sourceCpp("path/to/file.cpp"). This will create the matching R functions and add them to your current session. Note that these functions can not be saved in a .Rdata file and reloaded in a later session; they must be recreated each time you restart R.

For example, running sourceCpp() on the following file implements mean in C++ and then compares it to the built-in mean():

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double meanC(NumericVector x) {
  int n = x.size();
  double total = 0;

  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total / n;
}

/*** R
x <- runif(1e5)
bench::mark(
  mean(x),
  meanC(x)
)
*/

NB: if you run this code, you’ll notice that meanC() is much faster than the built-in mean(). This is because it trades numerical accuracy for speed.

For the remainder of this chapter C++ code will be presented stand-alone rather than wrapped in a call to cppFunction. If you want to try compiling and/or modifying the examples you should paste them into a C++ source file that includes the elements described above. This is easy to do in RMarkdown: all you need to do is specify engine = "Rcpp".

25.2.6 Exercises

  1. With the basics of C++ in hand, it’s now a great time to practice by reading and writing some simple C++ functions. For each of the following functions, read the code and figure out what the corresponding base R function is. You might not understand every part of the code yet, but you should be able to figure out the basics of what the function does.

    double f1(NumericVector x) {
      int n = x.size();
      double y = 0;
    
      for(int i = 0; i < n; ++i) {
        y += x[i] / n;
      }
      return y;
    }
    
    NumericVector f2(NumericVector x) {
      int n = x.size();
      NumericVector out(n);
    
      out[0] = x[0];
      for(int i = 1; i < n; ++i) {
        out[i] = out[i - 1] + x[i];
      }
      return out;
    }
    
    bool f3(LogicalVector x) {
      int n = x.size();
    
      for(int i = 0; i < n; ++i) {
        if (x[i]) return true;
      }
      return false;
    }
    
    int f4(Function pred, List x) {
      int n = x.size();
    
      for(int i = 0; i < n; ++i) {
        LogicalVector res = pred(x[i]);
        if (res[0]) return i + 1;
      }
      return 0;
    }
    
    NumericVector f5(NumericVector x, NumericVector y) {
      int n = std::max(x.size(), y.size());
      NumericVector x1 = rep_len(x, n);
      NumericVector y1 = rep_len(y, n);
    
      NumericVector out(n);
    
      for (int i = 0; i < n; ++i) {
        out[i] = std::min(x1[i], y1[i]);
      }
    
      return out;
    }
  2. To practice your function writing skills, convert the following functions into C++. For now, assume the inputs have no missing values.

    1. all().

    2. cumprod(), cummin(), cummax().

    3. diff(). Start by assuming lag 1, and then generalise for lag n.

    4. range().

    5. var(). Read about the approaches you can take on Wikipedia. Whenever implementing a numerical algorithm, it’s always good to check what is already known about the problem.