24.2 Code organisation

There are two traps that are easy to fall into when trying to make your code faster:

  1. Writing faster but incorrect code.
  2. Writing code that you think is faster, but is actually no better.

The strategy outlined below will help you avoid these pitfalls.

When tackling a bottleneck, you’re likely to come up with multiple approaches. Write a function for each approach, encapsulating all relevant behaviour. This makes it easier to check that each approach returns the correct result and to time how long it takes to run. To demonstrate the strategy, I’ll compare two approaches for computing the mean:

mean1 <- function(x) mean(x)
mean2 <- function(x) sum(x) / length(x)

I recommend that you keep a record of everything you try, even the failures. If a similar problem occurs in the future, it’ll be useful to see everything you’ve tried. To do this I recommend RMarkdown, which makes it easy to intermingle code with detailed comments and notes.

Next, generate a representative test case. The case should be big enough to capture the essence of your problem but small enough that it only takes a few seconds at most. You don’t want it to take too long because you’ll need to run the test case many times to compare approaches. On the other hand, you don’t want the case to be too small because then results might not scale up to the real problem. Here I’m going to use 100,000 numbers:

x <- runif(1e5)

Now use bench::mark() to precisely compare the variations. bench::mark() automatically checks that all calls return the same values. This doesn’t guarantee that the function behaves the same for all inputs, so in an ideal world you’ll also have unit tests to make sure you don’t accidentally change the behaviour of the function.

bench::mark(
  mean1(x),
  mean2(x)
)[c("expression", "min", "median", "itr/sec", "n_gc")]
#> # A tibble: 2 x 4
#>   expression      min   median `itr/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>
#> 1 mean1(x)      182µs  183.6µs     5383.
#> 2 mean2(x)       92µs   96.5µs     9995.

(You might be surprised by the results: mean(x) is considerably slower than sum(x) / length(x). This is because, among other reasons, mean(x) makes two passes over the vector to be more numerically accurate.)

If you’d like to see this strategy in action, I’ve used it a few times on stackoverflow: