24.4 Doing as little as possible
The easiest way to make a function faster is to let it do less work. One way to do that is use a function tailored to a more specific type of input or output, or to a more specific problem. For example:
rowSums()
,colSums()
,rowMeans()
, andcolMeans()
are faster than equivalent invocations that useapply()
because they are vectorised (Section 24.5).vapply()
is faster thansapply()
because it pre-specifies the output type.If you want to see if a vector contains a single value,
any(x == 10)
is much faster than10 %in% x
because testing equality is simpler than testing set inclusion.
Having this knowledge at your fingertips requires knowing that alternative functions exist: you need to have a good vocabulary. Expand your vocab by regularly reading R code. Good places to read code are the R-help mailing list and StackOverflow.
Some functions coerce their inputs into a specific type. If your input is not the right type, the function has to do extra work. Instead, look for a function that works with your data as it is, or consider changing the way you store your data. The most common example of this problem is using apply()
on a data frame. apply()
always turns its input into a matrix. Not only is this error prone (because a data frame is more general than a matrix), it is also slower.
Other functions will do less work if you give them more information about the problem. It’s always worthwhile to carefully read the documentation and experiment with different arguments. Some examples that I’ve discovered in the past include:
read.csv()
: specify known column types withcolClasses
. (Also consider switching toreadr::read_csv()
ordata.table::fread()
which are considerably faster thanread.csv()
.)factor()
: specify known levels withlevels
.cut()
: don’t generate labels withlabels = FALSE
if you don’t need them, or, even better, usefindInterval()
as mentioned in the “see also” section of the documentation.unlist(x, use.names = FALSE)
is much faster thanunlist(x)
.interaction()
: if you only need combinations that exist in the data, usedrop = TRUE
.
Below, I explore how you might improve apply this strategy to improve the performance of mean()
and as.data.frame()
.
24.4.1 mean()
Sometimes you can make a function faster by avoiding method dispatch. If you’re calling a method in a tight loop, you can avoid some of the costs by doing the method lookup only once:
For S3, you can do this by calling
generic.class()
instead ofgeneric()
.For S4, you can do this by using
selectMethod()
to find the method, saving it to a variable, and then calling that function.
For example, calling mean.default()
is quite a bit faster than calling mean()
for small vectors:
runif(1e2)
x <-
::mark(
benchmean(x),
mean.default(x)
c("expression", "min", "median", "itr/sec", "n_gc")]
)[#> # A tibble: 2 x 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 mean(x) 2.39µs 2.64µs 348061.
#> 2 mean.default(x) 1.26µs 1.41µs 665234.
This optimisation is a little risky. While mean.default()
is almost twice as fast for 100 values, it will fail in surprising ways if x
is not a numeric vector.
An even riskier optimisation is to directly call the underlying .Internal
function. This is faster because it doesn’t do any input checking or handle NA’s, so you are buying speed at the cost of safety.
runif(1e2)
x <-::mark(
benchmean(x),
mean.default(x),
.Internal(mean(x))
c("expression", "min", "median", "itr/sec", "n_gc")]
)[#> # A tibble: 3 x 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 mean(x) 2.39µs 2.61µs 344807.
#> 2 mean.default(x) 1.26µs 1.47µs 603153.
#> 3 .Internal(mean(x)) 298.95ns 337.02ns 2828816.
NB: most of these differences arise because x
is small. If you increase the size the differences basically disappear, because most of the time is now spent computing the mean, not finding the underlying implementation. This is a good reminder that the size of the input matters, and you should motivate your optimisations based on realistic data.
runif(1e4)
x <-::mark(
benchmean(x),
mean.default(x),
.Internal(mean(x))
c("expression", "min", "median", "itr/sec", "n_gc")]
)[#> # A tibble: 3 x 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 mean(x) 19.9µs 20.2µs 48664.
#> 2 mean.default(x) 18.8µs 19µs 52023.
#> 3 .Internal(mean(x)) 17.8µs 17.8µs 55603.
24.4.2 as.data.frame()
Knowing that you’re dealing with a specific type of input can be another way to write faster code. For example, as.data.frame()
is quite slow because it coerces each element into a data frame and then rbind()
s them together. If you have a named list with vectors of equal length, you can directly transform it into a data frame. In this case, if you can make strong assumptions about your input, you can write a method that’s considerably faster than the default.
function(l) {
quickdf <-class(l) <- "data.frame"
attr(l, "row.names") <- .set_row_names(length(l[[1]]))
l
}
lapply(1:26, function(i) runif(1e3))
l <-names(l) <- letters
::mark(
benchas.data.frame = as.data.frame(l),
quick_df = quickdf(l)
c("expression", "min", "median", "itr/sec", "n_gc")]
)[#> # A tibble: 2 x 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 as.data.frame 999.74µs 1.05ms 949.
#> 2 quick_df 6.63µs 7.35µs 124258.
Again, note the trade-off. This method is fast because it’s dangerous. If you give it bad inputs, you’ll get a corrupt data frame:
quickdf(list(x = 1, y = 1:2))
#> Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
#> x y
#> 1 1 1
To come up with this minimal method, I carefully read through and then rewrote the source code for as.data.frame.list()
and data.frame()
. I made many small changes, each time checking that I hadn’t broken existing behaviour. After several hours work, I was able to isolate the minimal code shown above. This is a very useful technique. Most base R functions are written for flexibility and functionality, not performance. Thus, rewriting for your specific need can often yield substantial improvements. To do this, you’ll need to read the source code. It can be complex and confusing, but don’t give up!
24.4.3 Exercises
What’s the difference between
rowSums()
and.rowSums()
?Make a faster version of
chisq.test()
that only computes the chi-square test statistic when the input is two numeric vectors with no missing values. You can try simplifyingchisq.test()
or by coding from the mathematical definition.Can you make a faster version of
table()
for the case of an input of two integer vectors with no missing values? Can you use it to speed up your chi-square test?