2.5 Modify-in-place

As we’ve seen above, modifying an R object usually creates a copy. There are two exceptions:

Objects with a single binding get a special performance optimisation.
Environments, a special type of object, are always modified in place.

2.5.1 Objects with a single binding

If an object has a single name bound to it, R will modify it in place:

v <- c(1, 2, 3)

v[[3]] <- 4

(Note the object IDs here: v continues to bind to the same object, 0x207.)

Two complications make predicting exactly when R applies this optimisation challenging:

When it comes to bindings, R can currently⁷ only count 0, 1, or many. That means that if an object has two bindings, and one goes away, the reference count does not go back to 1: one less than many is still many. In turn, this means that R will make copies when it sometimes doesn’t need to.
Whenever you call the vast majority of functions, it makes a reference to the object. The only exception are specially written “primitive” C functions. These can only be written by R-core and occur mostly in the base package.

Together, these two complications make it hard to predict whether or not a copy will occur. Instead, it’s better to determine it empirically with tracemem().

Let’s explore the subtleties with a case study using for loops. For loops have a reputation for being slow in R, but often that slowness is caused by every iteration of the loop creating a copy. Consider the following code. It subtracts the median from each column of a large data frame:

x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))

for (i in seq_along(medians)) {
  x[[i]] <- x[[i]] - medians[[i]]
}

This loop is surprisingly slow because each iteration of the loop copies the data frame. You can see this by using tracemem():

cat(tracemem(x), "\n")
#> <0x7f80c429e020> 

for (i in 1:5) {
  x[[i]] <- x[[i]] - medians[[i]]
}
#> tracemem[0x7f80c429e020 -> 0x7f80c0c144d8]: 
#> tracemem[0x7f80c0c144d8 -> 0x7f80c0c14540]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c14540 -> 0x7f80c0c145a8]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c145a8 -> 0x7f80c0c14610]: 
#> tracemem[0x7f80c0c14610 -> 0x7f80c0c14678]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c14678 -> 0x7f80c0c146e0]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c146e0 -> 0x7f80c0c14748]: 
#> tracemem[0x7f80c0c14748 -> 0x7f80c0c147b0]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c147b0 -> 0x7f80c0c14818]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c14818 -> 0x7f80c0c14880]: 
#> tracemem[0x7f80c0c14880 -> 0x7f80c0c148e8]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c148e8 -> 0x7f80c0c14950]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c14950 -> 0x7f80c0c149b8]: 
#> tracemem[0x7f80c0c149b8 -> 0x7f80c0c14a20]: [[<-.data.frame [[<- 
#> tracemem[0x7f80c0c14a20 -> 0x7f80c0c14a88]: [[<-.data.frame [[<- 

untracemem(x)

In fact, each iteration copies the data frame not once, not twice, but three times! Two copies are made by [[.data.frame, and a further copy⁸ is made because [[.data.frame is a regular function that increments the reference count of x.

We can reduce the number of copies by using a list instead of a data frame. Modifying a list uses internal C code, so the references are not incremented and only a single copy is made:

y <- as.list(x)
cat(tracemem(y), "\n")
#> <0x7f80c5c3de20>
  
for (i in 1:5) {
  y[[i]] <- y[[i]] - medians[[i]]
}
#> tracemem[0x7f80c5c3de20 -> 0x7f80c48de210]:

While it’s not hard to determine when a copy is made, it is hard to prevent it. If you find yourself resorting to exotic tricks to avoid copies, it may be time to rewrite your function in C++, as described in Chapter 25.

2.5.2 Environments

You’ll learn more about environments in Chapter 7, but it’s important to mention them here because their behaviour is different from that of other objects: environments are always modified in place. This property is sometimes described as reference semantics because when you modify an environment all existing bindings to that environment continue to have the same reference.

Take this environment, which we bind to e1 and e2:

e1 <- rlang::env(a = 1, b = 2, c = 3)
e2 <- e1

If we change a binding, the environment is modified in place:

e1$c <- 4
e2$c
#> [1] 4

This basic idea can be used to create functions that “remember” their previous state. See Section 10.2.4 for more details. This property is also used to implement the R6 object-oriented programming system, the topic of Chapter 14.

One consequence of this is that environments can contain themselves:

e <- rlang::env()
e$self <- e

ref(e)
#> █ [1:0x55d53e204520] <env> 
#> └─self = [1:0x55d53e204520]

This is a unique property of environments!

2.5.3 Exercises

Explain why the following code doesn’t create a circular list.
```
x <- list()
x[[1]] <- x
```
Wrap the two methods for subtracting medians into two functions, then use the ‘bench’ package (Hester 2018) to carefully compare their speeds. How does performance change as the number of columns increase?
What happens if you attempt to use tracemem() on an environment?

References

Hester, Jim. 2018. Bench: High Precision Timing of R Expressions. http://bench.r-lib.org/.

By the time you read this, this may have changed, as plans are afoot to improve reference counting: https://developer.r-project.org/Refcnt.html ↩︎
These copies are shallow: they only copy the reference to each individual column, not the contents of the columns. This means the performance isn’t terrible, but it’s obviously not as good as it could be.↩︎