2.3 Copy-on-modify

Consider the following code. It binds x and y to the same underlying value, then modifies y3.

x <- c(1, 2, 3)
y <- x

y[[3]] <- 4
x
#> [1] 1 2 3

Modifying y clearly didn’t modify x. So what happened to the shared binding? While the value associated with y changed, the original object did not. Instead, R created a new object, 0xcd2, a copy of 0x74b with one value changed, then rebound y to that object.

This behaviour is called copy-on-modify. Understanding it will radically improve your intuition about the performance of R code. A related way to describe this behaviour is to say that R objects are unchangeable, or immutable. However, I’ll generally avoid that term because there are a couple of important exceptions to copy-on-modify that you’ll learn about in Section 2.5.

When exploring copy-on-modify behaviour interactively, be aware that you’ll get different results inside of RStudio. That’s because the environment pane must make a reference to each object in order to display information about it. This distorts your interactive exploration but doesn’t affect code inside of functions, and so doesn’t affect performance during data analysis. For experimentation, I recommend either running R directly from the terminal, or using RMarkdown (like this book).

2.3.1 tracemem()

You can see when an object gets copied with the help of base::tracemem(). Once you call that function with an object, you’ll get the object’s current address:

x <- c(1, 2, 3)
cat(tracemem(x), "\n")
#> <0x7f80c0e0ffc8> 

From then on, whenever that object is copied, tracemem() will print a message telling you which object was copied, its new address, and the sequence of calls that led to the copy:

y <- x
y[[3]] <- 4L
#> tracemem[0x7f80c0e0ffc8 -> 0x7f80c4427f40]: 

If you modify y again, it won’t get copied. That’s because the new object now only has a single name bound to it, so R applies modify-in-place optimisation. We’ll come back to this in Section 2.5.

y[[3]] <- 5L

untracemem(x)

untracemem() is the opposite of tracemem(); it turns tracing off.

2.3.2 Function calls

The same rules for copying also apply to function calls. Take this code:

f <- function(a) {
  a
}

x <- c(1, 2, 3)
cat(tracemem(x), "\n")
#> <0x55d5378b2bd8>

z <- f(x)
# there's no copy here!

untracemem(x)

While f() is running, the a inside the function points to the same value as the x does outside the function:

You’ll learn more about the conventions used in this diagram in Section 7.4.4. In brief: the function f() is depicted by the yellow object on the right. It has a formal argument, a, which becomes a binding (indicated by dotted black line) in the execution environment (the gray box) when the function is run.

Once f() completes, x and z will point to the same object. 0x74b never gets copied because it never gets modified. If f() did modify x, R would create a new copy, and then z would bind that object.

2.3.3 Lists

It’s not just names (i.e. variables) that point to values; elements of lists do too. Consider this list, which is superficially very similar to the numeric vector above:

l1 <- list(1, 2, 3)

This list is more complex because instead of storing the values itself, it stores references to them:

This is particularly important when we modify a list:

l2 <- l1

l2[[3]] <- 4

Like vectors, lists use copy-on-modify behaviour; the original list is left unchanged, and R creates a modified copy. This, however, is a shallow copy: the list object and its bindings are copied, but the values pointed to by the bindings are not. The opposite of a shallow copy is a deep copy where the contents of every reference are copied. Prior to R 3.1.0, copies were always deep copies.

To see values that are shared across lists, use lobstr::ref(). ref() prints the memory address of each object, along with a local ID so that you can easily cross-reference shared components.

ref(l1, l2)
#> █ [1:0x55d53bd85848] <list> 
#> ├─[2:0x55d53b19f7d8] <dbl> 
#> ├─[3:0x55d53b19f7a0] <dbl> 
#> └─[4:0x55d53b19f768] <dbl> 
#>  
#> █ [5:0x55d53799ac38] <list> 
#> ├─[2:0x55d53b19f7d8] 
#> ├─[3:0x55d53b19f7a0] 
#> └─[6:0x55d53ba82d98] <dbl>

2.3.4 Data frames

Data frames are lists of vectors, so copy-on-modify has important consequences when you modify a data frame. Take this data frame as an example:

d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))

If you modify a column, only that column needs to be modified; the others will still point to their original references:

d2 <- d1
d2[, 2] <- d2[, 2] * 2

However, if you modify a row, every column is modified, which means every column must be copied:

d3 <- d1
d3[1, ] <- d3[1, ] * 3

2.3.5 Character vectors

The final place that R uses references is with character vectors4. I usually draw character vectors like this:

x <- c("a", "a", "abc", "d")

But this is a polite fiction. R actually uses a global string pool where each element of a character vector is a pointer to a unique string in the pool:

You can request that ref() show these references by setting the character argument to TRUE:

ref(x, character = TRUE)
#> █ [1:0x55d53ae40f18] <chr> 
#> ├─[2:0x55d534ba83e0] <string: "a"> 
#> ├─[2:0x55d534ba83e0] 
#> ├─[3:0x55d537e48c60] <string: "abc"> 
#> └─[4:0x55d534d361d8] <string: "d">

This has a profound impact on the amount of memory a character vector uses but is otherwise generally unimportant, so elsewhere in the book I’ll draw character vectors as if the strings lived inside a vector.

2.3.6 Exercises

  1. Why is tracemem(1:10) not useful?

  2. Explain why tracemem() shows two copies when you run this code. Hint: carefully look at the difference between this code and the code shown earlier in the section.

    x <- c(1L, 2L, 3L)
    tracemem(x)
    
    x[[3]] <- 4
  3. Sketch out the relationship between the following objects:

    a <- 1:10
    b <- list(a, a)
    c <- list(b, a, 1:10)
  4. What happens when you run this code?

    x <- list(1:10)
    x[[2]] <- x

    Draw a picture.


  1. You may be surprised to see [[ used to subset a numeric vector. We’ll come back to this in Section 4.3, but in brief, I think you should always use [[ when you are getting or setting a single element.↩︎

  2. Confusingly, a character vector is a vector of strings, not individual characters.↩︎