20.4 Data masks

In this section, you’ll learn about the data mask, a data frame where the evaluated code will look first for variable definitions. The data mask is the key idea that powers base functions like with(), subset() and transform(), and is used throughout the tidyverse in packages like dplyr and ggplot2.

20.4.1 Basics

The data mask allows you to mingle variables from an environment and a data frame in a single expression. You supply the data mask as the second argument to eval_tidy():

q1 <- new_quosure(expr(x * y), env(x = 100))
df <- data.frame(y = 1:10)

eval_tidy(q1, df)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

This code is a little hard to follow because there’s so much syntax as we’re creating every object from scratch. It’s easier to see what’s going on if we make a little wrapper. I call this with2() because it’s equivalent to base::with().

with2 <- function(data, expr) {
  expr <- enquo(expr)
  eval_tidy(expr, data)
}

We can now rewrite the code above as below:

x <- 100
with2(df, x * y)
#>  [1]  100  200  300  400  500  600  700  800  900 1000

base::eval() has similar functionality, although it doesn’t call it a data mask. Instead you can supply a data frame to the second argument and an environment to the third. That gives the following implementation of with():

with3 <- function(data, expr) {
  expr <- substitute(expr)
  eval(expr, data, caller_env())
}

20.4.2 Pronouns

Using a data mask introduces ambiguity. For example, in the following code you can’t know whether x will come from the data mask or the environment, unless you know what variables are found in df.

with2(df, x)

That makes code harder to reason about (because you need to know more context), which can introduce bugs. To resolve that issue, the data mask provides two pronouns: .data and .env.

.data$x always refers to x in the data mask.
.env$x always refers to x in the environment.

x <- 1
df <- data.frame(x = 2)

with2(df, .data$x)
#> [1] 2
with2(df, .env$x)
#> [1] 1

You can also subset .data and .env using [[, e.g. .data[["x"]]. Otherwise the pronouns are special objects and you shouldn’t expect them to behave like data frames or environments. In particular, they throw an error if the object isn’t found:

with2(df, .data$y)
#> Error: Column `y` not found in `.data`

20.4.3 Application: `subset()`

We’ll explore tidy evaluation in the context of base::subset(), because it’s a simple yet powerful function that makes a common data manipulation challenge easier. If you haven’t used it before, subset(), like dplyr::filter(), provides a convenient way of selecting rows of a data frame. You give it some data, along with an expression that is evaluated in the context of that data. This considerably reduces the number of times you need to type the name of the data frame:

sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))

# Shorthand for sample_df[sample_df$a >= 4, ]
subset(sample_df, a >= 4)
#>   a b c
#> 4 4 2 4
#> 5 5 1 1

# Shorthand for sample_df[sample_df$b == sample_df$c, ]
subset(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1

The core of our version of subset(), subset2(), is quite simple. It takes two arguments: a data frame, data, and an expression, rows. We evaluate rows using df as a data mask, then use the results to subset the data frame with [. I’ve included a very simple check to ensure the result is a logical vector; real code would do more to create an informative error.

subset2 <- function(data, rows) {
  rows <- enquo(rows)
  rows_val <- eval_tidy(rows, data)
  stopifnot(is.logical(rows_val))

  data[rows_val, , drop = FALSE]
}

subset2(sample_df, b == c)
#>   a b c
#> 1 1 5 5
#> 5 5 1 1

20.4.4 Application: transform

A more complicated situation is base::transform() which allows you to add new variables to a data frame, evaluating their expressions in the context of the existing variables:

df <- data.frame(x = c(2, 3, 1), y = runif(3))
transform(df, x = -x, y2 = 2 * y)
#>    x      y    y2
#> 1 -2 0.0808 0.162
#> 2 -3 0.8343 1.669
#> 3 -1 0.6008 1.202

Again, our own transform2() requires little code. We capture the unevaluated ... with enquos(...), and then evaluate each expression using a for loop. Real code would do more error checking to ensure that each input is named and evaluates to a vector the same length as data.

transform2 <- function(.data, ...) {
  dots <- enquos(...)

  for (i in seq_along(dots)) {
    name <- names(dots)[[i]]
    dot <- dots[[i]]

    .data[[name]] <- eval_tidy(dot, .data)
  }

  .data
}

transform2(df, x2 = x * 2, y = -y)
#>   x       y x2
#> 1 2 -0.0808  4
#> 2 3 -0.8343  6
#> 3 1 -0.6008  2

NB: I named the first argument .data to avoid problems if users tried to create a variable called data. They will still have problems if they attempt to create a variable called .data, but this is much less likely. This is the same reasoning that leads to the .x and .f arguments to map() (Section 9.2.4).

20.4.5 Application: `select()`

A data mask will typically be a data frame, but it’s sometimes useful to provide a list filled with more exotic contents. This is basically how the select argument in base::subset() works. It allows you to refer to variables as if they were numbers:

df <- data.frame(a = 1, b = 2, c = 3, d = 4, e = 5)
subset(df, select = b:d)
#>   b c d
#> 1 2 3 4

The key idea is to create a named list where each component gives the position of the corresponding variable:

vars <- as.list(set_names(seq_along(df), names(df)))
str(vars)
#> List of 5
#>  $ a: int 1
#>  $ b: int 2
#>  $ c: int 3
#>  $ d: int 4
#>  $ e: int 5

Then implementation is again only a few lines of code:

select2 <- function(data, ...) {
  dots <- enquos(...)

  vars <- as.list(set_names(seq_along(data), names(data)))
  cols <- unlist(map(dots, eval_tidy, vars))

  data[, cols, drop = FALSE]
}
select2(df, b:d)
#>   b c d
#> 1 2 3 4

dplyr::select() takes this idea and runs with it, providing a number of helpers that allow you to select variables based on their names (e.g. starts_with("x") or ends_with("_a")).

20.4.6 Exercises

Why did I use a for loop in transform2() instead of map()? Consider transform2(df, x = x * 2, x = x * 2).

Here’s an alternative implementation of subset2():

subset3 <- function(data, rows) {
  rows <- enquo(rows)
  eval_tidy(expr(data[!!rows, , drop = FALSE]), data = data)
}

df <- data.frame(x = 1:3)
subset3(df, x == 1)

Compare and contrast subset3() to subset2(). What are its advantages and disadvantages?

The following function implements the basics of dplyr::arrange(). Annotate each line with a comment explaining what it does. Can you explain why !!.na.last is strictly correct, but omitting the !! is unlikely to cause problems?

arrange2 <- function(.df, ..., .na.last = TRUE) {
  args <- enquos(...)

  order_call <- expr(order(!!!args, na.last = !!.na.last))

  ord <- eval_tidy(order_call, .df)
  stopifnot(length(ord) == nrow(.df))

  .df[ord, , drop = FALSE]
}