2 Why and how

Tidy evaluation is a framework for metaprogramming in R, used throughout the tidyverse to implement data masking. Metaprogramming is about using a programming language to manipulate or modify its own code. This idea is used throughout the tidyverse to change the context of computation of certain pieces of R code.

Changing the context of evaluation is useful for four main purposes:

To promote data frames to full blown scopes, where columns are exposed as named objects.
To execute your R code in a foreign environment. For instance, dbplyr translates ordinary dplyr pipelines to SQL queries.
To execute your R code with a more performant compiled language. For instance, the dplyr package uses C++ implementations for a certain set of mathematical expressions to avoid executing slower R code when possible¹.
To implement special rules for ordinary R operators. For instance, selection functions such as dplyr::select() or tidyr::gather() implement specific behaviours for c(), : and -.

2.1 Data masking

Of these goals, the promotion of data frames is the most important because data is often the most relevant context for data analysts. We believe that R and the tidyverse are human-centered in big part because the data frame is available for direct use in computations, without syntax and boilerplate getting in the way. Formulas for statistical models are a prime example of human-centered syntax in R. Data masking and special operator rules make model formulas an intuitive interface for model specification.

When the contents of the data frame are temporarily promoted as first class objects, we say the data masks the workspace:

library("dplyr")

starwars %>% filter(
  height < 200,
  gender == "male"
)

Data masking is natural in R because it reduces boilerplate and results in code that maps more directly to how users think about data manipulation problems. Compare to the equivalent subsetting code where it is necessary to be explicit about where the columns come from:

starwars[starwars$height < 200 & starwars$gender == "male", ]

Data masking is only possible because R allows suspending the normal flow of evaluation. If code was evaluated in the normal way, R would not be able to find the relevant columns for the computation. For instance, a normal function like list(), which has no concept of data masking, will give an error about object not found:

list(
  height < 200,
  gender == "male"
)
#> Error in eval(expr, envir, enclos): object 'height' not found

2.2 Quoting code

In order to change the context, evaluation must first be suspended before being resumed in a different environment. The technical term for delaying code in this way is quoting. Tidyverse grammars quote the code supplied by users as arguments. They don’t get results of code but the quoted code itself, whose evaluation can be resumed later on in a data context. In a way, quoted code is like a blueprint for R computations. One important quoting function in dplyr is vars(). This function does nothing but return its arguments as blueprints to be interpreted later on by verbs like summarise_at():

starwars %>% summarise_at(vars(ends_with("color")), n_distinct)
#> # A tibble: 1 x 3
#>   hair_color skin_color eye_color
#>        <int>      <int>     <int>
#> 1         13         31        15

If you call vars() alone, you get to see the blueprints! ²

vars(
  ends_with("color"),
  height:mass
)
#> <list_of<quosure>>
#> 
#> [[1]]
#> <quosure>
#> expr: ^ends_with("color")
#> env:  global
#> 
#> [[2]]
#> <quosure>
#> expr: ^height:mass
#> env:  global

The evaluation of an expression captured as a blueprint can be resumed at any time, possibly in a different context:

exprs <- vars(height / 100, mass + 50)

rlang::eval_tidy(exprs[[1]])
#> Error in rlang::eval_tidy(exprs[[1]]): object 'height' not found

rlang::eval_tidy(exprs[[1]], data = starwars)
#>  [1] 1.72 1.67 0.96 2.02 1.50 1.78 1.65 0.97 1.83 1.82 1.88 1.80 2.28 1.80 1.73
#> [16] 1.75 1.70 1.80 0.66 1.70 1.83 2.00 1.90 1.77 1.75 1.80 1.50   NA 0.88 1.60
#> [31] 1.93 1.91 1.70 1.96 2.24 2.06 1.83 1.37 1.12 1.83 1.63 1.75 1.80 1.78 0.94
#> [46] 1.22 1.63 1.88 1.98 1.96 1.71 1.84 1.88 2.64 1.88 1.96 1.85 1.57 1.83 1.83
#> [61] 1.70 1.66 1.65 1.93 1.91 1.83 1.68 1.98 2.29 2.13 1.67 0.79 0.96 1.93 1.91
#> [76] 1.78 2.16 2.34 1.88 1.78 2.06   NA   NA   NA   NA   NA 1.65

To sum up, the distinctive look and feel of data masking UIs requires suspending the normal evaluation of R code. Once captured as quoted code, it can be resumed in a different context. Unfortunately, the delaying of code makes it harder to program with data masking functions, and requires learning a bit of new theory and some new tools.

2.3 Unquoting code

Data masking functions prevent the normal evaluation of their arguments by quoting them. Once in possession of the blueprints of their arguments, a data mask is created and the evaluation is resumed in this new context. Unfortunately, delaying code in this way has a flip side. While it is natural to substitute values when you’re programming with normal functions using regular evaluation, it is harder to substitute column names in data masking functions that delay evaluation of your code. To make indirect references to columns, it is necessary to modify the quoted code before it gets evaluated. This is exactly what the !! operator, pronounced bang bang, is all about. It is a surgery operator for blueprints of R code.

In the world of normal functions, making indirect references to values is easy. Expressions that yield the same values can be freely interchanged, a property that is sometimes called referential transparency. The following calls to my_function() all yield the same results because they were given the same value as inputs:

my_function <- function(x) x * 100

my_function(6)
#> [1] 600

my_function(2 * 3)
#> [1] 600

a <- 2
b <- 3
my_function(a * b)
#> [1] 600

Because data masking functions evaluate their quoted arguments in a different context, they do not have this property:

starwars %>% summarise(avg = mean(height, na.rm = TRUE))
#> # A tibble: 1 x 1
#>     avg
#>   <dbl>
#> 1  174.

value <- mean(height, na.rm = TRUE)
#> Error in mean(height, na.rm = TRUE): object 'height' not found
starwars %>% summarise(avg = value)
#> Error: Problem with `summarise()` input `avg`.
#> ✖ object 'value' not found
#> ℹ Input `avg` is `value`.

Storing a column name in a variable or passing one as a function argument requires the tidy eval operator !!. This special operator, only available in quoting functions, acts like a surgical operator for modifying blueprints. To understand what it does, it is best to see it in action. The qq_show() helper from rlang processes !! and prints the resulting blueprint of the computation. As you can observe, !! modifies the quoted code by inlining the value of its operand right into the blueprint:

x <- 1

rlang::qq_show(
  starwars %>% summarise(out = x)
)
#> starwars %>% summarise(out = x)

rlang::qq_show(
  starwars %>% summarise(out = !!x)
)
#> starwars %>% summarise(out = 1)

What would it take to create an indirect reference to a column name? Inlining the name as a string in the blueprint will not produce what you expect:

col <- "height"

rlang::qq_show(
  starwars %>% summarise(out = sum(!!col, na.rm = TRUE))
)
#> starwars %>% summarise(out = sum("height", na.rm = TRUE))

This code amounts to taking the sum of a string, something that R will not be happy about:

starwars %>% summarise(out = sum("height", na.rm = TRUE))
#> Error: Problem with `summarise()` input `out`.
#> ✖ invalid 'type' (character) of argument
#> ℹ Input `out` is `sum("height", na.rm = TRUE)`.

To refer to column names inside a blueprint, we need to inline blueprint material. We need symbols:

sym(col)
#> height

Symbols are a special type of string that represent other objects. When a piece of R code is evaluated, every bare variable name is actually a symbol that represents some value, as defined in the current context. Let’s see what the modified blueprint looks like when we inline a symbol:

rlang::qq_show(
  starwars %>% summarise(out = sum(!!sym(col), na.rm = TRUE))
)
#> starwars %>% summarise(out = sum(height, na.rm = TRUE))

Looks good! We’re now ready to actually run the dplyr pipeline with an indirect reference:

starwars %>% summarise(out = sum(!!sym(col), na.rm = TRUE))
#> # A tibble: 1 x 1
#>     out
#>   <int>
#> 1 14123

There were two necessary steps to create an indirect reference and properly modify the summarising code:

We first created a piece of blueprint (a symbol) with sym().
We used !! to insert it in the blueprint captured by summarise().

We call the combination of these two steps the quote and unquote pattern. This pattern is the heart of programming with tidy eval functions. We quote an expression and unquote it in another quoted expression. In other words, we create or capture a piece of blueprint, and insert it in another blueprint just before it’s captured by a data masking function. This process is also called interpolation.

Most of the time though, we don’t need to create blueprints manually. We’ll get them by quoting the arguments supplied by users. This gives your functions the same usage and feel as tidyverse verbs.

The data.table package uses different metaprogramming tools than tidy eval for the same purpose. Certain expressions are executed in C to perform efficient data transformations.↩︎
As you can see, these blueprints are also called quosures. These are special types of expressions that keep track of the current context, or environment.↩︎