20.5 Using tidy evaluation

While it’s important to understand how eval_tidy() works, most of the time you won’t call it directly. Instead, you’ll usually use it indirectly by calling a function that uses eval_tidy(). This section will give a few practical examples of wrapping functions that use tidy evaluation.

20.5.1 Quoting and unquoting

Imagine we have written a function that resamples a dataset:

resample <- function(df, n) {
  idx <- sample(nrow(df), n, replace = TRUE)
  df[idx, , drop = FALSE]
}

We want to create a new function that allows us to resample and subset in a single step. Our naive approach doesn’t work:

subsample <- function(df, cond, n = nrow(df)) {
  df <- subset2(df, cond)
  resample(df, n)
}

df <- data.frame(x = c(1, 1, 1, 2, 2), y = 1:5)
subsample(df, x == 1)
#> Error in eval_tidy(rows, data): object 'x' not found

subsample() doesn’t quote any arguments so cond is evaluated normally (not in a data mask), and we get an error when it tries to find a binding for x. To fix this problem we need to quote cond, and then unquote it when we pass it on ot subset2():

subsample <- function(df, cond, n = nrow(df)) {
  cond <- enquo(cond)

  df <- subset2(df, !!cond)
  resample(df, n)
}

subsample(df, x == 1)
#>   x y
#> 3 1 3
#> 1 1 1
#> 2 1 2

This is a very common pattern; whenever you call a quoting function with arguments from the user, you need to quote them and then unquote.

20.5.2 Handling ambiguity

In the case above, we needed to think about tidy evaluation because of quasiquotation. We also need to think about tidy evaluation even when the wrapper doesn’t need to quote any arguments. Take this wrapper around subset2():

threshold_x <- function(df, val) {
  subset2(df, x >= val)
}

This function can silently return an incorrect result in two situations:

  • When x exists in the calling environment, but not in df:

    x <- 10
    no_x <- data.frame(y = 1:3)
    threshold_x(no_x, 2)
    #>   y
    #> 1 1
    #> 2 2
    #> 3 3
  • When val exists in df:

    has_val <- data.frame(x = 1:3, val = 9:11)
    threshold_x(has_val, 2)
    #> [1] x   val
    #> <0 rows> (or 0-length row.names)

These failure modes arise because tidy evaluation is ambiguous: each variable can be found in either the data mask or the environment. To make this function safe we need to remove the ambiguity using the .data and .env pronouns:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= .env$val)
}

x <- 10
threshold_x(no_x, 2)
#> Error: Column `x` not found in `.data`
threshold_x(has_val, 2)
#>   x val
#> 2 2  10
#> 3 3  11

Generally, whenever you use the .env pronoun, you can use unquoting instead:

threshold_x <- function(df, val) {
  subset2(df, .data$x >= !!val)
}

There are subtle differences in when val is evaluated. If you unquote, val will be early evaluated by enquo(); if you use a pronoun, val will be lazily evaluated by eval_tidy(). These differences are usually unimportant, so pick the form that looks most natural.

20.5.3 Quoting and ambiguity

To finish our discussion let’s consider the case where we have both quoting and potential ambiguity. I’ll generalise threshold_x() slightly so that the user can pick the variable used for thresholding. Here I used .data[[var]] because it makes the code a little simpler; in the exercises you’ll have a chance to explore how you might use $ instead.

threshold_var <- function(df, var, val) {
  var <- as_string(ensym(var))
  subset2(df, .data[[var]] >= !!val)
}

df <- data.frame(x = 1:10)
threshold_var(df, x, 8)
#>     x
#> 8   8
#> 9   9
#> 10 10

It is not always the responsibility of the function author to avoid ambiguity. Imagine we generalise further to allow thresholding based on any expression:

threshold_expr <- function(df, expr, val) {
  expr <- enquo(expr)
  subset2(df, !!expr >= !!val)
}

It’s not possible to evaluate expr only in the data mask, because the data mask doesn’t include any functions like + or ==. Here, it’s the user’s responsibility to avoid ambiguity. As a general rule of thumb, as a function author it’s your responsibility to avoid ambiguity with any expressions that you create; it’s the user’s responsibility to avoid ambiguity in expressions that they create.

20.5.4 Exercises

  1. I’ve included an alternative implementation of threshold_var() below. What makes it different to the approach I used above? What makes it harder?

    threshold_var <- function(df, var, val) {
      var <- ensym(var)
      subset2(df, `$`(.data, !!var) >= !!val)
    }