20.6 Base evaluation

Now that you understand tidy evaluation, it’s time to come back to the alternative approaches taken by base R. Here I’ll explore the two most common uses in base R:

  • substitute() and evaluation in the caller environment, as used by subset(). I’ll use this technique to demonstrate why this technique is not programming friendly, as warned about in the subset() documentation.

  • match.call(), call manipulation, and evaluation in the caller environment, as used by write.csv() and lm(). I’ll use this technique to demonstrate how quasiquotation and (regular) evaluation can help you write wrappers around such functions.

These two approaches are common forms of non-standard evaluation (NSE).

20.6.1 substitute()

The most common form of NSE in base R is substitute() + eval(). The following code shows how you might write the core of subset() in this style using substitute() and eval() rather than enquo() and eval_tidy(). I repeat the code introduced in Section 20.4.3 so you can compare easily. The main difference is the evaluation environment: in subset_base() the argument is evaluated in the caller environment, while in subset_tidy(), it’s evaluated in the environment where it was defined.

subset_base <- function(data, rows) {
  rows <- substitute(rows)
  rows_val <- eval(rows, data, caller_env())
  stopifnot(is.logical(rows_val))

  data[rows_val, , drop = FALSE]
}

subset_tidy <- function(data, rows) {
  rows <- enquo(rows)
  rows_val <- eval_tidy(rows, data)
  stopifnot(is.logical(rows_val))

  data[rows_val, , drop = FALSE]
}

20.6.1.1 Programming with subset()

The documentation of subset() includes the following warning:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

There are three main problems:

  • base::subset() always evaluates rows in the calling environment, but if ... has been used, then the expression might need to be evaluated elsewhere:

    f1 <- function(df, ...) {
      xval <- 3
      subset_base(df, ...)
    }
    
    my_df <- data.frame(x = 1:3, y = 3:1)
    xval <- 1
    f1(my_df, x == xval)
    #>   x y
    #> 3 3 1

    This may seems like an esoteric concern, but it means that subset_base() cannot reliably work with functionals like map() or lapply():

    local({
      zzz <- 2
      dfs <- list(data.frame(x = 1:3), data.frame(x = 4:6))
      lapply(dfs, subset_base, x == zzz)
    })
    #> Error in eval(rows, data, caller_env()): object 'zzz' not found
  • Calling subset() from another function requires some care: you have to use substitute() to capture a call to subset() complete expression, and then evaluate. I think this code is hard to understand because substitute() doesn’t use a syntactic marker for unquoting. Here I print the generated call to make it a little easier to see what’s happening.

    f2 <- function(df1, expr) {
      call <- substitute(subset_base(df1, expr))
      expr_print(call)
      eval(call, caller_env())
    }
    
    my_df <- data.frame(x = 1:3, y = 3:1)
    f2(my_df, x == 1)
    #> subset_base(my_df, x == 1)
    #>   x y
    #> 1 1 3
  • eval() doesn’t provide any pronouns so there’s no way to require part of the expression to come from the data. As far as I can tell, there’s no way to make the following function safe except by manually checking for the presence of z variable in df.

    f3 <- function(df) {
      call <- substitute(subset_base(df, z > 0))
      expr_print(call)
      eval(call, caller_env())
    }
    
    my_df <- data.frame(x = 1:3, y = 3:1)
    z <- -1
    f3(my_df)
    #> subset_base(my_df, z > 0)
    #> [1] x y
    #> <0 rows> (or 0-length row.names)

20.6.1.2 What about [?

Given that tidy evaluation is quite complex, why not simply use [ as ?subset recommends? Primarily, it seems unappealing to me to have functions that can only be used interactively, and never inside another function.

Additionally, even the simple subset() function provides two useful features compared to [:

  • It sets drop = FALSE by default, so it’s guaranteed to return a data frame.

  • It drops rows where the condition evaluates to NA.

That means subset(df, x == y) is not equivalent to df[x == y,] as you might expect. Instead, it is equivalent to df[x == y & !is.na(x == y), , drop = FALSE]: that’s a lot more typing! Real-life alternatives to subset(), like dplyr::filter(), do even more. For example, dplyr::filter() can translate R expressions to SQL so that they can be executed in a database. This makes programming with filter() relatively more important.

20.6.2 match.call()

Another common form of NSE is to capture the complete call with match.call(), modify it, and evaluate the result. match.call() is similar to substitute(), but instead of capturing a single argument, it captures the complete call. It doesn’t have an equivalent in rlang.

g <- function(x, y, z) {
  match.call()
}
g(1, 2, z = 3)
#> g(x = 1, y = 2, z = 3)

One prominent user of match.call() is write.csv(), which basically works by transforming the call into a call to write.table() with the appropriate arguments set. The following code shows the heart of write.csv():

write.csv <- function(...) {
  call <- match.call(write.table, expand.dots = TRUE)

  call[[1]] <- quote(write.table)
  call$sep <- ","
  call$dec <- "."

  eval(call, parent.frame())
}

I don’t think this technique is a good idea because you can achieve the same result without NSE:

write.csv <- function(...) {
  write.table(..., sep = ",", dec = ".")
}

Nevertheless, it’s important to understand this technique because it’s commonly used in modelling functions. These functions also prominently print the captured call, which poses some special challenges, as you’ll see next.

20.6.2.1 Wrapping modelling functions

To begin, consider the simplest possible wrapper around lm():

lm2 <- function(formula, data) {
  lm(formula, data)
}

This wrapper works, but is suboptimal because lm() captures its call and displays it when printing.

lm2(mpg ~ disp, mtcars)
#> 
#> Call:
#> lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>     29.5999      -0.0412

Fixing this is important because this call is the chief way that you see the model specification when printing the model. To overcome this problem, we need to capture the arguments, create the call to lm() using unquoting, then evaluate that call. To make it easier to see what’s going on, I’ll also print the expression we generate. This will become more useful as the calls get more complicated.

lm3 <- function(formula, data, env = caller_env()) {
  formula <- enexpr(formula)
  data <- enexpr(data)

  lm_call <- expr(lm(!!formula, data = !!data))
  expr_print(lm_call)
  eval(lm_call, env)
}

lm3(mpg ~ disp, mtcars)
#> lm(mpg ~ disp, data = mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ disp, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)         disp  
#>     29.5999      -0.0412

There are three pieces that you’ll use whenever wrapping a base NSE function in this way:

  • You capture the unevaluated arguments using enexpr(), and capture the caller environment using caller_env().

  • You generate a new expression using expr() and unquoting.

  • You evaluate that expression in the caller environment. You have to accept that the function will not work correctly if the arguments are not defined in the caller environment. Providing the env argument at least provides a hook that experts can use if the default environment isn’t correct.

The use of enexpr() has a nice side-effect: we can use unquoting to generate formulas dynamically:

resp <- expr(mpg)
disp1 <- expr(vs)
disp2 <- expr(wt)
lm3(!!resp ~ !!disp1 + !!disp2, mtcars)
#> lm(mpg ~ vs + wt, data = mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ vs + wt, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)           vs           wt  
#>       33.00         3.15        -4.44

20.6.2.2 Evaluation environment

What if you want to mingle objects supplied by the user with objects that you create in the function? For example, imagine you want to make an auto-resampling version of lm(). You might write it like this:

resample_lm0 <- function(formula, data, env = caller_env()) {
  formula <- enexpr(formula)
  resample_data <- resample(data, n = nrow(data))

  lm_call <- expr(lm(!!formula, data = resample_data))
  expr_print(lm_call)
  eval(lm_call, env)
}

df <- data.frame(x = 1:10, y = 5 + 3 * (1:10) + round(rnorm(10), 2))
resample_lm0(y ~ x, data = df)
#> lm(y ~ x, data = resample_data)
#> Error in is.data.frame(data): object 'resample_data' not found

Why doesn’t this code work? We’re evaluating lm_call in the caller environment, but resample_data exists in the execution environment. We could instead evaluate in the execution environment of resample_lm0(), but there’s no guarantee that formula could be evaluated in that environment.

There are two basic ways to overcome this challenge:

  1. Unquote the data frame into the call. This means that no lookup has to occur, but has all the problems of inlining expressions (Section 19.4.7). For modelling functions this means that the captured call is suboptimal:

    resample_lm1 <- function(formula, data, env = caller_env()) {
      formula <- enexpr(formula)
      resample_data <- resample(data, n = nrow(data))
    
      lm_call <- expr(lm(!!formula, data = !!resample_data))
      expr_print(lm_call)
      eval(lm_call, env)
    }
    resample_lm1(y ~ x, data = df)$call
    #> lm(y ~ x, data = <data.frame>)
    #> lm(formula = y ~ x, data = list(x = c(3L, 7L, 4L, 4L, 
    #> 2L, 7L, 2L, 1L, 8L, 9L), y = c(13.21, 27.04, 18.63, 
    #> 18.63, 10.99, 27.04, 10.99, 7.83, 28.14, 32.72)))
  2. Alternatively you can create a new environment that inherits from the caller, and bind variables that you’ve created inside the function to that environment.

    resample_lm2 <- function(formula, data, env = caller_env()) {
      formula <- enexpr(formula)
      resample_data <- resample(data, n = nrow(data))
    
      lm_env <- env(env, resample_data = resample_data)
      lm_call <- expr(lm(!!formula, data = resample_data))
      expr_print(lm_call)
      eval(lm_call, lm_env)
    }
    resample_lm2(y ~ x, data = df)
    #> lm(y ~ x, data = resample_data)
    #> 
    #> Call:
    #> lm(formula = y ~ x, data = resample_data)
    #> 
    #> Coefficients:
    #> (Intercept)            x  
    #>        4.42         3.11

    This is more work, but gives the cleanest specification.

20.6.3 Exercises

  1. Why does this function fail?

    lm3a <- function(formula, data) {
      formula <- enexpr(formula)
    
      lm_call <- expr(lm(!!formula, data = data))
      eval(lm_call, caller_env())
    }
    lm3a(mpg ~ disp, mtcars)$call
    #> Error in as.data.frame.default(data, optional = TRUE): 
    #> cannot coerce class ‘"function"’ to a data.frame
  2. When model building, typically the response and data are relatively constant while you rapidly experiment with different predictors. Write a small wrapper that allows you to reduce duplication in the code below.

    lm(mpg ~ disp, data = mtcars)
    lm(mpg ~ I(1 / disp), data = mtcars)
    lm(mpg ~ disp * cyl, data = mtcars)
  3. Another way to write resample_lm() would be to include the resample expression (data[sample(nrow(data), replace = TRUE), , drop = FALSE]) in the data argument. Implement that approach. What are the advantages? What are the disadvantages?