4 Getting up to speed

While tidyverse grammars are easy to write in scripts and at the console, they make it a bit harder to reduce code duplication. Writing functions around dplyr pipelines and other tidyeval APIs requires a bit of special knowledge because these APIs use a special type of functions called quoting functions in order to make data first class.

If one-off code is often reasonable for common data analysis tasks, it is good practice to write reusable functions to reduce code duplication. In this introduction, you will learn about quoting functions, what challenges they pose for programming, and the solutions that tidy evaluation provides to solve those problems.

4.1 Writing functions

4.1.1 Reducing duplication

Writing functions is essential for the clarity and robustness of your code. Functions have several advantages:

  1. They prevent inconsistencies because they force multiple computations to follow a single recipe.

  2. They emphasise what varies (the arguments) and what is constant (every other component of the computation).

  3. They make change easier because you only need to modify one place.

  4. They make your code clearer if you give the function and its arguments informative names.

The process for creating a function is straightforward. First, recognise duplication in your code. A good rule of thumb is to create a function when you have copy-pasted a piece of code three times. Can you spot the copy-paste mistake in this duplicated code?

(df$a - min(df$a)) / (max(df$a) - min(df$a))
(df$b - min(df$b)) / (max(df$b) - min(df$b))
(df$c - min(df$c)) / (max(df$c) - min(df$c))
(df$d - min(df$d)) / (max(df$d) - min(df$c))

Now identify the varying parts of the expression and give each a name. x is an easy choice, but it is often a good idea to reflect the type of argument expected in the name. In our case we expect a numeric vector:

(num - min(num)) / (max(num) - min(num))
(num - min(num)) / (max(num) - min(num))
(num - min(num)) / (max(num) - min(num))
(num - min(num)) / (max(num) - min(num))

We can now create a function with a relevant name:

rescale01 <- function(num) {

}

Fill it with our deduplicated code:

rescale01 <- function(num) {
  (num - min(num)) / (max(num) - min(num))
}

And refactor a little to reduce duplication further and handle more cases:

rescale01 <- function(num) {
  rng <- range(num, na.rm = TRUE, finite = TRUE)
  (num - rng[[1]]) / (rng[[2]] - rng[[1]])
}

Now you can reuse your function any place you need it:

rescale01(df$a)
rescale01(df$b)
rescale01(df$c)
rescale01(df$d)

Reducing code duplication is as much needed with tidyverse grammars as with ordinary computations. Unfortunately, the straightforward process to create functions breaks down with grammars like dplyr, which we attach now.

library("dplyr")

To see the problem, let’s use the same function-writing process with a duplicated dplyr pipeline:

df1 %>% group_by(x1) %>% summarise(mean = mean(y1))
df2 %>% group_by(x2) %>% summarise(mean = mean(y2))
df3 %>% group_by(x3) %>% summarise(mean = mean(y3))
df4 %>% group_by(x4) %>% summarise(mean = mean(y4))

We first abstract out the varying parts by giving them informative names:

data %>% group_by(group_var) %>% summarise(mean = mean(summary_var))

And wrap the pipeline with a function taking these argument names:

grouped_mean <- function(data, group_var, summary_var) {
  data %>%
    group_by(group_var) %>%
    summarise(mean = mean(summary_var))
}

Unfortunately this function doesn’t actually work. When you call it dplyr complains that the variable group_var is unknown:

grouped_mean(mtcars, cyl, mpg)
#> Error: Must group by variables found in `.data`.
#> * Column `group_var` is not found.

Here is the proper way of defining this function:

grouped_mean <- function(data, group_var, summary_var) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)

  data %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
}
grouped_mean(mtcars, cyl, mpg)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1     4  26.7
#> 2     6  19.7
#> 3     8  15.1

To understand how that works, we need to learn about quoting functions and what special steps are needed to be effective at programming with them. Really we only need two new concepts forming together a single pattern: quoting and unquoting. This introduction will get you up to speed with this pattern.

4.1.2 What’s special about quoting functions?

R functions can be categorised in two broad categories: evaluating functions and quoting functions 4. These functions differ in the way they get their arguments. Evaluating functions take arguments as values. It does not matter what the expression supplied as argument is or which objects it contains. R computes the argument value following the standard rules of evaluation which the function receives passively 5.

The simplest regular function is identity(). It evaluates its single argument and returns the value. Because only the final value of the argument matters, all of these statements are completely equivalent:

identity(6)
#> [1] 6

identity(2 * 3)
#> [1] 6

a <- 2
b <- 3
identity(a * b)
#> [1] 6

On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself. We say the argument has been automatically quoted. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is quote(). It automatically quotes its argument and returns the quoted expression without any evaluation. Because only the expression passed as argument matters, none of these statements are equivalent:

quote(6)
#> [1] 6

quote(2 * 3)
#> 2 * 3

quote(a * b)
#> a * b

Other familiar quoting operators are "" and ~. The "" operator quotes a piece of text at parsing time and returns a string. This prevents the text from being interpreted as some R code to evaluate. The tilde operator is similar to the quote() function in that it prevents R code from being automatically evaluated and returns a quoted expression in the form of a formula. The expression is then used to define a statistical model in modelling functions. The three following expressions are doing something similar, they are quoting their input:

"a * b"
#> [1] "a * b"

~ a * b
#> ~a * b

quote(a * b)
#> a * b

The first statement returns a quoted string and the other two return quoted code in a formula or as a bare expression.

4.1.2.1 Quoting and evaluating in mundane R code

As an R programmer, you are probably already familiar with the distinction between quoting and evaluating functions. Take the case of subsetting a data frame column by name. The [[ and $ operators are both standard for this task but they are used in very different situations. The former supports indirect references like variables or expressions that represent a column name while the latter takes a column name directly:

df <- data.frame(
  y = 1,
  var = 2
)

var <- "y"
df[[var]]
#> [1] 1

df$y
#> [1] 1

Technically, [[ is an evaluating function while $ is a quoting function. You can indirectly refer to columns with [[ because the subsetting index is evaluated, allowing indirect references. The following expressions are completely equivalent:

df[[var]] # Indirect
#> [1] 1

df[["y"]] # Direct
#> [1] 1

But these are not:

df$var    # Direct
#> [1] 2

df$y      # Direct
#> [1] 1

The following table summarises the fundamental asymmetry between the two subsetting methods:

Quoted Evaluated
Direct df$y df[["y"]]
Indirect ??? df[[var]]

4.1.2.2 Detecting quoting functions

Because they work so differently to standard R code, it is important to recognise auto-quoted arguments. The documentation of the quoting function should normally tell you if an argument is quoted and evaluated in a special way. You can also detect quoted arguments by yourself with some experimentation. Let’s take the following expressions involving a mix of quoting and evaluating functions:

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select

mtcars2 <- subset(mtcars, cyl == 4)

sum(mtcars2$am)
#> [1] 8

rm(mtcars2)

A good indication that an argument is auto-quoted and evaluated in a special way is that the argument will not work correctly outside of its original context. Let’s try to break down each of these expressions in two steps by storing the arguments in an intermediary variable:

  1. library(MASS)

    temp <- MASS
    #> Error in eval(expr, envir, enclos): object 'MASS' not found
    
    temp <- "MASS"
    library(temp)
    #> Error in library(temp): there is no package called 'temp'

    We get these errors because there is no MASS object for R to find, and temp is interpreted by library() directly as a package name rather than as an indirect reference. Let’s try to break down the subset() expression:

  2. mtcars2 <- subset(mtcars, cyl == 4)

    temp <- cyl == 4
    #> Error in eval(expr, envir, enclos): object 'cyl' not found

    R cannot find cyl because we haven’t specified where to find it. This object exists only inside the mtcars data frame.

  3. sum(mtcars2$am)

    temp <- mtcars$am
    sum(temp)
    #> [1] 13

    It worked! sum() is an evaluating function and the indirect reference was resolved in the ordinary way.

  4. rm(mtcars2)

    mtcars2 <- mtcars
    temp <- "mtcars2"
    rm(temp)
    
    exists("mtcars2")
    #> [1] TRUE
    exists("temp")
    #> [1] FALSE

    This time there was no error, but we have accidentally removed the variable temp instead of the variable it was referring to. This is because rm() auto-quotes its arguments.

4.1.3 Unquotation

In practice, functions that evaluate their arguments are easier to program with because they support both direct and indirect references. For quoting functions, a piece of syntax is missing. We need the ability to unquote arguments.

4.1.3.1 Unquoting in base R

Base R provides three different ways of allowing direct references:

  • An extra function that evaluates its arguments. For instance the evaluating variant of the $ operator is [[.

  • An extra parameter that switches off auto-quoting. For instance library() evaluates its first argument if you set character.only to TRUE:

    temp <- "MASS"
    library(temp, character.only = TRUE)
  • An extra parameter that evaluates its argument. If you have a list of object names to pass to rm(), use the list argument:

    temp <- "mtcars2"
    rm(list = temp)
    
    exists("mtcars2")
    #> [1] FALSE

There is no general unquoting convention in base R so you have to read the documentation to figure out how to unquote an argument. Many functions like subset() or transform() do not provide any unquoting option at all.

4.1.3.2 Unquoting in the tidyverse!!

All quoting functions in the tidyverse support a single unquotation mechanism, the !! operator (pronounced bang-bang). You can use !! to cancel the automatic quotation and supply indirect references everywhere an argument is automatically quoted. In other words, unquoting lets you open a variable and use what’s inside instead.

First let’s create a couple of variables that hold references to columns from the mtcars data frame. A simple way of creating these references is to use the fundamental quoting function quote():

# Variables referring to columns `cyl` and `mpg`
x_var <- quote(cyl)
y_var <- quote(mpg)

x_var
#> cyl

y_var
#> mpg

Here are a few examples of how !! can be used in tidyverse functions to unquote these variables, i.e. open them and use their contents.

  • In dplyr most verbs quote their arguments:

    library("dplyr")
    
    by_cyl <- mtcars %>%
      group_by(!!x_var) %>%            # Open x_var
      summarise(mean = mean(!!y_var))  # Open y_var
    #> `summarise()` ungrouping output (override with `.groups` argument)
  • In ggplot2 aes() is the main quoting function:

    library("ggplot2")
    
    ggplot(mtcars, aes(!!x_var, !!y_var)) +  # Open x_var and y_var
      geom_point()

    ggplot2 also features vars() which is useful for facetting:

    ggplot(mtcars, aes(disp, drat)) +
      geom_point() +
      facet_grid(vars(!!x_var))  # Open x_var

Being able to make indirect references by opening variables with !! is rarely useful in scripts but is invaluable for writing functions. With !! we can now easily fix our wrapper function, as we’ll see in the following section.

4.1.4 Understanding !! with qq_show()

At this point it is normal if the concept of unquoting still feels nebulous. A good way of practicing this operation is to see for yourself what it is really doing. To that end the qq_show() function from the rlang package performs unquoting and prints the result to the screen. Here is what !! is really doing in the dplyr example (I’ve broken the pipeline into two steps for readability):

rlang::qq_show(mtcars %>% group_by(!!x_var))
#> mtcars %>% group_by(cyl)

rlang::qq_show(data %>% summarise(mean = mean(!!y_var)))
#> data %>% summarise(mean = mean(mpg))

Similarly for the ggplot2 pipeline:

rlang::qq_show(ggplot(mtcars, aes(!!x_var, !!y_var)))
#> ggplot(mtcars, aes(cyl, mpg))

rlang::qq_show(facet_grid(vars(!!x_var)))
#> facet_grid(vars(cyl))

As you can see, unquoting a variable that contains a reference to the column cyl is equivalent to directly supplying cyl to the dplyr function.

4.2 Quote and unquote

The basic process for creating tidyeval functions requires thinking a bit differently but is straightforward: quote and unquote.

  1. Use enquo() to make a function automatically quote its argument.
  2. Use !! to unquote the argument.

Apart from these additional two steps, the process is the same.

4.2.1 The abstraction step

We start as usual by identifying the varying parts of a computation and giving them informative names. These names become the arguments to the function.

grouped_mean <- function(data, group_var, summary_var) {
  data %>%
    group_by(group_var) %>%
    summarise(mean = mean(summary_var))
}

As we have seen earlier this function does not quite work yet so let’s fix it by applying the two new steps.

4.2.2 The quoting step

The quoting step is about making our ordinary function a quoting function. Not all parameters should be automatically quoted though. For instance the data argument refers to a real data frame that is passed around in the ordinary way. It is crucial to identify which parameters of your function should be automatically quoted: the parameters for which it is allowed to refer to columns in the data frames. In the example, group_var and summary_var are the parameters that refer to the data.

We know that the fundamental quoting function is quote() but how do we go about creating other quoting functions? This is the job of enquo(). While quote() quotes what you typed, enquo() quotes what your user typed. In other words it makes an argument automatically quote its input. This is exactly how dplyr verbs are created! Here is how to apply enquo() to the group_var and summary_var arguments:

group_var <- enquo(group_var)
summary_var <- enquo(summary_var)

4.2.3 The unquoting step

Finally we identify any place where these variables are passed to other quoting functions. That’s where we need to unquote with !!. In this case we pass group_var to group_by() and summary_var to summarise():

data %>%
  group_by(!!group_var) %>%
  summarise(mean = mean(!!summary_var))

4.2.4 Result

The finished function looks like this:

grouped_mean <- function(data, group_var, summary_var) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)

  data %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
}

And voilà!

grouped_mean(mtcars, cyl, mpg)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1     4  26.7
#> 2     6  19.7
#> 3     8  15.1

grouped_mean(mtcars, cyl, disp)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1     4  105.
#> 2     6  183.
#> 3     8  353.

grouped_mean(mtcars, am, disp)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>      am  mean
#>   <dbl> <dbl>
#> 1     0  290.
#> 2     1  144.

This simple quote-and-unquote pattern will get you a long way. It makes it possible to abstract complex combinations of quoting functions into a new quoting function. However this gets us in a sort of loop: quoting functions unquote inside other quoting functions and so on. At the start of the loop is the user typing expressions that are automatically quoted. But what if we can’t or don’t want to start with expressions typed by the user? What if we’d like to start with a character vector of column names?

4.3 Strings instead of quotes

So far we have created a quoting function that wraps around other quoting functions. How can we break this chain of quoting? How can we go from the evaluating world to the quoting universe? The most common way this transition occurs is when you start with a character vector of column names and somehow need to pass the corresponding columns to quoting functions like dplyr::mutate(), dplyr::select(), or ggplot2::aes(). We need a way of bridging evaluating and quoting functions.

First let’s see why simply unquoting strings does not work:

var <- "height"
mutate(starwars, rescaled = !!var * 100)
#> Error: Problem with `mutate()` input `rescaled`.
#> ✖ non-numeric argument to binary operator
#> ℹ Input `rescaled` is `"height" * 100`.

We get a type error. Observing the result of unquoting with qq_show() will shed some light on this:

rlang::qq_show(mutate(starwars, rescaled = !!var * 100))
#> mutate(starwars, rescaled = "height" * 100)

We have unquoted a string, and now dplyr tried to multiply that string by 100!

4.3.1 Strings

There is a fundamental difference between these two objects:

"height"
#> [1] "height"

quote(height)
#> height

"height" is a string and quote(height) is a symbol, or variable name. A symbol is much more than a string, it is a reference to an R object. That’s why you have to use symbols to refer to data frame columns. Fortunately transforming strings to symbols is straightforward with the tidy eval sym() function:

sym("height")
#> height

If you use sym() instead of enquo(), you end up with an evaluating function that transforms its inputs into symbols that can suitably be unquoted:

grouped_mean2 <- function(data, group_var, summary_var) {
  group_var <- sym(group_var)
  summary_var <- sym(summary_var)

  data %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
}

With this simple change we now have an evaluating wrapper which can be used in the same way as [[. You can call grouped_mean2() with direct references:

grouped_mean2(starwars, "gender", "mass")
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   gender     mean
#>   <chr>     <dbl>
#> 1 feminine     NA
#> 2 masculine    NA
#> 3 <NA>         NA

Or indirect references:

grp_var <- "gender"
sum_var <- "mass"
grouped_mean2(starwars, grp_var, sum_var)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   gender     mean
#>   <chr>     <dbl>
#> 1 feminine     NA
#> 2 masculine    NA
#> 3 <NA>         NA

4.3.2 Character vectors of column names

What if you have a whole character vector of column names? You can transform vectors to a list of symbols with the plural variant syms():

cols <- syms(c("species", "gender"))

cols
#> [[1]]
#> species
#> 
#> [[2]]
#> gender

But now we have a list. Can we just unquote a list of symbols with !!?

group_by(starwars, !!cols)
#> Error: Problem with `mutate()` input `..1`.
#> ✖ Input `..1` can't be recycled to size 87.
#> ℹ Input `..1` is `<list>`.
#> ℹ Input `..1` must be size 87 or 1, not 2.

Something’s wrong. Using qq_show(), we see that group_by() gets a list instead of the individual symbols:

rlang::qq_show(group_by(starwars, !!cols))
#> group_by(starwars, <list: species, gender>)

We should unquote each symbol in the list as a separate argument. The big bang operator !!! makes this easy:

rlang::qq_show(group_by(starwars, !!cols[[1]], !!cols[[2]]))
#> group_by(starwars, species, gender)

rlang::qq_show(group_by(starwars, !!!cols))
#> group_by(starwars, species, gender)

Working with multiple arguments and lists of expressions requires specific techniques such as using !!!. These techniques are covered in the next chapter.


  1. In practice this is a bit more complex because most quoting functions evaluate at least one argument, usually the data argument.↩︎

  2. This is why regular functions are said to use standard evaluation unlike quoting functions which use non-standard evaluation (NSE). Note that the function is not entirely passive. Because arguments are lazily evaluated, the function gets to decide when an argument is evaluated, if at all.↩︎