8 dplyr

In the introductory vignette we learned that creating tidy eval functions boils down to a single pattern: quote and unquote. In this vignette we’ll apply this pattern in a series of recipes for dplyr.

This vignette is organised so that you can quickly find your way to a copy-paste solution when you face an immediate problem.

8.1 Patterns for single arguments

8.1.1 enquo() and !! - Quote and unquote arguments

We start with a quick recap of the introductory vignette. Creating a function around dplyr pipelines involves three steps: abstraction, quoting, and unquoting.

  • Abstraction step

    First identify the varying parts:

    df1 %>% group_by(x1) %>% summarise(mean = mean(y1))
    df2 %>% group_by(x2) %>% summarise(mean = mean(y2))
    df3 %>% group_by(x3) %>% summarise(mean = mean(y3))
    df4 %>% group_by(x4) %>% summarise(mean = mean(y4))

    And abstract those away with a informative argument names:

    data %>% group_by(group_var) %>% summarise(mean = mean(summary_var))

    And wrap in a function:

    grouped_mean <- function(data, group_var, summary_var) {
      data %>%
        group_by(group_var) %>%
        summarise(mean = mean(summary_var))
    }
  • Quoting step

    Identify all the arguments where the user is allowed to refer to data frame columns directly. The function can’t evaluate these arguments right away. Instead they should be automatically quoted. Apply enquo() to these arguments

    group_var <- enquo(group_var)
    summary_var <- enquo(summary_var)
  • Unquoting step

    Identify where these variables are passed to other quoting functions and unquote with !!. In this case we pass group_var to group_by() and summary_var to summarise():

    data %>%
      group_by(!!group_var) %>%
      summarise(mean = mean(!!summary_var))

We end up with a function that automatically quotes its arguments group_var and summary_var and unquotes them when they are passed to other quoting functions:

grouped_mean <- function(data, group_var, summary_var) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)

  data %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
}

grouped_mean(mtcars, cyl, mpg)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>     cyl  mean
#>   <dbl> <dbl>
#> 1     4  26.7
#> 2     6  19.7
#> 3     8  15.1

8.1.2 as_label() - Create default column names

Use as_label() to transform a quoted expression to a column name:

simple_var <- quote(height)
as_label(simple_var)
#> [1] "height"

These names are only a default stopgap. For more complex uses, you’ll probably want to let the user override the default. Here is a case where the default name is clearly suboptimal:

complex_var <- quote(mean(height, na.rm = TRUE))
as_label(complex_var)
#> [1] "mean(height, na.rm = TRUE)"

8.1.3 := and !! - Unquote column names

In expressions like c(name = NA), the argument name is quoted. Because of the quoting it’s not possible to make an indirect reference to a variable that contains a name:

name <- "the real name"
c(name = NA)
#> name 
#>   NA

In tidy eval function it is possible to unquote argument names with !!. However you need the special := operator:

rlang::qq_show(c(!!name := NA))
#> c("the real name" := NA)

This unusual operator is needed because using ! on the left-hand side of = is not valid R code:

rlang::qq_show(c(!!name = NA))
#> Error: <text>:1:25: unexpected '='
#> 1: rlang::qq_show(c(!!name =
#>                             ^

Let’s use this !! technique to pass custom column names to group_by() and summarise():

grouped_mean <- function(data, group_var, summary_var) {
  group_var <- enquo(group_var)
  summary_var <- enquo(summary_var)

  # Create default column names
  group_nm <- as_label(group_var)
  summary_nm <- as_label(summary_var)

  # Prepend with an informative prefix
  group_nm <- paste0("group_", group_nm)
  summary_nm <- paste0("mean_", summary_nm)

  data %>%
    group_by(!!group_nm := !!group_var) %>%
    summarise(!!summary_nm := mean(!!summary_var))
}

grouped_mean(mtcars, cyl, mpg)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   group_cyl mean_mpg
#>       <dbl>    <dbl>
#> 1         4     26.7
#> 2         6     19.7
#> 3         8     15.1

8.2 Patterns for multiple arguments

8.2.1 ... - Forward multiple arguments

We have created a function that takes one grouping variable and one summary variable. It would make sense to take multiple grouping variables instead of just one. Let’s adjust our function with a ... argument.

  1. Replace group_var by ...:

    function(data, ..., summary_var)
  2. Swap ... and summary_var because arguments on the right-hand side of ... are harder to pass. They can only be passed with their full name explictly specified while arguments on the left-hand side can be passed without name:

    function(data, summary_var, ...)
  3. It’s good practice to prefix named arguments with a . to reduce the risk of conflicts between your arguments and the arguments passed to ...:

    function(.data, .summary_var, ...)

Because of the magic of dots forwarding we don’t have to use the quote-and-unquote pattern. We can just pass ... to other quoting functions like group_by():

grouped_mean <- function(.data, .summary_var, ...) {
  summary_var <- enquo(.summary_var)

  .data %>%
    group_by(...) %>%  # Forward `...`
    summarise(mean = mean(!!summary_var))
}

grouped_mean(mtcars, disp, cyl, am)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 6 x 3
#> # Groups:   cyl [3]
#>     cyl    am  mean
#>   <dbl> <dbl> <dbl>
#> 1     4     0 136. 
#> 2     4     1  93.6
#> 3     6     0 205. 
#> 4     6     1 155  
#> 5     8     0 358. 
#> # … with 1 more row

Forwarding ... is straightforward but has the downside that you can’t modify the arguments or their names.

8.2.2 enquos() and !!! - Quote and unquote multiple arguments

Quoting and unquoting multiple variables with ... is pretty much the same process as for single arguments:

  • Quoting multiple arguments can be done in two ways: internal quoting with the plural variant enquos() and external quoting with vars(). Use internal quoting when your function takes expressions with ... and external quoting when your function takes a list of expressions.

  • Unquoting multiple arguments requires a variant of !!, the unquote-splice operator !!! which unquotes each element of a list as an independent argument in the surrounding function call.

Quote the dots with enquos() and unquote-splice them with !!!:

grouped_mean2 <- function(.data, .summary_var, ...) {
  summary_var <- enquo(.summary_var)
  group_vars <- enquos(...)  # Get a list of quoted dots

  .data %>%
    group_by(!!!group_vars) %>%  # Unquote-splice the list
    summarise(mean = mean(!!summary_var))
}

grouped_mean2(mtcars, disp, cyl, am)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 6 x 3
#> # Groups:   cyl [3]
#>     cyl    am  mean
#>   <dbl> <dbl> <dbl>
#> 1     4     0 136. 
#> 2     4     1  93.6
#> 3     6     0 205. 
#> 4     6     1 155  
#> 5     8     0 358. 
#> # … with 1 more row

The quote-and-unquote pattern does more work than simple forwarding of ... and is functionally identical. Don’t do this extra work unless you need to modify the arguments or their names.

8.2.3 expr() - Modify quoted arguments

Modifying quoted expressions is often necessary when dealing with multiple arguments. Say we’d like a grouped_mean() variant that takes multiple summary variables rather than multiple grouping variables. We need to somehow take the mean() of each summary variable.

One easy way is to use the quote-and-unquote pattern with expr(). This function is just like quote() from base R. It plainly returns your argument, quoted:

quote(height)
#> height

expr(height)
#> height


quote(mean(height))
#> mean(height)

expr(mean(height))
#> mean(height)

But expr() has a twist, it has full unquoting support:

vars <- list(quote(height), quote(mass))

expr(mean(!!vars[[1]]))
#> mean(height)

expr(group_by(!!!vars))
#> group_by(height, mass)

You can loop over a list of arguments and modify each of them:

purrr::map(vars, function(var) expr(mean(!!var, na.rm = TRUE)))
#> [[1]]
#> mean(height, na.rm = TRUE)
#> 
#> [[2]]
#> mean(mass, na.rm = TRUE)

This makes it easy to take multiple summary variables, wrap them in a call to mean(), before unquote-splicing within summarise():

grouped_mean3 <- function(.data, .group_var, ...) {
  group_var <- enquo(.group_var)
  summary_vars <- enquos(...)  # Get a list of quoted summary variables

  summary_vars <- purrr::map(summary_vars, function(var) {
    expr(mean(!!var, na.rm = TRUE))
  })

  .data %>%
    group_by(!!group_var) %>%
    summarise(!!!summary_vars)  # Unquote-splice the list
}

8.2.4 vars() - Quote multiple arguments externally

How could we take multiple summary variables in addition to multiple grouping variables? Internal quoting with ... has a major disadvantage: the arguments in ... can only have one purpose. If you need to quote multiple sets of variables you have to delegate the quoting to another function. That’s the purpose of vars() which quotes its arguments and returns a list:

vars(species, gender)
#> <list_of<quosure>>
#> 
#> [[1]]
#> <quosure>
#> expr: ^species
#> env:  global
#> 
#> [[2]]
#> <quosure>
#> expr: ^gender
#> env:  global

The arguments can be complex expressions and have names:

vars(h = height, m = mass / 100)
#> <list_of<quosure>>
#> 
#> $h
#> <quosure>
#> expr: ^height
#> env:  global
#> 
#> $m
#> <quosure>
#> expr: ^mass / 100
#> env:  global

When the quoting is external you don’t use enquos(). Simply take lists of expressions in your function and forward the lists to other quoting functions with !!!:

grouped_mean3 <- function(data, group_vars, summary_vars) {
  stopifnot(
    is.list(group_vars),
    is.list(summary_vars)
  )

  summary_vars <- purrr::map(summary_vars, function(var) {
    expr(mean(!!var, na.rm = TRUE))
  })

  data %>%
    group_by(!!!group_vars) %>%
    summarise(n = n(), !!!summary_vars)
}

grouped_mean3(starwars, vars(species, gender), vars(height))
#> `summarise()` regrouping output by 'species' (override with `.groups` argument)
#> # A tibble: 42 x 4
#> # Groups:   species [38]
#>   species  gender        n `mean(height, na.rm = TRUE)`
#>   <chr>    <chr>     <int>                        <dbl>
#> 1 Aleena   masculine     1                           79
#> 2 Besalisk masculine     1                          198
#> 3 Cerean   masculine     1                          198
#> 4 Chagrian masculine     1                          196
#> 5 Clawdite feminine      1                          168
#> # … with 37 more rows

grouped_mean3(starwars, vars(gender), vars(height, mass))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#>   gender        n `mean(height, na.rm = TRUE)` `mean(mass, na.rm = TRUE)`
#>   <chr>     <int>                        <dbl>                      <dbl>
#> 1 feminine     17                         165.                       54.7
#> 2 masculine    66                         177.                      106. 
#> 3 <NA>          4                         181.                       48

One advantage of vars() is that it lets users specify their own names:

grouped_mean3(starwars, vars(gender), vars(h = height, m = mass))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#>   gender        n     h     m
#>   <chr>     <int> <dbl> <dbl>
#> 1 feminine     17  165.  54.7
#> 2 masculine    66  177. 106. 
#> 3 <NA>          4  181.  48

8.2.5 enquos(.named = TRUE) - Automatically add default names

If you pass .named = TRUE to enquos() the unnamed expressions are automatically given default names:

f <- function(...) names(enquos(..., .named = TRUE))

f(height, mean(mass))
#> [1] "height"     "mean(mass)"

User-supplied names are never overridden:

f(height, m = mean(mass))
#> [1] "height" "m"

This is handy when you need to modify the names of quoted expressions. In this example we’ll ensure the list is named before adding a prefix:

grouped_mean2 <- function(.data, .summary_var, ...) {
  summary_var <- enquo(.summary_var)
  group_vars <- enquos(..., .named = TRUE)  # Ensure quoted dots are named

  # Prefix the names of the list of quoted dots
  names(group_vars) <- paste0("group_", names(group_vars))

  .data %>%
    group_by(!!!group_vars) %>%  # Unquote-splice the list
    summarise(mean = mean(!!summary_var))
}

grouped_mean2(mtcars, disp, cyl, am)
#> `summarise()` regrouping output by 'group_cyl' (override with `.groups` argument)
#> # A tibble: 6 x 3
#> # Groups:   group_cyl [3]
#>   group_cyl group_am  mean
#>       <dbl>    <dbl> <dbl>
#> 1         4        0 136. 
#> 2         4        1  93.6
#> 3         6        0 205. 
#> 4         6        1 155  
#> 5         8        0 358. 
#> # … with 1 more row

One big downside of this technique is that all arguments get a prefix, including the arguments that were given specific names by the user:

grouped_mean2(mtcars, disp, c = cyl, a = am)
#> `summarise()` regrouping output by 'group_c' (override with `.groups` argument)
#> # A tibble: 6 x 3
#> # Groups:   group_c [3]
#>   group_c group_a  mean
#>     <dbl>   <dbl> <dbl>
#> 1       4       0 136. 
#> 2       4       1  93.6
#> 3       6       0 205. 
#> 4       6       1 155  
#> 5       8       0 358. 
#> # … with 1 more row

In general it’s better to preserve the names explicitly passed by the user. To do that we can’t automatically add default names with enquos() because once the list is fully named we don’t have any way of detecting which arguments were passed with an explicit names. We’ll have to add default names manually with quos_auto_name().

8.2.6 quos_auto_name() - Manually add default names

It can be helpful add default names to the list of quoted dots manually:

  • We can detect which arguments were explicitly named by the user.
  • The default names can be applied to lists returned by vars().

Let’s add default names manually with quos_auto_name() to lists of externally quoted variables. We’ll detect unnamed arguments and only add a prefix to this subset of arguments. This way we preserve user-supplied names:

grouped_mean3 <- function(data, group_vars, summary_vars) {
  stopifnot(
    is.list(group_vars),
    is.list(summary_vars)
  )

  # Detect and prefix unnamed arguments:
  unnamed <- names(summary_vars) == ""

  # Add the default names:
  summary_vars <- rlang::quos_auto_name(summary_vars)

  prefixed_nms <- paste0("mean_", names(summary_vars)[unnamed])
  names(summary_vars)[unnamed] <- prefixed_nms

  # Expand the argument _after_ giving the list its default names
  summary_vars <- purrr::map(summary_vars, function(var) {
    expr(mean(!!var, na.rm = TRUE))
  })

  data %>%
    group_by(!!!group_vars) %>%
    summarise(n = n(), !!!summary_vars)  # Unquote-splice the renamed list
}

Note how we add the default names before wrapping the arguments in a mean() call. This way we avoid including mean() in the name:

as_label(quote(mass))
#> [1] "mass"

as_label(quote(mean(mass, na.rm = TRUE)))
#> [1] "mean(mass, na.rm = TRUE)"

We get nicely prefixed default names:

grouped_mean3(starwars, vars(gender), vars(height, mass))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#>   gender        n mean_height mean_mass
#>   <chr>     <int>       <dbl>     <dbl>
#> 1 feminine     17        165.      54.7
#> 2 masculine    66        177.     106. 
#> 3 <NA>          4        181.      48

And the user is able to fully override the names:

grouped_mean3(starwars, vars(gender), vars(h = height, m = mass))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#>   gender        n     h     m
#>   <chr>     <int> <dbl> <dbl>
#> 1 feminine     17  165.  54.7
#> 2 masculine    66  177. 106. 
#> 3 <NA>          4  181.  48

8.3 select()

TODO

8.4 filter()

TODO

8.5 case_when()

TODO