3 Do you need tidy eval?

In computer science, frameworks like tidy evaluation are known as metaprogramming. Modifying the blueprints of computations amounts to programming the program, i.e. metaprogramming. In other languages, this type of approach is often seen as a last resort because it requires new skills and might make your code harder to read. Things are different in R because of the importance of data masking functions, but it is still good advice to consider other options before turning to tidy evaluation. In this section, we review several strategies for solving programming problems with tidyverse packages.

Before diving into tidy eval, make sure to know about the fundamentals of programming with the tidyverse. These are likely to have a better return on investment of time and will also be useful to solve problems outside the tidyverse.

  • Fixed column names. A solid function taking data frames with fixed column names is better than a brittle function that uses tidy eval.

  • Automating loops. dplyr excels at automating loops. Acquiring a good command of rowwise vectorisation and columnwise mapping may prove very useful.

Tidy evaluation is not all-or-nothing, it encompasses a wide range of features and techniques. Here are a few techniques that are easy to pick up in your workflow:

  • Passing expressions through {{ and ....
  • Passing column names to .data[[ and one_of().

All these techniques make it possible to reuse existing components of tidyverse grammars and compose them into new functions.

3.1 Fixed column names

A simple solution is to write functions that expect data frames containing specific column names. If the computation always operates on the same columns and nothing varies, you don’t need any tidy eval. On the other hand, your users must ensure the existence of these columns as part of their data cleaning process. This is why this technique primarily makes sense when you’re writing functions tailored to your own data analysis uses, or perhaps in functions that interface with a specific web API for retrieving data. In general, fixed column names are task specific.

Say we have a simple pipeline that computes the body mass index for each observation in a tibble:

starwars %>% transmute(bmi = mass / (height / 100)^2)
#> # A tibble: 87 x 1
#>     bmi
#>   <dbl>
#> 1  26.0
#> 2  26.9
#> 3  34.7
#> 4  33.3
#> 5  21.8
#> # … with 82 more rows

We could extract this code in a function that takes data frames with columns mass and height:

compute_bmi <- function(data) {
  data %>% transmute(bmi = mass / height^2)
}

It’s always a good idea to check the inputs of your functions and fail early with an informative error message when their assumptions are not met. In this case, we should validate the data frame and throw an error when it does not contain the expected columns:

compute_bmi <- function(data) {
  if (!all(c("mass", "height") %in% names(data))) {
    stop("`data` must contain `mass` and `height` columns")
  }

  data %>% transmute(bmi = mass / height^2)
}

iris %>% compute_bmi()
#> Error in compute_bmi(.): `data` must contain `mass` and `height` columns

In fact, we could go even further and validate the contents of the columns in addition to their names:

compute_bmi <- function(data) {
  if (!all(c("mass", "height") %in% names(data))) {
    stop("`data` must contain `mass` and `height` columns")
  }

  mean_height <- round(mean(data$height, na.rm = TRUE), 1)
  if (mean_height > 3) {
    warning(glue::glue(
      "Average height is { mean_height }, is it scaled in meters?"
    ))
  }

  data %>% transmute(bmi = mass / height^2)
}

starwars %>% compute_bmi()
#> Warning in compute_bmi(.): Average height is 174.4, is it scaled in meters?
#> # A tibble: 87 x 1
#>       bmi
#>     <dbl>
#> 1 0.00260
#> 2 0.00269
#> 3 0.00347
#> 4 0.00333
#> 5 0.00218
#> # … with 82 more rows

starwars %>% mutate(height = height / 100) %>% compute_bmi()
#> # A tibble: 87 x 1
#>     bmi
#>   <dbl>
#> 1  26.0
#> 2  26.9
#> 3  34.7
#> 4  33.3
#> 5  21.8
#> # … with 82 more rows

Spending your programming time on the domain logic of your function, such as input and scale validation, may have a greater payoff than learning tidy eval just to improve its syntax. It makes your function more robust to faulty data and reduces the risks of erroneous analyses.

3.2 Automating loops

Most programming problems involve iteration because data transformations are typically achieved element by element, by applying the same recipe over and over again. There are two main ways of automating iteration in R, vectorisation and mapping. Learning how to juggle with the different ways of expressing loops is not only an important step towards acquiring a good command of R and the tidyverse, it will also make you more proficient at solving programming problems.

3.2.1 Vectorisation in dplyr

dplyr is designed to optimise iteration by taking advantage of the vectorisation of many R functions. Rowwise vectorisation is achieved through normal R rules, which dplyr augments with groupwise vectorisation.

3.2.1.1 Rowwise vectorisation

Rowwise vectorisation in dplyr is a consequence of normal R rules for vectorisation. A vectorised function is a function that works the same way with vectors of 1 element as with vectors of n elements. The operation is applied elementwise (often at the machine code level, which makes them very efficient). We have already mentioned the vectorisation of toupper(), and many other functions in R are vectorised. One important class of vectorised functions is the arithmetic operators:

# Dividing 1 element
1 / 10
#> [1] 0.1

# Dividing 5 elements
1:5 / 10
#> [1] 0.1 0.2 0.3 0.4 0.5

Technically, a function is vectorised when:

  • It returns a vector as long as the input.
  • Applying the function on a single element yields the same result than applying it on the whole vector and then subsetting the element.

In other words, a vectorised function fn fulfills the following identity:

fn(x[[i]]) == fn(x)[[i]]

When you mix vectorised and non-vectorised operations, the combined operation is itself vectorised when the last operation to run is vectorised. Here we’ll combine the vectorised / function with the summary function mean(). The result of this operation is a vector that has the same length as the LHS of /:

x <- 1:5
x / mean(x)
#> [1] 0.33 0.67 1.00 1.33 1.67

Note that the other combination of operations is not vectorised because in that case the summary operation has the last word:

mean(x / 10)
#> [1] 0.3

The dplyr verb mutate() expects vector semantics. The operations defining new columns typically return vectors as long as their inputs:

data <- tibble(x = rnorm(5, sd = 10))

data %>%
  mutate(rescaled = x / sd(x))
#> # A tibble: 5 x 2
#>          x rescaled
#>      <dbl>    <dbl>
#> 1 -14.0    -1.09   
#> 2   2.55    0.199  
#> 3 -24.4    -1.90   
#> 4  -0.0557 -0.00434
#> 5   6.22    0.484

In fact, mutate() enforces vectorisation. Returning a smaller vector is an error unless it has size 1. If the result of a mutate expression has size 1, it is automatically recycled to the tibble or group size. This ensures that all columns have the same length and fit within the tibble constraints of rectangular data:

data %>%
  mutate(sigma = sd(x))
#> # A tibble: 5 x 2
#>          x sigma
#>      <dbl> <dbl>
#> 1 -14.0     12.8
#> 2   2.55    12.8
#> 3 -24.4     12.8
#> 4  -0.0557  12.8
#> 5   6.22    12.8

In contrast to mutate(), the dplyr verb summarise() expects summary operations that return a single value:

data %>%
  summarise(sd(x))
#> # A tibble: 1 x 1
#>   `sd(x)`
#>     <dbl>
#> 1    12.8

3.2.1.2 Groupwise vectorisation

Things get interesting with grouped tibbles. dplyr augments the vectorisation of normal R functions with groupwise vectorisation. If your tibble has ngroup groups, the operations are repeated ngroup times.

my_division <- function(x, y) {
  message("I was just called")
  x / y
}

# Called 1 time
data %>%
  mutate(new = my_division(x, 10))
#> I was just called
#> # A tibble: 5 x 2
#>          x      new
#>      <dbl>    <dbl>
#> 1 -14.0    -1.40   
#> 2   2.55    0.255  
#> 3 -24.4    -2.44   
#> 4  -0.0557 -0.00557
#> 5   6.22    0.622

gdata <- data %>% group_by(g = c("a", "a", "b", "b", "c"))

# Called 3 times
gdata %>%
  mutate(new = my_division(x, 10))
#> I was just called
#> I was just called
#> I was just called
#> # A tibble: 5 x 3
#> # Groups:   g [3]
#>          x g          new
#>      <dbl> <chr>    <dbl>
#> 1 -14.0    a     -1.40   
#> 2   2.55   a      0.255  
#> 3 -24.4    b     -2.44   
#> 4  -0.0557 b     -0.00557
#> 5   6.22   c      0.622

If the operation is entirely vectorised, the result will be the same whether the tibble is grouped or not, since elementwise computations are not affected by the values of other elements. But as soon as summary operations are involved, the result depends on the grouping structure because the summaries are computed from group sections instead of whole columns.

# Marginal rescaling
data %>%
  mutate(new = x / sd(x))
#> # A tibble: 5 x 2
#>          x      new
#>      <dbl>    <dbl>
#> 1 -14.0    -1.09   
#> 2   2.55    0.199  
#> 3 -24.4    -1.90   
#> 4  -0.0557 -0.00434
#> 5   6.22    0.484

# Conditional rescaling
gdata %>%
  mutate(new = x / sd(x))
#> # A tibble: 5 x 3
#> # Groups:   g [3]
#>          x g          new
#>      <dbl> <chr>    <dbl>
#> 1 -14.0    a     -1.20   
#> 2   2.55   a      0.218  
#> 3 -24.4    b     -1.42   
#> 4  -0.0557 b     -0.00324
#> 5   6.22   c     NA

Whereas rowwise vectorisation automates loops over the elements of a column, groupwise vectorisation automates loops over the levels of a grouping specification. The combination of these is very powerful.

3.2.2 Looping over columns

Rowwise and groupwise vectorisations are means of looping in the direction of rows, applying the same operation to each group and each element. What if you’d like to apply an operation in the direction of columns? This is possible in dplyr by mapping functions over columns.

Mapping functions is part of the functional programming approach. If you’re going to spend some time learning new programming concepts, acquiring functional programming skills is likely to have a higher payoff than learning about the metaprogramming concepts of tidy evaluation. Functional programming is inherent to R as it underlies the apply() family of functions in base R and the map() family from the purrr package. It is a powerful tool to add to your quiver.

3.2.2.1 Mapping functions

Everything that exists in R is an object, including functions. If you type the name of a function without parentheses, you get the function object instead of the result of calling the function:

toupper
#> function (x) 
#> {
#>     if (!is.character(x)) 
#>         x <- as.character(x)
#>     .Internal(toupper(x))
#> }
#> <bytecode: 0x5582414a9398>
#> <environment: namespace:base>

In its simplest form, functional programming is about passing a function object as argument to another function called a mapper function, that iterates over a vector to apply the function on each element, and returns all results in a new vector. In other words, a mapper functions writes loops so you don’t have to. Here is a manual loop that applies toupper() over all elements of a character vector and returns a new vector:

new <- character(length(letters))

for (i in seq_along(letters)) {
  new[[i]] <- toupper(letters[[i]])
}

new
#>  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#> [20] "T" "U" "V" "W" "X" "Y" "Z"

Using a mapper function results in much leaner code. Here we apply toupper() over all elements of letters and return the results as a character vector, as indicated by the suffix _chr:

new <- purrr::map_chr(letters, toupper)
new
#>  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#> [20] "T" "U" "V" "W" "X" "Y" "Z"

In practice, functional programming is all about hiding for loops, which are abstracted away by the mapper functions that automate the iteration.

Mapping is an elegant way of transforming data element by element, but it’s not the only one. For instance, toupper() is actually a vectorised function that already operates on whole vectors element by element. The fastest and leanest code is just:

toupper(letters)
#>  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#> [20] "T" "U" "V" "W" "X" "Y" "Z"

Mapping functions are more useful with functions that are not vectorised or for computations over lists and data frame columns where the vectorisation occurs within the elements or columns themselves. In the following example, we apply a summarising function over all columns of a data frame:

purrr::map_int(mtcars, n_distinct)
#>  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
#>   25    3   27   22   22   29   30    2    2    3    6

3.2.2.2 Scoped dplyr variants

dplyr provides variants of the main data manipulation verbs that map functions over a selection of columns. These verbs are known as the scoped variants and are recognizable from their _at, _if and _all suffixes.

Scoped verbs support three sorts of selection:

  1. _all verbs operate on all columns of the data frame. You can summarise all columns of a data frame within groups with summarise_all():

    iris %>% group_by(Species) %>% summarise_all(mean)
    #> # A tibble: 3 x 5
    #>   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
    #>   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
    #> 1 setosa             5.01        3.43         1.46       0.246
    #> 2 versicolor         5.94        2.77         4.26       1.33 
    #> 3 virginica          6.59        2.97         5.55       2.03
  2. _if verbs operate conditionally, on all columns for which a predicate returns TRUE. If you are familiar with purrr, the idea is similar to the conditional mapper purrr::map_if(). Promoting all character columns of a data frame as grouping variables is as simple as:

    starwars %>% group_by_if(is.character)
    #> # A tibble: 87 x 14
    #> # Groups:   name, hair_color, skin_color, eye_color, sex, gender, homeworld,
    #> #   species [87]
    #>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
    #>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
    #> 1 Luke…    172    77 blond      fair       blue            19   male  mascu…
    #> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
    #> 3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
    #> 4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
    #> 5 Leia…    150    49 brown      light      brown           19   fema… femin…
    #> # … with 82 more rows, and 5 more variables: homeworld <chr>, species <chr>,
    #> #   films <list>, vehicles <list>, starships <list>
  3. _at verbs operate on a selection of columns. You can supply integer vectors of column positions or character vectors of colunm names.

    mtcars %>% summarise_at(1:2, mean)
    #>   mpg cyl
    #> 1  20 6.2
    
    mtcars %>% summarise_at(c("disp", "drat"), median)
    #>   disp drat
    #> 1  196  3.7

    More interestingly, you can use vars()3 to supply the same sort of expressions you would pass to select()! The selection helpers make it very convenient to craft a selection of columns to map over.

    starwars %>% summarise_at(vars(height:mass), mean)
    #> # A tibble: 1 x 2
    #>   height  mass
    #>    <dbl> <dbl>
    #> 1     NA    NA
    
    starwars %>% summarise_at(vars(ends_with("_color")), n_distinct)
    #> # A tibble: 1 x 3
    #>   hair_color skin_color eye_color
    #>        <int>      <int>     <int>
    #> 1         13         31        15

The scoped variants of mutate() and summarise() are the closest analogue to base::lapply() and purrr::map(). Unlike pure list mappers, the scoped verbs fully implement the dplyr semantics, such as groupwise vectorisation or the summary constraints:

# map() returns a simple list with the results
mtcars[1:5] %>% purrr::map(mean)
#> $mpg
#> [1] 20
#> 
#> $cyl
#> [1] 6.2
#> 
#> $disp
#> [1] 231
#> 
#> $hp
#> [1] 147
#> 
#> $drat
#> [1] 3.6

# `mutate_` variants recycle to group size
mtcars[1:5] %>% mutate_all(mean)
#> # A tibble: 32 x 5
#>     mpg   cyl  disp    hp  drat
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60
#> 2  20.1  6.19  231.  147.  3.60
#> 3  20.1  6.19  231.  147.  3.60
#> 4  20.1  6.19  231.  147.  3.60
#> 5  20.1  6.19  231.  147.  3.60
#> # … with 27 more rows

# `summarise_` variants enforce a size 1 constraint
mtcars[1:5] %>% summarise_all(mean)
#> # A tibble: 1 x 5
#>     mpg   cyl  disp    hp  drat
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60

# All scoped verbs know about groups
mtcars[1:5] %>% group_by(cyl) %>% summarise_all(mean)
#> # A tibble: 3 x 5
#>     cyl   mpg  disp    hp  drat
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  26.7  105.  82.6  4.07
#> 2     6  19.7  183. 122.   3.59
#> 3     8  15.1  353. 209.   3.23

The other scoped variants also accept optional functions to map over the selection of columns. For instance, you could group by a selection of variables and transform them on the fly:

iris %>% group_by_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> 5          5           3.6          1.4         0.2 setosa 
#> # … with 145 more rows

or transform the column names of selected variables:

storms %>% select_at(vars(name:hour), toupper)
#> # A tibble: 10,010 x 5
#>   NAME   YEAR MONTH   DAY  HOUR
#>   <chr> <dbl> <dbl> <int> <dbl>
#> 1 Amy    1975     6    27     0
#> 2 Amy    1975     6    27     6
#> 3 Amy    1975     6    27    12
#> 4 Amy    1975     6    27    18
#> 5 Amy    1975     6    28     0
#> # … with 10,005 more rows

The scoped variants lie at the intersection of purrr and dplyr and combine the rowwise looping mechanisms of dplyr with the columnwise mapping of purrr. This is a powerful combination.


  1. vars() is the function that does the quoting of your expressions, and returns blueprints to its caller. This pattern of letting an external helper quote the arguments is called external quoting.↩︎