8 Avoid dependencies between arguments

8.1 What’s the problem?

Avoid creating dependencies between details arguments so that only certain combinations are permitted. Dependencies between arguments makes functions harder to use because you have to remember how arguments interact, and you when reading a call, you need to read multiple arguments before interpreting one.

8.2 What are some examples?

  • In rep() you can supply both times and each unless times is a vector:

    rep(1:3, times = 2, each = 3)
    #>  [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
    rep(1:3, times = 1:3, each = 2)
    #> Error in rep(1:3, times = 1:3, each = 2): invalid 'times' argument

    Learn more in Chapter 11.

  • In var(), na.rm is only used if use is not set. If you supply both use and na.rm, na.rm is silently ignored.

  • In rgamma() you can provide either scale or rate. If you supply both you will get an error or a warning:

    rgamma(5, shape = 1, rate = 2, scale = 1/2)
    #> Warning in rgamma(5, shape = 1, rate = 2, scale = 1/2): specify 'rate' or
    #> 'scale' but not both
    #> [1] 0.91093006 1.45306085 0.08907445 0.60093980 0.21717070
    rgamma(5, shape = 1, rate = 2, scale = 2)
    #> Error in rgamma(5, shape = 1, rate = 2, scale = 2): specify 'rate' or 'scale' but not both
  • grepl() has perl, fixed, and ignore.case arguments which can either be TRUE or FALSE. But fixed = TRUE overrides perl = TRUE, and ignore.case only works if fixed = FALSE. Both fixed and perl change how another argument, pattern, is interpreted.

  • In library() the character.only argument changes how package is intepreted:

    ggplot2 <- "dplyr"
    
    # Loads ggplot2
    library(ggplot2)
    
    # Loads dplyr
    library(ggplot2, character.only = TRUE)
  • forcats::fct_lump() decides which algorithm to use based on a combination of the n and prop arguments.

  • In ggplot2::geom_histogram(), you can specify the histogram breaks in three ways: as a number of bins, as the width of each bin (binwidth, plus center or boundary), or the exact breaks. You can only pick one of the three options, which is hard to convey in the documentation. There’s also an implied precedence so that if more than one option is supplied, one will silently win.

  • In readr::locale() there’s a complex dependency between decimal_mark and grouping_mark because they can’t be the same value, and the US and Europe use different standards.

See Sections 10.3.1 and 10.3.2 for a two exceptions where the dependency is via specific patterns of missing arguments.

8.3 Why is this important?

Having complicated interdependencies between arguments has major downsides:

  • It suggests that there are many more viable code paths than there really are and all those (unnecessary) possibilities still occupy head space. You have to memorise the set of allowed combinations, rather than them being implied by the structure of the function.

  • It increases implementation complexity. Interdependence of arguments suggests complex implementation paths which are harder to analyse and test.

  • It makes documentation harder to write. You have to use extra words to explain exactly how combinations of arguments work together, and it’s not obvious where those words should go. If there’s an interaction between arg_a and arg_b do you document with arg_a, with arg_b, or with both?

8.4 How do I remediate?

Often these problems arise because the scope of a function grows over time. When the function was initially designed, the scope was small, and it grew incrementally over time. At no point did it seem worth the additional effort to refactor to a new design, but now you have a large complex function. This makes the problem hard to avoid.

To remediate the problem, you’ll need to think holistically and reconsider the complete interface. There are two common outcomes which are illustrated in the case studies below:

  • Splitting the function into multiple functions that each do one thing.

  • Encapulsating related details arguments into a single object.

See also larger case study in Chapter 11 where this problem is tangled up with other problems.

If these changes to the interface occur to exported functions in a package, you’ll need to consider how to preserve the interface with deprecation warnings. For important functions, it is worth generating an message that includes new code to copy and paste.

8.4.1 Case study: fct_lump()

There are many different ways to decide how to lump uncommon factor levels together, and initially we attempted to encode these through arguments to fct_lump(). However, over time as the number of arguments increased, it gets harder and harder to tell what the options are. Currently there are three behaviours:

  • Both n and prop missing - merge together the least frequent levels, ensuring that other is still the smallest level. (For this case, the ties.method argument is ignored.)

  • Only n supplied: if positive, preserves n most common values.

  • Only prop supplied: if positive, preserves

  • Both n and prop supplied: due to a bug in the code, this is treated the same way as both n and prop missing! (But it really should be an error)

Would be better to break into three functions:

  • fct_lump_n(f, n)
  • fct_lump_prop(f, prop)
  • fct_lump_smallest(f)

That has three advantages:

  • The name of function helps remind you of the purpose.

  • There’s no way to supply both n and prop.

  • The ties.method argument would only appear in fct_lump_n() and _prop(), not _smallest().

8.4.2 Case study: grepl() vs stringr::str_detect()

grepl(), has three arguments that take either FALSE or TRUE: ignore.case, perl, fixed, which might suggest that there are 2 ^ 3 = 16 possible invocations. However, a number of combinations are not allowed:

x <- grepl("a", letters, fixed = TRUE, ignore.case = TRUE)
#> Warning in grepl("a", letters, fixed = TRUE, ignore.case = TRUE): argument
#> 'ignore.case = TRUE' will be ignored
x <- grepl("a", letters, fixed = TRUE, perl = TRUE)
#> Warning in grepl("a", letters, fixed = TRUE, perl = TRUE): argument 'perl =
#> TRUE' will be ignored

Part of this problem could be resolved by making it more clear that one important choice is the matching engine to use: POSIX 1003.2 extended regular expressions (the default), Perl-style regular expressions (perl = TRUE) or fixed matching (fixed = TRUE). A better approach would be to use the pattern in Chapter 12, and create a new argument called something like engine = c("POSIX", "perl", "fixed").

The other problem is that ignore.case can only affect two of the three engines: POSIX and perl. This is hard to remedy without creating a completely new matching engine. Anything to do with case is always harder than you might expect because different languages have different rules.

stringr takes a different approach, encoding the engine as an attribute of the pattern:

library(stringr)

x <- str_detect(letters, "a")
# short for:
x <- str_detect(letters, regex("a"))
x <- str_detect(letters, fixed("a"))
x <- str_detect(letters, coll("a"))

This has the advantage that each engine can take different arguments.

An alternative approach would be to have a separate engine argument:

x <- str_detect(letters, "a", engine = regex())
x <- str_detect(letters, "a", engine = fixed())
x <- str_detect(letters, "a", engine = coll())

This approach is a bit more discoverable (because there’s clearly another argument that affects the pattern), but it’s slightly less general, because of the boundary() engine, which doesn’t match patterns but boundaries:

x <- str_detect(letters, boundary("word"))
# Seems confusing: now you can omit the pattern argument?
x <- str_detect(letters, engine = boundary("word"))

It would also mean that you had an argument engine, that affected how another argument, pattern, was interpreted, so it would repeat the problem in a slightly different form.

It’s appealing to all the details of the match wrapped up into a single object.