24 Type-stability

The less you need to know about a function’s inputs to predict the type of its output, the better. Ideally, a function should either always return the same type of thing, or return something that can be trivially computed from its inputs.

If a function is type-stable it satisfies two conditions:

  • You can predict the output type based only on the input types (not their values).

  • If the function uses ..., the order of arguments in does not affect the output type.

library(vctrs)

24.1 Simple examples

  • purrr::map() and base::lapply() are trivially type-stable because they always return lists.

  • paste() is type stable because it always returns a character vector.

    vec_ptype(paste(1))
    #> character(0)
    vec_ptype(paste("x"))
    #> character(0)
  • base::mean(x) almost always returns the same type of output as x. For example, the mean of a numeric vector is a numeric vector, and the mean of a date-time is a date-time.

    vec_ptype(mean(1))
    #> numeric(0)
    vec_ptype(mean(Sys.time()))
    #> POSIXct of length 0
  • ifelse() is not type-stable because the output type depends on the value:

    vec_ptype(ifelse(NA, 1L, 2))
    #> <unspecified> [0]
    vec_ptype(ifelse(FALSE, 1L, 2))
    #> numeric(0)
    vec_ptype(ifelse(TRUE, 1L, 2))
    #> integer(0)

24.2 More complicated examples

Some functions are more complex because they take multiple input types and have to return a single output type. This includes functions like c() and ifelse(). The rules governing base R functions are idiosyncratic, and each function tends to apply it’s own slightly different set of rules. Tidy functions should use the consistent set of rules provided by the vctrs package.

24.3 Challenge: the median

A more challenging example is median(). The median of a vector is a value that (as evenly as possible) splits the vector into a lower half and an upper half. In the absence of ties, mean(x > median(x)) == mean(x <= median(x)) == 0.5. The median is straightforward to compute for odd lengths: you simply order the vector and pick the value in the middle, i.e. sort(x)[(length(x) - 1) / 2]. It’s clear that the type of the output should be the same type as x, and this algorithm can be applied to any vector that can be ordered.

But what if the vector has an even length? In this case, there’s no longer a unique median, and by convention we usually take the mean of the middle two numbers.

In R, this makes the median() not type-stable:

typeof(median(1:3))
#> [1] "integer"
typeof(median(1:4))
#> [1] "double"

Base R doesn’t appear to follow a consistent principle when computing the median of a vector of length 2. Factors throw an error, but dates do not (even though there’s no date half way between two days that differ by an odd number of days).

median(factor(1:2))
#> Error in median.default(factor(1:2)): need numeric data
median(Sys.Date() + 0:1)
#> [1] "2020-09-23"

To be clear, the problems caused by this behaviour are quite small in practice, but it makes the analysis of median() more complex, and it makes it difficult to decide what principle you should adhere to when creating median methods for new vector classes.

median("foo")
#> [1] "foo"
median(c("foo", "bar"))
#> Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
#> is not numeric or logical: returning NA
#> [1] NA

24.4 Exercises

  1. How is a date like an integer? Why is this inconsistent?

    vec_ptype(mean(Sys.Date()))
    #> Date of length 0
    vec_ptype(mean(1L))
    #> numeric(0)