3.4 S3 atomic vectors

One of the most important vector attributes is class, which underlies the S3 object system. Having a class attribute turns an object into an S3 object, which means it will behave differently from a regular vector when passed to a generic function. Every S3 object is built on top of a base type, and often stores additional information in other attributes. You’ll learn the details of the S3 object system, and how to create your own S3 classes, in Chapter 13.

In this section, we’ll discuss four important S3 vectors used in base R:

  • Categorical data, where values come from a fixed set of levels recorded in factor vectors.

  • Dates (with day resolution), which are recorded in Date vectors.

  • Date-times (with second or sub-second resolution), which are stored in POSIXct vectors.

  • Durations, which are stored in difftime vectors.

3.4.1 Factors

A factor is a vector that can contain only predefined values. It is used to store categorical data. Factors are built on top of an integer vector with two attributes: a class, “factor”, which makes it behave differently from regular integer vectors, and levels, which defines the set of allowed values.

x <- factor(c("a", "b", "b", "a"))
#> [1] a b b a
#> Levels: a b

#> [1] "integer"
#> $levels
#> [1] "a" "b"
#> $class
#> [1] "factor"

Factors are useful when you know the set of possible values but they’re not all present in a given dataset. In contrast to a character vector, when you tabulate a factor you’ll get counts of all categories, even unobserved ones:

sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))

#> sex_char
#> m 
#> 3
#> sex_factor
#> m f 
#> 3 0

Ordered factors are a minor variation of factors. In general, they behave like regular factors, but the order of the levels is meaningful (low, medium, high) (a property that is automatically leveraged by some modelling and visualisation functions).

grade <- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a"))
#> [1] b b a c
#> Levels: c < b < a

In base R17 you tend to encounter factors very frequently because many base R functions (like read.csv() and data.frame()) automatically convert character vectors to factors. This is suboptimal because there’s no way for those functions to know the set of all possible levels or their correct order: the levels are a property of theory or experimental design, not of the data. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the “theoretical” data. To learn about the historical context of this behaviour, I recommend stringsAsFactors: An unauthorized biography by Roger Peng, and stringsAsFactors = <sigh> by Thomas Lumley.

While factors look like (and often behave like) character vectors, they are built on top of integers. So be careful when treating them like strings. Some string methods (like gsub() and grepl()) will automatically coerce factors to strings, others (like nchar()) will throw an error, and still others will (like c()) use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.

3.4.2 Dates

Date vectors are built on top of double vectors. They have class “Date” and no other attributes:

today <- Sys.Date()

#> [1] "double"
#> $class
#> [1] "Date"

The value of the double (which can be seen by stripping the class), represents the number of days since 1970-01-0118:

date <- as.Date("1970-02-01")
#> [1] 31

3.4.3 Date-times

Base R19 provides two ways of storing date-time information, POSIXct, and POSIXlt. These are admittedly odd names: “POSIX” is short for Portable Operating System Interface, which is a family of cross-platform standards. “ct” stands for calendar time (the time_t type in C), and “lt” for local time (the struct tm type in C). Here we’ll focus on POSIXct, because it’s the simplest, is built on top of an atomic vector, and is most appropriate for use in data frames. POSIXct vectors are built on top of double vectors, where the value represents the number of seconds since 1970-01-01.

now_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC")
#> [1] "2018-08-01 22:00:00 UTC"

#> [1] "double"
#> $class
#> [1] "POSIXct" "POSIXt" 
#> $tzone
#> [1] "UTC"

The tzone attribute controls only how the date-time is formatted; it does not control the instant of time represented by the vector. Note that the time is not printed if it is midnight.

structure(now_ct, tzone = "Asia/Tokyo")
#> [1] "2018-08-02 07:00:00 JST"
structure(now_ct, tzone = "America/New_York")
#> [1] "2018-08-01 18:00:00 EDT"
structure(now_ct, tzone = "Australia/Lord_Howe")
#> [1] "2018-08-02 08:30:00 +1030"
structure(now_ct, tzone = "Europe/Paris")
#> [1] "2018-08-02 CEST"

3.4.4 Durations

Durations, which represent the amount of time between pairs of dates or date-times, are stored in difftimes. Difftimes are built on top of doubles, and have a units attribute that determines how the integer should be interpreted:

one_week_1 <- as.difftime(1, units = "weeks")
#> Time difference of 1 weeks

#> [1] "double"
#> $class
#> [1] "difftime"
#> $units
#> [1] "weeks"

one_week_2 <- as.difftime(7, units = "days")
#> Time difference of 7 days

#> [1] "double"
#> $class
#> [1] "difftime"
#> $units
#> [1] "days"

3.4.5 Exercises

  1. What sort of object does table() return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?

  2. What happens to a factor when you modify its levels?

    f1 <- factor(letters)
    levels(f1) <- rev(levels(f1))
  3. What does this code do? How do f2 and f3 differ from f1?

    f2 <- rev(factor(letters))
    f3 <- factor(letters, levels = rev(letters))

  1. The tidyverse never automatically coerces characters to factors, and provides the forcats (Wickham 2018) package specifically for working with factors.↩︎

  2. This special date is known as the Unix Epoch.↩︎

  3. The tidyverse provides the lubridate (Grolemund and Wickham 2011) package for working with date-times. It provides a number of convenient helpers that work with the base POSIXct type.↩︎