3.4 S3 atomic vectors
One of the most important vector attributes is
class, which underlies the S3 object system. Having a class attribute turns an object into an S3 object, which means it will behave differently from a regular vector when passed to a generic function. Every S3 object is built on top of a base type, and often stores additional information in other attributes. You’ll learn the details of the S3 object system, and how to create your own S3 classes, in Chapter 13.
In this section, we’ll discuss four important S3 vectors used in base R:
Categorical data, where values come from a fixed set of levels recorded in factor vectors.
Dates (with day resolution), which are recorded in Date vectors.
Date-times (with second or sub-second resolution), which are stored in POSIXct vectors.
Durations, which are stored in difftime vectors.
A factor is a vector that can contain only predefined values. It is used to store categorical data. Factors are built on top of an integer vector with two attributes: a
class, “factor”, which makes it behave differently from regular integer vectors, and
levels, which defines the set of allowed values.
factor(c("a", "b", "b", "a")) x <- x#>  a b b a #> Levels: a b typeof(x) #>  "integer" attributes(x) #> $levels #>  "a" "b" #> #> $class #>  "factor"
Factors are useful when you know the set of possible values but they’re not all present in a given dataset. In contrast to a character vector, when you tabulate a factor you’ll get counts of all categories, even unobserved ones:
c("m", "m", "m") sex_char <- factor(sex_char, levels = c("m", "f")) sex_factor <- table(sex_char) #> sex_char #> m #> 3 table(sex_factor) #> sex_factor #> m f #> 3 0
Ordered factors are a minor variation of factors. In general, they behave like regular factors, but the order of the levels is meaningful (low, medium, high) (a property that is automatically leveraged by some modelling and visualisation functions).
ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a")) grade <- grade#>  b b a c #> Levels: c < b < a
In base R17 you tend to encounter factors very frequently because many base R functions (like
data.frame()) automatically convert character vectors to factors. This is suboptimal because there’s no way for those functions to know the set of all possible levels or their correct order: the levels are a property of theory or experimental design, not of the data. Instead, use the argument
stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the “theoretical” data. To learn about the historical context of this behaviour, I recommend stringsAsFactors: An unauthorized
biography by Roger Peng, and stringsAsFactors =
<sigh> by Thomas Lumley.
While factors look like (and often behave like) character vectors, they are built on top of integers. So be careful when treating them like strings. Some string methods (like
grepl()) will automatically coerce factors to strings, others (like
nchar()) will throw an error, and still others will (like
c()) use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.
Date vectors are built on top of double vectors. They have class “Date” and no other attributes:
Sys.Date() today <- typeof(today) #>  "double" attributes(today) #> $class #>  "Date"
The value of the double (which can be seen by stripping the class), represents the number of days since 1970-01-0118:
as.Date("1970-02-01") date <-unclass(date) #>  31
Base R19 provides two ways of storing date-time information, POSIXct, and POSIXlt. These are admittedly odd names: “POSIX” is short for Portable Operating System Interface, which is a family of cross-platform standards. “ct” stands for calendar time (the
time_t type in C), and “lt” for local time (the
struct tm type in C). Here we’ll focus on
POSIXct, because it’s the simplest, is built on top of an atomic vector, and is most appropriate for use in data frames. POSIXct vectors are built on top of double vectors, where the value represents the number of seconds since 1970-01-01.
as.POSIXct("2018-08-01 22:00", tz = "UTC") now_ct <- now_ct#>  "2018-08-01 22:00:00 UTC" typeof(now_ct) #>  "double" attributes(now_ct) #> $class #>  "POSIXct" "POSIXt" #> #> $tzone #>  "UTC"
tzone attribute controls only how the date-time is formatted; it does not control the instant of time represented by the vector. Note that the time is not printed if it is midnight.
structure(now_ct, tzone = "Asia/Tokyo") #>  "2018-08-02 07:00:00 JST" structure(now_ct, tzone = "America/New_York") #>  "2018-08-01 18:00:00 EDT" structure(now_ct, tzone = "Australia/Lord_Howe") #>  "2018-08-02 08:30:00 +1030" structure(now_ct, tzone = "Europe/Paris") #>  "2018-08-02 CEST"
Durations, which represent the amount of time between pairs of dates or date-times, are stored in difftimes. Difftimes are built on top of doubles, and have a
units attribute that determines how the integer should be interpreted:
1 <- as.difftime(1, units = "weeks") one_week_1 one_week_#> Time difference of 1 weeks typeof(one_week_1) #>  "double" attributes(one_week_1) #> $class #>  "difftime" #> #> $units #>  "weeks" 2 <- as.difftime(7, units = "days") one_week_2 one_week_#> Time difference of 7 days typeof(one_week_2) #>  "double" attributes(one_week_2) #> $class #>  "difftime" #> #> $units #>  "days"
What sort of object does
table()return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
What happens to a factor when you modify its levels?
factor(letters) f1 <-levels(f1) <- rev(levels(f1))
What does this code do? How do
rev(factor(letters)) f2 <- factor(letters, levels = rev(letters))f3 <-
This special date is known as the Unix Epoch.↩︎