4.5 Applications
The principles described above have a wide variety of useful applications. Some of the most important are described below. While many of the basic principles of subsetting have already been incorporated into functions like subset()
, merge()
, dplyr::arrange()
, a deeper understanding of how those principles have been implemented will be valuable when you run into situations where the functions you need don’t exist.
4.5.1 Lookup tables (character subsetting)
Character matching is a powerful way to create lookup tables. Say you want to convert abbreviations:
c("m", "f", "u", "f", "f", "m", "m")
x <- c(m = "Male", f = "Female", u = NA)
lookup <-
lookup[x]#> m f u f f m m
#> "Male" "Female" NA "Female" "Female" "Male" "Male"
Note that if you don’t want names in the result, use unname()
to remove them.
unname(lookup[x])
#> [1] "Male" "Female" NA "Female" "Female" "Male" "Male"
4.5.2 Matching and merging by hand (integer subsetting)
You can also have more complicated lookup tables with multiple columns of information. For example, suppose we have a vector of integer grades, and a table that describes their properties:
c(1, 2, 2, 3, 1)
grades <-
data.frame(
info <-grade = 3:1,
desc = c("Excellent", "Good", "Poor"),
fail = c(F, F, T)
)
Then, let’s say we want to duplicate the info
table so that we have a row for each value in grades
. An elegant way to do this is by combining match()
and integer subsetting (match(needles, haystack)
returns the position where each needle
is found in the haystack
).
match(grades, info$grade)
id <-
id#> [1] 3 2 2 1 3
info[id, ]#> grade desc fail
#> 3 1 Poor TRUE
#> 2 2 Good FALSE
#> 2.1 2 Good FALSE
#> 1 3 Excellent FALSE
#> 3.1 1 Poor TRUE
If you’re matching on multiple columns, you’ll need to first collapse them into a single column (with e.g. interaction()
). Typically, however, you’re better off switching to a function designed specifically for joining multiple tables like merge()
, or dplyr::left_join()
.
4.5.3 Random samples and bootstraps (integer subsetting)
You can use integer indices to randomly sample or bootstrap a vector or data frame. Just use sample(n)
to generate a random permutation of 1:n
, and then use the results to subset the values:
data.frame(x = c(1, 2, 3, 1, 2), y = 5:1, z = letters[1:5])
df <-
# Randomly reorder
sample(nrow(df)), ]
df[#> x y z
#> 5 2 1 e
#> 3 3 3 c
#> 4 1 2 d
#> 1 1 5 a
#> 2 2 4 b
# Select 3 random rows
sample(nrow(df), 3), ]
df[#> x y z
#> 4 1 2 d
#> 2 2 4 b
#> 1 1 5 a
# Select 6 bootstrap replicates
sample(nrow(df), 6, replace = TRUE), ]
df[#> x y z
#> 5 2 1 e
#> 5.1 2 1 e
#> 5.2 2 1 e
#> 2 2 4 b
#> 3 3 3 c
#> 3.1 3 3 c
The arguments of sample()
control the number of samples to extract, and also whether sampling is done with or without replacement.
4.5.4 Ordering (integer subsetting)
order()
takes a vector as its input and returns an integer vector describing how to order the subsetted vector24:
c("b", "c", "a")
x <-order(x)
#> [1] 3 1 2
order(x)]
x[#> [1] "a" "b" "c"
To break ties, you can supply additional variables to order()
. You can also change the order from ascending to descending by using decreasing = TRUE
. By default, any missing values will be put at the end of the vector; however, you can remove them with na.last = NA
or put them at the front with na.last = FALSE
.
For two or more dimensions, order()
and integer subsetting makes it easy to order either the rows or columns of an object:
# Randomly reorder df
df[sample(nrow(df)), 3:1]
df2 <-
df2#> z y x
#> 5 e 1 2
#> 1 a 5 1
#> 4 d 2 1
#> 2 b 4 2
#> 3 c 3 3
order(df2$x), ]
df2[#> z y x
#> 1 a 5 1
#> 4 d 2 1
#> 5 e 1 2
#> 2 b 4 2
#> 3 c 3 3
order(names(df2))]
df2[, #> x y z
#> 5 2 1 e
#> 1 1 5 a
#> 4 1 2 d
#> 2 2 4 b
#> 3 3 3 c
You can sort vectors directly with sort()
, or similarly dplyr::arrange()
, to sort a data frame.
4.5.5 Expanding aggregated counts (integer subsetting)
Sometimes you get a data frame where identical rows have been collapsed into one and a count column has been added. rep()
and integer subsetting make it easy to uncollapse, because we can take advantage of rep()
s vectorisation: rep(x, y)
repeats x[i]
y[i]
times.
data.frame(x = c(2, 4, 1), y = c(9, 11, 6), n = c(3, 5, 1))
df <-rep(1:nrow(df), df$n)
#> [1] 1 1 1 2 2 2 2 2 3
rep(1:nrow(df), df$n), ]
df[#> x y n
#> 1 2 9 3
#> 1.1 2 9 3
#> 1.2 2 9 3
#> 2 4 11 5
#> 2.1 4 11 5
#> 2.2 4 11 5
#> 2.3 4 11 5
#> 2.4 4 11 5
#> 3 1 6 1
4.5.6 Removing columns from data frames (character )
There are two ways to remove columns from a data frame. You can set individual columns to NULL
:
data.frame(x = 1:3, y = 3:1, z = letters[1:3])
df <-$z <- NULL df
Or you can subset to return only the columns you want:
data.frame(x = 1:3, y = 3:1, z = letters[1:3])
df <-c("x", "y")]
df[#> x y
#> 1 1 3
#> 2 2 2
#> 3 3 1
If you only know the columns you don’t want, use set operations to work out which columns to keep:
setdiff(names(df), "z")]
df[#> x y
#> 1 1 3
#> 2 2 2
#> 3 3 1
4.5.7 Selecting rows based on a condition (logical subsetting)
Because logical subsetting allows you to easily combine conditions from multiple columns, it’s probably the most commonly used technique for extracting rows out of a data frame.
$gear == 5, ]
mtcars[mtcars#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
$gear == 5 & mtcars$cyl == 4, ]
mtcars[mtcars#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
Remember to use the vector boolean operators &
and |
, not the short-circuiting scalar operators &&
and ||
, which are more useful inside if statements. And don’t forget De Morgan’s laws, which can be useful to simplify negations:
!(X & Y)
is the same as!X | !Y
!(X | Y)
is the same as!X & !Y
For example, !(X & !(Y | Z))
simplifies to !X | !!(Y|Z)
, and then to !X | Y | Z
.
4.5.8 Boolean algebra versus sets (logical and integer )
It’s useful to be aware of the natural equivalence between set operations (integer subsetting) and Boolean algebra (logical subsetting). Using set operations is more effective when:
You want to find the first (or last)
TRUE
.You have very few
TRUE
s and very manyFALSE
s; a set representation may be faster and require less storage.
which()
allows you to convert a Boolean representation to an integer representation. There’s no reverse operation in base R but we can easily create one:
sample(10) < 4
x <-which(x)
#> [1] 2 3 4
function(x, n) {
unwhich <- rep_len(FALSE, n)
out <- TRUE
out[x] <-
out
}unwhich(which(x), 10)
#> [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Let’s create two logical vectors and their integer equivalents, and then explore the relationship between Boolean and set operations.
1:10 %% 2 == 0)
(x1 <-#> [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
which(x1))
(x2 <-#> [1] 2 4 6 8 10
1:10 %% 5 == 0)
(y1 <-#> [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
which(y1))
(y2 <-#> [1] 5 10
# X & Y <-> intersect(x, y)
& y1
x1 #> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
intersect(x2, y2)
#> [1] 10
# X | Y <-> union(x, y)
| y1
x1 #> [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
union(x2, y2)
#> [1] 2 4 6 8 10 5
# X & !Y <-> setdiff(x, y)
& !y1
x1 #> [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
setdiff(x2, y2)
#> [1] 2 4 6 8
# xor(X, Y) <-> setdiff(union(x, y), intersect(x, y))
xor(x1, y1)
#> [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
setdiff(union(x2, y2), intersect(x2, y2))
#> [1] 2 4 6 8 5
When first learning subsetting, a common mistake is to use x[which(y)]
instead of x[y]
. Here the which()
achieves nothing: it switches from logical to integer subsetting but the result is exactly the same. In more general cases, there are two important differences.
When the logical vector contains
NA
, logical subsetting replaces these values withNA
whilewhich()
simply drops these values. It’s not uncommon to usewhich()
for this side-effect, but I don’t recommend it: nothing about the name “which” implies the removal of missing values.x[-which(y)]
is not equivalent tox[!y]
: ify
is all FALSE,which(y)
will beinteger(0)
and-integer(0)
is stillinteger(0)
, so you’ll get no values, instead of all values.
In general, avoid switching from logical to integer subsetting unless you want, for example, the first or last TRUE
value.
4.5.9 Exercises
How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
How would you select a random sample of
m
rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?How could you put the columns in a data frame in alphabetical order?
These are “pull” indices, i.e.,
order(x)[i]
is an index of where eachx[i]
is located. It is not an index of wherex[i]
should be sent.↩︎