4.2 Selecting multiple elements

Use [ to select any number of elements from a vector. To illustrate, I’ll apply [ to 1D atomic vectors, and then show how this generalises to more complex objects and more dimensions.

4.2.1 Atomic vectors

Let’s explore the different types of subsetting with a simple vector, x.

x <- c(2.1, 4.2, 3.3, 5.4)

Note that the number after the decimal point represents the original position in the vector.

There are six things that you can use to subset a vector:

  • Positive integers return elements at the specified positions:

    x[c(3, 1)]
    #> [1] 3.3 2.1
    x[order(x)]
    #> [1] 2.1 3.3 4.2 5.4
    
    # Duplicate indices will duplicate values
    x[c(1, 1)]
    #> [1] 2.1 2.1
    
    # Real numbers are silently truncated to integers
    x[c(2.1, 2.9)]
    #> [1] 4.2 4.2
  • Negative integers exclude elements at the specified positions:

    x[-c(3, 1)]
    #> [1] 4.2 5.4

    Note that you can’t mix positive and negative integers in a single subset:

    x[c(-1, 2)]
    #> Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
  • Logical vectors select elements where the corresponding logical value is TRUE. This is probably the most useful type of subsetting because you can write an expression that uses a logical vector:

    x[c(TRUE, TRUE, FALSE, FALSE)]
    #> [1] 2.1 4.2
    x[x > 3]
    #> [1] 4.2 3.3 5.4

    In x[y], what happens if x and y are different lengths? The behaviour is controlled by the recycling rules where the shorter of the two is recycled to the length of the longer. This is convenient and easy to understand when one of x and y is length one, but I recommend avoiding recycling for other lengths because the rules are inconsistently applied throughout base R.

    x[c(TRUE, FALSE)]
    #> [1] 2.1 3.3
    # Equivalent to
    x[c(TRUE, FALSE, TRUE, FALSE)]
    #> [1] 2.1 3.3

    Note that a missing value in the index always yields a missing value in the output:

    x[c(TRUE, TRUE, NA, FALSE)]
    #> [1] 2.1 4.2  NA
  • Nothing returns the original vector. This is not useful for 1D vectors, but, as you’ll see shortly, is very useful for matrices, data frames, and arrays. It can also be useful in conjunction with assignment.

    x[]
    #> [1] 2.1 4.2 3.3 5.4
  • Zero returns a zero-length vector. This is not something you usually do on purpose, but it can be helpful for generating test data.

    x[0]
    #> numeric(0)
  • If the vector is named, you can also use character vectors to return elements with matching names.

    (y <- setNames(x, letters[1:4]))
    #>   a   b   c   d 
    #> 2.1 4.2 3.3 5.4
    y[c("d", "c", "a")]
    #>   d   c   a 
    #> 5.4 3.3 2.1
    
    # Like integer indices, you can repeat indices
    y[c("a", "a", "a")]
    #>   a   a   a 
    #> 2.1 2.1 2.1
    
    # When subsetting with [, names are always matched exactly
    z <- c(abc = 1, def = 2)
    z[c("a", "d")]
    #> <NA> <NA> 
    #>   NA   NA

NB: Factors are not treated specially when subsetting. This means that subsetting will use the underlying integer vector, not the character levels. This is typically unexpected, so you should avoid subsetting with factors:

y[factor("b")]
#>   a 
#> 2.1

4.2.2 Lists

Subsetting a list works in the same way as subsetting an atomic vector. Using [ always returns a list; [[ and $, as described in Section 4.3, let you pull out elements of a list.

4.2.3 Matrices and arrays

You can subset higher-dimensional structures in three ways:

  • With multiple vectors.
  • With a single vector.
  • With a matrix.

The most common way of subsetting matrices (2D) and arrays (>2D) is a simple generalisation of 1D subsetting: supply a 1D index for each dimension, separated by a comma. Blank subsetting is now useful because it lets you keep all rows or all columns.

a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a[1:2, ]
#>      A B C
#> [1,] 1 4 7
#> [2,] 2 5 8
a[c(TRUE, FALSE, TRUE), c("B", "A")]
#>      B A
#> [1,] 4 1
#> [2,] 6 3
a[0, -2]
#>      A C

By default, [ simplifies the results to the lowest possible dimensionality. For example, both of the following expressions return 1D vectors. You’ll learn how to avoid “dropping” dimensions in Section 4.2.5:

a[1, ]
#> A B C 
#> 1 4 7
a[1, 1]
#> A 
#> 1

Because both matrices and arrays are just vectors with special attributes, you can subset them with a single vector, as if they were a 1D vector. Note that arrays in R are stored in column-major order:

vals <- outer(1:5, 1:5, FUN = "paste", sep = ",")
vals
#>      [,1]  [,2]  [,3]  [,4]  [,5] 
#> [1,] "1,1" "1,2" "1,3" "1,4" "1,5"
#> [2,] "2,1" "2,2" "2,3" "2,4" "2,5"
#> [3,] "3,1" "3,2" "3,3" "3,4" "3,5"
#> [4,] "4,1" "4,2" "4,3" "4,4" "4,5"
#> [5,] "5,1" "5,2" "5,3" "5,4" "5,5"

vals[c(4, 15)]
#> [1] "4,1" "5,3"

You can also subset higher-dimensional data structures with an integer matrix (or, if named, a character matrix). Each row in the matrix specifies the location of one value, and each column corresponds to a dimension in the array. This means that you can use a 2 column matrix to subset a matrix, a 3 column matrix to subset a 3D array, and so on. The result is a vector of values:

select <- matrix(ncol = 2, byrow = TRUE, c(
  1, 1,
  3, 1,
  2, 4
))
vals[select]
#> [1] "1,1" "3,1" "2,4"

4.2.4 Data frames and tibbles

Data frames have the characteristics of both lists and matrices:

  • When subsetting with a single index, they behave like lists and index the columns, so df[1:2] selects the first two columns.

  • When subsetting with two indices, they behave like matrices, so df[1:3, ] selects the first three rows (and all the columns)23.

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])

df[df$x == 2, ]
#>   x y z
#> 2 2 2 b
df[c(1, 3), ]
#>   x y z
#> 1 1 3 a
#> 3 3 1 c

# There are two ways to select columns from a data frame
# Like a list
df[c("x", "z")]
#>   x z
#> 1 1 a
#> 2 2 b
#> 3 3 c
# Like a matrix
df[, c("x", "z")]
#>   x z
#> 1 1 a
#> 2 2 b
#> 3 3 c

# There's an important difference if you select a single 
# column: matrix subsetting simplifies by default, list 
# subsetting does not.
str(df["x"])
#> 'data.frame':    3 obs. of  1 variable:
#>  $ x: int  1 2 3
str(df[, "x"])
#>  int [1:3] 1 2 3

Subsetting a tibble with [ always returns a tibble:

df <- tibble::tibble(x = 1:3, y = 3:1, z = letters[1:3])

str(df["x"])
#> tibble [3 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ x: int [1:3] 1 2 3
str(df[, "x"])
#> tibble [3 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ x: int [1:3] 1 2 3

4.2.5 Preserving dimensionality

By default, subsetting a matrix or data frame with a single number, a single name, or a logical vector containing a single TRUE, will simplify the returned output, i.e. it will return an object with lower dimensionality. To preserve the original dimensionality, you must use drop = FALSE.

  • For matrices and arrays, any dimensions with length 1 will be dropped:

    a <- matrix(1:4, nrow = 2)
    str(a[1, ])
    #>  int [1:2] 1 3
    
    str(a[1, , drop = FALSE])
    #>  int [1, 1:2] 1 3
  • Data frames with a single column will return just that column:

    df <- data.frame(a = 1:2, b = 1:2)
    str(df[, "a"])
    #>  int [1:2] 1 2
    
    str(df[, "a", drop = FALSE])
    #> 'data.frame':    2 obs. of  1 variable:
    #>  $ a: int  1 2

The default drop = TRUE behaviour is a common source of bugs in functions: you check your code with a data frame or matrix with multiple columns, and it works. Six months later, you (or someone else) uses it with a single column data frame and it fails with a mystifying error. When writing functions, get in the habit of always using drop = FALSE when subsetting a 2D object. For this reason, tibbles default to drop = FALSE, and [ always returns another tibble.

Factor subsetting also has a drop argument, but its meaning is rather different. It controls whether or not levels (rather than dimensions) are preserved, and it defaults to FALSE. If you find you’re using drop = TRUE a lot it’s often a sign that you should be using a character vector instead of a factor.

z <- factor(c("a", "b"))
z[1]
#> [1] a
#> Levels: a b
z[1, drop = TRUE]
#> [1] a
#> Levels: a

4.2.6 Exercises

  1. Fix each of the following common data frame subsetting errors:

    mtcars[mtcars$cyl = 4, ]
    mtcars[-1:4, ]
    mtcars[mtcars$cyl <= 5]
    mtcars[mtcars$cyl == 4 | 6, ]
  2. Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

    x <- 1:5
    x[NA]
    #> [1] NA NA NA NA NA
  3. What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

    x <- outer(1:5, 1:5, FUN = "*")
    x[upper.tri(x)]
  4. Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

  5. Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

  6. What does df[is.na(df)] <- 0 do? How does it work?


  1. If you’re coming from Python this is likely to be confusing, as you’d probably expect df[1:3, 1:2] to select three columns and two rows. Generally, R “thinks” about dimensions in terms of rows and columns while Python does so in terms of columns and rows.↩︎