Subsetting in R

Jesse Yang
Jan 27th, 2017

Things You Already Knew

  • Vectors
x <- c(2, 4, 6, 8)
x[c(3, 1)]
[1] 6 2
x[c(2:4, 1, 3)]
[1] 4 6 8 2 6

Things You Might not Know

  • Vectors can be named
x <- c(first = 2, second = 4,
       third = 6, forth = 8)
x[c(3, 1)]
third first 
    6     2 
x["second"]
second 
     4 

Things You Might not Know

  • Negative indices to exclude elements
x
 first second  third  forth 
     2      4      6      8 
x[-c(3, 2)]
first forth 
    2     8 

Things You Might not Know

x[c(-1, -2)]
third forth 
    6     8 
x[c(-1, 2)]
Error in x[c(-1, 2)] : only 0's may be mixed with negative subscripts

Things You Already Knew - Matrices

x <- matrix(c(2, 4, 6, 8, 10, 12), ncol = 3)
x
     [,1] [,2] [,3]
[1,]    2    6   10
[2,]    4    8   12
x[2, 1]  # row, column
[1] 4

Things You Might not Know

  • Additional Arguments for []
x[1, ]
[1]  2  6 10
x[1, , drop = FALSE]
     [,1] [,2] [,3]
[1,]    2    6   10
  • For more, search in help: ?"["

Bonus!

  • Structure
str(x)
 num [1:2, 1:3] 2 4 6 8 10 12
  • Dimension
dim(x)
[1] 2 3

Things You Already Knew - Data Frames

dat <- data.frame(x = c(1, 2),
                  y = c("A", "B"),
                  z = c("a", "b"))
dat
  x y z
1 1 A a
2 2 B b
dat[1, 2]  # access first row, second column
[1] A
Levels: A B

Alternatives

These are all the same!

dat[1, 2]
dat[1, "y"]  # Recommended!
dat[1, 'y']
dat[1, 2:2]

dat$y[1]
dat[["y"]][1]
  • dat[1, 2:2] is redundant, but just so you know sequences are do-able in subsetting, too.
  • We recommend you to always use double quotes (" vs ') in R, and always use names to access columns, if possible.

Things You Might Not Know

  • Subsetting by row names, just as column names
dat
  x y z
1 1 A a
2 2 B b
row.names(dat) <- c("first", "second")
dat["first", "y", drop = FALSE]
      y
first A

Get whole columns or rows

  • when omitting “,”, it picks columns
dat["y"]
       y
first  A
second B
dat[c(2, 3)]
       y z
first  A a
second B b

Try not to do that

  • This is much better.
dat[, c(2, 3)]  # better
       y z
first  A a
second B b
dat[, c("y", "z")]  # even better
       y z
first  A a
second B b
  • Because you know immediately the code will return a new data frame.

Things You Might Not Know

  • Data Frames Are Special Lists
x <- list(x = c(1, 2, 3, 3, 4), y = c(4, 8),
          z = c("burp", "blah"))
str(x)
List of 3
 $ x: num [1:5] 1 2 3 3 4
 $ y: num [1:2] 4 8
 $ z: chr [1:2] "burp" "blah"
x[["z"]][1]
[1] "burp"

Booleans for Subsetting

What is behind dat[dat$x > 5] ?

Booleans for Subsetting

  • Boolean data type: TRUE, FALSE
  • Logical Expressions Returns a Logical Vector (Booleans)
# constructing a list of random numbers
x <- round(rnorm(n = 6, mean = 5, sd = 3))
x
[1]  9 -2  8  3  7  7
x < 5  # a logical expression
[1] FALSE  TRUE FALSE  TRUE FALSE FALSE

Booleans for Subsetting

  • Remember our x is:
[1]  9 -2  8  3  7  7
  • Booleans can be used to subset vectors
# if we just pass in a vector of booleans
x[c(FALSE, TRUE)]
[1] -2  3  7
  • The (short) vector repeats itself.

Booleans for Subsetting

[1]  9 -2  8  3  7  7
y <- x < 5
y
[1] FALSE  TRUE FALSE  TRUE FALSE FALSE
x[y] == x[x < 5]
[1] TRUE TRUE
  • Make sure you understand why!

Boolean Algebra (Multi Logical Conditions)

[1]  9 -2  8  3  7  7
# X & Y <-> intersect(x, y)
x[x < 5 & x > 2]
[1] 3
# X | Y <-> union(x, y)
x[x < 5 | x > 7]
[1]  9 -2  8  3

Boolean Algebra

[1]  9 -2  8  3  7  7
# X & !Y <-> setdiff(x, y)
y3 <- x < 5 & !(x < 3)   # !(x < 3) == x >= 3
y3
[1] FALSE FALSE FALSE  TRUE FALSE FALSE
x[y3]
[1] 3

NAs in Subsetting

NAs in Subsetting

dat <- data.frame(x = c(1, NA, 3),
                  y = c("A", "B", "B"),
                  z = c("a", "b", "c"))
dat
   x y z
1  1 A a
2 NA B b
3  3 B c
dat[dat$x < 2, ]
    x    y    z
1   1    A    a
NA NA <NA> <NA>

NAs in Subsetting

  • Why is this? Try split steps.
dat$x < 2
[1]  TRUE    NA FALSE
dat[c(TRUE, NA, FALSE), ]
    x    y    z
1   1    A    a
NA NA <NA> <NA>

NAs in Subsetting

  • Solution: use which()
dat[which(dat$x < 2), ]
  x y z
1 1 A a

The `subset` function

   x y z
1  1 A a
2 NA B b
3  3 B c
subset(dat, x < 2)
  x y z
1 1 A a

The `subset` function

  • But, as you remember…
dat[dat$x < 2, ]
    x    y    z
1   1    A    a
NA NA <NA> <NA>
  • So subset function can eliminate NAs for you. But it is still not recommended for production code because of some other reasons.

That's it!

That's all you need to know about subsetting in R.

Questions? Email me: yang.jianc@husky.neu.edu