Subsetting in R

Jesse Yang
Jan 27th, 2017

Things You Already Knew

Vectors

x <- c(2, 4, 6, 8)
x[c(3, 1)]

[1] 6 2

x[c(2:4, 1, 3)]

[1] 4 6 8 2 6

Things You Might not Know

Vectors can be named

x <- c(first = 2, second = 4,
       third = 6, forth = 8)
x[c(3, 1)]

third first 
    6     2

x["second"]

second 
     4

Things You Might not Know

Negative indices to exclude elements

 first second  third  forth 
     2      4      6      8

x[-c(3, 2)]

first forth 
    2     8

Things You Might not Know

x[c(-1, -2)]

third forth 
    6     8

x[c(-1, 2)]

Error in x[c(-1, 2)] : only 0's may be mixed with negative subscripts

Things You Already Knew - Matrices

x <- matrix(c(2, 4, 6, 8, 10, 12), ncol = 3)
x

     [,1] [,2] [,3]
[1,]    2    6   10
[2,]    4    8   12

x[2, 1]  # row, column

[1] 4

Things You Might not Know

Additional Arguments for []

x[1, ]

[1]  2  6 10

x[1, , drop = FALSE]

     [,1] [,2] [,3]
[1,]    2    6   10

For more, search in help: ?"["

Bonus!

Structure

str(x)

 num [1:2, 1:3] 2 4 6 8 10 12

Dimension

dim(x)

[1] 2 3

Things You Already Knew - Data Frames

dat <- data.frame(x = c(1, 2),
                  y = c("A", "B"),
                  z = c("a", "b"))
dat

  x y z
1 1 A a
2 2 B b

dat[1, 2]  # access first row, second column

[1] A
Levels: A B

Alternatives

These are all the same!

dat[1, 2]
dat[1, "y"]  # Recommended!
dat[1, 'y']
dat[1, 2:2]

dat$y[1]
dat[["y"]][1]

dat[1, 2:2] is redundant, but just so you know sequences are do-able in subsetting, too.
We recommend you to always use double quotes (" vs ') in R, and always use names to access columns, if possible.

Things You Might Not Know

Subsetting by row names, just as column names

dat

  x y z
1 1 A a
2 2 B b

row.names(dat) <- c("first", "second")
dat["first", "y", drop = FALSE]

      y
first A

Get whole columns or rows

when omitting “,”, it picks columns

dat["y"]

       y
first  A
second B

dat[c(2, 3)]

       y z
first  A a
second B b

Try not to do that

This is much better.

dat[, c(2, 3)]  # better

       y z
first  A a
second B b

dat[, c("y", "z")]  # even better

       y z
first  A a
second B b

Because you know immediately the code will return a new data frame.

Things You Might Not Know

Data Frames Are Special Lists

x <- list(x = c(1, 2, 3, 3, 4), y = c(4, 8),
          z = c("burp", "blah"))
str(x)

List of 3
 $ x: num [1:5] 1 2 3 3 4
 $ y: num [1:2] 4 8
 $ z: chr [1:2] "burp" "blah"

x[["z"]][1]

[1] "burp"

Booleans for Subsetting

What is behind dat[dat$x > 5] ?

Booleans for Subsetting

Boolean data type: TRUE, FALSE
Logical Expressions Returns a Logical Vector (Booleans)

# constructing a list of random numbers
x <- round(rnorm(n = 6, mean = 5, sd = 3))
x

[1]  9 -2  8  3  7  7

x < 5  # a logical expression

[1] FALSE  TRUE FALSE  TRUE FALSE FALSE

Booleans for Subsetting

Remember our x is:

[1]  9 -2  8  3  7  7

Booleans can be used to subset vectors

# if we just pass in a vector of booleans
x[c(FALSE, TRUE)]

[1] -2  3  7

The (short) vector repeats itself.

Booleans for Subsetting

[1]  9 -2  8  3  7  7

y <- x < 5
y

[1] FALSE  TRUE FALSE  TRUE FALSE FALSE

x[y] == x[x < 5]

[1] TRUE TRUE

Make sure you understand why!

Boolean Algebra (Multi Logical Conditions)

[1]  9 -2  8  3  7  7

# X & Y <-> intersect(x, y)
x[x < 5 & x > 2]

[1] 3

# X | Y <-> union(x, y)
x[x < 5 | x > 7]

[1]  9 -2  8  3

Boolean Algebra

[1]  9 -2  8  3  7  7

# X & !Y <-> setdiff(x, y)
y3 <- x < 5 & !(x < 3)   # !(x < 3) == x >= 3
y3

[1] FALSE FALSE FALSE  TRUE FALSE FALSE

x[y3]

[1] 3

NAs in Subsetting

dat <- data.frame(x = c(1, NA, 3),
                  y = c("A", "B", "B"),
                  z = c("a", "b", "c"))
dat

   x y z
1  1 A a
2 NA B b
3  3 B c

dat[dat$x < 2, ]

    x    y    z
1   1    A    a
NA NA <NA> <NA>

NAs in Subsetting

Why is this? Try split steps.

dat$x < 2

[1]  TRUE    NA FALSE

dat[c(TRUE, NA, FALSE), ]

    x    y    z
1   1    A    a
NA NA <NA> <NA>

NAs in Subsetting

Solution: use which()

dat[which(dat$x < 2), ]

  x y z
1 1 A a

The `subset` function

   x y z
1  1 A a
2 NA B b
3  3 B c

subset(dat, x < 2)

  x y z
1 1 A a

The `subset` function

But, as you remember…

dat[dat$x < 2, ]

    x    y    z
1   1    A    a
NA NA <NA> <NA>

So subset function can eliminate NAs for you. But it is still not recommended for production code because of some other reasons.

That's it!

That's all you need to know about subsetting in R.

Questions? Email me: yang.jianc@husky.neu.edu