Learning Data Science with R:

R’s bracket notation, or bracket operators, is a frequent source of confusion for new users. Here, I provide an introduction to the operators and repeat examples using the equivalent dplyr functions.

Introduction

Brackets lets you select, or subset, data from a vector, matrix, array, list or data frame. Exactly how this works and the results you get depend, in part, on the data type.

You can read the official help for bracket operators by typing ?'[.data.frame', including the single-quotes. There’s also a great little article at http://www.r-bloggers.com/r-accessors-explained/, and some more examples here: http://www.ats.ucla.edu/stat/r/faq/subset_R.htm

As explained in the r-bloggers post, R’s operators can be summarized as:

[ for subsets,
[[ for extracting items, and
$ for extracting by name.

Using dplyr, the equivalent is

filter() and slice() for subsets by rows,
select() for subsetting by columns, and
magrittr::extract2() for extracting by column name or index.

Subsetting rows and columns

The best way to learn is to play around with a “toy” data set.

Start with a data frame:

my_df <- data.frame(a = runif(10), 
                    b = rnorm(10, 10), 
                    c = letters[1:10], 
                    d = LETTERS[1:2])

my_df

##             a         b c d
## 1  0.10709793 11.529173 a A
## 2  0.92978064  9.894295 b B
## 3  0.33239152  8.760886 c A
## 4  0.15785918  9.618424 d B
## 5  0.31492607 10.912449 e A
## 6  0.76551955 10.986973 f B
## 7  0.02208064 11.007789 g A
## 8  0.28574694 11.336040 h B
## 9  0.58989873  9.285833 i A
## 10 0.24728872 11.803678 j B

my_df[1:3] (no comma) will subset my_df, returning the first three columns as a data frame.
my_df[1:3, ] (with comma, numbers to left of the comma) will subset my_df and return the first three rows as a data frame.
my_df[, 1:3] (with comma, numbers to right of the comma) will subset my_df and return the first three columns as a data frame, the same as my_df[1:3].

my_df[1:3]

##             a         b c
## 1  0.10709793 11.529173 a
## 2  0.92978064  9.894295 b
## 3  0.33239152  8.760886 c
## 4  0.15785918  9.618424 d
## 5  0.31492607 10.912449 e
## 6  0.76551955 10.986973 f
## 7  0.02208064 11.007789 g
## 8  0.28574694 11.336040 h
## 9  0.58989873  9.285833 i
## 10 0.24728872 11.803678 j

my_df[1:3, ]

##           a         b c d
## 1 0.1070979 11.529173 a A
## 2 0.9297806  9.894295 b B
## 3 0.3323915  8.760886 c A

my_df[, 1:3]

##             a         b c
## 1  0.10709793 11.529173 a
## 2  0.92978064  9.894295 b
## 3  0.33239152  8.760886 c
## 4  0.15785918  9.618424 d
## 5  0.31492607 10.912449 e
## 6  0.76551955 10.986973 f
## 7  0.02208064 11.007789 g
## 8  0.28574694 11.336040 h
## 9  0.58989873  9.285833 i
## 10 0.24728872 11.803678 j

In dplyr, we would use

library(dplyr)
my_df %>% select(a:c)

##             a         b c
## 1  0.10709793 11.529173 a
## 2  0.92978064  9.894295 b
## 3  0.33239152  8.760886 c
## 4  0.15785918  9.618424 d
## 5  0.31492607 10.912449 e
## 6  0.76551955 10.986973 f
## 7  0.02208064 11.007789 g
## 8  0.28574694 11.336040 h
## 9  0.58989873  9.285833 i
## 10 0.24728872 11.803678 j

my_df %>% slice(1:3)

## # A tibble: 3 x 4
##       a     b c     d    
##   <dbl> <dbl> <fct> <fct>
## 1 0.107 11.5  a     A    
## 2 0.930  9.89 b     B    
## 3 0.332  8.76 c     A

Subsetting and extracting

We can extract specific rows or columns, for example extracting column a as a vector:

my_df[, 1]

##  [1] 0.10709793 0.92978064 0.33239152 0.15785918 0.31492607 0.76551955
##  [7] 0.02208064 0.28574694 0.58989873 0.24728872

my_df$a

##  [1] 0.10709793 0.92978064 0.33239152 0.15785918 0.31492607 0.76551955
##  [7] 0.02208064 0.28574694 0.58989873 0.24728872

my_df[[1]]

##  [1] 0.10709793 0.92978064 0.33239152 0.15785918 0.31492607 0.76551955
##  [7] 0.02208064 0.28574694 0.58989873 0.24728872

In this case, the single-brackets, usually used for subsetting, also works as an extractor.

dplyr functions always return a data frame, but we can extract a column as a vector using either the base function unlist() or the magrittr function extract2().

my_df %>% select(a) %>% unlist(use.names = FALSE)

##  [1] 0.10709793 0.92978064 0.33239152 0.15785918 0.31492607 0.76551955
##  [7] 0.02208064 0.28574694 0.58989873 0.24728872

my_df %>% extract2('a')

##  [1] 0.10709793 0.92978064 0.33239152 0.15785918 0.31492607 0.76551955
##  [7] 0.02208064 0.28574694 0.58989873 0.24728872

Combining subsetting and extraction allows us to return a vector (rather than a data frame) containing specific rows of column a.

# subset rows and columns of my_df
my_df[1:3, 1]

## [1] 0.1070979 0.9297806 0.3323915

# subset rows of my_dy, then extract column a
my_df[1:3, ]$a

## [1] 0.1070979 0.9297806 0.3323915

# extract column a, then subset by rows
my_df$a[1:3]

## [1] 0.1070979 0.9297806 0.3323915

Or, with dplyr:

my_df %>% slice(1:3) %>% extract2('a')

## [1] 0.1070979 0.9297806 0.3323915

We can subset based on an arbitrary list of rows:

index <- c(1, 3, 5, 10)

my_df[index,]

##            a         b c d
## 1  0.1070979 11.529173 a A
## 3  0.3323915  8.760886 c A
## 5  0.3149261 10.912449 e A
## 10 0.2472887 11.803678 j B

my_df %>% slice(index)

## # A tibble: 4 x 4
##       a     b c     d    
##   <dbl> <dbl> <fct> <fct>
## 1 0.107 11.5  a     A    
## 2 0.332  8.76 c     A    
## 3 0.315 10.9  e     A    
## 4 0.247 11.8  j     B

Alternatively, we can use a boolean vector to extract rows. In this case, we must have a boolean vector with the same number of elements as rows in the data frame (i.e. length(vector) == nrow(data.frame)), and R will extract the rows matching TRUE values in the vector.

index <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE)
my_df[index, ]

##            a         b c d
## 1  0.1070979 11.529173 a A
## 3  0.3323915  8.760886 c A
## 5  0.3149261 10.912449 e A
## 10 0.2472887 11.803678 j B

This returns exactly the same result as the previous example. Functions that return either row numbers or boolean values for each row can be used to subset.

The equivalent in dplyr is to filter() based on a condition; you cannot simply pass a boolean vector to slice(). We cover filter() in the next section.

Conditionals

R also allows us to subset based on a condition. For instance, the below two examples return the rows of my_df where the values in column a are greater than 0.5:

my_df[my_df$a > 0.5, ]

##           a         b c d
## 2 0.9297806  9.894295 b B
## 6 0.7655195 10.986973 f B
## 9 0.5898987  9.285833 i A

my_df[which(my_df$a > 0.5), ]

##           a         b c d
## 2 0.9297806  9.894295 b B
## 6 0.7655195 10.986973 f B
## 9 0.5898987  9.285833 i A

The second example, with which(), returns a list of row numbers where the condition is true, while and the former example returns boolean values TRUE or FALSE for each row.

With dplyr, we would use filter():

my_df %>% filter(a > 0.5)

##           a         b c d
## 1 0.9297806  9.894295 b B
## 2 0.7655195 10.986973 f B
## 3 0.5898987  9.285833 i A

R will also recycle shorter vectors for indexing, so if you set index <- c(TRUE, FALSE) (just two elements), and used my_df[index, ], you’d get every other row of my_df.

We can do more complicated subsetting, e.g. with more than one condition:

my_df[which(my_df$d == "A" & my_df$a > 0.5),]

##           a        b c d
## 9 0.5898987 9.285833 i A

my_df %>% filter(d == "A", a > 0.5)

##           a        b c d
## 1 0.5898987 9.285833 i A

In the first example, the ampersand (&) means “and,” so that the above condition reads “identify which rows of my_df where the value in column d equals ‘A’ and the value in column d is greater than 0.5.” Using dplyr we can either separate filter arguments with a comma or use the ampersand.

If we wanted all rows where either d == "A" or a > 0.5, we would use the OR operator (|):

my_df[which(my_df$d == "A" | my_df$a > 0.5),]

##            a         b c d
## 1 0.10709793 11.529173 a A
## 2 0.92978064  9.894295 b B
## 3 0.33239152  8.760886 c A
## 5 0.31492607 10.912449 e A
## 6 0.76551955 10.986973 f B
## 7 0.02208064 11.007789 g A
## 9 0.58989873  9.285833 i A

my_df %>% filter(d == "A" | a > 0.5)

##            a         b c d
## 1 0.10709793 11.529173 a A
## 2 0.92978064  9.894295 b B
## 3 0.33239152  8.760886 c A
## 4 0.31492607 10.912449 e A
## 5 0.76551955 10.986973 f B
## 6 0.02208064 11.007789 g A
## 7 0.58989873  9.285833 i A

If we just wanted to subset and extract the first column, we could do any of the following using the bracket operators:

# first subset my_df, then extract a
my_df[which(my_df$d == "A" & my_df$a > 0.5),]$a

## [1] 0.5898987

# subset by rows, and extract column 1
my_df[which(my_df$d == "A" & my_df$a > 0.5), 1]

## [1] 0.5898987

# first extract column a, then subset by rows
my_df$a[which(my_df$d == "A" & my_df$a > 0.5)]

## [1] 0.5898987

The equivalent using dplyr is either of the following, depending on whether you want a data frame or a vector:

library(dplyr)
my_df %>% 
  filter(d == "A" & a > 0.5) %>% 
  select(a)

##           a
## 1 0.5898987

my_df %>% filter(d == "A", a > 0.5) %>% 
  extract2('a')

## [1] 0.5898987

Extracting with `[[`

The double-bracket operator, [[, also works as an extractor, similar to the dollar sign ($), so

my_df[[1]]

##  [1] 0.10709793 0.92978064 0.33239152 0.15785918 0.31492607 0.76551955
##  [7] 0.02208064 0.28574694 0.58989873 0.24728872

is the same as my_df$a. Just as you can combine the single-bracket operator ([) with double-bracket ([[) to subset and extract in one line.

my_df[1:3,][[1]]

## [1] 0.1070979 0.9297806 0.3323915

This subsets to return only the first three rows of my_df, then extracts the first column as a vector, which is equivalent to either of the following

my_df[1:3,]$a

## [1] 0.1070979 0.9297806 0.3323915

my_df$a[1:3]

## [1] 0.1070979 0.9297806 0.3323915

The advantage of the double bracket operator is that you can use it in a function or program where the user (or some conditional) selects the column to extract, and you as the programmer don’t have to figure out the column name.

The equivalent with dplyr would be

my_df %>% slice(1:3) %>% 
  extract2(1)

## [1] 0.1070979 0.9297806 0.3323915

`[` with data.frame vs. dplyr’s tibble

Lastly, there is a difference between base R and dplyr that might trip you up if you ever write code using either a data.frame or a tibble and then later change your code to use the other object class. As the help at ?'[.data.frame' states, coercion is complicated in R, so somewhat surprisingly, the below subsetting operation, which differs only in the number of columns being subsetted, returns a vector in the first case and a data frame in the other.

my_df[1:3, 1]

## [1] 0.1070979 0.9297806 0.3323915

my_df[1:3, 1:2]

##           a         b
## 1 0.1070979 11.529173
## 2 0.9297806  9.894295
## 3 0.3323915  8.760886

dplyr corrects this inconsistency by always returns a data frame when using single brackets.

library(dplyr)
my2_df <- as_data_frame(my_df)

class(my_df)

## [1] "data.frame"

class(my2_df)

## [1] "tbl_df"     "tbl"        "data.frame"

my_df[1:3, 1]

## [1] 0.1070979 0.9297806 0.3323915

my2_df[1:3, 1]

## # A tibble: 3 x 1
##       a
##   <dbl>
## 1 0.107
## 2 0.930
## 3 0.332

Even though the two data frames are subsetted the same way, they don’t return the same type of object; the base R data.frame version returns a vector, while the dplyr tibble version returns a data frame. The dplyr version, therefore, always returns the same data type regardless of the number of columns when using the single-bracket operator, while the base R version returns a different data type (a vector) when only one column is given and a data frame when more than one column is given.

The two data types behave the same when more than one column is given:

my_df[, 1:2]

##             a         b
## 1  0.10709793 11.529173
## 2  0.92978064  9.894295
## 3  0.33239152  8.760886
## 4  0.15785918  9.618424
## 5  0.31492607 10.912449
## 6  0.76551955 10.986973
## 7  0.02208064 11.007789
## 8  0.28574694 11.336040
## 9  0.58989873  9.285833
## 10 0.24728872 11.803678

my2_df[, 1:2]

## # A tibble: 10 x 2
##         a     b
##     <dbl> <dbl>
##  1 0.107  11.5 
##  2 0.930   9.89
##  3 0.332   8.76
##  4 0.158   9.62
##  5 0.315  10.9 
##  6 0.766  11.0 
##  7 0.0221 11.0 
##  8 0.286  11.3 
##  9 0.590   9.29
## 10 0.247  11.8

Learning Data Science with R:

Subsetting, Extracting and Bracket Notation

Thomas Hopper

January 31, 2018

Introduction

Subsetting rows and columns

Subsetting and extracting

Conditionals

Extracting with `[[`

`[` with data.frame vs. dplyr’s tibble

Learning Data Science with R:

Subsetting, Extracting and Bracket Notation

Thomas Hopper

January 31, 2018

Introduction

Subsetting rows and columns

Subsetting and extracting

Conditionals

Extracting with [[

[ with data.frame vs. dplyr’s tibble

Extracting with `[[`

`[` with data.frame vs. dplyr’s tibble