Some notes on subsetting in R

This is an annotated demonstration of subsetting in R using the built-in iris dataset

data(iris)
#for those not familiar with the data, 
#let's run str() and summary()
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Now lets make a graph of the whole data (by a slightly roundabout way that will be better suited to demonstrate subsetting)

plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length, iris$Petal.Width, pch=19, col=iris$Species)

plot of chunk unnamed-chunk-2

Now we can make a subset using the common [] operator

plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[iris$Species=="setosa"], iris$Petal.Width[iris$Species=="setosa"], pch=19, col=iris$Species[iris$Species=="setosa"])

plot of chunk unnamed-chunk-3

Which is basically saying graph the Petal Length (of those cases where the Species is setosa) against the Petal Width (of those cases where the Species is setosa).

There is actually a really common error people make at this point. If you put in the code

plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[iris$Species=="setosa"], iris$Petal.Width, pch=19, col=iris$Species[iris$Species=="setosa"])

you will get the error message

Error in xy.coords(x, y, xlabel, ylabel, log) :

‘x’ and ‘y’ lengths differ

Which is caused because you are trying to compare something which has a subset on it (Petal Length) against something which does not have a subset on it (Petal Width) and R has no idea how to match up the different number of entries.

The stuff inside the [] are the criteria for choosing the entries, so we can see what is going on in a slightly less crowded way by setting the criteria as a variable.

criteria <- iris$Species=="setosa"
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-4

The double equal sign is used for making direct comparison and because we are working with text we put it in quotes. Other common operators are <, <=, >, >=, ==, !=, and !(which means “it is not the case”)

criteria <-  iris$Sepal.Width < 3.1
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-5

If we want a narrower result we can use the AND symbol & to combine different criteria

criteria <- iris$Species=="setosa" & iris$Sepal.Width < 3.1
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-6

If we want a broader result we can use the OR symbol | to provide a list of criteria, but by this time it is also a good idea to use parentheses () to make it clear what the groups are (like when learning maths, the stuff inside the parentheses is sorted out first)

criteria <- iris$Species=="setosa" & (iris$Sepal.Width < 3.1 | iris$Sepal.Width > 3.5)
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-7

Because we have parentheses, we get a very different set of results to if we had not used them

criteria <- iris$Species=="setosa" & iris$Sepal.Width < 3.1 | iris$Sepal.Width > 3.5
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-8

The %in% operator can be useful for providing a list to look in

criteria <- iris$Species %in% c("setosa","elephant","dingo")
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-9

You can also subset numerically, picking out a range of row numbers

criteria <- 5:10
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-10

Or you can provide a list of row numbers including several ranges

criteria <- c(2, 5:10, 12, 18:32)
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])

plot of chunk unnamed-chunk-11

Up until now we have been subsetting a particular variable, but you can also subset an entire data frame by using the form dataframe[rowcriteria, columncriteria]

rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[rowcriteria,columncriteria])

##   Sepal.Length   Sepal.Width  
##  Min.   :4.30   Min.   :2.30  
##  1st Qu.:4.80   1st Qu.:3.20  
##  Median :5.00   Median :3.40  
##  Mean   :5.01   Mean   :3.43  
##  3rd Qu.:5.20   3rd Qu.:3.67  
##  Max.   :5.80   Max.   :4.40

The comma is the marker of row criteria and column criteria, and providing you have the comma you can leave that section empty if you want all the rows/ columns.

rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[,columncriteria])

##   Sepal.Length   Sepal.Width  
##  Min.   :4.30   Min.   :2.00  
##  1st Qu.:5.10   1st Qu.:2.80  
##  Median :5.80   Median :3.00  
##  Mean   :5.84   Mean   :3.06  
##  3rd Qu.:6.40   3rd Qu.:3.30  
##  Max.   :7.90   Max.   :4.40

rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[rowcriteria,])

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width   
##  Min.   :4.30   Min.   :2.30   Min.   :1.00   Min.   :0.100  
##  1st Qu.:4.80   1st Qu.:3.20   1st Qu.:1.40   1st Qu.:0.200  
##  Median :5.00   Median :3.40   Median :1.50   Median :0.200  
##  Mean   :5.01   Mean   :3.43   Mean   :1.46   Mean   :0.246  
##  3rd Qu.:5.20   3rd Qu.:3.67   3rd Qu.:1.57   3rd Qu.:0.300  
##  Max.   :5.80   Max.   :4.40   Max.   :1.90   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
##

There is another really common mistake to see at this point. If you are only selecting rows it is really easy to leave off the comma

rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[rowcriteria])

you will get the error message

Error in ‘[.data.frame’(iris, rowcriteria) : undefined columns selected

Because we are dealing with a block of data, without the comma R does not know what to apply the rules to.

You can also use a numbered list of the columns you want

rowcriteria <- iris$Species=="setosa"
columncriteria <- c(1:2,4)
summary(iris[rowcriteria,columncriteria])

##   Sepal.Length   Sepal.Width    Petal.Width   
##  Min.   :4.30   Min.   :2.30   Min.   :0.100  
##  1st Qu.:4.80   1st Qu.:3.20   1st Qu.:0.200  
##  Median :5.00   Median :3.40   Median :0.200  
##  Mean   :5.01   Mean   :3.43   Mean   :0.246  
##  3rd Qu.:5.20   3rd Qu.:3.67   3rd Qu.:0.300  
##  Max.   :5.80   Max.   :4.40   Max.   :0.600

A list of names is fine as well

rowcriteria <- iris$Species=="setosa"
columncriteria <- c("Petal.Length","Petal.Width")
summary(iris[rowcriteria,columncriteria])

##   Petal.Length   Petal.Width   
##  Min.   :1.00   Min.   :0.100  
##  1st Qu.:1.40   1st Qu.:0.200  
##  Median :1.50   Median :0.200  
##  Mean   :1.46   Mean   :0.246  
##  3rd Qu.:1.57   3rd Qu.:0.300  
##  Max.   :1.90   Max.   :0.600

If you are getting you data into a particular form (for instance for use in a statistical test) you will most likely want to store it in a new object

rowcriteria <- iris$Species=="setosa"
columncriteria <- c("Petal.Length","Petal.Width")
mysubset <- iris[rowcriteria,columncriteria]
summary(mysubset)

##   Petal.Length   Petal.Width   
##  Min.   :1.00   Min.   :0.100  
##  1st Qu.:1.40   1st Qu.:0.200  
##  Median :1.50   Median :0.200  
##  Mean   :1.46   Mean   :0.246  
##  3rd Qu.:1.57   3rd Qu.:0.300  
##  Max.   :1.90   Max.   :0.600

You can also subset data by removing columns (rather than specifying the columns you want)

rowcriteria <- iris$Species=="setosa"
columncriteria <- names(iris) %in% c("Petal.Length","Petal.Width")
mysubset <- iris[rowcriteria,!(columncriteria)]
summary(mysubset)

##   Sepal.Length   Sepal.Width         Species  
##  Min.   :4.30   Min.   :2.30   setosa    :50  
##  1st Qu.:4.80   1st Qu.:3.20   versicolor: 0  
##  Median :5.00   Median :3.40   virginica : 0  
##  Mean   :5.01   Mean   :3.43                  
##  3rd Qu.:5.20   3rd Qu.:3.67                  
##  Max.   :5.80   Max.   :4.40

This basically says “Find the columns with those names. You see them. Those are the ones we don’t want”

Subsetting by removal is also possible by using negative numbers

rowcriteria <- iris$Species=="setosa"
columncriteria <- c(-1,-3:-5)
mysubset <- iris[rowcriteria,columncriteria]
summary(mysubset)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.30    3.20    3.40    3.43    3.68    4.40

You can remove rows by a similar use of negative numbers

rowcriteria <- c(-1:-25)
columncriteria <- c(-1,-3:-5)
mysubset <- iris[rowcriteria,columncriteria]
summary(mysubset)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.70    3.00    2.97    3.20    4.20

Subsetting is not quite the same thing as deleting a variable. Subsetting is hiding the ones you don’t want but they are still available later, deleting is destroying them. Though you can sometimes use them in the same way by first making a copy of the data

mysubset <- iris
mysubset$Species <- NULL
summary(mysubset)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5

Some notes on subsetting in R

David Hood

3 April 2014