This is an annotated demonstration of subsetting in R using the built-in iris dataset
data(iris)
#for those not familiar with the data,
#let's run str() and summary()
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Now lets make a graph of the whole data (by a slightly roundabout way that will be better suited to demonstrate subsetting)
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length, iris$Petal.Width, pch=19, col=iris$Species)
Now we can make a subset using the common [] operator
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[iris$Species=="setosa"], iris$Petal.Width[iris$Species=="setosa"], pch=19, col=iris$Species[iris$Species=="setosa"])
Which is basically saying graph the Petal Length (of those cases where the Species is setosa) against the Petal Width (of those cases where the Species is setosa).
There is actually a really common error people make at this point. If you put in the code
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[iris$Species=="setosa"], iris$Petal.Width, pch=19, col=iris$Species[iris$Species=="setosa"])
you will get the error message
Error in xy.coords(x, y, xlabel, ylabel, log) :
‘x’ and ‘y’ lengths differ
Which is caused because you are trying to compare something which has a subset on it (Petal Length) against something which does not have a subset on it (Petal Width) and R has no idea how to match up the different number of entries.
The stuff inside the [] are the criteria for choosing the entries, so we can see what is going on in a slightly less crowded way by setting the criteria as a variable.
criteria <- iris$Species=="setosa"
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
The double equal sign is used for making direct comparison and because we are working with text we put it in quotes. Other common operators are <, <=, >, >=, ==, !=, and !(which means “it is not the case”)
criteria <- iris$Sepal.Width < 3.1
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
If we want a narrower result we can use the AND symbol & to combine different criteria
criteria <- iris$Species=="setosa" & iris$Sepal.Width < 3.1
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
If we want a broader result we can use the OR symbol | to provide a list of criteria, but by this time it is also a good idea to use parentheses () to make it clear what the groups are (like when learning maths, the stuff inside the parentheses is sorted out first)
criteria <- iris$Species=="setosa" & (iris$Sepal.Width < 3.1 | iris$Sepal.Width > 3.5)
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
Because we have parentheses, we get a very different set of results to if we had not used them
criteria <- iris$Species=="setosa" & iris$Sepal.Width < 3.1 | iris$Sepal.Width > 3.5
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
The %in% operator can be useful for providing a list to look in
criteria <- iris$Species %in% c("setosa","elephant","dingo")
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
You can also subset numerically, picking out a range of row numbers
criteria <- 5:10
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
Or you can provide a list of row numbers including several ranges
criteria <- c(2, 5:10, 12, 18:32)
plot(iris$Petal.Length, iris$Petal.Width, type="n")
points(iris$Petal.Length[criteria], iris$Petal.Width[criteria], pch=19, col=iris$Species[criteria])
Up until now we have been subsetting a particular variable, but you can also subset an entire data frame by using the form dataframe[rowcriteria, columncriteria]
rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[rowcriteria,columncriteria])
## Sepal.Length Sepal.Width
## Min. :4.30 Min. :2.30
## 1st Qu.:4.80 1st Qu.:3.20
## Median :5.00 Median :3.40
## Mean :5.01 Mean :3.43
## 3rd Qu.:5.20 3rd Qu.:3.67
## Max. :5.80 Max. :4.40
The comma is the marker of row criteria and column criteria, and providing you have the comma you can leave that section empty if you want all the rows/ columns.
rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[,columncriteria])
## Sepal.Length Sepal.Width
## Min. :4.30 Min. :2.00
## 1st Qu.:5.10 1st Qu.:2.80
## Median :5.80 Median :3.00
## Mean :5.84 Mean :3.06
## 3rd Qu.:6.40 3rd Qu.:3.30
## Max. :7.90 Max. :4.40
or
rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[rowcriteria,])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.30 Min. :1.00 Min. :0.100
## 1st Qu.:4.80 1st Qu.:3.20 1st Qu.:1.40 1st Qu.:0.200
## Median :5.00 Median :3.40 Median :1.50 Median :0.200
## Mean :5.01 Mean :3.43 Mean :1.46 Mean :0.246
## 3rd Qu.:5.20 3rd Qu.:3.67 3rd Qu.:1.57 3rd Qu.:0.300
## Max. :5.80 Max. :4.40 Max. :1.90 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
There is another really common mistake to see at this point. If you are only selecting rows it is really easy to leave off the comma
rowcriteria <- iris$Species=="setosa"
columncriteria <- 1:2
summary(iris[rowcriteria])
you will get the error message
Error in ‘[.data.frame’(iris, rowcriteria) : undefined columns selected
Because we are dealing with a block of data, without the comma R does not know what to apply the rules to.
You can also use a numbered list of the columns you want
rowcriteria <- iris$Species=="setosa"
columncriteria <- c(1:2,4)
summary(iris[rowcriteria,columncriteria])
## Sepal.Length Sepal.Width Petal.Width
## Min. :4.30 Min. :2.30 Min. :0.100
## 1st Qu.:4.80 1st Qu.:3.20 1st Qu.:0.200
## Median :5.00 Median :3.40 Median :0.200
## Mean :5.01 Mean :3.43 Mean :0.246
## 3rd Qu.:5.20 3rd Qu.:3.67 3rd Qu.:0.300
## Max. :5.80 Max. :4.40 Max. :0.600
A list of names is fine as well
rowcriteria <- iris$Species=="setosa"
columncriteria <- c("Petal.Length","Petal.Width")
summary(iris[rowcriteria,columncriteria])
## Petal.Length Petal.Width
## Min. :1.00 Min. :0.100
## 1st Qu.:1.40 1st Qu.:0.200
## Median :1.50 Median :0.200
## Mean :1.46 Mean :0.246
## 3rd Qu.:1.57 3rd Qu.:0.300
## Max. :1.90 Max. :0.600
If you are getting you data into a particular form (for instance for use in a statistical test) you will most likely want to store it in a new object
rowcriteria <- iris$Species=="setosa"
columncriteria <- c("Petal.Length","Petal.Width")
mysubset <- iris[rowcriteria,columncriteria]
summary(mysubset)
## Petal.Length Petal.Width
## Min. :1.00 Min. :0.100
## 1st Qu.:1.40 1st Qu.:0.200
## Median :1.50 Median :0.200
## Mean :1.46 Mean :0.246
## 3rd Qu.:1.57 3rd Qu.:0.300
## Max. :1.90 Max. :0.600
You can also subset data by removing columns (rather than specifying the columns you want)
rowcriteria <- iris$Species=="setosa"
columncriteria <- names(iris) %in% c("Petal.Length","Petal.Width")
mysubset <- iris[rowcriteria,!(columncriteria)]
summary(mysubset)
## Sepal.Length Sepal.Width Species
## Min. :4.30 Min. :2.30 setosa :50
## 1st Qu.:4.80 1st Qu.:3.20 versicolor: 0
## Median :5.00 Median :3.40 virginica : 0
## Mean :5.01 Mean :3.43
## 3rd Qu.:5.20 3rd Qu.:3.67
## Max. :5.80 Max. :4.40
This basically says “Find the columns with those names. You see them. Those are the ones we don’t want”
Subsetting by removal is also possible by using negative numbers
rowcriteria <- iris$Species=="setosa"
columncriteria <- c(-1,-3:-5)
mysubset <- iris[rowcriteria,columncriteria]
summary(mysubset)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.30 3.20 3.40 3.43 3.68 4.40
You can remove rows by a similar use of negative numbers
rowcriteria <- c(-1:-25)
columncriteria <- c(-1,-3:-5)
mysubset <- iris[rowcriteria,columncriteria]
summary(mysubset)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.70 3.00 2.97 3.20 4.20
Subsetting is not quite the same thing as deleting a variable. Subsetting is hiding the ones you don’t want but they are still available later, deleting is destroying them. Though you can sometimes use them in the same way by first making a copy of the data
mysubset <- iris
mysubset$Species <- NULL
summary(mysubset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5