10 - Subsetting Data

Department of Environmental Science, AUT

Subsetting Data: Prerequisites

Subsetting Data

Content you should have understood before watching this video:

Number 1, ‘Files and Folder Basics’
Number 2, ‘Variables’
Number 4, ‘Basic Statistical Metrics’
Number 5, ‘Standard Deviation and Standard Error’

Creating a data frame (a table) in R

Subsetting Data

We can combine variables into a data frame:

name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
sex = rep(1:2, each = 3)
d1 = data.frame(name, alcohol, income, sex)
d1
     name alcohol income sex
1     Ben    0.75  58000   1
2  Martin    1.20  38000   1
3    Andy    2.40  28000   1
4 Pauline    0.23  63000   2
5     Eva    0.90  90500   2
6  Carina    1.36  17000   2

Why call it ‘d1’? Can we call it something else…?

Working with data frames in R

Subsetting Data

As soon as our data sets (called data frames in R) become larger (> 20 rows), we will want to be able to:

Quickly look at the content without showing every single line/column
Only show a certain part of the data frame
Aggregate a data frame according to a grouping variable
Add a variable (a column) to a data frame
…and eventually much much more, e.g. search and replace certain string patterns, apply algorithms to every row, …

Simple exploratory commands using the data set ‘iris’

Subsetting Data

The below commands are not evaluated (you don’t see what R does). Try them on your own device!

#the iris object (data set comes with R, it's already there for you!)
summary(iris) #summary() is very generic, try it on anything!
head(iris) #shows the first few lines of your data
tail(iris) #shows the last few lines of your data
plot(iris) #try to interpret this plot!
iris$Sepal.Length #access one variable in the data frame
head(iris$Sepal.Length) #access the first few values of one variable
#in a data frame

The ‘$’ symbol is to access variables contained inside a data frame, here we extract the variable ‘Sepal Length’ from ‘iris’

Subsetting a data frame using the ‘iris’ example

Subsetting Data

Extracting part of a data frame is called ‘subsetting’ and can be done in many ways. Here is one using [] (row selection before, column selection after the comma):

iris[4, 2] #show the fourth value (row) of the second column
iris[4, ] #show the fourth row of all columns
iris[, 'Species'] #show all rows for column 'Species'
iris[c(3, 16), c('Species', 'Petal.Length')]
#all rows for column 'Species'
iris[iris$Species == 'virginica', ] #all rows of species 'virginica'
iris[iris$Sepal.Length > 6, ] #all rows where Sepal.Length > 6
iris[iris$Sepal.Length > 6 & iris$Species == 'virginica', ]
#all rows where Sepal.Length > 6 AND species is 'virginica'

Aggregating a data frame

Subsetting Data

You can do this in many ways! Here is one way:

#calculate the mean petal length per species:
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = mean)

    setosa versicolor  virginica 
     1.462      4.260      5.552

#extract the maximum value of petal length per species:
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = max)

    setosa versicolor  virginica 
       1.9        5.1        6.9

Deleting/reordering variables

Subsetting Data

#removing variables
iris1 = iris[, -c(1, 2, 4)]
head(iris1, 2)

  Petal.Length Species
1          1.4  setosa
2          1.4  setosa

#reordering variables
iris2 = iris[, c(5, 1, 2, 3, 4)]
head(iris1, 2)

  Petal.Length Species
1          1.4  setosa
2          1.4  setosa

Adding variables to a data frame

Subsetting Data

#adding a variable
iris$newVariable = 1
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species newVariable
1          5.1         3.5          1.4         0.2  setosa           1
2          4.9         3.0          1.4         0.2  setosa           1
3          4.7         3.2          1.3         0.2  setosa           1
4          4.6         3.1          1.5         0.2  setosa           1
5          5.0         3.6          1.4         0.2  setosa           1
6          5.4         3.9          1.7         0.4  setosa           1

The most important in a nutshell

Subsetting Data

This type of ‘data crunching’ can be done in many other ways, but it pays to be familiar with the basics!
It’s a good idea to remember how to select certain rows/columns, remove, add, and order variables in a data frame
Remember one way to aggregate a continuous variable by a group, ‘tapply’ is the simplest!