Subsetting Data
Content you should have understood before watching this video:
- Number 1, ‘Files and Folder Basics’
- Number 2, ‘Variables’
- Number 4, ‘Basic Statistical Metrics’
- Number 5, ‘Standard Deviation and Standard Error’
Creating a data frame (a table) in R
Subsetting Data
- We can combine variables into a data frame:
name = c("Ben", "Martin", "Andy", "Pauline", "Eva", "Carina")
alcohol = c(0.75, 1.2, 2.4, 0.23, 0.9, 1.36) #standard drinks/day
income = c(58000, 38000, 28000, 63000, 90500, 17000)
sex = rep(1:2, each = 3)
d1 = data.frame(name, alcohol, income, sex)
d1
name alcohol income sex
1 Ben 0.75 58000 1
2 Martin 1.20 38000 1
3 Andy 2.40 28000 1
4 Pauline 0.23 63000 2
5 Eva 0.90 90500 2
6 Carina 1.36 17000 2
- Why call it ‘d1’? Can we call it something else…?
Working with data frames in R
Subsetting Data
As soon as our data sets (called data frames in R) become larger (> 20 rows), we will want to be able to:
- Quickly look at the content without showing every single line/column
- Only show a certain part of the data frame
- Aggregate a data frame according to a grouping variable
- Add a variable (a column) to a data frame
- …and eventually much much more, e.g. search and replace certain string patterns, apply algorithms to every row, …
Simple exploratory commands using the data set ‘iris’
Subsetting Data
The below commands are not evaluated (you don’t see what R does). Try them on your own device!
#the iris object (data set comes with R, it's already there for you!) summary(iris) #summary() is very generic, try it on anything! head(iris) #shows the first few lines of your data tail(iris) #shows the last few lines of your data plot(iris) #try to interpret this plot! iris$Sepal.Length #access one variable in the data frame head(iris$Sepal.Length) #access the first few values of one variable #in a data frame
The ‘$’ symbol is to access variables contained inside a data frame, here we extract the variable ‘Sepal Length’ from ‘iris’
Subsetting a data frame using the ‘iris’ example
Subsetting Data
Extracting part of a data frame is called ‘subsetting’ and can be done in many ways. Here is one using [] (row selection before, column selection after the comma):
iris[4, 2] #show the fourth value (row) of the second column
iris[4, ] #show the fourth row of all columns
iris[, 'Species'] #show all rows for column 'Species'
iris[c(3, 16), c('Species', 'Petal.Length')]
#all rows for column 'Species'
iris[iris$Species == 'virginica', ] #all rows of species 'virginica'
iris[iris$Sepal.Length > 6, ] #all rows where Sepal.Length > 6
iris[iris$Sepal.Length > 6 & iris$Species == 'virginica', ]
#all rows where Sepal.Length > 6 AND species is 'virginica'
Aggregating a data frame
Subsetting Data
You can do this in many ways! Here is one way:
#calculate the mean petal length per species: tapply(iris$Petal.Length, INDEX = iris$Species, FUN = mean)
setosa versicolor virginica
1.462 4.260 5.552
#extract the maximum value of petal length per species: tapply(iris$Petal.Length, INDEX = iris$Species, FUN = max)
setosa versicolor virginica
1.9 5.1 6.9
Deleting/reordering variables
Subsetting Data
#removing variables iris1 = iris[, -c(1, 2, 4)] head(iris1, 2)
Petal.Length Species 1 1.4 setosa 2 1.4 setosa
#reordering variables iris2 = iris[, c(5, 1, 2, 3, 4)] head(iris1, 2)
Petal.Length Species 1 1.4 setosa 2 1.4 setosa
Adding variables to a data frame
Subsetting Data
#adding a variable iris$newVariable = 1 head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species newVariable 1 5.1 3.5 1.4 0.2 setosa 1 2 4.9 3.0 1.4 0.2 setosa 1 3 4.7 3.2 1.3 0.2 setosa 1 4 4.6 3.1 1.5 0.2 setosa 1 5 5.0 3.6 1.4 0.2 setosa 1 6 5.4 3.9 1.7 0.4 setosa 1
The most important in a nutshell
Subsetting Data
- This type of ‘data crunching’ can be done in many other ways, but it pays to be familiar with the basics!
- It’s a good idea to remember how to select certain rows/columns, remove, add, and order variables in a data frame
- Remember one way to aggregate a continuous variable by a group, ‘tapply’ is the simplest!