Lesson adapted from Tobi Gerstenberg’s PSY252 course: https://psych252.github.io/psych252book/data-wrangling-1.html
This book is also a great resource for all things statistics! Check it out.
In this lesson, we will take a look at how to wrangle data using the dplyr package. Getting our data into shape is something you’ll need for the rest of your psychology / data analysis career, so it’s worth spending some time getting a good sense for how this works. The nice thing about R is that (thanks to the tidyverse
), both visualization and data wrangling are particularly simple.
%>%
worksdplyr
package (incl. filter()
, arrange()
, rename()
, relocate()
, select()
, mutate()
)Let’s first load the our packages
library("knitr") # for rendering the RMarkdown file
library("skimr") # for visualizing data
library("visdat") # for visualizing data
library("DT") # for visualizing data
library("tidyverse") # for data wrangling
$set(comment = "",
opts_chunkfig.show = "hold")
Variables in R can have different modes. Table 1.1 shows the most common ones.
name | example |
---|---|
numeric | 1 , 3 , 48 |
character | 'Steve' , 'a' , '78' |
logical | TRUE , FALSE |
not available | NA |
For characters you can either use "
or '
. R has a number of functions to convert a variable from one mode to another. NA
is used for missing values.
= "1" # we start with a character
tmp1 str(tmp1)
chr "1"
= as.numeric(tmp1) # turn it into a numeric
tmp2 str(tmp2)
num 1
= as.factor(tmp2) # turn that into a factor
tmp3 str(tmp3)
Factor w/ 1 level "1": 1
= as.character(tmp3) # and go full cycle by turning it back into a character
tmp4 str(tmp4)
chr "1"
identical(tmp1, tmp4) # checks whether tmp1 and tmp4 are the same
[1] TRUE
The str()
function displays the structure of an R object. Here, it shows us what mode the variable is.
R has a number of different data types. Table 1.2 shows the ones you’re most likely to come across (taken from this source):
name | description |
---|---|
vector | list of values with of the same variable mode |
factor | for ordinal variables |
matrix | 2D data structure |
array | same as matrix for higher dimensional data |
data frame | similar to matrix but with column names |
list | flexible type that can contain different other variable types |
We build vectors using the concatenate function c()
, and we use []
to access one or more elements of a vector.
= c(1, 4, 5) # make a vector
numbers 2] # access the second element numbers[
[1] 4
1:2] # access the first two elements numbers[
[1] 1 4
c(1, 3)] # access the first and last element numbers[
[1] 1 5
In R (unlike in Python for example), 1 refers to the first element of a vector (or list).
We build a matrix using the matrix()
function, and we use []
to access its elements.
= matrix(data = c(1, 2, 3, 4, 5, 6),
matrix nrow = 3,
ncol = 2)
# the full matrix matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
1, 2] # element in row 1, column 2 matrix[
[1] 4
1, ] # all elements in the first row matrix[
[1] 1 4
1] # all elements in the first column matrix[ ,
[1] 1 2 3
-1, ] # a matrix which excludes the first row matrix[
[,1] [,2]
[1,] 2 5
[2,] 3 6
Note how we use an empty placeholder to indicate that we want to select all the values in a row or column, and -
to indicate that we want to remove something.
Arrays work the same was as matrices with data of more than two dimensions.
Won’t be covering this really. Check out this resource for more information if you’re curious: https://www.tutorialspoint.com/r/r_arrays.htm
= tibble(participant_id = c(1, 2, 3),
df participant_name = c("Leia", "Luke", "Darth")) # make the data frame
# the complete data frame df
# A tibble: 3 × 2
participant_id participant_name
<dbl> <chr>
1 1 Leia
2 2 Luke
3 3 Darth
1, 2] # a single element using numbers df[
# A tibble: 1 × 1
participant_name
<chr>
1 Leia
$participant_id # all participants df
[1] 1 2 3
"participant_id"]] # same as before but using [[]] instead of $ df[[
[1] 1 2 3
$participant_name[2] # name of the second participant df
[1] "Luke"
"participant_name"]][2] # same as above df[[
[1] "Luke"
We’ll use data frames a lot. Data frames are like a matrix with column names. Data frames are also more general than matrices in that different columns can have different modes. For example, one column might be a character, another one numeric, and another one a factor.
Here we used the tibble()
function to create the data frame. A tibble
is almost the same as a data frame but it has better defaults for formatting output in the console (more information on tibbles is here).
= list(number = 1,
l.mixed character = "2",
factor = factor(3),
matrix = matrix(1:4, ncol = 2),
df = tibble(x = c(1, 2), y = c(3, 4)))
l.mixed
$number
[1] 1
$character
[1] "2"
$factor
[1] 3
Levels: 3
$matrix
[,1] [,2]
[1,] 1 3
[2,] 2 4
$df
# A tibble: 2 × 2
x y
<dbl> <dbl>
1 1 3
2 2 4
# three different ways of accessing a list
$character l.mixed
[1] "2"
"character"]] l.mixed[[
[1] "2"
2]] l.mixed[[
[1] "2"
Lists are a very flexible data format. You can put almost anything in a list.
Table 1.3 shows the comparison operators that result in logical outputs.
symbol | name |
---|---|
== |
equal to |
!= |
not equal to |
> , < |
greater/less than |
>= , <= |
greater/less than or equal |
& , | , ! |
logical operators: and, or, not |
%in% |
checks whether an element is in an object |
The %in%
operator is very useful, and we can use it like so:
= c(1, 2, 3)
x 2 %in% x
[1] TRUE
c(3, 4) %in% x
[1] TRUE FALSE
It’s particularly useful for filtering data! More to come on this.
= 3
number
if (number == 1) {
print("The number is 1.")
else if (number == 2) {
} print("The number is 2.")
else {
} print("The number is neither 1 nor 2.")
}
[1] "The number is neither 1 nor 2."
As a shorthand version, we can also use the ifelse()
function like so:
= 3
number ifelse(test = number == 1, yes = "correct", no = "false")
[1] "false"
The way to read this: if ______ is true, then put ______, otherwise put ______.
I use this command a lot! Worth getting comfortable with.
= 1:10
sequence
for(i in 1:length(sequence)){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
For more on for loops: https://www.datamentor.io/r-programming/for-loop/
= 1
number
while(number <= 10){
print(number)
= number + 1
number }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
For more on while loop: https://www.datamentor.io/r-programming/while-loop/
= function(a, b){
fun.add_two_numbers = a + b
x return(str_c("The result is ", x))
}
fun.add_two_numbers(1, 2)
[1] "The result is 3"
I’ve used the str_c()
function here to concatenate the string with the number. (R converts the number x
into a string for us.) Note, R functions can only return a single object. However, this object can be a list (which can contain anything).
For more on functions: https://swcarpentry.github.io/r-novice-inflammation/02-func-R/
name | description |
---|---|
length() |
length of an object |
dim() |
dimensions of an object (e.g. number of rows and columns) |
rm() |
remove an object |
seq() |
generate a sequence of numbers |
rep() |
repeat something n times |
max() |
maximum |
min() |
minimum |
which.max() |
index of the maximum |
which.min() |
index of the maximum |
mean() |
mean |
median() |
median |
sum() |
sum |
var() |
variance |
sd() |
standard deviation |
%>%
The pipe operator %>%
is a special operator that used heavily in the tidyverse. The basic idea is simple: this operator allows us to “pipe” several functions into one long chain that matches the order in which we want to do stuff.
Let’s consider the following example of making and eating a cake (thanks to https://twitter.com/dmi3k/status/1191824875842879489?s=09). This would be the traditional way of writing some code:
eat(
slice(
bake(
put(
pour(
mix(ingredients),
into = baking_form),
into = oven),
time = 30),
pieces = 6),
1)
To see what’s going on here, we need to read the code inside out. That is, we have to start in the innermost bracket, and then work our way outward. However, there is a natural causal ordering to these steps and wouldn’t it be nice if we could just write code in that order? Thanks to the pipe operator %>%
we can! Here is the same example using the pipe:
%>%
ingredients %>%
mix pour(into = baking_form) %>%
put(into = oven) %>%
bake(time = 30) %>%
slice(pieces = 6) %>%
eat(1)
This code is much easier to read and write, since it represents the order in which we want to do things!
Abstractly, the pipe operator does the following:
f(x)
can be rewritten asx %>% f()
For example, in standard R, we would write:
= 1:3
x
# standard R
sum(x)
[1] 6
With the pipe, we can rewrite this as:
= 1:3
x
# with the pipe
%>% sum() x
[1] 6
This doesn’t seem super useful yet, but it will be!
f(x, y)
can be rewritten asx %>% f(y)
So, we could rewrite the following standard R code …
# rounding pi to 6 digits, standard R
round(pi, digits = 6)
[1] 3.141593
… by using the pipe:
# rounding pi to 6 digits, standard R
%>% round(digits = 6) pi
[1] 3.141593
Here is another example:
= 3
a = 4
b sum(a, b) # standard way
[1] 7
%>% sum(b) # the pipe way a
[1] 7
The pipe operator inserts the result of the previous computation as a first element into the next computation. So, a %>% sum(b)
is equivalent to sum(a, b)
. We can also specify to insert the result at a different position via the .
operator. For example:
= 1
a = 10
b %>% seq(from = a, to = .) b
[1] 1 2 3 4 5 6 7 8 9 10
Here, I used the .
operator to specify that I woud like to insert the result of b
where I’ve put the .
in the seq()
function.
f(x, y)
can be rewritten asy %>% f(x, .)
h(g(f(x)))
can be rewritten asx %>% f() %>% g() %>% h()
Look how much simpler!
One more example to show you why the pipe is very helpful:
h(g(f(x)))
can be rewritten asx %>% f() %>% g() %>% h()
For example, consider that we want to calculate the root mean squared error (RMSE) between prediction and data.
Here is how the RMSE is defined:
\[ \text{RMSE} = \sqrt\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n} \] where \(\hat{y}_i\) denotes the prediction, and \(y_i\) the actually observed value.
In base R, we would do the following.
= c(1, 3, 4, 2, 5)
data = c(1, 2, 2, 1, 4)
prediction
# calculate root mean squared error
= sqrt(mean((prediction-data)^2))
rmse print(rmse)
[1] 1.183216
Using the pipe operator makes the operation more intuitive:
= c(1, 3, 4, 2, 5)
data = c(1, 2, 2, 1, 4)
prediction
# calculate root mean squared error the pipe way
= (prediction-data)^2 %>%
rmse mean() %>%
sqrt() %>%
print()
[1] 1.183216
First, we calculate the squared error, then we take the mean, then the square root, and then print the result.
The pipe operator %>%
is similar to the +
used in ggplot2
. It allows us to take step-by-step actions in a way that fits the causal ordering of how we want to do things.
Tip: The keyboard shortcut for the pipe operator is:
cmd/ctrl + shift + m
Definitely learn this one – we’ll use the pipe a lot!!
Tip: Code is generally easier to read when the pipe
%>%
is at the end of a line (just like the+
inggplot2
).
A key advantage of using the pipe is that you don’t have to save intermediate computations as new variables and this helps to keep your environment nice and clean!
The package dplyr
which we loaded as part of the tidyverse, includes a data set with information about starwars characters. Let’s store this as df.starwars
.
= starwars df.starwars
There are several ways of taking a look at data in R. Personally, I like to look at the data within RStudio’s data viewer. To do so, you can:
df.starwars
variable in the “Environment” tabView(df.starwars)
in the consoleF2
Sometimes it’s also helpful to look at data in the console instead of the data viewer. Particularly when the data is very large, the data viewer can be sluggish.
Here are some useful functions:
head()
Without any extra arguments specified, head()
shows the top six rows of the data.
head(df.starwars)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, grey light blue 52 male mascu…
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
glimpse()
glimpse()
is helpful when the data frame has many columns. The data is shown in a transposed way with columns as rows.
glimpse(df.starwars)
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
distinct()
distinct()
shows all the distinct values for a character or factor column.
%>%
df.starwars distinct(species)
# A tibble: 38 × 1
species
<chr>
1 Human
2 Droid
3 Wookiee
4 Rodian
5 Hutt
6 Yoda's species
7 Trandoshan
8 Mon Calamari
9 Ewok
10 Sullustan
# … with 28 more rows
count()
count()
shows a count of all the different distinct values in a column.
%>%
df.starwars count(eye_color)
# A tibble: 15 × 2
eye_color n
<chr> <int>
1 black 10
2 blue 19
3 blue-gray 1
4 brown 21
5 dark 1
6 gold 1
7 green, yellow 1
8 hazel 3
9 orange 8
10 pink 1
11 red 5
12 red, blue 1
13 unknown 3
14 white 1
15 yellow 11
It’s possible to do grouped counts by combining several variables.
%>%
df.starwars count(eye_color, gender) %>%
head(n = 10)
# A tibble: 10 × 3
eye_color gender n
<chr> <chr> <int>
1 black feminine 2
2 black masculine 8
3 blue feminine 6
4 blue masculine 12
5 blue <NA> 1
6 blue-gray masculine 1
7 brown feminine 5
8 brown masculine 15
9 brown <NA> 1
10 dark masculine 1
We use the functions in the package dplyr
to manipulate our data.
filter()
filter()
lets us apply logical (and other) operators (see Table 1.3) to subset the data. Here, I’ve filtered out the masculine characters.
%>%
df.starwars filter(gender == "masculine")
# A tibble: 66 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke S… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth … 202 136 none white yellow 41.9 male mascu…
5 Owen L… 178 120 brown, grey light blue 52 male mascu…
6 R5-D4 97 32 <NA> white, red red NA none mascu…
7 Biggs … 183 84 black light brown 24 male mascu…
8 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
9 Anakin… 188 84 blond fair blue 41.9 male mascu…
10 Wilhuf… 180 NA auburn, gr… fair blue 64 male mascu…
# … with 56 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
We can combine multiple conditions in the same call. Here, I’ve filtered out masculine characters, whose height is greater than the median height (i.e. they are in the top 50 percentile), and whose mass was not NA
.
%>%
df.starwars filter(gender == "masculine",
> median(height, na.rm = T),
height !is.na(mass))
# A tibble: 27 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Darth … 202 136 none white yellow 41.9 male mascu…
2 Biggs … 183 84 black light brown 24 male mascu…
3 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
4 Anakin… 188 84 blond fair blue 41.9 male mascu…
5 Chewba… 228 112 brown unknown blue 200 male mascu…
6 Boba F… 183 78.2 black fair brown 31.5 male mascu…
7 IG-88 200 140 none metal red 15 none mascu…
8 Bossk 190 113 none green red 53 male mascu…
9 Qui-Go… 193 89 brown fair blue 92 male mascu…
10 Nute G… 191 90 none mottled g… red NA male mascu…
# … with 17 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Many functions like mean()
, median()
, var()
, sd()
, sum()
have the argument na.rm
which is set to FALSE
by default. I set the argument to TRUE
here (or T
for short), which means that the NA
values are ignored, and the median()
is calculated based on the remaining values.
You can use ,
and &
interchangeably in filter()
. Make sure to use parentheses when combining several logical operators to indicate which logical operation should be performed first:
%>%
df.starwars filter((skin_color %in% c("dark", "pale") | gender == "hermaphrodite") & height > 170)
# A tibble: 9 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Lando C… 177 79 black dark brown 31 male mascu…
2 Quarsh … 183 NA black dark brown 62 <NA> <NA>
3 Bib For… 180 NA none pale pink NA male mascu…
4 Mace Wi… 188 84 none dark brown 72 male mascu…
5 Ki-Adi-… 198 82 white pale yellow 92 male mascu…
6 Adi Gal… 184 50 none dark blue NA female femin…
7 Saesee … 188 NA none pale orange NA male mascu…
8 Gregar … 185 85 black dark brown NA male mascu…
9 Sly Moo… 178 48 none pale white NA <NA> <NA>
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
The starwars characters that have either a "dark"
or a "pale"
skin tone, or whose gender is "hermaphrodite"
, and whose height is at least 170
cm. The %in%
operator is useful when there are multiple options. Instead of skin_color %in% c("dark", "pale")
, I could have also written skin_color == "dark" | skin_color == "pale"
but this gets cumbersome as the number of options increases.
arrange()
arrange()
allows us to sort the values in a data frame by one or more column entries.
%>%
df.starwars arrange(hair_color, desc(height))
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Mon Mo… 150 NA auburn fair blue 48 fema… femin…
2 Wilhuf… 180 NA auburn, gr… fair blue 64 male mascu…
3 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
4 Bail P… 191 NA black tan brown 67 male mascu…
5 Gregar… 185 85 black dark brown NA male mascu…
6 Biggs … 183 84 black light brown 24 male mascu…
7 Boba F… 183 78.2 black fair brown 31.5 male mascu…
8 Quarsh… 183 NA black dark brown 62 <NA> <NA>
9 Jango … 183 79 black tan brown 66 male mascu…
10 Lando … 177 79 black dark brown 31 male mascu…
# … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Here, I’ve sorted the data frame first by hair_color
, and then by height
. I’ve used the desc()
function to sort height
in descending order.
rename()
rename()
renames column names.
%>%
df.starwars rename(person = name,
mass_kg = mass)
# A tibble: 87 × 14
person height mass_kg hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke … 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth… 202 136 none white yellow 41.9 male mascu…
5 Leia … 150 49 brown light brown 19 fema… femin…
6 Owen … 178 120 brown, gr… light blue 52 male mascu…
7 Beru … 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs… 183 84 black light brown 24 male mascu…
10 Obi-W… 182 77 auburn, w… fair blue-gray 57 male mascu…
# … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
The new variable names goes on the LHS of the=
sign, and the old name on the RHS.
To rename all variables at the same time use rename_with()
:
%>%
df.starwars rename_with(.fn = ~ toupper(.))
# A tibble: 87 × 14
NAME HEIGHT MASS HAIR_COLOR SKIN_COLOR EYE_COLOR BIRTH_YEAR SEX GENDER
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke S… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth … 202 136 none white yellow 41.9 male mascu…
5 Leia O… 150 49 brown light brown 19 fema… femin…
6 Owen L… 178 120 brown, grey light blue 52 male mascu…
7 Beru W… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs … 183 84 black light brown 24 male mascu…
10 Obi-Wa… 182 77 auburn, wh… fair blue-gray 57 male mascu…
# … with 77 more rows, and 5 more variables: HOMEWORLD <chr>, SPECIES <chr>,
# FILMS <list>, VEHICLES <list>, STARSHIPS <list>
relocate()
relocate()
moves columns. For example, the following piece of code moves the species
column to the front of the data frame:
%>%
df.starwars relocate(species)
# A tibble: 87 × 14
species name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Human Luke S… 172 77 blond fair blue 19 male
2 Droid C-3PO 167 75 <NA> gold yellow 112 none
3 Droid R2-D2 96 32 <NA> white, bl… red 33 none
4 Human Darth … 202 136 none white yellow 41.9 male
5 Human Leia O… 150 49 brown light brown 19 fema…
6 Human Owen L… 178 120 brown, gr… light blue 52 male
7 Human Beru W… 165 75 brown light blue 47 fema…
8 Droid R5-D4 97 32 <NA> white, red red NA none
9 Human Biggs … 183 84 black light brown 24 male
10 Human Obi-Wa… 182 77 auburn, w… fair blue-gray 57 male
# … with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
# films <list>, vehicles <list>, starships <list>
We could also move the species
column after the name column like so:
%>%
df.starwars relocate(species, .after = name)
# A tibble: 87 × 14
name species height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Luke S… Human 172 77 blond fair blue 19 male
2 C-3PO Droid 167 75 <NA> gold yellow 112 none
3 R2-D2 Droid 96 32 <NA> white, bl… red 33 none
4 Darth … Human 202 136 none white yellow 41.9 male
5 Leia O… Human 150 49 brown light brown 19 fema…
6 Owen L… Human 178 120 brown, gr… light blue 52 male
7 Beru W… Human 165 75 brown light blue 47 fema…
8 R5-D4 Droid 97 32 <NA> white, red red NA none
9 Biggs … Human 183 84 black light brown 24 male
10 Obi-Wa… Human 182 77 auburn, w… fair blue-gray 57 male
# … with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
# films <list>, vehicles <list>, starships <list>
select()
select()
allows us to select a subset of the columns in the data frame.
%>%
df.starwars select(name, height, mass)
# A tibble: 87 × 3
name height mass
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# … with 77 more rows
We can select multiple columns using the (from:to)
syntax:
%>%
df.starwars select(name:birth_year) # from name to birth_year
# A tibble: 87 × 7
name height mass hair_color skin_color eye_color birth_year
<chr> <int> <dbl> <chr> <chr> <chr> <dbl>
1 Luke Skywalker 172 77 blond fair blue 19
2 C-3PO 167 75 <NA> gold yellow 112
3 R2-D2 96 32 <NA> white, bl… red 33
4 Darth Vader 202 136 none white yellow 41.9
5 Leia Organa 150 49 brown light brown 19
6 Owen Lars 178 120 brown, grey light blue 52
7 Beru Whitesun lars 165 75 brown light blue 47
8 R5-D4 97 32 <NA> white, red red NA
9 Biggs Darklighter 183 84 black light brown 24
10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57
# … with 77 more rows
Or use a variable for column selection:
= c("name", "height", "species")
columns
%>%
df.starwars select(one_of(columns)) # useful when using a variable for column selection
# A tibble: 87 × 3
name height species
<chr> <int> <chr>
1 Luke Skywalker 172 Human
2 C-3PO 167 Droid
3 R2-D2 96 Droid
4 Darth Vader 202 Human
5 Leia Organa 150 Human
6 Owen Lars 178 Human
7 Beru Whitesun lars 165 Human
8 R5-D4 97 Droid
9 Biggs Darklighter 183 Human
10 Obi-Wan Kenobi 182 Human
# … with 77 more rows
We can also deselect (multiple) columns:
%>%
df.starwars select(-name, -(birth_year:vehicles))
# A tibble: 87 × 6
height mass hair_color skin_color eye_color starships
<int> <dbl> <chr> <chr> <chr> <list>
1 172 77 blond fair blue <chr [2]>
2 167 75 <NA> gold yellow <chr [0]>
3 96 32 <NA> white, blue red <chr [0]>
4 202 136 none white yellow <chr [1]>
5 150 49 brown light brown <chr [0]>
6 178 120 brown, grey light blue <chr [0]>
7 165 75 brown light blue <chr [0]>
8 97 32 <NA> white, red red <chr [0]>
9 183 84 black light brown <chr [1]>
10 182 77 auburn, white fair blue-gray <chr [5]>
# … with 77 more rows
And select columns by partially matching the column name:
%>%
df.starwars select(contains("_")) # every column that contains the character "_"
# A tibble: 87 × 4
hair_color skin_color eye_color birth_year
<chr> <chr> <chr> <dbl>
1 blond fair blue 19
2 <NA> gold yellow 112
3 <NA> white, blue red 33
4 none white yellow 41.9
5 brown light brown 19
6 brown, grey light blue 52
7 brown light blue 47
8 <NA> white, red red NA
9 black light brown 24
10 auburn, white fair blue-gray 57
# … with 77 more rows
%>%
df.starwars select(starts_with("h")) # every column that starts with an "h"
# A tibble: 87 × 3
height hair_color homeworld
<int> <chr> <chr>
1 172 blond Tatooine
2 167 <NA> Tatooine
3 96 <NA> Naboo
4 202 none Tatooine
5 150 brown Alderaan
6 178 brown, grey Tatooine
7 165 brown Tatooine
8 97 <NA> Tatooine
9 183 black Tatooine
10 182 auburn, white Stewjon
# … with 77 more rows
We can rename some of the columns using select()
like so:
%>%
df.starwars select(person = name, height, mass_kg = mass)
# A tibble: 87 × 3
person height mass_kg
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# … with 77 more rows
where()
where()
is a useful helper function that comes in handy, for example, when we want to select columns based on their data type.
%>%
df.starwars select(where(fn = ~is.numeric(.))) # just select numeric columns
# A tibble: 87 × 3
height mass birth_year
<int> <dbl> <dbl>
1 172 77 19
2 167 75 112
3 96 32 33
4 202 136 41.9
5 150 49 19
6 178 120 52
7 165 75 47
8 97 32 NA
9 183 84 24
10 182 77 57
# … with 77 more rows
The following selects all columns that are not numeric:
%>%
df.starwars select(where(fn = ~ !is.numeric(.))) # selects all columns that are not numeric
# A tibble: 87 × 11
name hair_color skin_color eye_color sex gender homeworld species films
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lis>
1 Luke S… blond fair blue male mascu… Tatooine Human <chr…
2 C-3PO <NA> gold yellow none mascu… Tatooine Droid <chr…
3 R2-D2 <NA> white, bl… red none mascu… Naboo Droid <chr…
4 Darth … none white yellow male mascu… Tatooine Human <chr…
5 Leia O… brown light brown fema… femin… Alderaan Human <chr…
6 Owen L… brown, grey light blue male mascu… Tatooine Human <chr…
7 Beru W… brown light blue fema… femin… Tatooine Human <chr…
8 R5-D4 <NA> white, red red none mascu… Tatooine Droid <chr…
9 Biggs … black light brown male mascu… Tatooine Human <chr…
10 Obi-Wa… auburn, wh… fair blue-gray male mascu… Stewjon Human <chr…
# … with 77 more rows, and 2 more variables: vehicles <list>, starships <list>
Note that I used ~
here to indicate that I’m creating an anonymous function to check whether column type is numeric. A one-sided formula (expression beginning with ~
) is interpreted as function(x)
, and wherever x
would go in the function is represented by .
.
So, I could write the same code like so:
%>%
df.starwars select(where(function(x) !is.numeric(x))) # selects all columns that are not numeric
# A tibble: 87 × 11
name hair_color skin_color eye_color sex gender homeworld species films
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lis>
1 Luke S… blond fair blue male mascu… Tatooine Human <chr…
2 C-3PO <NA> gold yellow none mascu… Tatooine Droid <chr…
3 R2-D2 <NA> white, bl… red none mascu… Naboo Droid <chr…
4 Darth … none white yellow male mascu… Tatooine Human <chr…
5 Leia O… brown light brown fema… femin… Alderaan Human <chr…
6 Owen L… brown, grey light blue male mascu… Tatooine Human <chr…
7 Beru W… brown light blue fema… femin… Tatooine Human <chr…
8 R5-D4 <NA> white, red red none mascu… Tatooine Droid <chr…
9 Biggs … black light brown male mascu… Tatooine Human <chr…
10 Obi-Wa… auburn, wh… fair blue-gray male mascu… Stewjon Human <chr…
# … with 77 more rows, and 2 more variables: vehicles <list>, starships <list>
For more details, take a look at the help file for select()
, and this this great tutorial in which I learned about some of the more advanced ways of using select()
.
mutate()
mutate()
is used to change existing columns or make new ones.
%>%
df.starwars mutate(height = height / 100, # to get height in meters
bmi = mass / (height^2)) %>% # bmi = kg / (m^2)
select(name, height, mass, bmi)
# A tibble: 87 × 4
name height mass bmi
<chr> <dbl> <dbl> <dbl>
1 Luke Skywalker 1.72 77 26.0
2 C-3PO 1.67 75 26.9
3 R2-D2 0.96 32 34.7
4 Darth Vader 2.02 136 33.3
5 Leia Organa 1.5 49 21.8
6 Owen Lars 1.78 120 37.9
7 Beru Whitesun lars 1.65 75 27.5
8 R5-D4 0.97 32 34.0
9 Biggs Darklighter 1.83 84 25.1
10 Obi-Wan Kenobi 1.82 77 23.2
# … with 77 more rows
Here, I’ve calculated the bmi for the different starwars characters. I first mutated the height variable by going from cm to m, and then created the new column “bmi”.
A useful helper function for mutate()
is ifelse()
which is a shorthand for the if-else control flow (Section 1.3.4.1). Here is an example:
%>%
df.starwars mutate(height_categorical = ifelse(height > median(height, na.rm = T),
"tall",
"short")) %>%
select(name, contains("height"))
# A tibble: 87 × 3
name height height_categorical
<chr> <int> <chr>
1 Luke Skywalker 172 short
2 C-3PO 167 short
3 R2-D2 96 short
4 Darth Vader 202 tall
5 Leia Organa 150 short
6 Owen Lars 178 short
7 Beru Whitesun lars 165 short
8 R5-D4 97 short
9 Biggs Darklighter 183 tall
10 Obi-Wan Kenobi 182 tall
# … with 77 more rows
ifelse()
works in the following way: we first specify the condition, then what should be returned if the condition is true, and finally what should be returned otherwise. The more verbose version of the statement above would be: ifelse(test = height > median(height, na.rm = T), yes = "tall", no = "short")
In previous versions of dplyr
(the package we use for data wrangling), there were a variety of additional mutate functions such as mutate_at()
, mutate_if()
, and mutate_all()
. In the most recent version of dplyr
, these additional functions have been deprecated, and replaced with the flexible across()
helper function.
across()
across()
allows us to use the syntax that we’ve learned for select()
to select particular variables and apply a function to each of the selected variables.
For example, let’s imagine that we want to z-score a number of variables in our data frame. We can do this like so:
%>%
df.starwars mutate(across(.cols = c(height, mass, birth_year),
.fns = scale))
# A tibble: 87 × 14
name height[,1] mass[,1] hair_color skin_color eye_color birth_year[,1]
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 Luke Sky… -0.0678 -0.120 blond fair blue -0.443
2 C-3PO -0.212 -0.132 <NA> gold yellow 0.158
3 R2-D2 -2.25 -0.385 <NA> white, bl… red -0.353
4 Darth Va… 0.795 0.228 none white yellow -0.295
5 Leia Org… -0.701 -0.285 brown light brown -0.443
6 Owen Lars 0.105 0.134 brown, grey light blue -0.230
7 Beru Whi… -0.269 -0.132 brown light blue -0.262
8 R5-D4 -2.22 -0.385 <NA> white, red red NA
9 Biggs Da… 0.249 -0.0786 black light brown -0.411
10 Obi-Wan … 0.220 -0.120 auburn, wh… fair blue-gray -0.198
# … with 77 more rows, and 7 more variables: sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, films <list>, vehicles <list>,
# starships <list>
In the .cols =
argument of across()
, I’ve specified what variables to mutate. In the .fns =
argument, I’ve specified that I want to use the function scale
. Note that I wrote the function without ()
. The .fns
argument expects allows these possible values:
mean
.
as a dummy argument, ~ mean(.)
(note the ~
before the function call)list(mean = mean, median = ~ median(.))
(where I’ve mixed both of the other ways)Let’s use this same starwars dataset, and run some basic statistical tests.
Let’s look at our dataframe again by clicking on ‘df.starwars’ in our environment
Just looking at the data, I hypothesize that there is a linear relationship between height and mass, such that as height increases, as does mass. Sounds plausible, right?
Let’s start by visualizing these two variables in a scatterplot
ggplot(data = df.starwars,
mapping = aes(x = height,
y=mass))+
geom_point()
Warning: Removed 28 rows containing missing values (geom_point).
Woah! We have a massive outlier that is really going to skew our results. Let’s get rid of that just using some tidyverse data wrangling.
<- df.starwars %>%
df.starwars_noOutliers filter(mass < 200)
Now let’s plot it again and see what our relationship looks like.
ggplot(data = df.starwars_noOutliers,
mapping = aes(x = height,
y=mass))+
geom_point()
Much better! And yes, it does appear that there is a strong positive linear relationship between these two variables. Can statistics confirm our suspected correlation?
Let’s use a correlation test. The function cor.test comes with base R and outputs an correlation coefficient and a p-value.
cor.test(df.starwars_noOutliers$height, df.starwars_noOutliers$mass)
Pearson's product-moment correlation
data: df.starwars_noOutliers$height and df.starwars_noOutliers$mass
t = 8.7853, df = 56, p-value = 4.018e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6260700 0.8520232
sample estimates:
cor
0.7612612
What does this output mean? Let’s look at the helper window for more information about ‘cor.test’
?cor.test
In APA, we usually report the estimate and the p-value. So, in this case we might say something like: We observed a strong positive correlation between height and mass in the starwars dataset, r(56) = 0.76, p < .001.
For more information about reporting statistics in APA format: https://www.scribbr.com/apa-style/numbers-and-statistics/
Lastly, let’s visualize this relationship in our ggplot
ggplot(data = df.starwars_noOutliers,
mapping = aes(x = height,
y=mass))+
geom_point() +
geom_smooth(method=lm)
`geom_smooth()` using formula 'y ~ x'
Nice!
What about one more advanced, very commonly used statistical test: linear regression
First, check out the differences between correlation and linear regressions: http://sites.utexas.edu/sos/guided/inferential/numeric/bivariate/cor/#:~:text=A%20correlation%20analysis%20provides%20information,variable%20based%20on%20the%20other.
Honestly, I’m not an expert on when to use what statistical test, but Luiza will help you all navigate this with your own data.
Ok, let’s say we want to know whether there mass can be predicted from height:
= df.starwars_noOutliers %>%
model lm(mass ~ height,
data = .)
%>%
model summary()
Call:
lm(formula = mass ~ height, data = .)
Residuals:
Min 1Q Median 3Q Max
-39.382 -8.212 0.211 3.846 57.327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.54076 12.56053 -2.591 0.0122 *
height 0.62136 0.07073 8.785 4.02e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared: 0.5795, Adjusted R-squared: 0.572
F-statistic: 77.18 on 1 and 56 DF, p-value: 4.018e-12
Do we think that gender might moderate this relationship?
= df.starwars_noOutliers %>%
model lm(mass ~ height + gender,
data = .)
%>%
model summary()
Call:
lm(formula = mass ~ height + gender, data = .)
Residuals:
Min 1Q Median 3Q Max
-43.184 -6.878 -2.184 0.041 53.773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -48.70133 12.18945 -3.995 0.000197 ***
height 0.61057 0.06356 9.607 2.78e-13 ***
gendermasculine 22.04507 6.24608 3.529 0.000860 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.17 on 54 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.6684, Adjusted R-squared: 0.6561
F-statistic: 54.42 on 2 and 54 DF, p-value: 1.14e-13
As you can see, linear regressions are very flexible! This is what you will use most of the time. This syntax will help you as you start to navigate Luiza’s data.
For more information of linear regression and modeling (highly recommend learning more), check out these datacamp courses: