Lesson adapted from Tobi Gerstenberg’s PSY252 course: https://psych252.github.io/psych252book/data-wrangling-1.html

This book is also a great resource for all things statistics! Check it out.

1 Data wrangling

In this lesson, we will take a look at how to wrangle data using the dplyr package. Getting our data into shape is something you’ll need for the rest of your psychology / data analysis career, so it’s worth spending some time getting a good sense for how this works. The nice thing about R is that (thanks to the tidyverse), both visualization and data wrangling are particularly simple.

1.1 Goals for this Lesson

  • Review R basics
  • Learn how the pipe operator %>% works
  • See different ways for getting a sense of one’s data.
  • Learn data manipulation verbs from the dplyr package (incl. filter(), arrange(), rename(), relocate(), select(), mutate())

1.2 Load packages

Let’s first load the our packages

library("knitr")        # for rendering the RMarkdown file
library("skimr")        # for visualizing data 
library("visdat")       # for visualizing data 
library("DT")           # for visualizing data 
library("tidyverse")    # for data wrangling

opts_chunk$set(comment = "",
               fig.show = "hold")

1.3 Quick overview of R basics

1.3.1 Modes

Variables in R can have different modes. Table 1.1 shows the most common ones.

Table 1.1: Most commonly used variable modes in R.
name example
numeric 1, 3, 48
character 'Steve', 'a', '78'
logical TRUE, FALSE
not available NA

For characters you can either use " or '. R has a number of functions to convert a variable from one mode to another. NA is used for missing values.

tmp1 = "1" # we start with a character
str(tmp1) 
 chr "1"
tmp2 = as.numeric(tmp1) # turn it into a numeric
str(tmp2) 
 num 1
tmp3 = as.factor(tmp2) # turn that into a factor
str(tmp3)
 Factor w/ 1 level "1": 1
tmp4 = as.character(tmp3) # and go full cycle by turning it back into a character
str(tmp4)
 chr "1"
identical(tmp1, tmp4) # checks whether tmp1 and tmp4 are the same
[1] TRUE

The str() function displays the structure of an R object. Here, it shows us what mode the variable is.

1.3.2 Data types

R has a number of different data types. Table 1.2 shows the ones you’re most likely to come across (taken from this source):

Table 1.2: Most commonly used data types in R.
name description
vector list of values with of the same variable mode
factor for ordinal variables
matrix 2D data structure
array same as matrix for higher dimensional data
data frame similar to matrix but with column names
list flexible type that can contain different other variable types

1.3.2.1 Vectors

We build vectors using the concatenate function c(), and we use [] to access one or more elements of a vector.

numbers = c(1, 4, 5) # make a vector
numbers[2] # access the second element 
[1] 4
numbers[1:2] # access the first two elements
[1] 1 4
numbers[c(1, 3)] # access the first and last element
[1] 1 5

In R (unlike in Python for example), 1 refers to the first element of a vector (or list).

1.3.2.2 Matrix

We build a matrix using the matrix() function, and we use [] to access its elements.

matrix = matrix(data = c(1, 2, 3, 4, 5, 6),
                nrow = 3,
                ncol = 2)
matrix # the full matrix
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
matrix[1, 2] # element in row 1, column 2
[1] 4
matrix[1, ] # all elements in the first row 
[1] 1 4
matrix[ , 1] # all elements in the first column 
[1] 1 2 3
matrix[-1, ] # a matrix which excludes the first row
     [,1] [,2]
[1,]    2    5
[2,]    3    6

Note how we use an empty placeholder to indicate that we want to select all the values in a row or column, and - to indicate that we want to remove something.

1.3.2.3 Array

Arrays work the same was as matrices with data of more than two dimensions.

Won’t be covering this really. Check out this resource for more information if you’re curious: https://www.tutorialspoint.com/r/r_arrays.htm

1.3.2.4 Data frame

df = tibble(participant_id = c(1, 2, 3),
            participant_name = c("Leia", "Luke", "Darth")) # make the data frame 

df # the complete data frame
# A tibble: 3 × 2
  participant_id participant_name
           <dbl> <chr>           
1              1 Leia            
2              2 Luke            
3              3 Darth           
df[1, 2] # a single element using numbers 
# A tibble: 1 × 1
  participant_name
  <chr>           
1 Leia            
df$participant_id # all participants 
[1] 1 2 3
df[["participant_id"]] # same as before but using [[]] instead of $
[1] 1 2 3
df$participant_name[2] # name of the second participant
[1] "Luke"
df[["participant_name"]][2] # same as above
[1] "Luke"

We’ll use data frames a lot. Data frames are like a matrix with column names. Data frames are also more general than matrices in that different columns can have different modes. For example, one column might be a character, another one numeric, and another one a factor.

Here we used the tibble() function to create the data frame. A tibble is almost the same as a data frame but it has better defaults for formatting output in the console (more information on tibbles is here).

1.3.2.5 Lists

l.mixed = list(number = 1, 
               character = "2", 
               factor = factor(3), 
               matrix = matrix(1:4, ncol = 2),
               df = tibble(x = c(1, 2), y = c(3, 4)))
l.mixed
$number
[1] 1

$character
[1] "2"

$factor
[1] 3
Levels: 3

$matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

$df
# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     3
2     2     4
# three different ways of accessing a list
l.mixed$character
[1] "2"
l.mixed[["character"]]
[1] "2"
l.mixed[[2]] 
[1] "2"

Lists are a very flexible data format. You can put almost anything in a list.

1.3.3 Operators

Table 1.3 shows the comparison operators that result in logical outputs.

Table 1.3: Table of comparison operators that result in boolean (TRUE/FALSE) outputs.
symbol name
== equal to
!= not equal to
>, < greater/less than
>=, <= greater/less than or equal
&, &#124;, ! logical operators: and, or, not
%in% checks whether an element is in an object

The %in% operator is very useful, and we can use it like so:

x = c(1, 2, 3)
2 %in% x 
[1] TRUE
c(3, 4) %in% x
[1]  TRUE FALSE

It’s particularly useful for filtering data! More to come on this.

1.3.4 Control flow

1.3.4.1 if-then

number = 3

if (number == 1) {
  print("The number is 1.")
} else if (number == 2) {
  print("The number is 2.")
} else {
  print("The number is neither 1 nor 2.")
}
[1] "The number is neither 1 nor 2."

As a shorthand version, we can also use the ifelse() function like so:

number = 3
ifelse(test = number == 1, yes = "correct", no = "false")
[1] "false"

The way to read this: if ______ is true, then put ______, otherwise put ______.

I use this command a lot! Worth getting comfortable with.

1.3.4.2 for loop

sequence = 1:10

for(i in 1:length(sequence)){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

For more on for loops: https://www.datamentor.io/r-programming/for-loop/

1.3.4.3 while loop

number = 1 

while(number <= 10){
  print(number)
  number = number + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

For more on while loop: https://www.datamentor.io/r-programming/while-loop/

1.3.5 Functions

fun.add_two_numbers = function(a, b){
  x = a + b
  return(str_c("The result is ", x))
}

fun.add_two_numbers(1, 2)
[1] "The result is 3"

I’ve used the str_c() function here to concatenate the string with the number. (R converts the number x into a string for us.) Note, R functions can only return a single object. However, this object can be a list (which can contain anything).

For more on functions: https://swcarpentry.github.io/r-novice-inflammation/02-func-R/

1.3.5.1 Some often used functions

Table 1.4: Some frequently used functions.
name description
length() length of an object
dim() dimensions of an object (e.g. number of rows and columns)
rm() remove an object
seq() generate a sequence of numbers
rep() repeat something n times
max() maximum
min() minimum
which.max() index of the maximum
which.min() index of the maximum
mean() mean
median() median
sum() sum
var() variance
sd() standard deviation

1.3.6 The pipe operator %>%

The pipe operator %>% is a special operator that used heavily in the tidyverse. The basic idea is simple: this operator allows us to “pipe” several functions into one long chain that matches the order in which we want to do stuff.

Let’s consider the following example of making and eating a cake (thanks to https://twitter.com/dmi3k/status/1191824875842879489?s=09). This would be the traditional way of writing some code:

eat(
  slice(
    bake(
      put(
        pour(
          mix(ingredients),
          into = baking_form),
        into = oven),
      time = 30),
    pieces = 6),
  1)

To see what’s going on here, we need to read the code inside out. That is, we have to start in the innermost bracket, and then work our way outward. However, there is a natural causal ordering to these steps and wouldn’t it be nice if we could just write code in that order? Thanks to the pipe operator %>% we can! Here is the same example using the pipe:

ingredients %>% 
  mix %>% 
  pour(into = baking_form) %>% 
  put(into = oven) %>% 
  bake(time = 30) %>% 
  slice(pieces = 6) %>% 
  eat(1)

This code is much easier to read and write, since it represents the order in which we want to do things!

Abstractly, the pipe operator does the following:

f(x) can be rewritten as x %>% f()

For example, in standard R, we would write:

x = 1:3

# standard R 
sum(x)
[1] 6

With the pipe, we can rewrite this as:

x = 1:3

# with the pipe  
x %>% sum()
[1] 6

This doesn’t seem super useful yet, but it will be!

f(x, y) can be rewritten as x %>% f(y)

So, we could rewrite the following standard R code …

# rounding pi to 6 digits, standard R 
round(pi, digits = 6)
[1] 3.141593

… by using the pipe:

# rounding pi to 6 digits, standard R 
pi %>% round(digits = 6)
[1] 3.141593

Here is another example:

a = 3
b = 4
sum(a, b) # standard way 
[1] 7
a %>% sum(b) # the pipe way 
[1] 7

The pipe operator inserts the result of the previous computation as a first element into the next computation. So, a %>% sum(b) is equivalent to sum(a, b). We can also specify to insert the result at a different position via the . operator. For example:

a = 1
b = 10 
b %>% seq(from = a, to = .)
 [1]  1  2  3  4  5  6  7  8  9 10

Here, I used the . operator to specify that I woud like to insert the result of b where I’ve put the . in the seq() function.

f(x, y) can be rewritten as y %>% f(x, .)

h(g(f(x))) can be rewritten as x %>% f() %>% g() %>% h()

Look how much simpler!

One more example to show you why the pipe is very helpful:

h(g(f(x))) can be rewritten as x %>% f() %>% g() %>% h()

For example, consider that we want to calculate the root mean squared error (RMSE) between prediction and data.

Here is how the RMSE is defined:

\[ \text{RMSE} = \sqrt\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n} \] where \(\hat{y}_i\) denotes the prediction, and \(y_i\) the actually observed value.

In base R, we would do the following.

data = c(1, 3, 4, 2, 5)
prediction = c(1, 2, 2, 1, 4)

# calculate root mean squared error
rmse = sqrt(mean((prediction-data)^2))
print(rmse)
[1] 1.183216

Using the pipe operator makes the operation more intuitive:

data = c(1, 3, 4, 2, 5)
prediction = c(1, 2, 2, 1, 4)

# calculate root mean squared error the pipe way 
rmse = (prediction-data)^2 %>% 
  mean() %>% 
  sqrt() %>% 
  print() 
[1] 1.183216

First, we calculate the squared error, then we take the mean, then the square root, and then print the result.

The pipe operator %>% is similar to the + used in ggplot2. It allows us to take step-by-step actions in a way that fits the causal ordering of how we want to do things.

Tip: The keyboard shortcut for the pipe operator is:
cmd/ctrl + shift + m
Definitely learn this one – we’ll use the pipe a lot!!

Tip: Code is generally easier to read when the pipe %>% is at the end of a line (just like the + in ggplot2).

A key advantage of using the pipe is that you don’t have to save intermediate computations as new variables and this helps to keep your environment nice and clean!

2 Using tidyverse and the pipe to manipulate data

The package dplyr which we loaded as part of the tidyverse, includes a data set with information about starwars characters. Let’s store this as df.starwars.

df.starwars = starwars

There are several ways of taking a look at data in R. Personally, I like to look at the data within RStudio’s data viewer. To do so, you can:

Sometimes it’s also helpful to look at data in the console instead of the data viewer. Particularly when the data is very large, the data viewer can be sluggish.

Here are some useful functions:

2.1 View Data

2.1.2 glimpse()

glimpse() is helpful when the data frame has many columns. The data is shown in a transposed way with columns as rows.

glimpse(df.starwars)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

2.1.3 distinct()

distinct() shows all the distinct values for a character or factor column.

df.starwars %>% 
  distinct(species)
# A tibble: 38 × 1
   species       
   <chr>         
 1 Human         
 2 Droid         
 3 Wookiee       
 4 Rodian        
 5 Hutt          
 6 Yoda's species
 7 Trandoshan    
 8 Mon Calamari  
 9 Ewok          
10 Sullustan     
# … with 28 more rows

2.1.4 count()

count() shows a count of all the different distinct values in a column.

df.starwars %>% 
  count(eye_color)
# A tibble: 15 × 2
   eye_color         n
   <chr>         <int>
 1 black            10
 2 blue             19
 3 blue-gray         1
 4 brown            21
 5 dark              1
 6 gold              1
 7 green, yellow     1
 8 hazel             3
 9 orange            8
10 pink              1
11 red               5
12 red, blue         1
13 unknown           3
14 white             1
15 yellow           11

It’s possible to do grouped counts by combining several variables.

df.starwars %>% 
  count(eye_color, gender) %>% 
  head(n = 10)
# A tibble: 10 × 3
   eye_color gender        n
   <chr>     <chr>     <int>
 1 black     feminine      2
 2 black     masculine     8
 3 blue      feminine      6
 4 blue      masculine    12
 5 blue      <NA>          1
 6 blue-gray masculine     1
 7 brown     feminine      5
 8 brown     masculine    15
 9 brown     <NA>          1
10 dark      masculine     1

2.2 Wrangling data

We use the functions in the package dplyr to manipulate our data.

2.2.1 filter()

filter() lets us apply logical (and other) operators (see Table 1.3) to subset the data. Here, I’ve filtered out the masculine characters.

df.starwars %>% 
  filter(gender == "masculine")
# A tibble: 66 × 14
   name    height  mass hair_color  skin_color eye_color birth_year sex   gender
   <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke S…    172    77 blond       fair       blue            19   male  mascu…
 2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu…
 3 R2-D2       96    32 <NA>        white, bl… red             33   none  mascu…
 4 Darth …    202   136 none        white      yellow          41.9 male  mascu…
 5 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
 6 R5-D4       97    32 <NA>        white, red red             NA   none  mascu…
 7 Biggs …    183    84 black       light      brown           24   male  mascu…
 8 Obi-Wa…    182    77 auburn, wh… fair       blue-gray       57   male  mascu…
 9 Anakin…    188    84 blond       fair       blue            41.9 male  mascu…
10 Wilhuf…    180    NA auburn, gr… fair       blue            64   male  mascu…
# … with 56 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

We can combine multiple conditions in the same call. Here, I’ve filtered out masculine characters, whose height is greater than the median height (i.e. they are in the top 50 percentile), and whose mass was not NA.

df.starwars %>% 
  filter(gender == "masculine",
         height > median(height, na.rm = T),
         !is.na(mass))
# A tibble: 27 × 14
   name    height  mass hair_color  skin_color eye_color birth_year sex   gender
   <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
 1 Darth …    202 136   none        white      yellow          41.9 male  mascu…
 2 Biggs …    183  84   black       light      brown           24   male  mascu…
 3 Obi-Wa…    182  77   auburn, wh… fair       blue-gray       57   male  mascu…
 4 Anakin…    188  84   blond       fair       blue            41.9 male  mascu…
 5 Chewba…    228 112   brown       unknown    blue           200   male  mascu…
 6 Boba F…    183  78.2 black       fair       brown           31.5 male  mascu…
 7 IG-88      200 140   none        metal      red             15   none  mascu…
 8 Bossk      190 113   none        green      red             53   male  mascu…
 9 Qui-Go…    193  89   brown       fair       blue            92   male  mascu…
10 Nute G…    191  90   none        mottled g… red             NA   male  mascu…
# … with 17 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Many functions like mean(), median(), var(), sd(), sum() have the argument na.rm which is set to FALSE by default. I set the argument to TRUE here (or T for short), which means that the NA values are ignored, and the median() is calculated based on the remaining values.

You can use , and & interchangeably in filter(). Make sure to use parentheses when combining several logical operators to indicate which logical operation should be performed first:

df.starwars %>% 
  filter((skin_color %in% c("dark", "pale") | gender == "hermaphrodite") & height > 170)
# A tibble: 9 × 14
  name     height  mass hair_color skin_color eye_color birth_year sex    gender
  <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
1 Lando C…    177    79 black      dark       brown             31 male   mascu…
2 Quarsh …    183    NA black      dark       brown             62 <NA>   <NA>  
3 Bib For…    180    NA none       pale       pink              NA male   mascu…
4 Mace Wi…    188    84 none       dark       brown             72 male   mascu…
5 Ki-Adi-…    198    82 white      pale       yellow            92 male   mascu…
6 Adi Gal…    184    50 none       dark       blue              NA female femin…
7 Saesee …    188    NA none       pale       orange            NA male   mascu…
8 Gregar …    185    85 black      dark       brown             NA male   mascu…
9 Sly Moo…    178    48 none       pale       white             NA <NA>   <NA>  
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

The starwars characters that have either a "dark" or a "pale" skin tone, or whose gender is "hermaphrodite", and whose height is at least 170 cm. The %in% operator is useful when there are multiple options. Instead of skin_color %in% c("dark", "pale"), I could have also written skin_color == "dark" | skin_color == "pale" but this gets cumbersome as the number of options increases.

2.2.2 arrange()

arrange() allows us to sort the values in a data frame by one or more column entries.

df.starwars %>% 
  arrange(hair_color, desc(height))
# A tibble: 87 × 14
   name    height  mass hair_color  skin_color eye_color birth_year sex   gender
   <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
 1 Mon Mo…    150  NA   auburn      fair       blue            48   fema… femin…
 2 Wilhuf…    180  NA   auburn, gr… fair       blue            64   male  mascu…
 3 Obi-Wa…    182  77   auburn, wh… fair       blue-gray       57   male  mascu…
 4 Bail P…    191  NA   black       tan        brown           67   male  mascu…
 5 Gregar…    185  85   black       dark       brown           NA   male  mascu…
 6 Biggs …    183  84   black       light      brown           24   male  mascu…
 7 Boba F…    183  78.2 black       fair       brown           31.5 male  mascu…
 8 Quarsh…    183  NA   black       dark       brown           62   <NA>  <NA>  
 9 Jango …    183  79   black       tan        brown           66   male  mascu…
10 Lando …    177  79   black       dark       brown           31   male  mascu…
# … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Here, I’ve sorted the data frame first by hair_color, and then by height. I’ve used the desc() function to sort height in descending order.

2.2.3 rename()

rename() renames column names.

df.starwars %>% 
  rename(person = name,
         mass_kg = mass)
# A tibble: 87 × 14
   person height mass_kg hair_color skin_color eye_color birth_year sex   gender
   <chr>   <int>   <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke …    172      77 blond      fair       blue            19   male  mascu…
 2 C-3PO     167      75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2      96      32 <NA>       white, bl… red             33   none  mascu…
 4 Darth…    202     136 none       white      yellow          41.9 male  mascu…
 5 Leia …    150      49 brown      light      brown           19   fema… femin…
 6 Owen …    178     120 brown, gr… light      blue            52   male  mascu…
 7 Beru …    165      75 brown      light      blue            47   fema… femin…
 8 R5-D4      97      32 <NA>       white, red red             NA   none  mascu…
 9 Biggs…    183      84 black      light      brown           24   male  mascu…
10 Obi-W…    182      77 auburn, w… fair       blue-gray       57   male  mascu…
# … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

The new variable names goes on the LHS of the= sign, and the old name on the RHS.

To rename all variables at the same time use rename_with():

df.starwars %>%
  rename_with(.fn = ~ toupper(.))
# A tibble: 87 × 14
   NAME    HEIGHT  MASS HAIR_COLOR  SKIN_COLOR EYE_COLOR BIRTH_YEAR SEX   GENDER
   <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke S…    172    77 blond       fair       blue            19   male  mascu…
 2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu…
 3 R2-D2       96    32 <NA>        white, bl… red             33   none  mascu…
 4 Darth …    202   136 none        white      yellow          41.9 male  mascu…
 5 Leia O…    150    49 brown       light      brown           19   fema… femin…
 6 Owen L…    178   120 brown, grey light      blue            52   male  mascu…
 7 Beru W…    165    75 brown       light      blue            47   fema… femin…
 8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu…
 9 Biggs …    183    84 black       light      brown           24   male  mascu…
10 Obi-Wa…    182    77 auburn, wh… fair       blue-gray       57   male  mascu…
# … with 77 more rows, and 5 more variables: HOMEWORLD <chr>, SPECIES <chr>,
#   FILMS <list>, VEHICLES <list>, STARSHIPS <list>

2.2.4 relocate()

relocate() moves columns. For example, the following piece of code moves the species column to the front of the data frame:

df.starwars %>% 
  relocate(species)
# A tibble: 87 × 14
   species name    height  mass hair_color skin_color eye_color birth_year sex  
   <chr>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Human   Luke S…    172    77 blond      fair       blue            19   male 
 2 Droid   C-3PO      167    75 <NA>       gold       yellow         112   none 
 3 Droid   R2-D2       96    32 <NA>       white, bl… red             33   none 
 4 Human   Darth …    202   136 none       white      yellow          41.9 male 
 5 Human   Leia O…    150    49 brown      light      brown           19   fema…
 6 Human   Owen L…    178   120 brown, gr… light      blue            52   male 
 7 Human   Beru W…    165    75 brown      light      blue            47   fema…
 8 Droid   R5-D4       97    32 <NA>       white, red red             NA   none 
 9 Human   Biggs …    183    84 black      light      brown           24   male 
10 Human   Obi-Wa…    182    77 auburn, w… fair       blue-gray       57   male 
# … with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
#   films <list>, vehicles <list>, starships <list>

We could also move the species column after the name column like so:

df.starwars %>% 
  relocate(species, .after = name)
# A tibble: 87 × 14
   name    species height  mass hair_color skin_color eye_color birth_year sex  
   <chr>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke S… Human      172    77 blond      fair       blue            19   male 
 2 C-3PO   Droid      167    75 <NA>       gold       yellow         112   none 
 3 R2-D2   Droid       96    32 <NA>       white, bl… red             33   none 
 4 Darth … Human      202   136 none       white      yellow          41.9 male 
 5 Leia O… Human      150    49 brown      light      brown           19   fema…
 6 Owen L… Human      178   120 brown, gr… light      blue            52   male 
 7 Beru W… Human      165    75 brown      light      blue            47   fema…
 8 R5-D4   Droid       97    32 <NA>       white, red red             NA   none 
 9 Biggs … Human      183    84 black      light      brown           24   male 
10 Obi-Wa… Human      182    77 auburn, w… fair       blue-gray       57   male 
# … with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
#   films <list>, vehicles <list>, starships <list>

2.2.5 select()

select() allows us to select a subset of the columns in the data frame.

df.starwars %>% 
  select(name, height, mass)
# A tibble: 87 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 C-3PO                 167    75
 3 R2-D2                  96    32
 4 Darth Vader           202   136
 5 Leia Organa           150    49
 6 Owen Lars             178   120
 7 Beru Whitesun lars    165    75
 8 R5-D4                  97    32
 9 Biggs Darklighter     183    84
10 Obi-Wan Kenobi        182    77
# … with 77 more rows

We can select multiple columns using the (from:to) syntax:

df.starwars %>%  
  select(name:birth_year) # from name to birth_year
# A tibble: 87 × 7
   name               height  mass hair_color    skin_color eye_color birth_year
   <chr>               <int> <dbl> <chr>         <chr>      <chr>          <dbl>
 1 Luke Skywalker        172    77 blond         fair       blue            19  
 2 C-3PO                 167    75 <NA>          gold       yellow         112  
 3 R2-D2                  96    32 <NA>          white, bl… red             33  
 4 Darth Vader           202   136 none          white      yellow          41.9
 5 Leia Organa           150    49 brown         light      brown           19  
 6 Owen Lars             178   120 brown, grey   light      blue            52  
 7 Beru Whitesun lars    165    75 brown         light      blue            47  
 8 R5-D4                  97    32 <NA>          white, red red             NA  
 9 Biggs Darklighter     183    84 black         light      brown           24  
10 Obi-Wan Kenobi        182    77 auburn, white fair       blue-gray       57  
# … with 77 more rows

Or use a variable for column selection:

columns = c("name", "height", "species")

df.starwars %>% 
  select(one_of(columns)) # useful when using a variable for column selection
# A tibble: 87 × 3
   name               height species
   <chr>               <int> <chr>  
 1 Luke Skywalker        172 Human  
 2 C-3PO                 167 Droid  
 3 R2-D2                  96 Droid  
 4 Darth Vader           202 Human  
 5 Leia Organa           150 Human  
 6 Owen Lars             178 Human  
 7 Beru Whitesun lars    165 Human  
 8 R5-D4                  97 Droid  
 9 Biggs Darklighter     183 Human  
10 Obi-Wan Kenobi        182 Human  
# … with 77 more rows

We can also deselect (multiple) columns:

df.starwars %>% 
  select(-name, -(birth_year:vehicles))
# A tibble: 87 × 6
   height  mass hair_color    skin_color  eye_color starships
    <int> <dbl> <chr>         <chr>       <chr>     <list>   
 1    172    77 blond         fair        blue      <chr [2]>
 2    167    75 <NA>          gold        yellow    <chr [0]>
 3     96    32 <NA>          white, blue red       <chr [0]>
 4    202   136 none          white       yellow    <chr [1]>
 5    150    49 brown         light       brown     <chr [0]>
 6    178   120 brown, grey   light       blue      <chr [0]>
 7    165    75 brown         light       blue      <chr [0]>
 8     97    32 <NA>          white, red  red       <chr [0]>
 9    183    84 black         light       brown     <chr [1]>
10    182    77 auburn, white fair        blue-gray <chr [5]>
# … with 77 more rows

And select columns by partially matching the column name:

df.starwars %>% 
  select(contains("_")) # every column that contains the character "_"
# A tibble: 87 × 4
   hair_color    skin_color  eye_color birth_year
   <chr>         <chr>       <chr>          <dbl>
 1 blond         fair        blue            19  
 2 <NA>          gold        yellow         112  
 3 <NA>          white, blue red             33  
 4 none          white       yellow          41.9
 5 brown         light       brown           19  
 6 brown, grey   light       blue            52  
 7 brown         light       blue            47  
 8 <NA>          white, red  red             NA  
 9 black         light       brown           24  
10 auburn, white fair        blue-gray       57  
# … with 77 more rows
df.starwars %>% 
  select(starts_with("h")) # every column that starts with an "h"
# A tibble: 87 × 3
   height hair_color    homeworld
    <int> <chr>         <chr>    
 1    172 blond         Tatooine 
 2    167 <NA>          Tatooine 
 3     96 <NA>          Naboo    
 4    202 none          Tatooine 
 5    150 brown         Alderaan 
 6    178 brown, grey   Tatooine 
 7    165 brown         Tatooine 
 8     97 <NA>          Tatooine 
 9    183 black         Tatooine 
10    182 auburn, white Stewjon  
# … with 77 more rows

We can rename some of the columns using select() like so:

df.starwars %>% 
  select(person = name, height, mass_kg = mass)
# A tibble: 87 × 3
   person             height mass_kg
   <chr>               <int>   <dbl>
 1 Luke Skywalker        172      77
 2 C-3PO                 167      75
 3 R2-D2                  96      32
 4 Darth Vader           202     136
 5 Leia Organa           150      49
 6 Owen Lars             178     120
 7 Beru Whitesun lars    165      75
 8 R5-D4                  97      32
 9 Biggs Darklighter     183      84
10 Obi-Wan Kenobi        182      77
# … with 77 more rows

2.2.5.1 where()

where() is a useful helper function that comes in handy, for example, when we want to select columns based on their data type.

df.starwars %>% 
  select(where(fn = ~is.numeric(.))) # just select numeric columns
# A tibble: 87 × 3
   height  mass birth_year
    <int> <dbl>      <dbl>
 1    172    77       19  
 2    167    75      112  
 3     96    32       33  
 4    202   136       41.9
 5    150    49       19  
 6    178   120       52  
 7    165    75       47  
 8     97    32       NA  
 9    183    84       24  
10    182    77       57  
# … with 77 more rows

The following selects all columns that are not numeric:

df.starwars %>% 
  select(where(fn = ~ !is.numeric(.))) # selects all columns that are not numeric
# A tibble: 87 × 11
   name    hair_color  skin_color eye_color sex   gender homeworld species films
   <chr>   <chr>       <chr>      <chr>     <chr> <chr>  <chr>     <chr>   <lis>
 1 Luke S… blond       fair       blue      male  mascu… Tatooine  Human   <chr…
 2 C-3PO   <NA>        gold       yellow    none  mascu… Tatooine  Droid   <chr…
 3 R2-D2   <NA>        white, bl… red       none  mascu… Naboo     Droid   <chr…
 4 Darth … none        white      yellow    male  mascu… Tatooine  Human   <chr…
 5 Leia O… brown       light      brown     fema… femin… Alderaan  Human   <chr…
 6 Owen L… brown, grey light      blue      male  mascu… Tatooine  Human   <chr…
 7 Beru W… brown       light      blue      fema… femin… Tatooine  Human   <chr…
 8 R5-D4   <NA>        white, red red       none  mascu… Tatooine  Droid   <chr…
 9 Biggs … black       light      brown     male  mascu… Tatooine  Human   <chr…
10 Obi-Wa… auburn, wh… fair       blue-gray male  mascu… Stewjon   Human   <chr…
# … with 77 more rows, and 2 more variables: vehicles <list>, starships <list>

Note that I used ~ here to indicate that I’m creating an anonymous function to check whether column type is numeric. A one-sided formula (expression beginning with ~) is interpreted as function(x), and wherever x would go in the function is represented by ..

So, I could write the same code like so:

df.starwars %>% 
  select(where(function(x) !is.numeric(x))) # selects all columns that are not numeric
# A tibble: 87 × 11
   name    hair_color  skin_color eye_color sex   gender homeworld species films
   <chr>   <chr>       <chr>      <chr>     <chr> <chr>  <chr>     <chr>   <lis>
 1 Luke S… blond       fair       blue      male  mascu… Tatooine  Human   <chr…
 2 C-3PO   <NA>        gold       yellow    none  mascu… Tatooine  Droid   <chr…
 3 R2-D2   <NA>        white, bl… red       none  mascu… Naboo     Droid   <chr…
 4 Darth … none        white      yellow    male  mascu… Tatooine  Human   <chr…
 5 Leia O… brown       light      brown     fema… femin… Alderaan  Human   <chr…
 6 Owen L… brown, grey light      blue      male  mascu… Tatooine  Human   <chr…
 7 Beru W… brown       light      blue      fema… femin… Tatooine  Human   <chr…
 8 R5-D4   <NA>        white, red red       none  mascu… Tatooine  Droid   <chr…
 9 Biggs … black       light      brown     male  mascu… Tatooine  Human   <chr…
10 Obi-Wa… auburn, wh… fair       blue-gray male  mascu… Stewjon   Human   <chr…
# … with 77 more rows, and 2 more variables: vehicles <list>, starships <list>

For more details, take a look at the help file for select(), and this this great tutorial in which I learned about some of the more advanced ways of using select().

2.2.6 mutate()

mutate() is used to change existing columns or make new ones.

df.starwars %>% 
  mutate(height = height / 100, # to get height in meters
         bmi = mass / (height^2)) %>% # bmi = kg / (m^2)
  select(name, height, mass, bmi)
# A tibble: 87 × 4
   name               height  mass   bmi
   <chr>               <dbl> <dbl> <dbl>
 1 Luke Skywalker       1.72    77  26.0
 2 C-3PO                1.67    75  26.9
 3 R2-D2                0.96    32  34.7
 4 Darth Vader          2.02   136  33.3
 5 Leia Organa          1.5     49  21.8
 6 Owen Lars            1.78   120  37.9
 7 Beru Whitesun lars   1.65    75  27.5
 8 R5-D4                0.97    32  34.0
 9 Biggs Darklighter    1.83    84  25.1
10 Obi-Wan Kenobi       1.82    77  23.2
# … with 77 more rows

Here, I’ve calculated the bmi for the different starwars characters. I first mutated the height variable by going from cm to m, and then created the new column “bmi”.

A useful helper function for mutate() is ifelse() which is a shorthand for the if-else control flow (Section 1.3.4.1). Here is an example:

df.starwars %>% 
  mutate(height_categorical = ifelse(height > median(height, na.rm = T),
                                     "tall",
                                     "short")) %>% 
  select(name, contains("height"))
# A tibble: 87 × 3
   name               height height_categorical
   <chr>               <int> <chr>             
 1 Luke Skywalker        172 short             
 2 C-3PO                 167 short             
 3 R2-D2                  96 short             
 4 Darth Vader           202 tall              
 5 Leia Organa           150 short             
 6 Owen Lars             178 short             
 7 Beru Whitesun lars    165 short             
 8 R5-D4                  97 short             
 9 Biggs Darklighter     183 tall              
10 Obi-Wan Kenobi        182 tall              
# … with 77 more rows

ifelse() works in the following way: we first specify the condition, then what should be returned if the condition is true, and finally what should be returned otherwise. The more verbose version of the statement above would be: ifelse(test = height > median(height, na.rm = T), yes = "tall", no = "short")

In previous versions of dplyr (the package we use for data wrangling), there were a variety of additional mutate functions such as mutate_at(), mutate_if(), and mutate_all(). In the most recent version of dplyr, these additional functions have been deprecated, and replaced with the flexible across() helper function.

2.2.6.1 across()

across() allows us to use the syntax that we’ve learned for select() to select particular variables and apply a function to each of the selected variables.

For example, let’s imagine that we want to z-score a number of variables in our data frame. We can do this like so:

df.starwars %>%  
  mutate(across(.cols = c(height, mass, birth_year),
                .fns = scale))
# A tibble: 87 × 14
   name      height[,1] mass[,1] hair_color  skin_color eye_color birth_year[,1]
   <chr>          <dbl>    <dbl> <chr>       <chr>      <chr>              <dbl>
 1 Luke Sky…    -0.0678  -0.120  blond       fair       blue              -0.443
 2 C-3PO        -0.212   -0.132  <NA>        gold       yellow             0.158
 3 R2-D2        -2.25    -0.385  <NA>        white, bl… red               -0.353
 4 Darth Va…     0.795    0.228  none        white      yellow            -0.295
 5 Leia Org…    -0.701   -0.285  brown       light      brown             -0.443
 6 Owen Lars     0.105    0.134  brown, grey light      blue              -0.230
 7 Beru Whi…    -0.269   -0.132  brown       light      blue              -0.262
 8 R5-D4        -2.22    -0.385  <NA>        white, red red               NA    
 9 Biggs Da…     0.249   -0.0786 black       light      brown             -0.411
10 Obi-Wan …     0.220   -0.120  auburn, wh… fair       blue-gray         -0.198
# … with 77 more rows, and 7 more variables: sex <chr>, gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

In the .cols = argument of across(), I’ve specified what variables to mutate. In the .fns = argument, I’ve specified that I want to use the function scale. Note that I wrote the function without (). The .fns argument expects allows these possible values:

  • the function itself, e.g. mean
  • a call to the function with . as a dummy argument, ~ mean(.) (note the ~ before the function call)
  • a list of functions list(mean = mean, median = ~ median(.)) (where I’ve mixed both of the other ways)

3 Simple Statistical Tests

Let’s use this same starwars dataset, and run some basic statistical tests.

Let’s look at our dataframe again by clicking on ‘df.starwars’ in our environment

Just looking at the data, I hypothesize that there is a linear relationship between height and mass, such that as height increases, as does mass. Sounds plausible, right?

3.1 Correlation

Let’s start by visualizing these two variables in a scatterplot

ggplot(data = df.starwars,
       mapping = aes(x = height,
                     y=mass))+
  geom_point()
Warning: Removed 28 rows containing missing values (geom_point).

Woah! We have a massive outlier that is really going to skew our results. Let’s get rid of that just using some tidyverse data wrangling.

df.starwars_noOutliers <- df.starwars %>% 
  filter(mass < 200)

Now let’s plot it again and see what our relationship looks like.

ggplot(data = df.starwars_noOutliers,
       mapping = aes(x = height,
                     y=mass))+
  geom_point()

Much better! And yes, it does appear that there is a strong positive linear relationship between these two variables. Can statistics confirm our suspected correlation?

Let’s use a correlation test. The function cor.test comes with base R and outputs an correlation coefficient and a p-value.

cor.test(df.starwars_noOutliers$height, df.starwars_noOutliers$mass)

    Pearson's product-moment correlation

data:  df.starwars_noOutliers$height and df.starwars_noOutliers$mass
t = 8.7853, df = 56, p-value = 4.018e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6260700 0.8520232
sample estimates:
      cor 
0.7612612 

What does this output mean? Let’s look at the helper window for more information about ‘cor.test’

?cor.test

In APA, we usually report the estimate and the p-value. So, in this case we might say something like: We observed a strong positive correlation between height and mass in the starwars dataset, r(56) = 0.76, p < .001.

For more information about reporting statistics in APA format: https://www.scribbr.com/apa-style/numbers-and-statistics/

Lastly, let’s visualize this relationship in our ggplot

ggplot(data = df.starwars_noOutliers,
       mapping = aes(x = height,
                     y=mass))+
  geom_point() +
  geom_smooth(method=lm)
`geom_smooth()` using formula 'y ~ x'

Nice!

3.2 Linear Regression

What about one more advanced, very commonly used statistical test: linear regression

First, check out the differences between correlation and linear regressions: http://sites.utexas.edu/sos/guided/inferential/numeric/bivariate/cor/#:~:text=A%20correlation%20analysis%20provides%20information,variable%20based%20on%20the%20other.

Honestly, I’m not an expert on when to use what statistical test, but Luiza will help you all navigate this with your own data.

Ok, let’s say we want to know whether there mass can be predicted from height:

model = df.starwars_noOutliers %>% 
  lm(mass ~ height,
       data = .)
model %>% 
  summary()

Call:
lm(formula = mass ~ height, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.382  -8.212   0.211   3.846  57.327 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.54076   12.56053  -2.591   0.0122 *  
height        0.62136    0.07073   8.785 4.02e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared:  0.5795,    Adjusted R-squared:  0.572 
F-statistic: 77.18 on 1 and 56 DF,  p-value: 4.018e-12

Do we think that gender might moderate this relationship?

model = df.starwars_noOutliers %>% 
  lm(mass ~ height + gender,
       data = .)
model %>% 
  summary()

Call:
lm(formula = mass ~ height + gender, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.184  -6.878  -2.184   0.041  53.773 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -48.70133   12.18945  -3.995 0.000197 ***
height            0.61057    0.06356   9.607 2.78e-13 ***
gendermasculine  22.04507    6.24608   3.529 0.000860 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.17 on 54 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.6684,    Adjusted R-squared:  0.6561 
F-statistic: 54.42 on 2 and 54 DF,  p-value: 1.14e-13

As you can see, linear regressions are very flexible! This is what you will use most of the time. This syntax will help you as you start to navigate Luiza’s data.

For more information of linear regression and modeling (highly recommend learning more), check out these datacamp courses: