The purpose of this document is to provide an introduction into using Tidyverse in R. We will review the “core” packages within the Tidyverse and give examples.
package_names <- c("tidyverse", "here", "janitor", "vtable")
for(x in package_names){
if (!x %in% rownames(installed.packages())) install.packages(x)
}
library(tidyverse)
library(here) #is a helpful package for defining file paths
library(janitor) #is a helpful package for cleaning data
library(vtable) #is a helpful package for summarizing dataWe’ll be using the Star Wars data that comes loaded with the tidyverse.
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
| Name | Class | Values |
|---|---|---|
| name | character | |
| height | integer | Num: 66 to 264 |
| mass | numeric | Num: 15 to 1358 |
| hair_color | character | |
| skin_color | character | |
| eye_color | character | |
| birth_year | numeric | Num: 8 to 896 |
| sex | character | |
| gender | character | |
| homeworld | character | |
| species | character | |
| films | list | |
| vehicles | list | |
| starships | list |
Tidyverse was created with the purpose to have more consistent and
intuitive verbiage. There are several “core” packages we’ll discuss
here, including:
- Readr: read and write delimited files
- Dplyr: manipulate data
- Tidyr: tidy and clean data
- Stringr: find, extract, and replace strings
- Purrr: programming
- Forcats: working with factors
GGplot is another major package in the tidyverse, but is not discussed here.
The pipe operator “%>%” strings together sequences of commands.
It takes whatever is on the left side of the operator and then does the next command.
Example: take starwars_data and then rename a column, save as starwars_data2
## # A tibble: 87 × 14
## character_name height mass hair_color skin_color eye_color birth_year sex
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke Skywalker 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 none
## 3 R2-D2 96 32 <NA> white, bl… red 33 none
## 4 Darth Vader 202 136 none white yellow 41.9 male
## 5 Leia Organa 150 49 brown light brown 19 fema…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male
## 7 Beru Whitesun … 165 75 brown light blue 47 fema…
## 8 R5-D4 97 32 <NA> white, red red NA none
## 9 Biggs Darkligh… 183 84 black light brown 24 male
## 10 Obi-Wan Kenobi 182 77 auburn, w… fair blue-gray 57 male
## # ℹ 77 more rows
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Tidyverse website: https://www.tidyverse.org/
Learn the Tidyverse with the book R for Data Science: https://r4ds.hadley.nz/
See the different packages here: https://www.tidyverse.org/packages/
Readr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf
Dplyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf
Tidyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf
Stringr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
Purrr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf
Forcats cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/factors.pdf
Reads rectangular data from files such as CSV
To use this as an example, we’ll write the starwars_data to CSV and read it back into R. We’ll use the package “here”, which makes it easier to define file pathways.
Since some of our columns are lists, we can’t export this to CSV without losing that data, so this is just to show an example of how to use these functions.
The package dplyr has functions to facilitate data manipulation using consistent verbs.
Select the columns we want to keep in our data by naming the columns, the range of columns, or by description. Can also be used to rearrange the order of the columns.
For example, we can name the specific columns or range of columns that we want in the dataframe:
## # A tibble: 87 × 2
## name hair_color
## <chr> <chr>
## 1 Luke Skywalker blond
## 2 C-3PO <NA>
## 3 R2-D2 <NA>
## 4 Darth Vader none
## 5 Leia Organa brown
## 6 Owen Lars brown, grey
## 7 Beru Whitesun lars brown
## 8 R5-D4 <NA>
## 9 Biggs Darklighter black
## 10 Obi-Wan Kenobi auburn, white
## # ℹ 77 more rows
## # A tibble: 87 × 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # ℹ 77 more rows
We can select which columns we want by describing the columns:
## # A tibble: 87 × 4
## name hair_color skin_color eye_color
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker blond fair blue
## 2 C-3PO <NA> gold yellow
## 3 R2-D2 <NA> white, blue red
## 4 Darth Vader none white yellow
## 5 Leia Organa brown light brown
## 6 Owen Lars brown, grey light blue
## 7 Beru Whitesun lars brown light blue
## 8 R5-D4 <NA> white, red red
## 9 Biggs Darklighter black light brown
## 10 Obi-Wan Kenobi auburn, white fair blue-gray
## # ℹ 77 more rows
## # A tibble: 87 × 4
## name height mass birth_year
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 19
## 2 C-3PO 167 75 112
## 3 R2-D2 96 32 33
## 4 Darth Vader 202 136 41.9
## 5 Leia Organa 150 49 19
## 6 Owen Lars 178 120 52
## 7 Beru Whitesun lars 165 75 47
## 8 R5-D4 97 32 NA
## 9 Biggs Darklighter 183 84 24
## 10 Obi-Wan Kenobi 182 77 57
## # ℹ 77 more rows
We can also select the columns we don’t want using the same methods:
## # A tibble: 87 × 12
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 3 more variables: homeworld <chr>, species <chr>, starships <list>
## # A tibble: 87 × 6
## height mass birth_year films vehicles starships
## <int> <dbl> <dbl> <list> <list> <list>
## 1 172 77 19 <chr [5]> <chr [2]> <chr [2]>
## 2 167 75 112 <chr [6]> <chr [0]> <chr [0]>
## 3 96 32 33 <chr [7]> <chr [0]> <chr [0]>
## 4 202 136 41.9 <chr [4]> <chr [0]> <chr [1]>
## 5 150 49 19 <chr [5]> <chr [1]> <chr [0]>
## 6 178 120 52 <chr [3]> <chr [0]> <chr [0]>
## 7 165 75 47 <chr [3]> <chr [0]> <chr [0]>
## 8 97 32 NA <chr [1]> <chr [0]> <chr [0]>
## 9 183 84 24 <chr [1]> <chr [0]> <chr [1]>
## 10 182 77 57 <chr [6]> <chr [1]> <chr [5]>
## # ℹ 77 more rows
And we can rearrange the order of the columns:
## # A tibble: 87 × 14
## name homeworld species height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker Tatooine Human 172 77 blond fair blue
## 2 C-3PO Tatooine Droid 167 75 <NA> gold yellow
## 3 R2-D2 Naboo Droid 96 32 <NA> white, bl… red
## 4 Darth Vader Tatooine Human 202 136 none white yellow
## 5 Leia Organa Alderaan Human 150 49 brown light brown
## 6 Owen Lars Tatooine Human 178 120 brown, gr… light blue
## 7 Beru Whitesun… Tatooine Human 165 75 brown light blue
## 8 R5-D4 Tatooine Droid 97 32 <NA> white, red red
## 9 Biggs Darklig… Tatooine Human 183 84 black light brown
## 10 Obi-Wan Kenobi Stewjon Human 182 77 auburn, w… fair blue-gray
## # ℹ 77 more rows
## # ℹ 6 more variables: birth_year <dbl>, sex <chr>, gender <chr>, films <list>,
## # vehicles <list>, starships <list>
Allows us to filter the data:
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Anakin S… 188 84 blond fair blue 41.9 male mascu…
## 3 Shmi Sky… 163 NA black fair brown 72 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 8 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 3 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 4 Beru Whi… 165 75 brown light blue 47 fema… femin…
## 5 Biggs Da… 183 84 black light brown 24 male mascu…
## 6 Anakin S… 188 84 blond fair blue 41.9 male mascu…
## 7 Shmi Sky… 163 NA black fair brown 72 fema… femin…
## 8 Cliegg L… 183 NA brown fair blue 82 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Mutate creates new columns based on the given commands or calculations.
For example, we can create a new column, “bmi” using mass and height data:
## # A tibble: 87 × 4
## name height mass bmi
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 26.0
## 2 C-3PO 167 75 26.9
## 3 R2-D2 96 32 34.7
## 4 Darth Vader 202 136 33.3
## 5 Leia Organa 150 49 21.8
## 6 Owen Lars 178 120 37.9
## 7 Beru Whitesun lars 165 75 27.5
## 8 R5-D4 97 32 34.0
## 9 Biggs Darklighter 183 84 25.1
## 10 Obi-Wan Kenobi 182 77 23.2
## # ℹ 77 more rows
Or we could create categories of heights:
starwars_data %>%
mutate(height_bin = case_when(height < 150 ~ "Less than 150",
height >= 150 & height < 175 ~ "150-174",
height >= 175 & height < 200 ~ "175-199",
height >= 200 ~ "200+")) %>%
select(name, height, height_bin)## # A tibble: 87 × 3
## name height height_bin
## <chr> <int> <chr>
## 1 Luke Skywalker 172 150-174
## 2 C-3PO 167 150-174
## 3 R2-D2 96 Less than 150
## 4 Darth Vader 202 200+
## 5 Leia Organa 150 150-174
## 6 Owen Lars 178 175-199
## 7 Beru Whitesun lars 165 150-174
## 8 R5-D4 97 Less than 150
## 9 Biggs Darklighter 183 175-199
## 10 Obi-Wan Kenobi 182 175-199
## # ℹ 77 more rows
Summarise allows us to get a summarized dataframe based on the groups defined and the functions used to summarise.
For example, let’s find out how many characters there are from each species.
We can first use the group_by() function to group by species.
Then, we can use summarise(). This creates a new column (similarly to mutate) with the commands used within the arguments (in this case, it creates the column “count” using the function n()).
While mutate maintains all rows, summarise only maintains rows unique to the groups (in this case, species) and aggregates to the group level.
The arrange() function lets us see the top species first
## # A tibble: 38 × 2
## species count
## <chr> <int>
## 1 Human 35
## 2 Droid 6
## 3 <NA> 4
## 4 Gungan 3
## 5 Kaminoan 2
## 6 Mirialan 2
## 7 Twi'lek 2
## 8 Wookiee 2
## 9 Zabrak 2
## 10 Aleena 1
## # ℹ 28 more rows
We can join data a few different ways:
A inner_join() keeps only all observations in both x and y.
A left_join() keeps all observations in x.
A right_join() keeps all observations in y.
A full_join() keeps all observations in x and y.
Here are some example data that we can join to our starwars_data. “droids” shows the number of films each droid was present in. “droids_example” is the same as “droids”, but has an additional “fake” droid added for example purposes.
## # A tibble: 6 × 2
## name no_films
## <chr> <int>
## 1 BB8 1
## 2 C-3PO 6
## 3 IG-88 1
## 4 R2-D2 7
## 5 R4-P17 2
## 6 R5-D4 1
## # A tibble: 7 × 2
## name no_films
## <chr> <chr>
## 1 BB8 1
## 2 C-3PO 6
## 3 IG-88 1
## 4 R2-D2 7
## 5 R4-P17 2
## 6 R5-D4 1
## 7 Fake Droid Example 0
Let’s start with an inner_join() between the starwars_data (x) and droids_example (y). As you can see, it only keeps rows in both dataframes.
## # A tibble: 6 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, no_films <chr>
Let’s do a left_join() next with the starwars_data (x) and droids_example (y). Since starwars_data is on the left (x), we maintain all 87 rows and the new column “no_films” in our droids data (y) gets added to our data for the droids. The data for the fake droid in droids_examples does not stay in the data because it does not have a match in starwars_data.
leftjoin <- left_join(starwars_data, droids_example) %>%
select(name, no_films, everything())
leftjoin## # A tibble: 87 × 15
## name no_films height mass hair_color skin_color eye_color birth_year sex
## <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke … <NA> 172 77 blond fair blue 19 male
## 2 C-3PO 6 167 75 <NA> gold yellow 112 none
## 3 R2-D2 7 96 32 <NA> white, bl… red 33 none
## 4 Darth… <NA> 202 136 none white yellow 41.9 male
## 5 Leia … <NA> 150 49 brown light brown 19 fema…
## 6 Owen … <NA> 178 120 brown, gr… light blue 52 male
## 7 Beru … <NA> 165 75 brown light blue 47 fema…
## 8 R5-D4 1 97 32 <NA> white, red red NA none
## 9 Biggs… <NA> 183 84 black light brown 24 male
## 10 Obi-W… <NA> 182 77 auburn, w… fair blue-gray 57 male
## # ℹ 77 more rows
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Next, a right_join(). This will keep all observations in droids_example, including the data for the fake droid. All the columns from starwars_data for those droids gets added to the data.
rightjoin <- right_join(starwars_data, droids_example) %>%
select(name, no_films, everything())
rightjoin## # A tibble: 7 × 15
## name no_films height mass hair_color skin_color eye_color birth_year sex
## <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 C-3PO 6 167 75 <NA> gold yellow 112 none
## 2 R2-D2 7 96 32 <NA> white, bl… red 33 none
## 3 R5-D4 1 97 32 <NA> white, red red NA none
## 4 IG-88 1 200 140 none metal red 15 none
## 5 R4-P17 2 96 NA none silver, r… red, blue NA none
## 6 BB8 1 NA NA none none black NA none
## 7 Fake D… 0 NA NA <NA> <NA> <NA> NA <NA>
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Last, we’ll do a full_join(), where all observations present in either dataframe will be joined.
This includes the fake droid data along with the observations for all non-droids.
fulljoin <- full_join(starwars_data, droids_example) %>%
select(name, no_films, everything()) %>%
arrange(no_films)
fulljoin## # A tibble: 88 × 15
## name no_films height mass hair_color skin_color eye_color birth_year sex
## <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Fake … 0 NA NA <NA> <NA> <NA> NA <NA>
## 2 R5-D4 1 97 32 <NA> white, red red NA none
## 3 IG-88 1 200 140 none metal red 15 none
## 4 BB8 1 NA NA none none black NA none
## 5 R4-P17 2 96 NA none silver, r… red, blue NA none
## 6 C-3PO 6 167 75 <NA> gold yellow 112 none
## 7 R2-D2 7 96 32 <NA> white, bl… red 33 none
## 8 Luke … <NA> 172 77 blond fair blue 19 male
## 9 Darth… <NA> 202 136 none white yellow 41.9 male
## 10 Leia … <NA> 150 49 brown light brown 19 fema…
## # ℹ 78 more rows
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
The package tidyr has functions that help achieve “tidy” data:
- Each variable is a column; each column is a variable
- Each observation is a row; each row is an observation
- Each value is a cell; each cell is a single value
We can nest or unnest data using these functions.
For example, we could nest all the data columns by a group:
## # A tibble: 3 × 2
## # Groups: gender [3]
## gender data
## <chr> <list>
## 1 masculine <tibble [66 × 13]>
## 2 feminine <tibble [17 × 13]>
## 3 <NA> <tibble [4 × 13]>
And then we could use pluck() to grab the data separately for each of the groups.
## [[1]]
## # A tibble: 66 × 13
## name height mass hair_color skin_color eye_color birth_year sex homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male Tatooine
## 2 C-3PO 167 75 <NA> gold yellow 112 none Tatooine
## 3 R2-D2 96 32 <NA> white, bl… red 33 none Naboo
## 4 Dart… 202 136 none white yellow 41.9 male Tatooine
## 5 Owen… 178 120 brown, gr… light blue 52 male Tatooine
## 6 R5-D4 97 32 <NA> white, red red NA none Tatooine
## 7 Bigg… 183 84 black light brown 24 male Tatooine
## 8 Obi-… 182 77 auburn, w… fair blue-gray 57 male Stewjon
## 9 Anak… 188 84 blond fair blue 41.9 male Tatooine
## 10 Wilh… 180 NA auburn, g… fair blue 64 male Eriadu
## # ℹ 56 more rows
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
##
## [[2]]
## # A tibble: 17 × 13
## name height mass hair_color skin_color eye_color birth_year sex homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia… 150 49 brown light brown 19 fema… Alderaan
## 2 Beru… 165 75 brown light blue 47 fema… Tatooine
## 3 Mon … 150 NA auburn fair blue 48 fema… Chandrila
## 4 Shmi… 163 NA black fair brown 72 fema… Tatooine
## 5 Ayla… 178 55 none blue hazel 48 fema… Ryloth
## 6 Adi … 184 50 none dark blue NA fema… Coruscant
## 7 Cordé 157 NA brown light brown NA fema… Naboo
## 8 Lumi… 170 56.2 black yellow blue 58 fema… Mirial
## 9 Barr… 166 50 black yellow blue 40 fema… Mirial
## 10 Dormé 165 NA brown light brown NA fema… Naboo
## 11 Zam … 168 55 blonde fair, gre… yellow NA fema… Zolan
## 12 Taun… 213 NA none grey black NA fema… Kamino
## 13 Joca… 167 NA white fair blue NA fema… Coruscant
## 14 R4-P… 96 NA none silver, r… red, blue NA none <NA>
## 15 Shaa… 178 57 none red, blue… black NA fema… Shili
## 16 Rey NA NA brown light hazel NA fema… <NA>
## 17 Padm… 165 45 brown light brown 46 fema… Naboo
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
##
## [[3]]
## # A tibble: 4 × 13
## name height mass hair_color skin_color eye_color birth_year sex homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Ric O… 183 NA brown fair blue NA <NA> Naboo
## 2 Quars… 183 NA black dark brown 62 <NA> Naboo
## 3 Sly M… 178 48 none pale white NA <NA> Umbara
## 4 Capta… NA NA unknown unknown unknown NA <NA> <NA>
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
Next, we will unnest some data. As a reminder, the Star Wars data has 3 columns stored as lists:
## # A tibble: 87 × 3
## films vehicles starships
## <list> <list> <list>
## 1 <chr [5]> <chr [2]> <chr [2]>
## 2 <chr [6]> <chr [0]> <chr [0]>
## 3 <chr [7]> <chr [0]> <chr [0]>
## 4 <chr [4]> <chr [0]> <chr [1]>
## 5 <chr [5]> <chr [1]> <chr [0]>
## 6 <chr [3]> <chr [0]> <chr [0]>
## 7 <chr [3]> <chr [0]> <chr [0]>
## 8 <chr [1]> <chr [0]> <chr [0]>
## 9 <chr [1]> <chr [0]> <chr [1]>
## 10 <chr [6]> <chr [1]> <chr [5]>
## # ℹ 77 more rows
Let’s unnest the data in the films field (stored as a list)
We can do this using unnest_longer()
## # A tibble: 173 × 2
## films name
## <chr> <chr>
## 1 The Empire Strikes Back Luke Skywalker
## 2 Revenge of the Sith Luke Skywalker
## 3 Return of the Jedi Luke Skywalker
## 4 A New Hope Luke Skywalker
## 5 The Force Awakens Luke Skywalker
## 6 The Empire Strikes Back C-3PO
## 7 Attack of the Clones C-3PO
## 8 The Phantom Menace C-3PO
## 9 Revenge of the Sith C-3PO
## 10 Return of the Jedi C-3PO
## # ℹ 163 more rows
Or using unnest_wider()
## # A tibble: 87 × 8
## name films_1 films_2 films_3 films_4 films_5 films_6 films_7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Luke Skywalker The Empir… Reveng… Return… A New … The Fo… <NA> <NA>
## 2 C-3PO The Empir… Attack… The Ph… Reveng… Return… A New … <NA>
## 3 R2-D2 The Empir… Attack… The Ph… Reveng… Return… A New … The Fo…
## 4 Darth Vader The Empir… Reveng… Return… A New … <NA> <NA> <NA>
## 5 Leia Organa The Empir… Reveng… Return… A New … The Fo… <NA> <NA>
## 6 Owen Lars Attack of… Reveng… A New … <NA> <NA> <NA> <NA>
## 7 Beru Whitesun lars Attack of… Reveng… A New … <NA> <NA> <NA> <NA>
## 8 R5-D4 A New Hope <NA> <NA> <NA> <NA> <NA> <NA>
## 9 Biggs Darklighter A New Hope <NA> <NA> <NA> <NA> <NA> <NA>
## 10 Obi-Wan Kenobi The Empir… Attack… The Ph… Reveng… Return… A New … <NA>
## # ℹ 77 more rows
Let’s play around some more with transforming data. First, we’ll do something similar to what we did above but now using the pivot functions.
Let’s work with the newly unnested (using unnest_longer) field “films” to pivot our data to see which characters are in which films.
First, we’ll start by pivoting wider.
starwars_data_wide <- starwars_data2 %>%
mutate(in_movie = 1) %>%
pivot_wider(names_from = films,
values_from = in_movie)
starwars_data_wide## # A tibble: 87 × 8
## name The Empire Strikes B…¹ `Revenge of the Sith` `Return of the Jedi`
## <chr> <dbl> <dbl> <dbl>
## 1 Luke Skywa… 1 1 1
## 2 C-3PO 1 1 1
## 3 R2-D2 1 1 1
## 4 Darth Vader 1 1 1
## 5 Leia Organa 1 1 1
## 6 Owen Lars NA 1 NA
## 7 Beru White… NA 1 NA
## 8 R5-D4 NA NA NA
## 9 Biggs Dark… NA NA NA
## 10 Obi-Wan Ke… 1 1 1
## # ℹ 77 more rows
## # ℹ abbreviated name: ¹`The Empire Strikes Back`
## # ℹ 4 more variables: `A New Hope` <dbl>, `The Force Awakens` <dbl>,
## # `Attack of the Clones` <dbl>, `The Phantom Menace` <dbl>
We can use the janitor package to clean up the new column names in one line of code
## # A tibble: 87 × 8
## name the_empire_strikes_b…¹ revenge_of_the_sith return_of_the_jedi
## <chr> <dbl> <dbl> <dbl>
## 1 Luke Skywalker 1 1 1
## 2 C-3PO 1 1 1
## 3 R2-D2 1 1 1
## 4 Darth Vader 1 1 1
## 5 Leia Organa 1 1 1
## 6 Owen Lars NA 1 NA
## 7 Beru Whitesun … NA 1 NA
## 8 R5-D4 NA NA NA
## 9 Biggs Darkligh… NA NA NA
## 10 Obi-Wan Kenobi 1 1 1
## # ℹ 77 more rows
## # ℹ abbreviated name: ¹the_empire_strikes_back
## # ℹ 4 more variables: a_new_hope <dbl>, the_force_awakens <dbl>,
## # attack_of_the_clones <dbl>, the_phantom_menace <dbl>
And we can replace those NAs with 0s where the character is not in the movie by using the mutate() function.
starwars_data_wide2.2 <- starwars_data_wide2 %>%
mutate(across(the_empire_strikes_back:the_phantom_menace, ~ replace_na(.x, 0)))
starwars_data_wide2.2## # A tibble: 87 × 8
## name the_empire_strikes_b…¹ revenge_of_the_sith return_of_the_jedi
## <chr> <dbl> <dbl> <dbl>
## 1 Luke Skywalker 1 1 1
## 2 C-3PO 1 1 1
## 3 R2-D2 1 1 1
## 4 Darth Vader 1 1 1
## 5 Leia Organa 1 1 1
## 6 Owen Lars 0 1 0
## 7 Beru Whitesun … 0 1 0
## 8 R5-D4 0 0 0
## 9 Biggs Darkligh… 0 0 0
## 10 Obi-Wan Kenobi 1 1 1
## # ℹ 77 more rows
## # ℹ abbreviated name: ¹the_empire_strikes_back
## # ℹ 4 more variables: a_new_hope <dbl>, the_force_awakens <dbl>,
## # attack_of_the_clones <dbl>, the_phantom_menace <dbl>
Then we can pivot this longer again…
starwars_data_long <- starwars_data_wide2.2 %>%
pivot_longer(cols = the_empire_strikes_back:the_phantom_menace,
names_to = "movie",
values_to = "character_in_movie")
starwars_data_long## # A tibble: 609 × 3
## name movie character_in_movie
## <chr> <chr> <dbl>
## 1 Luke Skywalker the_empire_strikes_back 1
## 2 Luke Skywalker revenge_of_the_sith 1
## 3 Luke Skywalker return_of_the_jedi 1
## 4 Luke Skywalker a_new_hope 1
## 5 Luke Skywalker the_force_awakens 1
## 6 Luke Skywalker attack_of_the_clones 0
## 7 Luke Skywalker the_phantom_menace 0
## 8 C-3PO the_empire_strikes_back 1
## 9 C-3PO revenge_of_the_sith 1
## 10 C-3PO return_of_the_jedi 1
## # ℹ 599 more rows
And we did this in separate chunks, but the pipe could have done it all in one chunk of code:
Maybe we want to combine the film name with the episode number. We could do this using unite().
First, we’ll set up our data for this. We’ll unnest the films column, select only the columns we need, and will create a column (using mutate) with those episode numbers. We can use case_when() to do this.
starwars_data_unite <- starwars_data %>%
unnest(films) %>%
select(name, films) %>%
mutate(episode_no = case_when(films == "The Phantom Menace" ~ 1,
films == "Attack of the Clones" ~ 2,
films == "Revenge of the Sith" ~ 3,
films == "A New Hope" ~ 4,
films == "The Empire Strikes Back" ~ 5,
films == "Return of the Jedi" ~ 6,
films == "The Force Awakens" ~ 7))
starwars_data_unite## # A tibble: 173 × 3
## name films episode_no
## <chr> <chr> <dbl>
## 1 Luke Skywalker The Empire Strikes Back 5
## 2 Luke Skywalker Revenge of the Sith 3
## 3 Luke Skywalker Return of the Jedi 6
## 4 Luke Skywalker A New Hope 4
## 5 Luke Skywalker The Force Awakens 7
## 6 C-3PO The Empire Strikes Back 5
## 7 C-3PO Attack of the Clones 2
## 8 C-3PO The Phantom Menace 1
## 9 C-3PO Revenge of the Sith 3
## 10 C-3PO Return of the Jedi 6
## # ℹ 163 more rows
Now we can unite those two columns using unite().
The arguments in unite are:
- The new column name you are creating. In this case, we’ll call it
“movie”
- The columns you are uniting (episode_no and films)
- The string that will separate the two combined fields (“:”)
- You can also include remove = TRUE or remove = FALSE to keep or remove
original columns
starwars_data_unite2 <- starwars_data_unite %>%
unite("movie", c(episode_no, films), sep = ": ")
starwars_data_unite2## # A tibble: 173 × 2
## name movie
## <chr> <chr>
## 1 Luke Skywalker 5: The Empire Strikes Back
## 2 Luke Skywalker 3: Revenge of the Sith
## 3 Luke Skywalker 6: Return of the Jedi
## 4 Luke Skywalker 4: A New Hope
## 5 Luke Skywalker 7: The Force Awakens
## 6 C-3PO 5: The Empire Strikes Back
## 7 C-3PO 2: Attack of the Clones
## 8 C-3PO 1: The Phantom Menace
## 9 C-3PO 3: Revenge of the Sith
## 10 C-3PO 6: Return of the Jedi
## # ℹ 163 more rows
We can separate these columns again using separate()
## # A tibble: 173 × 3
## name episode_no films
## <chr> <chr> <chr>
## 1 Luke Skywalker 5 The Empire Strikes Back
## 2 Luke Skywalker 3 Revenge of the Sith
## 3 Luke Skywalker 6 Return of the Jedi
## 4 Luke Skywalker 4 A New Hope
## 5 Luke Skywalker 7 The Force Awakens
## 6 C-3PO 5 The Empire Strikes Back
## 7 C-3PO 2 Attack of the Clones
## 8 C-3PO 1 The Phantom Menace
## 9 C-3PO 3 Revenge of the Sith
## 10 C-3PO 6 Return of the Jedi
## # ℹ 163 more rows
Makes working with strings “as easy as possible”. Find/replace/extract strings using various functions. You can use these with regular expressions (RegEx)
Tells you if there is a match to the specified pattern.
## # A tibble: 19 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Or… 150 49 brown light brown 19 fema… femin…
## 2 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 3 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
## 5 Han Solo 180 80 brown fair brown 29 male mascu…
## 6 Wedge A… 170 77 brown fair hazel 21 male mascu…
## 7 Jek Ton… 180 110 brown fair blue NA male mascu…
## 8 Arvel C… NA NA brown fair brown NA male mascu…
## 9 Wicket … 88 20 brown brown brown 8 male mascu…
## 10 Qui-Gon… 193 89 brown fair blue 92 male mascu…
## 11 Ric Olié 183 NA brown fair blue NA <NA> <NA>
## 12 Cordé 157 NA brown light brown NA fema… femin…
## 13 Cliegg … 183 NA brown fair blue 82 male mascu…
## 14 Dormé 165 NA brown light brown NA fema… femin…
## 15 Tarfful 234 136 brown brown blue NA male mascu…
## 16 Raymus … 188 79 brown light brown NA male mascu…
## 17 Rey NA NA brown light hazel NA fema… femin…
## 18 Poe Dam… NA NA brown light brown NA male mascu…
## 19 Padmé A… 165 45 brown light brown 46 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
We can also use regular expressions. More information can be found in the stringr cheat sheet linked at the top of the page, but here’s a quick example. This will look at the character’s name in the row and create a new column with TRUE if there are any digits in the name and FALSE if there are no digits in the name.
## # A tibble: 87 × 2
## name name_digit
## <chr> <lgl>
## 1 Luke Skywalker FALSE
## 2 C-3PO TRUE
## 3 R2-D2 TRUE
## 4 Darth Vader FALSE
## 5 Leia Organa FALSE
## 6 Owen Lars FALSE
## 7 Beru Whitesun lars FALSE
## 8 R5-D4 TRUE
## 9 Biggs Darklighter FALSE
## 10 Obi-Wan Kenobi FALSE
## # ℹ 77 more rows
Returns a count of the matches to the specified string.
For example, how many digits are in each character’s name?
starwars_data %>%
mutate(name_digit_no = str_count(name, "[:digit:]")) %>%
select(name, name_digit_no)## # A tibble: 87 × 2
## name name_digit_no
## <chr> <int>
## 1 Luke Skywalker 0
## 2 C-3PO 1
## 3 R2-D2 2
## 4 Darth Vader 0
## 5 Leia Organa 0
## 6 Owen Lars 0
## 7 Beru Whitesun lars 0
## 8 R5-D4 2
## 9 Biggs Darklighter 0
## 10 Obi-Wan Kenobi 0
## # ℹ 77 more rows
Extract the specified partial string.
## # A tibble: 173 × 3
## name movie episode_no
## <chr> <chr> <chr>
## 1 Luke Skywalker 5: The Empire Strikes Back 5
## 2 Luke Skywalker 3: Revenge of the Sith 3
## 3 Luke Skywalker 6: Return of the Jedi 6
## 4 Luke Skywalker 4: A New Hope 4
## 5 Luke Skywalker 7: The Force Awakens 7
## 6 C-3PO 5: The Empire Strikes Back 5
## 7 C-3PO 2: Attack of the Clones 2
## 8 C-3PO 1: The Phantom Menace 1
## 9 C-3PO 3: Revenge of the Sith 3
## 10 C-3PO 6: Return of the Jedi 6
## # ℹ 163 more rows
starwars_data %>%
mutate(green_skin = str_extract(skin_color, "green")) %>%
filter(green_skin == "green") %>%
select(name, skin_color, green_skin)## # A tibble: 11 × 3
## name skin_color green_skin
## <chr> <chr> <chr>
## 1 Greedo green green
## 2 Jabba Desilijic Tiure green-tan, brown green
## 3 Yoda green green
## 4 Bossk green green
## 5 Nute Gunray mottled green green
## 6 Rugor Nass green green
## 7 Ben Quadinaros grey, green, yellow green
## 8 Kit Fisto green green
## 9 Poggle the Lesser green green
## 10 Zam Wesell fair, green, yellow green
## 11 Wat Tambor green, grey green
Replace a partial string with another specified string.
Let’s say we don’t want commas and spaces between the hair color column and we instead want “/”.
starwars_data %>%
mutate(hair_color2 = str_replace(hair_color, ", ", "/")) %>%
select(name, hair_color, hair_color2)## # A tibble: 87 × 3
## name hair_color hair_color2
## <chr> <chr> <chr>
## 1 Luke Skywalker blond blond
## 2 C-3PO <NA> <NA>
## 3 R2-D2 <NA> <NA>
## 4 Darth Vader none none
## 5 Leia Organa brown brown
## 6 Owen Lars brown, grey brown/grey
## 7 Beru Whitesun lars brown brown
## 8 R5-D4 <NA> <NA>
## 9 Biggs Darklighter black black
## 10 Obi-Wan Kenobi auburn, white auburn/white
## # ℹ 77 more rows
The package purrr has functions that can replace for loops with succinct code that is easier to read.
Map allows us to apply functions to separate pieces of the data.
We can start out easy, using map to return the number of distinct values across all of our columns.
map() returns a list
## $name
## [1] 87
##
## $height
## [1] 46
##
## $mass
## [1] 39
##
## $hair_color
## [1] 13
##
## $skin_color
## [1] 31
##
## $eye_color
## [1] 15
##
## $birth_year
## [1] 37
##
## $sex
## [1] 5
##
## $gender
## [1] 3
##
## $homeworld
## [1] 49
##
## $species
## [1] 38
##
## $films
## [1] 24
##
## $vehicles
## [1] 11
##
## $starships
## [1] 17
map_dbl() returns numbers
## name height mass hair_color skin_color eye_color birth_year
## 87 46 39 13 31 15 37
## sex gender homeworld species films vehicles starships
## 5 3 49 38 24 11 17
map_df() returns a dataframe
## # A tibble: 1 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 87 46 39 13 31 15 37 5 3
## # ℹ 5 more variables: homeworld <int>, species <int>, films <int>,
## # vehicles <int>, starships <int>
We can do more complicated iterative processes with map, such as fitting models to groups of the data.
Let’s do a quick example for how you could fit a model for mass as a function of height for each species.
We start by selecting the columns we need and filtering for complete cases (no NAs in any column).
starwars_data_map <- starwars_data %>%
select(name, species, height, mass) %>%
filter(complete.cases(.))
starwars_data_map## # A tibble: 58 × 4
## name species height mass
## <chr> <chr> <int> <dbl>
## 1 Luke Skywalker Human 172 77
## 2 C-3PO Droid 167 75
## 3 R2-D2 Droid 96 32
## 4 Darth Vader Human 202 136
## 5 Leia Organa Human 150 49
## 6 Owen Lars Human 178 120
## 7 Beru Whitesun lars Human 165 75
## 8 R5-D4 Droid 97 32
## 9 Biggs Darklighter Human 183 84
## 10 Obi-Wan Kenobi Human 182 77
## # ℹ 48 more rows
We don’t have a lot of data or big sample sizes. Only 2 of the species have more than 3 individuals, but we will use that to map this. We group by species, filter to those with more than 3 individs, and then nest the data.
starwars_data_map2 <- starwars_data_map %>%
group_by(species) %>%
filter(n() > 3) %>%
nest()
starwars_data_map2## # A tibble: 2 × 2
## # Groups: species [2]
## species data
## <chr> <list>
## 1 Human <tibble [22 × 3]>
## 2 Droid <tibble [4 × 3]>
We can then create a new column using mutate that maps the model to each group. The next line of code maps the function tidy() from the broom package, which extracts the model summary from our models. Finally, we ungroup the data.
starwars_data_map3 <- starwars_data_map2 %>%
mutate(lm_obj = map(data, ~lm(mass ~ height, data = .))) %>%
mutate(lm_tidy = map(lm_obj, broom::tidy)) %>%
ungroup()
starwars_data_map3## # A tibble: 2 × 4
## species data lm_obj lm_tidy
## <chr> <list> <list> <list>
## 1 Human <tibble [22 × 3]> <lm> <tibble [2 × 5]>
## 2 Droid <tibble [4 × 3]> <lm> <tibble [2 × 5]>
To get the data to a dataframe format, we can select the fields we want and then unnest them!
starwars_data_map4 <- starwars_data_map3 %>%
select(species, lm_tidy) %>%
unnest(cols = c(lm_tidy))
starwars_data_map4## # A tibble: 4 × 6
## species term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Human (Intercept) -117. 52.0 -2.24 0.0364
## 2 Human height 1.11 0.289 3.84 0.00102
## 3 Droid (Intercept) -62.1 28.7 -2.16 0.163
## 4 Droid height 0.942 0.195 4.83 0.0403
Forcats is a package that makes working with factors more easy by allowing for reordering and grouping.
Collapse the least or most frequent values of a factor into “other”.
Here, we can lump the species that have few individuals into “other”.
starwars %>%
filter(!is.na(species)) %>%
mutate(species = fct_lump(species, n = 3)) %>%
count(species)## # A tibble: 4 × 2
## species n
## <fct> <int>
## 1 Droid 6
## 2 Gungan 3
## 3 Human 35
## 4 Other 39