1 Overview

1.1 Purpose

The purpose of this document is to provide an introduction into using Tidyverse in R. We will review the “core” packages within the Tidyverse and give examples.

1.2 Load packages

package_names <- c("tidyverse", "here", "janitor", "vtable")

for(x in package_names){
if (!x %in% rownames(installed.packages())) install.packages(x)
}

library(tidyverse)
library(here) #is a helpful package for defining file paths
library(janitor) #is a helpful package for cleaning data
library(vtable) #is a helpful package for summarizing data

1.3 Load data

We’ll be using the Star Wars data that comes loaded with the tidyverse.

starwars_data <- starwars
starwars_data

## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

vtable(starwars_data)

starwars_data
Name	Class	Values
name	character
height	integer	Num: 66 to 264
mass	numeric	Num: 15 to 1358
hair_color	character
skin_color	character
eye_color	character
birth_year	numeric	Num: 8 to 896
sex	character
gender	character
homeworld	character
species	character
films	list
vehicles	list
starships	list

1.4 Tidyverse Basics

Tidyverse was created with the purpose to have more consistent and intuitive verbiage. There are several “core” packages we’ll discuss here, including:
- Readr: read and write delimited files
- Dplyr: manipulate data
- Tidyr: tidy and clean data
- Stringr: find, extract, and replace strings
- Purrr: programming
- Forcats: working with factors

GGplot is another major package in the tidyverse, but is not discussed here.

1.4.1 The Pipe %>%

The pipe operator “%>%” strings together sequences of commands.

It takes whatever is on the left side of the operator and then does the next command.

Example: take starwars_data and then rename a column, save as starwars_data2

starwars_data2 <- starwars_data %>%
  rename(character_name = name)

starwars_data2

## # A tibble: 87 × 14
##    character_name  height  mass hair_color skin_color eye_color birth_year sex  
##    <chr>            <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Luke Skywalker     172    77 blond      fair       blue            19   male 
##  2 C-3PO              167    75 <NA>       gold       yellow         112   none 
##  3 R2-D2               96    32 <NA>       white, bl… red             33   none 
##  4 Darth Vader        202   136 none       white      yellow          41.9 male 
##  5 Leia Organa        150    49 brown      light      brown           19   fema…
##  6 Owen Lars          178   120 brown, gr… light      blue            52   male 
##  7 Beru Whitesun …    165    75 brown      light      blue            47   fema…
##  8 R5-D4               97    32 <NA>       white, red red             NA   none 
##  9 Biggs Darkligh…    183    84 black      light      brown           24   male 
## 10 Obi-Wan Kenobi     182    77 auburn, w… fair       blue-gray       57   male 
## # ℹ 77 more rows
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

2 Resources

Tidyverse website: https://www.tidyverse.org/

Learn the Tidyverse with the book R for Data Science: https://r4ds.hadley.nz/

See the different packages here: https://www.tidyverse.org/packages/

Readr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf

Dplyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

Tidyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf

Stringr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf

Purrr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf

Forcats cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/factors.pdf

3 Core Packages

3.1 Readr

Reads rectangular data from files such as CSV

To use this as an example, we’ll write the starwars_data to CSV and read it back into R. We’ll use the package “here”, which makes it easier to define file pathways.

Since some of our columns are lists, we can’t export this to CSV without losing that data, so this is just to show an example of how to use these functions.

starwars_data %>%
  write_csv(here("025 - Statistics",
                 "Summer 2024 Stats Workshop",
                 "Tidyverse",
                 "starwars.csv"))

starwars_data_example <- read_csv(here("025 - Statistics",
                 "Summer 2024 Stats Workshop",
                 "Tidyverse",
                 "starwars.csv"))

3.2 Dplyr

The package dplyr has functions to facilitate data manipulation using consistent verbs.

3.2.1 select()

Select the columns we want to keep in our data by naming the columns, the range of columns, or by description. Can also be used to rearrange the order of the columns.

For example, we can name the specific columns or range of columns that we want in the dataframe:

starwars_data %>%
  select(name, hair_color)

## # A tibble: 87 × 2
##    name               hair_color   
##    <chr>              <chr>        
##  1 Luke Skywalker     blond        
##  2 C-3PO              <NA>         
##  3 R2-D2              <NA>         
##  4 Darth Vader        none         
##  5 Leia Organa        brown        
##  6 Owen Lars          brown, grey  
##  7 Beru Whitesun lars brown        
##  8 R5-D4              <NA>         
##  9 Biggs Darklighter  black        
## 10 Obi-Wan Kenobi     auburn, white
## # ℹ 77 more rows

starwars_data %>%
  select(1:3)

## # A tibble: 87 × 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # ℹ 77 more rows

We can select which columns we want by describing the columns:

starwars_data %>% 
  select(name, ends_with("color"))

## # A tibble: 87 × 4
##    name               hair_color    skin_color  eye_color
##    <chr>              <chr>         <chr>       <chr>    
##  1 Luke Skywalker     blond         fair        blue     
##  2 C-3PO              <NA>          gold        yellow   
##  3 R2-D2              <NA>          white, blue red      
##  4 Darth Vader        none          white       yellow   
##  5 Leia Organa        brown         light       brown    
##  6 Owen Lars          brown, grey   light       blue     
##  7 Beru Whitesun lars brown         light       blue     
##  8 R5-D4              <NA>          white, red  red      
##  9 Biggs Darklighter  black         light       brown    
## 10 Obi-Wan Kenobi     auburn, white fair        blue-gray
## # ℹ 77 more rows

starwars_data %>% 
  select(name, where(is.numeric))

## # A tibble: 87 × 4
##    name               height  mass birth_year
##    <chr>               <int> <dbl>      <dbl>
##  1 Luke Skywalker        172    77       19  
##  2 C-3PO                 167    75      112  
##  3 R2-D2                  96    32       33  
##  4 Darth Vader           202   136       41.9
##  5 Leia Organa           150    49       19  
##  6 Owen Lars             178   120       52  
##  7 Beru Whitesun lars    165    75       47  
##  8 R5-D4                  97    32       NA  
##  9 Biggs Darklighter     183    84       24  
## 10 Obi-Wan Kenobi        182    77       57  
## # ℹ 77 more rows

We can also select the columns we don’t want using the same methods:

starwars_data %>%
  select(!c(films, vehicles))

## # A tibble: 87 × 12
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 3 more variables: homeworld <chr>, species <chr>, starships <list>

starwars_data %>%
  select(!where(is.character))

## # A tibble: 87 × 6
##    height  mass birth_year films     vehicles  starships
##     <int> <dbl>      <dbl> <list>    <list>    <list>   
##  1    172    77       19   <chr [5]> <chr [2]> <chr [2]>
##  2    167    75      112   <chr [6]> <chr [0]> <chr [0]>
##  3     96    32       33   <chr [7]> <chr [0]> <chr [0]>
##  4    202   136       41.9 <chr [4]> <chr [0]> <chr [1]>
##  5    150    49       19   <chr [5]> <chr [1]> <chr [0]>
##  6    178   120       52   <chr [3]> <chr [0]> <chr [0]>
##  7    165    75       47   <chr [3]> <chr [0]> <chr [0]>
##  8     97    32       NA   <chr [1]> <chr [0]> <chr [0]>
##  9    183    84       24   <chr [1]> <chr [0]> <chr [1]>
## 10    182    77       57   <chr [6]> <chr [1]> <chr [5]>
## # ℹ 77 more rows

And we can rearrange the order of the columns:

starwars_data %>%
  select(name, homeworld, species, everything())

## # A tibble: 87 × 14
##    name           homeworld species height  mass hair_color skin_color eye_color
##    <chr>          <chr>     <chr>    <int> <dbl> <chr>      <chr>      <chr>    
##  1 Luke Skywalker Tatooine  Human      172    77 blond      fair       blue     
##  2 C-3PO          Tatooine  Droid      167    75 <NA>       gold       yellow   
##  3 R2-D2          Naboo     Droid       96    32 <NA>       white, bl… red      
##  4 Darth Vader    Tatooine  Human      202   136 none       white      yellow   
##  5 Leia Organa    Alderaan  Human      150    49 brown      light      brown    
##  6 Owen Lars      Tatooine  Human      178   120 brown, gr… light      blue     
##  7 Beru Whitesun… Tatooine  Human      165    75 brown      light      blue     
##  8 R5-D4          Tatooine  Droid       97    32 <NA>       white, red red      
##  9 Biggs Darklig… Tatooine  Human      183    84 black      light      brown    
## 10 Obi-Wan Kenobi Stewjon   Human      182    77 auburn, w… fair       blue-gray
## # ℹ 77 more rows
## # ℹ 6 more variables: birth_year <dbl>, sex <chr>, gender <chr>, films <list>,
## #   vehicles <list>, starships <list>

3.2.2 filter()

Allows us to filter the data:

starwars_data %>% 
  filter(species == "Droid")

## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

starwars_data %>% 
  filter(grepl("Skywalker", name))

## # A tibble: 3 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 Anakin S…    188    84 blond      fair       blue            41.9 male  mascu…
## 3 Shmi Sky…    163    NA black      fair       brown           72   fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

starwars_data %>%
  filter(species == "Human" & homeworld == "Tatooine")

## # A tibble: 8 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 3 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## 4 Beru Whi…    165    75 brown      light      blue            47   fema… femin…
## 5 Biggs Da…    183    84 black      light      brown           24   male  mascu…
## 6 Anakin S…    188    84 blond      fair       blue            41.9 male  mascu…
## 7 Shmi Sky…    163    NA black      fair       brown           72   fema… femin…
## 8 Cliegg L…    183    NA brown      fair       blue            82   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

3.2.3 mutate()

Mutate creates new columns based on the given commands or calculations.

For example, we can create a new column, “bmi” using mass and height data:

starwars_data %>% 
  mutate(bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)

## # A tibble: 87 × 4
##    name               height  mass   bmi
##    <chr>               <int> <dbl> <dbl>
##  1 Luke Skywalker        172    77  26.0
##  2 C-3PO                 167    75  26.9
##  3 R2-D2                  96    32  34.7
##  4 Darth Vader           202   136  33.3
##  5 Leia Organa           150    49  21.8
##  6 Owen Lars             178   120  37.9
##  7 Beru Whitesun lars    165    75  27.5
##  8 R5-D4                  97    32  34.0
##  9 Biggs Darklighter     183    84  25.1
## 10 Obi-Wan Kenobi        182    77  23.2
## # ℹ 77 more rows

Or we could create categories of heights:

starwars_data %>%
  mutate(height_bin = case_when(height < 150 ~ "Less than 150",
                                height >= 150 & height < 175 ~ "150-174",
                                height >= 175 & height < 200 ~ "175-199",
                                height >= 200 ~ "200+")) %>%
  select(name, height, height_bin)

## # A tibble: 87 × 3
##    name               height height_bin   
##    <chr>               <int> <chr>        
##  1 Luke Skywalker        172 150-174      
##  2 C-3PO                 167 150-174      
##  3 R2-D2                  96 Less than 150
##  4 Darth Vader           202 200+         
##  5 Leia Organa           150 150-174      
##  6 Owen Lars             178 175-199      
##  7 Beru Whitesun lars    165 150-174      
##  8 R5-D4                  97 Less than 150
##  9 Biggs Darklighter     183 175-199      
## 10 Obi-Wan Kenobi        182 175-199      
## # ℹ 77 more rows

3.2.4 summarise() and arrange()

Summarise allows us to get a summarized dataframe based on the groups defined and the functions used to summarise.

For example, let’s find out how many characters there are from each species.

We can first use the group_by() function to group by species.
Then, we can use summarise(). This creates a new column (similarly to mutate) with the commands used within the arguments (in this case, it creates the column “count” using the function n()).
While mutate maintains all rows, summarise only maintains rows unique to the groups (in this case, species) and aggregates to the group level.
The arrange() function lets us see the top species first

starwars %>%
  group_by(species) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

## # A tibble: 38 × 2
##    species  count
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # ℹ 28 more rows

3.2.5 joins

We can join data a few different ways:

A inner_join() keeps only all observations in both x and y.

A left_join() keeps all observations in x.

A right_join() keeps all observations in y.

A full_join() keeps all observations in x and y.

Here are some example data that we can join to our starwars_data. “droids” shows the number of films each droid was present in. “droids_example” is the same as “droids”, but has an additional “fake” droid added for example purposes.

droids

## # A tibble: 6 × 2
##   name   no_films
##   <chr>     <int>
## 1 BB8           1
## 2 C-3PO         6
## 3 IG-88         1
## 4 R2-D2         7
## 5 R4-P17        2
## 6 R5-D4         1

droids_example

## # A tibble: 7 × 2
##   name               no_films
##   <chr>              <chr>   
## 1 BB8                1       
## 2 C-3PO              6       
## 3 IG-88              1       
## 4 R2-D2              7       
## 5 R4-P17             2       
## 6 R5-D4              1       
## 7 Fake Droid Example 0

Let’s start with an inner_join() between the starwars_data (x) and droids_example (y). As you can see, it only keeps rows in both dataframes.

innerjoin <- inner_join(starwars_data, droids_example)

innerjoin

## # A tibble: 6 × 15
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, no_films <chr>

Let’s do a left_join() next with the starwars_data (x) and droids_example (y). Since starwars_data is on the left (x), we maintain all 87 rows and the new column “no_films” in our droids data (y) gets added to our data for the droids. The data for the fake droid in droids_examples does not stay in the data because it does not have a match in starwars_data.

leftjoin <- left_join(starwars_data, droids_example) %>%
  select(name, no_films, everything())

leftjoin

## # A tibble: 87 × 15
##    name   no_films height  mass hair_color skin_color eye_color birth_year sex  
##    <chr>  <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Luke … <NA>        172    77 blond      fair       blue            19   male 
##  2 C-3PO  6           167    75 <NA>       gold       yellow         112   none 
##  3 R2-D2  7            96    32 <NA>       white, bl… red             33   none 
##  4 Darth… <NA>        202   136 none       white      yellow          41.9 male 
##  5 Leia … <NA>        150    49 brown      light      brown           19   fema…
##  6 Owen … <NA>        178   120 brown, gr… light      blue            52   male 
##  7 Beru … <NA>        165    75 brown      light      blue            47   fema…
##  8 R5-D4  1            97    32 <NA>       white, red red             NA   none 
##  9 Biggs… <NA>        183    84 black      light      brown           24   male 
## 10 Obi-W… <NA>        182    77 auburn, w… fair       blue-gray       57   male 
## # ℹ 77 more rows
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Next, a right_join(). This will keep all observations in droids_example, including the data for the fake droid. All the columns from starwars_data for those droids gets added to the data.

rightjoin <- right_join(starwars_data, droids_example) %>%
  select(name, no_films, everything())

rightjoin

## # A tibble: 7 × 15
##   name    no_films height  mass hair_color skin_color eye_color birth_year sex  
##   <chr>   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
## 1 C-3PO   6           167    75 <NA>       gold       yellow           112 none 
## 2 R2-D2   7            96    32 <NA>       white, bl… red               33 none 
## 3 R5-D4   1            97    32 <NA>       white, red red               NA none 
## 4 IG-88   1           200   140 none       metal      red               15 none 
## 5 R4-P17  2            96    NA none       silver, r… red, blue         NA none 
## 6 BB8     1            NA    NA none       none       black             NA none 
## 7 Fake D… 0            NA    NA <NA>       <NA>       <NA>              NA <NA> 
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Last, we’ll do a full_join(), where all observations present in either dataframe will be joined.

This includes the fake droid data along with the observations for all non-droids.

fulljoin <- full_join(starwars_data, droids_example) %>%
  select(name, no_films, everything()) %>%
  arrange(no_films)
  

fulljoin

## # A tibble: 88 × 15
##    name   no_films height  mass hair_color skin_color eye_color birth_year sex  
##    <chr>  <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Fake … 0            NA    NA <NA>       <NA>       <NA>            NA   <NA> 
##  2 R5-D4  1            97    32 <NA>       white, red red             NA   none 
##  3 IG-88  1           200   140 none       metal      red             15   none 
##  4 BB8    1            NA    NA none       none       black           NA   none 
##  5 R4-P17 2            96    NA none       silver, r… red, blue       NA   none 
##  6 C-3PO  6           167    75 <NA>       gold       yellow         112   none 
##  7 R2-D2  7            96    32 <NA>       white, bl… red             33   none 
##  8 Luke … <NA>        172    77 blond      fair       blue            19   male 
##  9 Darth… <NA>        202   136 none       white      yellow          41.9 male 
## 10 Leia … <NA>        150    49 brown      light      brown           19   fema…
## # ℹ 78 more rows
## # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

3.3 Tidyr

The package tidyr has functions that help achieve “tidy” data:
- Each variable is a column; each column is a variable
- Each observation is a row; each row is an observation
- Each value is a cell; each cell is a single value

3.3.1 nest() and unnest()

We can nest or unnest data using these functions.

For example, we could nest all the data columns by a group:

starwars_data_nest <- starwars_data %>%
  group_by(gender) %>%
  nest()

starwars_data_nest

## # A tibble: 3 × 2
## # Groups:   gender [3]
##   gender    data              
##   <chr>     <list>            
## 1 masculine <tibble [66 × 13]>
## 2 feminine  <tibble [17 × 13]>
## 3 <NA>      <tibble [4 × 13]>

And then we could use pluck() to grab the data separately for each of the groups.

starwars_data_nest %>%
  pluck("data")

## [[1]]
## # A tibble: 66 × 13
##    name  height  mass hair_color skin_color eye_color birth_year sex   homeworld
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
##  1 Luke…    172    77 blond      fair       blue            19   male  Tatooine 
##  2 C-3PO    167    75 <NA>       gold       yellow         112   none  Tatooine 
##  3 R2-D2     96    32 <NA>       white, bl… red             33   none  Naboo    
##  4 Dart…    202   136 none       white      yellow          41.9 male  Tatooine 
##  5 Owen…    178   120 brown, gr… light      blue            52   male  Tatooine 
##  6 R5-D4     97    32 <NA>       white, red red             NA   none  Tatooine 
##  7 Bigg…    183    84 black      light      brown           24   male  Tatooine 
##  8 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  Stewjon  
##  9 Anak…    188    84 blond      fair       blue            41.9 male  Tatooine 
## 10 Wilh…    180    NA auburn, g… fair       blue            64   male  Eriadu   
## # ℹ 56 more rows
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>
## 
## [[2]]
## # A tibble: 17 × 13
##    name  height  mass hair_color skin_color eye_color birth_year sex   homeworld
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
##  1 Leia…    150  49   brown      light      brown             19 fema… Alderaan 
##  2 Beru…    165  75   brown      light      blue              47 fema… Tatooine 
##  3 Mon …    150  NA   auburn     fair       blue              48 fema… Chandrila
##  4 Shmi…    163  NA   black      fair       brown             72 fema… Tatooine 
##  5 Ayla…    178  55   none       blue       hazel             48 fema… Ryloth   
##  6 Adi …    184  50   none       dark       blue              NA fema… Coruscant
##  7 Cordé    157  NA   brown      light      brown             NA fema… Naboo    
##  8 Lumi…    170  56.2 black      yellow     blue              58 fema… Mirial   
##  9 Barr…    166  50   black      yellow     blue              40 fema… Mirial   
## 10 Dormé    165  NA   brown      light      brown             NA fema… Naboo    
## 11 Zam …    168  55   blonde     fair, gre… yellow            NA fema… Zolan    
## 12 Taun…    213  NA   none       grey       black             NA fema… Kamino   
## 13 Joca…    167  NA   white      fair       blue              NA fema… Coruscant
## 14 R4-P…     96  NA   none       silver, r… red, blue         NA none  <NA>     
## 15 Shaa…    178  57   none       red, blue… black             NA fema… Shili    
## 16 Rey       NA  NA   brown      light      hazel             NA fema… <NA>     
## 17 Padm…    165  45   brown      light      brown             46 fema… Naboo    
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>
## 
## [[3]]
## # A tibble: 4 × 13
##   name   height  mass hair_color skin_color eye_color birth_year sex   homeworld
##   <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
## 1 Ric O…    183    NA brown      fair       blue              NA <NA>  Naboo    
## 2 Quars…    183    NA black      dark       brown             62 <NA>  Naboo    
## 3 Sly M…    178    48 none       pale       white             NA <NA>  Umbara   
## 4 Capta…     NA    NA unknown    unknown    unknown           NA <NA>  <NA>     
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Next, we will unnest some data. As a reminder, the Star Wars data has 3 columns stored as lists:

starwars_data %>%
  select(where(is.list))

## # A tibble: 87 × 3
##    films     vehicles  starships
##    <list>    <list>    <list>   
##  1 <chr [5]> <chr [2]> <chr [2]>
##  2 <chr [6]> <chr [0]> <chr [0]>
##  3 <chr [7]> <chr [0]> <chr [0]>
##  4 <chr [4]> <chr [0]> <chr [1]>
##  5 <chr [5]> <chr [1]> <chr [0]>
##  6 <chr [3]> <chr [0]> <chr [0]>
##  7 <chr [3]> <chr [0]> <chr [0]>
##  8 <chr [1]> <chr [0]> <chr [0]>
##  9 <chr [1]> <chr [0]> <chr [1]>
## 10 <chr [6]> <chr [1]> <chr [5]>
## # ℹ 77 more rows

Let’s unnest the data in the films field (stored as a list)

We can do this using unnest_longer()

starwars_data2 <- starwars_data %>%
  select(films, name) %>%
  unnest_longer(films) 

starwars_data2

## # A tibble: 173 × 2
##    films                   name          
##    <chr>                   <chr>         
##  1 The Empire Strikes Back Luke Skywalker
##  2 Revenge of the Sith     Luke Skywalker
##  3 Return of the Jedi      Luke Skywalker
##  4 A New Hope              Luke Skywalker
##  5 The Force Awakens       Luke Skywalker
##  6 The Empire Strikes Back C-3PO         
##  7 Attack of the Clones    C-3PO         
##  8 The Phantom Menace      C-3PO         
##  9 Revenge of the Sith     C-3PO         
## 10 Return of the Jedi      C-3PO         
## # ℹ 163 more rows

Or using unnest_wider()

starwars_data %>%
  select(name, films) %>%
  unnest_wider(films, names_sep = "_")

## # A tibble: 87 × 8
##    name               films_1    films_2 films_3 films_4 films_5 films_6 films_7
##    <chr>              <chr>      <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 Luke Skywalker     The Empir… Reveng… Return… A New … The Fo… <NA>    <NA>   
##  2 C-3PO              The Empir… Attack… The Ph… Reveng… Return… A New … <NA>   
##  3 R2-D2              The Empir… Attack… The Ph… Reveng… Return… A New … The Fo…
##  4 Darth Vader        The Empir… Reveng… Return… A New … <NA>    <NA>    <NA>   
##  5 Leia Organa        The Empir… Reveng… Return… A New … The Fo… <NA>    <NA>   
##  6 Owen Lars          Attack of… Reveng… A New … <NA>    <NA>    <NA>    <NA>   
##  7 Beru Whitesun lars Attack of… Reveng… A New … <NA>    <NA>    <NA>    <NA>   
##  8 R5-D4              A New Hope <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
##  9 Biggs Darklighter  A New Hope <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
## 10 Obi-Wan Kenobi     The Empir… Attack… The Ph… Reveng… Return… A New … <NA>   
## # ℹ 77 more rows

3.3.2 pivot_longer() & pivot_wider()

Let’s play around some more with transforming data. First, we’ll do something similar to what we did above but now using the pivot functions.

Let’s work with the newly unnested (using unnest_longer) field “films” to pivot our data to see which characters are in which films.

First, we’ll start by pivoting wider.

starwars_data_wide <- starwars_data2 %>%
  mutate(in_movie = 1) %>%
  pivot_wider(names_from = films,
              values_from = in_movie) 
starwars_data_wide

## # A tibble: 87 × 8
##    name        The Empire Strikes B…¹ `Revenge of the Sith` `Return of the Jedi`
##    <chr>                        <dbl>                 <dbl>                <dbl>
##  1 Luke Skywa…                      1                     1                    1
##  2 C-3PO                            1                     1                    1
##  3 R2-D2                            1                     1                    1
##  4 Darth Vader                      1                     1                    1
##  5 Leia Organa                      1                     1                    1
##  6 Owen Lars                       NA                     1                   NA
##  7 Beru White…                     NA                     1                   NA
##  8 R5-D4                           NA                    NA                   NA
##  9 Biggs Dark…                     NA                    NA                   NA
## 10 Obi-Wan Ke…                      1                     1                    1
## # ℹ 77 more rows
## # ℹ abbreviated name: ¹`The Empire Strikes Back`
## # ℹ 4 more variables: `A New Hope` <dbl>, `The Force Awakens` <dbl>,
## #   `Attack of the Clones` <dbl>, `The Phantom Menace` <dbl>

We can use the janitor package to clean up the new column names in one line of code

starwars_data_wide2 <- janitor::clean_names(starwars_data_wide)
starwars_data_wide2

## # A tibble: 87 × 8
##    name            the_empire_strikes_b…¹ revenge_of_the_sith return_of_the_jedi
##    <chr>                            <dbl>               <dbl>              <dbl>
##  1 Luke Skywalker                       1                   1                  1
##  2 C-3PO                                1                   1                  1
##  3 R2-D2                                1                   1                  1
##  4 Darth Vader                          1                   1                  1
##  5 Leia Organa                          1                   1                  1
##  6 Owen Lars                           NA                   1                 NA
##  7 Beru Whitesun …                     NA                   1                 NA
##  8 R5-D4                               NA                  NA                 NA
##  9 Biggs Darkligh…                     NA                  NA                 NA
## 10 Obi-Wan Kenobi                       1                   1                  1
## # ℹ 77 more rows
## # ℹ abbreviated name: ¹the_empire_strikes_back
## # ℹ 4 more variables: a_new_hope <dbl>, the_force_awakens <dbl>,
## #   attack_of_the_clones <dbl>, the_phantom_menace <dbl>

And we can replace those NAs with 0s where the character is not in the movie by using the mutate() function.

starwars_data_wide2.2 <- starwars_data_wide2 %>%
  mutate(across(the_empire_strikes_back:the_phantom_menace, ~ replace_na(.x, 0)))
starwars_data_wide2.2

## # A tibble: 87 × 8
##    name            the_empire_strikes_b…¹ revenge_of_the_sith return_of_the_jedi
##    <chr>                            <dbl>               <dbl>              <dbl>
##  1 Luke Skywalker                       1                   1                  1
##  2 C-3PO                                1                   1                  1
##  3 R2-D2                                1                   1                  1
##  4 Darth Vader                          1                   1                  1
##  5 Leia Organa                          1                   1                  1
##  6 Owen Lars                            0                   1                  0
##  7 Beru Whitesun …                      0                   1                  0
##  8 R5-D4                                0                   0                  0
##  9 Biggs Darkligh…                      0                   0                  0
## 10 Obi-Wan Kenobi                       1                   1                  1
## # ℹ 77 more rows
## # ℹ abbreviated name: ¹the_empire_strikes_back
## # ℹ 4 more variables: a_new_hope <dbl>, the_force_awakens <dbl>,
## #   attack_of_the_clones <dbl>, the_phantom_menace <dbl>

Then we can pivot this longer again…

starwars_data_long <- starwars_data_wide2.2 %>%
  pivot_longer(cols = the_empire_strikes_back:the_phantom_menace,
               names_to = "movie",
               values_to = "character_in_movie")
starwars_data_long

## # A tibble: 609 × 3
##    name           movie                   character_in_movie
##    <chr>          <chr>                                <dbl>
##  1 Luke Skywalker the_empire_strikes_back                  1
##  2 Luke Skywalker revenge_of_the_sith                      1
##  3 Luke Skywalker return_of_the_jedi                       1
##  4 Luke Skywalker a_new_hope                               1
##  5 Luke Skywalker the_force_awakens                        1
##  6 Luke Skywalker attack_of_the_clones                     0
##  7 Luke Skywalker the_phantom_menace                       0
##  8 C-3PO          the_empire_strikes_back                  1
##  9 C-3PO          revenge_of_the_sith                      1
## 10 C-3PO          return_of_the_jedi                       1
## # ℹ 599 more rows

And we did this in separate chunks, but the pipe could have done it all in one chunk of code:

starwars_data_together <- starwars_data %>%
  select(films, name) %>%
  unnest_longer(films) %>%
  mutate(in_movie = 1) %>%
  pivot_wider(names_from = films,
              values_from = in_movie) %>%
  clean_names() %>%
  mutate(across(the_empire_strikes_back:the_phantom_menace, ~ replace_na(.x, 0)))

3.3.3 unite() and separate()

Maybe we want to combine the film name with the episode number. We could do this using unite().

First, we’ll set up our data for this. We’ll unnest the films column, select only the columns we need, and will create a column (using mutate) with those episode numbers. We can use case_when() to do this.

starwars_data_unite <- starwars_data %>%
  unnest(films) %>%
  select(name, films) %>%
  mutate(episode_no = case_when(films == "The Phantom Menace" ~ 1,
                         films == "Attack of the Clones" ~ 2,
                         films == "Revenge of the Sith" ~ 3,
                         films == "A New Hope" ~ 4,
                         films == "The Empire Strikes Back" ~ 5,
                         films == "Return of the Jedi" ~ 6,
                         films == "The Force Awakens" ~ 7))

starwars_data_unite

## # A tibble: 173 × 3
##    name           films                   episode_no
##    <chr>          <chr>                        <dbl>
##  1 Luke Skywalker The Empire Strikes Back          5
##  2 Luke Skywalker Revenge of the Sith              3
##  3 Luke Skywalker Return of the Jedi               6
##  4 Luke Skywalker A New Hope                       4
##  5 Luke Skywalker The Force Awakens                7
##  6 C-3PO          The Empire Strikes Back          5
##  7 C-3PO          Attack of the Clones             2
##  8 C-3PO          The Phantom Menace               1
##  9 C-3PO          Revenge of the Sith              3
## 10 C-3PO          Return of the Jedi               6
## # ℹ 163 more rows

Now we can unite those two columns using unite().

The arguments in unite are:
- The new column name you are creating. In this case, we’ll call it “movie”
- The columns you are uniting (episode_no and films)
- The string that will separate the two combined fields (“:”)
- You can also include remove = TRUE or remove = FALSE to keep or remove original columns

starwars_data_unite2 <- starwars_data_unite %>%
  unite("movie", c(episode_no, films), sep = ": ")

starwars_data_unite2

## # A tibble: 173 × 2
##    name           movie                     
##    <chr>          <chr>                     
##  1 Luke Skywalker 5: The Empire Strikes Back
##  2 Luke Skywalker 3: Revenge of the Sith    
##  3 Luke Skywalker 6: Return of the Jedi     
##  4 Luke Skywalker 4: A New Hope             
##  5 Luke Skywalker 7: The Force Awakens      
##  6 C-3PO          5: The Empire Strikes Back
##  7 C-3PO          2: Attack of the Clones   
##  8 C-3PO          1: The Phantom Menace     
##  9 C-3PO          3: Revenge of the Sith    
## 10 C-3PO          6: Return of the Jedi     
## # ℹ 163 more rows

We can separate these columns again using separate()

starwars_data_unite2 %>%
  separate(movie, into = c("episode_no", "films"), sep = ": ")

## # A tibble: 173 × 3
##    name           episode_no films                  
##    <chr>          <chr>      <chr>                  
##  1 Luke Skywalker 5          The Empire Strikes Back
##  2 Luke Skywalker 3          Revenge of the Sith    
##  3 Luke Skywalker 6          Return of the Jedi     
##  4 Luke Skywalker 4          A New Hope             
##  5 Luke Skywalker 7          The Force Awakens      
##  6 C-3PO          5          The Empire Strikes Back
##  7 C-3PO          2          Attack of the Clones   
##  8 C-3PO          1          The Phantom Menace     
##  9 C-3PO          3          Revenge of the Sith    
## 10 C-3PO          6          Return of the Jedi     
## # ℹ 163 more rows

3.4 Stringr

Makes working with strings “as easy as possible”. Find/replace/extract strings using various functions. You can use these with regular expressions (RegEx)

3.4.1 str_detect()

Tells you if there is a match to the specified pattern.

starwars_data %>%
  filter(str_detect(hair_color, "brown"))

## # A tibble: 19 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Leia Or…    150    49 brown      light      brown             19 fema… femin…
##  2 Owen La…    178   120 brown, gr… light      blue              52 male  mascu…
##  3 Beru Wh…    165    75 brown      light      blue              47 fema… femin…
##  4 Chewbac…    228   112 brown      unknown    blue             200 male  mascu…
##  5 Han Solo    180    80 brown      fair       brown             29 male  mascu…
##  6 Wedge A…    170    77 brown      fair       hazel             21 male  mascu…
##  7 Jek Ton…    180   110 brown      fair       blue              NA male  mascu…
##  8 Arvel C…     NA    NA brown      fair       brown             NA male  mascu…
##  9 Wicket …     88    20 brown      brown      brown              8 male  mascu…
## 10 Qui-Gon…    193    89 brown      fair       blue              92 male  mascu…
## 11 Ric Olié    183    NA brown      fair       blue              NA <NA>  <NA>  
## 12 Cordé       157    NA brown      light      brown             NA fema… femin…
## 13 Cliegg …    183    NA brown      fair       blue              82 male  mascu…
## 14 Dormé       165    NA brown      light      brown             NA fema… femin…
## 15 Tarfful     234   136 brown      brown      blue              NA male  mascu…
## 16 Raymus …    188    79 brown      light      brown             NA male  mascu…
## 17 Rey          NA    NA brown      light      hazel             NA fema… femin…
## 18 Poe Dam…     NA    NA brown      light      brown             NA male  mascu…
## 19 Padmé A…    165    45 brown      light      brown             46 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

We can also use regular expressions. More information can be found in the stringr cheat sheet linked at the top of the page, but here’s a quick example. This will look at the character’s name in the row and create a new column with TRUE if there are any digits in the name and FALSE if there are no digits in the name.

starwars_data %>%
  mutate(name_digit = str_detect(name, "[:digit:]")) %>%
  select(name, name_digit)

## # A tibble: 87 × 2
##    name               name_digit
##    <chr>              <lgl>     
##  1 Luke Skywalker     FALSE     
##  2 C-3PO              TRUE      
##  3 R2-D2              TRUE      
##  4 Darth Vader        FALSE     
##  5 Leia Organa        FALSE     
##  6 Owen Lars          FALSE     
##  7 Beru Whitesun lars FALSE     
##  8 R5-D4              TRUE      
##  9 Biggs Darklighter  FALSE     
## 10 Obi-Wan Kenobi     FALSE     
## # ℹ 77 more rows

3.4.2 str_count()

Returns a count of the matches to the specified string.

For example, how many digits are in each character’s name?

starwars_data %>%
  mutate(name_digit_no = str_count(name, "[:digit:]")) %>%
  select(name, name_digit_no)

## # A tibble: 87 × 2
##    name               name_digit_no
##    <chr>                      <int>
##  1 Luke Skywalker                 0
##  2 C-3PO                          1
##  3 R2-D2                          2
##  4 Darth Vader                    0
##  5 Leia Organa                    0
##  6 Owen Lars                      0
##  7 Beru Whitesun lars             0
##  8 R5-D4                          2
##  9 Biggs Darklighter              0
## 10 Obi-Wan Kenobi                 0
## # ℹ 77 more rows

3.4.3 str_extract()

Extract the specified partial string.

starwars_data_unite2 %>%
  mutate(episode_no = str_extract(movie, "[:digit:]"))

## # A tibble: 173 × 3
##    name           movie                      episode_no
##    <chr>          <chr>                      <chr>     
##  1 Luke Skywalker 5: The Empire Strikes Back 5         
##  2 Luke Skywalker 3: Revenge of the Sith     3         
##  3 Luke Skywalker 6: Return of the Jedi      6         
##  4 Luke Skywalker 4: A New Hope              4         
##  5 Luke Skywalker 7: The Force Awakens       7         
##  6 C-3PO          5: The Empire Strikes Back 5         
##  7 C-3PO          2: Attack of the Clones    2         
##  8 C-3PO          1: The Phantom Menace      1         
##  9 C-3PO          3: Revenge of the Sith     3         
## 10 C-3PO          6: Return of the Jedi      6         
## # ℹ 163 more rows

starwars_data %>%
  mutate(green_skin = str_extract(skin_color, "green")) %>%
  filter(green_skin == "green") %>%
  select(name, skin_color, green_skin)

## # A tibble: 11 × 3
##    name                  skin_color          green_skin
##    <chr>                 <chr>               <chr>     
##  1 Greedo                green               green     
##  2 Jabba Desilijic Tiure green-tan, brown    green     
##  3 Yoda                  green               green     
##  4 Bossk                 green               green     
##  5 Nute Gunray           mottled green       green     
##  6 Rugor Nass            green               green     
##  7 Ben Quadinaros        grey, green, yellow green     
##  8 Kit Fisto             green               green     
##  9 Poggle the Lesser     green               green     
## 10 Zam Wesell            fair, green, yellow green     
## 11 Wat Tambor            green, grey         green

3.4.4 str_replace()

Replace a partial string with another specified string.

Let’s say we don’t want commas and spaces between the hair color column and we instead want “/”.

starwars_data %>%
  mutate(hair_color2 = str_replace(hair_color, ", ", "/")) %>%
  select(name, hair_color, hair_color2)

## # A tibble: 87 × 3
##    name               hair_color    hair_color2 
##    <chr>              <chr>         <chr>       
##  1 Luke Skywalker     blond         blond       
##  2 C-3PO              <NA>          <NA>        
##  3 R2-D2              <NA>          <NA>        
##  4 Darth Vader        none          none        
##  5 Leia Organa        brown         brown       
##  6 Owen Lars          brown, grey   brown/grey  
##  7 Beru Whitesun lars brown         brown       
##  8 R5-D4              <NA>          <NA>        
##  9 Biggs Darklighter  black         black       
## 10 Obi-Wan Kenobi     auburn, white auburn/white
## # ℹ 77 more rows

3.5 Purrr

The package purrr has functions that can replace for loops with succinct code that is easier to read.

3.5.1 map()

Map allows us to apply functions to separate pieces of the data.

We can start out easy, using map to return the number of distinct values across all of our columns.

map() returns a list

starwars_data %>%
  map(n_distinct)

## $name
## [1] 87
## 
## $height
## [1] 46
## 
## $mass
## [1] 39
## 
## $hair_color
## [1] 13
## 
## $skin_color
## [1] 31
## 
## $eye_color
## [1] 15
## 
## $birth_year
## [1] 37
## 
## $sex
## [1] 5
## 
## $gender
## [1] 3
## 
## $homeworld
## [1] 49
## 
## $species
## [1] 38
## 
## $films
## [1] 24
## 
## $vehicles
## [1] 11
## 
## $starships
## [1] 17

map_dbl() returns numbers

starwars_data %>%
  map_dbl(n_distinct)

##       name     height       mass hair_color skin_color  eye_color birth_year 
##         87         46         39         13         31         15         37 
##        sex     gender  homeworld    species      films   vehicles  starships 
##          5          3         49         38         24         11         17

map_df() returns a dataframe

starwars_data %>%
  map_df(n_distinct)

## # A tibble: 1 × 14
##    name height  mass hair_color skin_color eye_color birth_year   sex gender
##   <int>  <int> <int>      <int>      <int>     <int>      <int> <int>  <int>
## 1    87     46    39         13         31        15         37     5      3
## # ℹ 5 more variables: homeworld <int>, species <int>, films <int>,
## #   vehicles <int>, starships <int>

We can do more complicated iterative processes with map, such as fitting models to groups of the data.

Let’s do a quick example for how you could fit a model for mass as a function of height for each species.

We start by selecting the columns we need and filtering for complete cases (no NAs in any column).

starwars_data_map <- starwars_data %>% 
  select(name, species, height, mass) %>%
  filter(complete.cases(.)) 

starwars_data_map

## # A tibble: 58 × 4
##    name               species height  mass
##    <chr>              <chr>    <int> <dbl>
##  1 Luke Skywalker     Human      172    77
##  2 C-3PO              Droid      167    75
##  3 R2-D2              Droid       96    32
##  4 Darth Vader        Human      202   136
##  5 Leia Organa        Human      150    49
##  6 Owen Lars          Human      178   120
##  7 Beru Whitesun lars Human      165    75
##  8 R5-D4              Droid       97    32
##  9 Biggs Darklighter  Human      183    84
## 10 Obi-Wan Kenobi     Human      182    77
## # ℹ 48 more rows

We don’t have a lot of data or big sample sizes. Only 2 of the species have more than 3 individuals, but we will use that to map this. We group by species, filter to those with more than 3 individs, and then nest the data.

starwars_data_map2 <- starwars_data_map %>%
  group_by(species) %>% 
  filter(n() > 3) %>%
  nest() 

starwars_data_map2

## # A tibble: 2 × 2
## # Groups:   species [2]
##   species data             
##   <chr>   <list>           
## 1 Human   <tibble [22 × 3]>
## 2 Droid   <tibble [4 × 3]>

We can then create a new column using mutate that maps the model to each group. The next line of code maps the function tidy() from the broom package, which extracts the model summary from our models. Finally, we ungroup the data.

starwars_data_map3 <- starwars_data_map2 %>%
  mutate(lm_obj = map(data, ~lm(mass ~ height, data = .))) %>%
  mutate(lm_tidy = map(lm_obj, broom::tidy)) %>%
  ungroup()

starwars_data_map3

## # A tibble: 2 × 4
##   species data              lm_obj lm_tidy         
##   <chr>   <list>            <list> <list>          
## 1 Human   <tibble [22 × 3]> <lm>   <tibble [2 × 5]>
## 2 Droid   <tibble [4 × 3]>  <lm>   <tibble [2 × 5]>

To get the data to a dataframe format, we can select the fields we want and then unnest them!

starwars_data_map4 <- starwars_data_map3 %>%
  select(species, lm_tidy) %>%
  unnest(cols = c(lm_tidy))

starwars_data_map4

## # A tibble: 4 × 6
##   species term        estimate std.error statistic p.value
##   <chr>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 Human   (Intercept) -117.       52.0       -2.24 0.0364 
## 2 Human   height         1.11      0.289      3.84 0.00102
## 3 Droid   (Intercept)  -62.1      28.7       -2.16 0.163  
## 4 Droid   height         0.942     0.195      4.83 0.0403

3.6 Forcats

Forcats is a package that makes working with factors more easy by allowing for reordering and grouping.

3.6.1 fct_lump()

Collapse the least or most frequent values of a factor into “other”.

Here, we can lump the species that have few individuals into “other”.

starwars %>% 
  filter(!is.na(species)) %>%
  mutate(species = fct_lump(species, n = 3)) %>%
  count(species)

## # A tibble: 4 × 2
##   species     n
##   <fct>   <int>
## 1 Droid       6
## 2 Gungan      3
## 3 Human      35
## 4 Other      39

3.6.2 fct_reorder()

Reorder a factor by another variable.

starwars_data_reorder <- starwars_data %>%
  filter(!is.na(height) & !is.na(species)) %>%
  group_by(species) %>%
  summarise(avg_height = mean(height)) %>%
  mutate(species = fct_reorder(species, avg_height))

3.6.3 fct_infreq()

Reorder a factor based on the frequency of the levels.

starwars %>%
  mutate(eye_color = fct_infreq(eye_color)) %>%
  ggplot(aes(x = eye_color)) + 
  geom_bar() + 
  coord_flip()

Intro to the Tidyverse

2024-07-29