We will see some uses of the dplyr package by loading a data set of contestants on the Bachelorette season’s 11-15.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read.csv("https://raw.githubusercontent.com/pmalo46/SPRING2020TIDYVERSE/master/BacheloretteDSFinal-Dogu.csv")
head(df)
## Season Name Age Hometown State
## 1 15 Jed Wyatt 25 Sevierville, Tennessee TN
## 2 15 Tyler Cameron 26 Jupiter, Florida FL
## 3 15 Peter Weber 27 Westlake Village, California CA
## 4 15 Luke Parker 24 Gainesville, Georgia GA
## 5 15 Garrett Powell 27 Homewood, Alabama AL
## 6 15 Mike Johnson 31 San Antonio, Texas TX
## College Occupation Win_Loss Height..cm.
## 1 Belmont University Singer/Sonwriter 1 190.50
## 2 Wake Forest General Contractor 0 187.96
## 3 Baylor University Pilot 0 175.25
## 4 Faulkner University Import/Export Manager 0 175.00
## 5 Mississippi State University Golf Pro 0 NaN
## 6 NaN Portfolio Manager 0 180.00
## Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1 Yes Brown Brown
## 2 No Brown Green
## 3 No Brown Brown
## 4 No Blonde Brown
## 5 No Brown Green
## 6 No Brown Brown
One of the most useful functions in the dplyr package is the filter function, which allows us to filter down to only rows that meet a certain condition.
filter(df, Win_Loss == 1)
## Season Name Age Hometown State
## 1 15 Jed Wyatt 25 Sevierville, Tennessee TN
## 2 14 Jason Tartick 29 Buffalo, New York NY
## 3 13 Bryan Abasolo 37 Miami, Florida FL
## 4 12 Jordan Rodgers 27 Chico, California CA
## 5 11 Shawn Booth 28 Windsor Locks, Connecticut CT
## College Occupation Win_Loss Height..cm.
## 1 Belmont University Singer/Sonwriter 1 190.50
## 2 University of Rochester Senior Corporate Banker 1 175.26
## 3 University of Florida Chiropractor 1 187.96
## 4 Butte College Former Pro Quarterback 1 187.96
## 5 Keene State College Personal Trainer 1 187.96
## Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1 Yes Brown Brown
## 2 No Brown Brown
## 3 No Brown Brown
## 4 No Brown Brown
## 5 No Brown Brown
The table above shows the winners of the last five seasons. Another useful function is the group_by function.
group_by(df, State) %>%
summarise(mean(Height..cm.))
## # A tibble: 32 x 2
## State `mean(Height..cm.)`
## <fct> <dbl>
## 1 AL NaN
## 2 AR 188.
## 3 AZ 185.
## 4 CA NaN
## 5 CO 187.
## 6 CT NaN
## 7 FL NaN
## 8 GA NaN
## 9 IA NaN
## 10 ID 188.
## # … with 22 more rows
The chunk above uses the group_by method to group the contestants by which state they are from, and then take the average height by state. Another dplyr method, ‘summarise’ is on display here, which allows us to reduce multiple values down to a single value. Another useful function is arrange()
as_tibble(tail(arrange(df, Occupation), 15))
## # A tibble: 15 x 12
## Season Name Age Hometown State College Occupation Win_Loss Height..cm.
## <int> <fct> <int> <fct> <fct> <fct> <fct> <int> <dbl>
## 1 12 "Nic… 26 San Fra… CA Other "Software… 0 NaN
## 2 11 "Ben… 26 Warsaw,… IN Indian… "Software… 0 198.
## 3 14 "Mic… 27 Cincinn… OH Univer… "Sports A… 0 NaN
## 4 12 "Pet… 26 Rockdal… IL Joliet… "Staffing… 0 180.
## 5 13 "Dea… 26 Aspen, … CO Univer… "Startup … 0 188
## 6 15 "Dev… 27 Sherman… CA Univer… "Talent M… 0 NaN
## 7 15 "Dyl… 24 San Die… CA Willia… "Tech Ent… 0 180.
## 8 12 "Jon… 29 Vancouv… Other Other "Technica… 0 185.
## 9 12 "Chr… 26 Los Ang… CA Califo… "Telecom … 0 180.
## 10 15 "Joe… 30 Chicago… IL North … "The Box … 0 NaN
## 11 12 "Ale… 25 Oceansi… CA Palm B… "U.S. Mar… 0 170.
## 12 13 "Bla… 29 San Fra… CA Other "U.S. Mar… 0 180.
## 13 15 "Gra… 30 San Cle… CA Saddle… "Unemploy… 0 NaN
## 14 14 "Dav… 25 Cherry … NJ Univer… "Venture … 0 NaN
## 15 12 "Luk… 31 Burnet,… TX West P… "War Vete… 0 185.
## # … with 3 more variables: Girlfriend.While.on.the.Show. <fct>,
## # Hair.Color <fct>, Eye.Color <fct>
The chunk above uses arrange() to sort the contestants alphabetically, while the as_tibble method makes the output more easily viewable.
These demonstrate some of the many uses of the great dplyr package.
Let’s use the dplyr function to further analyze the Bachelor franchise data.
We can use the select() function to view only certain columns from the data.
Here we want to view the Season, Name, Age and whether the contest won or not.
age <- df %>% select('Season','Name','Age','Win_Loss')
Let’s use the filter function and only look at contestants from Hannah Brown’s season (Season 15).
hb_contestants <- age %>% filter(Season == 15)
hb_contestants
## Season Name Age Win_Loss
## 1 15 Jed Wyatt 25 1
## 2 15 Tyler Cameron 26 0
## 3 15 Peter Weber 27 0
## 4 15 Luke Parker 24 0
## 5 15 Garrett Powell 27 0
## 6 15 Mike Johnson 31 0
## 7 15 Connor Saeli 24 0
## 8 15 Dustin Kendrick 30 0
## 9 15 Dylan Barbour 24 0
## 10 15 Devin Harris 27 0
## 11 15 Grant Eckel 30 0
## 12 15 Kevin Fortenberry 27 0
## 13 15 John Paul Jones 24 0
## 14 15 Matteo Valles 25 0
## 15 15 Luke Stone 29 0
## 16 15 Cameron Ayala 30 0
## 17 15 Joey Jones 33 0
## 18 15 Jonathan Saunders 27 0
## 19 15 Tyler Gwozdz 28 0
## 20 15 Connor Jenkins 28 0
## 21 15 Daron Blaylock 25 0
## 22 15 Matthew Spraggins 23 0
## 23 15 Brian Bowles 30 0
## 24 15 Chasen Coscia 27 0
## 25 15 Hunter Jones 24 0
## 26 15 Joe Barsano 30 0
## 27 15 Matt Donald 26 0
## 28 15 Ryan Spirko 25 0
## 29 15 Thomas Staton 27 0
## 30 15 Scott Anderson 28 0
Hannah Brown was the Bachelorette for Season 15. From, https://www.washingtonpost.com/lifestyle/2019/05/13/bachelorette-premiere-heres-everything-you-need-know-about-hannah-b/, we can see that Hannah Brown was 24 years old at the time of her season. What if we wanted to know each contestant age in relation to Hannah? We can do that using mutate().
Mutate helps us create new columns from existing columns. Our new colum will be the age difference between Hannah Brown and each contestant.
hb_contestants <- hb_contestants %>% mutate(age_difference = Age - 24)
hb_contestants
## Season Name Age Win_Loss age_difference
## 1 15 Jed Wyatt 25 1 1
## 2 15 Tyler Cameron 26 0 2
## 3 15 Peter Weber 27 0 3
## 4 15 Luke Parker 24 0 0
## 5 15 Garrett Powell 27 0 3
## 6 15 Mike Johnson 31 0 7
## 7 15 Connor Saeli 24 0 0
## 8 15 Dustin Kendrick 30 0 6
## 9 15 Dylan Barbour 24 0 0
## 10 15 Devin Harris 27 0 3
## 11 15 Grant Eckel 30 0 6
## 12 15 Kevin Fortenberry 27 0 3
## 13 15 John Paul Jones 24 0 0
## 14 15 Matteo Valles 25 0 1
## 15 15 Luke Stone 29 0 5
## 16 15 Cameron Ayala 30 0 6
## 17 15 Joey Jones 33 0 9
## 18 15 Jonathan Saunders 27 0 3
## 19 15 Tyler Gwozdz 28 0 4
## 20 15 Connor Jenkins 28 0 4
## 21 15 Daron Blaylock 25 0 1
## 22 15 Matthew Spraggins 23 0 -1
## 23 15 Brian Bowles 30 0 6
## 24 15 Chasen Coscia 27 0 3
## 25 15 Hunter Jones 24 0 0
## 26 15 Joe Barsano 30 0 6
## 27 15 Matt Donald 26 0 2
## 28 15 Ryan Spirko 25 0 1
## 29 15 Thomas Staton 27 0 3
## 30 15 Scott Anderson 28 0 4
Let’s order the contests in her season from youngest to oldest using arrange(). Plotting the age difference we can see most contestants are 3 years older than her.
library(ggplot2)
hb_contestants %>% arrange(age_difference)
## Season Name Age Win_Loss age_difference
## 1 15 Matthew Spraggins 23 0 -1
## 2 15 Luke Parker 24 0 0
## 3 15 Connor Saeli 24 0 0
## 4 15 Dylan Barbour 24 0 0
## 5 15 John Paul Jones 24 0 0
## 6 15 Hunter Jones 24 0 0
## 7 15 Jed Wyatt 25 1 1
## 8 15 Matteo Valles 25 0 1
## 9 15 Daron Blaylock 25 0 1
## 10 15 Ryan Spirko 25 0 1
## 11 15 Tyler Cameron 26 0 2
## 12 15 Matt Donald 26 0 2
## 13 15 Peter Weber 27 0 3
## 14 15 Garrett Powell 27 0 3
## 15 15 Devin Harris 27 0 3
## 16 15 Kevin Fortenberry 27 0 3
## 17 15 Jonathan Saunders 27 0 3
## 18 15 Chasen Coscia 27 0 3
## 19 15 Thomas Staton 27 0 3
## 20 15 Tyler Gwozdz 28 0 4
## 21 15 Connor Jenkins 28 0 4
## 22 15 Scott Anderson 28 0 4
## 23 15 Luke Stone 29 0 5
## 24 15 Dustin Kendrick 30 0 6
## 25 15 Grant Eckel 30 0 6
## 26 15 Cameron Ayala 30 0 6
## 27 15 Brian Bowles 30 0 6
## 28 15 Joe Barsano 30 0 6
## 29 15 Mike Johnson 31 0 7
## 30 15 Joey Jones 33 0 9
ggplot(hb_contestants,aes(x=age_difference)) +
geom_bar()
Using filter, we can see that the winner, Jed Wyatt was 1 year older than Hannah Brown.
hb_contestants %>% filter(Win_Loss == 1)
## Season Name Age Win_Loss age_difference
## 1 15 Jed Wyatt 25 1 1
Dplyr is a great tool to view and manipulate data. It helped us learn more about the Bachelorette contestants vying for Hannah Brown’s heart.