Introduction

We will see some uses of the dplyr package by loading a data set of contestants on the Bachelorette season’s 11-15.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df <- read.csv("https://raw.githubusercontent.com/pmalo46/SPRING2020TIDYVERSE/master/BacheloretteDSFinal-Dogu.csv")
head(df)
##   Season           Name Age                     Hometown State
## 1     15      Jed Wyatt  25       Sevierville, Tennessee    TN
## 2     15  Tyler Cameron  26             Jupiter, Florida    FL
## 3     15    Peter Weber  27 Westlake Village, California    CA
## 4     15    Luke Parker  24         Gainesville, Georgia    GA
## 5     15 Garrett Powell  27            Homewood, Alabama    AL
## 6     15   Mike Johnson  31           San Antonio, Texas    TX
##                        College            Occupation Win_Loss Height..cm.
## 1           Belmont University      Singer/Sonwriter        1      190.50
## 2                  Wake Forest    General Contractor        0      187.96
## 3            Baylor University                 Pilot        0      175.25
## 4          Faulkner University Import/Export Manager        0      175.00
## 5 Mississippi State University              Golf Pro        0         NaN
## 6                          NaN     Portfolio Manager        0      180.00
##   Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1                           Yes      Brown     Brown
## 2                            No      Brown     Green
## 3                            No      Brown     Brown
## 4                            No     Blonde     Brown
## 5                            No      Brown     Green
## 6                            No      Brown     Brown

One of the most useful functions in the dplyr package is the filter function, which allows us to filter down to only rows that meet a certain condition.

filter(df, Win_Loss == 1)
##   Season           Name Age                   Hometown State
## 1     15      Jed Wyatt  25     Sevierville, Tennessee    TN
## 2     14  Jason Tartick  29          Buffalo, New York    NY
## 3     13  Bryan Abasolo  37             Miami, Florida    FL
## 4     12 Jordan Rodgers  27          Chico, California    CA
## 5     11    Shawn Booth  28 Windsor Locks, Connecticut    CT
##                   College              Occupation Win_Loss Height..cm.
## 1      Belmont University        Singer/Sonwriter        1      190.50
## 2 University of Rochester Senior Corporate Banker        1      175.26
## 3   University of Florida            Chiropractor        1      187.96
## 4           Butte College  Former Pro Quarterback        1      187.96
## 5     Keene State College        Personal Trainer        1      187.96
##   Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1                           Yes      Brown     Brown
## 2                            No      Brown     Brown
## 3                            No      Brown     Brown
## 4                            No      Brown     Brown
## 5                            No      Brown     Brown

The table above shows the winners of the last five seasons. Another useful function is the group_by function.

group_by(df, State) %>%
  summarise(mean(Height..cm.))
## # A tibble: 32 x 2
##    State `mean(Height..cm.)`
##    <fct>               <dbl>
##  1 AL                   NaN 
##  2 AR                   188.
##  3 AZ                   185.
##  4 CA                   NaN 
##  5 CO                   187.
##  6 CT                   NaN 
##  7 FL                   NaN 
##  8 GA                   NaN 
##  9 IA                   NaN 
## 10 ID                   188.
## # … with 22 more rows

The chunk above uses the group_by method to group the contestants by which state they are from, and then take the average height by state. Another dplyr method, ‘summarise’ is on display here, which allows us to reduce multiple values down to a single value. Another useful function is arrange()

as_tibble(tail(arrange(df, Occupation), 15))
## # A tibble: 15 x 12
##    Season Name    Age Hometown State College Occupation Win_Loss Height..cm.
##     <int> <fct> <int> <fct>    <fct> <fct>   <fct>         <int>       <dbl>
##  1     12 "Nic…    26 San Fra… CA    Other   "Software…        0        NaN 
##  2     11 "Ben…    26 Warsaw,… IN    Indian… "Software…        0        198.
##  3     14 "Mic…    27 Cincinn… OH    Univer… "Sports A…        0        NaN 
##  4     12 "Pet…    26 Rockdal… IL    Joliet… "Staffing…        0        180.
##  5     13 "Dea…    26 Aspen, … CO    Univer… "Startup …        0        188 
##  6     15 "Dev…    27 Sherman… CA    Univer… "Talent M…        0        NaN 
##  7     15 "Dyl…    24 San Die… CA    Willia… "Tech Ent…        0        180.
##  8     12 "Jon…    29 Vancouv… Other Other   "Technica…        0        185.
##  9     12 "Chr…    26 Los Ang… CA    Califo… "Telecom …        0        180.
## 10     15 "Joe…    30 Chicago… IL    North … "The Box …        0        NaN 
## 11     12 "Ale…    25 Oceansi… CA    Palm B… "U.S. Mar…        0        170.
## 12     13 "Bla…    29 San Fra… CA    Other   "U.S. Mar…        0        180.
## 13     15 "Gra…    30 San Cle… CA    Saddle… "Unemploy…        0        NaN 
## 14     14 "Dav…    25 Cherry … NJ    Univer… "Venture …        0        NaN 
## 15     12 "Luk…    31 Burnet,… TX    West P… "War Vete…        0        185.
## # … with 3 more variables: Girlfriend.While.on.the.Show. <fct>,
## #   Hair.Color <fct>, Eye.Color <fct>

The chunk above uses arrange() to sort the contestants alphabetically, while the as_tibble method makes the output more easily viewable.

These demonstrate some of the many uses of the great dplyr package.

Devin Teran additional dplyr functions

Let’s use the dplyr function to further analyze the Bachelor franchise data.

We can use the select() function to view only certain columns from the data.

  1. Specify the data you want to select from
  2. Followed by %>%
  3. Finally followed by select() and the column names you’d like to view

Here we want to view the Season, Name, Age and whether the contest won or not.

age <- df %>% select('Season','Name','Age','Win_Loss')

Let’s use the filter function and only look at contestants from Hannah Brown’s season (Season 15).

hb_contestants <- age %>% filter(Season == 15)
hb_contestants
##    Season              Name Age Win_Loss
## 1      15         Jed Wyatt  25        1
## 2      15     Tyler Cameron  26        0
## 3      15       Peter Weber  27        0
## 4      15       Luke Parker  24        0
## 5      15    Garrett Powell  27        0
## 6      15      Mike Johnson  31        0
## 7      15      Connor Saeli  24        0
## 8      15   Dustin Kendrick  30        0
## 9      15     Dylan Barbour  24        0
## 10     15      Devin Harris  27        0
## 11     15       Grant Eckel  30        0
## 12     15 Kevin Fortenberry  27        0
## 13     15   John Paul Jones  24        0
## 14     15     Matteo Valles  25        0
## 15     15        Luke Stone  29        0
## 16     15     Cameron Ayala  30        0
## 17     15        Joey Jones  33        0
## 18     15 Jonathan Saunders  27        0
## 19     15      Tyler Gwozdz  28        0
## 20     15    Connor Jenkins  28        0
## 21     15    Daron Blaylock  25        0
## 22     15 Matthew Spraggins  23        0
## 23     15      Brian Bowles  30        0
## 24     15     Chasen Coscia  27        0
## 25     15      Hunter Jones  24        0
## 26     15       Joe Barsano  30        0
## 27     15       Matt Donald  26        0
## 28     15       Ryan Spirko  25        0
## 29     15     Thomas Staton  27        0
## 30     15    Scott Anderson  28        0

Hannah Brown was the Bachelorette for Season 15. From, https://www.washingtonpost.com/lifestyle/2019/05/13/bachelorette-premiere-heres-everything-you-need-know-about-hannah-b/, we can see that Hannah Brown was 24 years old at the time of her season. What if we wanted to know each contestant age in relation to Hannah? We can do that using mutate().

Mutate helps us create new columns from existing columns. Our new colum will be the age difference between Hannah Brown and each contestant.

hb_contestants <- hb_contestants %>% mutate(age_difference = Age - 24)
hb_contestants
##    Season              Name Age Win_Loss age_difference
## 1      15         Jed Wyatt  25        1              1
## 2      15     Tyler Cameron  26        0              2
## 3      15       Peter Weber  27        0              3
## 4      15       Luke Parker  24        0              0
## 5      15    Garrett Powell  27        0              3
## 6      15      Mike Johnson  31        0              7
## 7      15      Connor Saeli  24        0              0
## 8      15   Dustin Kendrick  30        0              6
## 9      15     Dylan Barbour  24        0              0
## 10     15      Devin Harris  27        0              3
## 11     15       Grant Eckel  30        0              6
## 12     15 Kevin Fortenberry  27        0              3
## 13     15   John Paul Jones  24        0              0
## 14     15     Matteo Valles  25        0              1
## 15     15        Luke Stone  29        0              5
## 16     15     Cameron Ayala  30        0              6
## 17     15        Joey Jones  33        0              9
## 18     15 Jonathan Saunders  27        0              3
## 19     15      Tyler Gwozdz  28        0              4
## 20     15    Connor Jenkins  28        0              4
## 21     15    Daron Blaylock  25        0              1
## 22     15 Matthew Spraggins  23        0             -1
## 23     15      Brian Bowles  30        0              6
## 24     15     Chasen Coscia  27        0              3
## 25     15      Hunter Jones  24        0              0
## 26     15       Joe Barsano  30        0              6
## 27     15       Matt Donald  26        0              2
## 28     15       Ryan Spirko  25        0              1
## 29     15     Thomas Staton  27        0              3
## 30     15    Scott Anderson  28        0              4

Let’s order the contests in her season from youngest to oldest using arrange(). Plotting the age difference we can see most contestants are 3 years older than her.

library(ggplot2)
hb_contestants %>% arrange(age_difference)
##    Season              Name Age Win_Loss age_difference
## 1      15 Matthew Spraggins  23        0             -1
## 2      15       Luke Parker  24        0              0
## 3      15      Connor Saeli  24        0              0
## 4      15     Dylan Barbour  24        0              0
## 5      15   John Paul Jones  24        0              0
## 6      15      Hunter Jones  24        0              0
## 7      15         Jed Wyatt  25        1              1
## 8      15     Matteo Valles  25        0              1
## 9      15    Daron Blaylock  25        0              1
## 10     15       Ryan Spirko  25        0              1
## 11     15     Tyler Cameron  26        0              2
## 12     15       Matt Donald  26        0              2
## 13     15       Peter Weber  27        0              3
## 14     15    Garrett Powell  27        0              3
## 15     15      Devin Harris  27        0              3
## 16     15 Kevin Fortenberry  27        0              3
## 17     15 Jonathan Saunders  27        0              3
## 18     15     Chasen Coscia  27        0              3
## 19     15     Thomas Staton  27        0              3
## 20     15      Tyler Gwozdz  28        0              4
## 21     15    Connor Jenkins  28        0              4
## 22     15    Scott Anderson  28        0              4
## 23     15        Luke Stone  29        0              5
## 24     15   Dustin Kendrick  30        0              6
## 25     15       Grant Eckel  30        0              6
## 26     15     Cameron Ayala  30        0              6
## 27     15      Brian Bowles  30        0              6
## 28     15       Joe Barsano  30        0              6
## 29     15      Mike Johnson  31        0              7
## 30     15        Joey Jones  33        0              9
ggplot(hb_contestants,aes(x=age_difference)) + 
  geom_bar()

Using filter, we can see that the winner, Jed Wyatt was 1 year older than Hannah Brown.

hb_contestants %>% filter(Win_Loss == 1)
##   Season      Name Age Win_Loss age_difference
## 1     15 Jed Wyatt  25        1              1

Dplyr is a great tool to view and manipulate data. It helped us learn more about the Bachelorette contestants vying for Hannah Brown’s heart.