We will see some uses of the dplyr package by loading a data set of contestants on the Bachelorette season’s 11-15.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- read.csv("https://raw.githubusercontent.com/pmalo46/SPRING2020TIDYVERSE/master/BacheloretteDSFinal-Dogu.csv")
head(df)
## Season Name Age Hometown State
## 1 15 Jed Wyatt 25 Sevierville, Tennessee TN
## 2 15 Tyler Cameron 26 Jupiter, Florida FL
## 3 15 Peter Weber 27 Westlake Village, California CA
## 4 15 Luke Parker 24 Gainesville, Georgia GA
## 5 15 Garrett Powell 27 Homewood, Alabama AL
## 6 15 Mike Johnson 31 San Antonio, Texas TX
## College Occupation Win_Loss Height..cm.
## 1 Belmont University Singer/Sonwriter 1 190.50
## 2 Wake Forest General Contractor 0 187.96
## 3 Baylor University Pilot 0 175.25
## 4 Faulkner University Import/Export Manager 0 175.00
## 5 Mississippi State University Golf Pro 0 NaN
## 6 NaN Portfolio Manager 0 180.00
## Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1 Yes Brown Brown
## 2 No Brown Green
## 3 No Brown Brown
## 4 No Blonde Brown
## 5 No Brown Green
## 6 No Brown Brown
One of the most useful functions in the dplyr package is the filter function, which allows us to filter down to only rows that meet a certain condition.
filter(df, Win_Loss == 1)
## Season Name Age Hometown State
## 1 15 Jed Wyatt 25 Sevierville, Tennessee TN
## 2 14 Jason Tartick 29 Buffalo, New York NY
## 3 13 Bryan Abasolo 37 Miami, Florida FL
## 4 12 Jordan Rodgers 27 Chico, California CA
## 5 11 Shawn Booth 28 Windsor Locks, Connecticut CT
## College Occupation Win_Loss Height..cm.
## 1 Belmont University Singer/Sonwriter 1 190.50
## 2 University of Rochester Senior Corporate Banker 1 175.26
## 3 University of Florida Chiropractor 1 187.96
## 4 Butte College Former Pro Quarterback 1 187.96
## 5 Keene State College Personal Trainer 1 187.96
## Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1 Yes Brown Brown
## 2 No Brown Brown
## 3 No Brown Brown
## 4 No Brown Brown
## 5 No Brown Brown
The table above shows the winners of the last five seasons. Another useful function is the group_by function.
group_by(df, State) %>%
summarise(mean(Height..cm.))
## # A tibble: 32 x 2
## State `mean(Height..cm.)`
## <fct> <dbl>
## 1 AL NaN
## 2 AR 188.
## 3 AZ 185.
## 4 CA NaN
## 5 CO 187.
## 6 CT NaN
## 7 FL NaN
## 8 GA NaN
## 9 IA NaN
## 10 ID 188.
## # … with 22 more rows
The chunk above uses the group_by method to group the contestants by which state they are from, and then take the average height by state. Another dplyr method, ‘summarise’ is on display here, which allows us to reduce multiple values down to a single value. Another useful function is arrange()
as_tibble(tail(arrange(df, Occupation), 15))
## # A tibble: 15 x 12
## Season Name Age Hometown State College Occupation Win_Loss
## <int> <fct> <int> <fct> <fct> <fct> <fct> <int>
## 1 12 "Nic… 26 San Fra… CA Other Software … 0
## 2 11 "Ben… 26 Warsaw,… IN Indian… Software … 0
## 3 14 "Mic… 27 Cincinn… OH Univer… Sports An… 0
## 4 12 Pete… 26 Rockdal… IL Joliet… Staffing … 0
## 5 13 Dean… 26 Aspen, … CO Univer… Startup R… 0
## 6 15 Devi… 27 Sherman… CA Univer… Talent Ma… 0
## 7 15 Dyla… 24 San Die… CA Willia… "Tech Ent… 0
## 8 12 Jona… 29 Vancouv… Other Other Technical… 0
## 9 12 Chri… 26 Los Ang… CA Califo… Telecom C… 0
## 10 15 Joe … 30 Chicago… IL North … The Box K… 0
## 11 12 "Ale… 25 Oceansi… CA Palm B… U.S. Mari… 0
## 12 13 Blak… 29 San Fra… CA Other U.S. Mari… 0
## 13 15 Gran… 30 San Cle… CA Saddle… Unemployed 0
## 14 14 Davi… 25 Cherry … NJ Univer… Venture C… 0
## 15 12 Luke… 31 Burnet,… TX West P… War Veter… 0
## # … with 4 more variables: Height..cm. <dbl>,
## # Girlfriend.While.on.the.Show. <fct>, Hair.Color <fct>, Eye.Color <fct>
The chunk above uses arrange() to sort the contestants alphabetically, while the as_tibble method makes the output more easily viewable.
These demonstrate some of the many uses of the great dplyr package.
The dplyr package provides several other powerful functions including the select and mutate functions. The following examples highlight the functions’ value and ease-of-use on the same data set of contestants on the Bachelorette season’s 11-15.
Another useful function in the dplyr package is the select function, which allows us to select variables (or columns) by name.
as_tibble(select(df, ends_with("Color")))
## # A tibble: 141 x 2
## Hair.Color Eye.Color
## <fct> <fct>
## 1 Brown Brown
## 2 Brown Green
## 3 Brown Brown
## 4 Blonde Brown
## 5 Brown Green
## 6 Brown Brown
## 7 Brown Brown
## 8 Brown Brown
## 9 Brown Brown
## 10 Brown Brown
## # … with 131 more rows
The tibble above includes only the variables that end with the string ‘Color’ which results in ‘Hair.Color’ and ‘Eye.Color’.
as_tibble(select(df, -ends_with("Color")))
## # A tibble: 141 x 10
## Season Name Age Hometown State College Occupation Win_Loss
## <int> <fct> <int> <fct> <fct> <fct> <fct> <int>
## 1 15 Jed … 25 Sevierv… TN Belmon… Singer/So… 1
## 2 15 Tyle… 26 Jupiter… FL Wake F… General C… 0
## 3 15 Pete… 27 Westlak… CA Baylor… Pilot 0
## 4 15 Luke… 24 Gainesv… GA Faulkn… Import/Ex… 0
## 5 15 Garr… 27 Homewoo… AL Missis… Golf Pro 0
## 6 15 Mike… 31 San Ant… TX NaN Portfolio… 0
## 7 15 Conn… 24 Birming… MI Southe… Investmen… 0
## 8 15 Dust… 30 Chicago… IL Northe… Real Esta… 0
## 9 15 Dyla… 24 San Die… CA Willia… "Tech Ent… 0
## 10 15 Devi… 27 Sherman… CA Univer… Talent Ma… 0
## # … with 131 more rows, and 2 more variables: Height..cm. <dbl>,
## # Girlfriend.While.on.the.Show. <fct>
The tibble above show the simplicity in removing a variable by using the minus symbol (-) as the two variables ending in ‘Color’ are no longer present.
as_tibble(select(df, Occupation, everything()))
## # A tibble: 141 x 12
## Occupation Season Name Age Hometown State College Win_Loss
## <fct> <int> <fct> <int> <fct> <fct> <fct> <int>
## 1 Singer/So… 15 Jed … 25 Sevierv… TN Belmon… 1
## 2 General C… 15 Tyle… 26 Jupiter… FL Wake F… 0
## 3 Pilot 15 Pete… 27 Westlak… CA Baylor… 0
## 4 Import/Ex… 15 Luke… 24 Gainesv… GA Faulkn… 0
## 5 Golf Pro 15 Garr… 27 Homewoo… AL Missis… 0
## 6 Portfolio… 15 Mike… 31 San Ant… TX NaN 0
## 7 Investmen… 15 Conn… 24 Birming… MI Southe… 0
## 8 Real Esta… 15 Dust… 30 Chicago… IL Northe… 0
## 9 "Tech Ent… 15 Dyla… 24 San Die… CA Willia… 0
## 10 Talent Ma… 15 Devi… 27 Sherman… CA Univer… 0
## # … with 131 more rows, and 4 more variables: Height..cm. <dbl>,
## # Girlfriend.While.on.the.Show. <fct>, Hair.Color <fct>, Eye.Color <fct>
The resulting tibble above move the Occupation variable to the front of the data frame and output as the tibble.
# first select all variables except Season, then re-select Season
as_tibble(select(df, -Season, Season))
## # A tibble: 141 x 12
## Name Age Hometown State College Occupation Win_Loss Height..cm.
## <fct> <int> <fct> <fct> <fct> <fct> <int> <dbl>
## 1 Jed … 25 Sevierv… TN Belmon… Singer/So… 1 190.
## 2 Tyle… 26 Jupiter… FL Wake F… General C… 0 188.
## 3 Pete… 27 Westlak… CA Baylor… Pilot 0 175.
## 4 Luke… 24 Gainesv… GA Faulkn… Import/Ex… 0 175
## 5 Garr… 27 Homewoo… AL Missis… Golf Pro 0 NaN
## 6 Mike… 31 San Ant… TX NaN Portfolio… 0 180
## 7 Conn… 24 Birming… MI Southe… Investmen… 0 198
## 8 Dust… 30 Chicago… IL Northe… Real Esta… 0 188
## 9 Dyla… 24 San Die… CA Willia… "Tech Ent… 0 180.
## 10 Devi… 27 Sherman… CA Univer… Talent Ma… 0 NaN
## # … with 131 more rows, and 4 more variables:
## # Girlfriend.While.on.the.Show. <fct>, Hair.Color <fct>,
## # Eye.Color <fct>, Season <int>
The inverse of moving a variable to the front, the select function can easily move a variable to the end of the data frame as shown above with the Season variable.
The dplyr package provides useful tool in the mutate function, which can add or transform variables.
df %>% mutate(
Year.Of.Birth = 2020 - Age,
Height..inch = Height..cm. / 2.54 # convert to inches
) %>% head()
## Season Name Age Hometown State
## 1 15 Jed Wyatt 25 Sevierville, Tennessee TN
## 2 15 Tyler Cameron 26 Jupiter, Florida FL
## 3 15 Peter Weber 27 Westlake Village, California CA
## 4 15 Luke Parker 24 Gainesville, Georgia GA
## 5 15 Garrett Powell 27 Homewood, Alabama AL
## 6 15 Mike Johnson 31 San Antonio, Texas TX
## College Occupation Win_Loss Height..cm.
## 1 Belmont University Singer/Sonwriter 1 190.50
## 2 Wake Forest General Contractor 0 187.96
## 3 Baylor University Pilot 0 175.25
## 4 Faulkner University Import/Export Manager 0 175.00
## 5 Mississippi State University Golf Pro 0 NaN
## 6 NaN Portfolio Manager 0 180.00
## Girlfriend.While.on.the.Show. Hair.Color Eye.Color Year.Of.Birth
## 1 Yes Brown Brown 1995
## 2 No Brown Green 1994
## 3 No Brown Brown 1993
## 4 No Blonde Brown 1996
## 5 No Brown Green 1993
## 6 No Brown Brown 1989
## Height..inch
## 1 75.00000
## 2 74.00000
## 3 68.99606
## 4 68.89764
## 5 NaN
## 6 70.86614
The resulting data frame above shows the capability of the mutate function to add new variables based on existing variables. The ‘Year.Of.Birth’ is calculated from the existing variable of ‘Age’, and the ‘Height..inch’ is based on the conversion from the existing variable ‘Height..cm.’.
df %>% mutate(State = NULL) %>% head()
## Season Name Age Hometown
## 1 15 Jed Wyatt 25 Sevierville, Tennessee
## 2 15 Tyler Cameron 26 Jupiter, Florida
## 3 15 Peter Weber 27 Westlake Village, California
## 4 15 Luke Parker 24 Gainesville, Georgia
## 5 15 Garrett Powell 27 Homewood, Alabama
## 6 15 Mike Johnson 31 San Antonio, Texas
## College Occupation Win_Loss Height..cm.
## 1 Belmont University Singer/Sonwriter 1 190.50
## 2 Wake Forest General Contractor 0 187.96
## 3 Baylor University Pilot 0 175.25
## 4 Faulkner University Import/Export Manager 0 175.00
## 5 Mississippi State University Golf Pro 0 NaN
## 6 NaN Portfolio Manager 0 180.00
## Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1 Yes Brown Brown
## 2 No Brown Green
## 3 No Brown Brown
## 4 No Blonde Brown
## 5 No Brown Green
## 6 No Brown Brown
The mutate function can easily remove variable by simply setting a variable to NULL as shown above with the variable ‘State’.
The additional functions and examples outline some of the valuable uses of the powerful dplyr package.