How to use dplyr

Introduction

We will see some uses of the dplyr package by loading a data set of contestants on the Bachelorette season’s 11-15.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- read.csv("https://raw.githubusercontent.com/pmalo46/SPRING2020TIDYVERSE/master/BacheloretteDSFinal-Dogu.csv")
head(df)

##   Season           Name Age                     Hometown State
## 1     15      Jed Wyatt  25       Sevierville, Tennessee    TN
## 2     15  Tyler Cameron  26             Jupiter, Florida    FL
## 3     15    Peter Weber  27 Westlake Village, California    CA
## 4     15    Luke Parker  24         Gainesville, Georgia    GA
## 5     15 Garrett Powell  27            Homewood, Alabama    AL
## 6     15   Mike Johnson  31           San Antonio, Texas    TX
##                        College            Occupation Win_Loss Height..cm.
## 1           Belmont University      Singer/Sonwriter        1      190.50
## 2                  Wake Forest    General Contractor        0      187.96
## 3            Baylor University                 Pilot        0      175.25
## 4          Faulkner University Import/Export Manager        0      175.00
## 5 Mississippi State University              Golf Pro        0         NaN
## 6                          NaN     Portfolio Manager        0      180.00
##   Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1                           Yes      Brown     Brown
## 2                            No      Brown     Green
## 3                            No      Brown     Brown
## 4                            No     Blonde     Brown
## 5                            No      Brown     Green
## 6                            No      Brown     Brown

filter function

One of the most useful functions in the dplyr package is the filter function, which allows us to filter down to only rows that meet a certain condition.

filter(df, Win_Loss == 1)

##   Season           Name Age                   Hometown State
## 1     15      Jed Wyatt  25     Sevierville, Tennessee    TN
## 2     14  Jason Tartick  29          Buffalo, New York    NY
## 3     13  Bryan Abasolo  37             Miami, Florida    FL
## 4     12 Jordan Rodgers  27          Chico, California    CA
## 5     11    Shawn Booth  28 Windsor Locks, Connecticut    CT
##                   College              Occupation Win_Loss Height..cm.
## 1      Belmont University        Singer/Sonwriter        1      190.50
## 2 University of Rochester Senior Corporate Banker        1      175.26
## 3   University of Florida            Chiropractor        1      187.96
## 4           Butte College  Former Pro Quarterback        1      187.96
## 5     Keene State College        Personal Trainer        1      187.96
##   Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1                           Yes      Brown     Brown
## 2                            No      Brown     Brown
## 3                            No      Brown     Brown
## 4                            No      Brown     Brown
## 5                            No      Brown     Brown

The table above shows the winners of the last five seasons. Another useful function is the group_by function.

group_by function

group_by(df, State) %>%
  summarise(mean(Height..cm.))

## # A tibble: 32 x 2
##    State `mean(Height..cm.)`
##    <fct>               <dbl>
##  1 AL                   NaN 
##  2 AR                   188.
##  3 AZ                   185.
##  4 CA                   NaN 
##  5 CO                   187.
##  6 CT                   NaN 
##  7 FL                   NaN 
##  8 GA                   NaN 
##  9 IA                   NaN 
## 10 ID                   188.
## # … with 22 more rows

The chunk above uses the group_by method to group the contestants by which state they are from, and then take the average height by state. Another dplyr method, ‘summarise’ is on display here, which allows us to reduce multiple values down to a single value. Another useful function is arrange()

arrange function

as_tibble(tail(arrange(df, Occupation), 15))

## # A tibble: 15 x 12
##    Season Name    Age Hometown State College Occupation Win_Loss
##     <int> <fct> <int> <fct>    <fct> <fct>   <fct>         <int>
##  1     12 "Nic…    26 San Fra… CA    Other   Software …        0
##  2     11 "Ben…    26 Warsaw,… IN    Indian… Software …        0
##  3     14 "Mic…    27 Cincinn… OH    Univer… Sports An…        0
##  4     12 Pete…    26 Rockdal… IL    Joliet… Staffing …        0
##  5     13 Dean…    26 Aspen, … CO    Univer… Startup R…        0
##  6     15 Devi…    27 Sherman… CA    Univer… Talent Ma…        0
##  7     15 Dyla…    24 San Die… CA    Willia… "Tech Ent…        0
##  8     12 Jona…    29 Vancouv… Other Other   Technical…        0
##  9     12 Chri…    26 Los Ang… CA    Califo… Telecom C…        0
## 10     15 Joe …    30 Chicago… IL    North … The Box K…        0
## 11     12 "Ale…    25 Oceansi… CA    Palm B… U.S. Mari…        0
## 12     13 Blak…    29 San Fra… CA    Other   U.S. Mari…        0
## 13     15 Gran…    30 San Cle… CA    Saddle… Unemployed        0
## 14     14 Davi…    25 Cherry … NJ    Univer… Venture C…        0
## 15     12 Luke…    31 Burnet,… TX    West P… War Veter…        0
## # … with 4 more variables: Height..cm. <dbl>,
## #   Girlfriend.While.on.the.Show. <fct>, Hair.Color <fct>, Eye.Color <fct>

The chunk above uses arrange() to sort the contestants alphabetically, while the as_tibble method makes the output more easily viewable.

These demonstrate some of the many uses of the great dplyr package.

Extension

The dplyr package provides several other powerful functions including the select and mutate functions. The following examples highlight the functions’ value and ease-of-use on the same data set of contestants on the Bachelorette season’s 11-15.

select function

Another useful function in the dplyr package is the select function, which allows us to select variables (or columns) by name.

as_tibble(select(df, ends_with("Color")))

## # A tibble: 141 x 2
##    Hair.Color Eye.Color
##    <fct>      <fct>    
##  1 Brown      Brown    
##  2 Brown      Green    
##  3 Brown      Brown    
##  4 Blonde     Brown    
##  5 Brown      Green    
##  6 Brown      Brown    
##  7 Brown      Brown    
##  8 Brown      Brown    
##  9 Brown      Brown    
## 10 Brown      Brown    
## # … with 131 more rows

The tibble above includes only the variables that end with the string ‘Color’ which results in ‘Hair.Color’ and ‘Eye.Color’.

as_tibble(select(df, -ends_with("Color")))

## # A tibble: 141 x 10
##    Season Name    Age Hometown State College Occupation Win_Loss
##     <int> <fct> <int> <fct>    <fct> <fct>   <fct>         <int>
##  1     15 Jed …    25 Sevierv… TN    Belmon… Singer/So…        1
##  2     15 Tyle…    26 Jupiter… FL    Wake F… General C…        0
##  3     15 Pete…    27 Westlak… CA    Baylor… Pilot             0
##  4     15 Luke…    24 Gainesv… GA    Faulkn… Import/Ex…        0
##  5     15 Garr…    27 Homewoo… AL    Missis… Golf Pro          0
##  6     15 Mike…    31 San Ant… TX    NaN     Portfolio…        0
##  7     15 Conn…    24 Birming… MI    Southe… Investmen…        0
##  8     15 Dust…    30 Chicago… IL    Northe… Real Esta…        0
##  9     15 Dyla…    24 San Die… CA    Willia… "Tech Ent…        0
## 10     15 Devi…    27 Sherman… CA    Univer… Talent Ma…        0
## # … with 131 more rows, and 2 more variables: Height..cm. <dbl>,
## #   Girlfriend.While.on.the.Show. <fct>

The tibble above show the simplicity in removing a variable by using the minus symbol (-) as the two variables ending in ‘Color’ are no longer present.

as_tibble(select(df, Occupation, everything()))

## # A tibble: 141 x 12
##    Occupation Season Name    Age Hometown State College Win_Loss
##    <fct>       <int> <fct> <int> <fct>    <fct> <fct>      <int>
##  1 Singer/So…     15 Jed …    25 Sevierv… TN    Belmon…        1
##  2 General C…     15 Tyle…    26 Jupiter… FL    Wake F…        0
##  3 Pilot          15 Pete…    27 Westlak… CA    Baylor…        0
##  4 Import/Ex…     15 Luke…    24 Gainesv… GA    Faulkn…        0
##  5 Golf Pro       15 Garr…    27 Homewoo… AL    Missis…        0
##  6 Portfolio…     15 Mike…    31 San Ant… TX    NaN            0
##  7 Investmen…     15 Conn…    24 Birming… MI    Southe…        0
##  8 Real Esta…     15 Dust…    30 Chicago… IL    Northe…        0
##  9 "Tech Ent…     15 Dyla…    24 San Die… CA    Willia…        0
## 10 Talent Ma…     15 Devi…    27 Sherman… CA    Univer…        0
## # … with 131 more rows, and 4 more variables: Height..cm. <dbl>,
## #   Girlfriend.While.on.the.Show. <fct>, Hair.Color <fct>, Eye.Color <fct>

The resulting tibble above move the Occupation variable to the front of the data frame and output as the tibble.

# first select all variables except Season, then re-select Season
as_tibble(select(df, -Season, Season))

## # A tibble: 141 x 12
##    Name    Age Hometown State College Occupation Win_Loss Height..cm.
##    <fct> <int> <fct>    <fct> <fct>   <fct>         <int>       <dbl>
##  1 Jed …    25 Sevierv… TN    Belmon… Singer/So…        1        190.
##  2 Tyle…    26 Jupiter… FL    Wake F… General C…        0        188.
##  3 Pete…    27 Westlak… CA    Baylor… Pilot             0        175.
##  4 Luke…    24 Gainesv… GA    Faulkn… Import/Ex…        0        175 
##  5 Garr…    27 Homewoo… AL    Missis… Golf Pro          0        NaN 
##  6 Mike…    31 San Ant… TX    NaN     Portfolio…        0        180 
##  7 Conn…    24 Birming… MI    Southe… Investmen…        0        198 
##  8 Dust…    30 Chicago… IL    Northe… Real Esta…        0        188 
##  9 Dyla…    24 San Die… CA    Willia… "Tech Ent…        0        180.
## 10 Devi…    27 Sherman… CA    Univer… Talent Ma…        0        NaN 
## # … with 131 more rows, and 4 more variables:
## #   Girlfriend.While.on.the.Show. <fct>, Hair.Color <fct>,
## #   Eye.Color <fct>, Season <int>

The inverse of moving a variable to the front, the select function can easily move a variable to the end of the data frame as shown above with the Season variable.

Mutate

The dplyr package provides useful tool in the mutate function, which can add or transform variables.

df %>% mutate(
  Year.Of.Birth = 2020 - Age,
  Height..inch = Height..cm. / 2.54 # convert to inches
) %>% head()

##   Season           Name Age                     Hometown State
## 1     15      Jed Wyatt  25       Sevierville, Tennessee    TN
## 2     15  Tyler Cameron  26             Jupiter, Florida    FL
## 3     15    Peter Weber  27 Westlake Village, California    CA
## 4     15    Luke Parker  24         Gainesville, Georgia    GA
## 5     15 Garrett Powell  27            Homewood, Alabama    AL
## 6     15   Mike Johnson  31           San Antonio, Texas    TX
##                        College            Occupation Win_Loss Height..cm.
## 1           Belmont University      Singer/Sonwriter        1      190.50
## 2                  Wake Forest    General Contractor        0      187.96
## 3            Baylor University                 Pilot        0      175.25
## 4          Faulkner University Import/Export Manager        0      175.00
## 5 Mississippi State University              Golf Pro        0         NaN
## 6                          NaN     Portfolio Manager        0      180.00
##   Girlfriend.While.on.the.Show. Hair.Color Eye.Color Year.Of.Birth
## 1                           Yes      Brown     Brown          1995
## 2                            No      Brown     Green          1994
## 3                            No      Brown     Brown          1993
## 4                            No     Blonde     Brown          1996
## 5                            No      Brown     Green          1993
## 6                            No      Brown     Brown          1989
##   Height..inch
## 1     75.00000
## 2     74.00000
## 3     68.99606
## 4     68.89764
## 5          NaN
## 6     70.86614

The resulting data frame above shows the capability of the mutate function to add new variables based on existing variables. The ‘Year.Of.Birth’ is calculated from the existing variable of ‘Age’, and the ‘Height..inch’ is based on the conversion from the existing variable ‘Height..cm.’.

df %>% mutate(State = NULL) %>% head()

##   Season           Name Age                     Hometown
## 1     15      Jed Wyatt  25       Sevierville, Tennessee
## 2     15  Tyler Cameron  26             Jupiter, Florida
## 3     15    Peter Weber  27 Westlake Village, California
## 4     15    Luke Parker  24         Gainesville, Georgia
## 5     15 Garrett Powell  27            Homewood, Alabama
## 6     15   Mike Johnson  31           San Antonio, Texas
##                        College            Occupation Win_Loss Height..cm.
## 1           Belmont University      Singer/Sonwriter        1      190.50
## 2                  Wake Forest    General Contractor        0      187.96
## 3            Baylor University                 Pilot        0      175.25
## 4          Faulkner University Import/Export Manager        0      175.00
## 5 Mississippi State University              Golf Pro        0         NaN
## 6                          NaN     Portfolio Manager        0      180.00
##   Girlfriend.While.on.the.Show. Hair.Color Eye.Color
## 1                           Yes      Brown     Brown
## 2                            No      Brown     Green
## 3                            No      Brown     Brown
## 4                            No     Blonde     Brown
## 5                            No      Brown     Green
## 6                            No      Brown     Brown

The mutate function can easily remove variable by simply setting a variable to NULL as shown above with the variable ‘State’.

The additional functions and examples outline some of the valuable uses of the powerful dplyr package.