Learning Objectives

In this session students will learn the basics of working with dplyr verbs from the tidyverse as well as employ the pipe operator %>%.

  • Use the piping operator %>% in your code to improve readability
  • Employ dplyr Verbs

    • filter()
    • count()
    • arrange()
    • group_by and summarise()
    • select()
    • mutate()

You will need to start by calling the tidyverse library.

library(tidyverse)

PART I: The Data

In this lesson we will use the Lahman’s Baseball Database, 1871-2019.

1. Install the package

# Import the package 
#install.packages("Lahman")

2. Call the library

library(Lahman)

3. Load in the data

We will be using the People data within the Lahman package. There are other datasets available, such as Batting and Pitching.

# Import the data for baseball players 
data(People)

# Let's take a peek at the data
head(People)
##    playerID birthYear birthMonth birthDay birthCountry birthState  birthCity
## 1 aardsda01      1981         12       27          USA         CO     Denver
## 2 aaronha01      1934          2        5          USA         AL     Mobile
## 3 aaronto01      1939          8        5          USA         AL     Mobile
## 4  aasedo01      1954          9        8          USA         CA     Orange
## 5  abadan01      1972          8       25          USA         FL Palm Beach
## 6  abadfe01      1985         12       17         D.R.  La Romana  La Romana
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1        NA         NA       NA         <NA>       <NA>      <NA>     David
## 2      2021          1       22          USA         GA   Atlanta      Hank
## 3      1984          8       16          USA         GA   Atlanta    Tommie
## 4        NA         NA       NA         <NA>       <NA>      <NA>       Don
## 5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
## 6        NA         NA       NA         <NA>       <NA>      <NA>  Fernando
##   nameLast        nameGiven weight height bats throws      debut  finalGame
## 1  Aardsma      David Allan    215     75    R      R 2004-04-06 2015-08-23
## 2    Aaron      Henry Louis    180     72    R      R 1954-04-13 1976-10-03
## 3    Aaron       Tommie Lee    190     75    R      R 1962-04-10 1971-09-26
## 4     Aase   Donald William    190     75    R      R 1977-07-26 1990-10-03
## 5     Abad    Fausto Andres    184     73    L      L 2001-09-10 2006-04-13
## 6     Abad Fernando Antonio    235     74    L      L 2010-07-28 2021-10-01
##    retroID   bbrefID  deathDate  birthDate
## 1 aardd001 aardsda01       <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001  aasedo01       <NA> 1954-09-08
## 5 abada001  abadan01       <NA> 1972-08-25
## 6 abadf001  abadfe01       <NA> 1985-12-17

4. Learn about these data

Learn a little bit about the dataset before you start.

?People

5. Data Structure

What are the variables contained in this dataset? How would you classify each variables’ type?

str(People)
## 'data.frame':    20370 obs. of  26 variables:
##  $ playerID    : chr  "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
##  $ birthYear   : int  1981 1934 1939 1954 1972 1985 1850 1877 1869 1866 ...
##  $ birthMonth  : int  12 2 8 9 8 12 11 4 11 10 ...
##  $ birthDay    : int  27 5 5 8 25 17 4 15 11 14 ...
##  $ birthCountry: chr  "USA" "USA" "USA" "USA" ...
##  $ birthState  : chr  "CO" "AL" "AL" "CA" ...
##  $ birthCity   : chr  "Denver" "Mobile" "Mobile" "Orange" ...
##  $ deathYear   : int  NA 2021 1984 NA NA NA 1905 1957 1962 1926 ...
##  $ deathMonth  : int  NA 1 8 NA NA NA 5 1 6 4 ...
##  $ deathDay    : int  NA 22 16 NA NA NA 17 6 11 27 ...
##  $ deathCountry: chr  NA "USA" "USA" NA ...
##  $ deathState  : chr  NA "GA" "GA" NA ...
##  $ deathCity   : chr  NA "Atlanta" "Atlanta" NA ...
##  $ nameFirst   : chr  "David" "Hank" "Tommie" "Don" ...
##  $ nameLast    : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
##  $ nameGiven   : chr  "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
##  $ weight      : int  215 180 190 190 184 235 192 170 175 169 ...
##  $ height      : int  75 72 75 75 73 74 72 71 71 68 ...
##  $ bats        : Factor w/ 3 levels "B","L","R": 3 3 3 3 2 2 3 3 3 2 ...
##  $ throws      : Factor w/ 3 levels "L","R","S": 2 2 2 2 1 1 2 2 2 1 ...
##  $ debut       : chr  "2004-04-06" "1954-04-13" "1962-04-10" "1977-07-26" ...
##  $ finalGame   : chr  "2015-08-23" "1976-10-03" "1971-09-26" "1990-10-03" ...
##  $ retroID     : chr  "aardd001" "aaroh101" "aarot101" "aased001" ...
##  $ bbrefID     : chr  "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
##  $ deathDate   : Date, format: NA "2021-01-22" ...
##  $ birthDate   : Date, format: "1981-12-27" "1934-02-05" ...

PART II: Pipe %>% Operator

Ceci n’est pas une pipe

The pipe %>% operator takes a data set from the left and passes it into a function on the right. This operator can be used to help you read your code. When you are reading your code out loud try replacing %>% with saying “and then”. These operators can be used to make your code more efficient (i.e. use less lines and less typing). They can also be used sequentially refine your data.

PART III: Verbs

1. filter

We can use the filter verb to identify the observations that adhere to a specification.

Filtering with strings

Let’s look at only baseball players that are from the United States of America.

Note: You don’t want to print the whole data set, unless you want to scroll for a long time…

## FILTER USA
usa<-People %>%
  filter(birthCountry == "USA")

## How big is this? 
## How many players are from the USA?
dim(usa)
## [1] 17601    26
Try it!

What countries are represented in these data?

unique(People$birthCountry)
##  [1] "USA"            "D.R."           "Venezuela"      "Cuba"          
##  [5] "Mexico"         "Panama"         "CAN"            "P.R."          
##  [9] "Russia"         "Japan"          "Curacao"        "Colombia"      
## [13] "Nicaragua"      "Germany"        "Norway"         "Ireland"       
## [17] "Italy"          "Bahamas"        "United Kingdom" "South Korea"   
## [21] "Australia"      "Czech Republic" "V.I."           "Netherlands"   
## [25] "France"         "Aruba"          NA               "Sweden"        
## [29] "Hong Kong"      "Afghanistan"    "Spain"          "Greece"        
## [33] "Taiwan"         "Philippines"    "Jamaica"        "Poland"        
## [37] "Honduras"       "Brazil"         "Viet Nam"       "Guam"          
## [41] "Denmark"        "Switzerland"    "Austria"        "Singapore"     
## [45] "China"          "Belgium"        "Peru"           "Belize"        
## [49] "Indonesia"      "Finland"        "Lithuania"      "South Africa"  
## [53] "At Sea"         "Slovakia"       "American Samoa" "Saudi Arabia"  
## [57] "Portugal"       "Latvia"

Pick your favorite country from this list repeat the example. How many baseball players are from your chosen country?

## INSERT YOUR CODE HERE ## 
Think about it…

What do you think the code would look like to only retain the rows for baseball players from Oregon?

Hint: Use the variable birthState

oregon<-People %>%
  filter(birthCountry == "USA")%>%
  filter(birthState == "OR")

dim(oregon)
## [1] 138  26

Filtering with numbers

Create a new dataframe that only includes players less than 60 inches tall. Show the dataset. How many players are there who are less than 60 inches tall? What are their names?

short <- People %>% 
  filter(height < 60)

short
##    playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 gaedeed01      1925          6        8          USA         IL   Chicago
## 2 healeto01      1853         NA       NA          USA         RI  Cranston
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1      1961          6       18          USA         IL   Chicago     Eddie
## 2      1891          2        6          USA         ME  Lewiston       Tom
##   nameLast   nameGiven weight height bats throws      debut  finalGame  retroID
## 1   Gaedel Edward Carl     65     43    R      L 1951-08-19 1951-08-19 gaede101
## 2   Healey   Thomas F.    155     55 <NA>      R 1878-06-13 1878-09-09 healt101
##     bbrefID  deathDate  birthDate
## 1 gaedeed01 1961-06-18 1925-06-08
## 2 healeto01 1891-02-06       <NA>
## PULLING OUT JUST THE FIRST AND LAST NAME 
## AND COMBINING INTO A STRING
paste(short$nameFirst, short$nameLast)
## [1] "Eddie Gaedel" "Tom Healey"

2. count

state<-People %>%
  filter(birthCountry == "USA")%>%
  count(birthState)

head(state)
##   birthState    n
## 1         AK   12
## 2         AL  347
## 3         AR  161
## 4         AZ  130
## 5         CA 2386
## 6         CO   98
Learn by doing!

How many players bat right (R), left (L), or both (B)?

Hint: Use the variable bats.

batSide<-People %>%
  count(bats)

batSide
##   bats     n
## 1    B  1242
## 2    L  5321
## 3    R 12626
## 4 <NA>  1181

What does NA mean?

ANSWER HERE:

3. arrange

stateArr<-People %>%
  filter(birthCountry == "USA")%>%
  count(birthState)%>%
  arrange(desc(n))

head(stateArr)
##   birthState    n
## 1         CA 2386
## 2         PA 1458
## 3         NY 1260
## 4         IL 1093
## 5         OH 1068
## 6         TX  980

4. group_by and summarise

Create subgroups of your data using group_by with a categorical variable. Once the data has been grouped you can perform numerical summaries on them.

Make a dataframe that displays just the average height of players from each country. What country has the tallest players?

tallest <- People %>% 
  group_by(birthCountry) %>% 
  summarise(avg_height=mean(height,na.rm = TRUE)) %>% 
  arrange(desc(avg_height))

tallest
## # A tibble: 58 × 2
##    birthCountry avg_height
##    <chr>             <dbl>
##  1 Indonesia          78  
##  2 Belgium            77  
##  3 Hong Kong          76  
##  4 Jamaica            75.2
##  5 Afghanistan        75  
##  6 Lithuania          75  
##  7 Guam               74.5
##  8 Brazil             74.2
##  9 Singapore          74  
## 10 Australia          73.5
## # … with 48 more rows

5. select

Rather than keeping all of the variables, we can select the ones that we are interested in using.

peopleS<-People%>%
  select(c(nameFirst, nameLast, weight, height))

str(peopleS)
## 'data.frame':    20370 obs. of  4 variables:
##  $ nameFirst: chr  "David" "Hank" "Tommie" "Don" ...
##  $ nameLast : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
##  $ weight   : int  215 180 190 190 184 235 192 170 175 169 ...
##  $ height   : int  75 72 75 75 73 74 72 71 71 68 ...

6. mutate

The mutate function can be used to create a new variable (a new column) as a function of other columns.

Example BMI

Body Mass Index (BMI) is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height.

\[BMI = \frac{weight (lbs) \times 703}{height^2 (in^2)}\]

peopleBMI<-People %>% 
  select(c(nameFirst, nameLast, weight, height))%>%
  mutate(bmi=(weight*703)/(height^2))

## OBSERVE THAT THE NEW COLUMN IS ADDED ON
str(peopleBMI)
## 'data.frame':    20370 obs. of  5 variables:
##  $ nameFirst: chr  "David" "Hank" "Tommie" "Don" ...
##  $ nameLast : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
##  $ weight   : int  215 180 190 190 184 235 192 170 175 169 ...
##  $ height   : int  75 72 75 75 73 74 72 71 71 68 ...
##  $ bmi      : num  26.9 24.4 23.7 23.7 24.3 ...

The CDC defines a BMI greater than 30 “within an obese range”. Create a dataframe of only the obese players. How many players are in this dataset? Which player has the highest BMI?

bmi30<-People %>% 
  select(c(nameFirst, nameLast, weight, height))%>%
  mutate(bmi=(weight*703)/(height^2))%>%
  filter(bmi>30)%>%
  arrange(desc(bmi))

head(bmi30)
##   nameFirst nameLast weight height      bmi
## 1 Alejandro     Kirk    265     68 40.28871
## 2   Bartolo    Colon    285     71 39.74509
## 3     Pablo Sandoval    268     70 38.44980
## 4    Prince  Fielder    275     71 38.35053
## 5     Jumbo     Diaz    315     76 38.33882
## 6     Reyes  Moronta    265     70 38.01939

Example Dates

If you look back at the structure of the data set, you might notice that birthDate and deathDate are Date variables. This is a new type of variable for us! If you subtract Date variables you will get the number of data between the dates in question.

Creating a dataset for only baseball players who have deceased, add a new column to the data set for a player’s lifetime (in years). You can assume that each year has 365 days. Which baseball player lived the longest? How many years old was he?

oldest <- People %>% 
  select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
  filter(is.na(deathYear)==FALSE) %>% 
  mutate(lifetime=(deathDate-birthDate)/365) %>% 
  arrange(desc(lifetime))

head(oldest)
##   nameFirst  nameLast  birthDate  deathDate deathYear      lifetime
## 1       Red      Hoff 1891-05-08 1998-09-17      1998 107.4329 days
## 2    Connie   Marrero 1911-04-25 2014-04-23      2014 103.0658 days
## 3       Bob    Wright 1891-12-13 1993-07-30      1993 101.6959 days
## 4       Ace    Parker 1912-05-17 2013-11-06      2013 101.5425 days
## 5      Tony Malinosky 1909-10-07 2011-02-08      2011 101.4082 days
## 6      Karl   Swanson 1900-12-17 2002-04-03      2002 101.3616 days

If we want to remove this days label we can use the as.numeric wrapper function.

oldest <- People %>% 
  select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
  filter(is.na(deathYear)==FALSE) %>% 
  mutate(lifetime=as.numeric((deathDate-birthDate)/365) )%>% 
  arrange(desc(lifetime))

head(oldest)
##   nameFirst  nameLast  birthDate  deathDate deathYear lifetime
## 1       Red      Hoff 1891-05-08 1998-09-17      1998 107.4329
## 2    Connie   Marrero 1911-04-25 2014-04-23      2014 103.0658
## 3       Bob    Wright 1891-12-13 1993-07-30      1993 101.6959
## 4       Ace    Parker 1912-05-17 2013-11-06      2013 101.5425
## 5      Tony Malinosky 1909-10-07 2011-02-08      2011 101.4082
## 6      Karl   Swanson 1900-12-17 2002-04-03      2002 101.3616
Try it!

Convert the debut and finalGame variables to dates using the as.Date() function. Add a new column to the data set for a player’s career (in years). You can assume that each year has 365 days. Which baseball player had the longest career? How long was it?

longest <- People %>% 
  select(c(nameFirst, nameLast, finalGame, debut))%>%
  mutate(career=as.numeric((as.Date(finalGame)-as.Date(debut))/365)) %>% 
  arrange(desc(career))

head(longest)
##   nameFirst nameLast  finalGame      debut   career
## 1      Nick  Altrock 1933-10-01 1898-07-14 35.23836
## 2       Jim O'Rourke 1904-09-22 1872-04-26 32.42740
## 3    Minnie   Minoso 1980-10-05 1949-04-19 31.48493
## 4   Charley  O'Leary 1934-09-30 1904-04-14 30.48219
## 5     Arlie   Latham 1909-09-30 1880-07-05 29.25479
## 6    Deacon  McGuire 1912-05-18 1884-06-21 27.92329