FALL 2024

Learning Objectives

In this session students will learn the basics of working with dplyr verbs from the tidyverse as well as employ the pipe operator %>%.

  • Use the piping operator %>% in your code to improve readability

  • Employ dplyr Verbs

    • filter()
    • summarise()
    • mutate()

You will need to start by calling the tidyverse library.

library(tidyverse)

PART I: The Data

In this lesson we will use the Lahman’s Baseball Database, 1871-2019.

1. Install the package

# Import the package 
#install.packages("Lahman")

2. Call the library

library(Lahman)

3. Load in the data

We will be using the People data within the Lahman package. There are other datasets available, such as Batting and Pitching.

# Import the data for baseball players 
data(People)

# Let's take a peek at the data
head(People)
##    playerID birthYear birthMonth birthDay  birthCity birthCountry birthState
## 1 aardsda01      1981         12       27     Denver          USA         CO
## 2 aaronha01      1934          2        5     Mobile          USA         AL
## 3 aaronto01      1939          8        5     Mobile          USA         AL
## 4  aasedo01      1954          9        8     Orange          USA         CA
## 5  abadan01      1972          8       25 Palm Beach          USA         FL
## 6  abadfe01      1985         12       17  La Romana         D.R.  La Romana
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1        NA         NA       NA         <NA>       <NA>      <NA>     David
## 2      2021          1       22          USA         GA   Atlanta      Hank
## 3      1984          8       16          USA         GA   Atlanta    Tommie
## 4        NA         NA       NA         <NA>       <NA>      <NA>       Don
## 5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
## 6        NA         NA       NA         <NA>       <NA>      <NA>  Fernando
##   nameLast        nameGiven weight height bats throws      debut   bbrefID
## 1  Aardsma      David Allan    215     75    R      R 2004-04-06 aardsda01
## 2    Aaron      Henry Louis    180     72    R      R 1954-04-13 aaronha01
## 3    Aaron       Tommie Lee    190     75    R      R 1962-04-10 aaronto01
## 4     Aase   Donald William    190     75    R      R 1977-07-26  aasedo01
## 5     Abad    Fausto Andres    184     73    L      L 2001-09-10  abadan01
## 6     Abad Fernando Antonio    235     74    L      L 2010-07-28  abadfe01
##    finalGame  retroID  deathDate  birthDate
## 1 2015-08-23 aardd001       <NA> 1981-12-27
## 2 1976-10-03 aaroh101 2021-01-22 1934-02-05
## 3 1971-09-26 aarot101 1984-08-16 1939-08-05
## 4 1990-10-03 aased001       <NA> 1954-09-08
## 5 2006-04-13 abada001       <NA> 1972-08-25
## 6 2021-10-01 abadf001       <NA> 1985-12-17

4. Learn about these data

Learn a little bit about the dataset before you start.

?People

5. Data Structure

What are the variables contained in this dataset? How would you classify each variables’ type?

str(People)
## 'data.frame':    21010 obs. of  26 variables:
##  $ playerID    : chr  "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
##  $ birthYear   : int  1981 1934 1939 1954 1972 1985 1850 1877 1869 1866 ...
##  $ birthMonth  : int  12 2 8 9 8 12 11 4 11 10 ...
##  $ birthDay    : int  27 5 5 8 25 17 4 15 11 14 ...
##  $ birthCity   : chr  "Denver" "Mobile" "Mobile" "Orange" ...
##  $ birthCountry: chr  "USA" "USA" "USA" "USA" ...
##  $ birthState  : chr  "CO" "AL" "AL" "CA" ...
##  $ deathYear   : int  NA 2021 1984 NA NA NA 1905 1957 1962 1926 ...
##  $ deathMonth  : int  NA 1 8 NA NA NA 5 1 6 4 ...
##  $ deathDay    : int  NA 22 16 NA NA NA 17 6 11 27 ...
##  $ deathCountry: chr  NA "USA" "USA" NA ...
##  $ deathState  : chr  NA "GA" "GA" NA ...
##  $ deathCity   : chr  NA "Atlanta" "Atlanta" NA ...
##  $ nameFirst   : chr  "David" "Hank" "Tommie" "Don" ...
##  $ nameLast    : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
##  $ nameGiven   : chr  "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
##  $ weight      : int  215 180 190 190 184 235 192 170 175 169 ...
##  $ height      : int  75 72 75 75 73 74 72 71 71 68 ...
##  $ bats        : Factor w/ 3 levels "B","L","R": 3 3 3 3 2 2 3 3 3 2 ...
##  $ throws      : Factor w/ 4 levels "B","L","R","S": 3 3 3 3 2 2 3 3 3 2 ...
##  $ debut       : chr  "2004-04-06" "1954-04-13" "1962-04-10" "1977-07-26" ...
##  $ bbrefID     : chr  "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
##  $ finalGame   : chr  "2015-08-23" "1976-10-03" "1971-09-26" "1990-10-03" ...
##  $ retroID     : chr  "aardd001" "aaroh101" "aarot101" "aased001" ...
##  $ deathDate   : Date, format: NA "2021-01-22" ...
##  $ birthDate   : Date, format: "1981-12-27" "1934-02-05" ...

PART II: Pipe %>% Operator

Ceci n’est pas une pipe

The pipe %>% operator takes a data set from the left and passes it into a function on the right. This operator can be used to help you read your code. When you are reading your code out loud try replacing %>% with saying “and then”. These operators can be used to make your code more efficient (i.e. use less lines and less typing). They can also be used sequentially refine your data.

PART III: Verbs

1. filter

We can use the filter verb to identify the observations that adhere to a specification.

Filtering with strings

Let’s look at only baseball players that are from the United States of America.

Note: You don’t want to print the whole data set, unless you want to scroll for a long time…

## FILTER USA
usa<-People %>%
  filter(birthCountry == "USA")

## How big is this? 
## How many players are from the USA?
dim(usa)
## [1] 18098    26
Try it!

What countries are represented in these data?

unique(People$birthCountry)
##  [1] "USA"            "D.R."           "Venezuela"      "Cuba"          
##  [5] "Mexico"         "Panama"         "CAN"            "P.R."          
##  [9] "Russia"         "Japan"          "Curacao"        "Colombia"      
## [13] "Nicaragua"      "Germany"        "Norway"         "Ireland"       
## [17] "Italy"          "Bahamas"        "United Kingdom" "South Korea"   
## [21] "Australia"      "Czech Republic" "V.I."           "Netherlands"   
## [25] "France"         "Aruba"          NA               "Sweden"        
## [29] "Hong Kong"      "Afghanistan"    "Spain"          "Greece"        
## [33] "Taiwan"         "Philippines"    "Ukriane"        "Jamaica"       
## [37] "Poland"         "Honduras"       "Brazil"         "Viet Nam"      
## [41] "England"        "Guam"           "Denmark"        "Switzerland"   
## [45] "Austria"        "Singapore"      "China"          "Belgium"       
## [49] "Slovakia"       "Peru"           "Belize"         "Indonesia"     
## [53] "Bohemia"        "Finland"        "Lithuania"      "South Africa"  
## [57] "Atlantic Ocean" "American Samoa" "Saudi Arabia"   "Portugal"      
## [61] "Latvia"

Pick your favorite country from this list repeat the example. How many baseball players are from your chosen country?

## INSERT YOUR CODE HERE ## 
Think about it…

What do you think the code would look like to only retain the rows for baseball players from Oregon?

Hint: Use the variable birthState

oregon<-People %>%
  filter(birthCountry == "USA")%>%
  filter(birthState == "OR")

dim(oregon)
## [1] 142  26

Filtering with numbers

Create a new dataframe that only includes players less than 60 inches tall. Show the dataset. How many players are there who are less than 60 inches tall? What are their names?

less5 <- People %>% 
  filter(height < 60)

less5
##    playerID birthYear birthMonth birthDay birthCity birthCountry birthState
## 1 gaedeed01      1925          6        8   Chicago          USA         IL
## 2 healeto01      1853         NA       NA  Cranston          USA         RI
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1      1961          6       18          USA         IL   Chicago     Eddie
## 2      1891          2        6          USA         ME  Lewiston       Tom
##   nameLast   nameGiven weight height bats throws      debut   bbrefID
## 1   Gaedel Edward Carl     65     43    R      L 1951-08-19 gaedeed01
## 2   Healey   Thomas F.    155     55 <NA>      R 1878-06-13 healeto01
##    finalGame  retroID  deathDate  birthDate
## 1 1951-08-19 gaede101 1961-06-18 1925-06-08
## 2 1878-09-09 healt101 1891-02-06       <NA>

2. count

state<-People %>%
  filter(birthCountry == "USA")%>%
  count(birthState)

head(state)
##   birthState    n
## 1         AK   13
## 2         AL  353
## 3         AR  164
## 4         AZ  144
## 5         CA 2474
## 6         CO  101
Learn by doing!

How many players bat right (R), left (L), or both (B)?

Hint: Use the variable bats.

batSide<-People %>%
  count(bats)

batSide
##   bats     n
## 1    B  1274
## 2    L  5509
## 3    R 13009
## 4 <NA>  1218

What does NA mean?

ANSWER HERE:

3. arrange

stateArr<-People %>%
  filter(birthCountry == "USA")%>%
  count(birthState)%>%
  arrange(desc(n))

head(stateArr)
##   birthState    n
## 1         CA 2474
## 2         PA 1471
## 3         NY 1293
## 4         IL 1114
## 5         OH 1085
## 6         TX 1034

4. group_by and summarise

Create subgroups of your data using group_by with a categorical variable. Once the data has been grouped you can perform numerical summaries on them.

Make a dataframe that displays just the average height of players from each country. What country has the tallest players?

tallest <- People %>% 
  group_by(birthCountry) %>% 
  summarise(avg_height=mean(height,na.rm = TRUE)) %>% 
  arrange(desc(avg_height))

tallest
## # A tibble: 61 × 2
##    birthCountry avg_height
##    <chr>             <dbl>
##  1 Indonesia          78  
##  2 Belgium            77  
##  3 Hong Kong          76  
##  4 Jamaica            75.2
##  5 Afghanistan        75  
##  6 Lithuania          75  
##  7 Brazil             74.2
##  8 Guam               74  
##  9 Singapore          74  
## 10 Netherlands        73.5
## # … with 51 more rows

5. select

Rather than keeping all of the variables, we can select the ones that we are interested in using.

peopleS<-People%>%
  select(c(nameFirst, nameLast, weight, height))

str(peopleS)
## 'data.frame':    21010 obs. of  4 variables:
##  $ nameFirst: chr  "David" "Hank" "Tommie" "Don" ...
##  $ nameLast : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
##  $ weight   : int  215 180 190 190 184 235 192 170 175 169 ...
##  $ height   : int  75 72 75 75 73 74 72 71 71 68 ...

6. mutate

The mutate function can be used to create a new variable (a new column) as a function of other columns.

Example BMI

Body Mass Index (BMI) is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height.

\[BMI = \frac{weight (lbs) \times 703}{height^2 (in^2)}\]

peopleBMI<-People %>% 
  select(c(nameFirst, nameLast, weight, height))%>%
  mutate(bmi=(weight*703)/(height^2))

## OBSERVE THAT THE NEW COLUMN IS ADDED ON
str(peopleBMI)
## 'data.frame':    21010 obs. of  5 variables:
##  $ nameFirst: chr  "David" "Hank" "Tommie" "Don" ...
##  $ nameLast : chr  "Aardsma" "Aaron" "Aaron" "Aase" ...
##  $ weight   : int  215 180 190 190 184 235 192 170 175 169 ...
##  $ height   : int  75 72 75 75 73 74 72 71 71 68 ...
##  $ bmi      : num  26.9 24.4 23.7 23.7 24.3 ...

The CDC defines a BMI greater than 30 “within an obese range”. Create a dataframe of only the obese players. How many players are in this dataset? Which player has the highest BMI?

bmi30<-People %>% 
  select(c(nameFirst, nameLast, weight, height))%>%
  mutate(bmi=(weight*703)/(height^2))%>%
  filter(bmi>30)%>%
  arrange(desc(bmi))

head(bmi30)
##   nameFirst nameLast weight height      bmi
## 1      John McSherry    351     75 43.86720
## 2   Bartolo    Colon    285     71 39.74509
## 3     Pablo Sandoval    268     70 38.44980
## 4    Prince  Fielder    275     71 38.35053
## 5     Jumbo     Diaz    315     76 38.33882
## 6     Reyes  Moronta    265     70 38.01939

Example Dates

If you look back at the structure of the data set, you might notice that birthDate and deathDate are Date variables. This is a new type of variable for us! If you subtract Date variables you will get the number of data between the dates in question.

Creating a dataset for only baseball players who have deceased, add a new column to the data set for a player’s lifetime (in years). You can assume that each year has 365 days. Which baseball player lived the longest? How many years old was he?

oldest <- People %>% 
  select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
  filter(is.na(deathYear)==FALSE) %>% 
  mutate(lifetime=(deathDate-birthDate)/365) %>% 
  arrange(desc(lifetime))

head(oldest)
##   nameFirst  nameLast  birthDate  deathDate deathYear      lifetime
## 1       Red      Hoff 1891-05-08 1998-09-17      1998 107.4329 days
## 2    Connie   Marrero 1911-04-25 2014-04-23      2014 103.0658 days
## 3       Bob    Wright 1891-12-13 1993-07-30      1993 101.6959 days
## 4       Ace    Parker 1912-05-17 2013-11-06      2013 101.5425 days
## 5      Tony Malinosky 1909-10-07 2011-02-08      2011 101.4082 days
## 6    George     Elder 1921-03-10 2022-07-07      2022 101.3945 days

If we want to remove this days label we can use the as.numeric wrapper function.

oldest <- People %>% 
  select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
  filter(is.na(deathYear)==FALSE) %>% 
  mutate(lifetime=as.numeric((deathDate-birthDate)/365) )%>% 
  arrange(desc(lifetime))

head(oldest)
##   nameFirst  nameLast  birthDate  deathDate deathYear lifetime
## 1       Red      Hoff 1891-05-08 1998-09-17      1998 107.4329
## 2    Connie   Marrero 1911-04-25 2014-04-23      2014 103.0658
## 3       Bob    Wright 1891-12-13 1993-07-30      1993 101.6959
## 4       Ace    Parker 1912-05-17 2013-11-06      2013 101.5425
## 5      Tony Malinosky 1909-10-07 2011-02-08      2011 101.4082
## 6    George     Elder 1921-03-10 2022-07-07      2022 101.3945
Try it!

Convert the debut and finalGame variables to dates using the as.Date() function. Add a new column to the data set for a player’s career (in years). You can assume that each year has 365 days. Which baseball player had the longest career? How long was it?

longest <- People %>% 
  select(c(nameFirst, nameLast, finalGame, debut))%>%
  mutate(career=as.numeric((as.Date(finalGame)-as.Date(debut))/365)) %>% 
  arrange(desc(career))

head(longest)
##   nameFirst nameLast  finalGame      debut   career
## 1      Nick  Altrock 1933-10-01 1898-07-14 35.23836
## 2       Jim O'Rourke 1904-09-22 1872-04-26 32.42740
## 3    Minnie   Minoso 1980-10-05 1949-04-19 31.48493
## 4   Charley  O'Leary 1934-09-30 1904-04-14 30.48219
## 5     Arlie   Latham 1909-09-30 1880-07-05 29.25479
## 6    Deacon  McGuire 1912-05-18 1884-06-21 27.92329