dplyr
VerbsIn this session students will learn the basics of working with dplyr
verbs from the tidyverse
as well as employ the pipe operator %>%
.
%>%
in your code to improve readabilityEmploy dplyr
Verbs
filter()
count()
arrange()
group_by
and summarise()
select()
mutate()
You will need to start by calling the tidyverse
library.
library(tidyverse)
In this lesson we will use the Lahman’s Baseball Database, 1871-2019.
# Import the package
#install.packages("Lahman")
library(Lahman)
We will be using the People
data within the Lahman
package. There are other datasets available, such as Batting
and Pitching
.
# Import the data for baseball players
data(People)
# Let's take a peek at the data
head(People)
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 aardsda01 1981 12 27 USA CO Denver
## 2 aaronha01 1934 2 5 USA AL Mobile
## 3 aaronto01 1939 8 5 USA AL Mobile
## 4 aasedo01 1954 9 8 USA CA Orange
## 5 abadan01 1972 8 25 USA FL Palm Beach
## 6 abadfe01 1985 12 17 D.R. La Romana La Romana
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> David
## 2 2021 1 22 USA GA Atlanta Hank
## 3 1984 8 16 USA GA Atlanta Tommie
## 4 NA NA NA <NA> <NA> <NA> Don
## 5 NA NA NA <NA> <NA> <NA> Andy
## 6 NA NA NA <NA> <NA> <NA> Fernando
## nameLast nameGiven weight height bats throws debut finalGame
## 1 Aardsma David Allan 215 75 R R 2004-04-06 2015-08-23
## 2 Aaron Henry Louis 180 72 R R 1954-04-13 1976-10-03
## 3 Aaron Tommie Lee 190 75 R R 1962-04-10 1971-09-26
## 4 Aase Donald William 190 75 R R 1977-07-26 1990-10-03
## 5 Abad Fausto Andres 184 73 L L 2001-09-10 2006-04-13
## 6 Abad Fernando Antonio 235 74 L L 2010-07-28 2021-10-01
## retroID bbrefID deathDate birthDate
## 1 aardd001 aardsda01 <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001 aasedo01 <NA> 1954-09-08
## 5 abada001 abadan01 <NA> 1972-08-25
## 6 abadf001 abadfe01 <NA> 1985-12-17
Learn a little bit about the dataset before you start.
?People
What are the variables contained in this dataset? How would you classify each variables’ type?
str(People)
## 'data.frame': 20370 obs. of 26 variables:
## $ playerID : chr "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
## $ birthYear : int 1981 1934 1939 1954 1972 1985 1850 1877 1869 1866 ...
## $ birthMonth : int 12 2 8 9 8 12 11 4 11 10 ...
## $ birthDay : int 27 5 5 8 25 17 4 15 11 14 ...
## $ birthCountry: chr "USA" "USA" "USA" "USA" ...
## $ birthState : chr "CO" "AL" "AL" "CA" ...
## $ birthCity : chr "Denver" "Mobile" "Mobile" "Orange" ...
## $ deathYear : int NA 2021 1984 NA NA NA 1905 1957 1962 1926 ...
## $ deathMonth : int NA 1 8 NA NA NA 5 1 6 4 ...
## $ deathDay : int NA 22 16 NA NA NA 17 6 11 27 ...
## $ deathCountry: chr NA "USA" "USA" NA ...
## $ deathState : chr NA "GA" "GA" NA ...
## $ deathCity : chr NA "Atlanta" "Atlanta" NA ...
## $ nameFirst : chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ nameGiven : chr "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
## $ bats : Factor w/ 3 levels "B","L","R": 3 3 3 3 2 2 3 3 3 2 ...
## $ throws : Factor w/ 3 levels "L","R","S": 2 2 2 2 1 1 2 2 2 1 ...
## $ debut : chr "2004-04-06" "1954-04-13" "1962-04-10" "1977-07-26" ...
## $ finalGame : chr "2015-08-23" "1976-10-03" "1971-09-26" "1990-10-03" ...
## $ retroID : chr "aardd001" "aaroh101" "aarot101" "aased001" ...
## $ bbrefID : chr "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
## $ deathDate : Date, format: NA "2021-01-22" ...
## $ birthDate : Date, format: "1981-12-27" "1934-02-05" ...
%>%
OperatorCeci n’est pas une pipe
The pipe %>%
operator takes a data set from the left and passes it into a function on the right. This operator can be used to help you read your code. When you are reading your code out loud try replacing %>%
with saying “and then”. These operators can be used to make your code more efficient (i.e. use less lines and less typing). They can also be used sequentially refine your data.
filter
We can use the filter
verb to identify the observations that adhere to a specification.
Let’s look at only baseball players that are from the United States of America.
Note: You don’t want to print the whole data set, unless you want to scroll for a long time…
## FILTER USA
usa<-People %>%
filter(birthCountry == "USA")
## How big is this?
## How many players are from the USA?
dim(usa)
## [1] 17601 26
What countries are represented in these data?
unique(People$birthCountry)
## [1] "USA" "D.R." "Venezuela" "Cuba"
## [5] "Mexico" "Panama" "CAN" "P.R."
## [9] "Russia" "Japan" "Curacao" "Colombia"
## [13] "Nicaragua" "Germany" "Norway" "Ireland"
## [17] "Italy" "Bahamas" "United Kingdom" "South Korea"
## [21] "Australia" "Czech Republic" "V.I." "Netherlands"
## [25] "France" "Aruba" NA "Sweden"
## [29] "Hong Kong" "Afghanistan" "Spain" "Greece"
## [33] "Taiwan" "Philippines" "Jamaica" "Poland"
## [37] "Honduras" "Brazil" "Viet Nam" "Guam"
## [41] "Denmark" "Switzerland" "Austria" "Singapore"
## [45] "China" "Belgium" "Peru" "Belize"
## [49] "Indonesia" "Finland" "Lithuania" "South Africa"
## [53] "At Sea" "Slovakia" "American Samoa" "Saudi Arabia"
## [57] "Portugal" "Latvia"
Pick your favorite country from this list repeat the example. How many baseball players are from your chosen country?
## INSERT YOUR CODE HERE ##
What do you think the code would look like to only retain the rows for baseball players from Oregon?
Hint: Use the variable birthState
oregon<-People %>%
filter(birthCountry == "USA")%>%
filter(birthState == "OR")
dim(oregon)
## [1] 138 26
Create a new dataframe that only includes players less than 60 inches tall. Show the dataset. How many players are there who are less than 60 inches tall? What are their names?
short <- People %>%
filter(height < 60)
short
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 gaedeed01 1925 6 8 USA IL Chicago
## 2 healeto01 1853 NA NA USA RI Cranston
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 1961 6 18 USA IL Chicago Eddie
## 2 1891 2 6 USA ME Lewiston Tom
## nameLast nameGiven weight height bats throws debut finalGame retroID
## 1 Gaedel Edward Carl 65 43 R L 1951-08-19 1951-08-19 gaede101
## 2 Healey Thomas F. 155 55 <NA> R 1878-06-13 1878-09-09 healt101
## bbrefID deathDate birthDate
## 1 gaedeed01 1961-06-18 1925-06-08
## 2 healeto01 1891-02-06 <NA>
## PULLING OUT JUST THE FIRST AND LAST NAME
## AND COMBINING INTO A STRING
paste(short$nameFirst, short$nameLast)
## [1] "Eddie Gaedel" "Tom Healey"
count
state<-People %>%
filter(birthCountry == "USA")%>%
count(birthState)
head(state)
## birthState n
## 1 AK 12
## 2 AL 347
## 3 AR 161
## 4 AZ 130
## 5 CA 2386
## 6 CO 98
How many players bat right (R), left (L), or both (B)?
Hint: Use the variable bats
.
batSide<-People %>%
count(bats)
batSide
## bats n
## 1 B 1242
## 2 L 5321
## 3 R 12626
## 4 <NA> 1181
What does NA
mean?
ANSWER HERE:
arrange
stateArr<-People %>%
filter(birthCountry == "USA")%>%
count(birthState)%>%
arrange(desc(n))
head(stateArr)
## birthState n
## 1 CA 2386
## 2 PA 1458
## 3 NY 1260
## 4 IL 1093
## 5 OH 1068
## 6 TX 980
group_by
and summarise
Create subgroups of your data using group_by
with a categorical variable. Once the data has been grouped you can perform numerical summaries on them.
Make a dataframe that displays just the average height of players from each country. What country has the tallest players?
tallest <- People %>%
group_by(birthCountry) %>%
summarise(avg_height=mean(height,na.rm = TRUE)) %>%
arrange(desc(avg_height))
tallest
## # A tibble: 58 × 2
## birthCountry avg_height
## <chr> <dbl>
## 1 Indonesia 78
## 2 Belgium 77
## 3 Hong Kong 76
## 4 Jamaica 75.2
## 5 Afghanistan 75
## 6 Lithuania 75
## 7 Guam 74.5
## 8 Brazil 74.2
## 9 Singapore 74
## 10 Australia 73.5
## # … with 48 more rows
select
Rather than keeping all of the variables, we can select the ones that we are interested in using.
peopleS<-People%>%
select(c(nameFirst, nameLast, weight, height))
str(peopleS)
## 'data.frame': 20370 obs. of 4 variables:
## $ nameFirst: chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
mutate
The mutate
function can be used to create a new variable (a new column) as a function of other columns.
Body Mass Index (BMI) is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height.
\[BMI = \frac{weight (lbs) \times 703}{height^2 (in^2)}\]
peopleBMI<-People %>%
select(c(nameFirst, nameLast, weight, height))%>%
mutate(bmi=(weight*703)/(height^2))
## OBSERVE THAT THE NEW COLUMN IS ADDED ON
str(peopleBMI)
## 'data.frame': 20370 obs. of 5 variables:
## $ nameFirst: chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
## $ bmi : num 26.9 24.4 23.7 23.7 24.3 ...
The CDC defines a BMI greater than 30 “within an obese range”. Create a dataframe of only the obese players. How many players are in this dataset? Which player has the highest BMI?
bmi30<-People %>%
select(c(nameFirst, nameLast, weight, height))%>%
mutate(bmi=(weight*703)/(height^2))%>%
filter(bmi>30)%>%
arrange(desc(bmi))
head(bmi30)
## nameFirst nameLast weight height bmi
## 1 Alejandro Kirk 265 68 40.28871
## 2 Bartolo Colon 285 71 39.74509
## 3 Pablo Sandoval 268 70 38.44980
## 4 Prince Fielder 275 71 38.35053
## 5 Jumbo Diaz 315 76 38.33882
## 6 Reyes Moronta 265 70 38.01939
If you look back at the structure of the data set, you might notice that birthDate
and deathDate
are Date
variables. This is a new type of variable for us! If you subtract Date
variables you will get the number of data between the dates in question.
Creating a dataset for only baseball players who have deceased, add a new column to the data set for a player’s lifetime (in years). You can assume that each year has 365 days. Which baseball player lived the longest? How many years old was he?
oldest <- People %>%
select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
filter(is.na(deathYear)==FALSE) %>%
mutate(lifetime=(deathDate-birthDate)/365) %>%
arrange(desc(lifetime))
head(oldest)
## nameFirst nameLast birthDate deathDate deathYear lifetime
## 1 Red Hoff 1891-05-08 1998-09-17 1998 107.4329 days
## 2 Connie Marrero 1911-04-25 2014-04-23 2014 103.0658 days
## 3 Bob Wright 1891-12-13 1993-07-30 1993 101.6959 days
## 4 Ace Parker 1912-05-17 2013-11-06 2013 101.5425 days
## 5 Tony Malinosky 1909-10-07 2011-02-08 2011 101.4082 days
## 6 Karl Swanson 1900-12-17 2002-04-03 2002 101.3616 days
If we want to remove this days
label we can use the as.numeric
wrapper function.
oldest <- People %>%
select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
filter(is.na(deathYear)==FALSE) %>%
mutate(lifetime=as.numeric((deathDate-birthDate)/365) )%>%
arrange(desc(lifetime))
head(oldest)
## nameFirst nameLast birthDate deathDate deathYear lifetime
## 1 Red Hoff 1891-05-08 1998-09-17 1998 107.4329
## 2 Connie Marrero 1911-04-25 2014-04-23 2014 103.0658
## 3 Bob Wright 1891-12-13 1993-07-30 1993 101.6959
## 4 Ace Parker 1912-05-17 2013-11-06 2013 101.5425
## 5 Tony Malinosky 1909-10-07 2011-02-08 2011 101.4082
## 6 Karl Swanson 1900-12-17 2002-04-03 2002 101.3616
Convert the debut and finalGame variables to dates using the as.Date()
function. Add a new column to the data set for a player’s career (in years). You can assume that each year has 365 days. Which baseball player had the longest career? How long was it?
longest <- People %>%
select(c(nameFirst, nameLast, finalGame, debut))%>%
mutate(career=as.numeric((as.Date(finalGame)-as.Date(debut))/365)) %>%
arrange(desc(career))
head(longest)
## nameFirst nameLast finalGame debut career
## 1 Nick Altrock 1933-10-01 1898-07-14 35.23836
## 2 Jim O'Rourke 1904-09-22 1872-04-26 32.42740
## 3 Minnie Minoso 1980-10-05 1949-04-19 31.48493
## 4 Charley O'Leary 1934-09-30 1904-04-14 30.48219
## 5 Arlie Latham 1909-09-30 1880-07-05 29.25479
## 6 Deacon McGuire 1912-05-18 1884-06-21 27.92329