tidyverse
In this session students will learn the basics of working with
dplyr
verbs from the tidyverse
as well as
employ the pipe operator %>%
.
Use the piping operator %>%
in your code to
improve readability
Employ dplyr
Verbs
filter()
summarise()
mutate()
You will need to start by calling the tidyverse
library.
library(tidyverse)
In this lesson we will use the Lahman’s Baseball Database, 1871-2019.
# Import the package
#install.packages("Lahman")
library(Lahman)
We will be using the People
data within the
Lahman
package. There are other datasets available, such as
Batting
and Pitching
.
# Import the data for baseball players
data(People)
# Let's take a peek at the data
head(People)
## playerID birthYear birthMonth birthDay birthCity birthCountry birthState
## 1 aardsda01 1981 12 27 Denver USA CO
## 2 aaronha01 1934 2 5 Mobile USA AL
## 3 aaronto01 1939 8 5 Mobile USA AL
## 4 aasedo01 1954 9 8 Orange USA CA
## 5 abadan01 1972 8 25 Palm Beach USA FL
## 6 abadfe01 1985 12 17 La Romana D.R. La Romana
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> David
## 2 2021 1 22 USA GA Atlanta Hank
## 3 1984 8 16 USA GA Atlanta Tommie
## 4 NA NA NA <NA> <NA> <NA> Don
## 5 NA NA NA <NA> <NA> <NA> Andy
## 6 NA NA NA <NA> <NA> <NA> Fernando
## nameLast nameGiven weight height bats throws debut bbrefID
## 1 Aardsma David Allan 215 75 R R 2004-04-06 aardsda01
## 2 Aaron Henry Louis 180 72 R R 1954-04-13 aaronha01
## 3 Aaron Tommie Lee 190 75 R R 1962-04-10 aaronto01
## 4 Aase Donald William 190 75 R R 1977-07-26 aasedo01
## 5 Abad Fausto Andres 184 73 L L 2001-09-10 abadan01
## 6 Abad Fernando Antonio 235 74 L L 2010-07-28 abadfe01
## finalGame retroID deathDate birthDate
## 1 2015-08-23 aardd001 <NA> 1981-12-27
## 2 1976-10-03 aaroh101 2021-01-22 1934-02-05
## 3 1971-09-26 aarot101 1984-08-16 1939-08-05
## 4 1990-10-03 aased001 <NA> 1954-09-08
## 5 2006-04-13 abada001 <NA> 1972-08-25
## 6 2021-10-01 abadf001 <NA> 1985-12-17
Learn a little bit about the dataset before you start.
?People
What are the variables contained in this dataset? How would you classify each variables’ type?
str(People)
## 'data.frame': 21010 obs. of 26 variables:
## $ playerID : chr "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
## $ birthYear : int 1981 1934 1939 1954 1972 1985 1850 1877 1869 1866 ...
## $ birthMonth : int 12 2 8 9 8 12 11 4 11 10 ...
## $ birthDay : int 27 5 5 8 25 17 4 15 11 14 ...
## $ birthCity : chr "Denver" "Mobile" "Mobile" "Orange" ...
## $ birthCountry: chr "USA" "USA" "USA" "USA" ...
## $ birthState : chr "CO" "AL" "AL" "CA" ...
## $ deathYear : int NA 2021 1984 NA NA NA 1905 1957 1962 1926 ...
## $ deathMonth : int NA 1 8 NA NA NA 5 1 6 4 ...
## $ deathDay : int NA 22 16 NA NA NA 17 6 11 27 ...
## $ deathCountry: chr NA "USA" "USA" NA ...
## $ deathState : chr NA "GA" "GA" NA ...
## $ deathCity : chr NA "Atlanta" "Atlanta" NA ...
## $ nameFirst : chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ nameGiven : chr "David Allan" "Henry Louis" "Tommie Lee" "Donald William" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
## $ bats : Factor w/ 3 levels "B","L","R": 3 3 3 3 2 2 3 3 3 2 ...
## $ throws : Factor w/ 4 levels "B","L","R","S": 3 3 3 3 2 2 3 3 3 2 ...
## $ debut : chr "2004-04-06" "1954-04-13" "1962-04-10" "1977-07-26" ...
## $ bbrefID : chr "aardsda01" "aaronha01" "aaronto01" "aasedo01" ...
## $ finalGame : chr "2015-08-23" "1976-10-03" "1971-09-26" "1990-10-03" ...
## $ retroID : chr "aardd001" "aaroh101" "aarot101" "aased001" ...
## $ deathDate : Date, format: NA "2021-01-22" ...
## $ birthDate : Date, format: "1981-12-27" "1934-02-05" ...
%>%
OperatorCeci n’est pas une pipe
The pipe %>%
operator takes a data set from the left
and passes it into a function on the right. This operator can be used to
help you read your code. When you are reading your code out loud try
replacing %>%
with saying “and then”. These
operators can be used to make your code more efficient (i.e. use less
lines and less typing). They can also be used sequentially refine your
data.
filter
We can use the filter
verb to identify the observations
that adhere to a specification.
Let’s look at only baseball players that are from the United States of America.
Note: You don’t want to print the whole data set, unless you want to scroll for a long time…
## FILTER USA
usa<-People %>%
filter(birthCountry == "USA")
## How big is this?
## How many players are from the USA?
dim(usa)
## [1] 18098 26
What countries are represented in these data?
unique(People$birthCountry)
## [1] "USA" "D.R." "Venezuela" "Cuba"
## [5] "Mexico" "Panama" "CAN" "P.R."
## [9] "Russia" "Japan" "Curacao" "Colombia"
## [13] "Nicaragua" "Germany" "Norway" "Ireland"
## [17] "Italy" "Bahamas" "United Kingdom" "South Korea"
## [21] "Australia" "Czech Republic" "V.I." "Netherlands"
## [25] "France" "Aruba" NA "Sweden"
## [29] "Hong Kong" "Afghanistan" "Spain" "Greece"
## [33] "Taiwan" "Philippines" "Ukriane" "Jamaica"
## [37] "Poland" "Honduras" "Brazil" "Viet Nam"
## [41] "England" "Guam" "Denmark" "Switzerland"
## [45] "Austria" "Singapore" "China" "Belgium"
## [49] "Slovakia" "Peru" "Belize" "Indonesia"
## [53] "Bohemia" "Finland" "Lithuania" "South Africa"
## [57] "Atlantic Ocean" "American Samoa" "Saudi Arabia" "Portugal"
## [61] "Latvia"
Pick your favorite country from this list repeat the example. How many baseball players are from your chosen country?
## INSERT YOUR CODE HERE ##
What do you think the code would look like to only retain the rows for baseball players from Oregon?
Hint: Use the variable birthState
oregon<-People %>%
filter(birthCountry == "USA")%>%
filter(birthState == "OR")
dim(oregon)
## [1] 142 26
Create a new dataframe that only includes players less than 60 inches tall. Show the dataset. How many players are there who are less than 60 inches tall? What are their names?
less5 <- People %>%
filter(height < 60)
less5
## playerID birthYear birthMonth birthDay birthCity birthCountry birthState
## 1 gaedeed01 1925 6 8 Chicago USA IL
## 2 healeto01 1853 NA NA Cranston USA RI
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 1961 6 18 USA IL Chicago Eddie
## 2 1891 2 6 USA ME Lewiston Tom
## nameLast nameGiven weight height bats throws debut bbrefID
## 1 Gaedel Edward Carl 65 43 R L 1951-08-19 gaedeed01
## 2 Healey Thomas F. 155 55 <NA> R 1878-06-13 healeto01
## finalGame retroID deathDate birthDate
## 1 1951-08-19 gaede101 1961-06-18 1925-06-08
## 2 1878-09-09 healt101 1891-02-06 <NA>
count
state<-People %>%
filter(birthCountry == "USA")%>%
count(birthState)
head(state)
## birthState n
## 1 AK 13
## 2 AL 353
## 3 AR 164
## 4 AZ 144
## 5 CA 2474
## 6 CO 101
How many players bat right (R), left (L), or both (B)?
Hint: Use the variable bats
.
batSide<-People %>%
count(bats)
batSide
## bats n
## 1 B 1274
## 2 L 5509
## 3 R 13009
## 4 <NA> 1218
What does NA
mean?
ANSWER HERE:
arrange
stateArr<-People %>%
filter(birthCountry == "USA")%>%
count(birthState)%>%
arrange(desc(n))
head(stateArr)
## birthState n
## 1 CA 2474
## 2 PA 1471
## 3 NY 1293
## 4 IL 1114
## 5 OH 1085
## 6 TX 1034
group_by
and summarise
Create subgroups of your data using group_by
with a
categorical variable. Once the data has been grouped you can perform
numerical summaries on them.
Make a dataframe that displays just the average height of players from each country. What country has the tallest players?
tallest <- People %>%
group_by(birthCountry) %>%
summarise(avg_height=mean(height,na.rm = TRUE)) %>%
arrange(desc(avg_height))
tallest
## # A tibble: 61 × 2
## birthCountry avg_height
## <chr> <dbl>
## 1 Indonesia 78
## 2 Belgium 77
## 3 Hong Kong 76
## 4 Jamaica 75.2
## 5 Afghanistan 75
## 6 Lithuania 75
## 7 Brazil 74.2
## 8 Guam 74
## 9 Singapore 74
## 10 Netherlands 73.5
## # … with 51 more rows
select
Rather than keeping all of the variables, we can select the ones that we are interested in using.
peopleS<-People%>%
select(c(nameFirst, nameLast, weight, height))
str(peopleS)
## 'data.frame': 21010 obs. of 4 variables:
## $ nameFirst: chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
mutate
The mutate
function can be used to create a new variable
(a new column) as a function of other columns.
Body Mass Index (BMI) is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height.
\[BMI = \frac{weight (lbs) \times 703}{height^2 (in^2)}\]
peopleBMI<-People %>%
select(c(nameFirst, nameLast, weight, height))%>%
mutate(bmi=(weight*703)/(height^2))
## OBSERVE THAT THE NEW COLUMN IS ADDED ON
str(peopleBMI)
## 'data.frame': 21010 obs. of 5 variables:
## $ nameFirst: chr "David" "Hank" "Tommie" "Don" ...
## $ nameLast : chr "Aardsma" "Aaron" "Aaron" "Aase" ...
## $ weight : int 215 180 190 190 184 235 192 170 175 169 ...
## $ height : int 75 72 75 75 73 74 72 71 71 68 ...
## $ bmi : num 26.9 24.4 23.7 23.7 24.3 ...
The CDC defines a BMI greater than 30 “within an obese range”. Create a dataframe of only the obese players. How many players are in this dataset? Which player has the highest BMI?
bmi30<-People %>%
select(c(nameFirst, nameLast, weight, height))%>%
mutate(bmi=(weight*703)/(height^2))%>%
filter(bmi>30)%>%
arrange(desc(bmi))
head(bmi30)
## nameFirst nameLast weight height bmi
## 1 John McSherry 351 75 43.86720
## 2 Bartolo Colon 285 71 39.74509
## 3 Pablo Sandoval 268 70 38.44980
## 4 Prince Fielder 275 71 38.35053
## 5 Jumbo Diaz 315 76 38.33882
## 6 Reyes Moronta 265 70 38.01939
If you look back at the structure of the data set, you might notice
that birthDate
and deathDate
are
Date
variables. This is a new type of variable for us! If
you subtract Date
variables you will get the number of data
between the dates in question.
Creating a dataset for only baseball players who have deceased, add a new column to the data set for a player’s lifetime (in years). You can assume that each year has 365 days. Which baseball player lived the longest? How many years old was he?
oldest <- People %>%
select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
filter(is.na(deathYear)==FALSE) %>%
mutate(lifetime=(deathDate-birthDate)/365) %>%
arrange(desc(lifetime))
head(oldest)
## nameFirst nameLast birthDate deathDate deathYear lifetime
## 1 Red Hoff 1891-05-08 1998-09-17 1998 107.4329 days
## 2 Connie Marrero 1911-04-25 2014-04-23 2014 103.0658 days
## 3 Bob Wright 1891-12-13 1993-07-30 1993 101.6959 days
## 4 Ace Parker 1912-05-17 2013-11-06 2013 101.5425 days
## 5 Tony Malinosky 1909-10-07 2011-02-08 2011 101.4082 days
## 6 George Elder 1921-03-10 2022-07-07 2022 101.3945 days
If we want to remove this days
label we can use the
as.numeric
wrapper function.
oldest <- People %>%
select(c(nameFirst, nameLast, birthDate, deathDate, deathYear))%>%
filter(is.na(deathYear)==FALSE) %>%
mutate(lifetime=as.numeric((deathDate-birthDate)/365) )%>%
arrange(desc(lifetime))
head(oldest)
## nameFirst nameLast birthDate deathDate deathYear lifetime
## 1 Red Hoff 1891-05-08 1998-09-17 1998 107.4329
## 2 Connie Marrero 1911-04-25 2014-04-23 2014 103.0658
## 3 Bob Wright 1891-12-13 1993-07-30 1993 101.6959
## 4 Ace Parker 1912-05-17 2013-11-06 2013 101.5425
## 5 Tony Malinosky 1909-10-07 2011-02-08 2011 101.4082
## 6 George Elder 1921-03-10 2022-07-07 2022 101.3945
Convert the debut and finalGame variables to dates using the
as.Date()
function. Add a new column to the data set for a
player’s career (in years). You can assume that each year has 365 days.
Which baseball player had the longest career? How long was it?
longest <- People %>%
select(c(nameFirst, nameLast, finalGame, debut))%>%
mutate(career=as.numeric((as.Date(finalGame)-as.Date(debut))/365)) %>%
arrange(desc(career))
head(longest)
## nameFirst nameLast finalGame debut career
## 1 Nick Altrock 1933-10-01 1898-07-14 35.23836
## 2 Jim O'Rourke 1904-09-22 1872-04-26 32.42740
## 3 Minnie Minoso 1980-10-05 1949-04-19 31.48493
## 4 Charley O'Leary 1934-09-30 1904-04-14 30.48219
## 5 Arlie Latham 1909-09-30 1880-07-05 29.25479
## 6 Deacon McGuire 1912-05-18 1884-06-21 27.92329