This code-through demonstrates how to use the dplyr
package to explore baseball data from the Lahman package.
The goal is to show how dplyr functions help clean,
transform and summarize data.
We will use the following functions for this
code-through:select() , rename() ,
mutate() , filter() , group_by()
, summarise() , and arrange()
Before using a package, it must be installed on your computer. A package only needs to be installed once.
The eval=FALSE option displays the installation code
without running it when knitted.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## yearID name G SO
## 1 1871 Boston Red Stockings 31 19
## 2 1871 Chicago White Stockings 28 22
## 3 1871 Cleveland Forest Citys 29 25
## 4 1871 Fort Wayne Kekiongas 19 9
## 5 1871 New York Mutuals 33 15
## 6 1871 Philadelphia Athletics 28 23
The Teams data set includes team level baseball
statistics by season. Each row represents one team in one year. Some
useful variables include yearID , name ,
G and SO.
In this step, we will use the select() function to keep
variables needed for the rest of the tutorial. We keep the year, team
name, games played and strike outs. We will also use
rename() to make the columns easier to understand.
teams.baseball <- Teams %>%
select(yearID, name, G, SO) %>%
rename(Year = yearID, Team = name, Games_Played = G, Strikeouts = SO)
head(teams.baseball)## Year Team Games_Played Strikeouts
## 1 1871 Boston Red Stockings 31 19
## 2 1871 Chicago White Stockings 28 22
## 3 1871 Cleveland Forest Citys 29 25
## 4 1871 Fort Wayne Kekiongas 19 9
## 5 1871 New York Mutuals 33 15
## 6 1871 Philadelphia Athletics 28 23
The mutate() function creates a new variable. We will be
using so_per_game to calculate strikeouts per game. Since
raw strikeout totals can be difficult to compare when teams play
different numbers of games, calculating strikeouts per game standardizes
the measurement. This makes the comparison across teams more
meaningful.
teams.baseball <- teams.baseball %>%
mutate(so_per_game = Strikeouts / Games_Played)
head(teams.baseball)## Year Team Games_Played Strikeouts so_per_game
## 1 1871 Boston Red Stockings 31 19 0.6129032
## 2 1871 Chicago White Stockings 28 22 0.7857143
## 3 1871 Cleveland Forest Citys 29 25 0.8620690
## 4 1871 Fort Wayne Kekiongas 19 9 0.4736842
## 5 1871 New York Mutuals 33 15 0.4545455
## 6 1871 Philadelphia Athletics 28 23 0.8214286
The filter() function keeps rows that meet the condition
we set. For this example, we will keep seasons from 2000 and later.
## Year Team Games_Played Strikeouts so_per_game
## 1 2000 Anaheim Angels 162 1024 6.320988
## 2 2000 Arizona Diamondbacks 162 975 6.018519
## 3 2000 Atlanta Braves 162 1010 6.234568
## 4 2000 Baltimore Orioles 162 900 5.555556
## 5 2000 Boston Red Sox 162 1019 6.290123
## 6 2000 Chicago White Sox 162 960 5.925926
Next we will summarize the strikeouts by year. The
group_by function organizes observations into groups and
the summarise() function creates a table.
Grouping by year changes the analysis from individual team in a season to an annual summary. Each row represents the average team strikeout rate for a year.
As a reminder, na.rm = TRUE tells R to ignore any missing values when calculating the mean.
strike.outs.by.year <- modern.teams %>%
group_by(Year) %>%
summarise(avg_Strikeout_per_game = mean(so_per_game, na.rm = TRUE))
strike.outs.by.year## # A tibble: 26 × 2
## Year avg_Strikeout_per_game
## <int> <dbl>
## 1 2000 6.45
## 2 2001 6.67
## 3 2002 6.47
## 4 2003 6.34
## 5 2004 6.55
## 6 2005 6.30
## 7 2006 6.52
## 8 2007 6.62
## 9 2008 6.77
## 10 2009 6.91
## # ℹ 16 more rows
Now we want to sort the data from highest to lowest values, we will
do this by using the arrange() and desc()
function.
We will use head() to view the first 10 rows to make
sure it worked.
## # A tibble: 10 × 2
## Year avg_Strikeout_per_game
## <int> <dbl>
## 1 2019 8.82
## 2 2020 8.68
## 3 2021 8.68
## 4 2023 8.61
## 5 2024 8.48
## 6 2018 8.47
## 7 2022 8.40
## 8 2025 8.36
## 9 2017 8.25
## 10 2016 8.03
Now that we have average strikeouts per game, let’s visualize how it
changes over time by plotting the results. We will use the
plot() function.
plot(x= strike.outs.by.year$Year , y= strike.outs.by.year$avg_Strikeout_per_game ,
type = "b",
pch= 19,
col="darkorange",
bty = "n",
xlab = "Year",
ylab = "Average Strikeouts per Game",
main = "Average MLB Strikeouts per Game Since 2000")