Introduction

This code-through demonstrates how to use the dplyr package to explore baseball data from the Lahman package.

The goal is to show how dplyr functions help clean, transform and summarize data.

We will use the following functions for this code-through:select() , rename() , mutate() , filter() , group_by() , summarise() , and arrange()



Before using a package, it must be installed on your computer. A package only needs to be installed once.

install.packages("dplyr")
install.packages("Lahman")

The eval=FALSE option displays the installation code without running it when knitted.

Required Packages:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Lahman)

Load and preview the data:

Teams %>% 
  select(yearID, name, G, SO) %>%
head()
##   yearID                    name  G SO
## 1   1871    Boston Red Stockings 31 19
## 2   1871 Chicago White Stockings 28 22
## 3   1871  Cleveland Forest Citys 29 25
## 4   1871    Fort Wayne Kekiongas 19  9
## 5   1871        New York Mutuals 33 15
## 6   1871  Philadelphia Athletics 28 23


Selecting variables

The Teams data set includes team level baseball statistics by season. Each row represents one team in one year. Some useful variables include yearID , name , G and SO.

In this step, we will use the select() function to keep variables needed for the rest of the tutorial. We keep the year, team name, games played and strike outs. We will also use rename() to make the columns easier to understand.

teams.baseball <- Teams %>%
  select(yearID, name, G, SO) %>% 
   rename(Year = yearID, Team = name, Games_Played = G, Strikeouts = SO) 

head(teams.baseball)
##   Year                    Team Games_Played Strikeouts
## 1 1871    Boston Red Stockings           31         19
## 2 1871 Chicago White Stockings           28         22
## 3 1871  Cleveland Forest Citys           29         25
## 4 1871    Fort Wayne Kekiongas           19          9
## 5 1871        New York Mutuals           33         15
## 6 1871  Philadelphia Athletics           28         23


Creating a new variable

The mutate() function creates a new variable. We will be using so_per_game to calculate strikeouts per game. Since raw strikeout totals can be difficult to compare when teams play different numbers of games, calculating strikeouts per game standardizes the measurement. This makes the comparison across teams more meaningful.

teams.baseball <- teams.baseball %>%
  mutate(so_per_game = Strikeouts / Games_Played)

head(teams.baseball)
##   Year                    Team Games_Played Strikeouts so_per_game
## 1 1871    Boston Red Stockings           31         19   0.6129032
## 2 1871 Chicago White Stockings           28         22   0.7857143
## 3 1871  Cleveland Forest Citys           29         25   0.8620690
## 4 1871    Fort Wayne Kekiongas           19          9   0.4736842
## 5 1871        New York Mutuals           33         15   0.4545455
## 6 1871  Philadelphia Athletics           28         23   0.8214286


Filtering by year

The filter() function keeps rows that meet the condition we set. For this example, we will keep seasons from 2000 and later.

modern.teams <- teams.baseball %>%
  filter(Year >= 2000)
head(modern.teams)
##   Year                 Team Games_Played Strikeouts so_per_game
## 1 2000       Anaheim Angels          162       1024    6.320988
## 2 2000 Arizona Diamondbacks          162        975    6.018519
## 3 2000       Atlanta Braves          162       1010    6.234568
## 4 2000    Baltimore Orioles          162        900    5.555556
## 5 2000       Boston Red Sox          162       1019    6.290123
## 6 2000    Chicago White Sox          162        960    5.925926


Strikeouts by year

Next we will summarize the strikeouts by year. The group_by function organizes observations into groups and the summarise() function creates a table.

Grouping by year changes the analysis from individual team in a season to an annual summary. Each row represents the average team strikeout rate for a year.

As a reminder, na.rm = TRUE tells R to ignore any missing values when calculating the mean.

strike.outs.by.year <- modern.teams %>%
  group_by(Year) %>%
  summarise(avg_Strikeout_per_game = mean(so_per_game, na.rm = TRUE))

strike.outs.by.year
## # A tibble: 26 × 2
##     Year avg_Strikeout_per_game
##    <int>                  <dbl>
##  1  2000                   6.45
##  2  2001                   6.67
##  3  2002                   6.47
##  4  2003                   6.34
##  5  2004                   6.55
##  6  2005                   6.30
##  7  2006                   6.52
##  8  2007                   6.62
##  9  2008                   6.77
## 10  2009                   6.91
## # ℹ 16 more rows


Sorting Data

Now we want to sort the data from highest to lowest values, we will do this by using the arrange() and desc() function.

We will use head() to view the first 10 rows to make sure it worked.

strike.outs.by.year %>% 
  arrange(desc(avg_Strikeout_per_game)) %>% 
  head(10)
## # A tibble: 10 × 2
##     Year avg_Strikeout_per_game
##    <int>                  <dbl>
##  1  2019                   8.82
##  2  2020                   8.68
##  3  2021                   8.68
##  4  2023                   8.61
##  5  2024                   8.48
##  6  2018                   8.47
##  7  2022                   8.40
##  8  2025                   8.36
##  9  2017                   8.25
## 10  2016                   8.03


Plotting Data

Now that we have average strikeouts per game, let’s visualize how it changes over time by plotting the results. We will use the plot() function.

  • type = “b” creates points and lines
  • pch= 19 makes the points solid circles
  • bty = “n” removes the box around the plot
  • col = “” changes the color of the points
plot(x= strike.outs.by.year$Year , y= strike.outs.by.year$avg_Strikeout_per_game , 
     type = "b", 
     pch= 19, 
     col="darkorange", 
     bty = "n", 
     xlab = "Year", 
     ylab = "Average Strikeouts per Game", 
     main = "Average MLB Strikeouts per Game Since 2000") 

Resources

  • dplyr documentation: Reference material for the data-manipulation functions demonstrated in this tutorial.
  • Lahman package: Documentation for the historical baseball datasets used in the analysis.
  • R Markdown documentation: Guidance for creating and formatting reproducible R tutorials.