Data Wrangling Mid-Term

Introduction

Baseball data will be used to categorize batters and suggest where they should fall within the batting order. Information such as this can help coaches optimize their lineups by seeing where their players best fit. Important stats to evaluate batters include batting average (BA), on-base percentage (OBP), slugging percentage (SLG), number of home runs (HR), and runs batted in (RBI). These stats will be calculated from the available data then combining exploratory data analysis with domain knowledge of baseball will allow for insights to be made on where batters should be ordered in a lineup.

Packages Required

library(readr)
library(dplyr)

These packages are both from the tidyverse. Functions to load data into R are included in readr. Functions to manipulate and join data are included in dplyr.

Data Preparation

The MLB Pitch Data 2015-2018 is posted on Kaggle. It was scraped from this webpage which is a part of MLB.com. This collection of data includes the following 4 data sets as csv files:

games – 9,718 observations with 17 variables for every game played in the 2015-2018 seasons
atbats – 740,389 observations with 11 variables for every at bat in each of the 4 seasons
pitches – 2,867,162 observations with 39 variables for every pitch thrown in each of the 4 seasons
player_names – 2,218 observations with 3 variables for every player represented

To start, the data is loaded into R and the structure of each data set is viewed.

atbats <- read_csv("mlb_pitch/atbats.csv")
games <- read_csv("mlb_pitch/games.csv")
pitches <- read_csv("mlb_pitch/pitches.csv")
player_names <- read_csv("mlb_pitch/player_names.csv")

str(atbats)
str(games)
str(pitches)
str(player_names)

For this analysis, only 3 variables from the atbats data set are needed:

ab_id - ID number given to each at bat (first 4 digits are the year)
batter_id - ID number given to each batter (matches id in player_names)
event - the result of each at bat

atbats.sub <- select(atbats, ab_id, batter_id, event)

Before moving on, any missing data should be identified.

sum(is.na(atbats.sub))

## [1] 0

Luckily, there is no missing data in this subset of the atbats data set.

The ID variables are numbers and classified as doubles in R. They were changed to be character as they do not represent an actual numeric value.

atbats.sub$ab_id <- as.character(atbats.sub$ab_id)
atbats.sub$batter_id <- as.character(atbats.sub$batter_id)

Below is a snippet of the resulting data set that will be used for further analysis.

## # A tibble: 6 x 3
##   ab_id      batter_id event    
##   <chr>      <chr>     <chr>    
## 1 2015000001 572761    Groundout
## 2 2015000002 518792    Double   
## 3 2015000003 407812    Single   
## 4 2015000004 425509    Strikeout
## 5 2015000005 571431    Strikeout
## 6 2015000006 451594    Double

The event variable needs to be explored further to see what the possible outcomes are for each at bat and how much each outcome occurs.

ab.results <- as_data_frame(table(atbats$event))
names(ab.results)[1] <- "event"
names(ab.results)[2] <- "count"
ab.results$percent <- ab.results$count/nrow(atbats)*100

arrange(ab.results, desc(count))

## # A tibble: 30 x 3
##    event      count percent
##    <chr>      <int>   <dbl>
##  1 Strikeout 157128   21.2 
##  2 Groundout 134893   18.2 
##  3 Single    108794   14.7 
##  4 Flyout     80731   10.9 
##  5 Walk       56894    7.68
##  6 Lineout    44934    6.07
##  7 Pop Out    34455    4.65
##  8 Double     33157    4.48
##  9 Home Run   22209    3.00
## 10 Forceout   15112    2.04
## # ... with 20 more rows

Proposed Exploratory Data Analysis

From the data set above, I will calculate statistics (such as batting average, on-base percentage, etc) for each batter. I will then plot these stats against each other to see the relationships and where groups of batters may form. For example, a batter with a lower batting average and higher number of home runs could be classified as a power hitter. From there I will perform a clustering analysis to get a sense for what groups form and how this can translate into placing batters into the best batting order slots such as being a leadoff hitter, cleanup, or on the bottom end of the lineup.

Data Wrangling Mid-Term

Brandon Lester

April 6, 2019

Introduction

Packages Required

Data Preparation

Proposed Exploratory Data Analysis