MLB Analysis

1. Data Input

First Things First – We need some data to analyze. The dataset we’ll be investigating is from Lahman’s Baseball Database. These files contain a variety of baseball statistics dating all the way back to 1871!

There are 25+ csv files in this set, so we’ll just use a quick loop to pull all csvs in as their own datasets. If you’re interested, the code is shown below:

#folder <- 'Blah-Blah-Blah/baseballdatabank-master/core/'     
# path to folder that holds multiple .csv files. Use a real reference, mine is already loaded.

list_of_files <- list.files(path=location, pattern="*.csv") 
end <- length(list_of_files)
# create list of all .csv files in folder

# read in each .csv file in file_list and create a data frame with the 
for (i in 1:end) {
  assign(list_of_files[i],
  
  read.csv(paste(location, list_of_files[i], sep = '')))
}

Let’s take a quick look at the output here. As a reminder, we’re expecting 25+ files.

##  [1] "AllstarFull.csv"         "Appearances.csv"        
##  [3] "AwardsManagers.csv"      "AwardsPlayers.csv"      
##  [5] "AwardsShareManagers.csv" "AwardsSharePlayers.csv" 
##  [7] "Batting.csv"             "BattingPost.csv"        
##  [9] "CollegePlaying.csv"      "Fielding.csv"           
## [11] "FieldingOF.csv"          "FieldingOFsplit.csv"    
## [13] "FieldingPost.csv"        "HallOfFame.csv"         
## [15] "HomeGames.csv"           "Managers.csv"           
## [17] "ManagersHalf.csv"        "Parks.csv"              
## [19] "People.csv"              "Pitching.csv"           
## [21] "PitchingPost.csv"        "Salaries.csv"           
## [23] "Schools.csv"             "SeriesPost.csv"         
## [25] "Teams.csv"               "TeamsFranchises.csv"    
## [27] "TeamsHalf.csv"

Great! We can see that 27 files were loaded into our environment. Let’s move on to some deeper views into the data.

2. Data Diving

With the relevant information loaded correctly, let’s dig in!

Let’s start with a standard scatter for Batting Average by Year –>

Batting.csv %>%
  group_by(yearID) %>%
  summarise(G = sum(G), AB = sum(AB), R = sum(R), H = sum(H)) %>%
  mutate(batting_average = H/AB) %>%
  ggplot(aes(x = yearID, y = batting_average)) +
  geom_point(color = '#0E61FB') + 
  geom_smooth(color = '#525659') + 
  theme_bw() + 
  guides(fill=FALSE)

Hmm. Data is a bit all over the place, so let’s try with a condensed timeframe.

Interesting – There appears to be a shift in around the year 2000, bucking a 30 year trend of increased batting averages. Expected BA has shifted from nearly .270 to just over .250, almost a 7.5% decline! Let’s start slicing the data a bit to see if anything else pops out.

Does handedness or League play any role? Let’s see.

MLB Analysis

Logan Ice

Feb 2019

1. Data Input

2. Data Diving