TidyVerse - A Complete History of MLB

Stephanie Roark
12/5/2018

America's Pastime

Let's take a look at the complete history of Major League Baseball from 1871 to present.

The MLB Complete History dataset contains two separate systems for rating teams; the simpler Elo ratings, used for The Complete History Of MLB, and the more involved — and confusingly named — "ratings" that are used in our MLB Predictions. This dataset contains an Elo-based rating system and predictive model for baseball that accounts for home-field advantage, margin of victory, park and era effects, travel, rest and — most importantly — starting pitchers.Elo is a simple way to rate competitors that can be tuned and customized endlessly to incorporate available data.

For baseball fans, all the descriptions for the datasets can be found here: How Our MLB Predictions Work

First, we load the data from FiveThirtyEight.com

library(tidyverse)
library(knitr)

Url <- "https://projects.fivethirtyeight.com/mlb-api/mlb_elo.csv"
MLBdata <- read.csv(file = Url, header = TRUE, sep = ",")
kable(head(MLBdata))

date	season	playoff	team1	team2	elo1_pre	elo2_pre	elo_prob1	elo_prob2	elo1_post	elo2_post	rating1_pre	rating2_pre	pitcher1	pitcher2	pitcher1_rgs	pitcher2_rgs	pitcher1_adj	pitcher2_adj	rating_prob1	rating_prob2	rating1_post	rating2_post	score1	score2
2018-10-28	2018	w	LAD	BOS	1572.027	1603.193	0.4862531	0.5137469	1568.153	1607.066	1576.245	1606.237	Clayton Kershaw	David Price	55.72020	55.90411	8.777848	11.1913925	0.4838771	0.5161229	1572.396	1610.086	1	5
2018-10-27	2018	w	LAD	BOS	1575.480	1599.739	0.4995027	0.5004973	1572.027	1603.193	1579.775	1602.707	Rich Hill	Eduardo Rodriguez	54.47421	53.63815	3.666228	0.3866077	0.5083420	0.4916580	1576.245	1606.237	6	9
2018-10-26	2018	w	LAD	BOS	1573.220	1601.999	0.4908324	0.5091676	1575.480	1599.739	1577.877	1604.605	Walker Buehler	Rick Porcello	57.86924	51.14604	20.863039	-11.1236660	0.5559070	0.4440930	1579.775	1602.707	3	2
2018-10-24	2018	w	BOS	LAD	1600.026	1575.193	0.5926204	0.4073796	1601.999	1573.220	1602.811	1579.671	David Price	Hyun-Jin Ryu	55.36010	51.91669	9.286617	-7.2494460	0.6198083	0.3801917	1604.605	1577.877	4	2
2018-10-23	2018	w	BOS	LAD	1597.035	1578.184	0.5814915	0.4185085	1600.026	1575.193	1600.469	1582.014	Chris Sale	Clayton Kershaw	64.08947	56.62859	50.165559	14.1844384	0.6489541	0.3510459	1602.811	1579.671	8	4
2018-10-20	2018	l	MIL	LAD	1561.263	1574.030	0.5215406	0.4784594	1557.109	1578.184	1565.239	1577.979	Jhoulys Chacin	Walker Buehler	54.69401	57.92168	14.107999	20.4168461	0.5095001	0.4904999	1561.205	1582.014	1	5

What makes data tidy and clean?

Tiday Data:

Data Structures : data is organized in columns and rows with labels.
Data Semantics : data is a collection of values, usually either numbers (if quantitative) or strings (if qualitative) and every value belongs to a variable and an observation.
A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

Or more simply put, tidy data means that each variable forms a column; each observation forms a row; each type of observational unit forms a table.

5 Most common problems to find when tidying data:

• Column headers are values, not variable names. • Multiple variables are stored in one column. • Variables are stored in both rows and columns. • Multiple types of observational units are stored in the same table. • A single observational unit is stored in multiple tables.

Once the data is tidy, you need to clean it (just like a house can't be cleaned until everything is tidyied up.) Check the data for missing values, labels that make sense, errors, units, outliers, etc.

Let's start by looking at the structure of the data and to see if there is any missing data:

#structure of the data - from this we can see that there are both numeric and categorical data
str(MLBdata)

## 'data.frame':    217544 obs. of  26 variables:
##  $ date        : Factor w/ 25902 levels "1871-05-04","1871-05-05",..: 25902 25901 25900 25899 25898 25897 25896 25895 25894 25894 ...
##  $ season      : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ neutral     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ playoff     : Factor w/ 5 levels "","c","d","l",..: 5 5 5 5 5 4 4 4 4 4 ...
##  $ team1       : Factor w/ 89 levels "ANA","ARI","ATL",..: 40 40 40 9 9 44 44 31 31 40 ...
##  $ team2       : Factor w/ 89 levels "ANA","ARI","ATL",..: 9 9 9 40 40 40 40 9 9 44 ...
##  $ elo1_pre    : num  1572 1575 1573 1600 1597 ...
##  $ elo2_pre    : num  1603 1600 1602 1575 1578 ...
##  $ elo_prob1   : num  0.486 0.5 0.491 0.593 0.581 ...
##  $ elo_prob2   : num  0.514 0.5 0.509 0.407 0.419 ...
##  $ elo1_post   : num  1568 1572 1575 1602 1600 ...
##  $ elo2_post   : num  1607 1603 1600 1573 1575 ...
##  $ rating1_pre : num  1576 1580 1578 1603 1600 ...
##  $ rating2_pre : num  1606 1603 1605 1580 1582 ...
##  $ pitcher1    : Factor w/ 5886 levels "","A.J. Burnett",..: 990 4473 5502 1262 949 2624 5483 2772 912 990 ...
##  $ pitcher2    : Factor w/ 5840 levels "","A.J. Burnett",..: 1271 1531 4441 2500 999 5470 2500 1271 4441 5450 ...
##  $ pitcher1_rgs: num  55.7 54.5 57.9 55.4 64.1 ...
##  $ pitcher2_rgs: num  55.9 53.6 51.1 51.9 56.6 ...
##  $ pitcher1_adj: num  8.78 3.67 20.86 9.29 50.17 ...
##  $ pitcher2_adj: num  11.191 0.387 -11.124 -7.249 14.184 ...
##  $ rating_prob1: num  0.484 0.508 0.556 0.62 0.649 ...
##  $ rating_prob2: num  0.516 0.492 0.444 0.38 0.351 ...
##  $ rating1_post: num  1572 1576 1580 1605 1603 ...
##  $ rating2_post: num  1610 1606 1603 1578 1580 ...
##  $ score1      : int  1 6 3 4 8 1 7 1 6 5 ...
##  $ score2      : int  5 9 2 2 4 5 2 4 8 2 ...

#are there NA's in the data set
sum(is.na(MLBdata))

## [1] 136325

We can see that the data consists mostly of numeric values and that there are quite a few NA's.

Let's look to see if there is any missing data or "" values.

#let's look at one column to see if there are missing entries that do not show up as NA's
MLBdata %>% 
    count(playoff)

## # A tibble: 5 x 2
##   playoff      n
##   <fct>    <int>
## 1 ""      215964
## 2 c           14
## 3 d          397
## 4 l          504
## 5 w          665

#read the data back in in a different way to automatically convert "" to NA's
#readr converts "" to NA's while reading in the data
MLBdata <- readr::read_csv("https://projects.fivethirtyeight.com/mlb-api/mlb_elo.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   date = col_date(format = ""),
##   season = col_integer(),
##   neutral = col_integer(),
##   playoff = col_character(),
##   team1 = col_character(),
##   team2 = col_character(),
##   pitcher1 = col_character(),
##   pitcher2 = col_character(),
##   score1 = col_integer(),
##   score2 = col_integer()
## )

## See spec(...) for full column specifications.

sum(is.na(MLBdata))

## [1] 352365

Now, all of the empty spaces are NA's which helps when performing analysis.

This dataset is already organized in columns and rows, but the names of the columns are confusing. So let's give the variables labels that are easier to understand.

# do the column names make sense
colnames(MLBdata)

##  [1] "date"         "season"       "neutral"      "playoff"     
##  [5] "team1"        "team2"        "elo1_pre"     "elo2_pre"    
##  [9] "elo_prob1"    "elo_prob2"    "elo1_post"    "elo2_post"   
## [13] "rating1_pre"  "rating2_pre"  "pitcher1"     "pitcher2"    
## [17] "pitcher1_rgs" "pitcher2_rgs" "pitcher1_adj" "pitcher2_adj"
## [21] "rating_prob1" "rating_prob2" "rating1_post" "rating2_post"
## [25] "score1"       "score2"

# let's rename the columns so that we can understand their meaning
newcolnames <- c("date", "season", "neutral", "playoff", "home_team", "away_team", 
                 "home_pre-elo", "away_pre-elo", "home_probability", 
                 "away_probability", "home_post-elo", "away_post-elo", 
                 "home_pre-rating", "away_pre-rating", "home_pitcher", "away_pitcher", 
                 "home_pitcher-rgs", "away_pitcher-rgs", "home_pitcher-adj",
                 "away_pitcher-adj", "home_rating-probability", "away_rating-probability",
                 "home_post-rating", "away_post-rating", "home_score", "away_score")

colnames(MLBdata) <- newcolnames
#now we have a better understanding of what each variable represents
colnames(MLBdata)

##  [1] "date"                    "season"                 
##  [3] "neutral"                 "playoff"                
##  [5] "home_team"               "away_team"              
##  [7] "home_pre-elo"            "away_pre-elo"           
##  [9] "home_probability"        "away_probability"       
## [11] "home_post-elo"           "away_post-elo"          
## [13] "home_pre-rating"         "away_pre-rating"        
## [15] "home_pitcher"            "away_pitcher"           
## [17] "home_pitcher-rgs"        "away_pitcher-rgs"       
## [19] "home_pitcher-adj"        "away_pitcher-adj"       
## [21] "home_rating-probability" "away_rating-probability"
## [23] "home_post-rating"        "away_post-rating"       
## [25] "home_score"              "away_score"

Next let's look at the numeric values using skimr.

#let's look at summaries of the numeric columns
MLBdata %>%
    select_if(is.numeric) %>%
    skimr::skim()

## Skim summary statistics
##  n obs: 217544 
##  n variables: 20 
## 
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────
##    variable missing complete      n       mean      sd   p0  p25  p50  p75
##  away_score       0   217544 217544    4.4      3.25      0    2    4    6
##  home_score       0   217544 217544    4.65     3.29      0    2    4    6
##     neutral       0   217544 217544    9.7e-05  0.0098    0    0    0    0
##      season       0   217544 217544 1961.58    39.78   1871 1929 1970 1996
##  p100     hist
##    49 ▇▂▁▁▁▁▁▁
##    38 ▇▅▁▁▁▁▁▁
##     1 ▇▁▁▁▁▁▁▁
##  2018 ▂▃▃▃▃▆▆▇
## 
## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────
##                 variable missing complete      n    mean     sd      p0
##         away_pitcher-adj   34421   183123 217544    6.61  8      -66.67
##         away_pitcher-rgs   33741   183803 217544   50.31  4.18    35.83
##            away_post-elo       0   217544 217544 1503.41 38.12  1333.07
##         away_post-rating       0   217544 217544 1503.42 38.34  1333.14
##             away_pre-elo       0   217544 217544 1503.39 38.01  1334.93
##          away_pre-rating       0   217544 217544 1503.39 38.23  1335   
##         away_probability       0   217544 217544    0.47  0.077    0.18
##  away_rating-probability       0   217544 217544    0.47  0.079    0.18
##         home_pitcher-adj   34422   183122 217544    6.58  8.04   -63.32
##         home_pitcher-rgs   33741   183803 217544   50.34  4.18    35.44
##            home_post-elo       0   217544 217544 1503.77 37.93  1346.22
##         home_post-rating       0   217544 217544 1503.76 38.16  1346.38
##             home_pre-elo       0   217544 217544 1503.79 37.85  1347.37
##          home_pre-rating       0   217544 217544 1503.79 38.07  1347.53
##         home_probability       0   217544 217544    0.53  0.077    0.21
##  home_rating-probability       0   217544 217544    0.53  0.079    0.2 
##      p25     p50     p75    p100     hist
##     3.82    7.54    9.82   79.4  ▁▁▁▅▇▁▁▁
##    47.4    49.73   52.8    78.05 ▁▂▇▅▁▁▁▁
##  1478.73 1505.34 1529.24 1657.99 ▁▁▂▆▇▃▁▁
##  1478.73 1505.41 1529.25 1659.34 ▁▁▂▆▇▃▁▁
##  1478.77 1505.31 1529.16 1657.99 ▁▁▂▆▇▃▁▁
##  1478.77 1505.4  1529.13 1659.34 ▁▁▂▆▇▃▁▁
##     0.41    0.46    0.52    0.79 ▁▁▅▇▆▂▁▁
##     0.41    0.46    0.52    0.8  ▁▁▅▇▆▂▁▁
##     3.82    7.54    9.82   76.3  ▁▁▁▅▇▁▁▁
##    47.4    49.75   52.8    77.67 ▁▂▇▅▁▁▁▁
##  1478.95 1505.63 1529.62 1657.62 ▁▁▂▆▇▃▁▁
##  1478.95 1505.67 1529.53 1658.98 ▁▁▃▇▇▃▁▁
##  1479.05 1505.65 1529.57 1656.64 ▁▁▂▇▇▃▁▁
##  1479.04 1505.68 1529.5  1658.01 ▁▁▃▇▇▃▁▁
##     0.48    0.54    0.59    0.82 ▁▁▂▆▇▅▁▁
##     0.48    0.54    0.59    0.82 ▁▁▂▆▇▅▁▁

#we can see that some variables are shown as numeric but are really ordinal, like season for example
#what are the variables classified as

We can see from the histogram that some of the numeric values, like season for example, are really ordinal and not continuously numeric. We could change season to be not numeric(others too maybe) if this would help in analysis, but I will leave it for now.

I want to see if the San Francisco Giants win more games home or away as I often hear announcers discussing teams struggling at one vs. another.

Let's separate out the home games and the away games to see where they win more often. We will look at the net wins and losses for each season.

#create tables for giants home and giants away to see if they play better at either
MLB_GiantsHome <- MLBdata  %>% 
    filter(home_team == "SFG") %>%
    mutate(sfg_win = ifelse(home_score > away_score, 1, 0)) %>%
    group_by(season) %>%
    summarise(home_wins = sum(sfg_win))


MLB_GiantsAway <- MLBdata  %>% 
    filter(away_team == "SFG") %>%
    mutate(sfg_win = ifelse(away_score > home_score, 1, 0)) %>%
    group_by(season) %>%
    summarise(away_wins = sum(sfg_win))


MLB_GiantsAll <- full_join(MLB_GiantsHome, MLB_GiantsAway, by="season") %>%
    mutate(net_home_wins = home_wins - away_wins)

kable(head(MLB_GiantsAll))

season	home_wins	away_wins	net_home_wins
1883	28	18	10
1884	34	28	6
1885	51	34	17
1886	47	28	19
1887	36	32	4
1888	43	40	3

Now we will plot the net wins and losses by season to see where they win most often and if it's true that they might struggle to win at home.

#create visualization for giants home and for giants away
ggplot(MLB_GiantsAll, aes(season, net_home_wins) ) +
    geom_step()

We can see from the plot of net wins per season, that there were 13 seasons where the San Francisco Giants did actually win more games away than at home. It could be interesting to investigate those seasons to understand why that may have occured.

Also, further analysis could examine the net wins by proportion of games per year and the number of home games vs. away games accounted for.