Stephanie Roark
12/5/2018
Let's take a look at the complete history of Major League Baseball from 1871 to present.
The MLB Complete History dataset contains two separate systems for rating teams; the simpler Elo ratings, used for The Complete History Of MLB, and the more involved — and confusingly named — "ratings" that are used in our MLB Predictions. This dataset contains an Elo-based rating system and predictive model for baseball that accounts for home-field advantage, margin of victory, park and era effects, travel, rest and — most importantly — starting pitchers.Elo is a simple way to rate competitors that can be tuned and customized endlessly to incorporate available data.
For baseball fans, all the descriptions for the datasets can be found here: How Our MLB Predictions Work
First, we load the data from FiveThirtyEight.com
library(tidyverse)
library(knitr)
Url <- "https://projects.fivethirtyeight.com/mlb-api/mlb_elo.csv"
MLBdata <- read.csv(file = Url, header = TRUE, sep = ",")
kable(head(MLBdata))| date | season | neutral | playoff | team1 | team2 | elo1_pre | elo2_pre | elo_prob1 | elo_prob2 | elo1_post | elo2_post | rating1_pre | rating2_pre | pitcher1 | pitcher2 | pitcher1_rgs | pitcher2_rgs | pitcher1_adj | pitcher2_adj | rating_prob1 | rating_prob2 | rating1_post | rating2_post | score1 | score2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2018-10-28 | 2018 | 0 | w | LAD | BOS | 1572.027 | 1603.193 | 0.4862531 | 0.5137469 | 1568.153 | 1607.066 | 1576.245 | 1606.237 | Clayton Kershaw | David Price | 55.72020 | 55.90411 | 8.777848 | 11.1913925 | 0.4838771 | 0.5161229 | 1572.396 | 1610.086 | 1 | 5 |
| 2018-10-27 | 2018 | 0 | w | LAD | BOS | 1575.480 | 1599.739 | 0.4995027 | 0.5004973 | 1572.027 | 1603.193 | 1579.775 | 1602.707 | Rich Hill | Eduardo Rodriguez | 54.47421 | 53.63815 | 3.666228 | 0.3866077 | 0.5083420 | 0.4916580 | 1576.245 | 1606.237 | 6 | 9 |
| 2018-10-26 | 2018 | 0 | w | LAD | BOS | 1573.220 | 1601.999 | 0.4908324 | 0.5091676 | 1575.480 | 1599.739 | 1577.877 | 1604.605 | Walker Buehler | Rick Porcello | 57.86924 | 51.14604 | 20.863039 | -11.1236660 | 0.5559070 | 0.4440930 | 1579.775 | 1602.707 | 3 | 2 |
| 2018-10-24 | 2018 | 0 | w | BOS | LAD | 1600.026 | 1575.193 | 0.5926204 | 0.4073796 | 1601.999 | 1573.220 | 1602.811 | 1579.671 | David Price | Hyun-Jin Ryu | 55.36010 | 51.91669 | 9.286617 | -7.2494460 | 0.6198083 | 0.3801917 | 1604.605 | 1577.877 | 4 | 2 |
| 2018-10-23 | 2018 | 0 | w | BOS | LAD | 1597.035 | 1578.184 | 0.5814915 | 0.4185085 | 1600.026 | 1575.193 | 1600.469 | 1582.014 | Chris Sale | Clayton Kershaw | 64.08947 | 56.62859 | 50.165559 | 14.1844384 | 0.6489541 | 0.3510459 | 1602.811 | 1579.671 | 8 | 4 |
| 2018-10-20 | 2018 | 0 | l | MIL | LAD | 1561.263 | 1574.030 | 0.5215406 | 0.4784594 | 1557.109 | 1578.184 | 1565.239 | 1577.979 | Jhoulys Chacin | Walker Buehler | 54.69401 | 57.92168 | 14.107999 | 20.4168461 | 0.5095001 | 0.4904999 | 1561.205 | 1582.014 | 1 | 5 |
What makes data tidy and clean?
Tiday Data:
Or more simply put, tidy data means that each variable forms a column; each observation forms a row; each type of observational unit forms a table.
5 Most common problems to find when tidying data:
• Column headers are values, not variable names. • Multiple variables are stored in one column. • Variables are stored in both rows and columns. • Multiple types of observational units are stored in the same table. • A single observational unit is stored in multiple tables.
Once the data is tidy, you need to clean it (just like a house can't be cleaned until everything is tidyied up.) Check the data for missing values, labels that make sense, errors, units, outliers, etc.
Let's start by looking at the structure of the data and to see if there is any missing data:
#structure of the data - from this we can see that there are both numeric and categorical data
str(MLBdata)## 'data.frame': 217544 obs. of 26 variables:
## $ date : Factor w/ 25902 levels "1871-05-04","1871-05-05",..: 25902 25901 25900 25899 25898 25897 25896 25895 25894 25894 ...
## $ season : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ neutral : int 0 0 0 0 0 0 0 0 0 0 ...
## $ playoff : Factor w/ 5 levels "","c","d","l",..: 5 5 5 5 5 4 4 4 4 4 ...
## $ team1 : Factor w/ 89 levels "ANA","ARI","ATL",..: 40 40 40 9 9 44 44 31 31 40 ...
## $ team2 : Factor w/ 89 levels "ANA","ARI","ATL",..: 9 9 9 40 40 40 40 9 9 44 ...
## $ elo1_pre : num 1572 1575 1573 1600 1597 ...
## $ elo2_pre : num 1603 1600 1602 1575 1578 ...
## $ elo_prob1 : num 0.486 0.5 0.491 0.593 0.581 ...
## $ elo_prob2 : num 0.514 0.5 0.509 0.407 0.419 ...
## $ elo1_post : num 1568 1572 1575 1602 1600 ...
## $ elo2_post : num 1607 1603 1600 1573 1575 ...
## $ rating1_pre : num 1576 1580 1578 1603 1600 ...
## $ rating2_pre : num 1606 1603 1605 1580 1582 ...
## $ pitcher1 : Factor w/ 5886 levels "","A.J. Burnett",..: 990 4473 5502 1262 949 2624 5483 2772 912 990 ...
## $ pitcher2 : Factor w/ 5840 levels "","A.J. Burnett",..: 1271 1531 4441 2500 999 5470 2500 1271 4441 5450 ...
## $ pitcher1_rgs: num 55.7 54.5 57.9 55.4 64.1 ...
## $ pitcher2_rgs: num 55.9 53.6 51.1 51.9 56.6 ...
## $ pitcher1_adj: num 8.78 3.67 20.86 9.29 50.17 ...
## $ pitcher2_adj: num 11.191 0.387 -11.124 -7.249 14.184 ...
## $ rating_prob1: num 0.484 0.508 0.556 0.62 0.649 ...
## $ rating_prob2: num 0.516 0.492 0.444 0.38 0.351 ...
## $ rating1_post: num 1572 1576 1580 1605 1603 ...
## $ rating2_post: num 1610 1606 1603 1578 1580 ...
## $ score1 : int 1 6 3 4 8 1 7 1 6 5 ...
## $ score2 : int 5 9 2 2 4 5 2 4 8 2 ...
#are there NA's in the data set
sum(is.na(MLBdata))## [1] 136325
We can see that the data consists mostly of numeric values and that there are quite a few NA's.
Let's look to see if there is any missing data or "" values.
#let's look at one column to see if there are missing entries that do not show up as NA's
MLBdata %>%
count(playoff)## # A tibble: 5 x 2
## playoff n
## <fct> <int>
## 1 "" 215964
## 2 c 14
## 3 d 397
## 4 l 504
## 5 w 665
#read the data back in in a different way to automatically convert "" to NA's
#readr converts "" to NA's while reading in the data
MLBdata <- readr::read_csv("https://projects.fivethirtyeight.com/mlb-api/mlb_elo.csv")## Parsed with column specification:
## cols(
## .default = col_double(),
## date = col_date(format = ""),
## season = col_integer(),
## neutral = col_integer(),
## playoff = col_character(),
## team1 = col_character(),
## team2 = col_character(),
## pitcher1 = col_character(),
## pitcher2 = col_character(),
## score1 = col_integer(),
## score2 = col_integer()
## )
## See spec(...) for full column specifications.
sum(is.na(MLBdata))## [1] 352365
Now, all of the empty spaces are NA's which helps when performing analysis.
This dataset is already organized in columns and rows, but the names of the columns are confusing. So let's give the variables labels that are easier to understand.
# do the column names make sense
colnames(MLBdata)## [1] "date" "season" "neutral" "playoff"
## [5] "team1" "team2" "elo1_pre" "elo2_pre"
## [9] "elo_prob1" "elo_prob2" "elo1_post" "elo2_post"
## [13] "rating1_pre" "rating2_pre" "pitcher1" "pitcher2"
## [17] "pitcher1_rgs" "pitcher2_rgs" "pitcher1_adj" "pitcher2_adj"
## [21] "rating_prob1" "rating_prob2" "rating1_post" "rating2_post"
## [25] "score1" "score2"
# let's rename the columns so that we can understand their meaning
newcolnames <- c("date", "season", "neutral", "playoff", "home_team", "away_team",
"home_pre-elo", "away_pre-elo", "home_probability",
"away_probability", "home_post-elo", "away_post-elo",
"home_pre-rating", "away_pre-rating", "home_pitcher", "away_pitcher",
"home_pitcher-rgs", "away_pitcher-rgs", "home_pitcher-adj",
"away_pitcher-adj", "home_rating-probability", "away_rating-probability",
"home_post-rating", "away_post-rating", "home_score", "away_score")
colnames(MLBdata) <- newcolnames
#now we have a better understanding of what each variable represents
colnames(MLBdata)## [1] "date" "season"
## [3] "neutral" "playoff"
## [5] "home_team" "away_team"
## [7] "home_pre-elo" "away_pre-elo"
## [9] "home_probability" "away_probability"
## [11] "home_post-elo" "away_post-elo"
## [13] "home_pre-rating" "away_pre-rating"
## [15] "home_pitcher" "away_pitcher"
## [17] "home_pitcher-rgs" "away_pitcher-rgs"
## [19] "home_pitcher-adj" "away_pitcher-adj"
## [21] "home_rating-probability" "away_rating-probability"
## [23] "home_post-rating" "away_post-rating"
## [25] "home_score" "away_score"
Next let's look at the numeric values using skimr.
#let's look at summaries of the numeric columns
MLBdata %>%
select_if(is.numeric) %>%
skimr::skim()## Skim summary statistics
## n obs: 217544
## n variables: 20
##
## ── Variable type:integer ───────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## away_score 0 217544 217544 4.4 3.25 0 2 4 6
## home_score 0 217544 217544 4.65 3.29 0 2 4 6
## neutral 0 217544 217544 9.7e-05 0.0098 0 0 0 0
## season 0 217544 217544 1961.58 39.78 1871 1929 1970 1996
## p100 hist
## 49 ▇▂▁▁▁▁▁▁
## 38 ▇▅▁▁▁▁▁▁
## 1 ▇▁▁▁▁▁▁▁
## 2018 ▂▃▃▃▃▆▆▇
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0
## away_pitcher-adj 34421 183123 217544 6.61 8 -66.67
## away_pitcher-rgs 33741 183803 217544 50.31 4.18 35.83
## away_post-elo 0 217544 217544 1503.41 38.12 1333.07
## away_post-rating 0 217544 217544 1503.42 38.34 1333.14
## away_pre-elo 0 217544 217544 1503.39 38.01 1334.93
## away_pre-rating 0 217544 217544 1503.39 38.23 1335
## away_probability 0 217544 217544 0.47 0.077 0.18
## away_rating-probability 0 217544 217544 0.47 0.079 0.18
## home_pitcher-adj 34422 183122 217544 6.58 8.04 -63.32
## home_pitcher-rgs 33741 183803 217544 50.34 4.18 35.44
## home_post-elo 0 217544 217544 1503.77 37.93 1346.22
## home_post-rating 0 217544 217544 1503.76 38.16 1346.38
## home_pre-elo 0 217544 217544 1503.79 37.85 1347.37
## home_pre-rating 0 217544 217544 1503.79 38.07 1347.53
## home_probability 0 217544 217544 0.53 0.077 0.21
## home_rating-probability 0 217544 217544 0.53 0.079 0.2
## p25 p50 p75 p100 hist
## 3.82 7.54 9.82 79.4 ▁▁▁▅▇▁▁▁
## 47.4 49.73 52.8 78.05 ▁▂▇▅▁▁▁▁
## 1478.73 1505.34 1529.24 1657.99 ▁▁▂▆▇▃▁▁
## 1478.73 1505.41 1529.25 1659.34 ▁▁▂▆▇▃▁▁
## 1478.77 1505.31 1529.16 1657.99 ▁▁▂▆▇▃▁▁
## 1478.77 1505.4 1529.13 1659.34 ▁▁▂▆▇▃▁▁
## 0.41 0.46 0.52 0.79 ▁▁▅▇▆▂▁▁
## 0.41 0.46 0.52 0.8 ▁▁▅▇▆▂▁▁
## 3.82 7.54 9.82 76.3 ▁▁▁▅▇▁▁▁
## 47.4 49.75 52.8 77.67 ▁▂▇▅▁▁▁▁
## 1478.95 1505.63 1529.62 1657.62 ▁▁▂▆▇▃▁▁
## 1478.95 1505.67 1529.53 1658.98 ▁▁▃▇▇▃▁▁
## 1479.05 1505.65 1529.57 1656.64 ▁▁▂▇▇▃▁▁
## 1479.04 1505.68 1529.5 1658.01 ▁▁▃▇▇▃▁▁
## 0.48 0.54 0.59 0.82 ▁▁▂▆▇▅▁▁
## 0.48 0.54 0.59 0.82 ▁▁▂▆▇▅▁▁
#we can see that some variables are shown as numeric but are really ordinal, like season for example
#what are the variables classified asWe can see from the histogram that some of the numeric values, like season for example, are really ordinal and not continuously numeric. We could change season to be not numeric(others too maybe) if this would help in analysis, but I will leave it for now.
I want to see if the San Francisco Giants win more games home or away as I often hear announcers discussing teams struggling at one vs. another.
Let's separate out the home games and the away games to see where they win more often. We will look at the net wins and losses for each season.
#create tables for giants home and giants away to see if they play better at either
MLB_GiantsHome <- MLBdata %>%
filter(home_team == "SFG") %>%
mutate(sfg_win = ifelse(home_score > away_score, 1, 0)) %>%
group_by(season) %>%
summarise(home_wins = sum(sfg_win))
MLB_GiantsAway <- MLBdata %>%
filter(away_team == "SFG") %>%
mutate(sfg_win = ifelse(away_score > home_score, 1, 0)) %>%
group_by(season) %>%
summarise(away_wins = sum(sfg_win))
MLB_GiantsAll <- full_join(MLB_GiantsHome, MLB_GiantsAway, by="season") %>%
mutate(net_home_wins = home_wins - away_wins)
kable(head(MLB_GiantsAll))| season | home_wins | away_wins | net_home_wins |
|---|---|---|---|
| 1883 | 28 | 18 | 10 |
| 1884 | 34 | 28 | 6 |
| 1885 | 51 | 34 | 17 |
| 1886 | 47 | 28 | 19 |
| 1887 | 36 | 32 | 4 |
| 1888 | 43 | 40 | 3 |
Now we will plot the net wins and losses by season to see where they win most often and if it's true that they might struggle to win at home.
#create visualization for giants home and for giants away
ggplot(MLB_GiantsAll, aes(season, net_home_wins) ) +
geom_step()We can see from the plot of net wins per season, that there were 13 seasons where the San Francisco Giants did actually win more games away than at home. It could be interesting to investigate those seasons to understand why that may have occured.
Also, further analysis could examine the net wins by proportion of games per year and the number of home games vs. away games accounted for.