The data we used for this assignment this assignment can be found as the following github location, and is provided via FiveThiryEight’s publically available databases. The data contains predictions of all Major League Baseball games from the latest season (2022), which are made via Elo ratings. The specifics of how FiveThiryEight calculates these Elo ratings and subsequent predictions can be found in this article.
The following chunk loads the libraries needed to run the subsequent R code for this assignment.
library("RCurl")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
library("tidyverse")
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.4
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The following chunk provides the code to import and take a first look at our data:
csv_data <- getURL("https://projects.fivethirtyeight.com/mlb-api/mlb_elo_latest.csv")
df <- data.frame(read.csv(text=csv_data))
glimpse(df)
## Rows: 2,430
## Columns: 26
## $ date <chr> "2022-10-05", "2022-10-05", "2022-10-05", "2022-10-05", "…
## $ season <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 202…
## $ neutral <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ playoff <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ team1 <chr> "LAD", "SEA", "SDP", "NYM", "MIL", "HOU", "FLA", "CLE", "…
## $ team2 <chr> "COL", "DET", "SFG", "WSN", "ARI", "PHI", "ATL", "KCR", "…
## $ elo1_pre <dbl> 1619.258, 1538.452, 1511.278, 1559.803, 1519.420, 1576.95…
## $ elo2_pre <dbl> 1458.059, 1449.094, 1500.692, 1427.806, 1486.195, 1529.62…
## $ elo_prob1 <dbl> 0.7438531, 0.6575839, 0.5496101, 0.7105367, 0.5816163, 0.…
## $ elo_prob2 <dbl> 0.2561469, 0.3424161, 0.4503899, 0.2894633, 0.4183837, 0.…
## $ elo1_post <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ elo2_post <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rating1_pre <dbl> 1619.903, 1525.827, 1522.537, 1563.657, 1528.464, 1568.70…
## $ rating2_pre <dbl> 1453.174, 1450.817, 1494.614, 1432.806, 1478.508, 1535.71…
## $ pitcher1 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ pitcher2 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ pitcher1_rgs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pitcher2_rgs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pitcher1_adj <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pitcher2_adj <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rating_prob1 <dbl> 0.7505160, 0.6547150, 0.5815474, 0.7196434, 0.6075037, 0.…
## $ rating_prob2 <dbl> 0.2494840, 0.3452850, 0.4184526, 0.2803566, 0.3924963, 0.…
## $ rating1_post <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rating2_post <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ score1 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ score2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
The glimpse() function reveals that this data set has 26 features, and 2,430 observations (which in this case are all the MLB games in the 2022 regular season).
The following is a description of each of the column names listed above, which can also be found in the data set documentation.
| Column Name | Description |
|---|---|
| date | Date of game |
| season | Year of season |
| neutral | Whether game was on a neutral site |
| playoff | Whether game was in playoffs, and the playoff round if so |
| team1 | Abbreviation for home team |
| team2 | Abbreviation for away team |
| elo1_pre | Home team’s Elo rating before the game |
| elo2_pre | Away team’s Elo rating before the game |
| elo_prob1 | Home team’s probability of winning according to Elo ratings |
| elo_prob2 | Away team’s probability of winning according to Elo ratings |
| elo1_post | Home team’s Elo rating after the game |
| elo2_post | Away team’s Elo rating after the game |
| rating1_pre | Home team’s rating before the game |
| rating2_pre | Away team’s rating before the game |
| pitcher1 | Name of home starting pitcher |
| pitcher2 | Name of away starting pitcher |
| pitcher1_rgs | Home starting pitcher’s rolling game score before the game |
| pitcher2_rgs | Away starting pitcher’s rolling game score before the game |
| pitcher1_adj | Home starting pitcher’s adjustment to their team’s rating |
| pitcher2_adj | Away starting pitcher’s adjustment to their team’s rating |
| rating_prob1 | Home team’s probability of winning according to team ratings and starting pitchers |
| rating_prob2 | Away team’s probability of winning according to team ratings and starting pitchers |
| rating1_post | Home team’s rating after the game |
| rating2_post | Away team’s rating after the game |
| score1 | Home team’s score |
| score2 | Away team’s score |
Some of the columns in the above data set are redundant, since they can be easily derived from others. For example, we do not need two probability columns given that the probability of one team winning can be easily determined if we know the other team’s probability of winning (if the probability of one team winning is p, the probability of the other team winning is 1-p). Thus, we can subset our data frame to remove two of the redundant probability columns, elo_prob2, and rating_prob2. Additionally, there are two columns that provide no additional information, and can be removed. These fields, and explanation of why they can be removed is given below:
The following R blocks confirm the above explanation by showing the one unique value in each of the playoff and season columns, respectively (NA and 2022).
print(unique(df$playoff))
## [1] NA
print(unique(df$season))
## [1] 2022
The following code block subsets our dataframe to remove the aforementioned redundant/useless columns. It then uses the dim() function to confirm that 4 of our original 26 columns have now been removed (22 remaining):
df <- subset(df, select = -c(rating_prob2, elo_prob2, playoff, season))
dim(df)
## [1] 2430 22
Additionally, there are some columns that would be useful to add to the data set in order to evaluate the effectiveness of FiveThiryEight’s predictions. For games that have already been played, we can look at how both of their prediction metrics align with the results of the actual game. First, the following chunk subsets the data to include only those games that have been played, by checking for non NA values in the score1 and score2 columns.
df <- subset(df, is.na(score1)==FALSE & is.na(score2)==FALSE)
dim(df)
## [1] 1958 22
The dim() command in the code block above shows that after filtering there are 1,956 rows remaining, meaning that we have removed 2,430 - 1,956 = 474 unplayed (as of this point) games.
Next, the chunk below creates two additional columns, elo_correct and rating_correct, which checks to see if their respective elo_prob1 and rating_prob1 columns accurately predicted the results of the game.
df <- df %>%
mutate(df, elo_correct = ifelse((elo_prob1 > 0.5 & score1 > score2) |
(elo_prob1 < 0.5 & score1 < score2),
TRUE, FALSE))
df <- df %>%
mutate(df, rating_correct = ifelse((rating_prob1 > 0.5 & score1 > score2) |
(rating_prob1 < 0.5 & score1 < score2),
TRUE, FALSE))
dim(df)
## [1] 1958 24
The dim function above confirms that two new columns have been added to our dataframe.
Now that the data has been cleaned and prepped, our final dataframe can be further analyzed to answer some interesting questions. First of all, we can check how often FiveThiryEight was able to accurately able to predict the result of MLB games. The chunk below prints the percentage of times the elo_prob1 field accurately predicted the game results:
nrow(subset(df, elo_correct == TRUE)) / nrow(df) * 100
## [1] 58.78447
The chunk below does the same for the rating_prob1 results
nrow(subset(df, rating_correct == TRUE)) / nrow(df) * 100
## [1] 59.75485
As you can see from the outputs above, both prediction methodologies do predict the correct outcome of MLB games better than chance. However, it will require additional analysis to prove whether or not this improvement is statistically significant.
The results seen above can also be visualized below in the following histograms:
barplot(table(df$elo_correct), ylim=c(0,1200),
main="How Often elo_prob1 Predicted Game Outcomes")
barplot(table(df$rating_correct), ylim=c(0,1200),
main="How Often rating_prob1 Predicted Game Outcomes")
Additionally, if we separate out the correct and incorrect predictions, we can see if there’s a difference in how close the probabilities were to making the right guess. For this analysis, we will focus on using only the rating_correct column, since it was more accurate.
df_correct <- subset(df, rating_correct == TRUE)
df_incorrect <- subset(df, rating_correct == FALSE)
Next, we can add a field to see the difference in win probability for each team. This can be written mathematically as |p - (1 - p)| = |2p - 1|, and is done in R in the following chunk:
df_correct <- df_correct %>%
mutate(prob_diff = abs(2 * rating_prob1 - 1))
df_incorrect <- df_incorrect %>%
mutate(prob_diff = abs(2 * rating_prob1 - 1))
Next, we can compare the differences in win probability for both the correct and incorrect predictions using a histogram (graphic inspired by this stack overflow post):
dat1 = data.frame(x=df_correct$prob_diff, group="Correct Predictions")
dat2 = data.frame(x=df_incorrect$prob_diff, group="Incorrect Predictions")
dat = rbind(dat1, dat2)
ggplot(dat, aes(x, fill=group, colour=group)) +
geom_histogram(aes(y=..density..), binwidth = 0.025,
alpha=0.6, position="identity", lwd=0.2) +
ggtitle("Probability Diffs Comparing Correct vs. Incorrect Predictions") +
xlab('Probability Difference')
The graphic above shows that the incorrect probability difference distribution is pushed slightly to the left of the correct probability distribution, possibly indicating that the FiveThirtyEight probability predictions are on average closer to being 50-50 when being incorrect, as opposed to when they get it right. Once again, further statistical testing would be needed to prove this.
The above code goes through the process of importing, cleaning, and preparing a data set, and shows a few data analyses that can be done as a result of completing these initial steps. It does seem as though FiveThirtyEight’s prediction methodology does correctly predict the outcome of MLB games more likely than chance, but a next step in this case would be to verify statistically if that is indeed the case.