First we load the necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2
library(ggplot2)
library(stringr)
library(knitr)
Then use rvest to scrape the MLB standings from espn.com
season <- 2018
url <- paste0('http://www.espn.com/mlb/standings/_/sort/winpercent/dir/desc/season/', season, '/group/overall')
url_html <- url %>% read_html()
html_tables <- url_html %>% html_table(fill=T)
teams <- html_tables[[2]]
teams %>% head()
##
## 1 * --BOSBoston Red Sox
## 2 x --HOUHouston Astros
## 3 y --NYYNew York Yankees
## 4 y --OAKOakland Athletics
## 5 * --MILMilwaukee Brewers
## 6 y --CHCChicago Cubs
# use regex to extract team acronym - get uppercase letters followed by lowercase, then delete last two characters
teams <- teams %>% rename(team=1) %>% mutate(team=str_extract_all(team, "[A-Z]{3,4}[a-z]") %>% substr(1, nchar(.)-2))
teams %>% head()
## team
## 1 BOS
## 2 HOU
## 3 NYY
## 4 OAK
## 5 MIL
## 6 CHC
standings <- html_tables[[4]]
standings %>% head()
## W L PCT GB HOME AWAY RS RA DIFF STRK L10
## 1 108 54 0.667 - 57-24 51-30 876 647 229 W1 5-5
## 2 103 59 0.636 5 46-35 57-24 797 534 263 L1 8-2
## 3 100 62 0.617 8 53-28 47-34 851 669 182 L1 7-3
## 4 97 65 0.599 11 50-31 47-34 813 674 139 L1 6-4
## 5 96 67 0.589 12.5 51-30 45-37 754 659 95 W8 9-1
## 6 95 68 0.583 13.5 51-31 44-37 761 645 116 L1 6-4
team_standings <- bind_cols(teams, standings)
team_standings %>% head() %>% kable()
| team | W | L | PCT | GB | HOME | AWAY | RS | RA | DIFF | STRK | L10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BOS | 108 | 54 | 0.667 | - | 57-24 | 51-30 | 876 | 647 | 229 | W1 | 5-5 |
| HOU | 103 | 59 | 0.636 | 5 | 46-35 | 57-24 | 797 | 534 | 263 | L1 | 8-2 |
| NYY | 100 | 62 | 0.617 | 8 | 53-28 | 47-34 | 851 | 669 | 182 | L1 | 7-3 |
| OAK | 97 | 65 | 0.599 | 11 | 50-31 | 47-34 | 813 | 674 | 139 | L1 | 6-4 |
| MIL | 96 | 67 | 0.589 | 12.5 | 51-30 | 45-37 | 754 | 659 | 95 | W8 | 9-1 |
| CHC | 95 | 68 | 0.583 | 13.5 | 51-31 | 44-37 | 761 | 645 | 116 | L1 | 6-4 |
Scoring runs should lead to winning
team_standings %>% ggplot(aes(x=RS, y=W)) + geom_point() + geom_smooth(method="lm", se=T)
basic_lm_fit <- lm(W ~ RS, data=team_standings)
summary(basic_lm_fit)
##
## Call:
## lm(formula = W ~ RS, data = team_standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7420 -5.9544 0.7703 5.6977 15.2073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.61397 14.62196 -2.572 0.0157 *
## RS 0.16456 0.02017 8.157 7.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.151 on 28 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6932
## F-statistic: 66.53 on 1 and 28 DF, p-value: 7.031e-09
The R2 is around 0.69, which is a Pearson correlation of about r=0.83. Not bad.
Winning is also influenced by how many runs a team allows
team_standings %>% ggplot(aes(x=DIFF, y=W)) + geom_point() + geom_smooth(method="lm", se=T)
multi_lm_fit <- lm(W ~ RS + RA, data=team_standings)
summary(multi_lm_fit)
##
## Call:
## lm(formula = W ~ RS + RA, data = team_standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.009 -2.561 0.108 2.563 11.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.29310 17.88598 4.769 5.67e-05 ***
## RS 0.09443 0.01458 6.477 6.08e-07 ***
## RA -0.10034 0.01294 -7.755 2.44e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.621 on 27 degrees of freedom
## Multiple R-squared: 0.9082, Adjusted R-squared: 0.9014
## F-statistic: 133.6 on 2 and 27 DF, p-value: 9.935e-15
The adjusted R2 has gone up to 0.90. Adding in Runs Allowed made a significant improvement to the model.
We acquired our data via web scraping and cleaned it using regular expressions. We then built a basic linear model, which was a decent fit. But we made it even better by expanding it slightly to a multivariate regression.
We now have a simple but effective (and interpretable) model for handicapping how many Wins a baseball team can expect based on how many runs they score and how many runs they allow.