Libraries

First we load the necessary libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2
library(ggplot2)
library(stringr)
library(knitr)

Get the data

Then use rvest to scrape the MLB standings from espn.com

season <- 2018
url <- paste0('http://www.espn.com/mlb/standings/_/sort/winpercent/dir/desc/season/', season, '/group/overall')
url_html <- url %>% read_html()
html_tables <- url_html %>% html_table(fill=T)
teams <- html_tables[[2]]
teams %>% head()
##                           
## 1    * --BOSBoston Red Sox
## 2    x --HOUHouston Astros
## 3  y --NYYNew York Yankees
## 4 y --OAKOakland Athletics
## 5 * --MILMilwaukee Brewers
## 6      y --CHCChicago Cubs
# use regex to extract team acronym - get uppercase letters followed by lowercase, then delete last two characters
teams <- teams %>% rename(team=1) %>% mutate(team=str_extract_all(team, "[A-Z]{3,4}[a-z]") %>% substr(1, nchar(.)-2))
teams %>% head()
##   team
## 1  BOS
## 2  HOU
## 3  NYY
## 4  OAK
## 5  MIL
## 6  CHC
standings <- html_tables[[4]]
standings %>% head()
##     W  L   PCT   GB  HOME  AWAY  RS  RA DIFF STRK L10
## 1 108 54 0.667    - 57-24 51-30 876 647  229   W1 5-5
## 2 103 59 0.636    5 46-35 57-24 797 534  263   L1 8-2
## 3 100 62 0.617    8 53-28 47-34 851 669  182   L1 7-3
## 4  97 65 0.599   11 50-31 47-34 813 674  139   L1 6-4
## 5  96 67 0.589 12.5 51-30 45-37 754 659   95   W8 9-1
## 6  95 68 0.583 13.5 51-31 44-37 761 645  116   L1 6-4
team_standings <- bind_cols(teams, standings)
team_standings %>% head() %>% kable()
team W L PCT GB HOME AWAY RS RA DIFF STRK L10
BOS 108 54 0.667 - 57-24 51-30 876 647 229 W1 5-5
HOU 103 59 0.636 5 46-35 57-24 797 534 263 L1 8-2
NYY 100 62 0.617 8 53-28 47-34 851 669 182 L1 7-3
OAK 97 65 0.599 11 50-31 47-34 813 674 139 L1 6-4
MIL 96 67 0.589 12.5 51-30 45-37 754 659 95 W8 9-1
CHC 95 68 0.583 13.5 51-31 44-37 761 645 116 L1 6-4

Build a simple linear model

Scoring runs should lead to winning

team_standings %>% ggplot(aes(x=RS, y=W)) + geom_point() + geom_smooth(method="lm", se=T)

basic_lm_fit <- lm(W ~ RS, data=team_standings)
summary(basic_lm_fit)
## 
## Call:
## lm(formula = W ~ RS, data = team_standings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7420  -5.9544   0.7703   5.6977  15.2073 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -37.61397   14.62196  -2.572   0.0157 *  
## RS            0.16456    0.02017   8.157 7.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.151 on 28 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6932 
## F-statistic: 66.53 on 1 and 28 DF,  p-value: 7.031e-09

The R2 is around 0.69, which is a Pearson correlation of about r=0.83. Not bad.

Expand the model

Winning is also influenced by how many runs a team allows

team_standings %>% ggplot(aes(x=DIFF, y=W)) + geom_point() + geom_smooth(method="lm", se=T)

multi_lm_fit <- lm(W ~ RS + RA, data=team_standings)
summary(multi_lm_fit)
## 
## Call:
## lm(formula = W ~ RS + RA, data = team_standings)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.009 -2.561  0.108  2.563 11.118 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 85.29310   17.88598   4.769 5.67e-05 ***
## RS           0.09443    0.01458   6.477 6.08e-07 ***
## RA          -0.10034    0.01294  -7.755 2.44e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.621 on 27 degrees of freedom
## Multiple R-squared:  0.9082, Adjusted R-squared:  0.9014 
## F-statistic: 133.6 on 2 and 27 DF,  p-value: 9.935e-15

The adjusted R2 has gone up to 0.90. Adding in Runs Allowed made a significant improvement to the model.

Summary

We acquired our data via web scraping and cleaned it using regular expressions. We then built a basic linear model, which was a decent fit. But we made it even better by expanding it slightly to a multivariate regression.

We now have a simple but effective (and interpretable) model for handicapping how many Wins a baseball team can expect based on how many runs they score and how many runs they allow.