For this example we will get Major League Baseball data from espn.com and fit a simple linear regression.


Libraries

First we load the necessary libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2
library(broom)
## Warning: package 'broom' was built under R version 3.5.3
library(ggplot2)
library(stringr)
library(knitr)


Get the data

Then use rvest to scrape the 2018 MLB standings from espn.com and clean up the team abbreviations using regular expressions.

season <- 2018
url <- paste0('http://www.espn.com/mlb/standings/_/sort/winpercent/dir/desc/season/', season, '/group/overall')
url_html <- url %>% read_html()
html_tables <- url_html %>% html_table(fill=T)
teams <- html_tables[[2]]

# use regex to extract team acronym - get uppercase letters followed by lowercase, then delete last two characters
teams <- teams %>% rename(team=1) %>% mutate(team=str_extract_all(team, "[A-Z]{3,4}[a-z]") %>% substr(1, nchar(.)-2))

standings <- html_tables[[4]]

team_standings <- bind_cols(teams, standings)
team_standings %>% head() %>% kable()
team W L PCT GB HOME AWAY RS RA DIFF STRK L10
BOS 108 54 0.667 - 57-24 51-30 876 647 229 W1 5-5
HOU 103 59 0.636 5 46-35 57-24 797 534 263 L1 8-2
NYY 100 62 0.617 8 53-28 47-34 851 669 182 L1 7-3
OAK 97 65 0.599 11 50-31 47-34 813 674 139 L1 6-4
MIL 96 67 0.589 12.5 51-30 45-37 754 659 95 W8 9-1
CHC 95 68 0.583 13.5 51-31 44-37 761 645 116 L1 6-4


Build the model

Run differential should correlate to to winning. We examine a scatterplot and see that a simple linear model looks appropriate.

team_standings %>% 
  ggplot(aes(x=DIFF, y=W)) + 
  geom_point() + 
  geom_smooth(method="lm", se=T) + 
  labs(x="Run Differential", y="Number of Wins")

lm_fit <- lm(W ~ DIFF, data=team_standings)
summary(lm_fit)
## 
## Call:
## lm(formula = W ~ DIFF, data = team_standings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9685 -2.9065  0.0181  2.5834 11.2852 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81.03333    0.82929   97.71  < 2e-16 ***
## DIFF         0.09760    0.00587   16.63 4.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.542 on 28 degrees of freedom
## Multiple R-squared:  0.908,  Adjusted R-squared:  0.9047 
## F-statistic: 276.4 on 1 and 28 DF,  p-value: 4.845e-16

The high adjusted R-squared indicates a good fit.


Examine the residuals

We use the broom package to put the linear model into a data frame.

lm_df <- augment(lm_fit)
lm_df %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

The residual values appear randomly scattered, as they should in a well fit model.