For this example we will get Major League Baseball data from espn.com and fit a simple linear regression.
First we load the necessary libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
## Loading required package: xml2
library(broom)
## Warning: package 'broom' was built under R version 3.5.3
library(ggplot2)
library(stringr)
library(knitr)
Then use rvest to scrape the 2018 MLB standings from espn.com and clean up the team abbreviations using regular expressions.
season <- 2018
url <- paste0('http://www.espn.com/mlb/standings/_/sort/winpercent/dir/desc/season/', season, '/group/overall')
url_html <- url %>% read_html()
html_tables <- url_html %>% html_table(fill=T)
teams <- html_tables[[2]]
# use regex to extract team acronym - get uppercase letters followed by lowercase, then delete last two characters
teams <- teams %>% rename(team=1) %>% mutate(team=str_extract_all(team, "[A-Z]{3,4}[a-z]") %>% substr(1, nchar(.)-2))
standings <- html_tables[[4]]
team_standings <- bind_cols(teams, standings)
team_standings %>% head() %>% kable()
| team | W | L | PCT | GB | HOME | AWAY | RS | RA | DIFF | STRK | L10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BOS | 108 | 54 | 0.667 | - | 57-24 | 51-30 | 876 | 647 | 229 | W1 | 5-5 |
| HOU | 103 | 59 | 0.636 | 5 | 46-35 | 57-24 | 797 | 534 | 263 | L1 | 8-2 |
| NYY | 100 | 62 | 0.617 | 8 | 53-28 | 47-34 | 851 | 669 | 182 | L1 | 7-3 |
| OAK | 97 | 65 | 0.599 | 11 | 50-31 | 47-34 | 813 | 674 | 139 | L1 | 6-4 |
| MIL | 96 | 67 | 0.589 | 12.5 | 51-30 | 45-37 | 754 | 659 | 95 | W8 | 9-1 |
| CHC | 95 | 68 | 0.583 | 13.5 | 51-31 | 44-37 | 761 | 645 | 116 | L1 | 6-4 |
Run differential should correlate to to winning. We examine a scatterplot and see that a simple linear model looks appropriate.
team_standings %>%
ggplot(aes(x=DIFF, y=W)) +
geom_point() +
geom_smooth(method="lm", se=T) +
labs(x="Run Differential", y="Number of Wins")
lm_fit <- lm(W ~ DIFF, data=team_standings)
summary(lm_fit)
##
## Call:
## lm(formula = W ~ DIFF, data = team_standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9685 -2.9065 0.0181 2.5834 11.2852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.03333 0.82929 97.71 < 2e-16 ***
## DIFF 0.09760 0.00587 16.63 4.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.542 on 28 degrees of freedom
## Multiple R-squared: 0.908, Adjusted R-squared: 0.9047
## F-statistic: 276.4 on 1 and 28 DF, p-value: 4.845e-16
The high adjusted R-squared indicates a good fit.
We use the broom package to put the linear model into a data frame.
lm_df <- augment(lm_fit)
lm_df %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
The residual values appear randomly scattered, as they should in a well fit model.