CUNY 605

For this example we will get Major League Baseball data from espn.com and fit a simple linear regression.

Libraries

First we load the necessary libraries.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rvest)

## Loading required package: xml2

library(broom)

## Warning: package 'broom' was built under R version 3.5.3

library(ggplot2)
library(stringr)
library(knitr)

Get the data

Then use rvest to scrape the 2018 MLB standings from espn.com and clean up the team abbreviations using regular expressions.

season <- 2018
url <- paste0('http://www.espn.com/mlb/standings/_/sort/winpercent/dir/desc/season/', season, '/group/overall')
url_html <- url %>% read_html()
html_tables <- url_html %>% html_table(fill=T)
teams <- html_tables[[2]]

# use regex to extract team acronym - get uppercase letters followed by lowercase, then delete last two characters
teams <- teams %>% rename(team=1) %>% mutate(team=str_extract_all(team, "[A-Z]{3,4}[a-z]") %>% substr(1, nchar(.)-2))

standings <- html_tables[[4]]

team_standings <- bind_cols(teams, standings)
team_standings %>% head() %>% kable()

team	W	L	PCT	GB	HOME	AWAY	RS	RA	DIFF	STRK	L10
BOS	108	54	0.667	-	57-24	51-30	876	647	229	W1	5-5
HOU	103	59	0.636	5	46-35	57-24	797	534	263	L1	8-2
NYY	100	62	0.617	8	53-28	47-34	851	669	182	L1	7-3
OAK	97	65	0.599	11	50-31	47-34	813	674	139	L1	6-4
MIL	96	67	0.589	12.5	51-30	45-37	754	659	95	W8	9-1
CHC	95	68	0.583	13.5	51-31	44-37	761	645	116	L1	6-4

Build the model

Run differential should correlate to to winning. We examine a scatterplot and see that a simple linear model looks appropriate.

team_standings %>% 
  ggplot(aes(x=DIFF, y=W)) + 
  geom_point() + 
  geom_smooth(method="lm", se=T) + 
  labs(x="Run Differential", y="Number of Wins")

lm_fit <- lm(W ~ DIFF, data=team_standings)
summary(lm_fit)

## 
## Call:
## lm(formula = W ~ DIFF, data = team_standings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9685 -2.9065  0.0181  2.5834 11.2852 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81.03333    0.82929   97.71  < 2e-16 ***
## DIFF         0.09760    0.00587   16.63 4.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.542 on 28 degrees of freedom
## Multiple R-squared:  0.908,  Adjusted R-squared:  0.9047 
## F-statistic: 276.4 on 1 and 28 DF,  p-value: 4.845e-16

The high adjusted R-squared indicates a good fit.

Examine the residuals

We use the broom package to put the linear model into a data frame.

lm_df <- augment(lm_fit)
lm_df %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0)

The residual values appear randomly scattered, as they should in a well fit model.

CUNY 605

Week 11 Discussion

mehtablocker

April 09, 2019

Libraries

Get the data

Build the model

Examine the residuals