Introduction to linear regression

http://htmlpreview.github.io/?https://github.com/andrewpbray/oiLabs-base-R/blob/master/simple_regression/simple_regression.html

head(mlb11)
##                  team runs at_bats hits homeruns bat_avg strikeouts
## 1       Texas Rangers  855    5659 1599      210   0.283        930
## 2      Boston Red Sox  875    5710 1600      203   0.280       1108
## 3      Detroit Tigers  787    5563 1540      169   0.277       1143
## 4  Kansas City Royals  730    5672 1560      129   0.275       1006
## 5 St. Louis Cardinals  762    5532 1513      162   0.273        978
## 6       New York Mets  718    5600 1477      108   0.264       1085
##   stolen_bases wins new_onbase new_slug new_obs
## 1          143   96      0.340    0.460   0.800
## 2          102   90      0.349    0.461   0.810
## 3           49   95      0.340    0.434   0.773
## 4          153   71      0.329    0.415   0.744
## 5           57   90      0.341    0.425   0.766
## 6          130   77      0.335    0.391   0.725

Exercise 1 - plot runs by at-bats

Using a scatter plot shows moderate linearity. A linear model should give a reasonable estimate of runs based on at-bats.

plot(runs ~ at_bats, data = mlb11, ylab = "runs", xlab = "at bats", col='blue', pch=19)
z <- line(mlb11$at_bats, mlb11$runs)
z
## 
## Call:
## line(mlb11$at_bats, mlb11$runs)
## 
## Coefficients:
## [1]  -2661.1626      0.6043
abline(coef(z),col='red')

Exercise 2 - Describe the distribution

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

The relationship between at-bats and runs is linear with a positive slope. A correlation of 0.61 shows moderate positive linearity. There’s one point that could be considered high leverage, where the (“wretched”, as my husband calls them) New York Yankees got 867 runs with 5518 at-bats. But it doesn’t seem that this point influences the slope of the regression line.

Exercise 3 - fit lines using plot_ss

The smallest sum of squares I got from using the interactive plots is 159179.4

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4 - homeruns to predict runs

Regression line equation for runs based in home runs: \(\hat{y}\) = 415.2389 + 1.8345 ∗ homeruns

The slope of the line is positive, so for every home run the number of runs increases 1.8345. Which means that there are additional players on base 1.8345 times that a home run is hit.

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 5 - Prediction and prediction errors

y = −2789.2429 + 0.6305 ∗ atbats = −2789.2429 + 0.6305 ∗ 5578 = 727.6861

This is an overestimate if we go by the closest point, the Phillies with 5579 at bats got 713 runs, which would give a residual of -14 runs. The range of runs for at bats between approximately 5570 and 5650 are all below the regression line.

plot(mlb11$runs ~ mlb11$at_bats,  ylab = "runs", xlab = "at bats", col='blue', pch=19)
abline(m1, col = 'red')

Exercise 6 - Model diagnostics

There is not a discernable pattern to the residuals. The relationship between runs and at-bats is linear.

plot(m1$residuals ~ mlb11$at_bats, col='orange', pch = 19,  ylab = "residuals", xlab = "at bats")
abline(h = 0, lty = 2, col='blue')  # adds a horizontal dashed line at y = 0

hist(m1$residuals, col='gold2', main = 'mb11 runs per at-bats residuals', xlab = 'residuals')

qqnorm(m1$residuals, col = 'orange', pch = 19)
qqline(m1$residuals, col = 'blue')  # adds diagonal line to the normal prob plot

Exercise 7 - Residuals QQplot

The distribution of the residuals is close to normal.

Exercise 8 - homoscedasticity

The variability of points in the scatterplot of runs by at-bats looks like is has constant variability in the data. The scale location plot has a horizontal line with randomly spread points

par(mfrow=c(2,2))
plot(m1)

On your own

1. Runs predicted by hits

The scatterplot of runs by hits looks linear.

m3 <- lm(runs ~ hits, data = mlb11)

plot(runs ~ hits, data = mlb11, ylab = "runs", xlab = "hits", col='blue', pch=19)
abline(m3, col = 'red')

2. Compare runs_by_hits model to runs_by_atbats model

The R2 value for runs_by_hits = 0.642 which is > the R2 value for runs_by_atbats = 0.373. So runs_by_hits accounts for a greater percentage of the variation in the data, and is a better model than runs_by_atbats.

summary(m3)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388
par(mfrow=c(2,2))
plot(m3)

3. runs by other variables

Team batting average has the best R2 = 0.656. The scatter plot is closer to linear than hits. The distribution of the residuals in the qqplot are almost normal, there are no influencing outliers, and the homoscedasticity looks good in the scale-location plot.

s1 <- summary(m1)  # at-bats
s2 <- summary(m2)  # homeruns
s3 <- summary(m3)  # hits
s4 <- summary(lm(runs ~ bat_avg, data = mlb11))
s5 <- summary(lm(runs ~ strikeouts, data = mlb11))
s6 <- summary(lm(runs ~ stolen_bases, data = mlb11))
s7 <- summary(lm(runs ~ wins, data = mlb11))

models <- tibble(name = c(s1$terms[[3]], s2$terms[[3]], s3$terms[[3]], s4$terms[[3]], 
                           s5$terms[[3]], s6$terms[[3]], s7$terms[[3]]),
                 rsquared = c(s1$r.squared, s2$r.squared, s3$r.squared, s4$r.squared,
                              s5$r.squared, s6$r.squared, s7$r.squared),
                 fstatistic = c(s1$fstatistic[1], s2$fstatistic[1], s3$fstatistic[1], s4$fstatistic[1],
                              s5$fstatistic[1], s6$fstatistic[1], s7$fstatistic[1])) 
models <- arrange(models, desc(rsquared)) 
models$name <- as.character(models$name)
kable(models, align = c('l', 'l', 'l'), table.attr = "style = \"color: blue;\"") %>% kable_minimal()
name rsquared fstatistic
bat_avg 0.6560771 53.4136041
hits 0.6419388 50.1989153
homeruns 0.6265636 46.9792942
at_bats 0.3728654 16.6475120
wins 0.3609712 15.8164901
strikeouts 0.1693579 5.7088634
stolen_bases 0.0029140 0.0818302
m4 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg,  ylab = "runs", xlab = "batting avg.", col='blue', pch=19)
abline(m4, col = 'red')

par(mfrow=c(2,2))
plot(m4)

4. runs by new variables

All of the new variables have better R2 values than the traditional variables. new_obs has the highest at 0.935 (which seems almost Too Good). The scatter plot of runs by new_obs is very tightly linear.

n1 <- summary(lm(runs ~ new_onbase, data = mlb11))
n2 <- summary(lm(runs ~ new_slug, data = mlb11))
n3 <- summary(lm(runs ~ new_obs, data = mlb11))

modeln <- tibble(name = c(n1$terms[[3]], n2$terms[[3]], n3$terms[[3]]),
                 rsquared = c(n1$r.squared, n2$r.squared, n3$r.squared),
                 fstatistic = c(n1$fstatistic[1], n2$fstatistic[1], n3$fstatistic[1])) 
modeln <- arrange(modeln, desc(rsquared)) 
modeln$name <- as.character(modeln$name)
kable(modeln, align = c('l', 'l', 'l'), table.attr = "style = \"color: blue;\"") %>% kable_minimal()
name rsquared fstatistic
new_obs 0.9349271 402.2868
new_slug 0.8968704 243.5030
new_onbase 0.8491053 157.5598
par(mfrow = c(1, 3))
plot(runs ~ new_onbase, data = mlb11, ylab = "runs", xlab = "new_onbase", col='blue', pch=19)
abline(lm(runs ~ new_onbase, data = mlb11), col = 'red')
plot(runs ~ new_slug, data = mlb11, ylab = "runs", xlab = "new_slug", col='blue', pch=19)
abline(lm(runs ~ new_slug, data = mlb11), col = 'red')
plot(runs ~ new_obs, data = mlb11, ylab = "runs", xlab = "new_obs", col='blue', pch=19)
abline(lm(runs ~ new_obs, data = mlb11), col = 'red')

5. model diagnostics for new_obs

The residuals qqplot for new_obs is very close to normal and the scale location plot has very even spread of the variance. Runs by new_obs is a very good linear model.

mbest <- lm(runs ~ new_obs, data = mlb11)
summary(mbest)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mbest)