getwd()
## [1] "C:/Users/Jerome/Documents/From_Toshiba_HD_Work_Files/0000_Montgomery_College/Math_217/Week_12"
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1

plot(mlb11$hits~ mlb11$at_bats)

cor(mlb11$hits, mlb11$at_bats)
## [1] 0.846472

Exercise 2

The relationship is positive and fairly strong. There does appear to be a fairly wide distribution of scores. There are outliers at the higher end - players who either played many games in a shorter career or who played for an exceptional # of years.

plot_ss(x=mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
plot_ss(x=mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3

NB: My plot was not interactive. When I ran it the 1st time, it produced the line and gave me no option to choose. I had no option to choose after I displayed the squares.

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4

plot(mlb11$runs ~ mlb11$at_bats)

plot(runs ~ at_bats, data = mlb11)
abline(m1)

Exercise 5

If a team has 5,578 at-bats, I would expect something over 700 runs, but < 750.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

Exercise 6

I can’t see any pattern in the residuals, other than a kind of cluster (or 2) at the lower end of the abscissa, with a few outliers at the high end.

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

Exercise 7

The distribution of the residuals appears to be nearly normal, but it is skewed to the right b/c of the outliers.

Exercise 8

Based on the m1 residuals plot in Exercise 6, it would seem the variability is constant, or nearly so.

On Your Own

m2 <- lm(runs ~ strikeouts, data = mlb11)
summary(m2)
## 
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386
plot(runs ~ strikeouts, data = mlb11)
abline(m2)

1.There is a weak linear relationship. Strikeouts do predict runs, in a sense. The more strikeouts, the fewer runs. But Babe Ruth’s record may give a different slant on this. From Wikipedia, “Ruth established many MLB batting (and some pitching) records, including career home runs (714), runs batted in (RBIs) (2,213), bases on balls (2,062), slugging percentage (. 690), and on-base plus slugging (OPS) (1.164); the last two still stand as of 2019.”

But Ruth also had a high strikeout percentage. “For many years, Babe Ruth was known as the King of Strikeouts. He was known for his all or nothing batting style. He led the American League in strikeouts five times and accumulated 1,330 of them in his career.” – This is from https://howtheyplay.com/team-sports/strikeouts-have-skyrocketed-since-Babe-Ruth#:~:text=For%20many%20years%2C%20Babe%20Ruth,of%20them%20in%20his%20career.

Ruth’s data were obviously not in this dataset; He would have been at 1300 on the abscissa and probably off the chart on the ordinate.

  1. This R-squared value is R-squared: 0.1694; the at_bats R-squared was R-squared: 0.3729. In one sense, the at_bats is a better predictor, but since there is a strong inverse relationship w/ strikeouts (Ruth excepted), one could argue strikeouts is a good predictor.

  2. Run 5 regressions

m3 <- lm(runs ~ hits, data = mlb11)
summary(m3)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07
m4 <- lm(runs ~ homeruns, data = mlb11)
summary(m4)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07
m5 <- lm(runs ~ bat_avg, data = mlb11)
summary(m5)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
m6 <- lm(runs ~ stolen_bases, data = mlb11)
summary(m6)
## 
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139.94  -62.87   10.01   38.54  182.49 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  677.3074    58.9751  11.485 4.17e-12 ***
## stolen_bases   0.1491     0.5211   0.286    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared:  0.002914,   Adjusted R-squared:  -0.0327 
## F-statistic: 0.08183 on 1 and 28 DF,  p-value: 0.7769
m7 <- lm(runs ~ wins, data = mlb11)
summary(m7)
## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469
plot(runs ~ bat_avg, data = mlb11)
abline(m5)

Batting Average wins, but not by much. Hits came a close 2nd, R-squared = 0.642; homeruns came in a close 3rd @ 0.627.

Call: lm(formula = runs ~ bat_avg, data = mlb11)

Residuals: Min 1Q Median 3Q Max -94.676 -26.303 -5.496 28.482 131.113

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -642.8 183.1 -3.511 0.00153 ** bat_avg 5242.2 717.3 7.308 5.88e-08 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 49.23 on 28 degrees of freedom Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438 F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08

  1. Use New Variables
m8 <- lm(runs ~ new_onbase, data = mlb11)
summary(m8)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
m9 <- lm(runs ~ new_slug, data = mlb11)
summary(m9)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15
m10 <- lm(runs ~ new_obs, data = mlb11)
summary(m10)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
plot(runs ~ new_obs, data = mlb11)
abline(m10)

new_obs (whatever that is) wins w/ an R-squared = 0.9349. The regression line is steeper as well.

These new predictor variables work better in each case.