library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.0.4
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")plot(mlb11$runs ~ mlb11$at_bats)
fit <- lm(mlb11$runs~mlb11$at_bats)
abline(fit, col="red")Answer: Yes, I’d be at ease using a linear model. Scatterplot is appropriate in this situation because runs and at bats have a strong association, as expected – the more at-bat chances a player gets, the more likely he or she is to strike the ball and score runs.
cor(mlb11$runs, mlb11$at_bats)## [1] 0.610627
Answer: The relationship between runs and at bats tends to be fairly linear, with a positive correlation, implying that as the number of at bats increases, so does the number of runs.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Answer: No matter how many times I run this function, the sum of squares remains constant at 123721.9, so I presume the neighbors are the same.
m1 <- lm(runs ~ at_bats, data = mlb11)summary(m1)##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
m2 <- lm(runs ~ homeruns, mlb11)
summary(m2)##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Answer: The least squares regression line for the linear model: runs^ = 415.2389 + 1.8345 ∗ homeruns
The slope of 1.8345: each homeruns contributes 1.8345 in number of runs. So, the more homeruns, the more runs.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1, col="red")at.bats <- 5578
hat.runs <- -2789.2429 + 0.6305 * at.bats
hat.runs## [1] 727.6861
Answer: Based on the regression line, he would have expected 727.6861 runs. According to the numbers, a team with 5579 at bats has 713 runs. The disparity yields a -14.6861 negative residual, which overestimates the observation.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3, col="red") ANswer: No, there is not. In the resiudal plot, there is no discernible trend. The number of residual points above and below the line seems to be equal. As a result, the linear regression model is trustworthy.
hist(m1$residuals)qqnorm(m1$residuals)
qqline(m1$residuals, col="red") sd(m1$residuals)## [1] 65.3167
summary(m1$residuals)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -125.58 -47.05 -16.59 0.00 54.40 176.87
Answer: Based on the histogram and the normal probability plot, the nearly normal residuals condition does appear to be met.
Answer: Yes, it is roughly constant in the first scatterplot between at-bats and runs.
plot(mlb11$hits, mlb11$runs)
fit <- lm( mlb11$runs~mlb11$hits)
abline(fit, col="red")Answer: It seems to be a linear relationship between hits and runs.
m3 <- lm(runs ~ hits, mlb11)# at_bats vs runs
summary(m1)##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Answer: R^2 = 0.3729 with at_bats, 37.3% of the variability in runs is explained by at_bats.
# hits vs runs
summary(m3)##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
Answer: R^2 = 0.6419 with hits, 64.2% of the variability in runs is explained by at_bats
We could state that hits is the better predictor looking at R^2
cor(mlb11$runs, mlb11$homeruns)## [1] 0.7915577
cor(mlb11$runs, mlb11$bat_avg)## [1] 0.8099859
cor(mlb11$runs, mlb11$strikeouts)## [1] -0.4115312
cor(mlb11$runs, mlb11$stolen_bases)## [1] 0.05398141
cor(mlb11$runs, mlb11$wins)## [1] 0.6008088
Answer: The data bat_avg seems to have the strongest relationship with runs of the standard variables. It has a correlation of 0.8099859, and the linear model shows 65.61 % of the variability in runs.
m4 <- lm(runs ~ bat_avg, mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m4, col="red")summary(m4)##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
plot_ss(x = mlb11$bat_avg, y = mlb11$runs, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -642.8 5242.2
##
## Sum of Squares: 67849.52
plot(m4$residuals ~ mlb11$bat_avg)
abline(h = 0, lty =3, col="red")cor(mlb11$runs, mlb11$new_onbase)## [1] 0.9214691
cor(mlb11$runs, mlb11$new_slug)## [1] 0.9470324
cor(mlb11$runs, mlb11$new_obs)## [1] 0.9669163
m5 <- lm(runs ~ new_onbase, mlb11)
m6 <- lm(runs ~ new_slug, mlb11)
m7 <- lm(runs ~ new_obs, mlb11)summary(m5)##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
summary(m6)##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
summary(m7)##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(mlb11$runs ~ mlb11$new_obs)
abline(m7, col="red")Answer: The newer variables seem to be more efficient than the older ones. The new obs variable seems to have the largest R2 of all ten variables. This makes sense because being on base to hit extra bases would almost certainly result in runs for the team.
plot(m7)Answer: Linearity: there seems to be a strong relationship, based on scatterplot. / Normality: there seems to be a nearly normal residuals, based on histogram and qqplot. / Constant variability: the variability is roughly constant, based on residual plot. / Independent observations: could be assumed.