library(tidyverse)
library(openintro)
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")I would use a scatterplot to display the relationship between runs and another numerical variable. The function for this in R is “plot(x,y)”. The relationship looks relatively linear, and the correlation coefficient indicates the same, so I would feel comfortable using a linear model to predict runs given an adequate margin of error.
cor(mlb11$runs, mlb11$at_bats)## [1] 0.610627
plot(mlb11$at_bats, mlb11$runs)Form: Linear
Direction: Positive
Strength: Moderate
Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
equation: y_hat = 415.239 + 1.835*homeruns
The slope tells us that homeruns and runs are positively correlated. I don’t understand baseball so I can’t interpret what exactly this means lol. Maybe its just saying that the measures of success of the teams are related directly?
plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 415.239 1.835
##
## Sum of Squares: 73671.99
m2 <- lm(runs ~ homeruns, data = mlb11)The data shows 713 runs at 5579 at-bats, but the model could predict it to be higher by roughly 14.686 runs than it may actually be.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)713 - (-2789.2429+0.6305*5578)## [1] -14.6861
There appears to be no pattern in the plot, and the correlation coefficient confirms this. You could use a linear model with slope zero to model the relationship but that wouldn’t mean much.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0cor(m1$residuals, mlb11$at_bats)## [1] 1.386089e-15
hist(m1$residuals)shapiro.test(m1$residuals)##
## Shapiro-Wilk normality test
##
## data: m1$residuals
## W = 0.96144, p-value = 0.337
Yes, it does and this is evident from the p-value of 0.337 from the Shapiro-Wilks test.
The data does not appear to cluster, at least visually, so I feel comfortable saying that it is constantly variable.
At a glance there doesn’t appear to be some sort of linear relationship to me. The relationship is inverse compared to that of runs and at_bats (there is a negative correlation). The R^2 value is smaller, at 13.97% of the variability of runs being explained by strikeouts (as compared to 37.3%). My variable does not seem to predict runs better than at_bats because both the R^2 and correlation coefficient are lower.The variable bat_avg appears to predict runs the best with a correlation coefficient of 0.801 (out of the traditional variables). Out of the new variables, new_obs seems to predict runs the best, with a correlation coefficient of 0.967, the best of all the variables. The conditions of constant variability and normal distribution of residuals appear to be met based on the scatterplot and shapiro-wilks test of the residuals of model 4 which uses new_obs as a predictor of runs, and therefore I believe that this is the most reliable linear model for runs in the dataset.
cor(mlb11$runs, mlb11$strikeouts)## [1] -0.4115312
plot_ss(x = mlb11$strikeouts, y = mlb11$runs, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 1054.7342 -0.3141
##
## Sum of Squares: 163870.1
m3 <- lm(runs ~ strikeouts, data = mlb11)
summary(m3)##
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.27 -46.95 -11.92 55.14 169.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1054.7342 151.7890 6.949 1.49e-07 ***
## strikeouts -0.3141 0.1315 -2.389 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1397
## F-statistic: 5.709 on 1 and 28 DF, p-value: 0.02386
plot(mlb11$runs ~ mlb11$strikeouts)
abline(m3)plot(m3$residuals ~ mlb11$strikeouts)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0cor(m3$residuals, mlb11$at_bats)## [1] 0.4607368
hist(m3$residuals)shapiro.test(m3$residuals)##
## Shapiro-Wilk normality test
##
## data: m3$residuals
## W = 0.96885, p-value = 0.5083
cor(mlb11$runs, mlb11$hits)## [1] 0.8012108
cor(mlb11$runs, mlb11$bat_avg)## [1] 0.8099859
cor(mlb11$runs, mlb11$stolen_bases)## [1] 0.05398141
cor(mlb11$runs, mlb11$wins)## [1] 0.6008088
cor(mlb11$runs, mlb11$new_onbase)## [1] 0.9214691
cor(mlb11$runs, mlb11$new_slug)## [1] 0.9470324
cor(mlb11$runs, mlb11$new_obs)## [1] 0.9669163
plot_ss(x = mlb11$new_obs, y = mlb11$runs, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -686.6 1919.4
##
## Sum of Squares: 12837.66
m4 <- lm(runs ~ new_obs, data = mlb11)
summary(m4)##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(mlb11$runs ~ mlb11$new_obs)
abline(m4)plot(m4$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0cor(m4$residuals, mlb11$at_bats)## [1] 0.01368748
hist(m4$residuals)shapiro.test(m4$residuals)##
## Shapiro-Wilk normality test
##
## data: m4$residuals
## W = 0.98068, p-value = 0.8434