Loading data
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
Plotting scatterplot of runs vs. at-bats
plot(x = mlb11$at_bats, y = mlb11$runs, xlab = "At-Bats", ylab = "Runs")
The relationship between “at-bats” and “runs” seems loosely linear. I would feel confident enough to use a linear model to predict runs based on at bats.
Finding correlation coefficient between “at-bats” and “runs”
cor(mlb11$at_bats, mlb11$runs)
## [1] 0.610627
The relationship between “at-bats” and “runs” is positive and somewhat linear with a correlation coefficient of about 0.610627. Most teams in our scatterplot had between 5400 and 5600 at-bats, with one team having over 5700 at-bats and over 850 runs. There is one team who seems extremely efficient, scoring over 850 runs in under 5525 at-bats. There are a few who seem particularly efficient, with one scoring around 725 runs in about 5675 at-bats and another scoring just over 500 runs in just under 5600 at-bats, among others.
Creating interactive plots
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
I wasn’t able to physically select my line (maybe because I’m running this in a markdown file and not a shell) but running the last chunk seems to have automatically found the line that best minimizes my sum of squares (123721.9).
Using lm function to create a linear model that uses at-bats to predict runs
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Fitting new linear model that uses home runs to predict runs
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Our equation for the regression line is: y = 415.2389 + 1.8345 * homeruns The slope shows a very strong positive relationship between home runs and overall runs. Our multiple R-squared value shows 62.66% of the variability in runs is explained by home runs. This all suggests that the more successful teams score more home runs.
Creating scatterplot with least squares line laid on top
plot(mlb11$runs ~ mlb11$at_bats, xlab = "At-Bats", ylab = "Runs")
abline(m1)
-2789.2429 + 0.6305 * 5578
## [1] 727.6861
We would predict 727.6861 runs, so about 728 runs (to be generous). It’s an overestimate, but because we don’t have a data point of a team that had 5578 at-bats, I’m not sure how we’d calculate the residual. We do have a team with 5579 at-bats, though: the Phillies. We can calculate the residual for them.
(-2789.2429 + 0.6305 * 5579) - 713
## [1] 15.3166
Our residual for the Phillies is 15.3166, so our least squares line is an overestimate of over 15 runs for this data point.
Plotting residuals
plot(m1$residuals ~ mlb11$at_bats, xlab = "At-Bats", ylab = "M1 Residuals")
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
There doesn’t seem to be any apparent pattern in our residual plot. This suggests a linear model is a good fit for this data.
Creating histogram of residuals
hist(m1$residuals)
Creating normal probability plot of residuals
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to normal probability plot
Our histogram doesn’t appear to follow a normal distribution too well, but it’s not drastically off. Our Q-Q plot seems to much more reasonably follow a normal distribution. It’s not super clear, but I would say the normal residuals condition appears to be met.
Our residuals plot in Exercise 6 shows no pattern so the constant variability condition appears to be met.
Exploring runs vs. wins
plot(x = mlb11$wins, y = mlb11$runs, xlab = "Wins", ylab = "Runs")
m3 <- lm(runs ~ wins, data = mlb11)
summary(m3)
##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.450 -47.506 -7.482 47.346 142.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.121 89.223 3.834 0.000654 ***
## wins 4.341 1.092 3.977 0.000447 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3381
## F-statistic: 15.82 on 1 and 28 DF, p-value: 0.0004469
Our scatterplot for runs vs. wins appears loosely linear like our scatterplot for runs vs. at-bats. Our multiple R-squared value here is lower than the one for runs vs. at-bats, so wins seems to predict runs worse than at-bats.
Exploring runs vs. hits
plot(x = mlb11$hits, y = mlb11$runs, xlab = "Hits", ylab = "Runs")
m4 <- lm(runs ~ hits, data = mlb11)
summary(m4)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
Exploring runs vs. batting average
plot(x = mlb11$bat_avg, y = mlb11$runs, xlab = "Batting Average", ylab = "Runs")
m5 <- lm(runs ~ bat_avg, data = mlb11)
summary(m5)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
Exploring runs vs. strikeouts
plot(x = mlb11$strikeouts, y = mlb11$runs, xlab = "Strikeouts", ylab = "Runs")
m6 <- lm(runs ~ strikeouts, data = mlb11)
summary(m6)
##
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.27 -46.95 -11.92 55.14 169.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1054.7342 151.7890 6.949 1.49e-07 ***
## strikeouts -0.3141 0.1315 -2.389 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1397
## F-statistic: 5.709 on 1 and 28 DF, p-value: 0.02386
Exploring runs vs. stolen bases
plot(x = mlb11$stolen_bases, y = mlb11$runs, xlab = "Stolen Bases", ylab = "Runs")
m7 <- lm(runs ~ stolen_bases, data = mlb11)
summary(m7)
##
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.94 -62.87 10.01 38.54 182.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 677.3074 58.9751 11.485 4.17e-12 ***
## stolen_bases 0.1491 0.5211 0.286 0.777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared: 0.002914, Adjusted R-squared: -0.0327
## F-statistic: 0.08183 on 1 and 28 DF, p-value: 0.7769
Further exploring runs vs. batting average with a scatterplot with a least squares line laid on top and also plotting the residuals
plot(mlb11$runs ~ mlb11$bat_avg, xlab = "Batting Average", ylab = "Runs")
abline(m5)
plot(m5$residuals ~ mlb11$bat_avg, xlab = "At Bats", ylab = "M5 Residuals")
abline(h = 0, lty = 3)
It seems batting average is the best predictor of runs out of our remaining five variables. It has a strong multiple R-squared value of 0.6561.
Exploring runs vs. new on-base percentage
plot(x = mlb11$new_onbase, y = mlb11$runs, xlab = "New On-Base Percentage", ylab = "Runs")
m8 <- lm(runs ~ new_onbase, data = mlb11)
summary(m8)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
Exploring runs vs. new slugging percentage
plot(x = mlb11$new_slug, y = mlb11$runs, xlab = "New Slugging Percentage", ylab = "Runs")
m9 <- lm(runs ~ new_slug, data = mlb11)
summary(m9)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
Exploring runs vs. new on-base plus slugging percentage
plot(x = mlb11$new_obs, y = mlb11$runs, xlab = "New On-Base Plug Slugging Percentage", ylab = "Runs")
m10 <- lm(runs ~ new_obs, data = mlb11)
summary(m10)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
New On-Base plus Slugging percentage seems to be the best predictor of runs. This makes sense as OBS is a more complete statistic regarding batting. It accounts for on-base percentage, which measures how often a batter gets on base, as well as slugging percentage, which measures how productive a batter is while accounting for power (meaning it weighs extra base hits and home runs more than singles). The team with better batters should logically score more runs.
Plotting residuals of runs vs. new OBS
plot(m10$residuals ~ mlb11$new_obs, xlab = "New OBS", ylab = "M10 Residuals")
abline(h = 0, lty = 3)
Plotting histogram of residuals
hist(m10$residuals)
Creating normal probability plot of residuals
qqnorm(m10$residuals)
qqline(m10$residuals)
A linear model is a good fit for this data.