download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")Lab Introduction to Linear Regression
Batter up
The data
Exercise 1
What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team's at_bats, would you be comfortable using a linear model to predict the number of runs?
I would use a scatter-plot. The relationship looks somewhat linear, but not very strong. I would not be comfortable using a linear model since this plot does not indicate much strength in a linear diagram.
plot(mlb11$runs~mlb11$at_bats)cor(mlb11$runs, mlb11$at_bats)[1] 0.610627
Sum of squared residuals
Exercise 2
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
The two variables seem somewhat correlated, but not closely correlated. I say this since there isn’t a strong linear direction being depicted. This can be corroborated due to the fact that the coefficient yields a number below 0.7--indicating a moderately positive linear relationship between two variables. The variables do correlate in the sense that they are both positively related (as one increases, the other does as well). However, this past relation is not consistent as some data points (i.e., 5600, 718) go against this relation. The form of the plot is also very scattered which makes a strong linear relationship hard to denote.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)Click two points to make a line.
Call:
lm(formula = y ~ x, data = pts)
Coefficients:
(Intercept) x
-2789.2429 0.6305
Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)Click two points to make a line.
Call:
lm(formula = y ~ x, data = pts)
Coefficients:
(Intercept) x
-2789.2429 0.6305
Sum of Squares: 123721.9
Exercise 3
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)Click two points to make a line.
Call:
lm(formula = y ~ x, data = pts)
Coefficients:
(Intercept) x
-2789.2429 0.6305
Sum of Squares: 123721.9
This part did not work for me :(.
The linear model
m1 <- lm(runs ~ at_bats, data = mlb11)summary(m1)
Call:
lm(formula = runs ~ at_bats, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-125.58 -47.05 -16.59 54.40 176.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2789.2429 853.6957 -3.267 0.002871 **
at_bats 0.6305 0.1545 4.080 0.000339 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 66.47 on 28 degrees of freedom
Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Exercise 4
Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship between Runs and Home runs", xlab = "Home Runs", ylab = "Runs")baseball <- lm(runs ~ homeruns, data = mlb11)
summary(baseball)
Call:
lm(formula = runs ~ homeruns, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-91.615 -33.410 3.231 24.292 104.631
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
homeruns 1.8345 0.2677 6.854 1.90e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 51.29 on 28 degrees of freedom
Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
cor(mlb11$runs, mlb11$homeruns)[1] 0.7915577
Equation: y^ = 415.2389 + 1.8345 * # of home runs
After taking a look at the plot, I can say that the relationship between runs and home runs is positively linear and fairly strong as the correlation coefficient is over 0.7 and very close to 1.
Prediction and prediction errors
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)Exercise 5
If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
From plugging in 5,578 into the equation, the team manager would have predicted just about 728 runs. This seems like an overestimate and by (728-713) 15 runs.
Model diagnostics
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
Exercise 6
Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds horizontal dashed line at y = 0I do not see any obvious patterns in the residuals plot. The numbers seem to be scattered but remain linear at the same time.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plotExercise 7
Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
Yes, looking at the histogram and plot, I would say that the normal residuals condition appear to be met.
Exercise 8
Based on the plot in (1), does the constant variability condition appear to be met?
The way points variate around the least squares line looks to be steady which means that the constant variability condition seems to have been met.