Lab Introduction to Linear Regression

Author

George Obongo

Batter up

The data

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team's at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use a scatter-plot. The relationship looks somewhat linear, but not very strong. I would not be comfortable using a linear model since this plot does not indicate much strength in a linear diagram.

plot(mlb11$runs~mlb11$at_bats)

cor(mlb11$runs, mlb11$at_bats)
[1] 0.610627

Sum of squared residuals

Exercise 2

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

The two variables seem somewhat correlated, but not closely correlated. I say this since there isn’t a strong linear direction being depicted. This can be corroborated due to the fact that the coefficient yields a number below 0.7--indicating a moderately positive linear relationship between two variables. The variables do correlate in the sense that they are both positively related (as one increases, the other does as well). However, this past relation is not consistent as some data points (i.e., 5600, 718) go against this relation. The form of the plot is also very scattered which makes a strong linear relationship hard to denote.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

Click two points to make a line.
                                
Call:
lm(formula = y ~ x, data = pts)

Coefficients:
(Intercept)            x  
 -2789.2429       0.6305  

Sum of Squares:  123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

Click two points to make a line.
                                
Call:
lm(formula = y ~ x, data = pts)

Coefficients:
(Intercept)            x  
 -2789.2429       0.6305  

Sum of Squares:  123721.9

Exercise 3

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

Click two points to make a line.
                                
Call:
lm(formula = y ~ x, data = pts)

Coefficients:
(Intercept)            x  
 -2789.2429       0.6305  

Sum of Squares:  123721.9

This part did not work for me :(.


The linear model

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)

Call:
lm(formula = runs ~ at_bats, data = mlb11)

Residuals:
    Min      1Q  Median      3Q     Max 
-125.58  -47.05  -16.59   54.40  176.87 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
at_bats         0.6305     0.1545   4.080 0.000339 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 66.47 on 28 degrees of freedom
Multiple R-squared:  0.3729,    Adjusted R-squared:  0.3505 
F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship between Runs and Home runs", xlab = "Home Runs", ylab = "Runs")

baseball <- lm(runs ~ homeruns, data = mlb11)
summary(baseball)

Call:
lm(formula = runs ~ homeruns, data = mlb11)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.615 -33.410   3.231  24.292 104.631 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
homeruns      1.8345     0.2677   6.854 1.90e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51.29 on 28 degrees of freedom
Multiple R-squared:  0.6266,    Adjusted R-squared:  0.6132 
F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07
cor(mlb11$runs, mlb11$homeruns)
[1] 0.7915577

Equation: y^ = 415.2389 + 1.8345 * # of home runs

After taking a look at the plot, I can say that the relationship between runs and home runs is positively linear and fairly strong as the correlation coefficient is over 0.7 and very close to 1.

Prediction and prediction errors

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

From plugging in 5,578 into the equation, the team manager would have predicted just about 728 runs. This seems like an overestimate and by (728-713) 15 runs.

Model diagnostics

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0


Exercise 6

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds horizontal dashed line at y = 0

I do not see any obvious patterns in the residuals plot. The numbers seem to be scattered but remain linear at the same time.

hist(m1$residuals)


qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

Exercise 7

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Yes, looking at the histogram and plot, I would say that the normal residuals condition appear to be met.

Exercise 8

Based on the plot in (1), does the constant variability condition appear to be met?

The way points variate around the least squares line looks to be steady which means that the constant variability condition seems to have been met.