Chapter 7 LAB: Introduction to linear regression

load("more/mlb11.RData")

Exercise 1 : What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

Answer : Scatterplot can be used to display the relationship

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.3

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'ggplot2' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'purrr' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

#plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship between Runs and atBats", xlab = "At Bats", ylab = "Runs")
Runs <- mlb11$runs; At_bats <- mlb11$at_bats
ggplot(data=mlb11) + geom_point(mapping = aes(x=At_bats,y=Runs), color = 'red', size =3, shape =19)

The relationship looks moderately linear but not strong enough to be able to comfortably use a linear model to predict the number of runs.

Since the relationship is linear we can quanitfy the strength of the relationship with the correlation coefficient

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

Excercise2 : Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

Answer The relationship between runs and at bats can be considered positive but moderately strong as the correlation coefficient 0.610627 turns out to be far below from +1. we can also clearly spot several positive outliers in the plot such as a team with 5518 and 5600 at bats.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Excercise 3: Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

Answer : The smallest sum of squares that i got after running plot_ss function several times is 123721.9 with the coefficients x -> 0.6305 Intercept -> -2789.2429 .I ran the plot_ss several times and getting the same result all the time.

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4 : Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

Answer :

Home_Runs <- mlb11$homeruns
Runs      <- mlb11$runs

ggplot(data=mlb11) + geom_point(mapping = aes(x=Home_Runs,y=Runs),color = 'red', size =3, shape =19) + ggtitle("Relationship between Runs and Home runs ")

Correlation Coeficient

cor(Runs, Home_Runs)

## [1] 0.7915577

m2 <- lm(runs ~ homeruns, data = mlb11)

summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

y^ = 415.2389 + 1.8345*homeruns

slope = 1.8345

By looking at the plot we can say that the relationship between runs and home runs is linear positive and relatively strong as the correlation coefficient 0.7916 is closer to +1

Excercise 5 : If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Answer

From Excercise 3, The Least Square Regression line for runs vs at_bats

y^ = -2789.2429 + 0.6305 * atbats

If atbats is 5,578

Predicted Runs is y^ = -2789.2429 + 0.6305 * 5578 y^ = 727.6861

The estimated number of runs for 5578 at bats based on the linear regression formula above is 728. A team with 5578 at bats cannot be found in the data but we can see the team Philadelphia Phillies has 5579 at bats with 713 runs. Therefore we can conclude that the model may have overestimated the runs for a team with 5578 at bats by 728 - 713 = 15 runs.

Excercise 6: Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

Answer

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

Based on the plot we can clearly say that there is no apparent pattern in the distribution as the numbers appear to be scattered unevenly around the dashed line and appear to be skewed. But it can be considered as a linear relationship.

Excercise 7:Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Answer :

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

The histogram and the plot suggests that the nearly normal residuals condition has been met.

Excercise 8: Based on the plot in (1), does the constant variability condition appear to be met?

Answer The variation of points around the least squares line appear to be reasonably constant thus an inference can be made that the constant variability condition has been met.

On Your Own

1, Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Answer Lets us take bat_avg as the predictor variable as I think it might also be a good predictor of runs.

plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m3 <- lm(runs ~ bat_avg, data = mlb11)
abline(m3)

cor(mlb11$runs, mlb11$bat_avg)

## [1] 0.8099859

lg<-lm(mlb11$runs~mlb11$bat_avg, data=mlb11)
summary(lg)

## 
## Call:
## lm(formula = mlb11$runs ~ mlb11$bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -642.8      183.1  -3.511  0.00153 ** 
## mlb11$bat_avg   5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

Linear Regression Line Formula: y^ = -642 + 5242.2 * bat_avg

Based on the plot, linear model statistics and correlation coefficient for the relationship between runs and batting average it is evident that the relationship is positive, linear and relatively strong.

How does this relationship compare to the relationship between runs and at_bats? Use the R^2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

Answer

R^2 is the percentage of the variance in the dependent variable that can be explained by a linear model. R^2 is always in the range between 0% - 100% and the higher the value the better the linear model explains the dependant variable and lower the value weaker the predictability of the dependant variable.

Let m1 be the model for the relationship between runs and at bats which produces R^2 of 37.29% (Excercise 3). Let m2 be the model for the relationship between runs and bat avg which produces R2 of 62.66% (Excercise 4)

Looking at the R^2 s of both models we can clearly see that the the R^2 value of the model m2 is far greater than that of the model m2 so it is clear that the variable bat_avg predicts runs better than at bats.

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

Answer After running the summary statistics for all the variables, the variable which best predicts the runs based on R2 happened to be bat_avg

plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m4 <- lm(runs ~ bat_avg, data = mlb11)
abline(m4)

hist(m4$residuals)

qqnorm(m4$residuals)
qqline(m4$residuals)

cor(mlb11$runs, mlb11$bat_avg)

## [1] 0.8099859

summary(m4)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

Answer

The three newer variables: new_onbase, new_slug and new_obs

cor(mlb11$runs, mlb11$new_onbase)

## [1] 0.9214691

cor(mlb11$runs, mlb11$new_slug)

## [1] 0.9470324

cor(mlb11$runs, mlb11$new_slug)

## [1] 0.9470324

cor(mlb11$runs, mlb11$new_slug)

## [1] 0.9470324

plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m4 <- lm(runs ~ bat_avg, data = mlb11)
abline(m4)

After examining the summary statistics and correlation coefficients of all three new predictors new_onbase, new_slug and new_obs, the relationship between runs and new_obs variable has the highest R2 and coefficient correlation values and appears to be the best and most effective predictor of the runs.

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

Answer

Model diagnostics for the regression model with the best predictor bat_avg for runs

m5 <- lm(runs ~ new_obs, data = mlb11)

Linearity:

The relationship looks linear based on a residual plot as the variability of residuals is approximately constant across the distribution but does not indicate any curvatures or any indication of non-normality.

plot(m5$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)

Nearly normal residuals:

If the residuals are approximately normaly distributed then the normal quantile-quantile plot of the residuals will result in an approximately straight line.

As you can clearly see the normal quantile-quantile plot of the residuals indicates a pretty straight line so we can safely say that the residuals are approximately normaly distributed and the model meets the nearly normal residuals condition.

hist(m5$residuals)

qqnorm(m5$residuals)
qqline(m5$residuals)

Constant variability:

Based on the plot the variability of points around the least squares line remains roughly constant so the condition constant variability has been met.

Chapter 7 LAB: Introduction to linear regression

James Kuruvilla

On Your Own