load("more/mlb11.RData")
Exercise 1 : What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?
Answer : Scatterplot can be used to display the relationship
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
#plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship between Runs and atBats", xlab = "At Bats", ylab = "Runs")
Runs <- mlb11$runs; At_bats <- mlb11$at_bats
ggplot(data=mlb11) + geom_point(mapping = aes(x=At_bats,y=Runs), color = 'red', size =3, shape =19)
The relationship looks moderately linear but not strong enough to be able to comfortably use a linear model to predict the number of runs.
Since the relationship is linear we can quanitfy the strength of the relationship with the correlation coefficient
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
Excercise2 : Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
Answer The relationship between runs and at bats can be considered positive but moderately strong as the correlation coefficient 0.610627 turns out to be far below from +1. we can also clearly spot several positive outliers in the plot such as a team with 5518 and 5600 at bats.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Excercise 3: Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
Answer : The smallest sum of squares that i got after running plot_ss function several times is 123721.9 with the coefficients x -> 0.6305 Intercept -> -2789.2429 .I ran the plot_ss several times and getting the same result all the time.
m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Exercise 4 : Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
Answer :
Home_Runs <- mlb11$homeruns
Runs <- mlb11$runs
ggplot(data=mlb11) + geom_point(mapping = aes(x=Home_Runs,y=Runs),color = 'red', size =3, shape =19) + ggtitle("Relationship between Runs and Home runs ")
Correlation Coeficient
cor(Runs, Home_Runs)
## [1] 0.7915577
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
y^ = 415.2389 + 1.8345*homeruns
slope = 1.8345
By looking at the plot we can say that the relationship between runs and home runs is linear positive and relatively strong as the correlation coefficient 0.7916 is closer to +1
Excercise 5 : If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
Answer
From Excercise 3, The Least Square Regression line for runs vs at_bats
y^ = -2789.2429 + 0.6305 * atbats
If atbats is 5,578
Predicted Runs is y^ = -2789.2429 + 0.6305 * 5578 y^ = 727.6861
The estimated number of runs for 5578 at bats based on the linear regression formula above is 728. A team with 5578 at bats cannot be found in the data but we can see the team Philadelphia Phillies has 5579 at bats with 713 runs. Therefore we can conclude that the model may have overestimated the runs for a team with 5578 at bats by 728 - 713 = 15 runs.
Excercise 6: Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
Answer
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
Based on the plot we can clearly say that there is no apparent pattern in the distribution as the numbers appear to be scattered unevenly around the dashed line and appear to be skewed. But it can be considered as a linear relationship.
Excercise 7:Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
Answer :
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
The histogram and the plot suggests that the nearly normal residuals condition has been met.
Excercise 8: Based on the plot in (1), does the constant variability condition appear to be met?
Answer The variation of points around the least squares line appear to be reasonably constant thus an inference can be made that the constant variability condition has been met.
1, Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
Answer Lets us take bat_avg as the predictor variable as I think it might also be a good predictor of runs.
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m3 <- lm(runs ~ bat_avg, data = mlb11)
abline(m3)
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859
lg<-lm(mlb11$runs~mlb11$bat_avg, data=mlb11)
summary(lg)
##
## Call:
## lm(formula = mlb11$runs ~ mlb11$bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## mlb11$bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
Linear Regression Line Formula: y^ = -642 + 5242.2 * bat_avg
Based on the plot, linear model statistics and correlation coefficient for the relationship between runs and batting average it is evident that the relationship is positive, linear and relatively strong.
Answer
R^2 is the percentage of the variance in the dependent variable that can be explained by a linear model. R^2 is always in the range between 0% - 100% and the higher the value the better the linear model explains the dependant variable and lower the value weaker the predictability of the dependant variable.
Let m1 be the model for the relationship between runs and at bats which produces R^2 of 37.29% (Excercise 3). Let m2 be the model for the relationship between runs and bat avg which produces R2 of 62.66% (Excercise 4)
Looking at the R^2 s of both models we can clearly see that the the R^2 value of the model m2 is far greater than that of the model m2 so it is clear that the variable bat_avg predicts runs better than at bats.
Answer After running the summary statistics for all the variables, the variable which best predicts the runs based on R2 happened to be bat_avg
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m4 <- lm(runs ~ bat_avg, data = mlb11)
abline(m4)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859
summary(m4)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
Answer
The three newer variables: new_onbase, new_slug and new_obs
cor(mlb11$runs, mlb11$new_onbase)
## [1] 0.9214691
cor(mlb11$runs, mlb11$new_slug)
## [1] 0.9470324
cor(mlb11$runs, mlb11$new_slug)
## [1] 0.9470324
cor(mlb11$runs, mlb11$new_slug)
## [1] 0.9470324
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m4 <- lm(runs ~ bat_avg, data = mlb11)
abline(m4)
After examining the summary statistics and correlation coefficients of all three new predictors new_onbase, new_slug and new_obs, the relationship between runs and new_obs variable has the highest R2 and coefficient correlation values and appears to be the best and most effective predictor of the runs.
Answer
Model diagnostics for the regression model with the best predictor bat_avg for runs
m5 <- lm(runs ~ new_obs, data = mlb11)
The relationship looks linear based on a residual plot as the variability of residuals is approximately constant across the distribution but does not indicate any curvatures or any indication of non-normality.
plot(m5$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)
If the residuals are approximately normaly distributed then the normal quantile-quantile plot of the residuals will result in an approximately straight line.
As you can clearly see the normal quantile-quantile plot of the residuals indicates a pretty straight line so we can safely say that the residuals are approximately normaly distributed and the model meets the nearly normal residuals condition.
hist(m5$residuals)
qqnorm(m5$residuals)
qqline(m5$residuals)
Based on the plot the variability of points around the least squares line remains roughly constant so the condition constant variability has been met.