library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(psych)
## Warning: package 'psych' was built under R version 4.4.2
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
setwd("C:\\Users\\srini\\OneDrive\\Desktop\\Regression Analysis\\Homework 5")
getwd()
## [1] "C:/Users/srini/OneDrive/Desktop/Regression Analysis/Homework 5"
baseball_data3=read.csv("Baseball-Data.csv")
View(baseball_data3)
pairs(baseball_data3[, c(2:4)], main = "Scatter Plot Matrix for MLB Dataset")
1.BA vs. W (Batting Average vs. Wins):
The scatter plot suggests a positive correlation between BA (Batting Average) and Wins (W). Teams with higher batting averages generally tend to win more games, which makes sense as better hitting often leads to more runs and victories.
2.ERA vs. W (Earned Run Average vs. Wins):
There is a negative correlation between ERA and Wins. Teams with lower ERA (better pitching performance) tend to have more wins. This aligns with baseball logic, where a lower ERA indicates effective pitching that prevents opponents from scoring.
3.BA vs. ERA (Batting Average vs. Earned Run Average):
The scatter plot does not show a strong trend between these two variables. Since BA represents offensive performance and ERA represents defensive performance, they do not necessarily have a direct relationship.
lm_model3=lm(W ~ BA + ERA, data = baseball_data3)
summary(lm_model3)
##
## Call:
## lm(formula = W ~ BA + ERA, data = baseball_data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.883 -3.014 1.283 3.727 7.558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.811 25.213 0.468 0.643
## BA 605.939 91.570 6.617 4.23e-07 ***
## ERA -19.197 2.308 -8.317 6.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.392 on 27 degrees of freedom
## Multiple R-squared: 0.8259, Adjusted R-squared: 0.813
## F-statistic: 64.05 on 2 and 27 DF, p-value: 5.623e-11
anova(lm_model3)
## Analysis of Variance Table
##
## Response: W
## Df Sum Sq Mean Sq F value Pr(>F)
## BA 1 1713.3 1713.35 58.938 2.947e-08 ***
## ERA 1 2010.7 2010.71 69.167 6.320e-09 ***
## Residuals 27 784.9 29.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The linear regression model estimating Wins(W) using Batting Average(BA) and Earned Run Average(ERA) is statistically significant overall, as indicated by the F-statistic of 64.05 with a p-value of 5.623e-11 .This extremely low p-value suggests that at least one of the predictors significantly contributes to explaining the variation in Wins. The model’s(R2)value of 0.8259 indicates that 82.6% of the variance in Wins is explained by BA and ERA, and the adjusted R2 value of 0.813 confirms the model remains strong even after adjusting for the number of predictors.Both BA and ERA are statistically significant predictors of Wins, as their p-values are 4.23e-7 and 6.32e-9, respectively which is below the 0.05 significance threshold. The coefficient for BA is 605.94, meaning that for every 0.001 increase in BA, Wins increase by approximately 0.61. Conversely, the coefficient for ERA is -19.20, indicating that an increase of 1 unit in ERA corresponds to approximately 19.2 fewer Wins, reinforcing the expectation that better pitching (lower ERA) leads to better team performance. The ANOVA table further supports this, with BA and ERA having F-values of 58.94 and 69.17, respectively, both highly significant.
Overall, the model effectively explains how BA and ERA influence a team’s Wins. Teams with a higher BA tend to win more games, while those with a higher ERA win fewer games, aligning with baseball strategies that emphasize strong offense and pitching. Given the high R-sqaure value and the low p-values for both predictors, the model is suitable for understanding the relationship between these performance metrics and team success.
fit = lm_model3
res = residuals(fit)
plot(fit$fitted.values, res, xlab="Fitted Values", ylab="Residuals", main="", abline(0,0,col="red"))
qqnorm(res, ylab="Residuals", main="", col="blue")
hist(res, xlab="Residuals", main="", nclass=10, col="orange")
The residual analysis helps verify the validity of our regression model predicting Wins (W) using Batting Average (BA) and Earned Run Average (ERA). The residuals vs. fitted values plot shows no clear pattern, suggesting that the assumption E[ε]=0 holds and that the variance of residuals remains fairly constant, indicating no major heteroscedasticity. This means the relationship between Wins, BA, and ERA is stable across different teams.
However, the Q-Q plot and histogram suggest some deviation from normality, particularly at the extremes, where a few teams may have performed exceptionally well or poorly. While this slight skewness could indicate the presence of outliers, linear regression is generally robust to mild departures from normality. Overall, the model assumptions hold well enough to justify using BA and ERA as predictors of Wins, though further investigation into potential influential teams may improve accuracy.
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
par(mfrow=c(2,1))
res = stdres(lm_model3)
plot(baseball_data3[,2],res,xlab="Weight",ylab="Residuals")
abline(0,0,col="red")
plot(baseball_data3[,3],res,xlab="Distance",ylab="Residuals")
abline(0,0,col="red")
The residual plots show no clear pattern or funnel shape, suggesting that homoscedasticity holds and the model assumptions are valid. The predictors (Weight and Distance) maintain a consistent relationship with the response variable.
predict(lm_model3, list(BA = 0.250, ERA = 4.00), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 86.50862 84.11272 88.90452
The 95% Confidence interval for a BA=0.250 and ERA=4.00 has a lower limit of 84.11272 and an upper limit of 88.90425
predict(lm_model3, list(BA = 0.250, ERA = 4.00), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 86.50862 75.18928 97.82797
The 95% Prediction interval for a BA=0.250 and ERA=4.00 has a lower limit of 75.18928 and an upper limit of 97.82797
plot(cooks.distance(fit), type="h", lwd=3, col="red", ylab="Cook's Distance", main="")
The Cook’s Distance plot suggests that teams around indices 5, 7, and 10 have a high influence on the regression model predicting Wins (W) using Batting Average (BA) and Earned Run Average (ERA). Cook’s Distance values close to 0.8 or higher indicate that these teams may disproportionately affect the model’s coefficients. This could mean that their BA and ERA values differ significantly from other teams, making them outliers in terms of performance.
lm_model4=lm(W ~ BA + ERA + I(BA^2) + I(ERA^2) + BA:ERA , data = baseball_data3)
summary(lm_model4)
##
## Call:
## lm(formula = W ~ BA + ERA + I(BA^2) + I(ERA^2) + BA:ERA, data = baseball_data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.361 -3.361 1.070 3.276 7.034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 207.281 406.552 0.510 0.61481
## BA 1795.767 2986.604 0.601 0.55329
## ERA -187.011 48.579 -3.850 0.00077 ***
## I(BA^2) -7666.990 5744.980 -1.335 0.19455
## I(ERA^2) 1.996 2.981 0.670 0.50944
## BA:ERA 625.820 192.294 3.254 0.00336 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.465 on 24 degrees of freedom
## Multiple R-squared: 0.8939, Adjusted R-squared: 0.8718
## F-statistic: 40.44 on 5 and 24 DF, p-value: 6.32e-11
From the model summary we can analyze using p value that: 1.ERA (p = 0.00077) is highly significant (p < 0.05), indicating that Earned Run Average has a strong impact on Wins. 2.BA:ERA interaction term (p = 0.00336) is also significant, suggesting that the combined effect of Batting Average and Earned Run Average influences Wins. 3.BA (p = 0.55329), BA² (p = 0.19455), and ERA² (p = 0.50944) are not significant as their p-values exceed 0.05.
This model has a high R-squared value of 0.8939 i.e. 89.39% variability is explained by this model.
From the first order model 1.Sum of Residual Error Squares(SSR)=5.392^2 * 27 = 785.24 2.Degrees of freedom = 27 3.R squared = 0.8259
For the second order model 1.Residual Sum of Squares (SSR) = 4.465^2 * 24 = 478.86 2.Degrees of freedom = 24 3.R Squared = 0.8939
F-statistic = ((SSR1-SSR2)/(df1-df2))/(SSR2/df2)
F-statistic = 102.13/19.95 = 5.12
F-statistic = 5.12
Using an F-distribution with (3,24) degrees of freedom, we compare the computed F = 5.12 to the critical value for α = 0.05 (which is approximately 3.01). Since 5.12 > 3.01, we reject the null hypothesis and conclude that the additional quadratic and interaction terms significantly improve the model.
The second-order model provides a statistically significant improvement over the first-order model in explaining Wins (W) using Batting Average (BA) and Earned Run Average (ERA). Thus, the additional terms (BA², ERA², and BA:ERA) should be included in the model for better predictive performance.
coeffs = coef(lm_model4)
BA_value = 0.250
ERA_value = 4.00
effect_BA = coeffs["BA"] + 2 * coeffs["I(BA^2)"] * BA_value + coeffs["BA:ERA"] * ERA_value
change_in_W = effect_BA * 0.010
The Estimated effect of a 0.010 increase in BA while keeping ERA constant on Wins is: 4.656
fit2 = lm_model4
res2 = residuals(fit2)
plot(fit2$fitted.values, res2, xlab="Fitted Values", ylab="Residuals", main="", abline(0,0,col="red"))
qqnorm(res2, ylab="Residuals", main="", col="blue")
hist(res, xlab="Residuals", main="", nclass=10, col="orange")
To assess the assumptions of the full second-order model, we examined the residual plots to check for mean zero residuals (E[ε]=0), normality, and homoscedasticity (constant variance).The residuals vs. fitted values plot shows that residuals are scattered around zero without a strong pattern, supporting the assumption that E[ε]=0. However, some residuals appear more spread out at higher fitted values, suggesting mild heteroscedasticity, though not severe enough to invalidate the model.
The Q-Q plot indicates some deviation from normality, particularly in the lower tail, suggesting that a few observations may not follow the normal distribution assumption perfectly. The histogram of residuals also shows slight skewness but is roughly symmetric. While normality is not perfect, it is not severely violated, meaning the regression results are still reliable.
par(mfrow=c(2,1))
res3 <- stdres(lm_model4)
plot(baseball_data3[,2],res3,xlab="Weight",ylab="Residuals")
abline(0,0,col="red")
plot(baseball_data3[,3],res3,xlab="Distance",ylab="Residuals")
abline(0,0,col="red")
The residuals vs. predictor plots for Weight (BA) and Distance (ERA) show no clear pattern, suggesting that the model assumptions of linearity and homoscedasticity (constant variance) are met. The residuals are randomly scattered around zero, indicating that no strong nonlinearity is present in the model