library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
# Load the Teams dataset
lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/Lahmans Databse .csv")
lahman_data <- lahman_data|>
filter(yearID!= 2020)
Note: Year 2020 is removed due to unusaul format because of Covid pandemic.
In this markdown file we are trying to predict the wins by choosing more than one independent variables.
lahman_data_numeric <- lahman_data[ ,sapply(lahman_data, is.numeric)]
Corr = cor(lahman_data_numeric,method = "pearson")
correlation_df <- as.data.frame(Corr)
correlation_df['W'] |>
arrange(desc(W))
## W
## W 1.000000000
## SV 0.655434991
## R 0.557335002
## SHO 0.508193806
## attendance 0.465332171
## IPouts 0.455629564
## BB 0.443387295
## HR 0.408562765
## FP 0.348289814
## H 0.321180149
## SOA 0.308891090
## SF 0.285833450
## X2B 0.254308261
## HBP 0.177867416
## AB 0.162925673
## CG 0.145910588
## G 0.087352633
## Ghome 0.081600058
## BPF 0.067965328
## SB 0.030771912
## yearID 0.001362651
## PPF -0.081228940
## X3B -0.107928182
## SO -0.117629799
## CS -0.130814449
## DP -0.196066290
## E -0.336076694
## HRA -0.370761572
## BBA -0.465921426
## HA -0.528944178
## ER -0.654614246
## RA -0.667560533
## ERA -0.668218859
## Rank -0.882680597
## L -0.999678467
library(ggcorrplot)
ggcorrplot(Corr, type='lower') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Correlation Plot") +
theme_clean()
Interpretation: The above data frame shows us which independent variables are closely correlated with wins.
With some knowledge of baseball, I have selected the following variables as independent variables:
Variables Selected: R, RA, H, AB, ERA, FP, SOA, HR, BB, attendance, HA, SHO
Dependent Variable <- Wins
Independent Variables <- R, RA, H, AB, ERA, FP, SO, HR, BB, attendance, HA, SHO
MLR_model <- lm(W~R+RA+H+AB+ERA+FP+SO+HR+BB+attendance+HA+SHO,data=lahman_data_numeric) |>
step(direction = 'backward', trace =0)
summary(MLR_model)
##
## Call:
## lm(formula = W ~ R + RA + H + AB + ERA + FP + HR + attendance +
## HA + SHO, data = lahman_data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4156 -2.6275 -0.1079 2.5269 14.1262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.270e+02 7.712e+01 -1.647 0.100057
## R 7.791e-02 4.669e-03 16.688 < 2e-16 ***
## RA -2.480e-02 1.559e-02 -1.591 0.112093
## H 2.196e-02 5.071e-03 4.330 1.74e-05 ***
## AB -2.129e-02 3.816e-03 -5.579 3.62e-08 ***
## ERA -1.232e+01 2.403e+00 -5.129 3.90e-07 ***
## FP 2.881e+02 8.157e+01 3.532 0.000443 ***
## HR 3.422e-02 7.243e-03 4.725 2.85e-06 ***
## attendance 5.460e-07 2.495e-07 2.189 0.028990 *
## HA 1.036e-02 3.161e-03 3.278 0.001104 **
## SHO 1.956e-01 5.360e-02 3.650 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.811 on 619 degrees of freedom
## Multiple R-squared: 0.8993, Adjusted R-squared: 0.8977
## F-statistic: 552.7 on 10 and 619 DF, p-value: < 2.2e-16
tidy(MLR_model,conf.int = TRUE)
## # A tibble: 11 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.27e+2 77.1 -1.65 1.00e- 1 -2.78e+2 24.4
## 2 R 7.79e-2 0.00467 16.7 6.69e-52 6.87e-2 0.0871
## 3 RA -2.48e-2 0.0156 -1.59 1.12e- 1 -5.54e-2 0.00581
## 4 H 2.20e-2 0.00507 4.33 1.74e- 5 1.20e-2 0.0319
## 5 AB -2.13e-2 0.00382 -5.58 3.62e- 8 -2.88e-2 -0.0138
## 6 ERA -1.23e+1 2.40 -5.13 3.90e- 7 -1.70e+1 -7.61
## 7 FP 2.88e+2 81.6 3.53 4.43e- 4 1.28e+2 448.
## 8 HR 3.42e-2 0.00724 4.73 2.85e- 6 2.00e-2 0.0484
## 9 attendance 5.46e-7 0.000000249 2.19 2.90e- 2 5.61e-8 0.00000104
## 10 HA 1.04e-2 0.00316 3.28 1.10e- 3 4.15e-3 0.0166
## 11 SHO 1.96e-1 0.0536 3.65 2.84e- 4 9.04e-2 0.301
The intercept (-127) is not statistically significant (p = 0.100), suggesting that when all predictor variables are zero, the expected value of the response variable is not significantly different from zero.
The coefficients for predictors R (Runs), RA (Runs Allowed), H (Hits), AB (At Bats), ERA (Earned Run Average), FP (Fielding Percentage), HR (Home Runs), attendance, HA (Hits Allowed), and SHO (Shutouts) are all statistically significant (p < 0.05). For instance, a one-unit increase in R is associated with a 0.0779 unit increase in the response variable, controlling for other predictors.
The multiple R-squared value of 0.8993 indicates that approximately 89.93% of the variability in the response variable is explained by the predictors.
The adjusted R-squared value of 0.8977 adjusts the multiple R-squared for the number of predictors in the model.
Col_Check <- lahman_data_numeric |>
select(R, RA, H, AB, ERA, FP, HR, attendance, HA, SHO) |>
cor()|>
round(2)
Col_Check <- as.data.frame(Col_Check)
Col_Check
## R RA H AB ERA FP HR attendance HA SHO
## R 1.00 0.15 0.73 0.56 0.14 0.03 0.66 0.32 0.13 -0.08
## RA 0.15 1.00 0.22 0.26 0.99 -0.42 0.08 -0.29 0.82 -0.69
## H 0.73 0.22 1.00 0.84 0.21 -0.03 0.18 0.34 0.37 -0.15
## AB 0.56 0.26 0.84 1.00 0.23 -0.01 0.17 0.24 0.39 -0.19
## ERA 0.14 0.99 0.21 0.23 1.00 -0.34 0.08 -0.28 0.81 -0.68
## FP 0.03 -0.42 -0.03 -0.01 -0.34 1.00 0.07 0.17 -0.36 0.31
## HR 0.66 0.08 0.18 0.17 0.08 0.07 1.00 0.15 -0.09 -0.06
## attendance 0.32 -0.29 0.34 0.24 -0.28 0.17 0.15 1.00 -0.15 0.21
## HA 0.13 0.82 0.37 0.39 0.81 -0.36 -0.09 -0.15 1.00 -0.60
## SHO -0.08 -0.69 -0.15 -0.19 -0.68 0.31 -0.06 0.21 -0.60 1.00
Interpretation: After finding out the co-liniearity between the independent variables the following variables have been selected as explanatory variables:
Variables Selected: R,AB, RA, FP, attendance, ERA
MLR_model <- lm(W~R+RA+FP+attendance+ERA+AB,data=lahman_data_numeric) |>
step(direction = 'backward', trace =0)
summary(MLR_model)
##
## Call:
## lm(formula = W ~ R + RA + FP + attendance + ERA + AB, data = lahman_data_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8339 -2.8812 0.0053 2.5835 13.0235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.764e+01 7.916e+01 -1.233 0.217874
## R 9.724e-02 2.379e-03 40.881 < 2e-16 ***
## RA -3.871e-02 1.503e-02 -2.576 0.010223 *
## FP 2.221e+02 8.149e+01 2.725 0.006613 **
## attendance 8.121e-07 2.528e-07 3.213 0.001382 **
## ERA -9.478e+00 2.357e+00 -4.021 6.5e-05 ***
## AB -8.102e-03 2.418e-03 -3.351 0.000854 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.945 on 623 degrees of freedom
## Multiple R-squared: 0.8914, Adjusted R-squared: 0.8903
## F-statistic: 852.1 on 6 and 623 DF, p-value: < 2.2e-16
tidy(MLR_model,conf.int = TRUE)
## # A tibble: 7 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -97.6 79.2 -1.23 2.18e- 1 -2.53e+2 5.78e+1
## 2 R 0.0972 0.00238 40.9 1.65e-178 9.26e-2 1.02e-1
## 3 RA -0.0387 0.0150 -2.58 1.02e- 2 -6.82e-2 -9.20e-3
## 4 FP 222. 81.5 2.72 6.61e- 3 6.20e+1 3.82e+2
## 5 attendance 0.000000812 0.000000253 3.21 1.38e- 3 3.16e-7 1.31e-6
## 6 ERA -9.48 2.36 -4.02 6.50e- 5 -1.41e+1 -4.85e+0
## 7 AB -0.00810 0.00242 -3.35 8.54e- 4 -1.28e-2 -3.35e-3
The intercept (-97.6) is not statistically significant (p = 0.200), suggesting that when all predictor variables are zero, the expected value of the response variable is not significantly different from zero.
The coefficients for predictors R (Runs), RA (Runs Allowed), AB (At Bats), ERA (Earned Run Average), FP (Fielding Percentage) and attendance are all statistically significant (p < 0.05). For instance, a one-unit increase in R is associated with a 0.097 unit increase in the response variable, controlling for other predictors.
The multiple R-squared value of 0.8914 indicates that approximately 89.14% of the variability in the response variable is explained by the predictors.
The adjusted R-squared value of 0.8903 adjusts the multiple R-squared for the number of predictors in the model.
gg_diagnose(MLR_model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
gg_resfitted(MLR_model) +
geom_smooth(se=FALSE)+
theme_economist()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The points in the plot exhibit a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is not violated. In other words, there is no discernible pattern in the residuals as the fitted values change.
The residuals should be independent of each other. In other words, there is no pattern or correlation between consecutive residuals
gg_qqplot(MLR_model)+
theme_economist()
Normal Q-Q plot shows points closely following the diagonal line, indicating that the data (residuals) is approximately normally distributed
gg_resX(MLR_model)
Interpretation: The plot helps to assess the assumption of linearity between the predictor variable (X) and the response variable. The points in the above plot exhibiting a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is reasonable.
gg_reshist(MLR_model)+
theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpretation: The above histogram of residuals interprets that the data is almost normally distributed.
gg_resleverage(MLR_model)+
theme_economist()
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: This plot helps us in identifying the influential points in the dataset. Look for the points which have high leverage and high residual.
gg_cooksd(MLR_model, threshold = 'matlab') +
theme_economist()
Interpretation: A Cook’s Distance plot is a diagnostic tool used in regression analysis to identify influential data points that may disproportionately affect the estimation of regression coefficients.
Higher values of Cook’s Distance indicate greater influence of the corresponding data point on the regression model.
Observations with Cook’s Distance values exceeding the threshold are highlighted in the plot. These points are considered potentially influential and may significantly impact the estimated regression coefficients if removed from the analysis.
High Cook’s Distance values indicate that removing the corresponding observation from the dataset could lead to substantial changes in the estimated regression coefficients.