## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)


# Load the Teams dataset
lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/Lahmans Databse .csv")
lahman_data <- lahman_data|>
  filter(yearID!= 2020)

Note: Year 2020 is removed due to unusaul format because of Covid pandemic.

1. Multiple Linear Regression Model

In this markdown file we are trying to predict the wins by choosing more than one independent variables.

lahman_data_numeric <- lahman_data[ ,sapply(lahman_data, is.numeric)]
Corr = cor(lahman_data_numeric,method = "pearson")

correlation_df <- as.data.frame(Corr)


correlation_df['W'] |>
  arrange(desc(W))

##                       W
## W           1.000000000
## SV          0.655434991
## R           0.557335002
## SHO         0.508193806
## attendance  0.465332171
## IPouts      0.455629564
## BB          0.443387295
## HR          0.408562765
## FP          0.348289814
## H           0.321180149
## SOA         0.308891090
## SF          0.285833450
## X2B         0.254308261
## HBP         0.177867416
## AB          0.162925673
## CG          0.145910588
## G           0.087352633
## Ghome       0.081600058
## BPF         0.067965328
## SB          0.030771912
## yearID      0.001362651
## PPF        -0.081228940
## X3B        -0.107928182
## SO         -0.117629799
## CS         -0.130814449
## DP         -0.196066290
## E          -0.336076694
## HRA        -0.370761572
## BBA        -0.465921426
## HA         -0.528944178
## ER         -0.654614246
## RA         -0.667560533
## ERA        -0.668218859
## Rank       -0.882680597
## L          -0.999678467

library(ggcorrplot)
ggcorrplot(Corr, type='lower') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Correlation Plot") +
  theme_clean()

Interpretation: The above data frame shows us which independent variables are closely correlated with wins.

With some knowledge of baseball, I have selected the following variables as independent variables:

Variables Selected: R, RA, H, AB, ERA, FP, SOA, HR, BB, attendance, HA, SHO

Linear regression using Backward elimination method

Dependent Variable <- Wins

Independent Variables <- R, RA, H, AB, ERA, FP, SO, HR, BB, attendance, HA, SHO

MLR_model <- lm(W~R+RA+H+AB+ERA+FP+SO+HR+BB+attendance+HA+SHO,data=lahman_data_numeric) |>
  step(direction = 'backward', trace =0)

summary(MLR_model)

## 
## Call:
## lm(formula = W ~ R + RA + H + AB + ERA + FP + HR + attendance + 
##     HA + SHO, data = lahman_data_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4156  -2.6275  -0.1079   2.5269  14.1262 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.270e+02  7.712e+01  -1.647 0.100057    
## R            7.791e-02  4.669e-03  16.688  < 2e-16 ***
## RA          -2.480e-02  1.559e-02  -1.591 0.112093    
## H            2.196e-02  5.071e-03   4.330 1.74e-05 ***
## AB          -2.129e-02  3.816e-03  -5.579 3.62e-08 ***
## ERA         -1.232e+01  2.403e+00  -5.129 3.90e-07 ***
## FP           2.881e+02  8.157e+01   3.532 0.000443 ***
## HR           3.422e-02  7.243e-03   4.725 2.85e-06 ***
## attendance   5.460e-07  2.495e-07   2.189 0.028990 *  
## HA           1.036e-02  3.161e-03   3.278 0.001104 ** 
## SHO          1.956e-01  5.360e-02   3.650 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.811 on 619 degrees of freedom
## Multiple R-squared:  0.8993, Adjusted R-squared:  0.8977 
## F-statistic: 552.7 on 10 and 619 DF,  p-value: < 2.2e-16

tidy(MLR_model,conf.int = TRUE)

## # A tibble: 11 × 7
##    term        estimate    std.error statistic  p.value conf.low    conf.high
##    <chr>          <dbl>        <dbl>     <dbl>    <dbl>    <dbl>        <dbl>
##  1 (Intercept) -1.27e+2 77.1             -1.65 1.00e- 1 -2.78e+2  24.4       
##  2 R            7.79e-2  0.00467         16.7  6.69e-52  6.87e-2   0.0871    
##  3 RA          -2.48e-2  0.0156          -1.59 1.12e- 1 -5.54e-2   0.00581   
##  4 H            2.20e-2  0.00507          4.33 1.74e- 5  1.20e-2   0.0319    
##  5 AB          -2.13e-2  0.00382         -5.58 3.62e- 8 -2.88e-2  -0.0138    
##  6 ERA         -1.23e+1  2.40            -5.13 3.90e- 7 -1.70e+1  -7.61      
##  7 FP           2.88e+2 81.6              3.53 4.43e- 4  1.28e+2 448.        
##  8 HR           3.42e-2  0.00724          4.73 2.85e- 6  2.00e-2   0.0484    
##  9 attendance   5.46e-7  0.000000249      2.19 2.90e- 2  5.61e-8   0.00000104
## 10 HA           1.04e-2  0.00316          3.28 1.10e- 3  4.15e-3   0.0166    
## 11 SHO          1.96e-1  0.0536           3.65 2.84e- 4  9.04e-2   0.301

Interpretation:

The intercept (-127) is not statistically significant (p = 0.100), suggesting that when all predictor variables are zero, the expected value of the response variable is not significantly different from zero.
The coefficients for predictors R (Runs), RA (Runs Allowed), H (Hits), AB (At Bats), ERA (Earned Run Average), FP (Fielding Percentage), HR (Home Runs), attendance, HA (Hits Allowed), and SHO (Shutouts) are all statistically significant (p < 0.05). For instance, a one-unit increase in R is associated with a 0.0779 unit increase in the response variable, controlling for other predictors.
The multiple R-squared value of 0.8993 indicates that approximately 89.93% of the variability in the response variable is explained by the predictors.
The adjusted R-squared value of 0.8977 adjusts the multiple R-squared for the number of predictors in the model.

2 Model Diagnosis and Evaluation:

Co linearity Check:

Col_Check <- lahman_data_numeric |>
  select(R, RA, H, AB, ERA, FP, HR, attendance, HA, SHO) |>
  cor()|>
  round(2)

Col_Check <- as.data.frame(Col_Check)
Col_Check

##                R    RA     H    AB   ERA    FP    HR attendance    HA   SHO
## R           1.00  0.15  0.73  0.56  0.14  0.03  0.66       0.32  0.13 -0.08
## RA          0.15  1.00  0.22  0.26  0.99 -0.42  0.08      -0.29  0.82 -0.69
## H           0.73  0.22  1.00  0.84  0.21 -0.03  0.18       0.34  0.37 -0.15
## AB          0.56  0.26  0.84  1.00  0.23 -0.01  0.17       0.24  0.39 -0.19
## ERA         0.14  0.99  0.21  0.23  1.00 -0.34  0.08      -0.28  0.81 -0.68
## FP          0.03 -0.42 -0.03 -0.01 -0.34  1.00  0.07       0.17 -0.36  0.31
## HR          0.66  0.08  0.18  0.17  0.08  0.07  1.00       0.15 -0.09 -0.06
## attendance  0.32 -0.29  0.34  0.24 -0.28  0.17  0.15       1.00 -0.15  0.21
## HA          0.13  0.82  0.37  0.39  0.81 -0.36 -0.09      -0.15  1.00 -0.60
## SHO        -0.08 -0.69 -0.15 -0.19 -0.68  0.31 -0.06       0.21 -0.60  1.00

Interpretation: After finding out the co-liniearity between the independent variables the following variables have been selected as explanatory variables:

Variables Selected: R,AB, RA, FP, attendance, ERA

MLR_model <- lm(W~R+RA+FP+attendance+ERA+AB,data=lahman_data_numeric) |>
  step(direction = 'backward', trace =0)

summary(MLR_model)

## 
## Call:
## lm(formula = W ~ R + RA + FP + attendance + ERA + AB, data = lahman_data_numeric)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8339  -2.8812   0.0053   2.5835  13.0235 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.764e+01  7.916e+01  -1.233 0.217874    
## R            9.724e-02  2.379e-03  40.881  < 2e-16 ***
## RA          -3.871e-02  1.503e-02  -2.576 0.010223 *  
## FP           2.221e+02  8.149e+01   2.725 0.006613 ** 
## attendance   8.121e-07  2.528e-07   3.213 0.001382 ** 
## ERA         -9.478e+00  2.357e+00  -4.021  6.5e-05 ***
## AB          -8.102e-03  2.418e-03  -3.351 0.000854 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.945 on 623 degrees of freedom
## Multiple R-squared:  0.8914, Adjusted R-squared:  0.8903 
## F-statistic: 852.1 on 6 and 623 DF,  p-value: < 2.2e-16

tidy(MLR_model,conf.int = TRUE)

## # A tibble: 7 × 7
##   term             estimate    std.error statistic   p.value conf.low  conf.high
##   <chr>               <dbl>        <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept) -97.6         79.2             -1.23 2.18e-  1 -2.53e+2    5.78e+1
## 2 R             0.0972       0.00238         40.9  1.65e-178  9.26e-2    1.02e-1
## 3 RA           -0.0387       0.0150          -2.58 1.02e-  2 -6.82e-2   -9.20e-3
## 4 FP          222.          81.5              2.72 6.61e-  3  6.20e+1    3.82e+2
## 5 attendance    0.000000812  0.000000253      3.21 1.38e-  3  3.16e-7    1.31e-6
## 6 ERA          -9.48         2.36            -4.02 6.50e-  5 -1.41e+1   -4.85e+0
## 7 AB           -0.00810      0.00242         -3.35 8.54e-  4 -1.28e-2   -3.35e-3

Interpretation:

The intercept (-97.6) is not statistically significant (p = 0.200), suggesting that when all predictor variables are zero, the expected value of the response variable is not significantly different from zero.
The coefficients for predictors R (Runs), RA (Runs Allowed), AB (At Bats), ERA (Earned Run Average), FP (Fielding Percentage) and attendance are all statistically significant (p < 0.05). For instance, a one-unit increase in R is associated with a 0.097 unit increase in the response variable, controlling for other predictors.
The multiple R-squared value of 0.8914 indicates that approximately 89.14% of the variability in the response variable is explained by the predictors.
The adjusted R-squared value of 0.8903 adjusts the multiple R-squared for the number of predictors in the model.

Diagnostic Plots

gg_diagnose(MLR_model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Residuals vs. Fitted Values

gg_resfitted(MLR_model) +
  geom_smooth(se=FALSE)+
  theme_economist()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Interpretation:

The points in the plot exhibit a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is not violated. In other words, there is no discernible pattern in the residuals as the fitted values change.

The residuals should be independent of each other. In other words, there is no pattern or correlation between consecutive residuals

Normality of residuals:

gg_qqplot(MLR_model)+
  theme_economist()

Interpretation:

Normal Q-Q plot shows points closely following the diagonal line, indicating that the data (residuals) is approximately normally distributed

Residuals vs. X Values:

gg_resX(MLR_model)

Interpretation: The plot helps to assess the assumption of linearity between the predictor variable (X) and the response variable. The points in the above plot exhibiting a random scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is reasonable.

Residual Histogram:

gg_reshist(MLR_model)+
  theme_economist()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The above histogram of residuals interprets that the data is almost normally distributed.

Influential Data Points:

gg_resleverage(MLR_model)+
  theme_economist()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation: This plot helps us in identifying the influential points in the dataset. Look for the points which have high leverage and high residual.

  gg_cooksd(MLR_model, threshold = 'matlab') +
  theme_economist()

Interpretation: A Cook’s Distance plot is a diagnostic tool used in regression analysis to identify influential data points that may disproportionately affect the estimation of regression coefficients.

Higher values of Cook’s Distance indicate greater influence of the corresponding data point on the regression model.

Observations with Cook’s Distance values exceeding the threshold are highlighted in the plot. These points are considered potentially influential and may significantly impact the estimated regression coefficients if removed from the analysis.

High Cook’s Distance values indicate that removing the corresponding observation from the dataset could lead to substantial changes in the estimated regression coefficients.

Data Dive week 9

Anurag

2024-03-14

Contents:

1. Multiple Linear Regression Model - using backward for selecting features.

2. Model Diagnosis and Evaluation

- Diagnostic Plots

- Influential data points

1. Multiple Linear Regression Model

Linear regression using Backward elimination method

Interpretation:

2 Model Diagnosis and Evaluation:

Interpretation:

Diagnostic Plots

Residuals vs. Fitted Values

Interpretation:

Normality of residuals:

Interpretation:

Residuals vs. X Values:

Residual Histogram:

Influential Data Points: