week-11

Contents:

  1. Multiplea Linear Regression Model
  2. Model Diagnosis
  3. Multicollinearity check
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggthemes)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
Data_set <- "/Users/ba/Documents/IUPUI/Masters/First Sem/Statistics/Dataset/PitchingPost.csv"
Pitching_Data <- read.csv(Data_set)
Regression_Data <-
  Pitching_Data |>
  filter(is.finite(ERA),
         is.finite(BAOpp))
model <- lm(W~ER+SO, data=Regression_Data)
model
## 
## Call:
## lm(formula = W ~ ER + SO, data = Regression_Data)
## 
## Coefficients:
## (Intercept)           ER           SO  
##     0.04328     -0.03074      0.06274

Interpretation: For every one-unit increase in ER, the response variable W is estimated to decrease by approximately 0.03074 units, while for every one-unit increase in SO, the response variable W is estimated to increase by approximately 0.06274 units. The intercept represents the estimated value of W when both ER and SO are zero. Therefore, the interpretation of the coefficients in this context would be that earned runs have a negative association with wins, while strikeouts have a positive association with wins.

gg_diagnose(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

gg_resfitted(model) +
  geom_smooth(se=FALSE)+
  theme_classic()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Interpretation:

The points in the plot exhibit a certain pattern around the horizontal line at y = 0, it suggests that the assumption of linearity is violated. In other words, there is discernible pattern in the residuals as the fitted values change.

gg_qqplot(model)

Interpretation: Normal Q-Q plot shows points not following the diagonal line, indicating that the data (residuals) is not at all normal.

gg_resX(model)

Interpretation: The plot helps to assess the assumption of linearity between the predictor variable (X) and the response variable. The points in the above plot exhibiting a pattern scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is violated.

gg_reshist(model)+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The above histogram of residuals interprets that the data is not at all normally distributed.

gg_resleverage(model)+
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

gg_cooksd(model, threshold = 'matlab') +  theme_classic()

Interpretation:

A Cook’s Distance plot is a diagnostic tool used in regression analysis to identify influential data points that may disproportionately affect the estimation of regression coefficients.

Higher values of Cook’s Distance indicate greater influence of the corresponding data point on the regression model.

Observations with Cook’s Distance values exceeding the threshold are highlighted in the plot. These points are considered potentially influential and may significantly impact the estimated regression coefficients if removed from the analysis.

High Cook’s Distance values indicate that removing the corresponding observation from the dataset could lead to substantial changes in the estimated regression coefficients.

vif(model)
##       ER       SO 
## 1.126932 1.126932

Interpretation: The VIF values for both predictors ‘ER’ and ‘SO’ are approximately 1.13, indicating that there is negligible multicollinearity between them. This suggests that each variable is providing unique information to the model without much redundancy due to correlation with the other variable.