week-11.knit

week-11

Contents:

Multiplea Linear Regression Model
Model Diagnosis
Multicollinearity check

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
library(ggthemes)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:boot':
## 
##     logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Data_set <- "/Users/ba/Documents/IUPUI/Masters/First Sem/Statistics/Dataset/PitchingPost.csv"
Pitching_Data <- read.csv(Data_set)
Regression_Data <-
  Pitching_Data |>
  filter(is.finite(ERA),
         is.finite(BAOpp))

model <- lm(W~ER+SO, data=Regression_Data)
model

## 
## Call:
## lm(formula = W ~ ER + SO, data = Regression_Data)
## 
## Coefficients:
## (Intercept)           ER           SO  
##     0.04328     -0.03074      0.06274

Interpretation: For every one-unit increase in ER, the response variable W is estimated to decrease by approximately 0.03074 units, while for every one-unit increase in SO, the response variable W is estimated to increase by approximately 0.06274 units. The intercept represents the estimated value of W when both ER and SO are zero. Therefore, the interpretation of the coefficients in this context would be that earned runs have a negative association with wins, while strikeouts have a positive association with wins.

Model Diagnosis
Diagnostic Plots

gg_diagnose(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Residual vs Fitted

gg_resfitted(model) +
  geom_smooth(se=FALSE)+
  theme_classic()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Interpretation:

The points in the plot exhibit a certain pattern around the horizontal line at y = 0, it suggests that the assumption of linearity is violated. In other words, there is discernible pattern in the residuals as the fitted values change.

Normality of residuals

gg_qqplot(model)

Interpretation: Normal Q-Q plot shows points not following the diagonal line, indicating that the data (residuals) is not at all normal.

Residual vs X-values

gg_resX(model)

Interpretation: The plot helps to assess the assumption of linearity between the predictor variable (X) and the response variable. The points in the above plot exhibiting a pattern scatter around the horizontal line at y = 0, it suggests that the assumption of linearity is violated.

Residual Histogram

gg_reshist(model)+
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: The above histogram of residuals interprets that the data is not at all normally distributed.

Influential Data Points

gg_resleverage(model)+
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

Cook’s Distance

gg_cooksd(model, threshold = 'matlab') +  theme_classic()

Interpretation:

A Cook’s Distance plot is a diagnostic tool used in regression analysis to identify influential data points that may disproportionately affect the estimation of regression coefficients.

Higher values of Cook’s Distance indicate greater influence of the corresponding data point on the regression model.

Observations with Cook’s Distance values exceeding the threshold are highlighted in the plot. These points are considered potentially influential and may significantly impact the estimated regression coefficients if removed from the analysis.

High Cook’s Distance values indicate that removing the corresponding observation from the dataset could lead to substantial changes in the estimated regression coefficients.

Multicollinearity between independent variables

vif(model)

##       ER       SO 
## 1.126932 1.126932

Interpretation: The VIF values for both predictors ‘ER’ and ‘SO’ are approximately 1.13, indicating that there is negligible multicollinearity between them. This suggests that each variable is providing unique information to the model without much redundancy due to correlation with the other variable.