Loading Required R Packages

library(tidyverse)
library(broom)
library(MASS)
library(dgof)
library(car)
library(lmtest)
library(caret)
library(mctest)

Input Dataset

Download the dataset here.

mydata = read.table("path_to_file", header = TRUE, sep = " ")
head(mydata)

##    x  y
## 1  3  5
## 2  7 11
## 3 11 21
## 4 15 16
## 5 18 16
## 6 27 28

Building a Regression Model

# building a regression model
model <- lm(y ~ x, data = mydata)
summary(model)

## 
## Call:
## lm(formula = y ~ x, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.939 -1.783 -0.228  1.506  8.157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.82963    1.76845   2.166   0.0382 *  
## x            0.90364    0.05012  18.030   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.23 on 31 degrees of freedom
## Multiple R-squared:  0.9129, Adjusted R-squared:  0.9101 
## F-statistic: 325.1 on 1 and 31 DF,  p-value: < 2.2e-16

model.diag.metrics <- augment(model)
head(model.diag.metrics)

## # A tibble: 6 × 8
##       y     x .fitted .resid   .hat .sigma  .cooksd .std.resid
##   <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1     5     3    6.54 -1.54  0.254    3.27 0.0518      -0.552 
## 2    11     7   10.2   0.845 0.199    3.28 0.0106       0.292 
## 3    21    11   13.8   7.23  0.152    2.95 0.528        2.43  
## 4    16    15   17.4  -1.38  0.112    3.27 0.0131      -0.455 
## 5    16    18   20.1  -4.10  0.0878   3.19 0.0849      -1.33  
## 6    28    27   28.2  -0.228 0.0403   3.28 0.000109    -0.0721

Linearity of the Data

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st plot)

# Linearity of the data
plot(model, 1)

Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.
In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.

Normality of Residuals

Q-Q Plot of Residuals

The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

# Normality of residuals
plot(model, 2)

In our example, all the points fall approximately along this reference line, so we can assume normality.

Perform a Shapiro-Wilk Normality Test

If the dataset size is less than 50, we can do Shapiro-Wilk normality test to test the data normality.

standardized_res = model.diag.metrics$.std.resid
shapiro.test(standardized_res)

## 
##  Shapiro-Wilk normality test
## 
## data:  standardized_res
## W = 0.9465, p-value = 0.1051

From the p-value = 0.1051 > 0.5, it is clear that the residuals are normally distributed.

Perform a Kolmogorov-Smirnov Normality Test

If the dataset size is more than 50 or equal to 50, we can do Kolmogorov-Smirnov normality test to test the data normality.

ks.test(standardized_res, "pnorm")

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  standardized_res
## D = 0.11726, p-value = 0.711
## alternative hypothesis: two-sided

From the p-value = 0.711 > 0.5, it is clear that the residuals are normally distributed.

Testing the Homoscedasticity Assumption

Scale-location plot

This assumption can be checked by examining the Scale-location plot, also known as the spread-location plot.

plot(model, 3)

This plot shows if residuals are spread equally along the ranges of predictors. It is good if you see a horizontal line with equally spread points. In our example, this is the case. we can assume Homogeneity of variance.

ncvTest() For Homoscedasticity

# non-constant error variance test
ncvTest(model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.446157, Df = 1, p = 0.22915

p > 0.05, suggesting that our data is homoscedastic.

Breusch-Pagan Test For Homoscedasticity

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.0853, df = 1, p-value = 0.2975

p > 0.05, suggesting that our data is homoscedastic.

Testing the Independence (Autocorrelation) Assumption

Durbin Watson Test for Autocorrelation

The Durbin Watson examines whether the errors are autocorrelated with themselves. The null states that they are not autocorrelated (what we want). This test could be especially useful when you conduct a multiple (times series) regression. For example, this test could tell you whether the residuals at time point 1 are correlated with the residuals at time point 2 (they shouldn’t be). In other words, this test is useful to verify that we haven’t violated the independence assumption.

# durbin watson test

durbinWatsonTest(model)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.1870909      2.354616   0.338
##  Alternative hypothesis: rho != 0

p-value > 0.05, suggesting that the errors are not autocorrelated.

Sources

1. Linear Regression Assumptions and Diagnostics in R
2. Kolmogorov-Smirnov Test in R (With Examples)
3. Linear Regression Assumptions and Diagnostics in R: Essentials

Simple Linear Regression Assumptions in R

Hanif

2022-12-19