library(tidyverse)
library(broom)
library(MASS)
library(dgof)
library(car)
library(lmtest)
library(caret)
library(mctest)
Download the dataset here.
mydata = read.table("path_to_file", header = TRUE, sep = " ")
head(mydata)
## x y
## 1 3 5
## 2 7 11
## 3 11 21
## 4 15 16
## 5 18 16
## 6 27 28
# building a regression model
model <- lm(y ~ x, data = mydata)
summary(model)
##
## Call:
## lm(formula = y ~ x, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.939 -1.783 -0.228 1.506 8.157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.82963 1.76845 2.166 0.0382 *
## x 0.90364 0.05012 18.030 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.23 on 31 degrees of freedom
## Multiple R-squared: 0.9129, Adjusted R-squared: 0.9101
## F-statistic: 325.1 on 1 and 31 DF, p-value: < 2.2e-16
model.diag.metrics <- augment(model)
head(model.diag.metrics)
## # A tibble: 6 × 8
## y x .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 3 6.54 -1.54 0.254 3.27 0.0518 -0.552
## 2 11 7 10.2 0.845 0.199 3.28 0.0106 0.292
## 3 21 11 13.8 7.23 0.152 2.95 0.528 2.43
## 4 16 15 17.4 -1.38 0.112 3.27 0.0131 -0.455
## 5 16 18 20.1 -4.10 0.0878 3.19 0.0849 -1.33
## 6 28 27 28.2 -0.228 0.0403 3.28 0.000109 -0.0721
The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st plot)
# Linearity of the data
plot(model, 1)
Ideally, the residual plot will show no fitted pattern. That is, the
red line should be approximately horizontal at zero. The presence of a
pattern may indicate a problem with some aspect of the linear
model.
In our example, there is no pattern in the residual plot. This suggests
that we can assume linear relationship between the predictors and the
outcome variables.
The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.
# Normality of residuals
plot(model, 2)
In our example, all the points fall approximately along this reference line, so we can assume normality.
If the dataset size is less than 50, we can do Shapiro-Wilk normality test to test the data normality.
standardized_res = model.diag.metrics$.std.resid
shapiro.test(standardized_res)
##
## Shapiro-Wilk normality test
##
## data: standardized_res
## W = 0.9465, p-value = 0.1051
From the p-value = 0.1051 > 0.5, it is clear that the residuals are normally distributed.
If the dataset size is more than 50 or equal to 50, we can do Kolmogorov-Smirnov normality test to test the data normality.
ks.test(standardized_res, "pnorm")
##
## One-sample Kolmogorov-Smirnov test
##
## data: standardized_res
## D = 0.11726, p-value = 0.711
## alternative hypothesis: two-sided
From the p-value = 0.711 > 0.5, it is clear that the residuals are normally distributed.
This assumption can be checked by examining the Scale-location plot, also known as the spread-location plot.
plot(model, 3)
This plot shows if residuals are spread equally along the ranges of predictors. It is good if you see a horizontal line with equally spread points. In our example, this is the case. we can assume Homogeneity of variance.
# non-constant error variance test
ncvTest(model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.446157, Df = 1, p = 0.22915
p > 0.05, suggesting that our data is homoscedastic.
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 1.0853, df = 1, p-value = 0.2975
p > 0.05, suggesting that our data is homoscedastic.
The Durbin Watson examines whether the errors are autocorrelated with themselves. The null states that they are not autocorrelated (what we want). This test could be especially useful when you conduct a multiple (times series) regression. For example, this test could tell you whether the residuals at time point 1 are correlated with the residuals at time point 2 (they shouldn’t be). In other words, this test is useful to verify that we haven’t violated the independence assumption.
# durbin watson test
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.1870909 2.354616 0.338
## Alternative hypothesis: rho != 0
p-value > 0.05, suggesting that the errors are not autocorrelated.