Linear Regression Assumptions and Diagnostics

Introduction

Anscombe’s quartet demonstrates the importance of graphing data when analyzing it.

The file “satgpa.csv” contains information regarding SAT scores and GPAs.

data <- read.csv('satgpa.csv')
head(data, n=2)

##   sex sat_v sat_m sat_sum hs_gpa fy_gpa
## 1   1    65    62     127    3.4   3.18
## 2   2    58    64     122    4.0   3.33

par(mfrow = c(1, 2))
kd_sat_sum <- density(data$sat_sum)
plot(kd_sat_sum)
polygon(kd_sat_sum, col = 'steelblue', border = 'black')

kd_fy_gpa <- density(data$fy_gpa)
plot(kd_fy_gpa)
polygon(kd_fy_gpa, col = 'steelblue', border = 'black')

Building a regression model

This is a model to predict fy_gpa on the basis of sat_sum.

model = lm(fy_gpa ~ sat_sum , data=data)
model

## 
## Call:
## lm(formula = fy_gpa ~ sat_sum, data = data)
## 
## Coefficients:
## (Intercept)      sat_sum  
##    0.001927     0.023866

Linear regression makes several assumptions about the data:

Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
Normality of residuals. The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity)
Independence of residuals error terms i.e no autocorrelation of the error terms.

Diagnostic plots

par(mfrow = c(2, 2))
plot(model)

1. Linearity of the data

plot(model, 1)

Ideally, the plot should show no fitted pattern. The red line should be approximately horizontal at zero. To address non-linearity, various nonlinear transformations of predictors such as log(x), sqrt(x), and x^2 can be employed.

2. Normality of residuals

plot(model, 2)

The QQ plot should approximately follow a straight line.

3. Homogeneity of variance

plot(model, 3)

Homoscedasticity, or equal variance of residuals, is observed when the residuals are uniformly dispersed across the predictor ranges. A desirable indicator is a horizontal line with evenly distributed data points.

4. Outliers and high levarage points

# Cook's distance
plot(model, 4)

# Residuals vs Leverage
plot(model, 5)

Possible outliers can be identified among observations with standardized residuals exceeding an absolute value of 3. If data points lie outside the Cook’s distance lines, they have a significant influence on regression results, and excluding them could alter the outcomes of the regression analysis.

5. Autocorrelation of the Error Terms

Autocorrelation suggests missing information that our model should capture. In time series data, it might mean overlooking past information. In non-time series data, our model could exhibit systematic bias, either underpredicting or overpredicting in specific conditions. This may also stem from a violation of the linearity assumption.

library(car)

## Loading required package: carData

durbinWatsonTest(model)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02003578      2.039963   0.534
##  Alternative hypothesis: rho != 0

P-value of 0.54, which is greater than the typical significance level of 0.05 suggests that there is no significant autocorrelation.