Linearity in Regression Analysis

Sameer Mathur

Linearity in Regression Analysis of the cars dataset

Regression Diagnostics

---

Linearity

In Ordinary Least Squares regression, the relationship between the outcome / reponse (y) and the predictor (x) is assumed to be linear.

This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.

A linear regression model is defined as

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Aside: The following model is also linear:

\[ y = \beta_0 + \beta_1 x + \beta_2 x^2+ \epsilon \]

But the following model is non-linear:

\[ y = \beta_0 + \beta_1 e^{\beta_2 x} \]

The cars Dataset

Speed and Stopping Distances of Cars

This dataset gives the speeds of cars and the distances taken by them to stop after brakes are applied. In this data, we have two data variables speed and distance with 50 observations.

Data Description

  • speed numeric Speed (mph)

  • dist numeric Stopping distance (ft)

Source cars data

The cars Dataset

First few rows of the cars dataset

library(stats); 
library(graphics)
data(cars) # importing data
attach(cars)# attaching data columns
head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

Descriptive statistics

summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  
dim(cars) # rows, columns
[1] 50  2

Scatter plot of stopping distance versus speed

# scatter plot of distance and speed
plot(dist ~ speed, data = cars)

plot of chunk unnamed-chunk-3

Regression Model

We construct a simple linear model as follow

\[ dist = \beta_0 + \beta_1 speed + \epsilon \]

Simple linear regression

# fitting simple linear model
fitCarsModel <- lm(dist ~ speed, data = cars)
# summary of the fitted model
summary(fitCarsModel)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Diagnostic Plots

# residual plots of OLS model
par(mfrow=c(2,2))
plot(fitCarsModel)

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (plot on the top-left) from the Diagnostic Plots.

plot of chunk unnamed-chunk-6

Linearity of the data (Residual vs. Fitted Plot)

# residual vs. fitted plot
par(mfrow=c(1,1))
plot(fitCarsModel, 1)

plot of chunk unnamed-chunk-8

Ideally, the residual plot should show no pattern.

The red line should be approximately horizontal.

The presence of a pattern indicates potential non-linearity.

In the cars data, the residual shows a fitted pattern. This suggests potential non-linearity.

Rectifying Non-linearity using a Power Transform

A power transform is a family of functions that are applied to create a monotonic transformation of data using power functions.

Rectifying Non-linearity using a Power Transform

This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures.

Box-Cox Transformation

Box-Cox Transformation

library(caret)
distTrans <- BoxCoxTrans(cars$dist)
distTrans
Box-Cox Transformation

50 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00   26.00   36.00   42.98   56.00  120.00 

Largest/Smallest: 60 
Sample Skewness: 0.759 

Estimated Lambda: 0.5 
distNew = predict(distTrans, cars$dist)
head(distNew)
[1] 0.8284271 4.3245553 2.0000000 7.3808315 6.0000000 4.3245553
# append the transformed variable to cars
cars <- cbind(cars, distNew)
# first few rows of the datset
head(cars)
  speed dist   distNew
1     4    2 0.8284271
2     4   10 4.3245553
3     7    4 2.0000000
4     7   22 7.3808315
5     8   16 6.0000000
6     9   10 4.3245553

predict() uses the fitted model to predict the Box-Cox transformed dependent variable.

Regression model on transformed data

The new regresison model will be based on the transformed data.

Simple Linear Regression

# fitting simple linear model
fitCarsTransModel <- lm(distNew ~ speed, data = cars)
# summary of the fitted model
summary(fitCarsTransModel)

Call:
lm(formula = distNew ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1369 -1.3966 -0.3598  1.1817  6.3069 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.55410    0.96888   0.572     0.57    
speed        0.64483    0.05957  10.825 1.77e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.205 on 48 degrees of freedom
Multiple R-squared:  0.7094,    Adjusted R-squared:  0.7034 
F-statistic: 117.2 on 1 and 48 DF,  p-value: 1.773e-14

Residual versus Fitted plot after transformation

# residual vs. fitted plot
plot(fitCarsTransModel, 1)

plot of chunk unnamed-chunk-12

Comparing Residual versus Fitted plots before and after Box-Cox transformation

Before Box-Cox transformation

# residual vs. fitted plot
plot(fitCarsModel, 1)

plot of chunk unnamed-chunk-13

After Box-Cox transformation

# residual vs. fitted plot
plot(fitCarsTransModel, 1)

plot of chunk unnamed-chunk-14

We can see that after Box-Cox transformation the red line become flatter compared to before the transformation.

Hence, this model is closer to satisfying the linearity condition.