Sameer Mathur
Linearity in Regression Analysis of the cars dataset
Regression Diagnostics
---
In Ordinary Least Squares regression, the relationship between the outcome / reponse (y) and the predictor (x) is assumed to be linear.
This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.
A linear regression model is defined as
\[ y = \beta_0 + \beta_1 x + \epsilon \]
Aside: The following model is also linear:
\[ y = \beta_0 + \beta_1 x + \beta_2 x^2+ \epsilon \]
But the following model is non-linear:
\[ y = \beta_0 + \beta_1 e^{\beta_2 x} \]
Speed and Stopping Distances of Cars
This dataset gives the speeds of cars and the distances taken by them to stop after brakes are applied.
In this data, we have two data variables speed and distance with 50 observations.
Data Description
speed numeric Speed (mph)
dist numeric Stopping distance (ft)
First few rows of the cars dataset
library(stats);
library(graphics)
data(cars) # importing data
attach(cars)# attaching data columns
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
Descriptive statistics
summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
dim(cars) # rows, columns
[1] 50 2
# scatter plot of distance and speed
plot(dist ~ speed, data = cars)
We construct a simple linear model as follow
\[ dist = \beta_0 + \beta_1 speed + \epsilon \]
# fitting simple linear model
fitCarsModel <- lm(dist ~ speed, data = cars)
# summary of the fitted model
summary(fitCarsModel)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# residual plots of OLS model
par(mfrow=c(2,2))
plot(fitCarsModel)
The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (plot on the top-left) from the Diagnostic Plots.
# residual vs. fitted plot
par(mfrow=c(1,1))
plot(fitCarsModel, 1)
Ideally, the residual plot should show no pattern.
The presence of a pattern indicates potential non-linearity.
In the cars data, the residual shows a fitted pattern.
This suggests potential non-linearity.
A power transform is a family of functions that are applied to create a monotonic transformation of data using power functions.
This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures.
library(caret)
distTrans <- BoxCoxTrans(cars$dist)
distTrans
Box-Cox Transformation
50 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 26.00 36.00 42.98 56.00 120.00
Largest/Smallest: 60
Sample Skewness: 0.759
Estimated Lambda: 0.5
distNew = predict(distTrans, cars$dist)
head(distNew)
[1] 0.8284271 4.3245553 2.0000000 7.3808315 6.0000000 4.3245553
# append the transformed variable to cars
cars <- cbind(cars, distNew)
# first few rows of the datset
head(cars)
speed dist distNew
1 4 2 0.8284271
2 4 10 4.3245553
3 7 4 2.0000000
4 7 22 7.3808315
5 8 16 6.0000000
6 9 10 4.3245553
predict() uses the fitted model to predict the Box-Cox transformed dependent variable.
The new regresison model will be based on the transformed data.
# fitting simple linear model
fitCarsTransModel <- lm(distNew ~ speed, data = cars)
# summary of the fitted model
summary(fitCarsTransModel)
Call:
lm(formula = distNew ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-4.1369 -1.3966 -0.3598 1.1817 6.3069
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.55410 0.96888 0.572 0.57
speed 0.64483 0.05957 10.825 1.77e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.205 on 48 degrees of freedom
Multiple R-squared: 0.7094, Adjusted R-squared: 0.7034
F-statistic: 117.2 on 1 and 48 DF, p-value: 1.773e-14
# residual vs. fitted plot
plot(fitCarsTransModel, 1)
Before Box-Cox transformation
# residual vs. fitted plot
plot(fitCarsModel, 1)
After Box-Cox transformation
# residual vs. fitted plot
plot(fitCarsTransModel, 1)
We can see that after Box-Cox transformation the red line become flatter compared to before the transformation.
Hence, this model is closer to satisfying the linearity condition.