In my previous post, we spoke about linear regression and how to approach linear problems, how to choose the best models and finding the most important predictors.
Building on that, i think it’d be a good idea to work on two or three different linear regression exercises - just to drive the point home.
Still not sure if all the exercises will be done and collated in a single report (like this one) or done in different reports (Part I, Part II, Part III) - it depends on how tough or tasking they are, or how confused i am with the questions.
I will be picking the questions from ISLR: Linear Regression chapter https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370 . The solution to these questions are on the internet but we are going to be taking a different approach. I remember struggling a lot with most of these questions when i was reading ISLR - but now ? I already have a wide range of mental model on how to solve them.
Let’s start.
This question involves the use of multiple linear regression on the Auto data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
#solution to question a
scatterplotMatrix(Auto)
Whenever you are working on a multiple linear regression task (any task actually), it’s always a good practice to check the structure of your data and also the correlation between variables. That way you have an idea of the kind of data you are dealing with.
I have added the str and descibe functions to get the structure of our data.
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
psych::describe(Auto)
## vars n mean sd median trimmed mad min max range
## mpg 1 392 23.45 7.81 22.75 22.99 8.60 9 46.6 37.6
## cylinders 2 392 5.47 1.71 4.00 5.35 0.00 3 8.0 5.0
## displacement 3 392 194.41 104.64 151.00 183.83 90.44 68 455.0 387.0
## horsepower 4 392 104.47 38.49 93.50 99.82 28.91 46 230.0 184.0
## weight 5 392 2977.58 849.40 2803.50 2916.94 948.12 1613 5140.0 3527.0
## acceleration 6 392 15.54 2.76 15.50 15.48 2.52 8 24.8 16.8
## year 7 392 75.98 3.68 76.00 75.97 4.45 70 82.0 12.0
## origin 8 392 1.58 0.81 1.00 1.47 0.00 1 3.0 2.0
## name* 9 392 148.35 89.53 148.50 148.03 119.35 1 304.0 303.0
## skew kurtosis se
## mpg 0.45 -0.54 0.39
## cylinders 0.50 -1.40 0.09
## displacement 0.70 -0.79 5.29
## horsepower 1.08 0.65 1.94
## weight 0.52 -0.83 42.90
## acceleration 0.29 0.41 0.14
## year 0.02 -1.18 0.19
## origin 0.91 -0.86 0.04
## name* 0.03 -1.25 4.52
#exclude non-numeric data columns
num.cols <- sapply(Auto, is.numeric)
Auto.num <- Auto[,num.cols]
#corellation plot
corrplot(cor(Auto.num), order = "hclust", method = "number")
Basic shortcuts and time saving approaches to have with Linear Regression tasks (Now this is just from me, meaning i can actually change my mind on one or two of these shortcuts overtime as i learn more)
If there is a positive correlation between the variables in your data set, then that’s a gift and a curse. Gift in the sense that you’ll most likely get a linear fit(now this depends on other factors) and the curse is you will most likely have a collinearity problem;
If your response feature/variable has a negative correlation with your predictors then your model will most likely be a non-linear model (polynomial) and fail linearity assumptions;
Any influential observation is most likely a bad influence on your model. Outliers, high leverage and Influential observations should be removed;
Don’t assume or follow your intuition(shortcuts), TEST it and have a mindset of “i may be wrong” - you should always play the devil’s advocate with your shortcuts;
Having a good model is not the same as having a model that can predict unseen data accurately. Obviously an accurate model is also a good model but you should not assume that a good model is an accurate model (they are two sides of the same coin).
fit <- lm(mpg ~ .-name, data = Auto)
fit2 <- lm(mpg ~.-name-cylinders-horsepower-displacement, data = Auto)
summary(fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
summary(fit2)
##
## Call:
## lm(formula = mpg ~ . - name - cylinders - horsepower - displacement,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7456 -2.1149 -0.0255 1.7654 13.1961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.879e+01 4.051e+00 -4.639 4.79e-06 ***
## weight -5.893e-03 2.685e-04 -21.951 < 2e-16 ***
## acceleration 7.966e-02 6.875e-02 1.159 0.247
## year 7.465e-01 4.917e-02 15.181 < 2e-16 ***
## origin 1.163e+00 2.593e-01 4.487 9.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.346 on 387 degrees of freedom
## Multiple R-squared: 0.8181, Adjusted R-squared: 0.8162
## F-statistic: 435.1 on 4 and 387 DF, p-value: < 2.2e-16
tree_f <- tree(mpg ~ .-name, data = Auto)
plot(tree_f)
text(tree_f, col = "red")
# checking for collinearity
vif(fit)
## cylinders displacement horsepower weight acceleration year
## 10.737535 21.836792 9.943693 10.831260 2.625806 1.244952
## origin
## 1.772386
vif(fit2)
## weight acceleration year origin
## 1.816049 1.256380 1.145708 1.523125
#checking for linear assumptions
facet(2,2)
plot(fit)
# check outliers and high leverage observations
Average <- apply(Auto[,num.cols],2,mean)
Average <- as.data.frame(Average)
t(Average)
## mpg cylinders displacement horsepower weight acceleration
## Average 23.44592 5.471939 194.412 104.4694 2977.584 15.54133
## year origin
## Average 75.97959 1.576531
Auto[14,]
## mpg cylinders displacement horsepower weight acceleration year origin
## 14 14 8 455 225 3086 10 70 1
## name
## 14 buick estate wagon (sw)
#train error
mean(fitted(fit2) - Auto$mpg)^2
## [1] 6.747478e-32
#find the outliers - observations that our model didn't predict well
#create a df of actual, predicted and actual-predicted and add names
Resid <- as.data.frame(cbind(Auto[,c("name","mpg")], fitted(fit), residuals(fit)))
#Resid
#change column names
names(Resid) <- c("Name","Actual", "Predicted", "Residuals")
#top 10 observations under-estimated
Resid %>%
arrange(desc(Residuals)) %>%
top_n(10)
## Selecting by Residuals
## Name Actual Predicted Residuals
## 323 mazda glc 46.6 33.53957 13.060427
## 327 vw dasher (diesel) 43.4 31.49186 11.908136
## 326 vw rabbit c (diesel) 44.3 32.94922 11.350776
## 245 volkswagen rabbit custom diesel 43.1 32.07897 11.021033
## 310 vw rabbit 41.5 31.68776 9.812243
## 330 honda civic 1500 gl 44.6 34.95804 9.641961
## 387 oldsmobile cutlass ciera (diesel) 38.0 28.43317 9.566833
## 394 vw pickup 44.0 34.46457 9.535428
## 328 audi 5000s (diesel) 36.4 27.00546 9.394544
## 325 datsun 210 40.8 33.62443 7.175574
#top 10 observation over-estimated
Resid %>%
arrange(Residuals) %>%
top_n(-10)
## Selecting by Residuals
## Name Actual Predicted Residuals
## 112 maxda rx3 18.0 27.59026 -9.590261
## 271 toyota celica gt liftback 21.1 29.61271 -8.512711
## 167 ford mustang ii 13.0 20.84110 -7.841100
## 156 ford maverick 15.0 22.43503 -7.435027
## 109 toyota carina 20.0 27.10766 -7.107660
## 335 mazda rx-7 gs 23.7 30.67927 -6.979266
## 367 chrysler lebaron salon 17.6 24.00071 -6.400706
## 72 mazda rx2 coupe 19.0 25.38718 -6.387177
## 389 ford granada l 22.0 28.35862 -6.358621
## 276 volvo 264gl 17.0 23.12529 -6.125289
qqPlot(fit)
## 323 327
## 321 325
crPlots(fit)
plot(hatvalues(fit))
text(hatvalues(fit), names(hatvalues(fit)), pos = 4)
plot(fit, which = 4)
influencePlot(fit)
## StudRes Hat CookD
## 14 -1.6329244 0.18991289 0.077800835
## 29 0.8036817 0.08954137 0.007947716
## 323 4.0295372 0.01367411 0.027064445
## 327 3.6902459 0.02874269 0.048772234
## 394 2.9683847 0.04917182 0.055823737
There’s a difference between this plot with observation 14 (influential Observation)….
avPlots(fit, ask = F, col = "red")
…. and this plot without observation 14 (look at weight | others in both plots).
fit_test <- lm(mpg~.-name, data = Auto[-14,])
avPlots(fit_test, ask = F, col = "red")
#check for heteroscedasticity or non-constant error
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 30.89489, Df = 1, p = 2.7239e-08
gvmodel <- gvlma(fit)
summary(gvmodel)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit)
##
## Value p-value Decision
## Global Stat 175.46 0.000e+00 Assumptions NOT satisfied!
## Skewness 18.29 1.898e-05 Assumptions NOT satisfied!
## Kurtosis 34.81 3.633e-09 Assumptions NOT satisfied!
## Link Function 96.23 0.000e+00 Assumptions NOT satisfied!
## Heteroscedasticity 26.12 3.208e-07 Assumptions NOT satisfied!
stepAIC(fit, direction = "backward")
## Start: AIC=950.5
## mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin + name) - name
##
## Df Sum of Sq RSS AIC
## - acceleration 1 7.36 4259.6 949.18
## - horsepower 1 16.74 4269.0 950.04
## <none> 4252.2 950.50
## - cylinders 1 25.79 4278.0 950.87
## - displacement 1 77.61 4329.8 955.59
## - origin 1 291.13 4543.3 974.46
## - weight 1 1091.63 5343.8 1038.08
## - year 1 2402.25 6654.5 1124.06
##
## Step: AIC=949.18
## mpg ~ cylinders + displacement + horsepower + weight + year +
## origin
##
## Df Sum of Sq RSS AIC
## <none> 4259.6 949.18
## - cylinders 1 27.27 4286.8 949.68
## - horsepower 1 53.80 4313.4 952.10
## - displacement 1 73.57 4333.1 953.89
## - origin 1 292.02 4551.6 973.17
## - weight 1 1310.43 5570.0 1052.32
## - year 1 2396.17 6655.7 1122.13
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## year + origin, data = Auto)
##
## Coefficients:
## (Intercept) cylinders displacement horsepower weight
## -15.563492 -0.506685 0.019269 -0.023895 -0.006218
## year origin
## 0.747516 1.428242
rg <- regsubsets(mpg ~ .-name, data = Auto)
summary(rg)
## Subset selection object
## Call: regsubsets.formula(mpg ~ . - name, data = Auto)
## 7 Variables (and intercept)
## Forced in Forced out
## cylinders FALSE FALSE
## displacement FALSE FALSE
## horsepower FALSE FALSE
## weight FALSE FALSE
## acceleration FALSE FALSE
## year FALSE FALSE
## origin FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## cylinders displacement horsepower weight acceleration year origin
## 1 ( 1 ) " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " "*" " " "*" " "
## 3 ( 1 ) " " " " " " "*" " " "*" "*"
## 4 ( 1 ) " " "*" " " "*" " " "*" "*"
## 5 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
facet(2,2)
plot(rg, scale = "adjr2")
plot(rg, scale = "bic")
plot(rg, scale = "Cp")
plot(rg, scale = "r2")
subsets(rg, statistic = "cp",
main = "Cp Plot for All Subsets Regression",
legend = c(6.2,280))
abline(1, 1, lty = 2, col = "red")
zauto <- as.data.frame(scale(Auto[,num.cols]))
zfit <- lm(mpg ~ ., data = zauto)
coef(zfit)
## (Intercept) cylinders displacement horsepower weight
## 1.046426e-15 -1.078273e-01 2.667467e-01 -8.359623e-02 -7.045565e-01
## acceleration year origin
## 2.848143e-02 3.543429e-01 1.471853e-01
sort(coef(zfit), decreasing = F)
## weight cylinders horsepower (Intercept) acceleration
## -7.045565e-01 -1.078273e-01 -8.359623e-02 1.046426e-15 2.848143e-02
## origin displacement year
## 1.471853e-01 2.667467e-01 3.543429e-01
relweights <- function(fit, ...){
R <- cor(fit$model)
nvar <- ncol(R)
rxx <- R[2:nvar, 2:nvar]
rxy <- R[2:nvar, 1]
svd <- eigen(rxx)
evec <- svd$vectors
ev <- svd$values
delta <- diag(sqrt(ev))
lambda <- evec %*% delta %*% t(evec)
lambdasq <- lambda ^ 2
beta <- solve(lambda) %*% rxy
rsquare <- colSums(beta ^ 2)
rawwgt <- lambdasq %*% beta ^ 2
import <- (rawwgt / rsquare) * 100
import <- as.data.frame(import)
row.names(import) <- names(fit$model[2:nvar])
names(import) <- "Weights"
import <- import[order(import),1, drop = F]
dotchart(import$Weights, labels = row.names(import),
xlab = "% of R-Square", pch = 19,
main = "Relative importance of Predictor Variables",
sub = paste("Total R-Square=", round(rsquare, digits = 3)),
...)
return(import)
}
fit_t <- lm(mpg ~ .-cylinders, data = Auto[,num.cols])
relweights(fit_t)
## Weights
## acceleration 4.091976
## origin 10.538774
## horsepower 14.882644
## cylinders 14.904625
## displacement 14.906457
## year 19.109955
## weight 21.565568