Linear Regression Exercise

In my previous post, we spoke about linear regression and how to approach linear problems, how to choose the best models and finding the most important predictors.

Building on that, i think it’d be a good idea to work on two or three different linear regression exercises - just to drive the point home.

Still not sure if all the exercises will be done and collated in a single report (like this one) or done in different reports (Part I, Part II, Part III) - it depends on how tough or tasking they are, or how confused i am with the questions.

I will be picking the questions from ISLR: Linear Regression chapter https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370 . The solution to these questions are on the internet but we are going to be taking a different approach. I remember struggling a lot with most of these questions when i was reading ISLR - but now ? I already have a wide range of mental model on how to solve them.

Let’s start.

ISLR - Quesion 10

This question involves the use of multiple linear regression on the Auto data set.

  1. Produce a scatterplot matrix which includes all of the variables in the data set.

  2. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

  3. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

    1. Is there a relationship between the predictors and the response?
    2. Which predictors appear to have a statistically significant relationship to the response?
    3. What does the coefficient for the year variable suggest?
  4. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

  5. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

  6. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.


#solution to question a

scatterplotMatrix(Auto) 

Whenever you are working on a multiple linear regression task (any task actually), it’s always a good practice to check the structure of your data and also the correlation between variables. That way you have an idea of the kind of data you are dealing with.

I have added the str and descibe functions to get the structure of our data.

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
psych::describe(Auto)
##              vars   n    mean     sd  median trimmed    mad  min    max  range
## mpg             1 392   23.45   7.81   22.75   22.99   8.60    9   46.6   37.6
## cylinders       2 392    5.47   1.71    4.00    5.35   0.00    3    8.0    5.0
## displacement    3 392  194.41 104.64  151.00  183.83  90.44   68  455.0  387.0
## horsepower      4 392  104.47  38.49   93.50   99.82  28.91   46  230.0  184.0
## weight          5 392 2977.58 849.40 2803.50 2916.94 948.12 1613 5140.0 3527.0
## acceleration    6 392   15.54   2.76   15.50   15.48   2.52    8   24.8   16.8
## year            7 392   75.98   3.68   76.00   75.97   4.45   70   82.0   12.0
## origin          8 392    1.58   0.81    1.00    1.47   0.00    1    3.0    2.0
## name*           9 392  148.35  89.53  148.50  148.03 119.35    1  304.0  303.0
##              skew kurtosis    se
## mpg          0.45    -0.54  0.39
## cylinders    0.50    -1.40  0.09
## displacement 0.70    -0.79  5.29
## horsepower   1.08     0.65  1.94
## weight       0.52    -0.83 42.90
## acceleration 0.29     0.41  0.14
## year         0.02    -1.18  0.19
## origin       0.91    -0.86  0.04
## name*        0.03    -1.25  4.52
#exclude non-numeric data columns
num.cols <- sapply(Auto, is.numeric)

Auto.num <- Auto[,num.cols]

#corellation plot
corrplot(cor(Auto.num), order = "hclust", method = "number")

Basic shortcuts and time saving approaches to have with Linear Regression tasks (Now this is just from me, meaning i can actually change my mind on one or two of these shortcuts overtime as i learn more)

fit <- lm(mpg ~ .-name, data = Auto)
fit2 <- lm(mpg ~.-name-cylinders-horsepower-displacement, data = Auto)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ . - name - cylinders - horsepower - displacement, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7456 -2.1149 -0.0255  1.7654 13.1961 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.879e+01  4.051e+00  -4.639 4.79e-06 ***
## weight       -5.893e-03  2.685e-04 -21.951  < 2e-16 ***
## acceleration  7.966e-02  6.875e-02   1.159    0.247    
## year          7.465e-01  4.917e-02  15.181  < 2e-16 ***
## origin        1.163e+00  2.593e-01   4.487 9.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.346 on 387 degrees of freedom
## Multiple R-squared:  0.8181, Adjusted R-squared:  0.8162 
## F-statistic: 435.1 on 4 and 387 DF,  p-value: < 2.2e-16
tree_f <- tree(mpg ~ .-name, data = Auto)
plot(tree_f)
text(tree_f, col = "red")

# checking for collinearity
vif(fit)
##    cylinders displacement   horsepower       weight acceleration         year 
##    10.737535    21.836792     9.943693    10.831260     2.625806     1.244952 
##       origin 
##     1.772386
vif(fit2)
##       weight acceleration         year       origin 
##     1.816049     1.256380     1.145708     1.523125
#checking for linear assumptions
facet(2,2)
plot(fit)

# check outliers and high leverage observations
Average <- apply(Auto[,num.cols],2,mean)

Average <- as.data.frame(Average)

t(Average)
##              mpg cylinders displacement horsepower   weight acceleration
## Average 23.44592  5.471939      194.412   104.4694 2977.584     15.54133
##             year   origin
## Average 75.97959 1.576531
Auto[14,]
##    mpg cylinders displacement horsepower weight acceleration year origin
## 14  14         8          455        225   3086           10   70      1
##                       name
## 14 buick estate wagon (sw)
#train error
mean(fitted(fit2) - Auto$mpg)^2
## [1] 6.747478e-32
#find the outliers - observations that our model didn't predict well

#create a df of actual, predicted and actual-predicted and add names
Resid <-  as.data.frame(cbind(Auto[,c("name","mpg")], fitted(fit), residuals(fit)))

#Resid
#change column names
names(Resid) <- c("Name","Actual", "Predicted", "Residuals")

#top 10 observations under-estimated
Resid %>% 
  arrange(desc(Residuals)) %>% 
  top_n(10)
## Selecting by Residuals
##                                  Name Actual Predicted Residuals
## 323                         mazda glc   46.6  33.53957 13.060427
## 327                vw dasher (diesel)   43.4  31.49186 11.908136
## 326              vw rabbit c (diesel)   44.3  32.94922 11.350776
## 245   volkswagen rabbit custom diesel   43.1  32.07897 11.021033
## 310                         vw rabbit   41.5  31.68776  9.812243
## 330               honda civic 1500 gl   44.6  34.95804  9.641961
## 387 oldsmobile cutlass ciera (diesel)   38.0  28.43317  9.566833
## 394                         vw pickup   44.0  34.46457  9.535428
## 328               audi 5000s (diesel)   36.4  27.00546  9.394544
## 325                        datsun 210   40.8  33.62443  7.175574
#top 10 observation over-estimated
Resid %>% 
  arrange(Residuals) %>% 
  top_n(-10)
## Selecting by Residuals
##                          Name Actual Predicted Residuals
## 112                 maxda rx3   18.0  27.59026 -9.590261
## 271 toyota celica gt liftback   21.1  29.61271 -8.512711
## 167           ford mustang ii   13.0  20.84110 -7.841100
## 156             ford maverick   15.0  22.43503 -7.435027
## 109             toyota carina   20.0  27.10766 -7.107660
## 335             mazda rx-7 gs   23.7  30.67927 -6.979266
## 367    chrysler lebaron salon   17.6  24.00071 -6.400706
## 72            mazda rx2 coupe   19.0  25.38718 -6.387177
## 389            ford granada l   22.0  28.35862 -6.358621
## 276               volvo 264gl   17.0  23.12529 -6.125289
qqPlot(fit)

## 323 327 
## 321 325
crPlots(fit)

plot(hatvalues(fit))
text(hatvalues(fit), names(hatvalues(fit)), pos = 4)

plot(fit, which = 4)

influencePlot(fit)

##        StudRes        Hat       CookD
## 14  -1.6329244 0.18991289 0.077800835
## 29   0.8036817 0.08954137 0.007947716
## 323  4.0295372 0.01367411 0.027064445
## 327  3.6902459 0.02874269 0.048772234
## 394  2.9683847 0.04917182 0.055823737

There’s a difference between this plot with observation 14 (influential Observation)….

avPlots(fit, ask = F, col = "red")

…. and this plot without observation 14 (look at weight | others in both plots).

fit_test <- lm(mpg~.-name, data = Auto[-14,])
avPlots(fit_test, ask = F, col = "red")

#check for heteroscedasticity or non-constant error
ncvTest(fit)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 30.89489, Df = 1, p = 2.7239e-08
gvmodel <- gvlma(fit)
summary(gvmodel)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit) 
## 
##                     Value   p-value                   Decision
## Global Stat        175.46 0.000e+00 Assumptions NOT satisfied!
## Skewness            18.29 1.898e-05 Assumptions NOT satisfied!
## Kurtosis            34.81 3.633e-09 Assumptions NOT satisfied!
## Link Function       96.23 0.000e+00 Assumptions NOT satisfied!
## Heteroscedasticity  26.12 3.208e-07 Assumptions NOT satisfied!
stepAIC(fit, direction = "backward")
## Start:  AIC=950.5
## mpg ~ (cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin + name) - name
## 
##                Df Sum of Sq    RSS     AIC
## - acceleration  1      7.36 4259.6  949.18
## - horsepower    1     16.74 4269.0  950.04
## <none>                      4252.2  950.50
## - cylinders     1     25.79 4278.0  950.87
## - displacement  1     77.61 4329.8  955.59
## - origin        1    291.13 4543.3  974.46
## - weight        1   1091.63 5343.8 1038.08
## - year          1   2402.25 6654.5 1124.06
## 
## Step:  AIC=949.18
## mpg ~ cylinders + displacement + horsepower + weight + year + 
##     origin
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      4259.6  949.18
## - cylinders     1     27.27 4286.8  949.68
## - horsepower    1     53.80 4313.4  952.10
## - displacement  1     73.57 4333.1  953.89
## - origin        1    292.02 4551.6  973.17
## - weight        1   1310.43 5570.0 1052.32
## - year          1   2396.17 6655.7 1122.13
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     year + origin, data = Auto)
## 
## Coefficients:
##  (Intercept)     cylinders  displacement    horsepower        weight  
##   -15.563492     -0.506685      0.019269     -0.023895     -0.006218  
##         year        origin  
##     0.747516      1.428242
rg <- regsubsets(mpg ~ .-name, data = Auto)
summary(rg)
## Subset selection object
## Call: regsubsets.formula(mpg ~ . - name, data = Auto)
## 7 Variables  (and intercept)
##              Forced in Forced out
## cylinders        FALSE      FALSE
## displacement     FALSE      FALSE
## horsepower       FALSE      FALSE
## weight           FALSE      FALSE
## acceleration     FALSE      FALSE
## year             FALSE      FALSE
## origin           FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          cylinders displacement horsepower weight acceleration year origin
## 1  ( 1 ) " "       " "          " "        "*"    " "          " "  " "   
## 2  ( 1 ) " "       " "          " "        "*"    " "          "*"  " "   
## 3  ( 1 ) " "       " "          " "        "*"    " "          "*"  "*"   
## 4  ( 1 ) " "       "*"          " "        "*"    " "          "*"  "*"   
## 5  ( 1 ) " "       "*"          "*"        "*"    " "          "*"  "*"   
## 6  ( 1 ) "*"       "*"          "*"        "*"    " "          "*"  "*"   
## 7  ( 1 ) "*"       "*"          "*"        "*"    "*"          "*"  "*"
facet(2,2)
plot(rg, scale = "adjr2")
plot(rg, scale = "bic")
plot(rg, scale = "Cp")
plot(rg, scale = "r2")

subsets(rg, statistic = "cp",
        main = "Cp Plot for All Subsets Regression",
        legend = c(6.2,280))

abline(1, 1, lty = 2, col = "red")

zauto <- as.data.frame(scale(Auto[,num.cols]))

zfit <- lm(mpg ~ ., data = zauto)

coef(zfit)
##   (Intercept)     cylinders  displacement    horsepower        weight 
##  1.046426e-15 -1.078273e-01  2.667467e-01 -8.359623e-02 -7.045565e-01 
##  acceleration          year        origin 
##  2.848143e-02  3.543429e-01  1.471853e-01
sort(coef(zfit), decreasing = F)
##        weight     cylinders    horsepower   (Intercept)  acceleration 
## -7.045565e-01 -1.078273e-01 -8.359623e-02  1.046426e-15  2.848143e-02 
##        origin  displacement          year 
##  1.471853e-01  2.667467e-01  3.543429e-01
relweights <- function(fit, ...){
  R <- cor(fit$model)
  nvar <- ncol(R)
  rxx <- R[2:nvar, 2:nvar]
  rxy <- R[2:nvar, 1]
  svd <- eigen(rxx)
  evec <- svd$vectors
  ev <- svd$values
  delta <- diag(sqrt(ev))
  lambda <- evec %*% delta %*% t(evec)
  lambdasq <- lambda ^ 2
  beta <- solve(lambda) %*% rxy
  rsquare <- colSums(beta ^ 2)
  rawwgt <- lambdasq %*% beta ^ 2
  import <- (rawwgt / rsquare) * 100
  import <- as.data.frame(import)
  row.names(import) <- names(fit$model[2:nvar])
  names(import) <- "Weights"
  import <- import[order(import),1, drop = F]
  dotchart(import$Weights, labels = row.names(import),
           xlab = "% of R-Square", pch = 19,
           main = "Relative importance of Predictor Variables",
           sub = paste("Total R-Square=", round(rsquare, digits = 3)),
           ...)
  return(import)
}


fit_t <- lm(mpg ~ .-cylinders, data = Auto[,num.cols])


relweights(fit_t)

##                Weights
## acceleration  4.091976
## origin       10.538774
## horsepower   14.882644
## cylinders    14.904625
## displacement 14.906457
## year         19.109955
## weight       21.565568