An analysis of the Auto dataset, from the ISLR package (http://www-bcf.usc.edu/~gareth/ISL/data.html) on 7 March 2015.

The dataset covers miles per gallon, cylinders, displacement, horsepower, weight, acceleration, year, origin of 397 cars.

The aim of the analysis is to answer the following question: Is miles per gallon (mpg) a function of the remaining variables in the dataset?

setwd("~/MDSI/STATS")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
library(coefplot)
library(car)
#assumes auto.csv is in your working directory.
auto = read.csv("auto.csv") 
str(auto)
## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(auto)
##       mpg         cylinders     displacement   horsepower      weight    
##  Min.   : 9.0   Min.   :3.00   Min.   : 68   150    : 22   Min.   :1613  
##  1st Qu.:17.5   1st Qu.:4.00   1st Qu.:104   90     : 20   1st Qu.:2223  
##  Median :23.0   Median :4.00   Median :146   88     : 19   Median :2800  
##  Mean   :23.5   Mean   :5.46   Mean   :194   110    : 18   Mean   :2970  
##  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:262   100    : 17   3rd Qu.:3609  
##  Max.   :46.6   Max.   :8.00   Max.   :455   75     : 14   Max.   :5140  
##                                              (Other):287                 
##   acceleration       year        origin                 name    
##  Min.   : 8.0   Min.   :70   Min.   :1.00   ford pinto    :  6  
##  1st Qu.:13.8   1st Qu.:73   1st Qu.:1.00   amc matador   :  5  
##  Median :15.5   Median :76   Median :1.00   ford maverick :  5  
##  Mean   :15.6   Mean   :76   Mean   :1.57   toyota corolla:  5  
##  3rd Qu.:17.1   3rd Qu.:79   3rd Qu.:2.00   amc gremlin   :  4  
##  Max.   :24.8   Max.   :82   Max.   :3.00   amc hornet    :  4  
##                                             (Other)       :368
head(auto, n=10)
##    mpg cylinders displacement horsepower weight acceleration year origin
## 1   18         8          307        130   3504         12.0   70      1
## 2   15         8          350        165   3693         11.5   70      1
## 3   18         8          318        150   3436         11.0   70      1
## 4   16         8          304        150   3433         12.0   70      1
## 5   17         8          302        140   3449         10.5   70      1
## 6   15         8          429        198   4341         10.0   70      1
## 7   14         8          454        220   4354          9.0   70      1
## 8   14         8          440        215   4312          8.5   70      1
## 9   14         8          455        225   4425         10.0   70      1
## 10  15         8          390        190   3850          8.5   70      1
##                         name
## 1  chevrolet chevelle malibu
## 2          buick skylark 320
## 3         plymouth satellite
## 4              amc rebel sst
## 5                ford torino
## 6           ford galaxie 500
## 7           chevrolet impala
## 8          plymouth fury iii
## 9           pontiac catalina
## 10        amc ambassador dpl

Summaries indicate no missing data.

Horsepower was read in as a factor - convert to an integer given the number of levels (94).

auto$horsepower = as.integer(auto$horsepower)
colnames(auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"
#select the columns we are interested in - I have assumed that names is not a factor that should influence mpg.
#While origin is also potentially a predictive variable, I have been unable to ascertain what the individual
#levels represent and have therefore excluded them.

auto = select(auto, mpg, cylinders, displacement, horsepower, weight, acceleration, year)
head(auto, n = 10)
##    mpg cylinders displacement horsepower weight acceleration year
## 1   18         8          307         17   3504         12.0   70
## 2   15         8          350         35   3693         11.5   70
## 3   18         8          318         29   3436         11.0   70
## 4   16         8          304         29   3433         12.0   70
## 5   17         8          302         24   3449         10.5   70
## 6   15         8          429         42   4341         10.0   70
## 7   14         8          454         47   4354          9.0   70
## 8   14         8          440         46   4312          8.5   70
## 9   14         8          455         48   4425         10.0   70
## 10  15         8          390         40   3850          8.5   70

Exploratory analysis:

plot(auto) 

plot of chunk unnamed-chunk-3

Examine the correlations between variables and then examine relationships 1 by 1.

correlations = cor(auto)
corrplot(correlations)

plot of chunk unnamed-chunk-4

Displacement and cylinders are highly correlated. Weight is highly positively correlated with cylinders and displacement.

Exploring some of the relationships between variables:

#Some plots between independent variables
ggplot(auto, aes(x = displacement, y =cylinders)) + geom_point()

plot of chunk unnamed-chunk-5

ggplot(auto, aes(x = horsepower, y =cylinders)) + geom_point()

plot of chunk unnamed-chunk-5

ggplot(auto, aes(x = weight, y =cylinders)) + geom_point()

plot of chunk unnamed-chunk-5

#Quick plots between mpg and single variables.
ggplot(auto, aes(x = mpg, y = weight)) + geom_point(aes(color = cylinders))

plot of chunk unnamed-chunk-5

ggplot(auto, aes(x = mpg, y = weight)) + geom_point(aes(color = horsepower))

plot of chunk unnamed-chunk-5

ggplot(auto, aes(x = mpg, y = cylinders)) + geom_point()

plot of chunk unnamed-chunk-5

ggplot(auto, aes(x = mpg, y = displacement)) + geom_point()

plot of chunk unnamed-chunk-5

#Check similarity of cylinders and displacement
ggplot(auto, aes(x = mpg, y = acceleration)) + geom_point(aes(color = cylinders))

plot of chunk unnamed-chunk-5

ggplot(auto, aes(x = mpg, y = acceleration)) + geom_point(aes(color = displacement))

plot of chunk unnamed-chunk-5

The similarity of the last two charts combined with the correlation of cylinders and displacement indicates that cylinders and displacement are unlikely to both be required in the final model.

The weight ~ mpg scatterplot indicates that the relationship may not be linear and the variable may require some transformation.

Creating the Linear Model:

I have used all variables in the initial model.

autoModel1 = lm(mpg ~ cylinders + horsepower + weight + displacement + year + acceleration, data = auto)
summary(autoModel1)
## 
## Call:
## lm(formula = mpg ~ cylinders + horsepower + weight + displacement + 
##     year + acceleration, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.750 -2.421 -0.087  1.982 14.309 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.64e+01   4.27e+00   -3.84  0.00014 ***
## cylinders    -1.57e-01   3.47e-01   -0.45  0.65179    
## horsepower    8.98e-03   7.01e-03    1.28  0.20113    
## weight       -6.83e-03   5.97e-04  -11.43  < 2e-16 ***
## displacement  6.30e-03   7.23e-03    0.87  0.38435    
## year          7.64e-01   5.08e-02   15.03  < 2e-16 ***
## acceleration  8.38e-02   7.85e-02    1.07  0.28641    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 390 degrees of freedom
## Multiple R-squared:  0.811,  Adjusted R-squared:  0.808 
## F-statistic:  278 on 6 and 390 DF,  p-value: <2e-16

Overall, autoModel1 is predictive, with an adjusted r squared of 80.8% and a low p value for the model. This is interpreted to mean that 80.8% of the variance in mpg is explained by the predictors in the model. The small p value of the overall model indicates that R squared is significantly different from zero.

Weight and Year are the only two significant variables at the 5% level (both are predictive at a less than .01% level).

We can also plot the coefficients to see a visual representation of the confidence intervals for the coefficients:

coefplot(autoModel1)

plot of chunk unnamed-chunk-7

Remove the variables with high p values to see if the overall explanatory power of the model is lowered. Adding them individually to the base model to see if the coefficients are significant.

autoModel2 = lm(mpg ~ weight + year + cylinders, data = auto)
summary(autoModel2)
## 
## Call:
## lm(formula = mpg ~ weight + year + cylinders, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.991 -2.304 -0.081  2.016 14.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.43e+01   4.04e+00   -3.54  0.00045 ***
## weight      -6.45e-03   4.61e-04  -14.01  < 2e-16 ***
## year         7.58e-01   4.99e-02   15.20  < 2e-16 ***
## cylinders   -1.17e-01   2.33e-01   -0.50  0.61583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 393 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.807 
## F-statistic:  554 on 3 and 393 DF,  p-value: <2e-16
autoModel2 = lm(mpg ~ weight + year + horsepower, data = auto)
summary(autoModel2)
## 
## Call:
## lm(formula = mpg ~ weight + year + horsepower, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.059 -2.382 -0.141  1.884 14.342 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.58e+01   4.04e+00   -3.91  0.00011 ***
## weight      -6.48e-03   2.42e-04  -26.83  < 2e-16 ***
## year         7.64e-01   4.90e-02   15.58  < 2e-16 ***
## horsepower   1.00e-02   6.58e-03    1.52  0.12827    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 393 degrees of freedom
## Multiple R-squared:  0.81,   Adjusted R-squared:  0.808 
## F-statistic:  558 on 3 and 393 DF,  p-value: <2e-16
autoModel2 = lm(mpg ~ weight + year + displacement, data = auto)
summary(autoModel2)
## 
## Call:
## lm(formula = mpg ~ weight + year + displacement, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.835 -2.284 -0.135  2.056 14.307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.47e+01   3.99e+00   -3.68  0.00026 ***
## weight       -6.75e-03   5.70e-04  -11.84  < 2e-16 ***
## year          7.64e-01   5.07e-02   15.09  < 2e-16 ***
## displacement  8.07e-04   4.74e-03    0.17  0.86482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.44 on 393 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.807 
## F-statistic:  554 on 3 and 393 DF,  p-value: <2e-16
autoModel2 = lm(mpg ~ weight + year + acceleration, data = auto)
summary(autoModel2)
## 
## Call:
## lm(formula = mpg ~ weight + year + acceleration, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.691 -2.357 -0.139  2.013 14.253 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -15.26252    4.03360   -3.78  0.00018 ***
## weight        -0.00658    0.00023  -28.66  < 2e-16 ***
## year           0.75415    0.04992   15.11  < 2e-16 ***
## acceleration   0.06463    0.07020    0.92  0.35786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 393 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.808 
## F-statistic:  555 on 3 and 393 DF,  p-value: <2e-16

The coefficients of cylinders, horsepower, displacement and acceleration are not statistically significant when added to the model.

Final model:

autoModel3 = lm(mpg ~ weight + year, data = auto)
summary(autoModel3)
## 
## Call:
## lm(formula = mpg ~ weight + year, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.865 -2.305 -0.128  2.037 14.306 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.47e+01   3.98e+00   -3.68  0.00026 ***
## weight      -6.66e-03   2.14e-04  -31.14  < 2e-16 ***
## year         7.62e-01   4.91e-02   15.52  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 394 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.808 
## F-statistic:  833 on 2 and 394 DF,  p-value: <2e-16

Removing cylinders, horsepower and acceleration has no effect on the overall explanatory power of the model. Adjusted r squared for autoModel2 is 80.8%. Again, weight and year are highly significant.

The model is interpreted to mean that mpg = -14.650237 - 0.006655 * weight + 0.762334 * year.

Plot the coefficients:

coefplot(autoModel3)

plot of chunk unnamed-chunk-10

Evaluating the appropriateness of the model:

par(mfrow=c(2,2))
plot(autoModel3)

plot of chunk unnamed-chunk-11

The normal Q-Q plot would be a 45 degree line if the residual values are normally distributed with a mean of zero. The plot indicates that this is not the case.

If the dependent variables are linearly related to the independent variables, then the Residuals vs Fitted Values chart (upper left chart) should show no systematic relationship between the fitted values and the residuals. The shape of the graph indicates that this is not the case, with the shape of the curve indicating that a transformation of one or more variables would be appropriate.

Evidence of non-linearity between the dependent variable and independent variables can be assessed using component plus residual plots:

#using the crPlots function in the "car" package:
crPlots(autoModel3)

plot of chunk unnamed-chunk-12

The above chart indicates that the weight variable may need to be transformed in order to improve the appropriateness of the model.