5100Project

Reading in the dataset

Here I’m reading in the dataset and summarizing it to see types (str) as well as get a rough idea of distributions (summary)

cars = read.csv("Cars.csv",header=TRUE)
head(cars)

##          make fuel.type aspiration num.of.doors  body.style drive.wheels
## 1 alfa-romero       gas        std          two convertible          rwd
## 2 alfa-romero       gas        std          two convertible          rwd
## 3 alfa-romero       gas        std          two   hatchback          rwd
## 4        audi       gas        std         four       sedan          fwd
## 5        audi       gas        std         four       sedan          4wd
## 6        audi       gas        std          two       sedan          fwd
##   engine.location wheel.base length width height curb.weight engine.type
## 1           front       88.6  168.8  64.1   48.8        2548        dohc
## 2           front       88.6  168.8  64.1   48.8        2548        dohc
## 3           front       94.5  171.2  65.5   52.4        2823        ohcv
## 4           front       99.8  176.6  66.2   54.3        2337         ohc
## 5           front       99.4  176.6  66.4   54.3        2824         ohc
## 6           front       99.8  177.3  66.3   53.1        2507         ohc
##   num.of.cylinders engine.size fuel.system bore stroke compression.ratio
## 1             four         130        mpfi 3.47   2.68               9.0
## 2             four         130        mpfi 3.47   2.68               9.0
## 3              six         152        mpfi 2.68   3.47               9.0
## 4             four         109        mpfi 3.19   3.40              10.0
## 5             five         136        mpfi 3.19   3.40               8.0
## 6             five         136        mpfi 3.19   3.40               8.5
##   horsepower peak.rpm city.mpg highway.mpg price
## 1        111     5000       21          27 13495
## 2        111     5000       21          27 16500
## 3        154     5000       19          26 16500
## 4        102     5500       24          30 13950
## 5        115     5500       18          22 17450
## 6        110     5500       19          25 15250

summary(cars)

##          make     fuel.type   aspiration  num.of.doors       body.style
##  toyota    :32   diesel: 19   std  :158   four:112     convertible: 6  
##  nissan    :18   gas   :174   turbo: 35   two : 81     hardtop    : 8  
##  honda     :13                                         hatchback  :63  
##  mitsubishi:13                                         sedan      :92  
##  mazda     :12                                         wagon      :24  
##  subaru    :12                                                         
##  (Other)   :93                                                         
##  drive.wheels engine.location   wheel.base         length     
##  4wd:  8      front:190       Min.   : 86.60   Min.   :141.1  
##  fwd:114      rear :  3       1st Qu.: 94.50   1st Qu.:166.3  
##  rwd: 71                      Median : 97.00   Median :173.2  
##                               Mean   : 98.92   Mean   :174.3  
##                               3rd Qu.:102.40   3rd Qu.:184.6  
##                               Max.   :120.90   Max.   :208.1  
##                                                               
##      width           height       curb.weight   engine.type
##  Min.   :60.30   Min.   :47.80   Min.   :1488   dohc: 12   
##  1st Qu.:64.10   1st Qu.:52.00   1st Qu.:2145   l   : 12   
##  Median :65.40   Median :54.10   Median :2414   ohc :141   
##  Mean   :65.89   Mean   :53.87   Mean   :2562   ohcf: 15   
##  3rd Qu.:66.90   3rd Qu.:55.70   3rd Qu.:2952   ohcv: 13   
##  Max.   :72.00   Max.   :59.80   Max.   :4066              
##                                                            
##  num.of.cylinders  engine.size    fuel.system      bore      
##  eight :  4       Min.   : 61.0   1bbl:11     Min.   :2.540  
##  five  : 10       1st Qu.: 98.0   2bbl:64     1st Qu.:3.150  
##  four  :153       Median :120.0   idi :19     Median :3.310  
##  six   : 24       Mean   :128.1   mfi : 1     Mean   :3.331  
##  three :  1       3rd Qu.:146.0   mpfi:88     3rd Qu.:3.590  
##  twelve:  1       Max.   :326.0   spdi: 9     Max.   :3.940  
##                                   spfi: 1                    
##      stroke      compression.ratio   horsepower       peak.rpm   
##  Min.   :2.070   Min.   : 7.00     Min.   : 48.0   Min.   :4150  
##  1st Qu.:3.110   1st Qu.: 8.50     1st Qu.: 70.0   1st Qu.:4800  
##  Median :3.290   Median : 9.00     Median : 95.0   Median :5100  
##  Mean   :3.249   Mean   :10.14     Mean   :103.5   Mean   :5100  
##  3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:116.0   3rd Qu.:5500  
##  Max.   :4.170   Max.   :23.00     Max.   :262.0   Max.   :6600  
##                                                                  
##     city.mpg      highway.mpg        price      
##  Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7738  
##  Median :25.00   Median :30.00   Median :10245  
##  Mean   :25.33   Mean   :30.79   Mean   :13285  
##  3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16515  
##  Max.   :49.00   Max.   :54.00   Max.   :45400  
##

str(cars)

## 'data.frame':    193 obs. of  24 variables:
##  $ make             : Factor w/ 21 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 3 ...
##  $ fuel.type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 1 ...
##  $ num.of.doors     : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ body.style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 4 ...
##  $ drive.wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 3 ...
##  $ engine.location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel.base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 64.8 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 54.3 ...
##  $ curb.weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 2395 ...
##  $ engine.type      : Factor w/ 5 levels "dohc","l","ohc",..: 1 1 5 3 3 3 3 3 3 3 ...
##  $ num.of.cylinders : Factor w/ 6 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 3 ...
##  $ engine.size      : int  130 130 152 109 136 136 136 136 131 108 ...
##  $ fuel.system      : Factor w/ 7 levels "1bbl","2bbl",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ bore             : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.5 ...
##  $ stroke           : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 ...
##  $ compression.ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 8.8 ...
##  $ horsepower       : int  111 111 154 102 115 110 110 110 140 101 ...
##  $ peak.rpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5800 ...
##  $ city.mpg         : int  21 21 19 24 18 19 19 19 17 23 ...
##  $ highway.mpg      : int  27 27 26 30 22 25 25 25 20 29 ...
##  $ price            : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 16430 ...

PreProcessing

Here I’m dropping factor variables to clean up the pairs plot a little bit.

drops = c("make","fuel.type","aspiration","num.of.doors","body.style","drive.wheels","engine.location","engine.type","num.of.cylinders","fuel.system")
d_droppedFactors = cars[ , !(names(cars) %in% drops)]

Visualizations

Here’s a pair plot with the factor variables removed from the dataset (Dr. Cutler made this suggestion on Monday, the pairs plot is really ugly if you don’t drop some variables from the dataset)

library(ggplot2)
library(GGally)
# ggpairs(d_droppedFactors)
pairs(d_droppedFactors)

Looking at the scatterplot for size of engine, it might be necessary to transform the data in some way to make our prediction even better. (it looks a little heteroskedastic to me)

p1 <- ggplot(cars, aes(x = engine.size, y = price, color = factor(make)))+ geom_point()
p1 + ggtitle("Price w.r.t Size of Engine")

Here’s a plot with linear regression lines drawn in (include their confidence intervals) to show that the regression lines are different between different makes. I think it’s too ugly to include in the report though

p1 <- ggplot(cars, aes(x = engine.size, y = price, color = factor(make)))+ geom_point() + geom_smooth(method = "lm", se = TRUE)
p1 + ggtitle("Price w.r.t Size of Engine")

## Warning in qt((1 - level)/2, df): NaNs produced

This visualization needs to be cleaned up if we decide to use it in the report. (The x axis is supposed to be the car make, and the boxplots prove that there is a significant difference in price between makes. One things we could do is run pair-wise t-tests to justify this statistically)

ggplot(cars, aes(y = price, x = make)) +scale_x_discrete("Make") + 
        scale_y_continuous("Price") + geom_boxplot(outlier.color="red") + theme(axis.text.x = element_text(size=20))+ggtitle("Price w.r.t Make")

All Subsets

As Sebastian noted, this is VERY slow, I usually cache large/slow computations (see the options in the r chunk)

library(leaps)
leaps<-regsubsets(price~.,data=cars,nbest=10,really.big=TRUE)

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 3 linear dependencies found

## Reordering variables and trying again:

# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
plot(leaps,scale="adjr2")

Lasso (with make included)

According to lasso we should be using make as our predictor.

x = model.matrix(price ~ ., cars)[, -1]
y = cars$price
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-5

grid = 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x, y, alpha = 1, lambda = grid)
# 
plot(lasso.mod)

# the coefficients are returned from the "coef" function, with variables as rows and lambda values as columns
# dim(coef(lasso.mod))

# cv.glmnet does cv to choose lambda
set.seed(1)
cv.out = cv.glmnet(x, y, alpha = 1)
# plot(cv.out)

abline(h = cv.out$cvup[which.min(cv.out$cvm)])

bestlam = cv.out$lambda.min
# log(bestlam)
lambda1se = cv.out$lambda.1se
# log(lambda1se)

# The minimum CV:
# bestlam = cv.out$lambda.min

# lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
# lasso.coef
# lasso.coef[lasso.coef != 0]

# The "1 standard error rule"
# bestlam = cv.out$lambda.1se

Here are the coefficients on the lasso regularized model.

lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef

##       (Intercept)          makeaudi           makebmw     makechevrolet 
##        -24829.545          1136.986          6541.172         -4112.390 
##         makedodge         makehonda         makeisuzu        makejaguar 
##         -4787.547         -2395.836         -2939.286          -453.764 
##         makemazda makemercedes-benz       makemercury    makemitsubishi 
##         -1760.107          2781.876         -3255.479         -5473.882 
##        makenissan        makepeugot      makeplymouth       makeporsche 
##         -2407.982         -4189.438         -4942.372          3790.318 
##          makesaab        makesubaru        maketoyota    makevolkswagen 
##          1100.114         -1459.796         -2896.814         -1911.205

lasso.coef[lasso.coef != 0]

##       (Intercept)          makeaudi           makebmw     makechevrolet 
##        -24829.545          1136.986          6541.172         -4112.390 
##         makedodge         makehonda         makeisuzu        makejaguar 
##         -4787.547         -2395.836         -2939.286          -453.764 
##         makemazda makemercedes-benz       makemercury    makemitsubishi 
##         -1760.107          2781.876         -3255.479         -5473.882 
##        makenissan        makepeugot      makeplymouth       makeporsche 
##         -2407.982         -4189.438         -4942.372          3790.318 
##          makesaab        makesubaru        maketoyota    makevolkswagen 
##          1100.114         -1459.796         -2896.814         -1911.205

Lasso (with make excluded)

what happens if we remove the make variable? now what is the most significant?

x = model.matrix(price ~ . -make, cars)[, -1]
y = cars$price
library(glmnet)
grid = 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x, y, alpha = 1, lambda = grid)
# 
plot(lasso.mod)

# the coefficients are returned from the "coef" function, with variables as rows and lambda values as columns
# dim(coef(lasso.mod))

# cv.glmnet does cv to choose lambda
set.seed(1)
cv.out = cv.glmnet(x, y, alpha = 1)
# plot(cv.out)

abline(h = cv.out$cvup[which.min(cv.out$cvm)])

bestlam = cv.out$lambda.min
# log(bestlam)
lambda1se = cv.out$lambda.1se
# log(lambda1se)

# The minimum CV:
# bestlam = cv.out$lambda.min

# lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
# lasso.coef
# lasso.coef[lasso.coef != 0]

# The "1 standard error rule"
# bestlam = cv.out$lambda.1se

Here are the coefficients on the lasso regularized model.

lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef

##         (Intercept)        fuel.typegas     aspirationturbo 
##        -18201.50191        -10907.65806          1936.20276 
##     num.of.doorstwo   body.stylehardtop body.stylehatchback 
##           348.47928         -3082.98971         -2968.67289 
##     body.stylesedan     body.stylewagon     drive.wheelsfwd 
##         -1870.83018         -2887.42094          -295.84249 
##     drive.wheelsrwd engine.locationrear          wheel.base 
##           511.97881          7847.14410            28.28957 
##              length               width              height 
##           -53.76169           667.68696            53.23290 
##         curb.weight        engine.typel      engine.typeohc 
##             3.79126          -772.13277          3174.40045 
##     engine.typeohcf     engine.typeohcv 
##          1177.86739         -5674.32686

lasso.coef[lasso.coef != 0]

##         (Intercept)        fuel.typegas     aspirationturbo 
##        -18201.50191        -10907.65806          1936.20276 
##     num.of.doorstwo   body.stylehardtop body.stylehatchback 
##           348.47928         -3082.98971         -2968.67289 
##     body.stylesedan     body.stylewagon     drive.wheelsfwd 
##         -1870.83018         -2887.42094          -295.84249 
##     drive.wheelsrwd engine.locationrear          wheel.base 
##           511.97881          7847.14410            28.28957 
##              length               width              height 
##           -53.76169           667.68696            53.23290 
##         curb.weight        engine.typel      engine.typeohc 
##             3.79126          -772.13277          3174.40045 
##     engine.typeohcf     engine.typeohcv 
##          1177.86739         -5674.32686

It looks to me like the model absolutely balloons when we leave the make out of the model and we’d have to include MANY more variables.

Subsetting for CV

Here I subset the data in preparation for training models and testing their accuracy.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(415)
smp_size <- floor(0.90 * nrow(cars))
train_indices = sample(seq_len(nrow(cars)),size=smp_size)
trainData = cars[train_indices,]
testData     = cars[-train_indices,]
nrow(testData)

## [1] 20

nrow(trainData)

## [1] 173

Random Forest Variable Importance

This dataset is already more interesting in that RF and regularization show different variables as being important to the model. RF states engine size as the most important, while the lasso shows make.

library(randomForest)
library(caret)
library(rpart)
modelRF =randomForest(price~., ntree = 250 ,data=trainData)

importance(modelRF)

##                   IncNodePurity
## make                 2176205069
## fuel.type               8558297
## aspiration             22387898
## num.of.doors            5387519
## body.style             46487233
## drive.wheels           92476997
## engine.location         7676605
## wheel.base            207549296
## length                250652064
## width                 449367982
## height                 49145650
## curb.weight          1323121574
## engine.type            32587673
## num.of.cylinders      604216654
## engine.size          2599143915
## fuel.system            91478267
## bore                  117282549
## stroke                 40566152
## compression.ratio      70107973
## horsepower            613176821
## peak.rpm               62641470
## city.mpg              742485900
## highway.mpg           718847547

varImpPlot = varImpPlot(modelRF)

Testing some models

Testing the variables we think are important

Here I’m testing a bunch of different models with a bunch of different variables, including:

Only Make
Only Engine Size
Both Make and Engine Size
Make, Engine Size, and Curb Weight (see the Random Forest variable importance plot)
The variables lasso flagged as important after holding out the make variable

lmOutMake = lm(price~make, data = trainData)
lmOutEngineSize = lm(price~engine.size, data=trainData)
lmOutBoth = lm(price~make+engine.size,data=trainData)
lmOutMakeEngineSizeCurbWeight = lm(price~make+engine.size + curb.weight,data=trainData)

lmOutLassoWithoutMake =lm(price~fuel.type + aspiration+num.of.doors+body.style+drive.wheels+engine.location+wheel.base+length+width+height+curb.weight+engine.type,data=trainData)

summary(lmOutMake)

## 
## Call:
## lm(formula = price ~ make, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9025.7 -2003.3  -430.7  1519.3 15859.3 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       15498.333   2103.747   7.367 1.04e-11 ***
## makeaudi           2360.833   2576.554   0.916 0.360974    
## makebmw            9957.381   2514.459   3.960 0.000115 ***
## makechevrolet     -9491.333   2975.148  -3.190 0.001727 ** 
## makedodge         -7391.333   2514.459  -2.940 0.003801 ** 
## makehonda         -7202.833   2352.061  -3.062 0.002598 ** 
## makeisuzu         -4450.333   4207.495  -1.058 0.291863    
## makejaguar        19101.667   2975.148   6.420 1.64e-09 ***
## makemazda         -4777.333   2429.198  -1.967 0.051047 .  
## makemercedes-benz 17101.667   2576.554   6.637 5.30e-10 ***
## makemercury        1004.667   4207.495   0.239 0.811598    
## makemitsubishi    -6046.606   2373.347  -2.548 0.011836 *  
## makenissan        -4872.863   2281.834  -2.136 0.034322 *  
## makepeugot           -9.242   2373.347  -0.004 0.996898    
## makeplymouth      -7475.833   2576.554  -2.901 0.004266 ** 
## makeporsche       19029.667   2975.148   6.396 1.86e-09 ***
## makesaab           -275.000   2576.554  -0.107 0.915143    
## makesubaru        -7204.152   2373.347  -3.035 0.002827 ** 
## maketoyota        -6119.678   2209.884  -2.769 0.006319 ** 
## makevolkswagen    -5713.333   2373.347  -2.407 0.017270 *  
## makevolvo          2719.667   2398.641   1.134 0.258648    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3644 on 152 degrees of freedom
## Multiple R-squared:  0.8076, Adjusted R-squared:  0.7822 
## F-statistic: 31.89 on 20 and 152 DF,  p-value: < 2.2e-16

summary(lmOutEngineSize)

## 
## Call:
## lm(formula = price ~ engine.size, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10752.9  -2247.3   -245.8   1489.8  14359.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8407.526    971.191  -8.657 3.49e-15 ***
## engine.size   169.204      7.261  23.304  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3832 on 171 degrees of freedom
## Multiple R-squared:  0.7605, Adjusted R-squared:  0.7591 
## F-statistic: 543.1 on 1 and 171 DF,  p-value: < 2.2e-16

summary(lmOutBoth)

## 
## Call:
## lm(formula = price ~ make + engine.size, data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5120.2 -1176.7  -183.3  1051.1 10432.6 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          18.143   1714.136   0.011  0.99157    
## makeaudi           3112.299   1658.021   1.877  0.06243 .  
## makebmw            7305.781   1627.308   4.489 1.41e-05 ***
## makechevrolet     -3066.303   1962.817  -1.562  0.12034    
## makedodge         -3762.828   1635.995  -2.300  0.02282 *  
## makehonda         -2985.233   1539.767  -1.939  0.05440 .  
## makeisuzu         -2383.803   2709.896  -0.880  0.38044    
## makejaguar         2945.158   2206.373   1.335  0.18394    
## makemazda         -2259.924   1571.798  -1.438  0.15256    
## makemercedes-benz  7689.561   1776.449   4.329 2.72e-05 ***
## makemercury         704.080   2706.329   0.260  0.79509    
## makemitsubishi    -4256.752   1531.373  -2.780  0.00613 ** 
## makenissan        -4013.098   1468.834  -2.732  0.00704 ** 
## makepeugot          161.545   1526.576   0.106  0.91586    
## makeplymouth      -4131.812   1672.751  -2.470  0.01462 *  
## makeporsche       12642.210   1962.250   6.443 1.49e-09 ***
## makesaab           1566.091   1661.953   0.942  0.34753    
## makesubaru        -3784.983   1544.125  -2.451  0.01538 *  
## maketoyota        -3757.744   1430.432  -2.627  0.00950 ** 
## makevolkswagen    -2027.736   1546.954  -1.311  0.19192    
## makevolvo          2148.553   1543.289   1.392  0.16591    
## engine.size         112.720      7.662  14.711  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2344 on 151 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.9099 
## F-statistic: 83.73 on 21 and 151 DF,  p-value: < 2.2e-16

summary(lmOutLassoWithoutMake)

## 
## Call:
## lm(formula = price ~ fuel.type + aspiration + num.of.doors + 
##     body.style + drive.wheels + engine.location + wheel.base + 
##     length + width + height + curb.weight + engine.type, data = trainData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5604  -1611   -165   1460  16416 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -54355.834  15453.878  -3.517 0.000574 ***
## fuel.typegas          1224.916    902.083   1.358 0.176502    
## aspirationturbo       -446.876    748.696  -0.597 0.551475    
## num.of.doorstwo       1685.896    718.427   2.347 0.020225 *  
## body.stylehardtop    -6349.274   1956.148  -3.246 0.001439 ** 
## body.stylehatchback  -4387.301   1613.220  -2.720 0.007292 ** 
## body.stylesedan      -1997.656   1703.848  -1.172 0.242844    
## body.stylewagon      -4600.237   1885.644  -2.440 0.015846 *  
## drive.wheelsfwd        502.825   1452.553   0.346 0.729693    
## drive.wheelsrwd       1438.892   1594.315   0.903 0.368202    
## engine.locationrear  21270.973   2520.357   8.440 2.26e-14 ***
## wheel.base             295.601    115.266   2.565 0.011294 *  
## length                -130.503     61.481  -2.123 0.035391 *  
## width                  589.568    302.225   1.951 0.052913 .  
## height                 -83.386    159.352  -0.523 0.601537    
## curb.weight             10.549      1.566   6.736 3.10e-10 ***
## engine.typel         -5219.940   1445.685  -3.611 0.000413 ***
## engine.typeohc         391.469   1108.690   0.353 0.724506    
## engine.typeohcf       -164.144   1532.373  -0.107 0.914836    
## engine.typeohcv        289.297   1415.069   0.204 0.838281    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2968 on 153 degrees of freedom
## Multiple R-squared:  0.8715, Adjusted R-squared:  0.8556 
## F-statistic: 54.62 on 19 and 153 DF,  p-value: < 2.2e-16

Testing with all variables included

lmOutAll = lm(price~., data = trainData)
summary(lmOutAll)

## 
## Call:
## lm(formula = price ~ ., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3262.3  -914.2   -34.1   862.7  8339.0 
## 
## Coefficients: (5 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.042e+04  1.842e+04  -0.566 0.572753    
## makeaudi                3.825e+02  2.579e+03   0.148 0.882319    
## makebmw                 4.233e+03  2.360e+03   1.794 0.075391 .  
## makechevrolet          -5.766e+03  2.300e+03  -2.507 0.013529 *  
## makedodge              -6.505e+03  1.909e+03  -3.407 0.000897 ***
## makehonda              -3.817e+03  2.215e+03  -1.724 0.087374 .  
## makeisuzu              -3.366e+03  2.884e+03  -1.167 0.245490    
## makejaguar             -2.094e+03  2.843e+03  -0.737 0.462731    
## makemazda              -3.100e+03  1.775e+03  -1.746 0.083350 .  
## makemercedes-benz       1.859e+03  2.983e+03   0.623 0.534298    
## makemercury            -5.511e+03  2.992e+03  -1.842 0.067947 .  
## makemitsubishi         -7.314e+03  1.903e+03  -3.843 0.000197 ***
## makenissan             -3.673e+03  1.723e+03  -2.131 0.035130 *  
## makepeugot             -9.382e+03  4.805e+03  -1.953 0.053216 .  
## makeplymouth           -6.561e+03  1.894e+03  -3.465 0.000739 ***
## makeporsche             1.436e+04  2.871e+03   5.000    2e-06 ***
## makesaab               -1.298e+02  2.070e+03  -0.063 0.950111    
## makesubaru             -2.857e+03  2.084e+03  -1.371 0.172960    
## maketoyota             -4.319e+03  1.618e+03  -2.670 0.008637 ** 
## makevolkswagen         -3.452e+03  1.804e+03  -1.913 0.058088 .  
## makevolvo              -3.492e+03  2.255e+03  -1.548 0.124174    
## fuel.typegas           -1.569e+04  7.131e+03  -2.200 0.029749 *  
## aspirationturbo         1.772e+03  8.587e+02   2.063 0.041245 *  
## num.of.doorstwo        -9.438e+01  4.986e+02  -0.189 0.850195    
## body.stylehardtop      -3.176e+03  1.435e+03  -2.213 0.028799 *  
## body.stylehatchback    -2.569e+03  1.295e+03  -1.984 0.049540 *  
## body.stylesedan        -1.841e+03  1.381e+03  -1.333 0.185056    
## body.stylewagon        -2.248e+03  1.508e+03  -1.491 0.138575    
## drive.wheelsfwd        -6.172e+02  9.662e+02  -0.639 0.524208    
## drive.wheelsrwd        -4.099e+02  1.279e+03  -0.320 0.749240    
## engine.locationrear            NA         NA      NA       NA    
## wheel.base              3.142e+02  9.413e+01   3.338 0.001128 ** 
## length                 -1.336e+02  5.295e+01  -2.524 0.012931 *  
## width                   5.945e+02  2.416e+02   2.460 0.015328 *  
## height                 -3.912e+02  1.460e+02  -2.680 0.008409 ** 
## curb.weight             7.165e+00  1.781e+00   4.023 0.000102 ***
## engine.typel            2.995e+03  4.437e+03   0.675 0.500990    
## engine.typeohc          1.360e+03  1.268e+03   1.073 0.285393    
## engine.typeohcf                NA         NA      NA       NA    
## engine.typeohcv        -2.796e+03  1.329e+03  -2.103 0.037538 *  
## num.of.cylindersfive   -6.963e+03  2.949e+03  -2.362 0.019823 *  
## num.of.cylindersfour   -3.485e+03  3.898e+03  -0.894 0.373038    
## num.of.cylinderssix    -2.604e+03  3.177e+03  -0.820 0.414093    
## num.of.cylindersthree          NA         NA      NA       NA    
## num.of.cylinderstwelve  5.810e+01  5.874e+03   0.010 0.992125    
## engine.size             5.905e+01  2.623e+01   2.251 0.026242 *  
## fuel.system2bbl         1.968e+03  1.486e+03   1.325 0.187869    
## fuel.systemidi                 NA         NA      NA       NA    
## fuel.systemmfi         -1.465e+03  2.657e+03  -0.551 0.582450    
## fuel.systemmpfi         9.461e+02  1.576e+03   0.600 0.549494    
## fuel.systemspdi        -5.165e+02  1.884e+03  -0.274 0.784412    
## fuel.systemspfi                NA         NA      NA       NA    
## bore                   -2.492e+03  1.889e+03  -1.319 0.189580    
## stroke                 -1.042e+03  1.012e+03  -1.029 0.305568    
## compression.ratio      -1.105e+03  5.327e+02  -2.074 0.040258 *  
## horsepower              8.347e+00  2.563e+01   0.326 0.745249    
## peak.rpm                2.253e+00  6.648e-01   3.389 0.000953 ***
## city.mpg               -3.217e+01  1.433e+02  -0.225 0.822692    
## highway.mpg             1.443e+02  1.211e+02   1.191 0.235911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1723 on 119 degrees of freedom
## Multiple R-squared:  0.9663, Adjusted R-squared:  0.9513 
## F-statistic: 64.43 on 53 and 119 DF,  p-value: < 2.2e-16

Testing each on the testSet

Looks like the model only including engine size generalizes the best. I used RMSE for my metric of success, but we could decide on another one as a group and use that to guide our analysis.

pMake = predict(lmOutMake,data=testData)
pEngineSize = predict(lmOutEngineSize,data=testData)
pBoth = predict(lmOutBoth,data=testData)
pAll = predict(lmOutAll,data=testData)
pMakeEngineCurb = predict(lmOutMakeEngineSizeCurbWeight,data=testData)
pLassoWithoutMake = predict(lmOutLassoWithoutMake,data=testData)


rmseMake <- sqrt(mean((pMake - testData$price)^2))

## Warning in pMake - testData$price: longer object length is not a multiple
## of shorter object length

rmseMake

## [1] 12975.97

rmseEngineSize <- sqrt(mean((pEngineSize - testData$price)^2))

## Warning in pEngineSize - testData$price: longer object length is not a
## multiple of shorter object length

rmseEngineSize

## [1] 12879.55

rmseBoth <- sqrt(mean((pBoth - testData$price)^2))

## Warning in pBoth - testData$price: longer object length is not a multiple
## of shorter object length

rmseBoth

## [1] 13269.54

rmseAll <- sqrt(mean((pAll - testData$price)^2))

## Warning in pAll - testData$price: longer object length is not a multiple of
## shorter object length

rmseAll

## [1] 13352.2

rmseMEC <- sqrt(mean((pMakeEngineCurb - testData$price)^2))

## Warning in pMakeEngineCurb - testData$price: longer object length is not a
## multiple of shorter object length

rmseMEC

## [1] 13344.66

rmseLWOM = sqrt(mean((pLassoWithoutMake - testData$price)^2))

## Warning in pLassoWithoutMake - testData$price: longer object length is not
## a multiple of shorter object length

rmseLWOM

## [1] 13191.68

How does random forests fare?

How does Random Forests compare with the best lienar regression model?

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)
library(rpart)
modelRFAll =randomForest(price~., ntree = 250 ,data=trainData)
modelRFEngineSize = randomForest(price~engine.size,ntree=250,data=trainData)
modelRFMake = randomForest(price~make,ntree=250,data=trainData)
modelRFBoth = randomForest(price~make+engine.size,ntree=250,data=trainData)

pMakeRF = predict(modelRFMake,data=testData)
pEngineSizeRF = predict(modelRFEngineSize,data=testData)
pBothRF = predict(modelRFBoth,data=testData)
pAllRF = predict(modelRFAll,data=testData)

rmseMakeRF <- sqrt(mean((pMakeRF - testData$price)^2))

## Warning in pMakeRF - testData$price: longer object length is not a multiple
## of shorter object length

rmseMakeRF

## [1] 12788.48

rmseEngineSizeRF <- sqrt(mean((pEngineSizeRF - testData$price)^2))

## Warning in pEngineSizeRF - testData$price: longer object length is not a
## multiple of shorter object length

rmseEngineSizeRF

## [1] 13219.01

rmseBothRF <- sqrt(mean((pBothRF - testData$price)^2))

## Warning in pBothRF - testData$price: longer object length is not a multiple
## of shorter object length

rmseBothRF

## [1] 13092.04

rmseAllRF <- sqrt(mean((pAllRF - testData$price)^2))

## Warning in pAllRF - testData$price: longer object length is not a multiple
## of shorter object length

rmseAllRF

## [1] 12979.18

Transformation

It doesn’t look like a log transformation really helps the engine size variable in regard to prediction.

loggedTrainData = trainData
loggedTrainData$engine.size = log(loggedTrainData$engine.size)
loggedTestData = testData
loggedTestData$engine.size = log(loggedTestData$engine.size)

lmLogged = lm(price ~ engine.size, data = loggedTrainData)
pLogged = predict(lmLogged,data=loggedTestData)
rmseLogged <- sqrt(mean((pLogged - loggedTestData$price)^2))

## Warning in pLogged - loggedTestData$price: longer object length is not a
## multiple of shorter object length

rmseLogged

## [1] 12880.03

Summary

Looks like RF using make as the sole predictor does the best on the test set, and the linear regression model using engine size as the sole predictor does the best on the test set.

I think these are the correct answers, but we still need to motivate these decisions using various selection methods, visualizations, and analyses.

Things from her rubric I’m going to try and address this weekend:

Model Assumptions (show normalized residuals, QQplots, etc. This will need to be done after we select a model as a group)
More graphs (again after we decide on the best model)
Dealing with outliers (I usually use R’s boxplot code for this, but I asked her about it earlier in the semester and she didn’t like it. Suggested just taking off the top and bottom 5% of variables). Could include Cook’s D plots here.
Multicollinearity. Thinking about it now, this is kind of addressed in models like the lmOutBoth, lmOutLassoWithoutMake, and lmOutAll models. We can use them to explain why they’re performing worse on the test set than the models with just engine size and make.
Interactions (see multicollinearity section)

I think everything else is kind of done, but obviously needs polished and combed over by everyone else.