Here I’m reading in the dataset and summarizing it to see types (str) as well as get a rough idea of distributions (summary)
cars = read.csv("Cars.csv",header=TRUE)
head(cars)
## make fuel.type aspiration num.of.doors body.style drive.wheels
## 1 alfa-romero gas std two convertible rwd
## 2 alfa-romero gas std two convertible rwd
## 3 alfa-romero gas std two hatchback rwd
## 4 audi gas std four sedan fwd
## 5 audi gas std four sedan 4wd
## 6 audi gas std two sedan fwd
## engine.location wheel.base length width height curb.weight engine.type
## 1 front 88.6 168.8 64.1 48.8 2548 dohc
## 2 front 88.6 168.8 64.1 48.8 2548 dohc
## 3 front 94.5 171.2 65.5 52.4 2823 ohcv
## 4 front 99.8 176.6 66.2 54.3 2337 ohc
## 5 front 99.4 176.6 66.4 54.3 2824 ohc
## 6 front 99.8 177.3 66.3 53.1 2507 ohc
## num.of.cylinders engine.size fuel.system bore stroke compression.ratio
## 1 four 130 mpfi 3.47 2.68 9.0
## 2 four 130 mpfi 3.47 2.68 9.0
## 3 six 152 mpfi 2.68 3.47 9.0
## 4 four 109 mpfi 3.19 3.40 10.0
## 5 five 136 mpfi 3.19 3.40 8.0
## 6 five 136 mpfi 3.19 3.40 8.5
## horsepower peak.rpm city.mpg highway.mpg price
## 1 111 5000 21 27 13495
## 2 111 5000 21 27 16500
## 3 154 5000 19 26 16500
## 4 102 5500 24 30 13950
## 5 115 5500 18 22 17450
## 6 110 5500 19 25 15250
summary(cars)
## make fuel.type aspiration num.of.doors body.style
## toyota :32 diesel: 19 std :158 four:112 convertible: 6
## nissan :18 gas :174 turbo: 35 two : 81 hardtop : 8
## honda :13 hatchback :63
## mitsubishi:13 sedan :92
## mazda :12 wagon :24
## subaru :12
## (Other) :93
## drive.wheels engine.location wheel.base length
## 4wd: 8 front:190 Min. : 86.60 Min. :141.1
## fwd:114 rear : 3 1st Qu.: 94.50 1st Qu.:166.3
## rwd: 71 Median : 97.00 Median :173.2
## Mean : 98.92 Mean :174.3
## 3rd Qu.:102.40 3rd Qu.:184.6
## Max. :120.90 Max. :208.1
##
## width height curb.weight engine.type
## Min. :60.30 Min. :47.80 Min. :1488 dohc: 12
## 1st Qu.:64.10 1st Qu.:52.00 1st Qu.:2145 l : 12
## Median :65.40 Median :54.10 Median :2414 ohc :141
## Mean :65.89 Mean :53.87 Mean :2562 ohcf: 15
## 3rd Qu.:66.90 3rd Qu.:55.70 3rd Qu.:2952 ohcv: 13
## Max. :72.00 Max. :59.80 Max. :4066
##
## num.of.cylinders engine.size fuel.system bore
## eight : 4 Min. : 61.0 1bbl:11 Min. :2.540
## five : 10 1st Qu.: 98.0 2bbl:64 1st Qu.:3.150
## four :153 Median :120.0 idi :19 Median :3.310
## six : 24 Mean :128.1 mfi : 1 Mean :3.331
## three : 1 3rd Qu.:146.0 mpfi:88 3rd Qu.:3.590
## twelve: 1 Max. :326.0 spdi: 9 Max. :3.940
## spfi: 1
## stroke compression.ratio horsepower peak.rpm
## Min. :2.070 Min. : 7.00 Min. : 48.0 Min. :4150
## 1st Qu.:3.110 1st Qu.: 8.50 1st Qu.: 70.0 1st Qu.:4800
## Median :3.290 Median : 9.00 Median : 95.0 Median :5100
## Mean :3.249 Mean :10.14 Mean :103.5 Mean :5100
## 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:116.0 3rd Qu.:5500
## Max. :4.170 Max. :23.00 Max. :262.0 Max. :6600
##
## city.mpg highway.mpg price
## Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7738
## Median :25.00 Median :30.00 Median :10245
## Mean :25.33 Mean :30.79 Mean :13285
## 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16515
## Max. :49.00 Max. :54.00 Max. :45400
##
str(cars)
## 'data.frame': 193 obs. of 24 variables:
## $ make : Factor w/ 21 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 3 ...
## $ fuel.type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 1 ...
## $ num.of.doors : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
## $ body.style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 4 ...
## $ drive.wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 3 ...
## $ engine.location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel.base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 64.8 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 54.3 ...
## $ curb.weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 2395 ...
## $ engine.type : Factor w/ 5 levels "dohc","l","ohc",..: 1 1 5 3 3 3 3 3 3 3 ...
## $ num.of.cylinders : Factor w/ 6 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 3 ...
## $ engine.size : int 130 130 152 109 136 136 136 136 131 108 ...
## $ fuel.system : Factor w/ 7 levels "1bbl","2bbl",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.5 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 ...
## $ compression.ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 8.8 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 101 ...
## $ peak.rpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5800 ...
## $ city.mpg : int 21 21 19 24 18 19 19 19 17 23 ...
## $ highway.mpg : int 27 27 26 30 22 25 25 25 20 29 ...
## $ price : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 16430 ...
Here I’m dropping factor variables to clean up the pairs plot a little bit.
drops = c("make","fuel.type","aspiration","num.of.doors","body.style","drive.wheels","engine.location","engine.type","num.of.cylinders","fuel.system")
d_droppedFactors = cars[ , !(names(cars) %in% drops)]
Here’s a pair plot with the factor variables removed from the dataset (Dr. Cutler made this suggestion on Monday, the pairs plot is really ugly if you don’t drop some variables from the dataset)
library(ggplot2)
library(GGally)
# ggpairs(d_droppedFactors)
pairs(d_droppedFactors)
Looking at the scatterplot for size of engine, it might be necessary to transform the data in some way to make our prediction even better. (it looks a little heteroskedastic to me)
p1 <- ggplot(cars, aes(x = engine.size, y = price, color = factor(make)))+ geom_point()
p1 + ggtitle("Price w.r.t Size of Engine")
Here’s a plot with linear regression lines drawn in (include their confidence intervals) to show that the regression lines are different between different makes. I think it’s too ugly to include in the report though
p1 <- ggplot(cars, aes(x = engine.size, y = price, color = factor(make)))+ geom_point() + geom_smooth(method = "lm", se = TRUE)
p1 + ggtitle("Price w.r.t Size of Engine")
## Warning in qt((1 - level)/2, df): NaNs produced
This visualization needs to be cleaned up if we decide to use it in the report. (The x axis is supposed to be the car make, and the boxplots prove that there is a significant difference in price between makes. One things we could do is run pair-wise t-tests to justify this statistically)
ggplot(cars, aes(y = price, x = make)) +scale_x_discrete("Make") +
scale_y_continuous("Price") + geom_boxplot(outlier.color="red") + theme(axis.text.x = element_text(size=20))+ggtitle("Price w.r.t Make")
As Sebastian noted, this is VERY slow, I usually cache large/slow computations (see the options in the r chunk)
library(leaps)
leaps<-regsubsets(price~.,data=cars,nbest=10,really.big=TRUE)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 3 linear dependencies found
## Reordering variables and trying again:
# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
plot(leaps,scale="adjr2")
According to lasso we should be using make as our predictor.
x = model.matrix(price ~ ., cars)[, -1]
y = cars$price
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
grid = 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x, y, alpha = 1, lambda = grid)
#
plot(lasso.mod)
# the coefficients are returned from the "coef" function, with variables as rows and lambda values as columns
# dim(coef(lasso.mod))
# cv.glmnet does cv to choose lambda
set.seed(1)
cv.out = cv.glmnet(x, y, alpha = 1)
# plot(cv.out)
abline(h = cv.out$cvup[which.min(cv.out$cvm)])
bestlam = cv.out$lambda.min
# log(bestlam)
lambda1se = cv.out$lambda.1se
# log(lambda1se)
# The minimum CV:
# bestlam = cv.out$lambda.min
# lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
# lasso.coef
# lasso.coef[lasso.coef != 0]
# The "1 standard error rule"
# bestlam = cv.out$lambda.1se
Here are the coefficients on the lasso regularized model.
lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef
## (Intercept) makeaudi makebmw makechevrolet
## -24829.545 1136.986 6541.172 -4112.390
## makedodge makehonda makeisuzu makejaguar
## -4787.547 -2395.836 -2939.286 -453.764
## makemazda makemercedes-benz makemercury makemitsubishi
## -1760.107 2781.876 -3255.479 -5473.882
## makenissan makepeugot makeplymouth makeporsche
## -2407.982 -4189.438 -4942.372 3790.318
## makesaab makesubaru maketoyota makevolkswagen
## 1100.114 -1459.796 -2896.814 -1911.205
lasso.coef[lasso.coef != 0]
## (Intercept) makeaudi makebmw makechevrolet
## -24829.545 1136.986 6541.172 -4112.390
## makedodge makehonda makeisuzu makejaguar
## -4787.547 -2395.836 -2939.286 -453.764
## makemazda makemercedes-benz makemercury makemitsubishi
## -1760.107 2781.876 -3255.479 -5473.882
## makenissan makepeugot makeplymouth makeporsche
## -2407.982 -4189.438 -4942.372 3790.318
## makesaab makesubaru maketoyota makevolkswagen
## 1100.114 -1459.796 -2896.814 -1911.205
what happens if we remove the make variable? now what is the most significant?
x = model.matrix(price ~ . -make, cars)[, -1]
y = cars$price
library(glmnet)
grid = 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x, y, alpha = 1, lambda = grid)
#
plot(lasso.mod)
# the coefficients are returned from the "coef" function, with variables as rows and lambda values as columns
# dim(coef(lasso.mod))
# cv.glmnet does cv to choose lambda
set.seed(1)
cv.out = cv.glmnet(x, y, alpha = 1)
# plot(cv.out)
abline(h = cv.out$cvup[which.min(cv.out$cvm)])
bestlam = cv.out$lambda.min
# log(bestlam)
lambda1se = cv.out$lambda.1se
# log(lambda1se)
# The minimum CV:
# bestlam = cv.out$lambda.min
# lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
# lasso.coef
# lasso.coef[lasso.coef != 0]
# The "1 standard error rule"
# bestlam = cv.out$lambda.1se
Here are the coefficients on the lasso regularized model.
lasso.coef = predict(cv.out, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef
## (Intercept) fuel.typegas aspirationturbo
## -18201.50191 -10907.65806 1936.20276
## num.of.doorstwo body.stylehardtop body.stylehatchback
## 348.47928 -3082.98971 -2968.67289
## body.stylesedan body.stylewagon drive.wheelsfwd
## -1870.83018 -2887.42094 -295.84249
## drive.wheelsrwd engine.locationrear wheel.base
## 511.97881 7847.14410 28.28957
## length width height
## -53.76169 667.68696 53.23290
## curb.weight engine.typel engine.typeohc
## 3.79126 -772.13277 3174.40045
## engine.typeohcf engine.typeohcv
## 1177.86739 -5674.32686
lasso.coef[lasso.coef != 0]
## (Intercept) fuel.typegas aspirationturbo
## -18201.50191 -10907.65806 1936.20276
## num.of.doorstwo body.stylehardtop body.stylehatchback
## 348.47928 -3082.98971 -2968.67289
## body.stylesedan body.stylewagon drive.wheelsfwd
## -1870.83018 -2887.42094 -295.84249
## drive.wheelsrwd engine.locationrear wheel.base
## 511.97881 7847.14410 28.28957
## length width height
## -53.76169 667.68696 53.23290
## curb.weight engine.typel engine.typeohc
## 3.79126 -772.13277 3174.40045
## engine.typeohcf engine.typeohcv
## 1177.86739 -5674.32686
It looks to me like the model absolutely balloons when we leave the make out of the model and we’d have to include MANY more variables.
Here I subset the data in preparation for training models and testing their accuracy.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(415)
smp_size <- floor(0.90 * nrow(cars))
train_indices = sample(seq_len(nrow(cars)),size=smp_size)
trainData = cars[train_indices,]
testData = cars[-train_indices,]
nrow(testData)
## [1] 20
nrow(trainData)
## [1] 173
This dataset is already more interesting in that RF and regularization show different variables as being important to the model. RF states engine size as the most important, while the lasso shows make.
library(randomForest)
library(caret)
library(rpart)
modelRF =randomForest(price~., ntree = 250 ,data=trainData)
importance(modelRF)
## IncNodePurity
## make 2176205069
## fuel.type 8558297
## aspiration 22387898
## num.of.doors 5387519
## body.style 46487233
## drive.wheels 92476997
## engine.location 7676605
## wheel.base 207549296
## length 250652064
## width 449367982
## height 49145650
## curb.weight 1323121574
## engine.type 32587673
## num.of.cylinders 604216654
## engine.size 2599143915
## fuel.system 91478267
## bore 117282549
## stroke 40566152
## compression.ratio 70107973
## horsepower 613176821
## peak.rpm 62641470
## city.mpg 742485900
## highway.mpg 718847547
varImpPlot = varImpPlot(modelRF)
Here I’m testing a bunch of different models with a bunch of different variables, including:
Only Make
Only Engine Size
Both Make and Engine Size
Make, Engine Size, and Curb Weight (see the Random Forest variable importance plot)
The variables lasso flagged as important after holding out the make variable
lmOutMake = lm(price~make, data = trainData)
lmOutEngineSize = lm(price~engine.size, data=trainData)
lmOutBoth = lm(price~make+engine.size,data=trainData)
lmOutMakeEngineSizeCurbWeight = lm(price~make+engine.size + curb.weight,data=trainData)
lmOutLassoWithoutMake =lm(price~fuel.type + aspiration+num.of.doors+body.style+drive.wheels+engine.location+wheel.base+length+width+height+curb.weight+engine.type,data=trainData)
summary(lmOutMake)
##
## Call:
## lm(formula = price ~ make, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9025.7 -2003.3 -430.7 1519.3 15859.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15498.333 2103.747 7.367 1.04e-11 ***
## makeaudi 2360.833 2576.554 0.916 0.360974
## makebmw 9957.381 2514.459 3.960 0.000115 ***
## makechevrolet -9491.333 2975.148 -3.190 0.001727 **
## makedodge -7391.333 2514.459 -2.940 0.003801 **
## makehonda -7202.833 2352.061 -3.062 0.002598 **
## makeisuzu -4450.333 4207.495 -1.058 0.291863
## makejaguar 19101.667 2975.148 6.420 1.64e-09 ***
## makemazda -4777.333 2429.198 -1.967 0.051047 .
## makemercedes-benz 17101.667 2576.554 6.637 5.30e-10 ***
## makemercury 1004.667 4207.495 0.239 0.811598
## makemitsubishi -6046.606 2373.347 -2.548 0.011836 *
## makenissan -4872.863 2281.834 -2.136 0.034322 *
## makepeugot -9.242 2373.347 -0.004 0.996898
## makeplymouth -7475.833 2576.554 -2.901 0.004266 **
## makeporsche 19029.667 2975.148 6.396 1.86e-09 ***
## makesaab -275.000 2576.554 -0.107 0.915143
## makesubaru -7204.152 2373.347 -3.035 0.002827 **
## maketoyota -6119.678 2209.884 -2.769 0.006319 **
## makevolkswagen -5713.333 2373.347 -2.407 0.017270 *
## makevolvo 2719.667 2398.641 1.134 0.258648
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3644 on 152 degrees of freedom
## Multiple R-squared: 0.8076, Adjusted R-squared: 0.7822
## F-statistic: 31.89 on 20 and 152 DF, p-value: < 2.2e-16
summary(lmOutEngineSize)
##
## Call:
## lm(formula = price ~ engine.size, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10752.9 -2247.3 -245.8 1489.8 14359.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8407.526 971.191 -8.657 3.49e-15 ***
## engine.size 169.204 7.261 23.304 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3832 on 171 degrees of freedom
## Multiple R-squared: 0.7605, Adjusted R-squared: 0.7591
## F-statistic: 543.1 on 1 and 171 DF, p-value: < 2.2e-16
summary(lmOutBoth)
##
## Call:
## lm(formula = price ~ make + engine.size, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5120.2 -1176.7 -183.3 1051.1 10432.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.143 1714.136 0.011 0.99157
## makeaudi 3112.299 1658.021 1.877 0.06243 .
## makebmw 7305.781 1627.308 4.489 1.41e-05 ***
## makechevrolet -3066.303 1962.817 -1.562 0.12034
## makedodge -3762.828 1635.995 -2.300 0.02282 *
## makehonda -2985.233 1539.767 -1.939 0.05440 .
## makeisuzu -2383.803 2709.896 -0.880 0.38044
## makejaguar 2945.158 2206.373 1.335 0.18394
## makemazda -2259.924 1571.798 -1.438 0.15256
## makemercedes-benz 7689.561 1776.449 4.329 2.72e-05 ***
## makemercury 704.080 2706.329 0.260 0.79509
## makemitsubishi -4256.752 1531.373 -2.780 0.00613 **
## makenissan -4013.098 1468.834 -2.732 0.00704 **
## makepeugot 161.545 1526.576 0.106 0.91586
## makeplymouth -4131.812 1672.751 -2.470 0.01462 *
## makeporsche 12642.210 1962.250 6.443 1.49e-09 ***
## makesaab 1566.091 1661.953 0.942 0.34753
## makesubaru -3784.983 1544.125 -2.451 0.01538 *
## maketoyota -3757.744 1430.432 -2.627 0.00950 **
## makevolkswagen -2027.736 1546.954 -1.311 0.19192
## makevolvo 2148.553 1543.289 1.392 0.16591
## engine.size 112.720 7.662 14.711 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2344 on 151 degrees of freedom
## Multiple R-squared: 0.9209, Adjusted R-squared: 0.9099
## F-statistic: 83.73 on 21 and 151 DF, p-value: < 2.2e-16
summary(lmOutLassoWithoutMake)
##
## Call:
## lm(formula = price ~ fuel.type + aspiration + num.of.doors +
## body.style + drive.wheels + engine.location + wheel.base +
## length + width + height + curb.weight + engine.type, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5604 -1611 -165 1460 16416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -54355.834 15453.878 -3.517 0.000574 ***
## fuel.typegas 1224.916 902.083 1.358 0.176502
## aspirationturbo -446.876 748.696 -0.597 0.551475
## num.of.doorstwo 1685.896 718.427 2.347 0.020225 *
## body.stylehardtop -6349.274 1956.148 -3.246 0.001439 **
## body.stylehatchback -4387.301 1613.220 -2.720 0.007292 **
## body.stylesedan -1997.656 1703.848 -1.172 0.242844
## body.stylewagon -4600.237 1885.644 -2.440 0.015846 *
## drive.wheelsfwd 502.825 1452.553 0.346 0.729693
## drive.wheelsrwd 1438.892 1594.315 0.903 0.368202
## engine.locationrear 21270.973 2520.357 8.440 2.26e-14 ***
## wheel.base 295.601 115.266 2.565 0.011294 *
## length -130.503 61.481 -2.123 0.035391 *
## width 589.568 302.225 1.951 0.052913 .
## height -83.386 159.352 -0.523 0.601537
## curb.weight 10.549 1.566 6.736 3.10e-10 ***
## engine.typel -5219.940 1445.685 -3.611 0.000413 ***
## engine.typeohc 391.469 1108.690 0.353 0.724506
## engine.typeohcf -164.144 1532.373 -0.107 0.914836
## engine.typeohcv 289.297 1415.069 0.204 0.838281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2968 on 153 degrees of freedom
## Multiple R-squared: 0.8715, Adjusted R-squared: 0.8556
## F-statistic: 54.62 on 19 and 153 DF, p-value: < 2.2e-16
lmOutAll = lm(price~., data = trainData)
summary(lmOutAll)
##
## Call:
## lm(formula = price ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3262.3 -914.2 -34.1 862.7 8339.0
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.042e+04 1.842e+04 -0.566 0.572753
## makeaudi 3.825e+02 2.579e+03 0.148 0.882319
## makebmw 4.233e+03 2.360e+03 1.794 0.075391 .
## makechevrolet -5.766e+03 2.300e+03 -2.507 0.013529 *
## makedodge -6.505e+03 1.909e+03 -3.407 0.000897 ***
## makehonda -3.817e+03 2.215e+03 -1.724 0.087374 .
## makeisuzu -3.366e+03 2.884e+03 -1.167 0.245490
## makejaguar -2.094e+03 2.843e+03 -0.737 0.462731
## makemazda -3.100e+03 1.775e+03 -1.746 0.083350 .
## makemercedes-benz 1.859e+03 2.983e+03 0.623 0.534298
## makemercury -5.511e+03 2.992e+03 -1.842 0.067947 .
## makemitsubishi -7.314e+03 1.903e+03 -3.843 0.000197 ***
## makenissan -3.673e+03 1.723e+03 -2.131 0.035130 *
## makepeugot -9.382e+03 4.805e+03 -1.953 0.053216 .
## makeplymouth -6.561e+03 1.894e+03 -3.465 0.000739 ***
## makeporsche 1.436e+04 2.871e+03 5.000 2e-06 ***
## makesaab -1.298e+02 2.070e+03 -0.063 0.950111
## makesubaru -2.857e+03 2.084e+03 -1.371 0.172960
## maketoyota -4.319e+03 1.618e+03 -2.670 0.008637 **
## makevolkswagen -3.452e+03 1.804e+03 -1.913 0.058088 .
## makevolvo -3.492e+03 2.255e+03 -1.548 0.124174
## fuel.typegas -1.569e+04 7.131e+03 -2.200 0.029749 *
## aspirationturbo 1.772e+03 8.587e+02 2.063 0.041245 *
## num.of.doorstwo -9.438e+01 4.986e+02 -0.189 0.850195
## body.stylehardtop -3.176e+03 1.435e+03 -2.213 0.028799 *
## body.stylehatchback -2.569e+03 1.295e+03 -1.984 0.049540 *
## body.stylesedan -1.841e+03 1.381e+03 -1.333 0.185056
## body.stylewagon -2.248e+03 1.508e+03 -1.491 0.138575
## drive.wheelsfwd -6.172e+02 9.662e+02 -0.639 0.524208
## drive.wheelsrwd -4.099e+02 1.279e+03 -0.320 0.749240
## engine.locationrear NA NA NA NA
## wheel.base 3.142e+02 9.413e+01 3.338 0.001128 **
## length -1.336e+02 5.295e+01 -2.524 0.012931 *
## width 5.945e+02 2.416e+02 2.460 0.015328 *
## height -3.912e+02 1.460e+02 -2.680 0.008409 **
## curb.weight 7.165e+00 1.781e+00 4.023 0.000102 ***
## engine.typel 2.995e+03 4.437e+03 0.675 0.500990
## engine.typeohc 1.360e+03 1.268e+03 1.073 0.285393
## engine.typeohcf NA NA NA NA
## engine.typeohcv -2.796e+03 1.329e+03 -2.103 0.037538 *
## num.of.cylindersfive -6.963e+03 2.949e+03 -2.362 0.019823 *
## num.of.cylindersfour -3.485e+03 3.898e+03 -0.894 0.373038
## num.of.cylinderssix -2.604e+03 3.177e+03 -0.820 0.414093
## num.of.cylindersthree NA NA NA NA
## num.of.cylinderstwelve 5.810e+01 5.874e+03 0.010 0.992125
## engine.size 5.905e+01 2.623e+01 2.251 0.026242 *
## fuel.system2bbl 1.968e+03 1.486e+03 1.325 0.187869
## fuel.systemidi NA NA NA NA
## fuel.systemmfi -1.465e+03 2.657e+03 -0.551 0.582450
## fuel.systemmpfi 9.461e+02 1.576e+03 0.600 0.549494
## fuel.systemspdi -5.165e+02 1.884e+03 -0.274 0.784412
## fuel.systemspfi NA NA NA NA
## bore -2.492e+03 1.889e+03 -1.319 0.189580
## stroke -1.042e+03 1.012e+03 -1.029 0.305568
## compression.ratio -1.105e+03 5.327e+02 -2.074 0.040258 *
## horsepower 8.347e+00 2.563e+01 0.326 0.745249
## peak.rpm 2.253e+00 6.648e-01 3.389 0.000953 ***
## city.mpg -3.217e+01 1.433e+02 -0.225 0.822692
## highway.mpg 1.443e+02 1.211e+02 1.191 0.235911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1723 on 119 degrees of freedom
## Multiple R-squared: 0.9663, Adjusted R-squared: 0.9513
## F-statistic: 64.43 on 53 and 119 DF, p-value: < 2.2e-16
Looks like the model only including engine size generalizes the best. I used RMSE for my metric of success, but we could decide on another one as a group and use that to guide our analysis.
pMake = predict(lmOutMake,data=testData)
pEngineSize = predict(lmOutEngineSize,data=testData)
pBoth = predict(lmOutBoth,data=testData)
pAll = predict(lmOutAll,data=testData)
pMakeEngineCurb = predict(lmOutMakeEngineSizeCurbWeight,data=testData)
pLassoWithoutMake = predict(lmOutLassoWithoutMake,data=testData)
rmseMake <- sqrt(mean((pMake - testData$price)^2))
## Warning in pMake - testData$price: longer object length is not a multiple
## of shorter object length
rmseMake
## [1] 12975.97
rmseEngineSize <- sqrt(mean((pEngineSize - testData$price)^2))
## Warning in pEngineSize - testData$price: longer object length is not a
## multiple of shorter object length
rmseEngineSize
## [1] 12879.55
rmseBoth <- sqrt(mean((pBoth - testData$price)^2))
## Warning in pBoth - testData$price: longer object length is not a multiple
## of shorter object length
rmseBoth
## [1] 13269.54
rmseAll <- sqrt(mean((pAll - testData$price)^2))
## Warning in pAll - testData$price: longer object length is not a multiple of
## shorter object length
rmseAll
## [1] 13352.2
rmseMEC <- sqrt(mean((pMakeEngineCurb - testData$price)^2))
## Warning in pMakeEngineCurb - testData$price: longer object length is not a
## multiple of shorter object length
rmseMEC
## [1] 13344.66
rmseLWOM = sqrt(mean((pLassoWithoutMake - testData$price)^2))
## Warning in pLassoWithoutMake - testData$price: longer object length is not
## a multiple of shorter object length
rmseLWOM
## [1] 13191.68
How does Random Forests compare with the best lienar regression model?
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
library(rpart)
modelRFAll =randomForest(price~., ntree = 250 ,data=trainData)
modelRFEngineSize = randomForest(price~engine.size,ntree=250,data=trainData)
modelRFMake = randomForest(price~make,ntree=250,data=trainData)
modelRFBoth = randomForest(price~make+engine.size,ntree=250,data=trainData)
pMakeRF = predict(modelRFMake,data=testData)
pEngineSizeRF = predict(modelRFEngineSize,data=testData)
pBothRF = predict(modelRFBoth,data=testData)
pAllRF = predict(modelRFAll,data=testData)
rmseMakeRF <- sqrt(mean((pMakeRF - testData$price)^2))
## Warning in pMakeRF - testData$price: longer object length is not a multiple
## of shorter object length
rmseMakeRF
## [1] 12788.48
rmseEngineSizeRF <- sqrt(mean((pEngineSizeRF - testData$price)^2))
## Warning in pEngineSizeRF - testData$price: longer object length is not a
## multiple of shorter object length
rmseEngineSizeRF
## [1] 13219.01
rmseBothRF <- sqrt(mean((pBothRF - testData$price)^2))
## Warning in pBothRF - testData$price: longer object length is not a multiple
## of shorter object length
rmseBothRF
## [1] 13092.04
rmseAllRF <- sqrt(mean((pAllRF - testData$price)^2))
## Warning in pAllRF - testData$price: longer object length is not a multiple
## of shorter object length
rmseAllRF
## [1] 12979.18
It doesn’t look like a log transformation really helps the engine size variable in regard to prediction.
loggedTrainData = trainData
loggedTrainData$engine.size = log(loggedTrainData$engine.size)
loggedTestData = testData
loggedTestData$engine.size = log(loggedTestData$engine.size)
lmLogged = lm(price ~ engine.size, data = loggedTrainData)
pLogged = predict(lmLogged,data=loggedTestData)
rmseLogged <- sqrt(mean((pLogged - loggedTestData$price)^2))
## Warning in pLogged - loggedTestData$price: longer object length is not a
## multiple of shorter object length
rmseLogged
## [1] 12880.03
Looks like RF using make as the sole predictor does the best on the test set, and the linear regression model using engine size as the sole predictor does the best on the test set.
I think these are the correct answers, but we still need to motivate these decisions using various selection methods, visualizations, and analyses.
Things from her rubric I’m going to try and address this weekend:
Model Assumptions (show normalized residuals, QQplots, etc. This will need to be done after we select a model as a group)
More graphs (again after we decide on the best model)
Dealing with outliers (I usually use R’s boxplot code for this, but I asked her about it earlier in the semester and she didn’t like it. Suggested just taking off the top and bottom 5% of variables). Could include Cook’s D plots here.
Multicollinearity. Thinking about it now, this is kind of addressed in models like the lmOutBoth, lmOutLassoWithoutMake, and lmOutAll models. We can use them to explain why they’re performing worse on the test set than the models with just engine size and make.
Interactions (see multicollinearity section)
I think everything else is kind of done, but obviously needs polished and combed over by everyone else.