| Dataset | Method | Focus | n | p |
|---|---|---|---|---|
| (a)Salary | Regression | Inference | 350 | 3 |
| (b)Product | Classification | Prediction | 31 | 13 |
| (c)Stock | Regression | Prediction | 52 | 3 |
This is a regression problem because salary is a continuous variable. It focuses on inference because we want only to figure out influential factors but not to predict.
This is a classification problem because response variable (Success/Fail) is a categorical variable. It focuses on prediction because the aim is to forecast the market reaction of a new product instead of to tell which factor influence a historical product’s destiny. This problem may need feature selection or it will suffer from overfitting.
This is a regression problem because % change in dollar is a continuous variable. It focuses on prediction because the target is to tell the change of dollar given stock market information, not to analyze which stock market affect dollar most. This problem may employ elastic logarithm model.
library(dplyr)
library(plotly)
library(Hmisc)
library(psych)
library(MASS)
library(caret)
library(pROC)
library(xgboost)
library(AER)
setwd("C:/Users/Administrator/Desktop/test/data-analyst-data-test-master/data-analyst-data-test-master")
RawData = read.csv("Cars_mileage.csv",stringsAsFactors = F)
head(RawData,n=6)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
MedianMPG = median(RawData$mpg)
mpg_binary=as.integer(mpg>MedianMPG) # Question 2.(a) Here
CarData = RawData %>%
mutate(horsepower=as.integer(horsepower)) %>%
filter(!is.na(horsepower)) %>%
mutate(mpg_binary=as.integer(mpg>MedianMPG)) %>%
mutate(brand=
(regexpr(" ",name)-1)%>%
ifelse((.==-2),nchar(name),.)%>%
substr(x=name,start=1,stop=.))%>%
mutate(gpm=1/mpg)%>%
mutate(sw=grepl(pattern="\\(sw\\)",name))
sumTable=CarData %>%
group_by(as.factor(brand)) %>%
summarise(n=n())%>%
arrange(n)
head(CarData)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name mpg_binary brand gpm sw
## 1 chevrolet chevelle malibu 0 chevrolet 0.05555556 FALSE
## 2 buick skylark 320 0 buick 0.06666667 FALSE
## 3 plymouth satellite 0 plymouth 0.05555556 FALSE
## 4 amc rebel sst 0 amc 0.06250000 FALSE
## 5 ford torino 0 ford 0.05882353 FALSE
## 6 ford galaxie 500 0 ford 0.06666667 FALSE
print(sumTable,n=38)
## # A tibble: 37 × 2
## `as.factor(brand)` n
## <fctr> <int>
## 1 capri 1
## 2 chevroelt 1
## 3 hi 1
## 4 mercedes 1
## 5 nissan 1
## 6 toyouta 1
## 7 triumph 1
## 8 vokswagen 1
## 9 bmw 2
## 10 cadillac 2
## 11 maxda 2
## 12 mercedes-benz 2
## 13 chevy 3
## 14 renault 3
## 15 opel 4
## 16 saab 4
## 17 subaru 4
## 18 chrysler 6
## 19 volvo 6
## 20 vw 6
## 21 audi 7
## 22 fiat 8
## 23 peugeot 8
## 24 mazda 10
## 25 oldsmobile 10
## 26 mercury 11
## 27 honda 13
## 28 volkswagen 15
## 29 pontiac 16
## 30 buick 17
## 31 datsun 23
## 32 toyota 25
## 33 amc 27
## 34 dodge 28
## 35 plymouth 31
## 36 chevrolet 43
## 37 ford 48
From above I found some typo in brand name. In addition, names of some cars contains only model name, instead of starting with a brand name. I searched the internet and found their brand names and mannually inserted. Following codes solve this problem.
attach(CarData)
brand[brand=="capri"]="ford"
brand[brand=="chevroelt"]="chevrolet"
brand[brand=="vokswagen"]="volkswagen"
brand[brand=="mercedes-benz"]="mercedes"
brand[brand=="chevroelt"]="chevrolet"
brand[brand=="hi"]="unknown"
brand[brand=="triumph"]="unknown"
brand[brand=="bmw"]="unknown"
brand[brand=="toyouta"]="toyota"
brand[brand=="cadillac"]="unknown"
brand[brand=="maxda"]="mazda"
brand[brand=="chevy"]="chevrolet"
brand[brand=="opel"]="unknown"
brand[brand=="saab"]="unknown"
brand[brand=="subaru"]="unknown"
brand[brand=="chrysler"]="unknown"
brand[brand=="volvo"]="unknown"
brand[brand=="vw"]="unknown"
brand[brand=="audi"]="unknown"
brand[brand=="fiat"]="unknown"
brand[brand=="peugeot"]="unknown"
brand[brand=="mercury"]="ford"
brand[brand=="mercedes"]="unknown"
brand[brand=="renault"]="unknown"
brand[brand=="datsun"]="nissan"
origin[origin==1]="American"
origin[origin==2]="European"
origin[origin==3]="Japan"
CarData$brand=capitalize(brand)
CarData$origin=factor(origin)
detach(CarData)
sumTable=CarData %>%
group_by(brand) %>%
summarise(n=n(),avg_mpg=mean(mpg))%>%
arrange(avg_mpg)
CarData$brand=factor(CarData$brand,levels=sumTable$`brand`,ordered = T)
print(sumTable,n=14)
## # A tibble: 14 × 3
## brand n avg_mpg
## <chr> <int> <dbl>
## 1 Amc 27 18.07037
## 2 Buick 17 19.18235
## 3 Ford 60 19.50167
## 4 Pontiac 16 20.01250
## 5 Chevrolet 47 20.21915
## 6 Oldsmobile 10 21.10000
## 7 Plymouth 31 21.70323
## 8 Dodge 28 22.06071
## 9 Unknown 65 25.92000
## 10 Toyota 26 28.16538
## 11 Volkswagen 16 29.15000
## 12 Mazda 12 30.05833
## 13 Nissan 24 31.31667
## 14 Honda 13 33.76154
head(CarData)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 American
## 2 15 8 350 165 3693 11.5 70 American
## 3 18 8 318 150 3436 11.0 70 American
## 4 16 8 304 150 3433 12.0 70 American
## 5 17 8 302 140 3449 10.5 70 American
## 6 15 8 429 198 4341 10.0 70 American
## name mpg_binary brand gpm sw
## 1 chevrolet chevelle malibu 0 Chevrolet 0.05555556 FALSE
## 2 buick skylark 320 0 Buick 0.06666667 FALSE
## 3 plymouth satellite 0 Plymouth 0.05555556 FALSE
## 4 amc rebel sst 0 Amc 0.06250000 FALSE
## 5 ford torino 0 Ford 0.05882353 FALSE
## 6 ford galaxie 500 0 Ford 0.06666667 FALSE
Now the data looks much tidier.
First, of course, we want to know the correlation between different variables.
pairs.panels(CarData[,-c(9,10)],density=F,ellipses=F,
main="Feature Correlation")
From the correlation plot I found
cylinder is correlated with mpg, therefore this can be a useful variable for prediction.
displacement is also correlated with mpg, but it worthes notice that it’s highly correlated with cylinder. Multicollinearity might not be an issue in prediction, but it is trouble in interpretation.
horsepower and weight have similar case as displacement.
acceleration has a “purer” but weaker correlation with mpg, but it’s a potentially useful variable anyway.
year is also a useful variable since the fuel efficiency seems to evolve during those years. This pattern is more obvious when compared with gpm.
origin is definitely a useful variable because Japan has a significant advantage in fuel efficiency.
brand seems to be correlated with mpg, but this is doubious since the relation maybe caused by origin.
sw looks like a factor influencing mpg, but as the data is unbalanced, this needed to be further tested.
Next, I inspected some doubious correlation:
I want to check whether mpg indeed varies by brands, or it just varies by origins.
BrandPlot <- plot_ly(data=CarData, y = ~mpg, x=~brand,
color = ~origin, type = "box")%>%
layout(title="MPG by Brands",
xaxis = list(title=""), yaxis = list(title="MPG"))
BrandPlot
All right, brand seems contribute no more information than origin, it can take a rest now.
t.test(CarData$mpg[CarData$sw==0],CarData$mpg[CarData$sw==1],
var.equal = F)
##
## Welch Two Sample t-test
##
## data: CarData$mpg[CarData$sw == 0] and CarData$mpg[CarData$sw == 1]
## t = 5.1431, df = 36.963, p-value = 9.079e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.346797 7.698258
## sample estimates:
## mean of x mean of y
## 23.84038 18.31786
Small p-value shows that sw make a difference in car mpg, indicating that it could be a useful variable.
Further, I want to check some interaction terms.
Different origin may have a different evolution path, maybe American car catch up faster in fuel efficiency than European.
YearPlot<- plot_ly(data=CarData, y = ~mpg, x=~jitter(year),
hoverinfo = 'text',
text=~paste('Brand: ',brand,
'</br> Name: ', capitalize(name),
'</br> MPG: ', mpg,
'</br> Displacement: ', displacement),
color = ~origin,type="scatter",mode="markers")%>%
layout(title="MPG by Year",
xaxis = list(title="Year"),
yaxis = list(title="MPG"))
YearPlot
The hypothesis seems to find its evidence from this plot. American cars have a clearly lower mpg in 1970 but later their mpg is more compariable with Japanese cars.
Maybe state wagons with different origin have different fuel efficiency?
aov.fit=aov(mpg~origin*sw,data=CarData)
summary(aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 7904 3952 98.859 < 2e-16 ***
## sw 1 464 464 11.616 0.000723 ***
## origin:sw 2 19 10 0.238 0.788332
## Residuals 386 15431 40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction term has a high p-value for the null hypothesis: it’s not useful.
Some car designs focus on luxury experience, while others focus on economics. So this variable could be useful in distinguish whether a car is luxury car or not.
luxPlot=plot_ly(data=CarData, y = ~mpg, x=~I(displacement/horsepower),
hoverinfo = 'text',
text=~paste('Brand: ',brand,
'</br> Name: ', capitalize(name),
'</br> MPG: ', mpg,
'</br> Displacement: ', displacement),
color=(~I(displacement/horsepower)>1.9),
type="scatter",mode="markers")%>%
layout(title="Luxury vs Economics",
xaxis = list(title="Displacement/Horsepower ratio"),
yaxis = list(title="MPG"))
luxPlot
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
The pattern shows that those cars with displacement/horsepower > 1.9 seem to be luxury ones while the rest are more likely economic cars.
CarData=mutate(CarData,luxury=(displacement/horsepower>1.9))
CarData=mutate(CarData,dhratio=displacement/horsepower)
Finally, look at a big picture including all continuous variables
ConPlot <- plot_ly(CarData, x = ~weight, y = ~horsepower, z = ~acceleration,size=~displacement, color = ~mpg) %>%
add_markers() %>%
layout(title="Global View",
scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'Horsepower'),
zaxis = list(title = 'Acceleration')))
ConPlot
Looks like no more interaction terms is obvious from this graph.
To conclude, seemingly useful variables includes: cylinder, displacement, horsepower, weight, acceleration, year, origin, sw, year*origin, sw*origin, displacement/horsepower
Seperate the data into training set and testing set. The partition ratio is 0.7, so there won’t be too few data in each side.
Train.ind = createDataPartition(CarData$mpg, p=0.7, list=FALSE)
training = CarData[ Train.ind, ]
testing = CarData[ -Train.ind, ]
I choose logistics regression and decision tree. Reasons are:
| Methods | Select | Reason |
|---|---|---|
| Discriminant Analysis | No | Bad for Categorical Variables |
| KNN | No | Curse of Dimension |
| Decision Tree | No | Inferior to GBTree |
| Random Forests | No | Insufficient Features |
| LASSO regression | No | Inferior to Manual Variables Selection |
| Ridge regression | No | Inferior to Manual Variables Selection |
| Elastic Net Method | No | Inferior to Manual Variables Selection |
| Logistics | Yes | Best Traditional Model for Binary Prediction |
| GBTree | Yes | Good for Nonlinear Modeling |
Before fitting a logistics regression, I want to “cheat” a little bit here by fitting a linear regression to mpg, try all possible variables and plausible interaction terms there. Then use stepwise regression to perform variable selection, and use selected variables to fit a logistics regression.
m1=lm(data=training,mpg~cylinders+displacement+horsepower+weight+acceleration+year+origin+sw+year*origin+sw*origin+I(displacement/horsepower)+luxury*displacement+luxury*sw+luxury*weight+luxury*cylinders)
summary(m1)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin + sw + year * origin + sw *
## origin + I(displacement/horsepower) + luxury * displacement +
## luxury * sw + luxury * weight + luxury * cylinders, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2817 -1.6708 -0.1533 1.4998 13.0941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.225925 7.054354 1.591 0.112766
## cylinders -0.015656 0.547719 -0.029 0.977218
## displacement 0.114839 0.033038 3.476 0.000598 ***
## horsepower -0.164868 0.053941 -3.056 0.002477 **
## weight -0.008567 0.001159 -7.394 2.02e-12 ***
## acceleration 0.143612 0.104124 1.379 0.169022
## year 0.560911 0.072400 7.747 2.20e-13 ***
## originEuropean -29.912285 11.566468 -2.586 0.010259 *
## originJapan -25.061109 10.892654 -2.301 0.022211 *
## swTRUE -0.354445 2.045163 -0.173 0.862546
## I(displacement/horsepower) -6.055865 2.813095 -2.153 0.032274 *
## luxuryTRUE -8.859993 4.053856 -2.186 0.029753 *
## year:originEuropean 0.410963 0.150723 2.727 0.006841 **
## year:originJapan 0.353290 0.139881 2.526 0.012153 *
## originEuropean:swTRUE -0.797858 2.955743 -0.270 0.787428
## originJapan:swTRUE 1.872383 2.728643 0.686 0.493211
## displacement:luxuryTRUE -0.029097 0.017350 -1.677 0.094750 .
## swTRUE:luxuryTRUE 0.676152 2.220178 0.305 0.760957
## weight:luxuryTRUE 0.004731 0.001329 3.561 0.000441 ***
## cylinders:luxuryTRUE -0.670472 0.732243 -0.916 0.360717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.828 on 256 degrees of freedom
## Multiple R-squared: 0.8782, Adjusted R-squared: 0.8692
## F-statistic: 97.18 on 19 and 256 DF, p-value: < 2.2e-16
Most terms has a significant p value, some are not due to colinearity, which is not an issue in prediction scenario.
m2=stepAIC(m1,direction = "both")
summary(m2)
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration +
## year + origin + I(displacement/horsepower) + luxury + year:origin +
## displacement:luxury + weight:luxury, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3372 -1.6858 -0.1269 1.5357 13.4216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.877963 6.808056 1.598 0.111291
## displacement 0.110136 0.031486 3.498 0.000551 ***
## horsepower -0.155686 0.053333 -2.919 0.003815 **
## weight -0.008651 0.001143 -7.569 6.43e-13 ***
## acceleration 0.162574 0.101237 1.606 0.109507
## year 0.551266 0.071146 7.748 2.05e-13 ***
## originEuropean -32.376108 11.076248 -2.923 0.003769 **
## originJapan -22.977678 10.267203 -2.238 0.026064 *
## I(displacement/horsepower) -5.580863 2.779133 -2.008 0.045656 *
## luxuryTRUE -11.360172 3.345793 -3.395 0.000792 ***
## year:originEuropean 0.443522 0.144351 3.073 0.002346 **
## year:originJapan 0.327901 0.132088 2.482 0.013675 *
## displacement:luxuryTRUE -0.035873 0.013450 -2.667 0.008128 **
## weight:luxuryTRUE 0.004695 0.001317 3.564 0.000435 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.814 on 262 degrees of freedom
## Multiple R-squared: 0.8766, Adjusted R-squared: 0.8705
## F-statistic: 143.2 on 13 and 262 DF, p-value: < 2.2e-16
Use the significant terms to fit a logistics model. Also fit some reduced model for comparison.
m3=glm(mpg_binary~cylinders+displacement+horsepower+weight+
acceleration+year+origin,
family=binomial(logit),data=training)
summary(m3)
##
## Call:
## glm(formula = mpg_binary ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin, family = binomial(logit),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.43352 -0.07932 -0.00021 0.09822 2.45825
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -29.690230 8.232227 -3.607 0.00031 ***
## cylinders 0.063695 0.581088 0.110 0.91272
## displacement 0.020775 0.017371 1.196 0.23171
## horsepower -0.048122 0.034826 -1.382 0.16704
## weight -0.007579 0.001852 -4.093 4.26e-05 ***
## acceleration 0.010591 0.193802 0.055 0.95642
## year 0.664988 0.129520 5.134 2.83e-07 ***
## originEuropean 1.881182 0.922393 2.039 0.04140 *
## originJapan 2.123111 0.952347 2.229 0.02579 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 381.689 on 275 degrees of freedom
## Residual deviance: 96.213 on 267 degrees of freedom
## AIC: 114.21
##
## Number of Fisher Scoring iterations: 8
m4=glm(mpg_binary~cylinders+displacement+horsepower+weight+
acceleration+year+origin+I(displacement/horsepower)+
luxury+year*origin+displacement*luxury+
weight*luxury+cylinders*luxury,
family=binomial(logit),data=training)
summary(m4)
##
## Call:
## glm(formula = mpg_binary ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin + I(displacement/horsepower) +
## luxury + year * origin + displacement * luxury + weight *
## luxury + cylinders * luxury, family = binomial(logit), data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.05839 -0.08966 -0.00063 0.05367 2.27936
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.154889 14.386434 -2.096 0.036077 *
## cylinders -0.045725 0.813253 -0.056 0.955163
## displacement 0.112460 0.080852 1.391 0.164245
## horsepower -0.152756 0.124357 -1.228 0.219309
## weight -0.010416 0.002762 -3.772 0.000162 ***
## acceleration -0.011565 0.222130 -0.052 0.958476
## year 0.827426 0.191296 4.325 1.52e-05 ***
## originEuropean 39.658324 17.224071 2.302 0.021307 *
## originJapan -0.899112 25.880306 -0.035 0.972286
## I(displacement/horsepower) -4.330616 6.391604 -0.678 0.498059
## luxuryTRUE -18.742915 9.539497 -1.965 0.049441 *
## year:originEuropean -0.491099 0.228049 -2.153 0.031281 *
## year:originJapan 0.058898 0.354919 0.166 0.868198
## displacement:luxuryTRUE -0.074142 0.042330 -1.752 0.079852 .
## weight:luxuryTRUE 0.008750 0.003312 2.642 0.008235 **
## cylinders:luxuryTRUE 0.731786 1.389095 0.527 0.598327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 381.689 on 275 degrees of freedom
## Residual deviance: 78.859 on 260 degrees of freedom
## AIC: 110.86
##
## Number of Fisher Scoring iterations: 8
m5=glm(mpg_binary~displacement+weight+
year+origin+
luxury+year*origin+displacement*luxury+
weight*luxury,
family=binomial(logit),data=training)
summary(m5)
##
## Call:
## glm(formula = mpg_binary ~ displacement + weight + year + origin +
## luxury + year * origin + displacement * luxury + weight *
## luxury, family = binomial(logit), data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35502 -0.09331 -0.00352 0.08623 2.35440
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -43.553328 10.965686 -3.972 7.13e-05 ***
## displacement 0.045038 0.023929 1.882 0.05982 .
## weight -0.011592 0.002576 -4.499 6.81e-06 ***
## year 0.892882 0.184922 4.828 1.38e-06 ***
## originEuropean 40.189119 16.314555 2.463 0.01376 *
## originJapan 8.761303 23.201566 0.378 0.70572
## luxuryTRUE -18.225629 6.492766 -2.807 0.00500 **
## year:originEuropean -0.506318 0.215231 -2.352 0.01865 *
## year:originJapan -0.089839 0.315608 -0.285 0.77591
## displacement:luxuryTRUE -0.052864 0.026292 -2.011 0.04436 *
## weight:luxuryTRUE 0.009194 0.002968 3.098 0.00195 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 381.689 on 275 degrees of freedom
## Residual deviance: 85.862 on 265 degrees of freedom
## AIC: 107.86
##
## Number of Fisher Scoring iterations: 8
anova(m3,m4,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: mpg_binary ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin
## Model 2: mpg_binary ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin + I(displacement/horsepower) +
## luxury + year * origin + displacement * luxury + weight *
## luxury + cylinders * luxury
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 267 96.213
## 2 260 78.859 7 17.355 0.01525 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3,m5,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: mpg_binary ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin
## Model 2: mpg_binary ~ displacement + weight + year + origin + luxury +
## year * origin + displacement * luxury + weight * luxury
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 267 96.213
## 2 265 85.862 2 10.351 0.005653 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Many terms seem to be insignificant here, but it’s OK since we have confidence that it should be included in the model from the OLS model. Also the chi-squared test indicates the reduced model m3 which only employs raw features is insufficient. If we exclude insignificant variables, new model m5 has a similar deviance with the original model with only raw features. Therefore, I decided to choose model m4.
mpg_fit=predict(m4,newdata = testing,type="response")
mpg_test=as.logical(testing$mpg_binary)
confusionMatrix(data=(mpg_fit>0.5),reference = mpg_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 52 4
## TRUE 7 53
##
## Accuracy : 0.9052
## 95% CI : (0.8367, 0.9517)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8105
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.8814
## Specificity : 0.9298
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.8833
## Prevalence : 0.5086
## Detection Rate : 0.4483
## Detection Prevalence : 0.4828
## Balanced Accuracy : 0.9056
##
## 'Positive' Class : FALSE
##
roc.fit=roc(mpg_test~mpg_fit);roc.fit
##
## Call:
## roc.formula(formula = mpg_test ~ mpg_fit)
##
## Data: mpg_fit in 59 controls (mpg_test FALSE) < 57 cases (mpg_test TRUE).
## Area under the curve: 0.9655
plot.roc(roc.fit)
Testing error rate for logistic regression is 0.0756, 95% CI is (0.0361, 0.1422).
Use Caret to tune parameters.
n.trees=c(120,140,160)
interaction.depth=c(3,4,5,6)
shrinkage=c(0.01,0.02,0.05,0.1)
n.minobsinnode=c(5,10,15)
para=expand.grid(n.trees=n.trees,
interaction.depth=interaction.depth,
shrinkage=shrinkage,
n.minobsinnode=n.minobsinnode)
m6=train(factor(mpg_binary)~cylinders+displacement+horsepower+weight+
acceleration+year+origin+sw+dhratio,
data=training,method="gbm",tuneGrid=para)
Indeed I ran the train() function several times before and found the close region for parameters. The procedure showed above is just the last step. Also, since tree is good at modeling nonlinear relation, I don’t need to include interaction terms.
m6.fit=m6$finalModel
mpg_fit=predict(m6.fit,newdata =
testing[,c(2,3,4,5,6,7,8,13,15)],n.trees=80)
mpg_test=as.logical(testing$mpg_binary)
confusionMatrix(data=(mpg_fit<0.5),reference = mpg_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 52 1
## TRUE 7 56
##
## Accuracy : 0.931
## 95% CI : (0.8686, 0.9698)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8623
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.8814
## Specificity : 0.9825
## Pos Pred Value : 0.9811
## Neg Pred Value : 0.8889
## Prevalence : 0.5086
## Detection Rate : 0.4483
## Detection Prevalence : 0.4569
## Balanced Accuracy : 0.9319
##
## 'Positive' Class : FALSE
##
roc.fit=roc(mpg_test~mpg_fit);roc.fit
##
## Call:
## roc.formula(formula = mpg_test ~ mpg_fit)
##
## Data: mpg_fit in 59 controls (mpg_test FALSE) > 57 cases (mpg_test TRUE).
## Area under the curve: 0.9822
plot.roc(roc.fit)
The error rate of Gradient boosting trees is 0.0517, 95% CI is (0.1092, 0.0192).
American car seems more fuel inefficient, but they were trying to improve it and was catching up with Japanese and European cars.
Cars with lower displacement/horsepower ratio tend to perform better in mpg.
Different brand has different fuel strageties, but brands of same origin tend have similar pattern.
Acceleration seems not to have much influence on mpg as expected.
Logistics regression performs comparably with gradient boosting trees. Their accuracy confidence interval overlap.
No significant outlier is obvious from graphs.
AUC of logistic regression is 0.9896, of GBM is 0.9911.
Guanzhong You Apr 2, 2017