Question 1

Dataset Method Focus n p
(a)Salary Regression Inference 350 3
(b)Product Classification Prediction 31 13
(c)Stock Regression Prediction 52 3
  1. This is a regression problem because salary is a continuous variable. It focuses on inference because we want only to figure out influential factors but not to predict.

  2. This is a classification problem because response variable (Success/Fail) is a categorical variable. It focuses on prediction because the aim is to forecast the market reaction of a new product instead of to tell which factor influence a historical product’s destiny. This problem may need feature selection or it will suffer from overfitting.

  3. This is a regression problem because % change in dollar is a continuous variable. It focuses on prediction because the target is to tell the change of dollar given stock market information, not to analyze which stock market affect dollar most. This problem may employ elastic logarithm model.

Question 2

1.Read Data

  • Import packages
  • Set working directory
  • Read in data
library(dplyr)
library(plotly)
library(Hmisc)
library(psych)
library(MASS)
library(caret)
library(pROC)
library(xgboost)
library(AER)
setwd("C:/Users/Administrator/Desktop/test/data-analyst-data-test-master/data-analyst-data-test-master")
RawData = read.csv("Cars_mileage.csv",stringsAsFactors = F)
head(RawData,n=6)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

2.Data Manipulation (Question 2.(a) Included)

  • Create varible mpg_binary as required in Question 2.(a)
  • Create more potentially useful variables:
    • brand: Different brands may have different design thinking in fuel efficiency. Most car’s name starts with its brand name, so it’s easy to seperate that apart.
    • gpm: gallon per mile, whcih looks more like a variable that has a linear relation with other variables. E.g., one ton increase in weight could increase gpm by a certain amount.
    • sw: whether it is Station Wagon. This is marked with(sw) after each car’s name. Station Wagon may have a different fuel efficiency pattern compared to sedan, since its shape and structure is different.
  • Omit some NA data. Some rows has a “?” mark in horsepower. The number of those rows is very small so it is safe to simple delete those entries. No need for imputation, which brings in biases.
  • Change the sign of origin. The origin is marked with number. For further interpretation, I changed it to area name.
MedianMPG = median(RawData$mpg)
mpg_binary=as.integer(mpg>MedianMPG) # Question 2.(a) Here
CarData = RawData %>%
          mutate(horsepower=as.integer(horsepower)) %>%
          filter(!is.na(horsepower)) %>%
          mutate(mpg_binary=as.integer(mpg>MedianMPG)) %>%
          mutate(brand=
                   (regexpr(" ",name)-1)%>%
                   ifelse((.==-2),nchar(name),.)%>%
                   substr(x=name,start=1,stop=.))%>%
          mutate(gpm=1/mpg)%>%
          mutate(sw=grepl(pattern="\\(sw\\)",name))
sumTable=CarData %>% 
         group_by(as.factor(brand)) %>%
         summarise(n=n())%>%
         arrange(n)
head(CarData)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name mpg_binary     brand        gpm    sw
## 1 chevrolet chevelle malibu          0 chevrolet 0.05555556 FALSE
## 2         buick skylark 320          0     buick 0.06666667 FALSE
## 3        plymouth satellite          0  plymouth 0.05555556 FALSE
## 4             amc rebel sst          0       amc 0.06250000 FALSE
## 5               ford torino          0      ford 0.05882353 FALSE
## 6          ford galaxie 500          0      ford 0.06666667 FALSE
print(sumTable,n=38)
## # A tibble: 37 × 2
##    `as.factor(brand)`     n
##                <fctr> <int>
## 1               capri     1
## 2           chevroelt     1
## 3                  hi     1
## 4            mercedes     1
## 5              nissan     1
## 6             toyouta     1
## 7             triumph     1
## 8           vokswagen     1
## 9                 bmw     2
## 10           cadillac     2
## 11              maxda     2
## 12      mercedes-benz     2
## 13              chevy     3
## 14            renault     3
## 15               opel     4
## 16               saab     4
## 17             subaru     4
## 18           chrysler     6
## 19              volvo     6
## 20                 vw     6
## 21               audi     7
## 22               fiat     8
## 23            peugeot     8
## 24              mazda    10
## 25         oldsmobile    10
## 26            mercury    11
## 27              honda    13
## 28         volkswagen    15
## 29            pontiac    16
## 30              buick    17
## 31             datsun    23
## 32             toyota    25
## 33                amc    27
## 34              dodge    28
## 35           plymouth    31
## 36          chevrolet    43
## 37               ford    48

From above I found some typo in brand name. In addition, names of some cars contains only model name, instead of starting with a brand name. I searched the internet and found their brand names and mannually inserted. Following codes solve this problem.

attach(CarData)
  brand[brand=="capri"]="ford"
  brand[brand=="chevroelt"]="chevrolet"
  brand[brand=="vokswagen"]="volkswagen"
  brand[brand=="mercedes-benz"]="mercedes"
  brand[brand=="chevroelt"]="chevrolet"
  brand[brand=="hi"]="unknown"
  brand[brand=="triumph"]="unknown"
  brand[brand=="bmw"]="unknown"
  brand[brand=="toyouta"]="toyota"
  brand[brand=="cadillac"]="unknown"
  brand[brand=="maxda"]="mazda"
  brand[brand=="chevy"]="chevrolet"
  brand[brand=="opel"]="unknown"
  brand[brand=="saab"]="unknown"
  brand[brand=="subaru"]="unknown"
  brand[brand=="chrysler"]="unknown"
  brand[brand=="volvo"]="unknown"
  brand[brand=="vw"]="unknown"
  brand[brand=="audi"]="unknown"
  brand[brand=="fiat"]="unknown"
  brand[brand=="peugeot"]="unknown"
  brand[brand=="mercury"]="ford"
  brand[brand=="mercedes"]="unknown"
  brand[brand=="renault"]="unknown"
  brand[brand=="datsun"]="nissan"
  origin[origin==1]="American"
  origin[origin==2]="European"
  origin[origin==3]="Japan"
  CarData$brand=capitalize(brand)
  CarData$origin=factor(origin)
detach(CarData)
sumTable=CarData %>% 
         group_by(brand) %>%
         summarise(n=n(),avg_mpg=mean(mpg))%>%
         arrange(avg_mpg)
CarData$brand=factor(CarData$brand,levels=sumTable$`brand`,ordered = T)
print(sumTable,n=14)
## # A tibble: 14 × 3
##         brand     n  avg_mpg
##         <chr> <int>    <dbl>
## 1         Amc    27 18.07037
## 2       Buick    17 19.18235
## 3        Ford    60 19.50167
## 4     Pontiac    16 20.01250
## 5   Chevrolet    47 20.21915
## 6  Oldsmobile    10 21.10000
## 7    Plymouth    31 21.70323
## 8       Dodge    28 22.06071
## 9     Unknown    65 25.92000
## 10     Toyota    26 28.16538
## 11 Volkswagen    16 29.15000
## 12      Mazda    12 30.05833
## 13     Nissan    24 31.31667
## 14      Honda    13 33.76154
head(CarData)
##   mpg cylinders displacement horsepower weight acceleration year   origin
## 1  18         8          307        130   3504         12.0   70 American
## 2  15         8          350        165   3693         11.5   70 American
## 3  18         8          318        150   3436         11.0   70 American
## 4  16         8          304        150   3433         12.0   70 American
## 5  17         8          302        140   3449         10.5   70 American
## 6  15         8          429        198   4341         10.0   70 American
##                        name mpg_binary     brand        gpm    sw
## 1 chevrolet chevelle malibu          0 Chevrolet 0.05555556 FALSE
## 2         buick skylark 320          0     Buick 0.06666667 FALSE
## 3        plymouth satellite          0  Plymouth 0.05555556 FALSE
## 4             amc rebel sst          0       Amc 0.06250000 FALSE
## 5               ford torino          0      Ford 0.05882353 FALSE
## 6          ford galaxie 500          0      Ford 0.06666667 FALSE

Now the data looks much tidier.

3. Feature Exploration (Question 2.(b) Included)

First, of course, we want to know the correlation between different variables.

pairs.panels(CarData[,-c(9,10)],density=F,ellipses=F,
             main="Feature Correlation")

From the correlation plot I found

  1. cylinder is correlated with mpg, therefore this can be a useful variable for prediction.

  2. displacement is also correlated with mpg, but it worthes notice that it’s highly correlated with cylinder. Multicollinearity might not be an issue in prediction, but it is trouble in interpretation.

  3. horsepower and weight have similar case as displacement.

  4. acceleration has a “purer” but weaker correlation with mpg, but it’s a potentially useful variable anyway.

  5. year is also a useful variable since the fuel efficiency seems to evolve during those years. This pattern is more obvious when compared with gpm.

  6. origin is definitely a useful variable because Japan has a significant advantage in fuel efficiency.

  7. brand seems to be correlated with mpg, but this is doubious since the relation maybe caused by origin.

  8. sw looks like a factor influencing mpg, but as the data is unbalanced, this needed to be further tested.

Next, I inspected some doubious correlation:

A. brand

I want to check whether mpg indeed varies by brands, or it just varies by origins.

BrandPlot <- plot_ly(data=CarData, y = ~mpg, x=~brand,
                     color = ~origin, type = "box")%>%
             layout(title="MPG by Brands", 
                    xaxis = list(title=""), yaxis =  list(title="MPG"))
BrandPlot

All right, brand seems contribute no more information than origin, it can take a rest now.

B. sw

t.test(CarData$mpg[CarData$sw==0],CarData$mpg[CarData$sw==1],
       var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  CarData$mpg[CarData$sw == 0] and CarData$mpg[CarData$sw == 1]
## t = 5.1431, df = 36.963, p-value = 9.079e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.346797 7.698258
## sample estimates:
## mean of x mean of y 
##  23.84038  18.31786

Small p-value shows that sw make a difference in car mpg, indicating that it could be a useful variable.

Further, I want to check some interaction terms.

C. year x origin

Different origin may have a different evolution path, maybe American car catch up faster in fuel efficiency than European.

YearPlot<- plot_ly(data=CarData, y = ~mpg, x=~jitter(year),
                   hoverinfo = 'text',
                   text=~paste('Brand: ',brand,
                               '</br> Name: ', capitalize(name), 
                               '</br> MPG: ', mpg,
                               '</br> Displacement: ', displacement),
                   color = ~origin,type="scatter",mode="markers")%>%
           layout(title="MPG by Year", 
                  xaxis = list(title="Year"), 
                  yaxis =  list(title="MPG"))
YearPlot

The hypothesis seems to find its evidence from this plot. American cars have a clearly lower mpg in 1970 but later their mpg is more compariable with Japanese cars.

D. sw x origin

Maybe state wagons with different origin have different fuel efficiency?

aov.fit=aov(mpg~origin*sw,data=CarData)
summary(aov.fit)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## origin        2   7904    3952  98.859  < 2e-16 ***
## sw            1    464     464  11.616 0.000723 ***
## origin:sw     2     19      10   0.238 0.788332    
## Residuals   386  15431      40                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term has a high p-value for the null hypothesis: it’s not useful.

E. displacement / horsepower

Some car designs focus on luxury experience, while others focus on economics. So this variable could be useful in distinguish whether a car is luxury car or not.

luxPlot=plot_ly(data=CarData, y = ~mpg, x=~I(displacement/horsepower),
                   hoverinfo = 'text',
                   text=~paste('Brand: ',brand,
                               '</br> Name: ', capitalize(name), 
                               '</br> MPG: ', mpg,
                               '</br> Displacement: ', displacement),
                   color=(~I(displacement/horsepower)>1.9),
                   type="scatter",mode="markers")%>%
        layout(title="Luxury vs Economics", 
                  xaxis = list(title="Displacement/Horsepower ratio"), 
                  yaxis =  list(title="MPG"))
luxPlot
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

The pattern shows that those cars with displacement/horsepower > 1.9 seem to be luxury ones while the rest are more likely economic cars.

CarData=mutate(CarData,luxury=(displacement/horsepower>1.9))
CarData=mutate(CarData,dhratio=displacement/horsepower)

F. General View

Finally, look at a big picture including all continuous variables

ConPlot <- plot_ly(CarData, x = ~weight, y = ~horsepower, z = ~acceleration,size=~displacement, color = ~mpg) %>%
  add_markers() %>%
  layout(title="Global View",
         scene = list(xaxis = list(title = 'Weight'),
                      yaxis = list(title = 'Horsepower'),
                      zaxis = list(title = 'Acceleration')))
ConPlot

Looks like no more interaction terms is obvious from this graph.

To conclude, seemingly useful variables includes: cylinder, displacement, horsepower, weight, acceleration, year, origin, sw, year*origin, sw*origin, displacement/horsepower

4. Train/Test Partition (Question 2.(c))

Seperate the data into training set and testing set. The partition ratio is 0.7, so there won’t be too few data in each side.

Train.ind = createDataPartition(CarData$mpg, p=0.7, list=FALSE)
training = CarData[ Train.ind, ]
testing = CarData[ -Train.ind, ]

5. Model Fitting (Question 2.(d))

I choose logistics regression and decision tree. Reasons are:

Methods Select Reason
Discriminant Analysis No Bad for Categorical Variables
KNN No Curse of Dimension
Decision Tree No Inferior to GBTree
Random Forests No Insufficient Features
LASSO regression No Inferior to Manual Variables Selection
Ridge regression No Inferior to Manual Variables Selection
Elastic Net Method No Inferior to Manual Variables Selection
Logistics Yes Best Traditional Model for Binary Prediction
GBTree Yes Good for Nonlinear Modeling

A. Logistic Regression

Before fitting a logistics regression, I want to “cheat” a little bit here by fitting a linear regression to mpg, try all possible variables and plausible interaction terms there. Then use stepwise regression to perform variable selection, and use selected variables to fit a logistics regression.

m1=lm(data=training,mpg~cylinders+displacement+horsepower+weight+acceleration+year+origin+sw+year*origin+sw*origin+I(displacement/horsepower)+luxury*displacement+luxury*sw+luxury*weight+luxury*cylinders)
summary(m1)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin + sw + year * origin + sw * 
##     origin + I(displacement/horsepower) + luxury * displacement + 
##     luxury * sw + luxury * weight + luxury * cylinders, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2817 -1.6708 -0.1533  1.4998 13.0941 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 11.225925   7.054354   1.591 0.112766    
## cylinders                   -0.015656   0.547719  -0.029 0.977218    
## displacement                 0.114839   0.033038   3.476 0.000598 ***
## horsepower                  -0.164868   0.053941  -3.056 0.002477 ** 
## weight                      -0.008567   0.001159  -7.394 2.02e-12 ***
## acceleration                 0.143612   0.104124   1.379 0.169022    
## year                         0.560911   0.072400   7.747 2.20e-13 ***
## originEuropean             -29.912285  11.566468  -2.586 0.010259 *  
## originJapan                -25.061109  10.892654  -2.301 0.022211 *  
## swTRUE                      -0.354445   2.045163  -0.173 0.862546    
## I(displacement/horsepower)  -6.055865   2.813095  -2.153 0.032274 *  
## luxuryTRUE                  -8.859993   4.053856  -2.186 0.029753 *  
## year:originEuropean          0.410963   0.150723   2.727 0.006841 ** 
## year:originJapan             0.353290   0.139881   2.526 0.012153 *  
## originEuropean:swTRUE       -0.797858   2.955743  -0.270 0.787428    
## originJapan:swTRUE           1.872383   2.728643   0.686 0.493211    
## displacement:luxuryTRUE     -0.029097   0.017350  -1.677 0.094750 .  
## swTRUE:luxuryTRUE            0.676152   2.220178   0.305 0.760957    
## weight:luxuryTRUE            0.004731   0.001329   3.561 0.000441 ***
## cylinders:luxuryTRUE        -0.670472   0.732243  -0.916 0.360717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.828 on 256 degrees of freedom
## Multiple R-squared:  0.8782, Adjusted R-squared:  0.8692 
## F-statistic: 97.18 on 19 and 256 DF,  p-value: < 2.2e-16

Most terms has a significant p value, some are not due to colinearity, which is not an issue in prediction scenario.

m2=stepAIC(m1,direction = "both")
summary(m2)
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration + 
##     year + origin + I(displacement/horsepower) + luxury + year:origin + 
##     displacement:luxury + weight:luxury, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3372 -1.6858 -0.1269  1.5357 13.4216 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 10.877963   6.808056   1.598 0.111291    
## displacement                 0.110136   0.031486   3.498 0.000551 ***
## horsepower                  -0.155686   0.053333  -2.919 0.003815 ** 
## weight                      -0.008651   0.001143  -7.569 6.43e-13 ***
## acceleration                 0.162574   0.101237   1.606 0.109507    
## year                         0.551266   0.071146   7.748 2.05e-13 ***
## originEuropean             -32.376108  11.076248  -2.923 0.003769 ** 
## originJapan                -22.977678  10.267203  -2.238 0.026064 *  
## I(displacement/horsepower)  -5.580863   2.779133  -2.008 0.045656 *  
## luxuryTRUE                 -11.360172   3.345793  -3.395 0.000792 ***
## year:originEuropean          0.443522   0.144351   3.073 0.002346 ** 
## year:originJapan             0.327901   0.132088   2.482 0.013675 *  
## displacement:luxuryTRUE     -0.035873   0.013450  -2.667 0.008128 ** 
## weight:luxuryTRUE            0.004695   0.001317   3.564 0.000435 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.814 on 262 degrees of freedom
## Multiple R-squared:  0.8766, Adjusted R-squared:  0.8705 
## F-statistic: 143.2 on 13 and 262 DF,  p-value: < 2.2e-16

Use the significant terms to fit a logistics model. Also fit some reduced model for comparison.

m3=glm(mpg_binary~cylinders+displacement+horsepower+weight+
         acceleration+year+origin,
       family=binomial(logit),data=training)

summary(m3)
## 
## Call:
## glm(formula = mpg_binary ~ cylinders + displacement + horsepower + 
##     weight + acceleration + year + origin, family = binomial(logit), 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43352  -0.07932  -0.00021   0.09822   2.45825  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -29.690230   8.232227  -3.607  0.00031 ***
## cylinders        0.063695   0.581088   0.110  0.91272    
## displacement     0.020775   0.017371   1.196  0.23171    
## horsepower      -0.048122   0.034826  -1.382  0.16704    
## weight          -0.007579   0.001852  -4.093 4.26e-05 ***
## acceleration     0.010591   0.193802   0.055  0.95642    
## year             0.664988   0.129520   5.134 2.83e-07 ***
## originEuropean   1.881182   0.922393   2.039  0.04140 *  
## originJapan      2.123111   0.952347   2.229  0.02579 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 381.689  on 275  degrees of freedom
## Residual deviance:  96.213  on 267  degrees of freedom
## AIC: 114.21
## 
## Number of Fisher Scoring iterations: 8
m4=glm(mpg_binary~cylinders+displacement+horsepower+weight+
         acceleration+year+origin+I(displacement/horsepower)+
         luxury+year*origin+displacement*luxury+
         weight*luxury+cylinders*luxury,
       family=binomial(logit),data=training)
summary(m4)
## 
## Call:
## glm(formula = mpg_binary ~ cylinders + displacement + horsepower + 
##     weight + acceleration + year + origin + I(displacement/horsepower) + 
##     luxury + year * origin + displacement * luxury + weight * 
##     luxury + cylinders * luxury, family = binomial(logit), data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.05839  -0.08966  -0.00063   0.05367   2.27936  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -30.154889  14.386434  -2.096 0.036077 *  
## cylinders                   -0.045725   0.813253  -0.056 0.955163    
## displacement                 0.112460   0.080852   1.391 0.164245    
## horsepower                  -0.152756   0.124357  -1.228 0.219309    
## weight                      -0.010416   0.002762  -3.772 0.000162 ***
## acceleration                -0.011565   0.222130  -0.052 0.958476    
## year                         0.827426   0.191296   4.325 1.52e-05 ***
## originEuropean              39.658324  17.224071   2.302 0.021307 *  
## originJapan                 -0.899112  25.880306  -0.035 0.972286    
## I(displacement/horsepower)  -4.330616   6.391604  -0.678 0.498059    
## luxuryTRUE                 -18.742915   9.539497  -1.965 0.049441 *  
## year:originEuropean         -0.491099   0.228049  -2.153 0.031281 *  
## year:originJapan             0.058898   0.354919   0.166 0.868198    
## displacement:luxuryTRUE     -0.074142   0.042330  -1.752 0.079852 .  
## weight:luxuryTRUE            0.008750   0.003312   2.642 0.008235 ** 
## cylinders:luxuryTRUE         0.731786   1.389095   0.527 0.598327    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 381.689  on 275  degrees of freedom
## Residual deviance:  78.859  on 260  degrees of freedom
## AIC: 110.86
## 
## Number of Fisher Scoring iterations: 8
m5=glm(mpg_binary~displacement+weight+
         year+origin+
         luxury+year*origin+displacement*luxury+
         weight*luxury,
       family=binomial(logit),data=training)
summary(m5)
## 
## Call:
## glm(formula = mpg_binary ~ displacement + weight + year + origin + 
##     luxury + year * origin + displacement * luxury + weight * 
##     luxury, family = binomial(logit), data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.35502  -0.09331  -0.00352   0.08623   2.35440  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -43.553328  10.965686  -3.972 7.13e-05 ***
## displacement              0.045038   0.023929   1.882  0.05982 .  
## weight                   -0.011592   0.002576  -4.499 6.81e-06 ***
## year                      0.892882   0.184922   4.828 1.38e-06 ***
## originEuropean           40.189119  16.314555   2.463  0.01376 *  
## originJapan               8.761303  23.201566   0.378  0.70572    
## luxuryTRUE              -18.225629   6.492766  -2.807  0.00500 ** 
## year:originEuropean      -0.506318   0.215231  -2.352  0.01865 *  
## year:originJapan         -0.089839   0.315608  -0.285  0.77591    
## displacement:luxuryTRUE  -0.052864   0.026292  -2.011  0.04436 *  
## weight:luxuryTRUE         0.009194   0.002968   3.098  0.00195 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 381.689  on 275  degrees of freedom
## Residual deviance:  85.862  on 265  degrees of freedom
## AIC: 107.86
## 
## Number of Fisher Scoring iterations: 8
anova(m3,m4,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: mpg_binary ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin
## Model 2: mpg_binary ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin + I(displacement/horsepower) + 
##     luxury + year * origin + displacement * luxury + weight * 
##     luxury + cylinders * luxury
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       267     96.213                       
## 2       260     78.859  7   17.355  0.01525 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3,m5,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: mpg_binary ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin
## Model 2: mpg_binary ~ displacement + weight + year + origin + luxury + 
##     year * origin + displacement * luxury + weight * luxury
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       267     96.213                        
## 2       265     85.862  2   10.351 0.005653 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Many terms seem to be insignificant here, but it’s OK since we have confidence that it should be included in the model from the OLS model. Also the chi-squared test indicates the reduced model m3 which only employs raw features is insufficient. If we exclude insignificant variables, new model m5 has a similar deviance with the original model with only raw features. Therefore, I decided to choose model m4.

mpg_fit=predict(m4,newdata = testing,type="response")
mpg_test=as.logical(testing$mpg_binary)
confusionMatrix(data=(mpg_fit>0.5),reference = mpg_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    52    4
##      TRUE      7   53
##                                           
##                Accuracy : 0.9052          
##                  95% CI : (0.8367, 0.9517)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8105          
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.8814          
##             Specificity : 0.9298          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.8833          
##              Prevalence : 0.5086          
##          Detection Rate : 0.4483          
##    Detection Prevalence : 0.4828          
##       Balanced Accuracy : 0.9056          
##                                           
##        'Positive' Class : FALSE           
## 
roc.fit=roc(mpg_test~mpg_fit);roc.fit
## 
## Call:
## roc.formula(formula = mpg_test ~ mpg_fit)
## 
## Data: mpg_fit in 59 controls (mpg_test FALSE) < 57 cases (mpg_test TRUE).
## Area under the curve: 0.9655
plot.roc(roc.fit)

Testing error rate for logistic regression is 0.0756, 95% CI is (0.0361, 0.1422).

B. Gradient Boosting Trees

Use Caret to tune parameters.

n.trees=c(120,140,160)
interaction.depth=c(3,4,5,6)
shrinkage=c(0.01,0.02,0.05,0.1)
n.minobsinnode=c(5,10,15)

para=expand.grid(n.trees=n.trees,
                 interaction.depth=interaction.depth,
                 shrinkage=shrinkage,
                 n.minobsinnode=n.minobsinnode)

m6=train(factor(mpg_binary)~cylinders+displacement+horsepower+weight+
         acceleration+year+origin+sw+dhratio,
         data=training,method="gbm",tuneGrid=para)

Indeed I ran the train() function several times before and found the close region for parameters. The procedure showed above is just the last step. Also, since tree is good at modeling nonlinear relation, I don’t need to include interaction terms.

m6.fit=m6$finalModel

mpg_fit=predict(m6.fit,newdata = 
                  testing[,c(2,3,4,5,6,7,8,13,15)],n.trees=80)
mpg_test=as.logical(testing$mpg_binary)
confusionMatrix(data=(mpg_fit<0.5),reference = mpg_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    52    1
##      TRUE      7   56
##                                           
##                Accuracy : 0.931           
##                  95% CI : (0.8686, 0.9698)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8623          
##  Mcnemar's Test P-Value : 0.0771          
##                                           
##             Sensitivity : 0.8814          
##             Specificity : 0.9825          
##          Pos Pred Value : 0.9811          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.5086          
##          Detection Rate : 0.4483          
##    Detection Prevalence : 0.4569          
##       Balanced Accuracy : 0.9319          
##                                           
##        'Positive' Class : FALSE           
## 
roc.fit=roc(mpg_test~mpg_fit);roc.fit
## 
## Call:
## roc.formula(formula = mpg_test ~ mpg_fit)
## 
## Data: mpg_fit in 59 controls (mpg_test FALSE) > 57 cases (mpg_test TRUE).
## Area under the curve: 0.9822
plot.roc(roc.fit)

The error rate of Gradient boosting trees is 0.0517, 95% CI is (0.1092, 0.0192).

C. Other Findings

  1. American car seems more fuel inefficient, but they were trying to improve it and was catching up with Japanese and European cars.

  2. Cars with lower displacement/horsepower ratio tend to perform better in mpg.

  3. Different brand has different fuel strageties, but brands of same origin tend have similar pattern.

  4. Acceleration seems not to have much influence on mpg as expected.

  5. Logistics regression performs comparably with gradient boosting trees. Their accuracy confidence interval overlap.

  6. No significant outlier is obvious from graphs.

  7. AUC of logistic regression is 0.9896, of GBM is 0.9911.

Guanzhong You Apr 2, 2017