data_621_hw

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. The objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target.

packages <- c('tidyverse', 'broom', 'caret', 'kableExtra', 'janitor', 'Hmisc', 'MASS', 'corrplot', 'Metrics', 'purrr', 'janitor', 'ggmosaic', 'pscl')
pacman::p_load(char = packages)

Data Exploration

# read data
dfTrain <- read.csv("wine-training-data.csv", header = TRUE)
dfEval <- read.csv("wine-evaluation-data.csv", header = TRUE)

# dim
lapply(list(`wine-training` = dfTrain, `wine-evaluation` = dfEval), function(x) dim(x))
## $`wine-training`
## [1] 12795    16
## 
## $`wine-evaluation`
## [1] 3335   16

# head
head(dfTrain)
##   ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1        1      3          3.2           1.160      -0.98          54.2
## 2        2      3          4.5           0.160      -0.81          26.1
## 3        4      5          7.1           2.640      -0.88          14.8
## 4        5      3          5.7           0.385       0.04          18.8
## 5        6      4          8.0           0.330      -1.26           9.4
## 6        7      0         11.3           0.320       0.59           2.2
##   Chlorides FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1    -0.567                NA                268 0.99280 3.33     -0.59     9.9
## 2    -0.425                15               -327 1.02792 3.38      0.70      NA
## 3     0.037               214                142 0.99518 3.12      0.48    22.0
## 4    -0.425                22                115 0.99640 2.24      1.83     6.2
## 5        NA              -167                108 0.99457 3.12      1.77    13.7
## 6     0.556               -37                 15 0.99940 3.20      1.29    15.4
##   LabelAppeal AcidIndex STARS
## 1           0         8     2
## 2          -1         7     3
## 3          -1         8     3
## 4          -1         6     1
## 5           0         9     2
## 6           0        11    NA

# let's clean up variable names and remove the index column
dfTrain <- janitor::clean_names(dfTrain) %>% dplyr::select(-i_index)
dfEval <- janitor::clean_names(dfEval) %>% dplyr::select(-`in`)

# meta data
metaDataTrain <- data.frame(fields = names(dfTrain), 
                            missing = colSums(is.na(dfTrain)),
                            unique = sapply(dfTrain, function(x) length(unique(x))),
                            type = sapply(dfTrain, class))

metaDataEval <- data.frame(fields = names(dfEval), 
                           missing = colSums(is.na(dfEval)),
                           unique = sapply(dfEval, function(x) length(unique(x))),
                           type = sapply(dfEval, class))

metaDataTrain
##                                    fields missing unique    type
## target                             target       0      9 integer
## fixed_acidity               fixed_acidity       0    470 numeric
## volatile_acidity         volatile_acidity       0    815 numeric
## citric_acid                   citric_acid       0    602 numeric
## residual_sugar             residual_sugar     616   2078 numeric
## chlorides                       chlorides     638   1664 numeric
## free_sulfur_dioxide   free_sulfur_dioxide     647   1000 numeric
## total_sulfur_dioxide total_sulfur_dioxide     682   1371 numeric
## density                           density       0   5933 numeric
## p_h                                   p_h     395    498 numeric
## sulphates                       sulphates    1210    631 numeric
## alcohol                           alcohol     653    402 numeric
## label_appeal                 label_appeal       0      5 integer
## acid_index                     acid_index       0     14 integer
## stars                               stars    3359      5 integer
metaDataEval
##                                    fields missing unique    type
## target                             target    3335      1 logical
## fixed_acidity               fixed_acidity       0    366 numeric
## volatile_acidity         volatile_acidity       0    533 numeric
## citric_acid                   citric_acid       0    448 numeric
## residual_sugar             residual_sugar     168   1224 numeric
## chlorides                       chlorides     138   1051 numeric
## free_sulfur_dioxide   free_sulfur_dioxide     152    700 numeric
## total_sulfur_dioxide total_sulfur_dioxide     157    940 numeric
## density                           density       0   2179 numeric
## p_h                                   p_h     104    379 numeric
## sulphates                       sulphates     310    466 numeric
## alcohol                           alcohol     185    288 numeric
## label_appeal                 label_appeal       0      5 integer
## acid_index                     acid_index       0     13 integer
## stars                               stars     841      5 integer

# barplot
barplot(table(dfTrain$target), main = "Train Set : Target")


# corrplot
cor(dfTrain, method = "pearson", use = "complete.obs") %>% 
        corrplot::corrplot(., method = "shade", type = "upper")


# density curve
dfGather <- dfTrain %>%
        dplyr::select(-label_appeal, -acid_index, -stars) %>%
        dplyr::mutate(target = as.factor(target)) %>%
        tidyr::gather(key, value, -target)

ggplot(dfGather, aes(value, color = target)) +
        geom_density() +
        geom_vline(data = aggregate(value ~ target + key, dfGather, median), 
                   aes(xintercept = value,
                       color = target),
                   linetype = "dashed") +
        facet_wrap(~ key, nrow = 5, scales = "free")


# mosaic plot
dfTrain %>%
        ggplot(.) +
        geom_mosaic(aes(x = product(target, label_appeal), fill = as.factor(target)), na.rm = TRUE) +
        labs(x = "label_appeal", fill = "Target") + 
        ggtitle("Mosaic Plot : Label Appeal")


dfTrain %>%
        ggplot(.) +
        geom_mosaic(aes(x = product(target, acid_index), fill = as.factor(target)), na.rm = TRUE) +
        labs(x = "acid_index", fill = "Target") + 
        ggtitle("Mosaic Plot : Acid Index")


dfTrain %>%
        ggplot(.) +
        geom_mosaic(aes(x = product(target, stars), fill = as.factor(target)), na.rm = TRUE) +
        labs(x = "stars", fill = "Target") + 
        ggtitle("Mosaic Plot : Stars")

Most of the data is normally distributed. We do not seem to have any variable that strongly correlates with our target variable. In fact, majority of these variables do not seem to corrlelate with one another, except stars and label_appeal (not surprising). We have many missing values that we need to take care of first.

Data Preparation

# impute - dfTrain
preProcValuesTrain <- preProcess(dfTrain, method = c("medianImpute"))
dfTrain <- predict(preProcValuesTrain, dfTrain)

# impute - dfEval
preProcValuesEval <- preProcess(dfEval, method = c("medianImpute"))
dfEval <- predict(preProcValuesEval, dfEval)

# check again
sapply(list(dfTrain, dfEval), function(x) colSums(is.na(x)))
##                      [,1] [,2]
## target                  0 3335
## fixed_acidity           0    0
## volatile_acidity        0    0
## citric_acid             0    0
## residual_sugar          0    0
## chlorides               0    0
## free_sulfur_dioxide     0    0
## total_sulfur_dioxide    0    0
## density                 0    0
## p_h                     0    0
## sulphates               0    0
## alcohol                 0    0
## label_appeal            0    0
## acid_index              0    0
## stars                   0    0

# split data
set.seed(1234)
index = sample(1:nrow(dfTrain), size = nrow(dfTrain) * .7, replace = FALSE)
trainSet <- dfTrain[index, ]
testSet <- dfTrain[-index, ]

Build Models

# poisson models
poisson_model_1 <- glm(target ~ ., family = "poisson", data = trainSet)
poisson_model_1 <- step(poisson_model_1, direction = "backward")
## Start:  AIC=35197.29
## target ~ fixed_acidity + volatile_acidity + citric_acid + residual_sugar + 
##     chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     density + p_h + sulphates + alcohol + label_appeal + acid_index + 
##     stars
## 
##                        Df Deviance   AIC
## - fixed_acidity         1    12808 35196
## - residual_sugar        1    12809 35196
## <none>                       12808 35197
## - density               1    12811 35199
## - p_h                   1    12813 35200
## - citric_acid           1    12814 35201
## - sulphates             1    12814 35202
## - alcohol               1    12816 35203
## - total_sulfur_dioxide  1    12816 35204
## - free_sulfur_dioxide   1    12820 35207
## - chlorides             1    12821 35208
## - volatile_acidity      1    12852 35239
## - acid_index            1    13393 35781
## - label_appeal          1    13573 35960
## - stars                 1    13593 35980
## 
## Step:  AIC=35195.58
## target ~ volatile_acidity + citric_acid + residual_sugar + chlorides + 
##     free_sulfur_dioxide + total_sulfur_dioxide + density + p_h + 
##     sulphates + alcohol + label_appeal + acid_index + stars
## 
##                        Df Deviance   AIC
## - residual_sugar        1    12809 35195
## <none>                       12808 35196
## - density               1    12812 35197
## - p_h                   1    12813 35199
## - citric_acid           1    12814 35199
## - sulphates             1    12815 35200
## - alcohol               1    12816 35201
## - total_sulfur_dioxide  1    12817 35202
## - free_sulfur_dioxide   1    12820 35206
## - chlorides             1    12821 35207
## - volatile_acidity      1    12852 35237
## - acid_index            1    13413 35799
## - label_appeal          1    13574 35960
## - stars                 1    13593 35979
## 
## Step:  AIC=35194.59
## target ~ volatile_acidity + citric_acid + chlorides + free_sulfur_dioxide + 
##     total_sulfur_dioxide + density + p_h + sulphates + alcohol + 
##     label_appeal + acid_index + stars
## 
##                        Df Deviance   AIC
## <none>                       12809 35195
## - density               1    12813 35196
## - p_h                   1    12814 35198
## - citric_acid           1    12815 35198
## - sulphates             1    12816 35199
## - alcohol               1    12817 35200
## - total_sulfur_dioxide  1    12818 35201
## - free_sulfur_dioxide   1    12821 35205
## - chlorides             1    12822 35206
## - volatile_acidity      1    12853 35236
## - acid_index            1    13414 35798
## - label_appeal          1    13575 35959
## - stars                 1    13596 35979
summary(poisson_model_1)
## 
## Call:
## glm(formula = target ~ volatile_acidity + citric_acid + chlorides + 
##     free_sulfur_dioxide + total_sulfur_dioxide + density + p_h + 
##     sulphates + alcohol + label_appeal + acid_index + stars, 
##     family = "poisson", data = trainSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4689  -0.5213   0.2005   0.6340   2.5391  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.025e+00  2.335e-01   8.674  < 2e-16 ***
## volatile_acidity     -5.181e-02  7.831e-03  -6.616 3.68e-11 ***
## citric_acid           1.656e-02  6.989e-03   2.370 0.017812 *  
## chlorides            -7.126e-02  1.968e-02  -3.621 0.000293 ***
## free_sulfur_dioxide   1.452e-04  4.150e-05   3.497 0.000470 ***
## total_sulfur_dioxide  7.928e-05  2.698e-05   2.939 0.003297 ** 
## density              -4.238e-01  2.284e-01  -1.855 0.063561 .  
## p_h                  -2.068e-02  9.172e-03  -2.255 0.024140 *  
## sulphates            -1.759e-02  6.871e-03  -2.560 0.010467 *  
## alcohol               4.627e-03  1.696e-03   2.729 0.006356 ** 
## label_appeal          1.996e-01  7.213e-03  27.677  < 2e-16 ***
## acid_index           -1.249e-01  5.266e-03 -23.711  < 2e-16 ***
## stars                 2.204e-01  7.791e-03  28.285  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15963  on 8955  degrees of freedom
## Residual deviance: 12809  on 8943  degrees of freedom
## AIC: 35195
## 
## Number of Fisher Scoring iterations: 5

poisson_model_2 <- pscl::zeroinfl(target ~ ., data = trainSet)
summary(poisson_model_2)
## 
## Call:
## pscl::zeroinfl(formula = target ~ ., data = trainSet)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0030 -0.3625  0.1578  0.5023  4.2839 
## 
## Count model coefficients (poisson with log link):
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.551e+00  2.474e-01   6.271 3.60e-10 ***
## fixed_acidity         4.890e-04  1.027e-03   0.476  0.63391    
## volatile_acidity     -1.526e-02  8.326e-03  -1.833  0.06679 .  
## citric_acid          -5.891e-05  7.294e-03  -0.008  0.99356    
## residual_sugar       -9.120e-05  1.953e-04  -0.467  0.64043    
## chlorides            -1.701e-02  2.071e-02  -0.822  0.41133    
## free_sulfur_dioxide   4.438e-05  4.238e-05   1.047  0.29496    
## total_sulfur_dioxide -3.596e-05  2.741e-05  -1.312  0.18961    
## density              -4.440e-01  2.423e-01  -1.832  0.06693 .  
## p_h                   6.828e-03  9.697e-03   0.704  0.48138    
## sulphates             2.226e-03  7.265e-03   0.306  0.75930    
## alcohol               7.806e-03  1.765e-03   4.422 9.77e-06 ***
## label_appeal          2.507e-01  7.618e-03  32.911  < 2e-16 ***
## acid_index           -1.799e-02  5.965e-03  -3.017  0.00256 ** 
## stars                 1.010e-01  7.778e-03  12.979  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.7934085  1.2454084  -3.849 0.000119 ***
## fixed_acidity         0.0064796  0.0052037   1.245 0.213060    
## volatile_acidity      0.2376106  0.0419756   5.661 1.51e-08 ***
## citric_acid          -0.0993046  0.0378625  -2.623 0.008722 ** 
## residual_sugar       -0.0015738  0.0009954  -1.581 0.113839    
## chlorides             0.3491774  0.1049534   3.327 0.000878 ***
## free_sulfur_dioxide  -0.0006455  0.0002220  -2.907 0.003647 ** 
## total_sulfur_dioxide -0.0007660  0.0001437  -5.330 9.85e-08 ***
## density              -0.1038197  1.2239272  -0.085 0.932400    
## p_h                   0.1707674  0.0490124   3.484 0.000494 ***
## sulphates             0.1254033  0.0367406   3.413 0.000642 ***
## alcohol               0.0156930  0.0090488   1.734 0.082872 .  
## label_appeal          0.3154615  0.0397371   7.939 2.04e-15 ***
## acid_index            0.4904284  0.0237354  20.662  < 2e-16 ***
## stars                -0.6951840  0.0463883 -14.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 35 
## Log-likelihood: -1.575e+04 on 30 Df
poisson_model_2$coefficients
## $count
##          (Intercept)        fixed_acidity     volatile_acidity 
##         1.551071e+00         4.890169e-04        -1.526259e-02 
##          citric_acid       residual_sugar            chlorides 
##        -5.890782e-05        -9.120066e-05        -1.701147e-02 
##  free_sulfur_dioxide total_sulfur_dioxide              density 
##         4.438266e-05        -3.595520e-05        -4.440066e-01 
##                  p_h            sulphates              alcohol 
##         6.827573e-03         2.226029e-03         7.806270e-03 
##         label_appeal           acid_index                stars 
##         2.507153e-01        -1.799343e-02         1.009544e-01 
## 
## $zero
##          (Intercept)        fixed_acidity     volatile_acidity 
##        -4.7934084628         0.0064795831         0.2376106142 
##          citric_acid       residual_sugar            chlorides 
##        -0.0993046359        -0.0015738373         0.3491773765 
##  free_sulfur_dioxide total_sulfur_dioxide              density 
##        -0.0006454633        -0.0007660401        -0.1038197157 
##                  p_h            sulphates              alcohol 
##         0.1707674358         0.1254033497         0.0156929524 
##         label_appeal           acid_index                stars 
##         0.3154615094         0.4904283874        -0.6951839602

# Vuong comparison
pscl::vuong(poisson_model_1, poisson_model_2)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -32.10931 model2 > model1 < 2.22e-16
## AIC-corrected         -31.81212 model2 > model1 < 2.22e-16
## BIC-corrected         -30.75709 model2 > model1 < 2.22e-16

Overall, it seems to be the case that stronger sales is associated with more citric, sulfur dioxide containing, alcohol, label appeal, and stars. On the other hand, the more dense, more sulphates, more chlorides and less acid is more likely to contribute to less sales. We conduct the Vuong test to compare the first model with the zero-inflated model and it turns out that the zero-inflated model outperforms the first model as we can see from the output.

# negative binomial models
negative_binomial_3 <- glm.nb(target ~ ., data = trainSet)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(negative_binomial_3)
## 
## Call:
## glm.nb(formula = target ~ ., data = trainSet, init.theta = 40280.1003, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4671  -0.5236   0.1992   0.6334   2.5376  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.025e+00  2.335e-01   8.675  < 2e-16 ***
## fixed_acidity        -5.329e-04  9.818e-04  -0.543 0.587304    
## volatile_acidity     -5.178e-02  7.832e-03  -6.611 3.82e-11 ***
## citric_acid           1.667e-02  6.990e-03   2.384 0.017107 *  
## residual_sugar        1.846e-04  1.857e-04   0.994 0.320274    
## chlorides            -7.117e-02  1.968e-02  -3.617 0.000298 ***
## free_sulfur_dioxide   1.448e-04  4.151e-05   3.488 0.000486 ***
## total_sulfur_dioxide  7.863e-05  2.699e-05   2.914 0.003574 ** 
## density              -4.242e-01  2.284e-01  -1.857 0.063340 .  
## p_h                  -2.090e-02  9.176e-03  -2.278 0.022728 *  
## sulphates            -1.744e-02  6.873e-03  -2.538 0.011164 *  
## alcohol               4.668e-03  1.696e-03   2.752 0.005928 ** 
## label_appeal          1.995e-01  7.214e-03  27.661  < 2e-16 ***
## acid_index           -1.244e-01  5.325e-03 -23.366  < 2e-16 ***
## stars                 2.202e-01  7.793e-03  28.261  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40280.1) family taken to be 1)
## 
##     Null deviance: 15962  on 8955  degrees of freedom
## Residual deviance: 12807  on 8941  degrees of freedom
## AIC: 35199
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40280 
##           Std. Err.:  71902 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -35167.38

# let's remove two variables from above model that are not significant and try again
negative_binomial_4 <- glm.nb(target ~ ., data = trainSet %>% dplyr::select(-fixed_acidity, -residual_sugar))
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(negative_binomial_4)
## 
## Call:
## glm.nb(formula = target ~ ., data = trainSet %>% dplyr::select(-fixed_acidity, 
##     -residual_sugar), init.theta = 40292.42361, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4688  -0.5212   0.2005   0.6339   2.5390  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.025e+00  2.335e-01   8.674  < 2e-16 ***
## volatile_acidity     -5.181e-02  7.831e-03  -6.616 3.68e-11 ***
## citric_acid           1.656e-02  6.989e-03   2.369 0.017817 *  
## chlorides            -7.126e-02  1.968e-02  -3.621 0.000293 ***
## free_sulfur_dioxide   1.452e-04  4.151e-05   3.497 0.000470 ***
## total_sulfur_dioxide  7.928e-05  2.698e-05   2.939 0.003298 ** 
## density              -4.238e-01  2.285e-01  -1.855 0.063570 .  
## p_h                  -2.068e-02  9.172e-03  -2.255 0.024143 *  
## sulphates            -1.759e-02  6.871e-03  -2.560 0.010468 *  
## alcohol               4.627e-03  1.696e-03   2.729 0.006358 ** 
## label_appeal          1.996e-01  7.213e-03  27.676  < 2e-16 ***
## acid_index           -1.249e-01  5.266e-03 -23.711  < 2e-16 ***
## stars                 2.204e-01  7.791e-03  28.284  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40292.42) family taken to be 1)
## 
##     Null deviance: 15962  on 8955  degrees of freedom
## Residual deviance: 12809  on 8943  degrees of freedom
## AIC: 35197
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40292 
##           Std. Err.:  71929 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -35168.68

The above two models perform pretty much the same, so in this case we can go after the one that has less variables (thus more precise).

# multi-linear model
# multi_linear_5 <- lm(target ~ ., trainSet)
# multi_linear_5 <- step(multi_linear_5, direction = 'backward')
summary(multi_linear_5)
## 
## Call:
## lm(formula = target ~ volatile_acidity + citric_acid + chlorides + 
##     free_sulfur_dioxide + total_sulfur_dioxide + density + p_h + 
##     sulphates + alcohol + label_appeal + acid_index + stars, 
##     data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1942 -0.7512  0.3593  1.1162  4.3042 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.347e+00  6.533e-01   8.184 3.11e-16 ***
## volatile_acidity     -1.548e-01  2.202e-02  -7.030 2.22e-12 ***
## citric_acid           5.099e-02  1.975e-02   2.582 0.009833 ** 
## chlorides            -2.260e-01  5.508e-02  -4.104 4.10e-05 ***
## free_sulfur_dioxide   4.422e-04  1.170e-04   3.780 0.000158 ***
## total_sulfur_dioxide  2.305e-04  7.561e-05   3.049 0.002306 ** 
## density              -1.240e+00  6.411e-01  -1.935 0.053080 .  
## p_h                  -5.085e-02  2.569e-02  -1.980 0.047764 *  
## sulphates            -4.925e-02  1.926e-02  -2.557 0.010564 *  
## alcohol               1.692e-02  4.750e-03   3.561 0.000371 ***
## label_appeal          6.034e-01  2.015e-02  29.952  < 2e-16 ***
## acid_index           -3.322e-01  1.304e-02 -25.469  < 2e-16 ***
## stars                 7.481e-01  2.333e-02  32.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.62 on 8943 degrees of freedom
## Multiple R-squared:  0.292,  Adjusted R-squared:  0.2911 
## F-statistic: 307.4 on 12 and 8943 DF,  p-value: < 2.2e-16

# try two degree of interaction
# multi_linear_6 <- lm(target ~.^2, trainSet)
# multi_linear_6 <- step(multi_linear_6, direction = 'backward')
summary(multi_linear_6)
## 
## Call:
## lm(formula = target ~ fixed_acidity + volatile_acidity + citric_acid + 
##     chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     density + p_h + sulphates + alcohol + label_appeal + acid_index + 
##     stars + fixed_acidity:free_sulfur_dioxide + fixed_acidity:density + 
##     fixed_acidity:acid_index + volatile_acidity:density + volatile_acidity:p_h + 
##     volatile_acidity:stars + citric_acid:free_sulfur_dioxide + 
##     citric_acid:total_sulfur_dioxide + citric_acid:sulphates + 
##     citric_acid:label_appeal + chlorides:acid_index + chlorides:stars + 
##     free_sulfur_dioxide:density + free_sulfur_dioxide:p_h + free_sulfur_dioxide:acid_index + 
##     total_sulfur_dioxide:density + total_sulfur_dioxide:acid_index + 
##     density:alcohol + sulphates:stars + alcohol:label_appeal + 
##     alcohol:acid_index + alcohol:stars + label_appeal:acid_index + 
##     label_appeal:stars + acid_index:stars, data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2526 -0.7560  0.3177  1.1023  5.0959 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.162e+01  2.189e+00   5.309 1.13e-07 ***
## fixed_acidity                     -2.708e-01  9.910e-02  -2.733 0.006290 ** 
## volatile_acidity                  -1.485e+00  7.972e-01  -1.862 0.062610 .  
## citric_acid                        3.550e-02  2.574e-02   1.379 0.167843    
## chlorides                          6.291e-01  3.522e-01   1.786 0.074087 .  
## free_sulfur_dioxide               -8.892e-03  4.350e-03  -2.044 0.040948 *  
## total_sulfur_dioxide               3.769e-03  2.770e-03   1.361 0.173612    
## density                           -6.630e+00  2.156e+00  -3.075 0.002114 ** 
## p_h                               -3.500e-02  2.766e-02  -1.265 0.205821    
## sulphates                         -1.642e-01  5.458e-02  -3.009 0.002625 ** 
## alcohol                           -2.432e-01  1.806e-01  -1.347 0.178136    
## label_appeal                       6.317e-01  1.442e-01   4.381 1.19e-05 ***
## acid_index                        -3.767e-01  5.827e-02  -6.465 1.07e-10 ***
## stars                             -2.632e-01  1.696e-01  -1.552 0.120620    
## fixed_acidity:free_sulfur_dioxide -4.458e-05  1.872e-05  -2.381 0.017292 *  
## fixed_acidity:density              3.043e-01  9.886e-02   3.079 0.002086 ** 
## fixed_acidity:acid_index          -4.135e-03  1.918e-03  -2.155 0.031153 *  
## volatile_acidity:density           1.381e+00  7.928e-01   1.742 0.081478 .  
## volatile_acidity:p_h              -4.838e-02  3.232e-02  -1.497 0.134390    
## volatile_acidity:stars             5.831e-02  2.836e-02   2.056 0.039788 *  
## citric_acid:free_sulfur_dioxide    3.082e-04  1.327e-04   2.322 0.020254 *  
## citric_acid:total_sulfur_dioxide  -2.010e-04  8.593e-05  -2.339 0.019339 *  
## citric_acid:sulphates              5.731e-02  2.171e-02   2.640 0.008309 ** 
## citric_acid:label_appeal           4.534e-02  2.147e-02   2.112 0.034718 *  
## chlorides:acid_index              -1.354e-01  3.953e-02  -3.425 0.000617 ***
## chlorides:stars                    1.023e-01  7.139e-02   1.433 0.151797    
## free_sulfur_dioxide:density        7.817e-03  4.296e-03   1.820 0.068811 .  
## free_sulfur_dioxide:p_h           -2.620e-04  1.673e-04  -1.566 0.117342    
## free_sulfur_dioxide:acid_index     3.347e-04  9.271e-05   3.610 0.000308 ***
## total_sulfur_dioxide:density      -4.461e-03  2.752e-03  -1.621 0.105034    
## total_sulfur_dioxide:acid_index    1.235e-04  5.775e-05   2.139 0.032478 *  
## density:alcohol                    2.894e-01  1.793e-01   1.613 0.106678    
## sulphates:stars                    4.727e-02  2.516e-02   1.879 0.060266 .  
## alcohol:label_appeal              -8.775e-03  5.677e-03  -1.546 0.122238    
## alcohol:acid_index                -9.119e-03  3.492e-03  -2.611 0.009032 ** 
## alcohol:stars                      2.064e-02  6.357e-03   3.247 0.001172 ** 
## label_appeal:acid_index           -7.326e-02  1.502e-02  -4.879 1.09e-06 ***
## label_appeal:stars                 3.099e-01  2.468e-02  12.556  < 2e-16 ***
## acid_index:stars                   8.903e-02  1.953e-02   4.559 5.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.595 on 8917 degrees of freedom
## Multiple R-squared:  0.3162, Adjusted R-squared:  0.3133 
## F-statistic: 108.5 on 38 and 8917 DF,  p-value: < 2.2e-16

The second multi-linear regression model (multi_linear_6) is better than the first one (multi_linear_5), e.g. adjsuted R-squared 0.3133 versus 0.2911; however, it is on two degree of interaction and therefore it is more complicated and difficult to explain to stakeholders. It may result in overfitting or simply too complicated to be applicable.

Select Models

# poisson model
targetMean <- round(mean(trainSet$target), 3)
targetVar <- round(var(trainSet$target), 3)

print(paste0("the mean of the targt is approximately ", targetMean))
## [1] "the mean of the targt is approximately 3.025"
print(paste0("the variance of the targt is approximately ", targetVar))
## [1] "the variance of the targt is approximately 3.703"

Since the mean and variance of the target variable is not identical (not even close), we suspect that there is a case of overdispersion and therefore it is less than ideal to apply the poisson models to fit our data in this case.

# prediction accuracy
testSet <- testSet %>% 
        dplyr::mutate(pred_nb_3 = round(predict(negative_binomial_3, newdata = testSet, type = "response"), 0),
                      pred_nb_4 = round(predict(negative_binomial_4, newdata = testSet, type = "response"), 0),
                      pred_lm_5 = round(predict(multi_linear_5, newdata = testSet, type = "response"), 0),
                      pred_lm_6 = round(predict(multi_linear_6, newdata = testSet, type = "response"), 0),
                      pred_nb_3_accuracy = dplyr::case_when(pred_nb_3 == target ~ 1, TRUE ~ 0),
                      pred_nb_4_accuracy = dplyr::case_when(pred_nb_4 == target ~ 1, TRUE ~ 0),
                      pred_lm_5_accuracy = dplyr::case_when(pred_lm_5 == target ~ 1, TRUE ~ 0),
                      pred_lm_6_accuracy = dplyr::case_when(pred_lm_6 == target ~ 1, TRUE ~ 0))

accuracy <- lapply(testSet %>% dplyr::select(pred_nb_3_accuracy, pred_nb_4_accuracy, pred_lm_5_accuracy, pred_lm_6_accuracy),
       sum) %>% unlist 

print(round(accuracy / nrow(testSet), 3)) %>% kable()
## pred_nb_3_accuracy pred_nb_4_accuracy pred_lm_5_accuracy pred_lm_6_accuracy 
##              0.238              0.238              0.258              0.255

	x
pred_nb_3_accuracy	0.238
pred_nb_4_accuracy	0.238
pred_lm_5_accuracy	0.258
pred_lm_6_accuracy	0.255


# mean squared error
mse <- lapply(list(negative_binomial_3, negative_binomial_4, multi_linear_5, multi_linear_6),
              function(x) mean(x$residuals ^2))

names(mse) <- c("negative_binomial_3: mse", "negative_binomial_4: mse", "multi_linear_5: mse", "multi_linear_6: mse")

print(sapply(mse, function(x) round(x, 3))) %>% kable()
## negative_binomial_3: mse negative_binomial_4: mse      multi_linear_5: mse 
##                    0.382                    0.382                    2.621 
##      multi_linear_6: mse 
##                    2.532

	x
negative_binomial_3: mse	0.382
negative_binomial_4: mse	0.382
multi_linear_5: mse	2.621
multi_linear_6: mse	2.532


# evaluation set
dfEval <- dfEval %>%
        dplyr::mutate(pred = round(predict(negative_binomial_4, newdata = dfEval, type = "response"), 0))

# barplot for evaluation set (side by side with the train set)
par(mfrow = c(1, 2))
barplot(table(dfTrain$target), main = "Train Set : Target")
barplot(table(dfEval$pred), main = "Evaluation Set : Target Prediction")

Since the negative binomial models report its result with an AIC and log likelihood, whereas the multi-linear regression approach reports with adjusted R-squared, we cannot directly compare their outputs. Instead, we look at the accuracy of their predictions and calculate the mean squared error. The overall accuracy among the models actually do not differ so much. The linear models actually do slightly better than the negative binomial models (26% vs 24%); however, these two approaches differ significantly in the mean squared error. As a result, we would choose negative_binomial_4, which is highly similar to model 3 but we remove two variables that have no predictive power, i.e. fixed_acidity, and residual_sugar.

data_621_hw_5

Jimmy Ng

5/8/2020

Overview

Data Exploration

Data Preparation

Build Models

Select Models