Assignment 5

DATA 621 – Business Analytics and Data Mining

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME DEFINITION THEORETICAL EFFECT INDEX Identification Variable (do not use) None TARGET Number of Cases Purchased None AcidIndex Proprietary method of testing total acidity of wine by using a weighted average Alcohol Alcohol Content Chlorides Chloride content of wine CitricAcid Citric Acid Content Density Density of Wine FixedAcidity Fixed Acidity of Wine FreeSulfurDioxide Sulfur Dioxide content of wine LabelAppeal Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales. ResidualSugar Residual Sugar of wine STARS Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales Sulphates Sulfate content of wine TotalSulfurDioxide Total Sulfur Dioxide of Wine VolatileAcidity Volatile Acid content of wine pH pH of wine

Deliverables:

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assigned predictions (number of cases of wine sold) for the evaluation data set. Include your R statistical programming code in an Appendix.

Write Up:

1. DATA EXPLORATION (25 Points)

Describe the size and the variables in the wine training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

  1. Mean / Standard Deviation / Median
  2. Bar Chart or Box Plot of the data
  3. Is the data correlated to the target variable (or to other variables?)
  4. Are any of the variables missing and need to be imputed “fixed”?

DATA EXPLORATION SOLUTION

There are 12,795 records in the file and 16 variables. All the variables are numeric in nature. Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) as well as histogram plots for those variables.

We notice there are a few variables with NA’s - ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol and STARS which may need to be fixed. Upon investigating the records in the file, the TARGET column which has zero values appears to be legitimate and not entered in error. Similarly, looking at the maximum values doesn’t cause any alarm. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the quantitative variables in the dataset.

A visual plot of the variables indicates most have a skewed distribution. There appears to be only one normally distributed variables - Alcohol Content.

From the correlation matrix plot, we identify only a few variables that are correlated - LabelAppeal & TARGET (0.50), STARS & TARGET (0.55), LabelAppeal & STARS (0.32). This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, of all the predictive variables, Label appeal and wine rating by a team of experts seem to be factors that have a higher predictive value than other variables in predicting the number of cases purchased. Furthermore, AcidIndex and the target variable are slightly negatively correlated (-0.17) while FixedAcidity and Acidindex are slightly positively correlated (0.15).

library(rmarkdown)
library(corrplot)
## corrplot 0.92 loaded
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Reading in the training and evaluation data files

training <- read.csv("/Users/tponnada/Downloads/wine-training-data.csv")
eval <- read.csv("/Users/tponnada/Downloads/wine-evaluation-data.csv")

#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.

head(training)
##   INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1     1      3          3.2           1.160      -0.98          54.2    -0.567
## 2     2      3          4.5           0.160      -0.81          26.1    -0.425
## 3     4      5          7.1           2.640      -0.88          14.8     0.037
## 4     5      3          5.7           0.385       0.04          18.8    -0.425
## 5     6      4          8.0           0.330      -1.26           9.4        NA
## 6     7      0         11.3           0.320       0.59           2.2     0.556
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                NA                268 0.99280 3.33     -0.59     9.9
## 2                15               -327 1.02792 3.38      0.70      NA
## 3               214                142 0.99518 3.12      0.48    22.0
## 4                22                115 0.99640 2.24      1.83     6.2
## 5              -167                108 0.99457 3.12      1.77    13.7
## 6               -37                 15 0.99940 3.20      1.29    15.4
##   LabelAppeal AcidIndex STARS
## 1           0         8     2
## 2          -1         7     3
## 3          -1         8     3
## 4          -1         6     1
## 5           0         9     2
## 6           0        11    NA
dim(training)
## [1] 12795    16
summary(training)
##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359
#Standard deviations of variables 

sapply(training[,c(1:16)], sd)
##              INDEX             TARGET       FixedAcidity    VolatileAcidity 
##       4.656905e+03       1.926368e+00       6.317643e+00       7.840142e-01 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##       8.620798e-01                 NA                 NA                 NA 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##                 NA       2.653765e-02                 NA                 NA 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                 NA       8.910892e-01       1.323926e+00                 NA
#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 15 numeric variables (excluding index variable which doesn't have any explanatory value).

#Number of Cases Purchased

hist(training$TARGET, xlab = "Number of Cases Purchased", main = "")

plot(density(training$TARGET, na.rm = TRUE), main = "")

plot(sort(training$TARGET), ylab = "Number of Cases Purchased")

#AcidIndex

hist(training$AcidIndex, xlab = "Proprietary method of testing total acidity of wine by using a weighted average", main = "")

plot(density(training$AcidIndex, na.rm = TRUE), main = "")

plot(sort(training$AcidIndex), ylab = "Proprietary method of testing total acidity of wine by using a weighted average")

boxplot(AcidIndex ~ TARGET, training)

#Alcohol Content

hist(training$Alcohol, xlab = "Alcohol Content", main = "")

plot(density(training$Alcohol, na.rm = TRUE), main = "")

plot(sort(training$Alcohol), ylab = "Alcohol Content")

boxplot(Alcohol ~ TARGET, training)

#Chloride content of wine

hist(training$Chlorides, xlab = "Chloride content of wine", main = "")

plot(density(training$Chlorides, na.rm = TRUE), main = "")

plot(sort(training$Chlorides), ylab = "Chloride content of wine")

boxplot(Chlorides ~ TARGET, training)

#Citric Acid Content

hist(training$CitricAcid, xlab = "Citric Acid Content", main = "")

plot(density(training$CitricAcid, na.rm = TRUE), main = "")

plot(sort(training$CitricAcid), ylab = "Citric Acid Content")

boxplot(CitricAcid ~ TARGET, training)

#Density of Wine

hist(training$Density, xlab = "Density of Wine", main = "")

plot(density(training$Density, na.rm = TRUE), main = "")

plot(sort(training$Density), ylab = "Density of Wine")

boxplot(Density ~ TARGET, training)

#Fixed Acidity of Wine

hist(training$FixedAcidity, xlab = "Fixed Acidity of Wine", main = "")

plot(density(training$FixedAcidity, na.rm = TRUE), main = "")

plot(sort(training$FixedAcidity), ylab = "Fixed Acidity of Wine")

boxplot(FixedAcidity ~ TARGET, training)

#Sulfur Dioxide content of wine

hist(training$FreeSulfurDioxide, xlab = "Sulfur Dioxide content of wine", main = "")

plot(density(training$FreeSulfurDioxide, na.rm = TRUE), main = "")

plot(sort(training$FreeSulfurDioxide), ylab = "Sulfur Dioxide content of wine")

boxplot(FreeSulfurDioxide ~ TARGET, training)

#Marketing Score indicating the appeal of label design for consumers

hist(training$LabelAppeal, xlab = "Marketing Score indicating the appeal of label design for consumers", main = "")

plot(density(training$LabelAppeal, na.rm = TRUE), main = "")

plot(sort(training$LabelAppeal), ylab = "Marketing Score indicating the appeal of label design for consumers")

boxplot(LabelAppeal ~ TARGET, training)

#Residual Sugar of wine

hist(training$ResidualSugar, xlab = "Residual Sugar of wine", main = "")

plot(density(training$ResidualSugar, na.rm = TRUE), main = "")

plot(sort(training$ResidualSugar), ylab = "Residual Sugar of wine")

boxplot(ResidualSugar ~ TARGET, training)

#Wine rating by a team of experts

hist(training$STARS, xlab = "Wine rating by a team of experts", main = "")

plot(density(training$STARS, na.rm = TRUE), main = "")

plot(sort(training$STARS), ylab = "Wine rating by a team of experts")

boxplot(STARS ~ TARGET, training)

#Sulfate content of wine

hist(training$Sulphates, xlab = "Sulfate content of wine", main = "")

plot(density(training$Sulphates, na.rm = TRUE), main = "")

plot(sort(training$Sulphates), ylab = "Sulfate content of wine")

boxplot(Sulphates ~ TARGET, training)

#Total Sulfur Dioxide of Wine

hist(training$TotalSulfurDioxide, xlab = "Total Sulfur Dioxide of Wine", main = "")

plot(density(training$TotalSulfurDioxide, na.rm = TRUE), main = "")

plot(sort(training$TotalSulfurDioxide), ylab = "Total Sulfur Dioxide of Wine")

boxplot(TotalSulfurDioxide ~ TARGET, training)

#Volatile Acid content of wine

hist(training$VolatileAcidity, xlab = "Volatile Acid content of wine", main = "")

plot(density(training$VolatileAcidity, na.rm = TRUE), main = "")

plot(sort(training$VolatileAcidity), ylab = "Volatile Acid content of wine")

boxplot(VolatileAcidity ~ TARGET, training)

#pH of wine

hist(training$pH, xlab = "pH of wine", main = "")

plot(density(training$pH, na.rm = TRUE), main = "")

plot(sort(training$pH), ylab = "pH of wine")

boxplot(pH ~ TARGET, training)

#Instead of using scatterplots for each of the 15 variables against each other, I used the correlation matrix.

p.mat <- cor(training, use = "na.or.complete")
corrplot(p.mat, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.6, tl.cex = 0.6, cl.cex = 0.6)

2. DATA PREPARATION (25 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

  1. Fix missing values (maybe with a Mean or Median value)
  2. Create flags to suggest if a variable was missing
  3. Transform data by putting it into buckets
  4. Mathematical transforms such as log or square root (or use Box-Cox)
  5. Combine variables (such as ratios or adding or multiplying) to create new variables

DATA PREPARATION SOLUTION

As observed in the data exploration section, there are some variables with missing NA values. ResidualSugar has 616 (4.8%) records that have NA values, Chlorides has 638 (5.0%) records that have NA values, FreeSulfurDioxide has 647 (5.1%) records that have NA values, TotalSulfurDioxide has 682 (5.3%) records that have NA values, pH has 395 (3.1%) records that have NA values, Sulphates has 1,210 (9.5%) records that have NA values, Alcohol that has 653 (5.1%) records with NA values and finally STARS that has 3,359 (26.2%) records that have NA values. We fill the missing values for these variables with the median values.

We also calculate the variance inflation factor (vif) of each variable, which measures the strength and correlation between the predictor variables in a regression model. A value > 5 indicates potentially severe correlation between a given predictor variable and other predictors in the model. Based on this threshold and the calculated vif scores below, we don’t see any variables that are severely correlated with each other.

require(MASS)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
require(car)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
require(kableExtra)
## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
require(dplyr)

sapply(training, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
INDEX 0
TARGET 0
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
ResidualSugar 616
Chlorides 638
FreeSulfurDioxide 647
TotalSulfurDioxide 682
Density 0
pH 395
Sulphates 1210
Alcohol 653
LabelAppeal 0
AcidIndex 0
STARS 3359
##Fill missing records with missing values

trainingdata <- training %>% mutate_at(vars(c("ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "pH", "Sulphates", "Alcohol", "STARS")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

lmod <- lm(TARGET ~., training)
summary(lmod)
## 
## Call:
## lm(formula = TARGET ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0514 -0.5137  0.1275  0.7216  3.2670 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.515e+00  5.537e-01   8.155 4.16e-16 ***
## INDEX               5.124e-06  3.110e-06   1.647   0.0995 .  
## FixedAcidity        1.698e-03  2.319e-03   0.732   0.4640    
## VolatileAcidity    -9.466e-02  1.846e-02  -5.129 2.99e-07 ***
## CitricAcid         -5.597e-03  1.676e-02  -0.334   0.7384    
## ResidualSugar      -2.668e-04  4.276e-04  -0.624   0.5328    
## Chlorides          -1.136e-01  4.545e-02  -2.500   0.0124 *  
## FreeSulfurDioxide   2.259e-04  9.710e-05   2.327   0.0200 *  
## TotalSulfurDioxide  7.739e-05  6.287e-05   1.231   0.2184    
## Density            -1.278e+00  5.434e-01  -2.352   0.0187 *  
## pH                 -8.473e-03  2.122e-02  -0.399   0.6897    
## Sulphates          -1.714e-02  1.558e-02  -1.100   0.2713    
## Alcohol             1.654e-02  3.887e-03   4.255 2.12e-05 ***
## LabelAppeal         6.432e-01  1.744e-02  36.872  < 2e-16 ***
## AcidIndex          -1.649e-01  1.235e-02 -13.351  < 2e-16 ***
## STARS               7.283e-01  1.710e-02  42.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.153 on 6420 degrees of freedom
##   (6359 observations deleted due to missingness)
## Multiple R-squared:  0.4453, Adjusted R-squared:  0.444 
## F-statistic: 343.5 on 15 and 6420 DF,  p-value: < 2.2e-16
vif(lmod)
##              INDEX       FixedAcidity    VolatileAcidity         CitricAcid 
##           1.003401           1.027307           1.003524           1.007046 
##      ResidualSugar          Chlorides  FreeSulfurDioxide TotalSulfurDioxide 
##           1.003079           1.003094           1.004032           1.002977 
##            Density                 pH          Sulphates            Alcohol 
##           1.004786           1.004886           1.004511           1.009936 
##        LabelAppeal          AcidIndex              STARS 
##           1.118432           1.048567           1.134539

3. BUILD MODELS (25 Points)

Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations). Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models. Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? In this case, about the only thing you can comment on is the number of stars and the wine label appeal. However, you might comment on the coefficient and magnitude of variables and how they are similar or different from model to model. For example, you might say “pH seems to have a major positive impact in my poisson regression model, but a negative effect in my multiple linear regression model”. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

BUILD MODELS SOLUTION

For Poisson model 1, I use the model with raw data first without any transformations, i.e. use the dataset with NA values. As we can see from the summary results, the deviance residuals are symmetrical which implies that the predicted points are close to actual observed points. We also see that the p-values are very small for VolatileAcidity, Alcohol, LabelAppeal, AcidIndex and STARS. The last three of these variables were identified as having predictive value from the correlation matrix calculated earlier.

For Poisson model 2, we use the dataset where NA values were substituted with median values. Although the deviance residuals are similar, we notice that more variables have small p-values or become significant in model 2 such as CitricAcid, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH and Sulphates besides VolatileAcidity, Alcohol, LabelAppeal, AcidIndex and STARS which were all identified as having significance per model 1. The AIC score has also improved for model 2 from model 1’s 23,172 to 50,393. However, on the flip side, the goodness of fit has a smaller p-value for model 2 and the residual deviance for model 2 is greater than the degrees of freedom which indicates that this model might not be a good fit.

For model 3, we use a negative binomial regression models with untransformed data and notice that the accuracy is similar to that of model 1. The residual deviance, AIC values are similar and the p-value is 1 as with model 1. Model 4 is similarly another negative binomial regression model but with transformed data. The results of this model are very similar to those of model 2.

For model 5, we deploy a multiple regression model with raw untransformed data. We notice that several variables become significant as was the case when using the Poisson model with transformed data - VolatileAcidity, Chlorides, FreeSulfurDioxide, Density, Alcohol, LabelAppeal, AcidIndex and STARS. The adjusted R-squared of this model ends up being 44.4%. Lastly, model 6 uses transformed data and we notice that several variables end with being significant - VolatileAcidity, CitricAcid, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex and STARS.

require(MASS)
require(car)
require(kableExtra)
require(dplyr)

##Model 1: Poisson model with raw data

model1 <- glm(TARGET ~ ., family = poisson, training)
summary(model1)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2107  -0.2736   0.0628   0.3748   1.6983  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.578e+00  2.509e-01   6.290 3.18e-10 ***
## INDEX               1.610e-06  1.407e-06   1.144  0.25266    
## FixedAcidity        3.348e-04  1.053e-03   0.318  0.75050    
## VolatileAcidity    -2.563e-02  8.354e-03  -3.067  0.00216 ** 
## CitricAcid         -9.521e-04  7.578e-03  -0.126  0.90002    
## ResidualSugar      -6.579e-05  1.941e-04  -0.339  0.73462    
## Chlorides          -3.016e-02  2.056e-02  -1.467  0.14239    
## FreeSulfurDioxide   6.706e-05  4.404e-05   1.523  0.12778    
## TotalSulfurDioxide  2.071e-05  2.855e-05   0.725  0.46829    
## Density            -3.712e-01  2.462e-01  -1.508  0.13160    
## pH                 -4.402e-03  9.601e-03  -0.459  0.64657    
## Sulphates          -5.102e-03  7.052e-03  -0.724  0.46937    
## Alcohol             3.932e-03  1.771e-03   2.221  0.02638 *  
## LabelAppeal         1.769e-01  7.958e-03  22.223  < 2e-16 ***
## AcidIndex          -4.872e-02  5.903e-03  -8.254  < 2e-16 ***
## STARS               1.873e-01  7.490e-03  25.010  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4007.8  on 6420  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')
## [1] "Goodness of Fit Test:"
with(model1, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance   df p
## [1,]     4007.834 6420 1
##Model 2: Poisson model with transformed data

model2 <- glm(TARGET ~ ., family = poisson, trainingdata)
summary(model2)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4841  -0.5278   0.2047   0.6364   2.5486  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.021e+00  1.960e-01  10.311  < 2e-16 ***
## INDEX              -4.273e-07  1.091e-06  -0.392 0.695178    
## FixedAcidity       -4.508e-04  8.195e-04  -0.550 0.582287    
## VolatileAcidity    -5.046e-02  6.493e-03  -7.773 7.68e-15 ***
## CitricAcid          1.332e-02  5.892e-03   2.261 0.023765 *  
## ResidualSugar       1.444e-04  1.545e-04   0.935 0.350022    
## Chlorides          -6.003e-02  1.645e-02  -3.649 0.000263 ***
## FreeSulfurDioxide   1.426e-04  3.513e-05   4.057 4.96e-05 ***
## TotalSulfurDioxide  1.065e-04  2.268e-05   4.697 2.64e-06 ***
## Density            -4.323e-01  1.921e-01  -2.251 0.024405 *  
## pH                 -2.390e-02  7.639e-03  -3.129 0.001755 ** 
## Sulphates          -1.877e-02  5.738e-03  -3.270 0.001074 ** 
## Alcohol             5.367e-03  1.410e-03   3.806 0.000141 ***
## LabelAppeal         1.965e-01  6.021e-03  32.636  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.382  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.980  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18419  on 12779  degrees of freedom
## AIC: 50393
## 
## Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')
## [1] "Goodness of Fit Test:"
with(model2, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance    df             p
## [1,]     18419.32 12779 5.985146e-213
##Model 3: Negative binomial regression model with raw data

model3 <- glm.nb(TARGET ~ ., data = training)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(model3)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = training, init.theta = 140158.8704, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2107  -0.2736   0.0628   0.3748   1.6983  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.578e+00  2.509e-01   6.290 3.18e-10 ***
## INDEX               1.610e-06  1.407e-06   1.144  0.25267    
## FixedAcidity        3.348e-04  1.053e-03   0.318  0.75050    
## VolatileAcidity    -2.563e-02  8.354e-03  -3.067  0.00216 ** 
## CitricAcid         -9.522e-04  7.578e-03  -0.126  0.90002    
## ResidualSugar      -6.579e-05  1.941e-04  -0.339  0.73463    
## Chlorides          -3.016e-02  2.056e-02  -1.467  0.14240    
## FreeSulfurDioxide   6.706e-05  4.404e-05   1.523  0.12779    
## TotalSulfurDioxide  2.071e-05  2.855e-05   0.725  0.46829    
## Density            -3.712e-01  2.462e-01  -1.508  0.13160    
## pH                 -4.402e-03  9.601e-03  -0.459  0.64657    
## Sulphates          -5.102e-03  7.052e-03  -0.724  0.46937    
## Alcohol             3.932e-03  1.771e-03   2.221  0.02638 *  
## LabelAppeal         1.769e-01  7.958e-03  22.223  < 2e-16 ***
## AcidIndex          -4.872e-02  5.903e-03  -8.254  < 2e-16 ***
## STARS               1.873e-01  7.490e-03  25.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(140158.9) family taken to be 1)
## 
##     Null deviance: 5843.9  on 6435  degrees of freedom
## Residual deviance: 4007.8  on 6420  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23175
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  140159 
##           Std. Err.:  234850 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23140.54
print('Goodness of Fit Test:')
## [1] "Goodness of Fit Test:"
with(model3, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance   df p
## [1,]     4007.771 6420 1
##Model 4: Negative binomial regression model with transformed data

model4 <- glm.nb(TARGET ~ ., data = trainingdata)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(model4)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingdata, init.theta = 39421.98179, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4840  -0.5278   0.2047   0.6364   2.5485  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.021e+00  1.960e-01  10.311  < 2e-16 ***
## INDEX              -4.274e-07  1.091e-06  -0.392 0.695177    
## FixedAcidity       -4.508e-04  8.196e-04  -0.550 0.582294    
## VolatileAcidity    -5.047e-02  6.493e-03  -7.773 7.69e-15 ***
## CitricAcid          1.332e-02  5.892e-03   2.261 0.023771 *  
## ResidualSugar       1.444e-04  1.545e-04   0.935 0.350021    
## Chlorides          -6.003e-02  1.645e-02  -3.649 0.000263 ***
## FreeSulfurDioxide   1.426e-04  3.514e-05   4.057 4.97e-05 ***
## TotalSulfurDioxide  1.065e-04  2.269e-05   4.697 2.65e-06 ***
## Density            -4.323e-01  1.921e-01  -2.251 0.024409 *  
## pH                 -2.390e-02  7.639e-03  -3.129 0.001755 ** 
## Sulphates          -1.877e-02  5.739e-03  -3.270 0.001075 ** 
## Alcohol             5.367e-03  1.410e-03   3.806 0.000141 ***
## LabelAppeal         1.965e-01  6.022e-03  32.635  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.382  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.978  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39421.98) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18418  on 12779  degrees of freedom
## AIC: 50395
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39422 
##           Std. Err.:  59709 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50361.47
print('Goodness of Fit Test:')
## [1] "Goodness of Fit Test:"
with(model4, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance    df             p
## [1,]     18418.46 12779 6.822121e-213
##Model 5: Multiple linear regression model with raw data

model5 <- lm(TARGET ~ ., data = training)
summary(model5)
## 
## Call:
## lm(formula = TARGET ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0514 -0.5137  0.1275  0.7216  3.2670 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.515e+00  5.537e-01   8.155 4.16e-16 ***
## INDEX               5.124e-06  3.110e-06   1.647   0.0995 .  
## FixedAcidity        1.698e-03  2.319e-03   0.732   0.4640    
## VolatileAcidity    -9.466e-02  1.846e-02  -5.129 2.99e-07 ***
## CitricAcid         -5.597e-03  1.676e-02  -0.334   0.7384    
## ResidualSugar      -2.668e-04  4.276e-04  -0.624   0.5328    
## Chlorides          -1.136e-01  4.545e-02  -2.500   0.0124 *  
## FreeSulfurDioxide   2.259e-04  9.710e-05   2.327   0.0200 *  
## TotalSulfurDioxide  7.739e-05  6.287e-05   1.231   0.2184    
## Density            -1.278e+00  5.434e-01  -2.352   0.0187 *  
## pH                 -8.473e-03  2.122e-02  -0.399   0.6897    
## Sulphates          -1.714e-02  1.558e-02  -1.100   0.2713    
## Alcohol             1.654e-02  3.887e-03   4.255 2.12e-05 ***
## LabelAppeal         6.432e-01  1.744e-02  36.872  < 2e-16 ***
## AcidIndex          -1.649e-01  1.235e-02 -13.351  < 2e-16 ***
## STARS               7.283e-01  1.710e-02  42.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.153 on 6420 degrees of freedom
##   (6359 observations deleted due to missingness)
## Multiple R-squared:  0.4453, Adjusted R-squared:  0.444 
## F-statistic: 343.5 on 15 and 6420 DF,  p-value: < 2.2e-16
##Model 6: Multiple linear regression model with transformed data

model6 <- lm(TARGET ~ ., data = trainingdata)
summary(model6)
## 
## Call:
## lm(formula = TARGET ~ ., data = trainingdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2269 -0.7534  0.3615  1.1226  4.3612 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.375e+00  5.526e-01   9.727  < 2e-16 ***
## INDEX              -1.951e-06  3.089e-06  -0.632 0.527573    
## FixedAcidity       -1.164e-03  2.315e-03  -0.503 0.615047    
## VolatileAcidity    -1.550e-01  1.838e-02  -8.434  < 2e-16 ***
## CitricAcid          3.982e-02  1.673e-02   2.380 0.017329 *  
## ResidualSugar       4.726e-04  4.371e-04   1.081 0.279676    
## Chlorides          -1.930e-01  4.638e-02  -4.162 3.18e-05 ***
## FreeSulfurDioxide   4.291e-04  9.941e-05   4.317 1.59e-05 ***
## TotalSulfurDioxide  3.102e-04  6.387e-05   4.857 1.21e-06 ***
## Density            -1.277e+00  5.428e-01  -2.353 0.018636 *  
## pH                 -6.395e-02  2.154e-02  -2.969 0.002992 ** 
## Sulphates          -5.479e-02  1.623e-02  -3.376 0.000739 ***
## Alcohol             1.881e-02  3.973e-03   4.734 2.22e-06 ***
## LabelAppeal         5.946e-01  1.687e-02  35.254  < 2e-16 ***
## AcidIndex          -3.260e-01  1.117e-02 -29.172  < 2e-16 ***
## STARS               7.478e-01  1.946e-02  38.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.627 on 12779 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2871 
## F-statistic: 344.5 on 15 and 12779 DF,  p-value: < 2.2e-16

4. SELECT MODELS (25 Points)

Decide on the criteria for selecting the best count regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the count regression model, will you use a metric such as AIC, average squared error, etc.? Be sure to explain how you can make inferences from the model, and discuss other relevant model output. If you like the multiple linear regression model the best, please say why. However, you must select a count regression model for model deployment. Using the training data set, evaluate the performance of the count regression model. Make predictions using the evaluation data set.

SELECT MODELS SOLUTION:

We run a couple of diagnostics on the selected models.

Residual vs Fitted Plots

The residual vs fitted plots for none of the 6 models are homoscedastic.

Normality using Q-Q plots

Next, we can test the residuals for normality using the Q-Q plot. Residuals for none of the Q-Q plots follow the line approximately and hence the residuals are not normal. The skew in the Q-Q plot persists for all models.

Identify leverage points using half-normal plots

Two index values 1351 and 6283 are identified as leverage points in model 1 & 3. 1630, 12,513 are identified as leverage points in model 2 & 4. 1351, 6009 are identified as leverage points in model 5.

Identify influential points using Cook’s distance

We identify influential points in all models and removal of the largest of the influential points in each model doesn’t change the summary stats meaningfully.

Finally, looking at the AIC and MSE values, we select model 4 with the highest AIC value and smallest MSE.

This model can be written as:

TARGET = 2.021e+00 + (-4.274e-07 * INDEX) + (-4.508e-04 * FixedAcidity) + (-5.047e-02 * VolatileAcidity) + (1.332e-02 * CitricAcid) + (1.444e-04 * ResidualSugar) + (-6.003e-02 * Chlorides) + (1.426e-04 * FreeSulfurDioxide) + (1.065e-04 * TotalSulfurDioxide) + (-4.323e-01 * Density) + (-2.390e-02 * pH) + (-1.877e-02 * Sulphates) + (5.367e-03 * Alcohol) + (1.965e-01 * LabelAppeal) + (-1.222e-01 * AcidIndex) + (2.198e-01 * STARS)

require(faraway)
## Loading required package: faraway
## 
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
## 
##     logit, vif
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ readr   1.3.1
## ✓ tibble  3.0.3     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()          masks stats::filter()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag()             masks stats::lag()
## x car::recode()            masks dplyr::recode()
## x MASS::select()           masks dplyr::select()
## x purrr::some()            masks car::some()
require(kableExtra)

# Model 1:
par(mfrow = c(2, 2))
plot(model1)

#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(model1))
plot(tail(residuals(model1), n1-1) ~ head(residuals(model1), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv1 <- hatvalues(model1)
sum(hatv1)
## [1] 16
index <- row.names(training)
halfnorm(hatv1, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf1 <- cooks.distance(model1)
halfnorm(cookf1, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
modelf1 <- glm(TARGET ~ ., family = poisson, training, subset = (cookf1 < max(cookf1)))
summary(modelf1)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = training, 
##     subset = (cookf1 < max(cookf1)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2107  -0.2736   0.0628   0.3748   1.6983  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.578e+00  2.509e-01   6.290 3.18e-10 ***
## INDEX               1.610e-06  1.407e-06   1.144  0.25266    
## FixedAcidity        3.348e-04  1.053e-03   0.318  0.75050    
## VolatileAcidity    -2.563e-02  8.354e-03  -3.067  0.00216 ** 
## CitricAcid         -9.521e-04  7.578e-03  -0.126  0.90002    
## ResidualSugar      -6.579e-05  1.941e-04  -0.339  0.73462    
## Chlorides          -3.016e-02  2.056e-02  -1.467  0.14239    
## FreeSulfurDioxide   6.706e-05  4.404e-05   1.523  0.12778    
## TotalSulfurDioxide  2.071e-05  2.855e-05   0.725  0.46829    
## Density            -3.712e-01  2.462e-01  -1.508  0.13160    
## pH                 -4.402e-03  9.601e-03  -0.459  0.64657    
## Sulphates          -5.102e-03  7.052e-03  -0.724  0.46937    
## Alcohol             3.932e-03  1.771e-03   2.221  0.02638 *  
## LabelAppeal         1.769e-01  7.958e-03  22.223  < 2e-16 ***
## AcidIndex          -4.872e-02  5.903e-03  -8.254  < 2e-16 ***
## STARS               1.873e-01  7.490e-03  25.010  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4007.8  on 6420  degrees of freedom
##   (6357 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5
summary(model1)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2107  -0.2736   0.0628   0.3748   1.6983  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.578e+00  2.509e-01   6.290 3.18e-10 ***
## INDEX               1.610e-06  1.407e-06   1.144  0.25266    
## FixedAcidity        3.348e-04  1.053e-03   0.318  0.75050    
## VolatileAcidity    -2.563e-02  8.354e-03  -3.067  0.00216 ** 
## CitricAcid         -9.521e-04  7.578e-03  -0.126  0.90002    
## ResidualSugar      -6.579e-05  1.941e-04  -0.339  0.73462    
## Chlorides          -3.016e-02  2.056e-02  -1.467  0.14239    
## FreeSulfurDioxide   6.706e-05  4.404e-05   1.523  0.12778    
## TotalSulfurDioxide  2.071e-05  2.855e-05   0.725  0.46829    
## Density            -3.712e-01  2.462e-01  -1.508  0.13160    
## pH                 -4.402e-03  9.601e-03  -0.459  0.64657    
## Sulphates          -5.102e-03  7.052e-03  -0.724  0.46937    
## Alcohol             3.932e-03  1.771e-03   2.221  0.02638 *  
## LabelAppeal         1.769e-01  7.958e-03  22.223  < 2e-16 ***
## AcidIndex          -4.872e-02  5.903e-03  -8.254  < 2e-16 ***
## STARS               1.873e-01  7.490e-03  25.010  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4007.8  on 6420  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5
# Model 2:
par(mfrow = c(2, 2))

plot(model2)

#Plot of successive pairs of residuals to check for serial correlation
n2 <- length(residuals(model2))
plot(tail(residuals(model2), n1-1) ~ head(residuals(model2), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv2 <- hatvalues(model2)
sum(hatv2)
## [1] 16
index <- row.names(training)
halfnorm(hatv2, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf2 <- cooks.distance(model2)
halfnorm(cookf2, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
modelf2 <- glm(TARGET ~ ., family = poisson, trainingdata, subset = (cookf2 < max(cookf2)))
summary(modelf2)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingdata, 
##     subset = (cookf2 < max(cookf2)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4882  -0.5275   0.2045   0.6363   2.5559  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.026e+00  1.960e-01  10.336  < 2e-16 ***
## INDEX              -4.598e-07  1.091e-06  -0.422 0.673346    
## FixedAcidity       -4.816e-04  8.196e-04  -0.588 0.556842    
## VolatileAcidity    -5.043e-02  6.493e-03  -7.767 8.05e-15 ***
## CitricAcid          1.335e-02  5.892e-03   2.266 0.023453 *  
## ResidualSugar       1.397e-04  1.545e-04   0.904 0.365960    
## Chlorides          -5.907e-02  1.645e-02  -3.590 0.000330 ***
## FreeSulfurDioxide   1.438e-04  3.514e-05   4.093 4.26e-05 ***
## TotalSulfurDioxide  1.071e-04  2.269e-05   4.720 2.36e-06 ***
## Density            -4.325e-01  1.921e-01  -2.252 0.024332 *  
## pH                 -2.391e-02  7.639e-03  -3.130 0.001746 ** 
## Sulphates          -1.879e-02  5.739e-03  -3.274 0.001060 ** 
## Alcohol             5.406e-03  1.410e-03   3.834 0.000126 ***
## LabelAppeal         1.966e-01  6.022e-03  32.650  < 2e-16 ***
## AcidIndex          -1.228e-01  4.469e-03 -27.479  < 2e-16 ***
## STARS               2.196e-01  6.468e-03  33.944  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22860  on 12793  degrees of freedom
## Residual deviance: 18413  on 12778  degrees of freedom
## AIC: 50383
## 
## Number of Fisher Scoring iterations: 5
summary(model2)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4841  -0.5278   0.2047   0.6364   2.5486  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.021e+00  1.960e-01  10.311  < 2e-16 ***
## INDEX              -4.273e-07  1.091e-06  -0.392 0.695178    
## FixedAcidity       -4.508e-04  8.195e-04  -0.550 0.582287    
## VolatileAcidity    -5.046e-02  6.493e-03  -7.773 7.68e-15 ***
## CitricAcid          1.332e-02  5.892e-03   2.261 0.023765 *  
## ResidualSugar       1.444e-04  1.545e-04   0.935 0.350022    
## Chlorides          -6.003e-02  1.645e-02  -3.649 0.000263 ***
## FreeSulfurDioxide   1.426e-04  3.513e-05   4.057 4.96e-05 ***
## TotalSulfurDioxide  1.065e-04  2.268e-05   4.697 2.64e-06 ***
## Density            -4.323e-01  1.921e-01  -2.251 0.024405 *  
## pH                 -2.390e-02  7.639e-03  -3.129 0.001755 ** 
## Sulphates          -1.877e-02  5.738e-03  -3.270 0.001074 ** 
## Alcohol             5.367e-03  1.410e-03   3.806 0.000141 ***
## LabelAppeal         1.965e-01  6.021e-03  32.636  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.382  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.980  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18419  on 12779  degrees of freedom
## AIC: 50393
## 
## Number of Fisher Scoring iterations: 5
# Model 3:
par(mfrow = c(2, 2))

plot(model3)

#Plot of successive pairs of residuals to check for serial correlation
n3 <- length(residuals(model3))
plot(tail(residuals(model3), n1-1) ~ head(residuals(model3), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv3 <- hatvalues(model3)
sum(hatv3)
## [1] 16
index <- row.names(training)
halfnorm(hatv3, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf3 <- cooks.distance(model3)
halfnorm(cookf3, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
modelf3 <- glm.nb(TARGET ~ ., data = training, subset = (cookf3 < max(cookf3)))
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(modelf3)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = training, subset = (cookf3 < 
##     max(cookf3)), init.theta = 140158.8704, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2107  -0.2736   0.0628   0.3748   1.6983  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.578e+00  2.509e-01   6.290 3.18e-10 ***
## INDEX               1.610e-06  1.407e-06   1.144  0.25267    
## FixedAcidity        3.348e-04  1.053e-03   0.318  0.75050    
## VolatileAcidity    -2.563e-02  8.354e-03  -3.067  0.00216 ** 
## CitricAcid         -9.522e-04  7.578e-03  -0.126  0.90002    
## ResidualSugar      -6.579e-05  1.941e-04  -0.339  0.73463    
## Chlorides          -3.016e-02  2.056e-02  -1.467  0.14240    
## FreeSulfurDioxide   6.706e-05  4.404e-05   1.523  0.12779    
## TotalSulfurDioxide  2.071e-05  2.855e-05   0.725  0.46829    
## Density            -3.712e-01  2.462e-01  -1.508  0.13160    
## pH                 -4.402e-03  9.601e-03  -0.459  0.64657    
## Sulphates          -5.102e-03  7.052e-03  -0.724  0.46937    
## Alcohol             3.932e-03  1.771e-03   2.221  0.02638 *  
## LabelAppeal         1.769e-01  7.958e-03  22.223  < 2e-16 ***
## AcidIndex          -4.872e-02  5.903e-03  -8.254  < 2e-16 ***
## STARS               1.873e-01  7.490e-03  25.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(140158.9) family taken to be 1)
## 
##     Null deviance: 5843.9  on 6435  degrees of freedom
## Residual deviance: 4007.8  on 6420  degrees of freedom
##   (6357 observations deleted due to missingness)
## AIC: 23175
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  140159 
##           Std. Err.:  234850 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23140.54
summary(model3)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = training, init.theta = 140158.8704, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2107  -0.2736   0.0628   0.3748   1.6983  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.578e+00  2.509e-01   6.290 3.18e-10 ***
## INDEX               1.610e-06  1.407e-06   1.144  0.25267    
## FixedAcidity        3.348e-04  1.053e-03   0.318  0.75050    
## VolatileAcidity    -2.563e-02  8.354e-03  -3.067  0.00216 ** 
## CitricAcid         -9.522e-04  7.578e-03  -0.126  0.90002    
## ResidualSugar      -6.579e-05  1.941e-04  -0.339  0.73463    
## Chlorides          -3.016e-02  2.056e-02  -1.467  0.14240    
## FreeSulfurDioxide   6.706e-05  4.404e-05   1.523  0.12779    
## TotalSulfurDioxide  2.071e-05  2.855e-05   0.725  0.46829    
## Density            -3.712e-01  2.462e-01  -1.508  0.13160    
## pH                 -4.402e-03  9.601e-03  -0.459  0.64657    
## Sulphates          -5.102e-03  7.052e-03  -0.724  0.46937    
## Alcohol             3.932e-03  1.771e-03   2.221  0.02638 *  
## LabelAppeal         1.769e-01  7.958e-03  22.223  < 2e-16 ***
## AcidIndex          -4.872e-02  5.903e-03  -8.254  < 2e-16 ***
## STARS               1.873e-01  7.490e-03  25.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(140158.9) family taken to be 1)
## 
##     Null deviance: 5843.9  on 6435  degrees of freedom
## Residual deviance: 4007.8  on 6420  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23175
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  140159 
##           Std. Err.:  234850 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23140.54
# Model 4:
par(mfrow = c(2, 2))

plot(model4)

#Plot of successive pairs of residuals to check for serial correlation
n4 <- length(residuals(model4))
plot(tail(residuals(model4), n1-1) ~ head(residuals(model4), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv4 <- hatvalues(model4)
sum(hatv4)
## [1] 16
index <- row.names(training)
halfnorm(hatv4, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf4 <- cooks.distance(model4)
halfnorm(cookf4, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
modelf4 <- glm.nb(TARGET ~ ., data = trainingdata, subset = (cookf4 < max(cookf4)))
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(modelf4)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingdata, subset = (cookf4 < 
##     max(cookf4)), init.theta = 39457.82795, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4881  -0.5275   0.2045   0.6362   2.5559  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.026e+00  1.960e-01  10.335  < 2e-16 ***
## INDEX              -4.598e-07  1.091e-06  -0.422 0.673345    
## FixedAcidity       -4.816e-04  8.197e-04  -0.588 0.556849    
## VolatileAcidity    -5.043e-02  6.493e-03  -7.767 8.07e-15 ***
## CitricAcid          1.335e-02  5.892e-03   2.266 0.023459 *  
## ResidualSugar       1.397e-04  1.545e-04   0.904 0.365959    
## Chlorides          -5.907e-02  1.645e-02  -3.590 0.000331 ***
## FreeSulfurDioxide   1.438e-04  3.514e-05   4.093 4.26e-05 ***
## TotalSulfurDioxide  1.071e-04  2.269e-05   4.720 2.36e-06 ***
## Density            -4.325e-01  1.921e-01  -2.252 0.024336 *  
## pH                 -2.391e-02  7.639e-03  -3.130 0.001747 ** 
## Sulphates          -1.879e-02  5.739e-03  -3.274 0.001060 ** 
## Alcohol             5.406e-03  1.410e-03   3.833 0.000126 ***
## LabelAppeal         1.966e-01  6.022e-03  32.648  < 2e-16 ***
## AcidIndex          -1.228e-01  4.469e-03 -27.478  < 2e-16 ***
## STARS               2.196e-01  6.469e-03  33.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39457.83) family taken to be 1)
## 
##     Null deviance: 22859  on 12793  degrees of freedom
## Residual deviance: 18412  on 12778  degrees of freedom
## AIC: 50385
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39458 
##           Std. Err.:  59764 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50351.48
summary(model4)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingdata, init.theta = 39421.98179, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4840  -0.5278   0.2047   0.6364   2.5485  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.021e+00  1.960e-01  10.311  < 2e-16 ***
## INDEX              -4.274e-07  1.091e-06  -0.392 0.695177    
## FixedAcidity       -4.508e-04  8.196e-04  -0.550 0.582294    
## VolatileAcidity    -5.047e-02  6.493e-03  -7.773 7.69e-15 ***
## CitricAcid          1.332e-02  5.892e-03   2.261 0.023771 *  
## ResidualSugar       1.444e-04  1.545e-04   0.935 0.350021    
## Chlorides          -6.003e-02  1.645e-02  -3.649 0.000263 ***
## FreeSulfurDioxide   1.426e-04  3.514e-05   4.057 4.97e-05 ***
## TotalSulfurDioxide  1.065e-04  2.269e-05   4.697 2.65e-06 ***
## Density            -4.323e-01  1.921e-01  -2.251 0.024409 *  
## pH                 -2.390e-02  7.639e-03  -3.129 0.001755 ** 
## Sulphates          -1.877e-02  5.739e-03  -3.270 0.001075 ** 
## Alcohol             5.367e-03  1.410e-03   3.806 0.000141 ***
## LabelAppeal         1.965e-01  6.022e-03  32.635  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.382  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.978  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39421.98) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18418  on 12779  degrees of freedom
## AIC: 50395
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39422 
##           Std. Err.:  59709 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50361.47
# Model 5:
par(mfrow = c(2, 2))

plot(model5)

#Plot of successive pairs of residuals to check for serial correlation
n5 <- length(residuals(model5))
plot(tail(residuals(model5), n1-1) ~ head(residuals(model5), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv5 <- hatvalues(model5)
sum(hatv5)
## [1] 16
index <- row.names(training)
halfnorm(hatv5, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf5 <- cooks.distance(model5)
halfnorm(cookf5, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
modelf5 <- lm(TARGET ~ ., data = training, subset = (cookf5 < max(cookf5)))
summary(modelf5)
## 
## Call:
## lm(formula = TARGET ~ ., data = training, subset = (cookf5 < 
##     max(cookf5)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0514 -0.5137  0.1275  0.7216  3.2670 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.515e+00  5.537e-01   8.155 4.16e-16 ***
## INDEX               5.124e-06  3.110e-06   1.647   0.0995 .  
## FixedAcidity        1.698e-03  2.319e-03   0.732   0.4640    
## VolatileAcidity    -9.466e-02  1.846e-02  -5.129 2.99e-07 ***
## CitricAcid         -5.597e-03  1.676e-02  -0.334   0.7384    
## ResidualSugar      -2.668e-04  4.276e-04  -0.624   0.5328    
## Chlorides          -1.136e-01  4.545e-02  -2.500   0.0124 *  
## FreeSulfurDioxide   2.259e-04  9.710e-05   2.327   0.0200 *  
## TotalSulfurDioxide  7.739e-05  6.287e-05   1.231   0.2184    
## Density            -1.278e+00  5.434e-01  -2.352   0.0187 *  
## pH                 -8.473e-03  2.122e-02  -0.399   0.6897    
## Sulphates          -1.714e-02  1.558e-02  -1.100   0.2713    
## Alcohol             1.654e-02  3.887e-03   4.255 2.12e-05 ***
## LabelAppeal         6.432e-01  1.744e-02  36.872  < 2e-16 ***
## AcidIndex          -1.649e-01  1.235e-02 -13.351  < 2e-16 ***
## STARS               7.283e-01  1.710e-02  42.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.153 on 6420 degrees of freedom
##   (6357 observations deleted due to missingness)
## Multiple R-squared:  0.4453, Adjusted R-squared:  0.444 
## F-statistic: 343.5 on 15 and 6420 DF,  p-value: < 2.2e-16
summary(model5)
## 
## Call:
## lm(formula = TARGET ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0514 -0.5137  0.1275  0.7216  3.2670 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.515e+00  5.537e-01   8.155 4.16e-16 ***
## INDEX               5.124e-06  3.110e-06   1.647   0.0995 .  
## FixedAcidity        1.698e-03  2.319e-03   0.732   0.4640    
## VolatileAcidity    -9.466e-02  1.846e-02  -5.129 2.99e-07 ***
## CitricAcid         -5.597e-03  1.676e-02  -0.334   0.7384    
## ResidualSugar      -2.668e-04  4.276e-04  -0.624   0.5328    
## Chlorides          -1.136e-01  4.545e-02  -2.500   0.0124 *  
## FreeSulfurDioxide   2.259e-04  9.710e-05   2.327   0.0200 *  
## TotalSulfurDioxide  7.739e-05  6.287e-05   1.231   0.2184    
## Density            -1.278e+00  5.434e-01  -2.352   0.0187 *  
## pH                 -8.473e-03  2.122e-02  -0.399   0.6897    
## Sulphates          -1.714e-02  1.558e-02  -1.100   0.2713    
## Alcohol             1.654e-02  3.887e-03   4.255 2.12e-05 ***
## LabelAppeal         6.432e-01  1.744e-02  36.872  < 2e-16 ***
## AcidIndex          -1.649e-01  1.235e-02 -13.351  < 2e-16 ***
## STARS               7.283e-01  1.710e-02  42.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.153 on 6420 degrees of freedom
##   (6359 observations deleted due to missingness)
## Multiple R-squared:  0.4453, Adjusted R-squared:  0.444 
## F-statistic: 343.5 on 15 and 6420 DF,  p-value: < 2.2e-16
# Model 6:
par(mfrow = c(2, 2))

plot(model6)

#Plot of successive pairs of residuals to check for serial correlation
n6 <- length(residuals(model6))
plot(tail(residuals(model6), n1-1) ~ head(residuals(model6), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv6 <- hatvalues(model6)
sum(hatv6)
## [1] 16
index <- row.names(training)
halfnorm(hatv6, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf6 <- cooks.distance(model6)
halfnorm(cookf6, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
modelf6 <- lm(TARGET ~ ., data = trainingdata, subset = (cookf6 < max(cookf6)))
summary(modelf6)
## 
## Call:
## lm(formula = TARGET ~ ., data = trainingdata, subset = (cookf6 < 
##     max(cookf6)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2338 -0.7517  0.3618  1.1229  4.3610 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.389e+00  5.526e-01   9.752  < 2e-16 ***
## INDEX              -2.059e-06  3.089e-06  -0.667 0.505000    
## FixedAcidity       -1.263e-03  2.315e-03  -0.546 0.585375    
## VolatileAcidity    -1.549e-01  1.838e-02  -8.427  < 2e-16 ***
## CitricAcid          3.991e-02  1.673e-02   2.386 0.017053 *  
## ResidualSugar       4.571e-04  4.371e-04   1.046 0.295710    
## Chlorides          -1.899e-01  4.639e-02  -4.094 4.27e-05 ***
## FreeSulfurDioxide   4.333e-04  9.941e-05   4.359 1.32e-05 ***
## TotalSulfurDioxide  3.118e-04  6.386e-05   4.883 1.06e-06 ***
## Density            -1.278e+00  5.426e-01  -2.355 0.018521 *  
## pH                 -6.395e-02  2.153e-02  -2.970 0.002988 ** 
## Sulphates          -5.484e-02  1.623e-02  -3.379 0.000729 ***
## Alcohol             1.894e-02  3.972e-03   4.768 1.88e-06 ***
## LabelAppeal         5.949e-01  1.686e-02  35.275  < 2e-16 ***
## AcidIndex          -3.274e-01  1.119e-02 -29.269  < 2e-16 ***
## STARS               7.470e-01  1.946e-02  38.389  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.626 on 12778 degrees of freedom
## Multiple R-squared:  0.2882, Adjusted R-squared:  0.2874 
## F-statistic: 344.9 on 15 and 12778 DF,  p-value: < 2.2e-16
summary(model6)
## 
## Call:
## lm(formula = TARGET ~ ., data = trainingdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2269 -0.7534  0.3615  1.1226  4.3612 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.375e+00  5.526e-01   9.727  < 2e-16 ***
## INDEX              -1.951e-06  3.089e-06  -0.632 0.527573    
## FixedAcidity       -1.164e-03  2.315e-03  -0.503 0.615047    
## VolatileAcidity    -1.550e-01  1.838e-02  -8.434  < 2e-16 ***
## CitricAcid          3.982e-02  1.673e-02   2.380 0.017329 *  
## ResidualSugar       4.726e-04  4.371e-04   1.081 0.279676    
## Chlorides          -1.930e-01  4.638e-02  -4.162 3.18e-05 ***
## FreeSulfurDioxide   4.291e-04  9.941e-05   4.317 1.59e-05 ***
## TotalSulfurDioxide  3.102e-04  6.387e-05   4.857 1.21e-06 ***
## Density            -1.277e+00  5.428e-01  -2.353 0.018636 *  
## pH                 -6.395e-02  2.154e-02  -2.969 0.002992 ** 
## Sulphates          -5.479e-02  1.623e-02  -3.376 0.000739 ***
## Alcohol             1.881e-02  3.973e-03   4.734 2.22e-06 ***
## LabelAppeal         5.946e-01  1.687e-02  35.254  < 2e-16 ***
## AcidIndex          -3.260e-01  1.117e-02 -29.172  < 2e-16 ***
## STARS               7.478e-01  1.946e-02  38.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.627 on 12779 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2871 
## F-statistic: 344.5 on 15 and 12779 DF,  p-value: < 2.2e-16
## Comparing models using AIC and MSE

modelValidation1 <- function(mod){
  preds1 = predict(mod, training)
  diffMat1 = as.numeric(preds1) - as.numeric(training$TARGET)
  diffMat1 = diffMat1^2
  loss1 <- mean(diffMat1)
  return(loss1)
}

modelValidation2 <- function(mod){
  preds2 = predict(mod, trainingdata)
  diffMat2 = as.numeric(preds2) - as.numeric(trainingdata$TARGET)
  diffMat2 = diffMat2^2
  loss2 <- mean(diffMat2)
  return(loss2)
}

compare_models <- matrix(c(modelValidation1(model1), modelValidation2(model2), modelValidation1(model3), modelValidation2(model4),
                           modelValidation1(model5), modelValidation2(model6)), nrow = 6,ncol = 1,byrow = TRUE)


aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic

mse1 <- mean((training$TARGET - predict(model1))^2)
## Warning in training$TARGET - predict(model1): longer object length is not a
## multiple of shorter object length
mse2 <- mean((trainingdata$TARGET - predict(model2))^2)
mse3 <- mean((training$TARGET - predict(model3))^2)
## Warning in training$TARGET - predict(model3): longer object length is not a
## multiple of shorter object length
mse4 <- mean((trainingdata$TARGET - predict(model4))^2)
mse5 <- mean((training$TARGET - predict(model5))^2)
## Warning in training$TARGET - predict(model5): longer object length is not a
## multiple of shorter object length
mse6 <- mean((trainingdata$TARGET - predict(model6))^2)

compare_aic_mse <- matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, 
                            aic1, aic2, aic3, aic4, 0, 0),nrow = 6,ncol = 2,byrow = FALSE)

rownames(compare_aic_mse) <- c("Model1","Model2","Model3","Model4","Model5","Model6")
colnames(compare_aic_mse) <- c("MSE","AIC")
compare_models <- as.data.frame(compare_models)

kable(compare_aic_mse)  %>% 
  kable_styling(full_width = T)
MSE AIC
Model1 6.902681 23172.44
Model2 7.042875 50393.34
Model3 6.902682 23174.54
Model4 7.042875 50395.47
Model5 5.154331 0.00
Model6 2.642231 0.00

Predicting using the eval dataset

We use the model 4 to predict TARGET.

sapply(eval, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
INDEX 0
TARGET 3335
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
ResidualSugar 168
Chlorides 138
FreeSulfurDioxide 152
TotalSulfurDioxide 157
Density 0
pH 104
Sulphates 310
Alcohol 185
LabelAppeal 0
AcidIndex 0
STARS 841
##Fill missing records with missing values

evaldata <- eval %>% mutate_at(vars(c("ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "pH", "Sulphates", "Alcohol", "STARS")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

##Model 4: Negative binomial regression model with transformed data

eval_probs <- predict(model4, newdata = evaldata, type='response')

write.csv(eval, file = '/Users/tponnada/Downloads/predictHW5.csv')