Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_5
The wine dataset is a highly popular one in the data science community, as it models some of the challenges of real world datasets and can be modeled by a variety of different model types.
We will first explore the data looking for issues or challenges (i.e. missing data, outliers, possible coding errors, multicollinearlity, etc). Once we have a handle on the data, we will apply any necessary cleaning steps. Once we have a reasonable dataset to work with, we will build and evaluate three different linear models that predict sales. Our dataset includes both training data and evaluation data - we will train using the main training data, then evaluate models based on how well they perform against the holdout evaluation data. Finally we will select a final model that offers the best compromise between accuracy and simplicity.
The wine training set contains 16 columns - including the target variable TARGET - and 12795 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We believe it is still reasonable to treat these as numerical variables since the different values follow a natural numerical order.
Below, we created a chart that describes each variable in the dataset and the theoretical effect it will have on the number of wins projected for a team.
Variables of Interest
Given that the Index column had no impact on the target variable, number of wines, it was dropped.
We compiled summary statistics on our data set to better understand the data before modeling.
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
The first observation is the prevalance of NA’s throughout the dataset. Of the 14 feature columns, 8 of them contain at least some NA values. We also see that the TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold” (we would not expect partial cases).
We also note that many of the numerical features measuring the quantity of a chemical in the wine have a negative minimum value. We are assuming the original chemical measurements were normalized (possible a log transform) allowing for negative values, since technically negative concentrations shouldn’t be physically possible. As such, we chose to leave those values as-is.
Next, we wanted to get an idea of the distribution profiles for each of the variables.
We see that most variables have a somewhat normal (although steep) distribution.
The distribution profiles show right skew in variables AcidIndex, and STARS.
More interesting is the shape of many features where they are centered with most values clustered at the center and somewhat uniform shape above and below. It almost appears like a tri-modal distribution with a low, middle and high normal distributions overlapping. We are not going to do extensive feature engineering, or we might consider breaking these features up into 3 separate features each. Two approaches include:
mixTools to separate the multi-modal curves into 3 distinct (and separate) features each capturing just the low, middle or high values and retaining a numerical value.low, middle or high value.In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of each variable.
The box-plots do not reveal any enormous outliers in any of the features, meaning it is unlikely we will need to perform outlier detection and removal. AcidIndex, LabelAppeal, and STARS are essentially categorical in nature (ordinal), so we explore how each value of those features relates with TARGET. We see a clear relationship - as LabelAppeal increases, so does TARGET.
We see the same relationship between STARS and TARGET - we especially note that STARS=NA highly correlates with lower TARGET. In the original project instructions, attention was drawn to the fact that missing data might be informative. Based on this, we will impute STARS=NA into STARS=0 which fits with the other values we see for STARS and the pattern that as stars increase, cases sold increase.
Finally, we wanted to plot scatter plots of each variable versus the target variable, TARGET, to get an idea of the relationship between them.
Due to the discrete nature of the target, it is somewhat difficult to see clear linear relationships in the data. However, it does appear that both STARS and LabelAppeal have a significant positive relationship with the TARGET, and many of the chemical features have at least some negative relationship with the TARGET as lower values led to more values of 8 and 7 in the target variable.
Overall, although our plots indicate some interesting relationships between our variables, they also reveal some significant issues with the data.
For instance, there are many data points that contain missing data that will need to be either imputed or discarded. There also was the issue of nonsensical negative values for variables measuring a concentration; we have chosen to assume that these variables have been log transformed and these values are not in error. However, there is no evidence to back up this decision, and we would need to reevaluate if given more information on the data collection/transformation process.
When we initially viewed the first few rows of the raw data, we already noticed missing data. Let’s assess which fields have missing data.
## values ind
## 1 26.25 STARS
## 2 9.46 Sulphates
## 3 5.33 TotalSulfurDioxide
## 4 5.10 Alcohol
## 5 5.06 FreeSulfurDioxide
## 6 4.99 Chlorides
## 7 4.81 ResidualSugar
## 8 3.09 pH
## 9 0.00 TARGET
## 10 0.00 FixedAcidity
## 11 0.00 VolatileAcidity
## 12 0.00 CitricAcid
## 13 0.00 Density
## 14 0.00 LabelAppeal
## 15 0.00 AcidIndex
In the project description, it was noted that the fact that a certain variable is missing may be predictive. We will impute STARS=NA to STARS=0. We will impute the remaining missing data using caret::preProcess and method=knnImpute. Note that preProcess will also center, scale and BoxCox our features at the same time.
With our missing data imputed correctly, we can now build off the scatter plots from above to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining wins by a team.
## values ind
## 1 0.685381473 STARS
## 2 0.356500469 LabelAppeal
## 3 0.062030498 Alcohol
## 4 0.051730323 TotalSulfurDioxide
## 5 0.043996542 FreeSulfurDioxide
## 6 0.016187709 ResidualSugar
## 7 0.008684633 CitricAcid
## 8 -0.009081197 pH
## 9 -0.035589560 Density
## 10 -0.039072231 Chlorides
## 11 -0.039917146 Sulphates
## 12 -0.049010939 FixedAcidity
## 13 -0.088793212 VolatileAcidity
## 14 -0.221991949 AcidIndex
STARS, LabelAppeal, and AcidIndex have the highest correlation with TARGET, which matches what we saw in the variable plots above. Recall that we imputed NA values for STARS as 0.
One problem that can occur with multi-variable regression is correlation between variables, or multicolinearity. A quick check is to run correlations between variables.
We see that the features have very low correlations with each other, meaning that there is not much multicollinearity present in the dataset. This means that the assumptions of linear regression are more likely to be met.
To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:
We removed the INDEX field as it offers no information for a model.
For the STARS field, we imputed NAs as the value of 0, since missing values were highly correlated with the target variable. For other fields with missing values, we imputed their value using caret’s knnimpute.
Many of the numerical features have somewhat unreasonable negative values, but we are choosing to believe that these are log-transformed variables and that the values are legitimate.
Finally, as mentioned earlier in our data exploration, during the impute step, caret::preProcess() automatically centered, scaled and BoxCox transformed our data. Here are some plots to demonstrate the changes in distributions and final values after the transformations:
We see that after the transformations, the variables are more centered and more closely resemble a normal distribution, although clearly they are still not perfect normal distributions.
With our transformations complete, we can now add these into our clean_df dataframe and continue on to build our models. To better measure each model performance, we split our data into a training and testing data set. We will train using the first, then measure model performance again the testing hold out set.
## [1] "Number of Training Samples: 10238"
## [1] "Number of Testing Samples: 2557"
In this first model, we include all available features. Features include:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2322 -0.6405 -0.0033 0.4393 3.7034
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.4764210 0.4500102 -1.059
## FixedAcidity 0.0005923 0.0057834 0.102
## VolatileAcidity -0.0231063 0.0057719 -4.003
## CitricAcid 0.0016765 0.0056782 0.295
## ResidualSugar -0.0028065 0.0058368 -0.481
## Chlorides -0.0109994 0.0058997 -1.864
## FreeSulfurDioxide 0.0124254 0.0057848 2.148
## TotalSulfurDioxide 0.0148079 0.0058740 2.521
## Density -0.0081961 0.0056717 -1.445
## pH -0.0100224 0.0057717 -1.736
## Sulphates -0.0105234 0.0059669 -1.764
## Alcohol 0.0138843 0.0058505 2.373
## as.factor(LabelAppeal)-1.11204793733397 0.2557234 0.0436304 5.861
## as.factor(LabelAppeal)0.0101741115806247 0.4410408 0.0426601 10.338
## as.factor(LabelAppeal)1.13239616049522 0.5765125 0.0433930 13.286
## as.factor(LabelAppeal)2.25461820940981 0.7229030 0.0485605 14.887
## as.factor(AcidIndex)-3.59682937695875 0.2966385 0.4532300 0.654
## as.factor(AcidIndex)-1.79176983045029 0.3657691 0.4481834 0.816
## as.factor(AcidIndex)-0.545318540973785 0.3285393 0.4479696 0.733
## as.factor(AcidIndex)0.362910765511677 0.2916780 0.4480181 0.651
## as.factor(AcidIndex)1.05172974217783 0.1849498 0.4483313 0.413
## as.factor(AcidIndex)1.59059728918163 0.0266897 0.4492902 0.059
## as.factor(AcidIndex)2.02271372429848 -0.2994324 0.4522662 -0.662
## as.factor(AcidIndex)2.37629509167962 -0.3838050 0.4584058 -0.837
## as.factor(AcidIndex)2.67051656830802 -0.2533947 0.4617542 -0.549
## as.factor(AcidIndex)2.9188445277671 -0.2353397 0.4699609 -0.501
## as.factor(AcidIndex)3.13100139587667 0.2361758 0.5863534 0.403
## as.factor(AcidIndex)3.31417429494859 -0.3706191 0.6334044 -0.585
## as.factor(AcidIndex)3.47378568897179 -0.3479974 0.6336536 -0.549
## as.factor(STARS)-0.42623524866846 0.7752057 0.0218663 35.452
## as.factor(STARS)0.416552574962037 1.0815575 0.0205812 52.551
## as.factor(STARS)1.25934039859254 1.1994529 0.0216582 55.381
## as.factor(STARS)2.10212822222303 1.3224400 0.0273678 48.321
## Pr(>|z|)
## (Intercept) 0.2897
## FixedAcidity 0.9184
## VolatileAcidity 0.0000624785 ***
## CitricAcid 0.7678
## ResidualSugar 0.6306
## Chlorides 0.0623 .
## FreeSulfurDioxide 0.0317 *
## TotalSulfurDioxide 0.0117 *
## Density 0.1484
## pH 0.0825 .
## Sulphates 0.0778 .
## Alcohol 0.0176 *
## as.factor(LabelAppeal)-1.11204793733397 0.0000000046 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.5128
## as.factor(AcidIndex)-1.79176983045029 0.4144
## as.factor(AcidIndex)-0.545318540973785 0.4633
## as.factor(AcidIndex)0.362910765511677 0.5150
## as.factor(AcidIndex)1.05172974217783 0.6800
## as.factor(AcidIndex)1.59059728918163 0.9526
## as.factor(AcidIndex)2.02271372429848 0.5079
## as.factor(AcidIndex)2.37629509167962 0.4024
## as.factor(AcidIndex)2.67051656830802 0.5832
## as.factor(AcidIndex)2.9188445277671 0.6165
## as.factor(AcidIndex)3.13100139587667 0.6871
## as.factor(AcidIndex)3.31417429494859 0.5585
## as.factor(AcidIndex)3.47378568897179 0.5829
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18260 on 10237 degrees of freedom
## Residual deviance: 10785 on 10205 degrees of freedom
## AIC: 36434
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5905365 0.5213364 2.2265753 36433.9569815 36672.6744132
In this second model, we only include the most predictive features based on our first Poisson Model. The predictors for the following model are:
VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), family = poisson, data = clean_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2335 -0.6501 -0.0035 0.4410 3.6951
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.001201 0.318675 0.004
## VolatileAcidity -0.023533 0.005122 -4.595
## TotalSulfurDioxide 0.016952 0.005244 3.232
## Alcohol 0.016009 0.005231 3.060
## as.factor(LabelAppeal)-1.11204793733397 0.240289 0.037999 6.324
## as.factor(LabelAppeal)0.0101741115806247 0.430806 0.037064 11.623
## as.factor(LabelAppeal)1.13239616049522 0.564104 0.037710 14.959
## as.factor(LabelAppeal)2.25461820940981 0.699216 0.042446 16.473
## as.factor(AcidIndex)-3.59682937695875 -0.140143 0.322337 -0.435
## as.factor(AcidIndex)-1.79176983045029 -0.097789 0.316887 -0.309
## as.factor(AcidIndex)-0.545318540973785 -0.131342 0.316605 -0.415
## as.factor(AcidIndex)0.362910765511677 -0.162635 0.316637 -0.514
## as.factor(AcidIndex)1.05172974217783 -0.272917 0.316940 -0.861
## as.factor(AcidIndex)1.59059728918163 -0.432064 0.318025 -1.359
## as.factor(AcidIndex)2.02271372429848 -0.795786 0.321596 -2.474
## as.factor(AcidIndex)2.37629509167962 -0.810861 0.327262 -2.478
## as.factor(AcidIndex)2.67051656830802 -0.646459 0.330156 -1.958
## as.factor(AcidIndex)2.9188445277671 -0.738613 0.342710 -2.155
## as.factor(AcidIndex)3.13100139587667 -0.286638 0.403433 -0.710
## as.factor(AcidIndex)3.31417429494859 -0.951704 0.547993 -1.737
## as.factor(AcidIndex)3.47378568897179 -1.196936 0.548071 -2.184
## as.factor(STARS)-0.42623524866846 0.757245 0.019564 38.705
## as.factor(STARS)0.416552574962037 1.075518 0.018261 58.896
## as.factor(STARS)1.25934039859254 1.194174 0.019232 62.093
## as.factor(STARS)2.10212822222303 1.313789 0.024330 53.999
## Pr(>|z|)
## (Intercept) 0.99699
## VolatileAcidity 0.000004335408 ***
## TotalSulfurDioxide 0.00123 **
## Alcohol 0.00221 **
## as.factor(LabelAppeal)-1.11204793733397 0.000000000256 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66373
## as.factor(AcidIndex)-1.79176983045029 0.75763
## as.factor(AcidIndex)-0.545318540973785 0.67825
## as.factor(AcidIndex)0.362910765511677 0.60751
## as.factor(AcidIndex)1.05172974217783 0.38918
## as.factor(AcidIndex)1.59059728918163 0.17428
## as.factor(AcidIndex)2.02271372429848 0.01334 *
## as.factor(AcidIndex)2.37629509167962 0.01322 *
## as.factor(AcidIndex)2.67051656830802 0.05023 .
## as.factor(AcidIndex)2.9188445277671 0.03114 *
## as.factor(AcidIndex)3.13100139587667 0.47740
## as.factor(AcidIndex)3.31417429494859 0.08244 .
## as.factor(AcidIndex)3.47378568897179 0.02897 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 13550 on 12770 degrees of freedom
## AIC: 45542
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.591897 0.521064 2.228483 45542.318359 45728.738603
Similar to Poisson Model 1, the predictors for the following model are:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df,
## init.theta = 40957.00204, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2219 -0.6496 -0.0055 0.4446 3.6790
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.0275198 0.3190721 0.086
## FixedAcidity 0.0010176 0.0051829 0.196
## VolatileAcidity -0.0231944 0.0051231 -4.527
## CitricAcid 0.0039724 0.0050838 0.781
## ResidualSugar 0.0005767 0.0052060 0.111
## Chlorides -0.0122552 0.0052206 -2.347
## FreeSulfurDioxide 0.0132531 0.0051832 2.557
## TotalSulfurDioxide 0.0170166 0.0052484 3.242
## Density -0.0074549 0.0050938 -1.464
## pH -0.0066655 0.0051875 -1.285
## Sulphates -0.0106282 0.0053149 -2.000
## Alcohol 0.0157805 0.0052355 3.014
## as.factor(LabelAppeal)-1.11204793733397 0.2398534 0.0380017 6.312
## as.factor(LabelAppeal)0.0101741115806247 0.4300463 0.0370666 11.602
## as.factor(LabelAppeal)1.13239616049522 0.5633460 0.0377142 14.937
## as.factor(LabelAppeal)2.25461820940981 0.6992610 0.0424548 16.471
## as.factor(AcidIndex)-3.59682937695875 -0.1651538 0.3226593 -0.512
## as.factor(AcidIndex)-1.79176983045029 -0.1214774 0.3172467 -0.383
## as.factor(AcidIndex)-0.545318540973785 -0.1558113 0.3169954 -0.492
## as.factor(AcidIndex)0.362910765511677 -0.1871144 0.3170413 -0.590
## as.factor(AcidIndex)1.05172974217783 -0.2972766 0.3173791 -0.937
## as.factor(AcidIndex)1.59059728918163 -0.4542592 0.3184811 -1.426
## as.factor(AcidIndex)2.02271372429848 -0.8158352 0.3220684 -2.533
## as.factor(AcidIndex)2.37629509167962 -0.8303961 0.3277299 -2.534
## as.factor(AcidIndex)2.67051656830802 -0.6688133 0.3306330 -2.023
## as.factor(AcidIndex)2.9188445277671 -0.7687131 0.3432641 -2.239
## as.factor(AcidIndex)3.13100139587667 -0.3297889 0.4038365 -0.817
## as.factor(AcidIndex)3.31417429494859 -0.9814037 0.5484760 -1.789
## as.factor(AcidIndex)3.47378568897179 -1.2022430 0.5486104 -2.191
## as.factor(STARS)-0.42623524866846 0.7548590 0.0195728 38.567
## as.factor(STARS)0.416552574962037 1.0732229 0.0182738 58.730
## as.factor(STARS)1.25934039859254 1.1910222 0.0192473 61.880
## as.factor(STARS)2.10212822222303 1.3117031 0.0243440 53.882
## Pr(>|z|)
## (Intercept) 0.93127
## FixedAcidity 0.84435
## VolatileAcidity 0.000005971523 ***
## CitricAcid 0.43458
## ResidualSugar 0.91179
## Chlorides 0.01890 *
## FreeSulfurDioxide 0.01056 *
## TotalSulfurDioxide 0.00119 **
## Density 0.14332
## pH 0.19882
## Sulphates 0.04554 *
## Alcohol 0.00258 **
## as.factor(LabelAppeal)-1.11204793733397 0.000000000276 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.60875
## as.factor(AcidIndex)-1.79176983045029 0.70179
## as.factor(AcidIndex)-0.545318540973785 0.62305
## as.factor(AcidIndex)0.362910765511677 0.55506
## as.factor(AcidIndex)1.05172974217783 0.34893
## as.factor(AcidIndex)1.59059728918163 0.15377
## as.factor(AcidIndex)2.02271372429848 0.01131 *
## as.factor(AcidIndex)2.37629509167962 0.01128 *
## as.factor(AcidIndex)2.67051656830802 0.04309 *
## as.factor(AcidIndex)2.9188445277671 0.02513 *
## as.factor(AcidIndex)3.13100139587667 0.41413
## as.factor(AcidIndex)3.31417429494859 0.07356 .
## as.factor(AcidIndex)3.47378568897179 0.02842 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40957) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13529 on 12762 degrees of freedom
## AIC: 45540
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40957
## Std. Err.: 34344
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45472.13
## RMSE Rsquared MAE aic bic
## 2.5916189 0.5222719 2.2281550 45540.1321552 45793.6636866
Similar to Poisson Model 2, the predictors for the following model are:
VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), data = clean_df, init.theta = 40912.67371,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2456 -0.6517 -0.0038 0.4399 3.6952
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.014317 0.318731 0.045
## VolatileAcidity -0.023451 0.005122 -4.578
## FreeSulfurDioxide 0.013266 0.005179 2.562
## TotalSulfurDioxide 0.016879 0.005244 3.219
## Alcohol 0.016237 0.005231 3.104
## as.factor(LabelAppeal)-1.11204793733397 0.240039 0.038000 6.317
## as.factor(LabelAppeal)0.0101741115806247 0.430314 0.037065 11.610
## as.factor(LabelAppeal)1.13239616049522 0.563287 0.037712 14.936
## as.factor(LabelAppeal)2.25461820940981 0.698162 0.042449 16.447
## as.factor(AcidIndex)-3.59682937695875 -0.154135 0.322401 -0.478
## as.factor(AcidIndex)-1.79176983045029 -0.110580 0.316945 -0.349
## as.factor(AcidIndex)-0.545318540973785 -0.143771 0.316660 -0.454
## as.factor(AcidIndex)0.362910765511677 -0.174856 0.316690 -0.552
## as.factor(AcidIndex)1.05172974217783 -0.285228 0.316994 -0.900
## as.factor(AcidIndex)1.59059728918163 -0.443142 0.318072 -1.393
## as.factor(AcidIndex)2.02271372429848 -0.806122 0.321638 -2.506
## as.factor(AcidIndex)2.37629509167962 -0.819576 0.327296 -2.504
## as.factor(AcidIndex)2.67051656830802 -0.655682 0.330193 -1.986
## as.factor(AcidIndex)2.9188445277671 -0.753474 0.342775 -2.198
## as.factor(AcidIndex)3.13100139587667 -0.299583 0.403483 -0.742
## as.factor(AcidIndex)3.31417429494859 -0.953685 0.548008 -1.740
## as.factor(AcidIndex)3.47378568897179 -1.205542 0.548094 -2.200
## as.factor(STARS)-0.42623524866846 0.756497 0.019567 38.662
## as.factor(STARS)0.416552574962037 1.074730 0.018264 58.844
## as.factor(STARS)1.25934039859254 1.193602 0.019234 62.057
## as.factor(STARS)2.10212822222303 1.314420 0.024331 54.023
## Pr(>|z|)
## (Intercept) 0.96417
## VolatileAcidity 0.000004685011 ***
## FreeSulfurDioxide 0.01042 *
## TotalSulfurDioxide 0.00129 **
## Alcohol 0.00191 **
## as.factor(LabelAppeal)-1.11204793733397 0.000000000267 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.63259
## as.factor(AcidIndex)-1.79176983045029 0.72717
## as.factor(AcidIndex)-0.545318540973785 0.64981
## as.factor(AcidIndex)0.362910765511677 0.58086
## as.factor(AcidIndex)1.05172974217783 0.36823
## as.factor(AcidIndex)1.59059728918163 0.16356
## as.factor(AcidIndex)2.02271372429848 0.01220 *
## as.factor(AcidIndex)2.37629509167962 0.01228 *
## as.factor(AcidIndex)2.67051656830802 0.04706 *
## as.factor(AcidIndex)2.9188445277671 0.02794 *
## as.factor(AcidIndex)3.13100139587667 0.45779
## as.factor(AcidIndex)3.31417429494859 0.08181 .
## as.factor(AcidIndex)3.47378568897179 0.02784 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40912.67) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13543 on 12769 degrees of freedom
## AIC: 45540
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40913
## Std. Err.: 34297
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45486.18
## RMSE Rsquared MAE aic bic
## 2.5918248 0.5214531 2.2284242 45540.1774667 45741.5113299
The predictors for the following model are:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9635 -0.8591 0.0325 0.8384 6.0750
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.995095 0.755491 1.317
## FixedAcidity 0.004930 0.011720 0.421
## VolatileAcidity -0.073667 0.011565 -6.370
## CitricAcid 0.014455 0.011558 1.251
## ResidualSugar 0.004027 0.011782 0.342
## Chlorides -0.038485 0.011776 -3.268
## FreeSulfurDioxide 0.039825 0.011786 3.379
## TotalSulfurDioxide 0.049447 0.011816 4.185
## Density -0.022475 0.011545 -1.947
## pH -0.018619 0.011704 -1.591
## Sulphates -0.028248 0.012015 -2.351
## Alcohol 0.050812 0.011830 4.295
## as.factor(LabelAppeal)-1.11204793733397 0.367639 0.062729 5.861
## as.factor(LabelAppeal)0.0101741115806247 0.835185 0.061168 13.654
## as.factor(LabelAppeal)1.13239616049522 1.302062 0.063917 20.371
## as.factor(LabelAppeal)2.25461820940981 1.889951 0.084169 22.454
## as.factor(AcidIndex)-3.59682937695875 -0.334854 0.767938 -0.436
## as.factor(AcidIndex)-1.79176983045029 -0.220221 0.754101 -0.292
## as.factor(AcidIndex)-0.545318540973785 -0.322349 0.753472 -0.428
## as.factor(AcidIndex)0.362910765511677 -0.429363 0.753539 -0.570
## as.factor(AcidIndex)1.05172974217783 -0.732560 0.754117 -0.971
## as.factor(AcidIndex)1.59059728918163 -1.041297 0.755385 -1.378
## as.factor(AcidIndex)2.02271372429848 -1.513113 0.757752 -1.997
## as.factor(AcidIndex)2.37629509167962 -1.533481 0.762156 -2.012
## as.factor(AcidIndex)2.67051656830802 -1.552014 0.769606 -2.017
## as.factor(AcidIndex)2.9188445277671 -1.400154 0.777299 -1.801
## as.factor(AcidIndex)3.13100139587667 -0.692206 0.883131 -0.784
## as.factor(AcidIndex)3.31417429494859 -1.772148 0.952843 -1.860
## as.factor(AcidIndex)3.47378568897179 -1.920432 0.900840 -2.132
## as.factor(STARS)-0.42623524866846 1.346560 0.032920 40.904
## as.factor(STARS)0.416552574962037 2.381720 0.032021 74.381
## as.factor(STARS)1.25934039859254 2.942287 0.037079 79.352
## as.factor(STARS)2.10212822222303 3.629958 0.059150 61.368
## Pr(>|t|)
## (Intercept) 0.18781
## FixedAcidity 0.67401
## VolatileAcidity 0.000000000196 ***
## CitricAcid 0.21109
## ResidualSugar 0.73251
## Chlorides 0.00109 **
## FreeSulfurDioxide 0.00073 ***
## TotalSulfurDioxide 0.000028712970 ***
## Density 0.05160 .
## pH 0.11167
## Sulphates 0.01873 *
## Alcohol 0.000017590098 ***
## as.factor(LabelAppeal)-1.11204793733397 0.000000004722 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66281
## as.factor(AcidIndex)-1.79176983045029 0.77027
## as.factor(AcidIndex)-0.545318540973785 0.66879
## as.factor(AcidIndex)0.362910765511677 0.56883
## as.factor(AcidIndex)1.05172974217783 0.33136
## as.factor(AcidIndex)1.59059728918163 0.16807
## as.factor(AcidIndex)2.02271372429848 0.04586 *
## as.factor(AcidIndex)2.37629509167962 0.04424 *
## as.factor(AcidIndex)2.67051656830802 0.04375 *
## as.factor(AcidIndex)2.9188445277671 0.07168 .
## as.factor(AcidIndex)3.13100139587667 0.43317
## as.factor(AcidIndex)3.31417429494859 0.06293 .
## as.factor(AcidIndex)3.47378568897179 0.03304 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 12762 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5429
## F-statistic: 475.9 on 32 and 12762 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.2999972 0.5444765 1.0168751 43106.3022739 43359.8338054
For the final Linear Model, we leverage stepAIC on our Linear Model #5 to choose the most important features.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS),
## data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9616 -0.8590 0.0352 0.8399 6.0675
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.99019 0.75488 1.312
## VolatileAcidity -0.07393 0.01156 -6.394
## Chlorides -0.03864 0.01177 -3.282
## FreeSulfurDioxide 0.04009 0.01178 3.403
## TotalSulfurDioxide 0.04960 0.01181 4.200
## Density -0.02270 0.01154 -1.967
## pH -0.01862 0.01170 -1.591
## Sulphates -0.02837 0.01201 -2.362
## Alcohol 0.05100 0.01183 4.313
## as.factor(LabelAppeal)-1.11204793733397 0.36722 0.06272 5.854
## as.factor(LabelAppeal)0.0101741115806247 0.83483 0.06116 13.649
## as.factor(LabelAppeal)1.13239616049522 1.30161 0.06391 20.367
## as.factor(LabelAppeal)2.25461820940981 1.88998 0.08416 22.456
## as.factor(AcidIndex)-3.59682937695875 -0.33511 0.76744 -0.437
## as.factor(AcidIndex)-1.79176983045029 -0.21824 0.75353 -0.290
## as.factor(AcidIndex)-0.545318540973785 -0.31837 0.75287 -0.423
## as.factor(AcidIndex)0.362910765511677 -0.42442 0.75293 -0.564
## as.factor(AcidIndex)1.05172974217783 -0.72561 0.75343 -0.963
## as.factor(AcidIndex)1.59059728918163 -1.03390 0.75463 -1.370
## as.factor(AcidIndex)2.02271372429848 -1.50407 0.75695 -1.987
## as.factor(AcidIndex)2.37629509167962 -1.52217 0.76131 -1.999
## as.factor(AcidIndex)2.67051656830802 -1.54018 0.76867 -2.004
## as.factor(AcidIndex)2.9188445277671 -1.38512 0.77635 -1.784
## as.factor(AcidIndex)3.13100139587667 -0.67937 0.88243 -0.770
## as.factor(AcidIndex)3.31417429494859 -1.75343 0.95178 -1.842
## as.factor(AcidIndex)3.47378568897179 -1.89498 0.89978 -2.106
## as.factor(STARS)-0.42623524866846 1.34682 0.03291 40.918
## as.factor(STARS)0.416552574962037 2.38256 0.03201 74.442
## as.factor(STARS)1.25934039859254 2.94276 0.03707 79.374
## as.factor(STARS)2.10212822222303 3.63105 0.05914 61.397
## Pr(>|t|)
## (Intercept) 0.189639
## VolatileAcidity 0.000000000167 ***
## Chlorides 0.001034 **
## FreeSulfurDioxide 0.000669 ***
## TotalSulfurDioxide 0.000026926678 ***
## Density 0.049224 *
## pH 0.111585
## Sulphates 0.018191 *
## Alcohol 0.000016245384 ***
## as.factor(LabelAppeal)-1.11204793733397 0.000000004904 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.662366
## as.factor(AcidIndex)-1.79176983045029 0.772104
## as.factor(AcidIndex)-0.545318540973785 0.672396
## as.factor(AcidIndex)0.362910765511677 0.572976
## as.factor(AcidIndex)1.05172974217783 0.335525
## as.factor(AcidIndex)1.59059728918163 0.170687
## as.factor(AcidIndex)2.02271372429848 0.046941 *
## as.factor(AcidIndex)2.37629509167962 0.045584 *
## as.factor(AcidIndex)2.67051656830802 0.045124 *
## as.factor(AcidIndex)2.9188445277671 0.074426 .
## as.factor(AcidIndex)3.13100139587667 0.441383
## as.factor(AcidIndex)3.31417429494859 0.065461 .
## as.factor(AcidIndex)3.47378568897179 0.035220 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 12765 degrees of freedom
## Multiple R-squared: 0.544, Adjusted R-squared: 0.543
## F-statistic: 525.1 on 29 and 12765 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.2999843 0.5444859 1.0169024 43102.1597330 43333.3208352
This table summarizes the RMSE, \(R^2\), MAE, AIC and BIC for all 6 models. In terms of raw metrics, The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and \(R^2\); however, Poisson Model 1 had the best performance based on AIC and BIC.
Overall, RMSE an \(R^2\) were not largely different across the 6 models, given this we chose Poisson Model 1 as our final model since it had a far lower AIC.
| RMSE | Rsquared | MAE | aic | bic | |
|---|---|---|---|---|---|
| poiss1_eval | 2.590536 | 0.5213364 | 2.226575 | 36433.96 | 36672.67 |
| poiss2_eval | 2.591897 | 0.5210640 | 2.228483 | 45542.32 | 45728.74 |
| nb3_eval | 2.591619 | 0.5222719 | 2.228155 | 45540.13 | 45793.66 |
| nb4_eval | 2.591825 | 0.5214531 | 2.228424 | 45540.18 | 45741.51 |
| lm5_eval | 1.299997 | 0.5444765 | 1.016875 | 43106.30 | 43359.83 |
| lm6_eval | 1.299984 | 0.5444859 | 1.016902 | 43102.16 | 43333.32 |
The final thing we will take a look at is our variable importance for each model.
The final figure presented below shows the feature importance in each model. As we see, all 6 models identified the same top 10 features.
We apply Poisson Model #1 to the holdout evaluation set to predict the TARGET for these instances. We have saved these predictions as csv in the file eval_predictions.csv.
Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_5/eval_predictions.csv
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1 0.17837679 -0.26524401 -1.51030924 -0.04455813 -0.4776009
## 2 1.37610328 0.84276407 0.07767213 -1.23934313 -0.7442725
## 3 0.94153812 0.01967235 1.81871194 -0.16055667 -1.1383538
## 4 0.85883598 -0.13861451 -0.28584168 1.73021959 -0.1309278
## 5 -0.02047049 0.68447720 -0.14553811 -0.03295827 -0.1250018
## 6 1.70750410 1.66585579 -0.36237090 -1.69173745 -0.1190758
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
## 1 0.11673887 -0.05275591 1.19564455 -0.3402128 2.66647966
## 2 3.49856180 -0.45621338 -0.22730155 -0.1442311 0.23889195
## 3 0.03195779 -0.14689598 -0.19280589 1.9789246 2.06326090
## 4 -0.73421194 0.49191169 -0.13675044 -0.2085894 -0.01122314
## 5 -0.05282329 0.26328578 -0.29198092 1.3139028 -0.98225823
## 6 1.50777652 -1.88848742 0.08315942 -1.6483055 -0.21720028
## Sulphates Alcohol LabelAppeal AcidIndex STARS
## 1 0.1211079 0.4857435 -1.11204794 -1.7917698 -1.2690231
## 2 0.6038736 1.4782809 0.01017411 -1.7917698 0.4165526
## 3 0.1640204 -0.5202067 0.01017411 0.3629108 -0.4262352
## 4 1.6981424 0.4857435 -1.11204794 0.3629108 -0.4262352
## 5 -0.6405890 -1.5261568 0.01017411 1.5905973 -1.2690231
## 6 -0.5869484 0.2443154 1.13239616 0.3629108 2.1021282
# =====================================================================================
# Load Libraries and Define Helper functions
# =====================================================================================
library(MASS)
library(rpart.plot)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(GGally)
library(corrplot)
library(RColorBrewer)
library(tibble)
library(tidyr)
library(tidyverse)
library(dplyr)
library(reshape2)
library(mixtools)
library(tidymodels)
library(ggpmisc)
library(regclass)
library(skimr)
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
# Make sure a model was passed
if (is.null(model)) {
return
}
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
plot(residuals(model))
hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
xlab = "Residuals",col="lightgreen")
lines(density(model[["residuals"]], kernel = "ep"),col="blue", lwd=3)
curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)
qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
par(mfrow = c(1, 1))
}
#' Print a Variable Importance Plot for the provided model
#'
#' @param model The model
#' @param chart_title The Title to show on the plot
#' @examples
#' variableImportancePlot(myLinearModel, 'My Title)
#' @return null
#' @export
variableImportancePlot <- function(model=NULL, chart_title='Variable Importance Plot') {
# Make sure a model was passed
if (is.null(model)) {
return
}
# use caret and gglot to print a variable importance plot
varImp(model) %>% as.data.frame() %>%
ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall)) +
geom_col(aes(fill = Overall)) +
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle = 90)) +
scale_fill_gradient() +
labs(title = chart_title,
x = "Parameter",
y = "Relative Importance")
}
#' Print a Facet Chart of histograms
#'
#' @param df Dataset
#' @param box Facet size (rows)
#' @examples
#' histbox(my_df, 3)
#' @return null
#' @export
histbox <- function(df, box) {
par(mfrow = box)
ndf <- dimnames(df)[[2]]
for (i in seq_along(ndf)) {
data <- na.omit(unlist(df[, i]))
hist(data, breaks = "fd", main = paste("Histogram of", ndf[i]),
xlab = ndf[i], freq = FALSE)
lines(density(data, kernel = "ep"), col = 'red')
}
par(mfrow = c(1, 1))
}
#' Extract key performance results from a model
#'
#' @param model A linear model of interest
#' @examples
#' model_performance_extraction(my_model)
#' @return data.frame
#' @export
model_performance_extraction <- function(model=NULL) {
# Make sure a model was passed
if (is.null(model)) {
return
}
data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
}
# =====================================================================================
# Load Data set
# =====================================================================================
# Load Wine dataset
df <- read.csv('https://raw.githubusercontent.com/djlofland/DS621_F2020_Group3/master/Homework_5/datasets/wine-training-data.csv', fileEncoding="UTF-8-BOM")
df_eval <- read.csv('https://raw.githubusercontent.com/djlofland/DS621_F2020_Group3/master/Homework_5/datasets/wine-evaluation-data.csv')
# =====================================================================================
# Summary Stats
# =====================================================================================
# Display summary statistics
summary(df)
# =====================================================================================
# Distributions
# =====================================================================================
# Prepare data for ggplot
gather_df <- df %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)
# =====================================================================================
# Boxplots
# =====================================================================================
# Prepare data for ggplot
gather_df <- df %>%
gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(gather_df, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6)
df_character_wide <- df %>%
select(TARGET, STARS, LabelAppeal, AcidIndex) %>%
pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
arrange(variable, value)
df_character_wide %>%
ggplot(mapping = aes(x = factor(value), y = TARGET)) +
geom_boxplot() +
facet_wrap(.~variable, scales="free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
# =====================================================================================
# Variable Plots
# =====================================================================================
# Plot scatter plots of each variable versus the target variable
featurePlot(df[,2:ncol(df)], df[,1], pch = 20)
# =====================================================================================
# Missing Data
# =====================================================================================
# Identify missing data by Feature and display percent breakout
missing <- colSums(df %>% sapply(is.na))
missing_pct <- round(missing / nrow(df) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))
# separate our features from target so we don't inadvertently transform the target
training_x <- df %>% select(-TARGET)
training_y <- df$TARGET
# separate our features from target so we don't inadvertently transform the target
eval_x <- df_eval %>% select(-TARGET)
eval_y <- df_eval$TARGET
create_na_dummy <- function(vector) {
as.integer(vector %>% is.na())
}
impute_missing <- function(data) {
# Replace missing STARS with 0
data$STARS <- data$STARS %>%
replace_na(0)
return(data)
}
# Replace missing STARS with 'unknown' and convert STASR to a factor
training_x <- impute_missing(training_x)
eval_x <- impute_missing(eval_x)
imputation <- preProcess(training_x, method = c("knnImpute", 'BoxCox'))
# summary(imputation)
training_x_imp <- predict(imputation, training_x)
eval_x_imp <- predict(imputation, eval_x)
clean_df <- cbind(training_y, training_x_imp) %>%
as.data.frame() %>%
rename(TARGET = training_y)
clean_eval_df <- cbind(eval_y, eval_x_imp) %>%
as.data.frame() %>%
rename(TARGET = eval_y)
# =====================================================================================
# Feature-target correlations
# =====================================================================================
# Show feature correlations/target by decreasing correlation
stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))
# =====================================================================================
# Multicolinearity
# =====================================================================================
# Calculate and plot the Multicolinearity
correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))
# =====================================================================================
# Transform non-normal variables
# =====================================================================================
# created empty data frame to store transformed variables
#df_temp <- data.frame(matrix(ncol = 1, nrow = length(clean_df$TARGET)))
# performed log transformation
#df_temp$Chlorides <- clean_df$Chlorides
#df_temp$Chlorides_transform <- log(clean_df$Chlorides)
# performed log transformation
#df_temp$CitricAcid <- clean_df$CitricAcid
#df_temp$CitricAcid_transform <- log(clean_df$CitricAcid)
# performed a log transformation
#df_temp$FixedAcidity <- clean_df$FixedAcidity
#df_temp$FixedAcidity_transform <- log(clean_df$FixedAcidity)
# performed a log transformation
#df_temp$FreeSulfurDioxide <- clean_df$FreeSulfurDioxide
#df_temp$FreeSulfurDioxide_transform <- log(clean_df$FreeSulfurDioxide)
# performed a log transformation
#df_temp$ResidualSugar <- clean_df$ResidualSugar
#df_temp$ResidualSugar_transform <- log(clean_df$ResidualSugar)
# performed a log transformation
#df_temp$Sulphates <- clean_df$Sulphates
#df_temp$Sulphates_transform <- log(clean_df$Sulphates)
# performed a log transformation
#df_temp$TotalSulfurDioxide <- clean_df$TotalSulfurDioxide
#df_temp$TotalSulfurDioxide_transform <- log(clean_df$TotalSulfurDioxide)
# performed a log transformation
#df_temp$VolatileAcidity <- clean_df$VolatileAcidity
#df_temp$VolatileAcidity_transform <- log(clean_df$VolatileAcidity)
# =====================================================================================
# Finalizing dataset for model building
# =====================================================================================
options(scipen = 999)
#75% data test training split
# get training/test split
y_raw <- as.matrix(clean_df$TARGET)
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)
# Build training data sets
trainX <- clean_df[trainingRows,] %>% select(-TARGET)
trainY <- clean_df[trainingRows,] %>% select(TARGET)
# put remaining rows into the test sets
testX <- clean_df[-trainingRows,] %>% select(-TARGET)
testY <- clean_df[-trainingRows,] %>% select(TARGET)
# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$TARGET <- trainY$TARGET
print(paste('Number of Training Samples: ', dim(trainingData)[1]))
testingData <- as.data.frame(testX)
testingData$TARGET <- testY$TARGET
print(paste('Number of Testing Samples: ', dim(testingData)[1]))
model_test_perf <- function(model, trainX, trainY, testX, testY) {
# Evaluate Model 1 with testing data set
predictedY <- predict(model, newdata=trainX)
model_results <- data.frame(obs = trainY, pred=predictedY)
colnames(model_results) = c('obs', 'pred')
# This grabs RMSE, Rsquaredand MAE by default
model_eval <- defaultSummary(model_results)
# Add AIC score to the results
if ('aic' %in% model) {
model_eval[4] <- model$aic
} else {
model_eval[4] <- AIC(model)
}
names(model_eval)[4] <- 'aic'
# Add BIC score to the results
model_eval[5] <- BIC(model)
names(model_eval)[5] <- 'bic'
return(model_eval)
# =====================================================================================
# Poisson Model 1
# =====================================================================================
options(scipen = 999)
poiss1 <- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=trainingData,
family=poisson)
summary(poiss1)
# Evaluate Model 1 with testing data set
(poiss1_eval <- model_test_perf(poiss1, trainX, trainY, testX, testY))
po1VIP <- variableImportancePlot(poiss1, "Poisson Model 1 Variable Importance")
# =====================================================================================
# Poisson Model 2
# =====================================================================================
poiss2 <- glm(TARGET ~ VolatileAcidity + TotalSulfurDioxide + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df,
family=poisson)
summary(poiss2)
# Evaluate Model 1 with testing data set
(poiss2_eval <- model_test_perf(poiss2, trainX, trainY, testX, testY))
po2VIP <- variableImportancePlot(poiss2, "Poisson Model 2 Variable Importance")
# =====================================================================================
# Negative Binomial Model 3
# =====================================================================================
nb3 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df)
summary(nb3)
# Evaluate Model 1 with testing data set
(nb3_eval <- model_test_perf(nb3, trainX, trainY, testX, testY))
nb3VIP <- variableImportancePlot(nb3, "Negative Binomial 3 Variable Importance")(model3$fit, "Linear Model 3")
# =====================================================================================
# Negative Binomial Model 4
# =====================================================================================
nb4 <- glm.nb(TARGET~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df)
summary (nb4)
# Evaluate Model 1 with testing data set
(nb4_eval <- model_test_perf(nb4, trainX, trainY, testX, testY))
nb4VIP <- variableImportancePlot(nb4, "Negative Binomial 4 Variable Importance")
# =====================================================================================
# Linear Model 5
# =====================================================================================
lm5 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df)
summary(lm5)
# Evaluate Model 1 with testing data set
(lm5_eval <- model_test_perf(lm5, trainX, trainY, testX, testY))
lm5VIP <- variableImportancePlot(lm5, "Linear Model 5 Variable Importance")
# =====================================================================================
# Linear Model 6
# =====================================================================================
lm6 <- stepAIC(lm5, direction = "both",
scope = list(upper = lm5, lower = ~ 1),
scale = 0, trace = FALSE)
summary(lm6)
# Evaluate Model 1 with testing data set
(lm6_eval <- model_test_perf(lm6, trainX, trainY, testX, testY))
lm6VIP <- variableImportancePlot(lm6, "Linear Model 6 Variable Importance")
# =====================================================================================
Model selection and analysis
# =====================================================================================
models_summary <- rbind(poiss1_eval, poiss2_eval, nb3_eval, nb4_eval, lm5_eval, lm6_eval)
kable(models_summary) %>%
kable_styling(bootstrap_options = "basic", position = "center")
grid.arrange(po1VIP, po2VIP, nb3VIP, nb4VIP, lm5VIP, lm6VIP, ncol = 2)
# =====================================================================================
Predictions
# =====================================================================================
eval_data <- clean_eval_df %>% select(-TARGET)
predictions <- predict(poiss1, eval_data)
clean_eval_df$TARGET <- predictions
write.csv(clean_eval_df, 'eval_predictions.csv', row.names=F)
head(clean_eval_df)
# =====================================================================================
End of R Code
# =====================================================================================