Introduction

This project aims to find the count—or amount—of potential cases of wine that a particular wine seller can expect to sell based on the variables observed in the data provided. This dataset is quite large, and most variables are wine properties, such as sugar, acidity, among several others. The premise is that these factors affect how restaurants, or their buyers/sommeliers, evaluate these wines. In other words, the seller wants to know what is essential to maximize potential sales.

Data Exploration 1.0

We loaded the data, convert it into a data frame, and remove the index.

#get rid of index
train <- train[,-1]
eval <- eval[,-1]

head(train)
TARGETFixedAcidityVolatileAcidityCitricAcidResidualSugarChloridesFreeSulfurDioxideTotalSulfurDioxideDensitypHSulphatesAlcoholLabelAppealAcidIndexSTARS
33.21.16 -0.9854.2-0.5672680.9933.33-0.599.9082
34.50.16 -0.8126.1-0.42515-3271.03 3.380.7   -173
57.12.64 -0.8814.80.0372141420.9953.120.4822  -183
35.70.3850.0418.8-0.425221150.9962.241.836.2-161
48  0.33 -1.269.4    -1671080.9953.121.7713.7092
011.30.32 0.592.20.556-37150.9993.2 1.2915.4011
head(eval)
TARGETFixedAcidityVolatileAcidityCitricAcidResidualSugarChloridesFreeSulfurDioxideTotalSulfurDioxideDensitypHSulphatesAlcoholLabelAppealAcidIndexSTARS
5.4-0.86 0.27-10.70.092233980.9855.020.6412.3 -16
12.40.385-0.76-19.71.17 -37680.99 3.371.0916   062
7.21.75 0.17-33  0.0659761.05 4.610.688.55081
6.20.1  1.8 1  -0.179104890.9893.2 2.1112.3 -181
11.40.21 0.281.20.03870531.03 2.54-0.074.8 010
17.60.04 -1.151.40.535-2501400.95 3.06-0.0211.4 184

Data Exploration 2.0

We first looked to evaluate the training dataset to assess the file. We ran some basic commands and visuals to look at the data structure, meaning, descriptive statistics, and missing values.

We Initially found that the number of NAs in the dataset is significant, and as such—and to a headache degree—we knew that we needed to do some potential imputation at some point. In addition, we noticed several values with negative values, and that to perform the modeling, we would also need to transform these data series and place them in a zero to the above scale.

First, however, we looked at the distribution of all the variables. We scaled them centered around zero to see them in boxplots. We initially thought to convert categorical variables like STARS and LabelAppeal into factors, but to run the correlation, we needed numeric values first. So for both the histograms and the correlation, we found that most variables, including STARS and LabelAppeal, are normally distributed and that these two are closely correlated to our target variable, cases sold.

#Data exploration

#summary tables and data structure

summary(train)
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359
str(train)
## tibble [12,795 x 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
st(train)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 7.076 6.318 -18.1 5.2 9.5 34.4
VolatileAcidity 12795 0.324 0.784 -2.79 0.13 0.64 3.68
CitricAcid 12795 0.308 0.862 -3.24 0.03 0.58 3.86
ResidualSugar 12179 5.419 33.749 -127.8 -2 15.9 141.15
Chlorides 12157 0.055 0.318 -1.171 -0.031 0.153 1.351
FreeSulfurDioxide 12148 30.846 148.715 -555 0 70 623
TotalSulfurDioxide 12113 120.714 231.913 -823 27 208 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12400 3.208 0.68 0.48 2.96 3.47 6.13
Sulphates 11585 0.527 0.932 -3.13 0.28 0.86 4.24
Alcohol 12142 10.489 3.728 -4.7 9 12.4 26.5
LabelAppeal 12795 -0.009 0.891 -2 -1 1 2
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 9436 2.042 0.903 1 1 3 4
describe(train)
## train 
## 
##  15  Variables      12795  Observations
## --------------------------------------------------------------------------------
## TARGET 
##        n  missing distinct     Info     Mean      Gmd 
##    12795        0        9    0.962    3.029    2.141 
## 
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency   2734   244  1091  2611  3177  2014   765   142    17
## Proportion 0.214 0.019 0.085 0.204 0.248 0.157 0.060 0.011 0.001
## --------------------------------------------------------------------------------
## FixedAcidity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      470        1    7.076    6.688     -3.6     -1.2 
##      .25      .50      .75      .90      .95 
##      5.2      6.9      9.5     15.6     17.8 
## 
## lowest : -18.1 -18.0 -17.7 -17.5 -17.4, highest:  32.4  32.5  32.6  34.1  34.4
## --------------------------------------------------------------------------------
## VolatileAcidity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      815        1   0.3241   0.8262   -1.023   -0.720 
##      .25      .50      .75      .90      .95 
##    0.130    0.280    0.640    1.350    1.640 
## 
## lowest : -2.790 -2.750 -2.745 -2.730 -2.720, highest:  3.500  3.550  3.565  3.590  3.680
## --------------------------------------------------------------------------------
## CitricAcid 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      602        1   0.3084   0.9057    -1.16    -0.84 
##      .25      .50      .75      .90      .95 
##     0.03     0.31     0.58     1.43     1.79 
## 
## lowest : -3.24 -3.16 -3.10 -3.08 -3.06, highest:  3.63  3.68  3.70  3.77  3.86
## --------------------------------------------------------------------------------
## ResidualSugar 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12179      616     2077        1    5.419    35.31   -52.70   -39.66 
##      .25      .50      .75      .90      .95 
##    -2.00     3.90    15.90    49.72    62.70 
## 
## lowest : -127.80 -127.10 -126.20 -126.10 -125.70
## highest:  136.50  137.60  138.00  140.65  141.15
## --------------------------------------------------------------------------------
## Chlorides 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12157      638     1663        1  0.05482   0.3311   -0.489   -0.372 
##      .25      .50      .75      .90      .95 
##   -0.031    0.046    0.153    0.481    0.598 
## 
## lowest : -1.171 -1.170 -1.158 -1.156 -1.155, highest:  1.260  1.261  1.270  1.275  1.351
## --------------------------------------------------------------------------------
## FreeSulfurDioxide 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12148      647      999        1    30.85    155.2     -224     -171 
##      .25      .50      .75      .90      .95 
##        0       30       70      230      284 
## 
## lowest : -555 -546 -536 -535 -532, highest:  613  617  618  622  623
## --------------------------------------------------------------------------------
## TotalSulfurDioxide 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12113      682     1370        1    120.7    246.9   -273.0   -185.0 
##      .25      .50      .75      .90      .95 
##     27.0    123.0    208.0    421.8    513.4 
## 
## lowest : -823 -816 -793 -781 -779, highest: 1032 1041 1048 1054 1057
## --------------------------------------------------------------------------------
## Density 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0     5933        1   0.9942  0.02769   0.9488   0.9587 
##      .25      .50      .75      .90      .95 
##   0.9877   0.9945   1.0005   1.0295   1.0398 
## 
## lowest : 0.88809 0.88949 0.88978 0.88983 0.89167
## highest: 1.09658 1.09679 1.09695 1.09791 1.09924
## --------------------------------------------------------------------------------
## pH 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12400      395      497        1    3.208   0.7242     2.06     2.31 
##      .25      .50      .75      .90      .95 
##     2.96     3.20     3.47     4.10     4.37 
## 
## lowest : 0.48 0.53 0.54 0.58 0.59, highest: 5.91 5.94 6.02 6.05 6.13
## --------------------------------------------------------------------------------
## Sulphates 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    11585     1210      630        1   0.5271   0.9827    -1.05    -0.70 
##      .25      .50      .75      .90      .95 
##     0.28     0.50     0.86     1.77     2.09 
## 
## lowest : -3.13 -3.12 -3.10 -3.07 -3.03, highest:  4.11  4.16  4.19  4.21  4.24
## --------------------------------------------------------------------------------
## Alcohol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12142      653      401        1    10.49    4.015      4.1      5.7 
##      .25      .50      .75      .90      .95 
##      9.0     10.4     12.4     15.2     16.7 
## 
## lowest : -4.7 -4.5 -4.4 -4.3 -4.1, highest: 25.4 25.6 26.0 26.1 26.5
## --------------------------------------------------------------------------------
## LabelAppeal 
##         n   missing  distinct      Info      Mean       Gmd 
##     12795         0         5     0.887 -0.009066    0.9566 
## 
## lowest : -2 -1  0  1  2, highest: -2 -1  0  1  2
##                                         
## Value         -2    -1     0     1     2
## Frequency    504  3136  5617  3048   490
## Proportion 0.039 0.245 0.439 0.238 0.038
## --------------------------------------------------------------------------------
## AcidIndex 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0       14    0.908    7.773    1.316        6        7 
##      .25      .50      .75      .90      .95 
##        7        8        8        9       10 
## 
## lowest :  4  5  6  7  8, highest: 13 14 15 16 17
##                                                                             
## Value          4     5     6     7     8     9    10    11    12    13    14
## Frequency      3    75  1197  4878  4142  1427   551   258   128    69    47
## Proportion 0.000 0.006 0.094 0.381 0.324 0.112 0.043 0.020 0.010 0.005 0.004
##                             
## Value         15    16    17
## Frequency      8     5     7
## Proportion 0.001 0.000 0.001
## --------------------------------------------------------------------------------
## STARS 
##        n  missing distinct     Info     Mean      Gmd 
##     9436     3359        4    0.899    2.042   0.9777 
##                                   
## Value          1     2     3     4
## Frequency   3042  3570  2212   612
## Proportion 0.322 0.378 0.234 0.065
## --------------------------------------------------------------------------------
dim(train)
## [1] 12795    15
#some variables have multiple Nas
#assess missing values
missing <- colSums(train %>% sapply(is.na))
missing_pct <- round(missing / nrow(train) * 100, 2)
na_table <- stack(sort(missing_pct, decreasing = TRUE))
na_table
valuesind
26.2 STARS
9.46Sulphates
5.33TotalSulfurDioxide
5.1 Alcohol
5.06FreeSulfurDioxide
4.99Chlorides
4.81ResidualSugar
3.09pH
0   TARGET
0   FixedAcidity
0   VolatileAcidity
0   CitricAcid
0   Density
0   LabelAppeal
0   AcidIndex
plot_missing(train)

Visualization

# SOME VISUALS on histograms and correlation

# Histograms to check how variables are distributed,=
plot_num(train)

# Correlation plot -- al variables as numeric
cor <-cor(train, method="pearson", use = "pairwise.complete.obs")
corrplot(cor, method="circle")

#Scale variables to create boxplot chart with all variables in it

scaled.train <- as.data.table(scale(train[, c(
  'TARGET',
  'FixedAcidity',
  'VolatileAcidity',
  'CitricAcid',
  'ResidualSugar',
  'Chlorides',
  'FreeSulfurDioxide',
  'TotalSulfurDioxide',
  'pH',
  'Sulphates',
  'Density',
  'Alcohol',
  'AcidIndex',
  'STARS',
  'LabelAppeal'
  )]))

#Show boxplots

#boxplot(scaled.train)

melt.train <- melt(scaled.train)

scaled_boxplots <- ggplot(melt.train, aes(variable, value)) +
  geom_boxplot(width=.5, fill="navyblue", outlier.color="magenta", outlier.size = 1) +
  stat_summary(aes(color="mean"), fun.y=mean, geom="point",
               size=2, show.legend=TRUE) +
  stat_summary(aes(color="median"), fun.y=median, geom="point",
               size=2, show.legend=TRUE) +
  coord_flip() +
  labs(x="", y="") +
  scale_color_manual(values=c("blue", "purple")) + 
  theme(legend.position="top")


scaled_boxplots

Data Preparation

As mentioned earlier, as we inspected the data, we noticed quite a large amount of NA’s. I have worked with NA’s in the past but in a much simpler way. And aside from the “hint” in the assignment paper to not always discard NA’s, I knew that STARS would have to be converted to zeros—not a significant accomplishment since I have learned how to impute NA’s in this class in many other forms.

For the other variables that needed imputation, I read about approaching this in several ways, whether using the imputeTS package, Hmisc, missForest, and mice. Although mice was a tad over my head, I read that it could be the best avenue. I have imputed using means and medians in the past, but after learning a bit more, I went with mice, a random forest regressor, to try something new.

After this, it made sense to revert to the initial idea to convert STARS and LabelAppeal as factors before running the models. Many examples available tried to change “AcidIndex” as a factor so I tried that as well. It also made sense to consider converting the variables with negative values to a positive scale since this exercise aims to run Poisson and Negative Binomial models that cannot take negative values. After looking at several techniques, I questioned whether I would log the data if most independent variables had a relatively normal distribution. Then, I realized that Poisson and Negative Binomial regressions can adjust for that once the data is transformed. This was the progress of the data transformation.

Step 1. Initial Imputed data  Data as given, then transform NA’s for stars as zero and then use “MICE” for the remaining variables.

Step 2. Scaled (x + abs(min(x))+C  comes from the imputed data, but scaled to change negative numbers to positive; LabelAppeal was adjusted as + 2. The other variables were turned into absolute values. At this point, all the variables remained to show a relatively normal distribution.

Step 3, using the log  we converted continuous variables into logs to see if the distribution would take the shape of the negative binomial. And it appeared as it did.

Step 4. Transform data using absolute values  I used absolute values without log transformation or scaling it and tested how it would perform. At first glance, it looked like just using absolute values would have been enough.

#impute some of the variables, particularly Stars from na/s to zeros

train$STARS[is.na(train$STARS)] <- 0
train$STARS <-as.factor(train$STARS)
train$LabelAppeal <- as.factor(train$LabelAppeal)
#train$AcidIndex <- as.factor(train$AcidIndex)

# head(train)

# impute the other variables--used mice based on this:
#https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

train_mice <- mice::mice(train, m = 2, method='cart', maxit = 2, print = FALSE)
train_imputed <- mice::complete(train_mice)

density.plot <-densityplot(train_mice)
density.plot

head(train_imputed)
TARGETFixedAcidityVolatileAcidityCitricAcidResidualSugarChloridesFreeSulfurDioxideTotalSulfurDioxideDensitypHSulphatesAlcoholLabelAppealAcidIndexSTARS
33.21.16 -0.9854.2-0.567-1702680.9933.33-0.599.9082
34.50.16 -0.8126.1-0.42515-3271.03 3.380.7 13.5-173
57.12.64 -0.8814.80.0372141420.9953.120.4822  -183
35.70.3850.0418.8-0.425221150.9962.241.836.2-161
48  0.33 -1.269.4-0.446-1671080.9953.121.7713.7092
011.30.32 0.592.20.556-37150.9993.2 1.2915.40110
st(train_imputed)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 7.076 6.318 -18.1 5.2 9.5 34.4
VolatileAcidity 12795 0.324 0.784 -2.79 0.13 0.64 3.68
CitricAcid 12795 0.308 0.862 -3.24 0.03 0.58 3.86
ResidualSugar 12795 5.44 33.737 -127.8 -1.8 15.85 141.15
Chlorides 12795 0.056 0.319 -1.171 -0.029 0.156 1.351
FreeSulfurDioxide 12795 31.198 148.59 -555 0 70 623
TotalSulfurDioxide 12795 121.383 231.196 -823 27.5 208 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.525 0.93 -3.13 0.28 0.86 4.24
Alcohol 12795 10.493 3.738 -4.7 9 12.4 26.5
LabelAppeal 12795
… -2 504 3.9%
… -1 3136 24.5%
… 0 5617 43.9%
… 1 3048 23.8%
… 2 490 3.8%
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
plot_num(train_imputed)

Further insights.

I looked at different techniques in r to adjust the negative values, and the following is the one I understood better. What made sense to me was to convert the absolute value of the minimum + 1 (to change the scale) for each observation. I have done this manually in the past with models in excel. I learned how many analysts did this in several examples and started with this one. There were also analysts who considered “AcidINdex” to be a categorical variable. I did not go this route.

I learned there are other techniques, such as simply absolute values, and then the log of all these absolute values. Without some of these transformations I was still getting normal distributions for the imputed values, and from what the negative binomial model, and possibly poisson model if there is no overdispersion.

train_imp_plusminconst <- train_imputed

#only do it for columns with negative values--not sure that the constant (y+Y-min(Y) 
#(+1 does, I guess it could be any constant, or none?)
#Also --> should these be logged despite having a normal distribution? --I decided not to
#https://www.listendata.com/2015/09/regression-transform-negative-values.html

train_imp_plusminconst$FixedAcidity <- train_imp_plusminconst$FixedAcidity + abs(min(train_imp_plusminconst$FixedAcidity))+1
train_imp_plusminconst$VolatileAcidity <- train_imp_plusminconst$VolatileAcidity + abs(min(train_imp_plusminconst$VolatileAcidity))+1
train_imp_plusminconst$CitricAcid <- train_imp_plusminconst$CitricAcid + abs(min(train_imp_plusminconst$CitricAcid))+1
train_imp_plusminconst$ResidualSugar <- train_imp_plusminconst$ResidualSugar + abs(min(train_imp_plusminconst$ResidualSugar))+1
train_imp_plusminconst$Chlorides <- train_imp_plusminconst$Chlorides + abs(min(train_imp_plusminconst$Chlorides))+1
train_imp_plusminconst$FreeSulfurDioxide <- train_imp_plusminconst$FreeSulfurDioxide + abs(min(train_imp_plusminconst$FreeSulfurDioxide))+1
train_imp_plusminconst$TotalSulfurDioxide <- train_imp_plusminconst$TotalSulfurDioxide +abs(min(train_imp_plusminconst$TotalSulfurDioxide ))+1
train_imp_plusminconst$Sulphates <-train_imp_plusminconst$Sulphates +abs(min(train_imp_plusminconst$Sulphates))+1
train_imp_plusminconst$Alcohol <-train_imp_plusminconst$Alcohol +abs(min(train_imp_plusminconst$Alcohol))+1

#this seems out of scale for the other variables, so decided to scale the other positive variables too
train_imp_plusminconst$Density <- train_imp_plusminconst$Density + abs(min(train_imp_plusminconst$Density))+1
train_imp_plusminconst$pH <- train_imp_plusminconst$pH + abs(min(train_imp_plusminconst$pH))+1

#transform Label Appeal too.
train_imp_plusminconst$LabelAppeal <- as.numeric(train_imp_plusminconst$LabelAppeal)
train_imp_plusminconst$LabelAppeal <-train_imp_plusminconst$LabelAppeal + abs(min(train_imp_plusminconst$LabelAppeal))-2

train_imp_plusminconst$LabelAppeal <- as.factor(train_imp_plusminconst$LabelAppeal)

st(train_imp_plusminconst) #run this to make sure each variable worked after
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 26.176 6.318 1 24.3 28.6 53.5
VolatileAcidity 12795 4.114 0.784 1 3.92 4.43 7.47
CitricAcid 12795 4.548 0.862 1 4.27 4.82 8.1
ResidualSugar 12795 134.24 33.737 1 127 144.65 269.95
Chlorides 12795 2.227 0.319 1 2.142 2.327 3.522
FreeSulfurDioxide 12795 587.198 148.59 1 556 626 1179
TotalSulfurDioxide 12795 945.383 231.196 1 851.5 1032 1881
Density 12795 2.882 0.027 2.776 2.876 2.889 2.987
pH 12795 4.687 0.681 1.96 4.43 4.95 7.61
Sulphates 12795 4.655 0.93 1 4.41 4.99 8.37
Alcohol 12795 16.193 3.738 1 14.7 18.1 32.2
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
summary(train_imp_plusminconst) #make sure nothing broke
##      TARGET       FixedAcidity   VolatileAcidity   CitricAcid   
##  Min.   :0.000   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:24.30   1st Qu.:3.920   1st Qu.:4.270  
##  Median :3.000   Median :26.00   Median :4.070   Median :4.550  
##  Mean   :3.029   Mean   :26.18   Mean   :4.114   Mean   :4.548  
##  3rd Qu.:4.000   3rd Qu.:28.60   3rd Qu.:4.430   3rd Qu.:4.820  
##  Max.   :8.000   Max.   :53.50   Max.   :7.470   Max.   :8.100  
##  ResidualSugar     Chlorides     FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  1.0   Min.   :1.000   Min.   :   1.0    Min.   :   1.0    
##  1st Qu.:127.0   1st Qu.:2.142   1st Qu.: 556.0    1st Qu.: 851.5    
##  Median :132.7   Median :2.217   Median : 586.0    Median : 947.0    
##  Mean   :134.2   Mean   :2.227   Mean   : 587.2    Mean   : 945.4    
##  3rd Qu.:144.7   3rd Qu.:2.326   3rd Qu.: 626.0    3rd Qu.:1032.0    
##  Max.   :269.9   Max.   :3.522   Max.   :1179.0    Max.   :1881.0    
##     Density            pH          Sulphates        Alcohol      LabelAppeal
##  Min.   :2.776   Min.   :1.960   Min.   :1.000   Min.   : 1.00   0: 504     
##  1st Qu.:2.876   1st Qu.:4.430   1st Qu.:4.410   1st Qu.:14.70   1:3136     
##  Median :2.883   Median :4.680   Median :4.630   Median :16.10   2:5617     
##  Mean   :2.882   Mean   :4.687   Mean   :4.655   Mean   :16.19   3:3048     
##  3rd Qu.:2.889   3rd Qu.:4.950   3rd Qu.:4.990   3rd Qu.:18.10   4: 490     
##  Max.   :2.987   Max.   :7.610   Max.   :8.370   Max.   :32.20              
##    AcidIndex      STARS   
##  Min.   : 4.000   0:3359  
##  1st Qu.: 7.000   1:3042  
##  Median : 8.000   2:3570  
##  Mean   : 7.773   3:2212  
##  3rd Qu.: 8.000   4: 612  
##  Max.   :17.000
plot_num(train_imp_plusminconst)

# I liked the for loop version better, but it was a bit over my head too. Review for later, stick to the above method.

######################################################################
# Transform the data from imputed to logs

train_scaling_subset2 <- train_imputed %>%
  dplyr::select(FixedAcidity,
                VolatileAcidity,
                CitricAcid,
                ResidualSugar,
                Chlorides,
                FreeSulfurDioxide,
                TotalSulfurDioxide,
                Sulphates,
                Alcohol)

train_absscaled_subset <- lapply(train_scaling_subset2,
                                 FUN = function(x) sapply(x, FUN = abs)) %>%
  as.data.frame()

# Join absolute value-scaled subset back to other continuous variables
train_abs <- train_imputed %>%
  dplyr::select(Density,
                AcidIndex,
                pH
                ) %>%
  cbind(train_absscaled_subset)

# Log-scale all continuous variables, adding constant of 1
train_abslog <- lapply(train_abs, FUN = function(x)
  sapply(x, FUN = function(x) log(x+1))) %>%
  as.data.frame()

st(train_abs)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
AcidIndex 12795 7.773 1.324 4 7 8 17
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
FixedAcidity 12795 8.063 4.996 0 5.6 9.8 34.4
VolatileAcidity 12795 0.641 0.556 0 0.25 0.91 3.68
CitricAcid 12795 0.686 0.606 0 0.28 0.97 3.86
ResidualSugar 12795 23.326 24.972 0 3.6 38.6 141.15
Chlorides 12795 0.223 0.234 0 0.046 0.368 1.351
FreeSulfurDioxide 12795 106.642 108.069 0 28 171 623
TotalSulfurDioxide 12795 204.015 162.976 0 99 262 1057
Sulphates 12795 0.845 0.652 0 0.43 1.09 4.24
Alcohol 12795 10.526 3.642 0 9 12.4 26.5
#bring iin scaled Label Appeal from the abs plus min dataframe
train_abslog$LabelAppeal <- train_imp_plusminconst$LabelAppeal


# Map remaining variables to dataframe
#train_abslog$INDEX <- train_imputed$INDEX
train_abslog$TARGET <- train_imputed$TARGET
train_abslog$STARS <- train_imputed$STARS
train_abslog$STARS <- as.factor(train_abslog$STARS)

head(train_abslog)
DensityAcidIndexpHFixedAcidityVolatileAcidityCitricAcidResidualSugarChloridesFreeSulfurDioxideTotalSulfurDioxideSulphatesAlcoholLabelAppealTARGETSTARS
0.69 2.2 1.471.440.77 0.683 4.010.449 5.145.590.4642.39232
0.7072.081.481.7 0.1480.593 3.3 0.354 2.775.790.5312.67133
0.6912.2 1.422.091.29 0.631 2.760.03635.374.960.3923.14153
0.6911.951.181.9 0.3260.03922.990.354 3.144.751.04 1.97131
0.69 2.3 1.422.2 0.2850.815 2.340.369 5.124.691.02 2.69242
0.6932.481.442.510.2780.464 1.160.442 3.642.770.8292.8 200
st(train_abslog)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Density 12795 0.69 0.013 0.636 0.687 0.693 0.742
AcidIndex 12795 2.161 0.14 1.609 2.079 2.197 2.89
pH 12795 1.423 0.171 0.392 1.374 1.497 1.964
FixedAcidity 12795 2.04 0.617 0 1.887 2.38 3.567
VolatileAcidity 12795 0.449 0.293 0 0.223 0.647 1.543
CitricAcid 12795 0.47 0.31 0 0.247 0.678 1.581
ResidualSugar 12795 2.597 1.17 0 1.526 3.679 4.957
Chlorides 12795 0.185 0.174 0 0.045 0.313 0.855
FreeSulfurDioxide 12795 4.14 1.115 0 3.367 5.147 6.436
TotalSulfurDioxide 12795 4.993 0.903 0 4.605 5.572 6.964
Sulphates 12795 0.562 0.304 0 0.358 0.737 1.656
Alcohol 12795 2.382 0.388 0 2.303 2.595 3.314
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
TARGET 12795 3.029 1.926 0 2 4 8
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
plot_num(train_abslog)

train_imp_abs <- train_imputed

#only do it for columns with negative values--not sure that the constant (y+Y-min(Y) 
#(+1 does, I guess it could be any constant, or none?)
#Also --> should these be logged despite having a normal distribution? --I decided not to
#https://www.listendata.com/2015/09/regression-transform-negative-values.html

train_imp_abs$FixedAcidity <- abs(train_imp_abs$FixedAcidity)
train_imp_abs$VolatileAcidity <- abs(train_imp_abs$VolatileAcidity)
train_imp_abs$CitricAcid <- abs(train_imp_abs$CitricAcid)
train_imp_abs$ResidualSugar <-abs(train_imp_abs$ResidualSugar)
train_imp_abs$Chlorides <-abs(train_imp_abs$Chlorides)
train_imp_abs$FreeSulfurDioxide <-abs(train_imp_abs$FreeSulfurDioxide)
train_imp_abs$TotalSulfurDioxide <-abs(train_imp_abs$TotalSulfurDioxide)
train_imp_abs$Sulphates <- abs(train_imp_abs$Sulphates)
train_imp_abs$Alcohol <-abs(train_imp_abs$Alcohol)

#transform Label Appeal too.
train_imp_abs$LabelAppeal <- as.numeric(train_imp_abs$LabelAppeal)
train_imp_abs$LabelAppeal <- abs(train_imp_abs$LabelAppeal) 
#train_imp_abs$LabelAppeal + abs(min(train_imp_abs$LabelAppeal))-2

train_imp_abs$LabelAppeal <- as.factor(train_imp_abs$LabelAppeal)

st(train_imp_abs) #run this to make sure each variable worked after
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 8.063 4.996 0 5.6 9.8 34.4
VolatileAcidity 12795 0.641 0.556 0 0.25 0.91 3.68
CitricAcid 12795 0.686 0.606 0 0.28 0.97 3.86
ResidualSugar 12795 23.326 24.972 0 3.6 38.6 141.15
Chlorides 12795 0.223 0.234 0 0.046 0.368 1.351
FreeSulfurDioxide 12795 106.642 108.069 0 28 171 623
TotalSulfurDioxide 12795 204.015 162.976 0 99 262 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.845 0.652 0 0.43 1.09 4.24
Alcohol 12795 10.526 3.642 0 9 12.4 26.5
LabelAppeal 12795
… 1 504 3.9%
… 2 3136 24.5%
… 3 5617 43.9%
… 4 3048 23.8%
… 5 490 3.8%
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
summary(train_imp_abs)#make sure nothing broke
##      TARGET       FixedAcidity    VolatileAcidity    CitricAcid    
##  Min.   :0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.: 5.600   1st Qu.:0.2500   1st Qu.:0.2800  
##  Median :3.000   Median : 7.000   Median :0.4100   Median :0.4400  
##  Mean   :3.029   Mean   : 8.063   Mean   :0.6411   Mean   :0.6863  
##  3rd Qu.:4.000   3rd Qu.: 9.800   3rd Qu.:0.9100   3rd Qu.:0.9700  
##  Max.   :8.000   Max.   :34.400   Max.   :3.6800   Max.   :3.8600  
##  ResidualSugar      Chlorides      FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  0.00   Min.   :0.0000   Min.   :  0.0     Min.   :   0      
##  1st Qu.:  3.60   1st Qu.:0.0460   1st Qu.: 28.0     1st Qu.:  99      
##  Median : 12.90   Median :0.0980   Median : 56.0     Median : 154      
##  Mean   : 23.33   Mean   :0.2227   Mean   :106.6     Mean   : 204      
##  3rd Qu.: 38.60   3rd Qu.:0.3680   3rd Qu.:171.0     3rd Qu.: 262      
##  Max.   :141.15   Max.   :1.3510   Max.   :623.0     Max.   :1057      
##     Density             pH          Sulphates         Alcohol      LabelAppeal
##  Min.   :0.8881   Min.   :0.480   Min.   :0.0000   Min.   : 0.00   1: 504     
##  1st Qu.:0.9877   1st Qu.:2.950   1st Qu.:0.4300   1st Qu.: 9.00   2:3136     
##  Median :0.9945   Median :3.200   Median :0.5900   Median :10.40   3:5617     
##  Mean   :0.9942   Mean   :3.207   Mean   :0.8454   Mean   :10.53   4:3048     
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.:1.0900   3rd Qu.:12.40   5: 490     
##  Max.   :1.0992   Max.   :6.130   Max.   :4.2400   Max.   :26.50              
##    AcidIndex      STARS   
##  Min.   : 4.000   0:3359  
##  1st Qu.: 7.000   1:3042  
##  Median : 8.000   2:3570  
##  Mean   : 7.773   3:2212  
##  3rd Qu.: 8.000   4: 612  
##  Max.   :17.000
plot_num(train_imp_abs)

train_abs_bc <- train_imp_plusminconst
train_abs_bc$TARGET <-(train_abs_bc$TARGET)
train_abs_bc$LabelAppeal <- as.numeric(train_abs_bc$LabelAppeal)
train_abs_bc$STARS <- as.numeric(train_abs_bc$STARS)
st(train_abs_bc)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 26.176 6.318 1 24.3 28.6 53.5
VolatileAcidity 12795 4.114 0.784 1 3.92 4.43 7.47
CitricAcid 12795 4.548 0.862 1 4.27 4.82 8.1
ResidualSugar 12795 134.24 33.737 1 127 144.65 269.95
Chlorides 12795 2.227 0.319 1 2.142 2.327 3.522
FreeSulfurDioxide 12795 587.198 148.59 1 556 626 1179
TotalSulfurDioxide 12795 945.383 231.196 1 851.5 1032 1881
Density 12795 2.882 0.027 2.776 2.876 2.889 2.987
pH 12795 4.687 0.681 1.96 4.43 4.95 7.61
Sulphates 12795 4.655 0.93 1 4.41 4.99 8.37
Alcohol 12795 16.193 3.738 1 14.7 18.1 32.2
LabelAppeal 12795 2.991 0.891 1 2 4 5
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795 2.506 1.187 1 1 3 5
train_abs_bc[,c("TARGET",
                "FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                "LabelAppeal",
                "AcidIndex",
                "STARS"
                )] = train_abs_bc[,c(
                  "TARGET",
                  "FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                "LabelAppeal",
                "AcidIndex",
                "STARS"
                )]+1

b = boxcox(TARGET ~
                  FixedAcidity
                  +VolatileAcidity
                  +CitricAcid
                  +ResidualSugar
                  +Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  +Density
                  +pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS,
           data=train_abs_bc)

lambda = b$x
lik = b$y
bc = cbind(lambda,lik)

hold=bc[order(-lik),]
bcVal=hold[1,1]

train_abs_bc[,c("FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                #"LabelAppeal",
                "AcidIndex"
                #"STARS"
                )] = train_abs_bc[,c("FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                #"LabelAppeal",
                "AcidIndex"
                #"STARS"
                )]^(bcVal)
#Before running the models, lets look at the datasets we will be working with.

st(train_imputed)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 7.076 6.318 -18.1 5.2 9.5 34.4
VolatileAcidity 12795 0.324 0.784 -2.79 0.13 0.64 3.68
CitricAcid 12795 0.308 0.862 -3.24 0.03 0.58 3.86
ResidualSugar 12795 5.44 33.737 -127.8 -1.8 15.85 141.15
Chlorides 12795 0.056 0.319 -1.171 -0.029 0.156 1.351
FreeSulfurDioxide 12795 31.198 148.59 -555 0 70 623
TotalSulfurDioxide 12795 121.383 231.196 -823 27.5 208 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.525 0.93 -3.13 0.28 0.86 4.24
Alcohol 12795 10.493 3.738 -4.7 9 12.4 26.5
LabelAppeal 12795
… -2 504 3.9%
… -1 3136 24.5%
… 0 5617 43.9%
… 1 3048 23.8%
… 2 490 3.8%
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
st(train_imp_plusminconst)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 26.176 6.318 1 24.3 28.6 53.5
VolatileAcidity 12795 4.114 0.784 1 3.92 4.43 7.47
CitricAcid 12795 4.548 0.862 1 4.27 4.82 8.1
ResidualSugar 12795 134.24 33.737 1 127 144.65 269.95
Chlorides 12795 2.227 0.319 1 2.142 2.327 3.522
FreeSulfurDioxide 12795 587.198 148.59 1 556 626 1179
TotalSulfurDioxide 12795 945.383 231.196 1 851.5 1032 1881
Density 12795 2.882 0.027 2.776 2.876 2.889 2.987
pH 12795 4.687 0.681 1.96 4.43 4.95 7.61
Sulphates 12795 4.655 0.93 1 4.41 4.99 8.37
Alcohol 12795 16.193 3.738 1 14.7 18.1 32.2
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
st(train_abslog)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Density 12795 0.69 0.013 0.636 0.687 0.693 0.742
AcidIndex 12795 2.161 0.14 1.609 2.079 2.197 2.89
pH 12795 1.423 0.171 0.392 1.374 1.497 1.964
FixedAcidity 12795 2.04 0.617 0 1.887 2.38 3.567
VolatileAcidity 12795 0.449 0.293 0 0.223 0.647 1.543
CitricAcid 12795 0.47 0.31 0 0.247 0.678 1.581
ResidualSugar 12795 2.597 1.17 0 1.526 3.679 4.957
Chlorides 12795 0.185 0.174 0 0.045 0.313 0.855
FreeSulfurDioxide 12795 4.14 1.115 0 3.367 5.147 6.436
TotalSulfurDioxide 12795 4.993 0.903 0 4.605 5.572 6.964
Sulphates 12795 0.562 0.304 0 0.358 0.737 1.656
Alcohol 12795 2.382 0.388 0 2.303 2.595 3.314
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
TARGET 12795 3.029 1.926 0 2 4 8
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
st(train_imp_abs)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 8.063 4.996 0 5.6 9.8 34.4
VolatileAcidity 12795 0.641 0.556 0 0.25 0.91 3.68
CitricAcid 12795 0.686 0.606 0 0.28 0.97 3.86
ResidualSugar 12795 23.326 24.972 0 3.6 38.6 141.15
Chlorides 12795 0.223 0.234 0 0.046 0.368 1.351
FreeSulfurDioxide 12795 106.642 108.069 0 28 171 623
TotalSulfurDioxide 12795 204.015 162.976 0 99 262 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.845 0.652 0 0.43 1.09 4.24
Alcohol 12795 10.526 3.642 0 9 12.4 26.5
LabelAppeal 12795
… 1 504 3.9%
… 2 3136 24.5%
… 3 5617 43.9%
… 4 3048 23.8%
… 5 490 3.8%
AcidIndex 12795 7.773 1.324 4 7 8 17
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
st(train_abs_bc)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 4.029 1.926 1 3 5 9
FixedAcidity 12795 51.548 14.092 2.285 47.033 56.71 117.394
VolatileAcidity 12795 7.014 1.275 2.285 6.68 7.513 12.763
CitricAcid 12795 7.731 1.423 2.285 7.25 8.161 13.903
ResidualSugar 12795 349.423 102.119 2.285 324.806 378.871 793.976
Chlorides 12795 4.045 0.475 2.285 3.914 4.19 6.041
FreeSulfurDioxide 12795 2015.385 597.457 2.285 1874.293 2158.325 4586.011
TotalSulfurDioxide 12795 3550.688 1019.146 2.285 3112.798 3913.488 7999.852
Density 12795 5.037 0.041 4.873 5.027 5.046 5.2
pH 12795 7.952 1.132 3.645 7.513 8.378 13.015
Sulphates 12795 7.91 1.541 2.285 7.48 8.446 14.396
Alcohol 12795 29.843 7.646 2.285 26.633 33.642 65.024
LabelAppeal 12795 3.991 0.891 2 3 5 6
AcidIndex 12795 13.342 2.443 6.81 11.924 13.721 31.346
STARS 12795 3.506 1.187 2 2 4 6

Building the models

Linear regression #1

This first model tried all variables using the imputed data, both with “AcidIndex” as a factor and numeric. Surprisingly we got a relatively decent result around 0.54 R-squared. If we take “AcidIndex” as numeric, this variable is statistically significant, while if used as a factor, none of the 1-14 categories was significant. Therefore, we continued treating “AcidIndex” as numeric and kept it as such.

We removed the statistically insignificant variables, and the maximum R-Squared score was 0.5405.

When plotting the residuals, we validated the model, and it appeared that there is a strong linear relationship.

Some of the coefficients indicated the following:

• Naturally, the more STARS a wine was given, the more cases one would have expected to sell. • AcidIndex, was negatively correlated; the more acidic wine is, the fewer cases it is expected to sell. • Although the label appeal was statistically significant and had a positive correlation depending on the score, one would expect to sell about one more case if the rating were zero.

#start with linear regression
# available dataframes:

# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################

# Use for checking if "AcidIndex" as numeric or factor makes any difference

###############################################################################

# raw --> train_imputed
train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
st(train_imputed)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 7.076 6.318 -18.1 5.2 9.5 34.4
VolatileAcidity 12795 0.324 0.784 -2.79 0.13 0.64 3.68
CitricAcid 12795 0.308 0.862 -3.24 0.03 0.58 3.86
ResidualSugar 12795 5.44 33.737 -127.8 -1.8 15.85 141.15
Chlorides 12795 0.056 0.319 -1.171 -0.029 0.156 1.351
FreeSulfurDioxide 12795 31.198 148.59 -555 0 70 623
TotalSulfurDioxide 12795 121.383 231.196 -823 27.5 208 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.525 0.93 -3.13 0.28 0.86 4.24
Alcohol 12795 10.493 3.738 -4.7 9 12.4 26.5
LabelAppeal 12795
… -2 504 3.9%
… -1 3136 24.5%
… 0 5617 43.9%
… 1 3048 23.8%
… 2 490 3.8%
AcidIndex 12795 4.773 1.324 1 4 5 14
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
# # scaled through minplusconstant --> train_imp_plusminconst
train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
st(train_imp_plusminconst)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 26.176 6.318 1 24.3 28.6 53.5
VolatileAcidity 12795 4.114 0.784 1 3.92 4.43 7.47
CitricAcid 12795 4.548 0.862 1 4.27 4.82 8.1
ResidualSugar 12795 134.24 33.737 1 127 144.65 269.95
Chlorides 12795 2.227 0.319 1 2.142 2.327 3.522
FreeSulfurDioxide 12795 587.198 148.59 1 556 626 1179
TotalSulfurDioxide 12795 945.383 231.196 1 851.5 1032 1881
Density 12795 2.882 0.027 2.776 2.876 2.889 2.987
pH 12795 4.687 0.681 1.96 4.43 4.95 7.61
Sulphates 12795 4.655 0.93 1 4.41 4.99 8.37
Alcohol 12795 16.193 3.738 1 14.7 18.1 32.2
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
AcidIndex 12795 4.773 1.324 1 4 5 14
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
#
# # scaled (not all) and logged (continuous variables) --> train_abslog
train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
st(train_abslog)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Density 12795 0.69 0.013 0.636 0.687 0.693 0.742
AcidIndex 12795 7.773 1.324 4 7 8 17
pH 12795 1.423 0.171 0.392 1.374 1.497 1.964
FixedAcidity 12795 2.04 0.617 0 1.887 2.38 3.567
VolatileAcidity 12795 0.449 0.293 0 0.223 0.647 1.543
CitricAcid 12795 0.47 0.31 0 0.247 0.678 1.581
ResidualSugar 12795 2.597 1.17 0 1.526 3.679 4.957
Chlorides 12795 0.185 0.174 0 0.045 0.313 0.855
FreeSulfurDioxide 12795 4.14 1.115 0 3.367 5.147 6.436
TotalSulfurDioxide 12795 4.993 0.903 0 4.605 5.572 6.964
Sulphates 12795 0.562 0.304 0 0.358 0.737 1.656
Alcohol 12795 2.382 0.388 0 2.303 2.595 3.314
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
TARGET 12795 3.029 1.926 0 2 4 8
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
#
# # scaled through absolute values
train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
st(train_imp_abs)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 8.063 4.996 0 5.6 9.8 34.4
VolatileAcidity 12795 0.641 0.556 0 0.25 0.91 3.68
CitricAcid 12795 0.686 0.606 0 0.28 0.97 3.86
ResidualSugar 12795 23.326 24.972 0 3.6 38.6 141.15
Chlorides 12795 0.223 0.234 0 0.046 0.368 1.351
FreeSulfurDioxide 12795 106.642 108.069 0 28 171 623
TotalSulfurDioxide 12795 204.015 162.976 0 99 262 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.845 0.652 0 0.43 1.09 4.24
Alcohol 12795 10.526 3.642 0 9 12.4 26.5
LabelAppeal 12795
… 1 504 3.9%
… 2 3136 24.5%
… 3 5617 43.9%
… 4 3048 23.8%
… 5 490 3.8%
AcidIndex 12795 4.773 1.324 1 4 5 14
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
#Model 1 Linear Regression - 
linear_r1 <- lm(formula=TARGET ~
                   #FixedAcidity
                  +VolatileAcidity
                  #+CitricAcid
                  #+ResidualSugar
                  +Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  +Density
                  +pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, data=train_imputed)

summary(linear_r1)
## 
## Call:
## lm(formula = TARGET ~ +VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9533 -0.8596  0.0234  0.8432  6.1683 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.293e+00  4.433e-01   5.174 2.32e-07 ***
## VolatileAcidity    -9.466e-02  1.477e-02  -6.409 1.52e-10 ***
## Chlorides          -1.295e-01  3.630e-02  -3.569  0.00036 ***
## FreeSulfurDioxide   2.503e-04  7.782e-05   3.216  0.00130 ** 
## TotalSulfurDioxide  2.327e-04  5.003e-05   4.651 3.34e-06 ***
## Density            -8.135e-01  4.357e-01  -1.867  0.06188 .  
## pH                 -3.954e-02  1.698e-02  -2.329  0.01990 *  
## Sulphates          -3.428e-02  1.243e-02  -2.759  0.00581 ** 
## Alcohol             1.244e-02  3.100e-03   4.014 6.00e-05 ***
## LabelAppeal-1       3.600e-01  6.283e-02   5.730 1.03e-08 ***
## LabelAppeal0        8.268e-01  6.126e-02  13.495  < 2e-16 ***
## LabelAppeal1        1.291e+00  6.399e-02  20.169  < 2e-16 ***
## LabelAppeal2        1.882e+00  8.431e-02  22.317  < 2e-16 ***
## AcidIndex          -1.991e-01  8.933e-03 -22.282  < 2e-16 ***
## STARS1              1.363e+00  3.291e-02  41.415  < 2e-16 ***
## STARS2              2.398e+00  3.199e-02  74.979  < 2e-16 ***
## STARS3              2.963e+00  3.705e-02  79.971  < 2e-16 ***
## STARS4              3.647e+00  5.923e-02  61.574  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.305 on 12777 degrees of freedom
## Multiple R-squared:  0.5414, Adjusted R-squared:  0.5408 
## F-statistic: 887.3 on 17 and 12777 DF,  p-value: < 2.2e-16
plot(linear_r1) +
  theme_cowplot()

## NULL
vif(linear_r1)
##                        GVIF Df GVIF^(1/(2*Df))
## VolatileAcidity    1.006766  1        1.003377
## Chlorides          1.003662  1        1.001829
## FreeSulfurDioxide  1.003978  1        1.001987
## TotalSulfurDioxide 1.004660  1        1.002327
## Density            1.003654  1        1.001825
## pH                 1.005040  1        1.002517
## Sulphates          1.002371  1        1.001185
## Alcohol            1.007712  1        1.003849
## LabelAppeal        1.118877  4        1.014140
## AcidIndex          1.050180  1        1.024783
## STARS              1.167548  4        1.019552

Linear regression #2

I tried to use the other three transformations (scaled, scaled and logged, and absolute values). My approach was to bounce between the transformations and choose the best one. After trying them all, we got about the same R-squared as model number #1. I also tried converting “AcidIndex” into a factor, and while all the coefficients were negative as expected, numbers between 10-13 appeared borderline statistically significant. The R-squared was 0.5433.

#Model#2 Linear Regression
# available dataframes:

# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################

# Use for checking if "AcidIndex" as numeric or factor makes any difference

###############################################################################

#raw --> train_imputed
train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
st(train_imputed)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 7.076 6.318 -18.1 5.2 9.5 34.4
VolatileAcidity 12795 0.324 0.784 -2.79 0.13 0.64 3.68
CitricAcid 12795 0.308 0.862 -3.24 0.03 0.58 3.86
ResidualSugar 12795 5.44 33.737 -127.8 -1.8 15.85 141.15
Chlorides 12795 0.056 0.319 -1.171 -0.029 0.156 1.351
FreeSulfurDioxide 12795 31.198 148.59 -555 0 70 623
TotalSulfurDioxide 12795 121.383 231.196 -823 27.5 208 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.525 0.93 -3.13 0.28 0.86 4.24
Alcohol 12795 10.493 3.738 -4.7 9 12.4 26.5
LabelAppeal 12795
… -2 504 3.9%
… -1 3136 24.5%
… 0 5617 43.9%
… 1 3048 23.8%
… 2 490 3.8%
AcidIndex 12795 4.773 1.324 1 4 5 14
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
# # scaled through minplusconstant --> train_imp_plusminconst
train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
st(train_imp_plusminconst)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 26.176 6.318 1 24.3 28.6 53.5
VolatileAcidity 12795 4.114 0.784 1 3.92 4.43 7.47
CitricAcid 12795 4.548 0.862 1 4.27 4.82 8.1
ResidualSugar 12795 134.24 33.737 1 127 144.65 269.95
Chlorides 12795 2.227 0.319 1 2.142 2.327 3.522
FreeSulfurDioxide 12795 587.198 148.59 1 556 626 1179
TotalSulfurDioxide 12795 945.383 231.196 1 851.5 1032 1881
Density 12795 2.882 0.027 2.776 2.876 2.889 2.987
pH 12795 4.687 0.681 1.96 4.43 4.95 7.61
Sulphates 12795 4.655 0.93 1 4.41 4.99 8.37
Alcohol 12795 16.193 3.738 1 14.7 18.1 32.2
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
AcidIndex 12795 4.773 1.324 1 4 5 14
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
#
# # scaled (not all) and logged (continuous variables) --> train_abslog
train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
st(train_abslog)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
Density 12795 0.69 0.013 0.636 0.687 0.693 0.742
AcidIndex 12795 7.773 1.324 4 7 8 17
pH 12795 1.423 0.171 0.392 1.374 1.497 1.964
FixedAcidity 12795 2.04 0.617 0 1.887 2.38 3.567
VolatileAcidity 12795 0.449 0.293 0 0.223 0.647 1.543
CitricAcid 12795 0.47 0.31 0 0.247 0.678 1.581
ResidualSugar 12795 2.597 1.17 0 1.526 3.679 4.957
Chlorides 12795 0.185 0.174 0 0.045 0.313 0.855
FreeSulfurDioxide 12795 4.14 1.115 0 3.367 5.147 6.436
TotalSulfurDioxide 12795 4.993 0.903 0 4.605 5.572 6.964
Sulphates 12795 0.562 0.304 0 0.358 0.737 1.656
Alcohol 12795 2.382 0.388 0 2.303 2.595 3.314
LabelAppeal 12795
… 0 504 3.9%
… 1 3136 24.5%
… 2 5617 43.9%
… 3 3048 23.8%
… 4 490 3.8%
TARGET 12795 3.029 1.926 0 2 4 8
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
#
# scaled through absolute values
train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
st(train_imp_abs)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3.029 1.926 0 2 4 8
FixedAcidity 12795 8.063 4.996 0 5.6 9.8 34.4
VolatileAcidity 12795 0.641 0.556 0 0.25 0.91 3.68
CitricAcid 12795 0.686 0.606 0 0.28 0.97 3.86
ResidualSugar 12795 23.326 24.972 0 3.6 38.6 141.15
Chlorides 12795 0.223 0.234 0 0.046 0.368 1.351
FreeSulfurDioxide 12795 106.642 108.069 0 28 171 623
TotalSulfurDioxide 12795 204.015 162.976 0 99 262 1057
Density 12795 0.994 0.027 0.888 0.988 1.001 1.099
pH 12795 3.207 0.681 0.48 2.95 3.47 6.13
Sulphates 12795 0.845 0.652 0 0.43 1.09 4.24
Alcohol 12795 10.526 3.642 0 9 12.4 26.5
LabelAppeal 12795
… 1 504 3.9%
… 2 3136 24.5%
… 3 5617 43.9%
… 4 3048 23.8%
… 5 490 3.8%
AcidIndex 12795 4.773 1.324 1 4 5 14
STARS 12795
… 0 3359 26.3%
… 1 3042 23.8%
… 2 3570 27.9%
… 3 2212 17.3%
… 4 612 4.8%
#Model 1 Linear Regression - 
linear_r2 <- lm(formula=TARGET^(bcVal) ~
                  #FixedAcidity
                  +VolatileAcidity
                  +CitricAcid
                  #+ResidualSugar
                  #+Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, data=train_abs_bc)


summary(linear_r2)
## 
## Call:
## lm(formula = TARGET^(bcVal) ~ +VolatileAcidity + CitricAcid + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abs_bc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8554 -1.4118  0.0652  1.3512  9.7439 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.815e+00  2.163e+00   1.764 0.077774 .  
## VolatileAcidity    -9.129e-02  1.377e-02  -6.628 3.55e-11 ***
## CitricAcid          1.873e-02  1.235e-02   1.517 0.129183    
## FreeSulfurDioxide   1.076e-04  2.935e-05   3.668 0.000246 ***
## TotalSulfurDioxide  7.433e-05  1.722e-05   4.317 1.60e-05 ***
## Density            -8.526e-01  4.273e-01  -1.995 0.046042 *  
## Sulphates          -2.797e-02  1.138e-02  -2.459 0.013952 *  
## Alcohol             9.638e-03  2.297e-03   4.196 2.74e-05 ***
## LabelAppeal         7.311e-01  2.044e-02  35.771  < 2e-16 ***
## AcidIndex          -1.654e-01  7.340e-03 -22.529  < 2e-16 ***
## STARS               1.465e+00  1.563e-02  93.706  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.981 on 12784 degrees of freedom
## Multiple R-squared:  0.5371, Adjusted R-squared:  0.5368 
## F-statistic:  1484 on 10 and 12784 DF,  p-value: < 2.2e-16
plot(linear_r2)

vif(linear_r2)
##    VolatileAcidity         CitricAcid  FreeSulfurDioxide TotalSulfurDioxide 
##           1.006155           1.006099           1.002760           1.004454 
##            Density          Sulphates            Alcohol        LabelAppeal 
##           1.002896           1.002084           1.006064           1.081947 
##          AcidIndex              STARS 
##           1.048475           1.122253

Poisson Model #1

I used the absolute values transformation for my independent variables based on early distributions for the first model. Several indicators suggested a relatively ok model. The dispersion was under 1, and then I ran another test to check for the residuals, and this line was flat. Further, I removed the variables with high p values, but the score did not change much.

# Poisson #1

# available dataframes:

# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################

# Use for checking if "AcidIndex" as numeric or factor makes any difference

###############################################################################

# raw --> train_imputed
# train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
# train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
# st(train_imputed)
# 
# # # scaled through minplusconstant --> train_imp_plusminconst
# train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
# train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
# st(train_imp_plusminconst)
# #
# # # scaled (not all) and logged (continuous variables) --> train_abslog
# train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
# train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
# st(train_abslog)
# #
# # # scaled through absolute values
# train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
# train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
# st(train_imp_abs)

#Model 1 Linear Regression - 
poisson_1 <- glm(formula=TARGET ~
                  # FixedAcidity
                  +VolatileAcidity
                  #+CitricAcid
                  #+ResidualSugar
                  #+Chlorides
                  #+FreeSulfurDioxide
                  +TotalSulfurDioxide
                  #+Density
                  #+pH
                  #+Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 fam = poisson,
                 data=train_imp_abs)

summary(poisson_1)
## 
## Call:
## glm(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = train_imp_abs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2543  -0.6402  -0.0075   0.4515   3.7857  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.538e-01  4.797e-02   3.206  0.00135 ** 
## VolatileAcidity    -3.705e-02  9.396e-03  -3.944 8.03e-05 ***
## TotalSulfurDioxide  8.467e-05  3.116e-05   2.718  0.00658 ** 
## Alcohol             3.983e-03  1.403e-03   2.839  0.00453 ** 
## LabelAppeal2        2.357e-01  3.798e-02   6.205 5.45e-10 ***
## LabelAppeal3        4.254e-01  3.705e-02  11.480  < 2e-16 ***
## LabelAppeal4        5.582e-01  3.769e-02  14.813  < 2e-16 ***
## LabelAppeal5        6.946e-01  4.243e-02  16.372  < 2e-16 ***
## AcidIndex          -8.031e-02  4.497e-03 -17.861  < 2e-16 ***
## STARS1              7.702e-01  1.953e-02  39.439  < 2e-16 ***
## STARS2              1.089e+00  1.823e-02  59.749  < 2e-16 ***
## STARS3              1.210e+00  1.918e-02  63.090  < 2e-16 ***
## STARS4              1.330e+00  2.428e-02  54.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 13669  on 12782  degrees of freedom
## AIC: 45637
## 
## Number of Fisher Scoring iterations: 6
dispersiontest(poisson_1)
## 
##  Overdispersion test
## 
## data:  poisson_1
## z = -8.9098, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8848692
sim_p1 <- simulateResiduals(poisson_1, refit=T)
testOverdispersion(sim_p1)
## testOverdispersion is deprecated, switch your code to using the testDispersion function

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  simulationOutput
## dispersion = 0.88397, p-value < 2.2e-16
## alternative hypothesis: two.sided
plotSimulatedResiduals(sim_p1)
## plotSimulatedResiduals is deprecated, please switch your code to simply using the plot() function

plot(poisson_1)

knitr::kable(vif(poisson_1), "html")
GVIF Df GVIF^(1/(2*Df))
VolatileAcidity 1.004014 1 1.002005
TotalSulfurDioxide 1.003350 1 1.001673
Alcohol 1.010637 1 1.005304
LabelAppeal 1.133631 4 1.015802
AcidIndex 1.025613 1 1.012726
STARS 1.165661 4 1.019346

Poisson Model #2

I tested the remaining transformed datasets on this model, but the results were not that much different. For this one, I kept the absolute + min + constant (+1) transform dataset for x. Although the residual deviance changed slightly, my dispersion test came similar, indicating the model fits ok. Similar to the other models, the coefficients indicate a similar behavior, such as AcidIndex, which negatively correlates with the expected cases sold.

#Model 2 Poisson Regression - 
poisson_2 <- glm(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  +Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  +pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 fam = poisson,
                 data=train_abs_bc)

summary(poisson_2)
## 
## Call:
## glm(formula = TARGET ~ +VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + Alcohol + LabelAppeal + 
##     AcidIndex + STARS, family = poisson, data = train_abs_bc)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.35957  -0.60161   0.04921   0.48184   2.80743  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.502e-01  7.581e-02   9.896  < 2e-16 ***
## VolatileAcidity    -1.558e-02  3.481e-03  -4.475 7.65e-06 ***
## Chlorides          -2.337e-02  9.321e-03  -2.508  0.01215 *  
## FreeSulfurDioxide   2.057e-05  7.357e-06   2.795  0.00518 ** 
## TotalSulfurDioxide  1.395e-05  4.337e-06   3.217  0.00130 ** 
## pH                 -7.838e-03  3.909e-03  -2.005  0.04495 *  
## Sulphates          -5.375e-03  2.868e-03  -1.874  0.06094 .  
## Alcohol             9.068e-04  5.787e-04   1.567  0.11710    
## LabelAppeal         1.019e-01  5.225e-03  19.507  < 2e-16 ***
## AcidIndex          -3.397e-02  2.056e-03 -16.524  < 2e-16 ***
## STARS               2.343e-01  3.900e-03  60.062  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 13818  on 12794  degrees of freedom
## Residual deviance:  7692  on 12784  degrees of freedom
## AIC: 47576
## 
## Number of Fisher Scoring iterations: 5
dispersiontest(poisson_2)
## 
##  Overdispersion test
## 
## data:  poisson_2
## z = -60.603, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.5650852
sim_p2 <- simulateResiduals(poisson_2, refit=T)
testOverdispersion(sim_p2)
## testOverdispersion is deprecated, switch your code to using the testDispersion function

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  simulationOutput
## dispersion = 0.56014, p-value < 2.2e-16
## alternative hypothesis: two.sided
plotSimulatedResiduals(sim_p2)
## plotSimulatedResiduals is deprecated, please switch your code to simply using the plot() function
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

plot (poisson_2)

knitr::kable(vif(poisson_2),"html")
x
VolatileAcidity 1.005062
Chlorides 1.002222
FreeSulfurDioxide 1.002172
TotalSulfurDioxide 1.002419
pH 1.004460
Sulphates 1.001323
Alcohol 1.008729
LabelAppeal 1.108977
AcidIndex 1.036240
STARS 1.144527

Negative Binomial Model #1 and #2

For these models, we did not see much difference compared to Poisson. At this point, I considered the zero-inflated model due to TARGET having a decent amount of zeros and other variables. I used log-transformed and the absolute values datasets for my x variables. Again, there were no significant differences between the two. I tried the zero inflated models, but I had not many differences, other than the STARS label proving insignificant when 3 or 4 stars were given. This result made me question the previous models, and it appears that deviance begins to increase after a certain point.

#Model 1 Negative Binomial Model - 
nb_1 <- glm.nb(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  #+Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 data=train_abs_bc, 
               #link=log
               )

summary(nb_1)
## 
## Call:
## glm.nb(formula = TARGET ~ +VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal + 
##     AcidIndex + STARS, data = train_abs_bc, init.theta = 109478.1766, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36553  -0.59808   0.04932   0.48309   2.80871  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.901e-01  5.671e-02  10.407  < 2e-16 ***
## VolatileAcidity    -1.572e-02  3.481e-03  -4.517 6.27e-06 ***
## FreeSulfurDioxide   2.087e-05  7.356e-06   2.838  0.00454 ** 
## TotalSulfurDioxide  1.401e-05  4.337e-06   3.231  0.00123 ** 
## Sulphates          -5.352e-03  2.868e-03  -1.866  0.06203 .  
## Alcohol             9.515e-04  5.784e-04   1.645  0.09996 .  
## LabelAppeal         1.016e-01  5.224e-03  19.451  < 2e-16 ***
## AcidIndex          -3.381e-02  2.051e-03 -16.482  < 2e-16 ***
## STARS               2.346e-01  3.898e-03  60.180  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(109478.2) family taken to be 1)
## 
##     Null deviance: 13817.9  on 12794  degrees of freedom
## Residual deviance:  7701.9  on 12786  degrees of freedom
## AIC: 47584
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  109478 
##           Std. Err.:  119125 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -47563.93
plot (nb_1)

knitr::kable(vif(nb_1),"html")
x
VolatileAcidity 1.004856
FreeSulfurDioxide 1.001905
TotalSulfurDioxide 1.002394
Sulphates 1.001305
Alcohol 1.008101
LabelAppeal 1.108525
AcidIndex 1.032101
STARS 1.143609
#Model 2 Negative Binomial Model --
nb_2 <- glm.nb(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  #+Chlorides
                  #+FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 data=train_abs_bc 
               #link=log
               )

summary(nb_2)
## 
## Call:
## glm.nb(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + 
##     Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abs_bc, 
##     init.theta = 109393.0652, link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.35227  -0.59700   0.05004   0.48066   2.80722  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.342e-01  5.454e-02  11.629  < 2e-16 ***
## VolatileAcidity    -1.576e-02  3.481e-03  -4.528 5.95e-06 ***
## TotalSulfurDioxide  1.412e-05  4.337e-06   3.255  0.00113 ** 
## Sulphates          -5.287e-03  2.868e-03  -1.844  0.06524 .  
## Alcohol             9.209e-04  5.784e-04   1.592  0.11136    
## LabelAppeal         1.019e-01  5.224e-03  19.500  < 2e-16 ***
## AcidIndex          -3.401e-02  2.050e-03 -16.587  < 2e-16 ***
## STARS               2.346e-01  3.898e-03  60.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(109393.1) family taken to be 1)
## 
##     Null deviance: 13818  on 12794  degrees of freedom
## Residual deviance:  7710  on 12787  degrees of freedom
## AIC: 47590
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  109393 
##           Std. Err.:  119037 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -47571.97
plot (nb_2)

knitr::kable(vif(nb_2),"html")
x
VolatileAcidity 1.004839
TotalSulfurDioxide 1.002319
Sulphates 1.001239
Alcohol 1.007742
LabelAppeal 1.108272
AcidIndex 1.031013
STARS 1.143736

Negative Binomial Model #3

#Model 3 Negative Binomial Model --Zero Inflated (all the NA's that were imputed as zeros, and y has quite  bit s well)
nb_3 <- zeroinfl(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  #+Chlorides
                  #+FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 data=train_abslog 
               #link=log
               )

summary(nb_3)
## 
## Call:
## zeroinfl(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abslog)
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.281455 -0.427760  0.004961  0.383624  5.883963 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.460517   0.074704   6.165 7.07e-10 ***
## VolatileAcidity    -0.028028   0.017982  -1.559  0.11909    
## TotalSulfurDioxide -0.001421   0.006141  -0.231  0.81694    
## Sulphates           0.011055   0.017086   0.647  0.51760    
## Alcohol             0.060271   0.013780   4.374 1.22e-05 ***
## LabelAppeal1        0.437858   0.041171  10.635  < 2e-16 ***
## LabelAppeal2        0.726474   0.040239  18.054  < 2e-16 ***
## LabelAppeal3        0.916378   0.040909  22.400  < 2e-16 ***
## LabelAppeal4        1.073404   0.045439  23.623  < 2e-16 ***
## AcidIndex          -0.020142   0.004822  -4.177 2.95e-05 ***
## STARS1              0.061189   0.021127   2.896  0.00378 ** 
## STARS2              0.183247   0.019726   9.289  < 2e-16 ***
## STARS3              0.281102   0.020657  13.608  < 2e-16 ***
## STARS4              0.380375   0.025581  14.869  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.72883    0.48751  -9.700  < 2e-16 ***
## VolatileAcidity      0.44829    0.11488   3.902 9.53e-05 ***
## TotalSulfurDioxide  -0.27086    0.03554  -7.622 2.50e-14 ***
## Sulphates            0.44717    0.11119   4.022 5.78e-05 ***
## Alcohol              0.23974    0.09008   2.661  0.00778 ** 
## LabelAppeal1         1.45212    0.31942   4.546 5.47e-06 ***
## LabelAppeal2         2.19766    0.31662   6.941 3.89e-12 ***
## LabelAppeal3         2.90407    0.32204   9.018  < 2e-16 ***
## LabelAppeal4         3.33806    0.37494   8.903  < 2e-16 ***
## AcidIndex            0.41565    0.02565  16.202  < 2e-16 ***
## STARS1              -2.09320    0.07659 -27.330  < 2e-16 ***
## STARS2              -5.71200    0.32653 -17.493  < 2e-16 ***
## STARS3             -20.23807  341.00123  -0.059  0.95267    
## STARS4             -20.37694  644.74712  -0.032  0.97479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 33 
## Log-likelihood: -2.035e+04 on 28 Df

Model Selection:

Based on the information above, I decided to select the Poisson model #2, which had low dispersion, a p-value of 1. Although the AIC value was relatively high, it was hard to interpret.

#Prepare Evaluation dataset

eval$STARS[is.na(eval$STARS)] <- 0
eval$STARS <-as.factor(eval$STARS)
eval$LabelAppeal <- as.factor(eval$LabelAppeal)
st(eval)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 0
… No 0 NaN%
… Yes 0 NaN%
FixedAcidity 3335 6.864 6.318 -18.2 5.2 9 33.5
VolatileAcidity 3335 0.31 0.807 -2.83 0.08 0.63 3.61
CitricAcid 3335 0.312 0.871 -3.12 0 0.605 3.76
ResidualSugar 3167 5.319 34.371 -128.3 -2.6 17.2 145.4
Chlorides 3197 0.061 0.314 -1.15 0.016 0.171 1.263
FreeSulfurDioxide 3183 34.947 149.633 -563 3 79.25 617
TotalSulfurDioxide 3178 123.41 225.8 -769 27.25 210 1004
Density 3335 0.995 0.026 0.89 0.988 1.001 1.1
pH 3231 3.237 0.676 0.6 2.98 3.49 6.21
Sulphates 3025 0.535 0.905 -3.07 0.33 0.82 4.18
Alcohol 3150 10.584 3.759 -4.2 9 12.5 25.6
LabelAppeal 3335
… -2 114 3.4%
… -1 810 24.3%
… 0 1470 44.1%
… 1 799 24%
… 2 142 4.3%
AcidIndex 3335 7.748 1.315 5 7 8 17
STARS 3335
… 0 841 25.2%
… 1 828 24.8%
… 2 902 27%
… 3 600 18%
… 4 164 4.9%
eval_abs <-eval

eval_abs$FixedAcidity <- abs(eval_abs$FixedAcidity)
eval_abs$VolatileAcidity <- abs(eval_abs$VolatileAcidity)
eval_abs$CitricAcid <- abs(eval_abs$CitricAcid)
eval_abs$ResidualSugar <-abs(eval_abs$ResidualSugar)
eval_abs$Chlorides <-abs(eval_abs$Chlorides)
eval_abs$FreeSulfurDioxide <-abs(eval_abs$FreeSulfurDioxide)
eval_abs$TotalSulfurDioxide <-abs(eval_abs$TotalSulfurDioxide)
eval_abs$Sulphates <- abs(eval_abs$Sulphates)
eval_abs$Alcohol <-abs(eval_abs$Alcohol)

#transform Label Appeal too.
eval_abs$LabelAppeal <- as.numeric(eval_abs$LabelAppeal)
eval_abs$LabelAppeal <- abs(eval_abs$LabelAppeal) 
#eval_abs$LabelAppeal + abs(min(eval_abs$LabelAppeal))-2

eval_abs$LabelAppeal <- as.factor(eval_abs$LabelAppeal)

st(eval_abs) #run this to make sure each variable worked after
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 0
… No 0 NaN%
… Yes 0 NaN%
FixedAcidity 3335 7.967 4.854 0 5.7 9.4 33.5
VolatileAcidity 3335 0.654 0.565 0 0.25 0.93 3.61
CitricAcid 3335 0.697 0.609 0 0.29 1 3.76
ResidualSugar 3167 23.775 25.381 0.1 3.5 38.5 145.4
Chlorides 3197 0.221 0.231 0 0.045 0.369 1.263
FreeSulfurDioxide 3183 107.2 110.075 0 27 174 617
TotalSulfurDioxide 3178 201.511 160.003 0 97 261 1004
Density 3335 0.995 0.026 0.89 0.988 1.001 1.1
pH 3231 3.237 0.676 0.6 2.98 3.49 6.21
Sulphates 3025 0.833 0.641 0 0.43 1.06 4.18
Alcohol 3150 10.614 3.672 0 9 12.5 25.6
LabelAppeal 3335
… 1 114 3.4%
… 2 810 24.3%
… 3 1470 44.1%
… 4 799 24%
… 5 142 4.3%
AcidIndex 3335 7.748 1.315 5 7 8 17
STARS 3335
… 0 841 25.2%
… 1 828 24.8%
… 2 902 27%
… 3 600 18%
… 4 164 4.9%
summary(eval_abs)#make sure nothing broke
##   TARGET         FixedAcidity    VolatileAcidity    CitricAcid    
##  Mode:logical   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  NA's:3335      1st Qu.: 5.700   1st Qu.:0.2500   1st Qu.:0.2900  
##                 Median : 7.000   Median :0.4200   Median :0.4400  
##                 Mean   : 7.967   Mean   :0.6542   Mean   :0.6969  
##                 3rd Qu.: 9.400   3rd Qu.:0.9300   3rd Qu.:1.0000  
##                 Max.   :33.500   Max.   :3.6100   Max.   :3.7600  
##                                                                   
##  ResidualSugar      Chlorides      FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  0.10   Min.   :0.0000   Min.   :  0.0     Min.   :   0.0    
##  1st Qu.:  3.50   1st Qu.:0.0450   1st Qu.: 27.0     1st Qu.:  97.0    
##  Median : 13.50   Median :0.1000   Median : 54.0     Median : 153.0    
##  Mean   : 23.77   Mean   :0.2213   Mean   :107.2     Mean   : 201.5    
##  3rd Qu.: 38.50   3rd Qu.:0.3690   3rd Qu.:174.0     3rd Qu.: 261.0    
##  Max.   :145.40   Max.   :1.2630   Max.   :617.0     Max.   :1004.0    
##  NA's   :168      NA's   :138      NA's   :152       NA's   :157       
##     Density             pH          Sulphates         Alcohol      LabelAppeal
##  Min.   :0.8898   Min.   :0.600   Min.   :0.0000   Min.   : 0.00   1: 114     
##  1st Qu.:0.9883   1st Qu.:2.980   1st Qu.:0.4300   1st Qu.: 9.00   2: 810     
##  Median :0.9946   Median :3.210   Median :0.5900   Median :10.40   3:1470     
##  Mean   :0.9947   Mean   :3.237   Mean   :0.8331   Mean   :10.61   4: 799     
##  3rd Qu.:1.0005   3rd Qu.:3.490   3rd Qu.:1.0600   3rd Qu.:12.50   5: 142     
##  Max.   :1.0998   Max.   :6.210   Max.   :4.1800   Max.   :25.60              
##                   NA's   :104     NA's   :310      NA's   :185                
##    AcidIndex      STARS  
##  Min.   : 5.000   0:841  
##  1st Qu.: 7.000   1:828  
##  Median : 8.000   2:902  
##  Mean   : 7.748   3:600  
##  3rd Qu.: 8.000   4:164  
##  Max.   :17.000          
## 
head(eval_abs)
TARGETFixedAcidityVolatileAcidityCitricAcidResidualSugarChloridesFreeSulfurDioxideTotalSulfurDioxideDensitypHSulphatesAlcoholLabelAppealAcidIndexSTARS
5.40.86 0.2710.70.092233980.9855.020.6412.3 260
12.40.3850.7619.71.17 37680.99 3.371.0916   362
7.21.75 0.1733  0.0659761.05 4.610.688.55381
6.20.1  1.8 1  0.179104890.9893.2 2.1112.3 281
11.40.21 0.281.20.03870531.03 2.540.074.8 3100
17.60.04 1.151.40.5352501400.95 3.060.0211.4 484
# Predict

eval_abs$TARGET <- predict(poisson_1, newdata = eval_abs, type = "response")
eval_abs$TARGET <- as.numeric(round(eval_abs$TARGET,0))

#inspect projections

head(eval_abs)
TARGETFixedAcidityVolatileAcidityCitricAcidResidualSugarChloridesFreeSulfurDioxideTotalSulfurDioxideDensitypHSulphatesAlcoholLabelAppealAcidIndexSTARS
15.40.86 0.2710.70.092233980.9855.020.6412.3 260
312.40.3850.7619.71.17 37680.99 3.371.0916   362
27.21.75 0.1733  0.0659761.05 4.610.688.55381
26.20.1  1.8 1  0.179104890.9893.2 2.1112.3 281
111.40.21 0.281.20.03870531.03 2.540.074.8 3100
417.60.04 1.151.40.5352501400.95 3.060.0211.4 484
predictions <- eval_abs