Modeling Wine

Introduction

This project aims to find the count—or amount—of potential cases of wine that a particular wine seller can expect to sell based on the variables observed in the data provided. This dataset is quite large, and most variables are wine properties, such as sugar, acidity, among several others. The premise is that these factors affect how restaurants, or their buyers/sommeliers, evaluate these wines. In other words, the seller wants to know what is essential to maximize potential sales.

Data Exploration 1.0

We loaded the data, convert it into a data frame, and remove the index.

#get rid of index
train <- train[,-1]
eval <- eval[,-1]

head(train)

TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
3	3.2	1.16	-0.98	54.2	-0.567		268	0.993	3.33	-0.59	9.9	0	8	2
3	4.5	0.16	-0.81	26.1	-0.425	15	-327	1.03	3.38	0.7		-1	7	3
5	7.1	2.64	-0.88	14.8	0.037	214	142	0.995	3.12	0.48	22	-1	8	3
3	5.7	0.385	0.04	18.8	-0.425	22	115	0.996	2.24	1.83	6.2	-1	6	1
4	8	0.33	-1.26	9.4		-167	108	0.995	3.12	1.77	13.7	0	9	2
0	11.3	0.32	0.59	2.2	0.556	-37	15	0.999	3.2	1.29	15.4	0	11

head(eval)

FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
5.4	-0.86	0.27	-10.7	0.092	23	398	0.985	5.02	0.64	12.3	-1	6
12.4	0.385	-0.76	-19.7	1.17	-37	68	0.99	3.37	1.09	16	0	6	2
7.2	1.75	0.17	-33	0.065	9	76	1.05	4.61	0.68	8.55	0	8	1
6.2	0.1	1.8	1	-0.179	104	89	0.989	3.2	2.11	12.3	-1	8	1
11.4	0.21	0.28	1.2	0.038	70	53	1.03	2.54	-0.07	4.8	0	10
17.6	0.04	-1.15	1.4	0.535	-250	140	0.95	3.06	-0.02	11.4	1	8	4

Data Exploration 2.0

We first looked to evaluate the training dataset to assess the file. We ran some basic commands and visuals to look at the data structure, meaning, descriptive statistics, and missing values.

We Initially found that the number of NAs in the dataset is significant, and as such—and to a headache degree—we knew that we needed to do some potential imputation at some point. In addition, we noticed several values with negative values, and that to perform the modeling, we would also need to transform these data series and place them in a zero to the above scale.

First, however, we looked at the distribution of all the variables. We scaled them centered around zero to see them in boxplots. We initially thought to convert categorical variables like STARS and LabelAppeal into factors, but to run the correlation, we needed numeric values first. So for both the histograms and the correlation, we found that most variables, including STARS and LabelAppeal, are normally distributed and that these two are closely correlated to our target variable, cases sold.

#Data exploration

#summary tables and data structure

summary(train)

##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

str(train)

## tibble [12,795 x 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...

st(train)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	7.076	6.318	-18.1	5.2	9.5	34.4
VolatileAcidity	12795	0.324	0.784	-2.79	0.13	0.64	3.68
CitricAcid	12795	0.308	0.862	-3.24	0.03	0.58	3.86
ResidualSugar	12179	5.419	33.749	-127.8	-2	15.9	141.15
Chlorides	12157	0.055	0.318	-1.171	-0.031	0.153	1.351
FreeSulfurDioxide	12148	30.846	148.715	-555	0	70	623
TotalSulfurDioxide	12113	120.714	231.913	-823	27	208	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12400	3.208	0.68	0.48	2.96	3.47	6.13
Sulphates	11585	0.527	0.932	-3.13	0.28	0.86	4.24
Alcohol	12142	10.489	3.728	-4.7	9	12.4	26.5
LabelAppeal	12795	-0.009	0.891	-2	-1	1	2
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	9436	2.042	0.903	1	1	3	4

describe(train)

## train 
## 
##  15  Variables      12795  Observations
## --------------------------------------------------------------------------------
## TARGET 
##        n  missing distinct     Info     Mean      Gmd 
##    12795        0        9    0.962    3.029    2.141 
## 
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency   2734   244  1091  2611  3177  2014   765   142    17
## Proportion 0.214 0.019 0.085 0.204 0.248 0.157 0.060 0.011 0.001
## --------------------------------------------------------------------------------
## FixedAcidity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      470        1    7.076    6.688     -3.6     -1.2 
##      .25      .50      .75      .90      .95 
##      5.2      6.9      9.5     15.6     17.8 
## 
## lowest : -18.1 -18.0 -17.7 -17.5 -17.4, highest:  32.4  32.5  32.6  34.1  34.4
## --------------------------------------------------------------------------------
## VolatileAcidity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      815        1   0.3241   0.8262   -1.023   -0.720 
##      .25      .50      .75      .90      .95 
##    0.130    0.280    0.640    1.350    1.640 
## 
## lowest : -2.790 -2.750 -2.745 -2.730 -2.720, highest:  3.500  3.550  3.565  3.590  3.680
## --------------------------------------------------------------------------------
## CitricAcid 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0      602        1   0.3084   0.9057    -1.16    -0.84 
##      .25      .50      .75      .90      .95 
##     0.03     0.31     0.58     1.43     1.79 
## 
## lowest : -3.24 -3.16 -3.10 -3.08 -3.06, highest:  3.63  3.68  3.70  3.77  3.86
## --------------------------------------------------------------------------------
## ResidualSugar 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12179      616     2077        1    5.419    35.31   -52.70   -39.66 
##      .25      .50      .75      .90      .95 
##    -2.00     3.90    15.90    49.72    62.70 
## 
## lowest : -127.80 -127.10 -126.20 -126.10 -125.70
## highest:  136.50  137.60  138.00  140.65  141.15
## --------------------------------------------------------------------------------
## Chlorides 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12157      638     1663        1  0.05482   0.3311   -0.489   -0.372 
##      .25      .50      .75      .90      .95 
##   -0.031    0.046    0.153    0.481    0.598 
## 
## lowest : -1.171 -1.170 -1.158 -1.156 -1.155, highest:  1.260  1.261  1.270  1.275  1.351
## --------------------------------------------------------------------------------
## FreeSulfurDioxide 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12148      647      999        1    30.85    155.2     -224     -171 
##      .25      .50      .75      .90      .95 
##        0       30       70      230      284 
## 
## lowest : -555 -546 -536 -535 -532, highest:  613  617  618  622  623
## --------------------------------------------------------------------------------
## TotalSulfurDioxide 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12113      682     1370        1    120.7    246.9   -273.0   -185.0 
##      .25      .50      .75      .90      .95 
##     27.0    123.0    208.0    421.8    513.4 
## 
## lowest : -823 -816 -793 -781 -779, highest: 1032 1041 1048 1054 1057
## --------------------------------------------------------------------------------
## Density 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0     5933        1   0.9942  0.02769   0.9488   0.9587 
##      .25      .50      .75      .90      .95 
##   0.9877   0.9945   1.0005   1.0295   1.0398 
## 
## lowest : 0.88809 0.88949 0.88978 0.88983 0.89167
## highest: 1.09658 1.09679 1.09695 1.09791 1.09924
## --------------------------------------------------------------------------------
## pH 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12400      395      497        1    3.208   0.7242     2.06     2.31 
##      .25      .50      .75      .90      .95 
##     2.96     3.20     3.47     4.10     4.37 
## 
## lowest : 0.48 0.53 0.54 0.58 0.59, highest: 5.91 5.94 6.02 6.05 6.13
## --------------------------------------------------------------------------------
## Sulphates 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    11585     1210      630        1   0.5271   0.9827    -1.05    -0.70 
##      .25      .50      .75      .90      .95 
##     0.28     0.50     0.86     1.77     2.09 
## 
## lowest : -3.13 -3.12 -3.10 -3.07 -3.03, highest:  4.11  4.16  4.19  4.21  4.24
## --------------------------------------------------------------------------------
## Alcohol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12142      653      401        1    10.49    4.015      4.1      5.7 
##      .25      .50      .75      .90      .95 
##      9.0     10.4     12.4     15.2     16.7 
## 
## lowest : -4.7 -4.5 -4.4 -4.3 -4.1, highest: 25.4 25.6 26.0 26.1 26.5
## --------------------------------------------------------------------------------
## LabelAppeal 
##         n   missing  distinct      Info      Mean       Gmd 
##     12795         0         5     0.887 -0.009066    0.9566 
## 
## lowest : -2 -1  0  1  2, highest: -2 -1  0  1  2
##                                         
## Value         -2    -1     0     1     2
## Frequency    504  3136  5617  3048   490
## Proportion 0.039 0.245 0.439 0.238 0.038
## --------------------------------------------------------------------------------
## AcidIndex 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    12795        0       14    0.908    7.773    1.316        6        7 
##      .25      .50      .75      .90      .95 
##        7        8        8        9       10 
## 
## lowest :  4  5  6  7  8, highest: 13 14 15 16 17
##                                                                             
## Value          4     5     6     7     8     9    10    11    12    13    14
## Frequency      3    75  1197  4878  4142  1427   551   258   128    69    47
## Proportion 0.000 0.006 0.094 0.381 0.324 0.112 0.043 0.020 0.010 0.005 0.004
##                             
## Value         15    16    17
## Frequency      8     5     7
## Proportion 0.001 0.000 0.001
## --------------------------------------------------------------------------------
## STARS 
##        n  missing distinct     Info     Mean      Gmd 
##     9436     3359        4    0.899    2.042   0.9777 
##                                   
## Value          1     2     3     4
## Frequency   3042  3570  2212   612
## Proportion 0.322 0.378 0.234 0.065
## --------------------------------------------------------------------------------

dim(train)

## [1] 12795    15

#some variables have multiple Nas
#assess missing values
missing <- colSums(train %>% sapply(is.na))
missing_pct <- round(missing / nrow(train) * 100, 2)
na_table <- stack(sort(missing_pct, decreasing = TRUE))
na_table

values	ind
26.2	STARS
9.46	Sulphates
5.33	TotalSulfurDioxide
5.1	Alcohol
5.06	FreeSulfurDioxide
4.99	Chlorides
4.81	ResidualSugar
3.09	pH
0	TARGET
0	FixedAcidity
0	VolatileAcidity
0	CitricAcid
0	Density
0	LabelAppeal
0	AcidIndex

plot_missing(train)

Visualization

# SOME VISUALS on histograms and correlation

# Histograms to check how variables are distributed,=
plot_num(train)

# Correlation plot -- al variables as numeric
cor <-cor(train, method="pearson", use = "pairwise.complete.obs")
corrplot(cor, method="circle")

#Scale variables to create boxplot chart with all variables in it

scaled.train <- as.data.table(scale(train[, c(
  'TARGET',
  'FixedAcidity',
  'VolatileAcidity',
  'CitricAcid',
  'ResidualSugar',
  'Chlorides',
  'FreeSulfurDioxide',
  'TotalSulfurDioxide',
  'pH',
  'Sulphates',
  'Density',
  'Alcohol',
  'AcidIndex',
  'STARS',
  'LabelAppeal'
  )]))

#Show boxplots

#boxplot(scaled.train)

melt.train <- melt(scaled.train)

scaled_boxplots <- ggplot(melt.train, aes(variable, value)) +
  geom_boxplot(width=.5, fill="navyblue", outlier.color="magenta", outlier.size = 1) +
  stat_summary(aes(color="mean"), fun.y=mean, geom="point",
               size=2, show.legend=TRUE) +
  stat_summary(aes(color="median"), fun.y=median, geom="point",
               size=2, show.legend=TRUE) +
  coord_flip() +
  labs(x="", y="") +
  scale_color_manual(values=c("blue", "purple")) + 
  theme(legend.position="top")


scaled_boxplots

Data Preparation

As mentioned earlier, as we inspected the data, we noticed quite a large amount of NA’s. I have worked with NA’s in the past but in a much simpler way. And aside from the “hint” in the assignment paper to not always discard NA’s, I knew that STARS would have to be converted to zeros—not a significant accomplishment since I have learned how to impute NA’s in this class in many other forms.

For the other variables that needed imputation, I read about approaching this in several ways, whether using the imputeTS package, Hmisc, missForest, and mice. Although mice was a tad over my head, I read that it could be the best avenue. I have imputed using means and medians in the past, but after learning a bit more, I went with mice, a random forest regressor, to try something new.

After this, it made sense to revert to the initial idea to convert STARS and LabelAppeal as factors before running the models. Many examples available tried to change “AcidIndex” as a factor so I tried that as well. It also made sense to consider converting the variables with negative values to a positive scale since this exercise aims to run Poisson and Negative Binomial models that cannot take negative values. After looking at several techniques, I questioned whether I would log the data if most independent variables had a relatively normal distribution. Then, I realized that Poisson and Negative Binomial regressions can adjust for that once the data is transformed. This was the progress of the data transformation.

Step 1. Initial Imputed data  Data as given, then transform NA’s for stars as zero and then use “MICE” for the remaining variables.

Step 2. Scaled (x + abs(min(x))+C  comes from the imputed data, but scaled to change negative numbers to positive; LabelAppeal was adjusted as + 2. The other variables were turned into absolute values. At this point, all the variables remained to show a relatively normal distribution.

Step 3, using the log  we converted continuous variables into logs to see if the distribution would take the shape of the negative binomial. And it appeared as it did.

Step 4. Transform data using absolute values  I used absolute values without log transformation or scaling it and tested how it would perform. At first glance, it looked like just using absolute values would have been enough.

#impute some of the variables, particularly Stars from na/s to zeros

train$STARS[is.na(train$STARS)] <- 0
train$STARS <-as.factor(train$STARS)
train$LabelAppeal <- as.factor(train$LabelAppeal)
#train$AcidIndex <- as.factor(train$AcidIndex)

# head(train)

# impute the other variables--used mice based on this:
#https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

train_mice <- mice::mice(train, m = 2, method='cart', maxit = 2, print = FALSE)
train_imputed <- mice::complete(train_mice)

density.plot <-densityplot(train_mice)
density.plot

head(train_imputed)

TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
3	3.2	1.16	-0.98	54.2	-0.567	-170	268	0.993	3.33	-0.59	9.9	0	8	2
3	4.5	0.16	-0.81	26.1	-0.425	15	-327	1.03	3.38	0.7	13.5	-1	7	3
5	7.1	2.64	-0.88	14.8	0.037	214	142	0.995	3.12	0.48	22	-1	8	3
3	5.7	0.385	0.04	18.8	-0.425	22	115	0.996	2.24	1.83	6.2	-1	6	1
4	8	0.33	-1.26	9.4	-0.446	-167	108	0.995	3.12	1.77	13.7	0	9	2
0	11.3	0.32	0.59	2.2	0.556	-37	15	0.999	3.2	1.29	15.4	0	11	0

st(train_imputed)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	7.076	6.318	-18.1	5.2	9.5	34.4
VolatileAcidity	12795	0.324	0.784	-2.79	0.13	0.64	3.68
CitricAcid	12795	0.308	0.862	-3.24	0.03	0.58	3.86
ResidualSugar	12795	5.44	33.737	-127.8	-1.8	15.85	141.15
Chlorides	12795	0.056	0.319	-1.171	-0.029	0.156	1.351
FreeSulfurDioxide	12795	31.198	148.59	-555	0	70	623
TotalSulfurDioxide	12795	121.383	231.196	-823	27.5	208	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.525	0.93	-3.13	0.28	0.86	4.24
Alcohol	12795	10.493	3.738	-4.7	9	12.4	26.5
LabelAppeal	12795
… -2	504	3.9%
… -1	3136	24.5%
… 0	5617	43.9%
… 1	3048	23.8%
… 2	490	3.8%
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

plot_num(train_imputed)

Further insights.

I looked at different techniques in r to adjust the negative values, and the following is the one I understood better. What made sense to me was to convert the absolute value of the minimum + 1 (to change the scale) for each observation. I have done this manually in the past with models in excel. I learned how many analysts did this in several examples and started with this one. There were also analysts who considered “AcidINdex” to be a categorical variable. I did not go this route.

I learned there are other techniques, such as simply absolute values, and then the log of all these absolute values. Without some of these transformations I was still getting normal distributions for the imputed values, and from what the negative binomial model, and possibly poisson model if there is no overdispersion.

train_imp_plusminconst <- train_imputed

#only do it for columns with negative values--not sure that the constant (y+Y-min(Y) 
#(+1 does, I guess it could be any constant, or none?)
#Also --> should these be logged despite having a normal distribution? --I decided not to
#https://www.listendata.com/2015/09/regression-transform-negative-values.html

train_imp_plusminconst$FixedAcidity <- train_imp_plusminconst$FixedAcidity + abs(min(train_imp_plusminconst$FixedAcidity))+1
train_imp_plusminconst$VolatileAcidity <- train_imp_plusminconst$VolatileAcidity + abs(min(train_imp_plusminconst$VolatileAcidity))+1
train_imp_plusminconst$CitricAcid <- train_imp_plusminconst$CitricAcid + abs(min(train_imp_plusminconst$CitricAcid))+1
train_imp_plusminconst$ResidualSugar <- train_imp_plusminconst$ResidualSugar + abs(min(train_imp_plusminconst$ResidualSugar))+1
train_imp_plusminconst$Chlorides <- train_imp_plusminconst$Chlorides + abs(min(train_imp_plusminconst$Chlorides))+1
train_imp_plusminconst$FreeSulfurDioxide <- train_imp_plusminconst$FreeSulfurDioxide + abs(min(train_imp_plusminconst$FreeSulfurDioxide))+1
train_imp_plusminconst$TotalSulfurDioxide <- train_imp_plusminconst$TotalSulfurDioxide +abs(min(train_imp_plusminconst$TotalSulfurDioxide ))+1
train_imp_plusminconst$Sulphates <-train_imp_plusminconst$Sulphates +abs(min(train_imp_plusminconst$Sulphates))+1
train_imp_plusminconst$Alcohol <-train_imp_plusminconst$Alcohol +abs(min(train_imp_plusminconst$Alcohol))+1

#this seems out of scale for the other variables, so decided to scale the other positive variables too
train_imp_plusminconst$Density <- train_imp_plusminconst$Density + abs(min(train_imp_plusminconst$Density))+1
train_imp_plusminconst$pH <- train_imp_plusminconst$pH + abs(min(train_imp_plusminconst$pH))+1

#transform Label Appeal too.
train_imp_plusminconst$LabelAppeal <- as.numeric(train_imp_plusminconst$LabelAppeal)
train_imp_plusminconst$LabelAppeal <-train_imp_plusminconst$LabelAppeal + abs(min(train_imp_plusminconst$LabelAppeal))-2

train_imp_plusminconst$LabelAppeal <- as.factor(train_imp_plusminconst$LabelAppeal)

st(train_imp_plusminconst) #run this to make sure each variable worked after

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	26.176	6.318	1	24.3	28.6	53.5
VolatileAcidity	12795	4.114	0.784	1	3.92	4.43	7.47
CitricAcid	12795	4.548	0.862	1	4.27	4.82	8.1
ResidualSugar	12795	134.24	33.737	1	127	144.65	269.95
Chlorides	12795	2.227	0.319	1	2.142	2.327	3.522
FreeSulfurDioxide	12795	587.198	148.59	1	556	626	1179
TotalSulfurDioxide	12795	945.383	231.196	1	851.5	1032	1881
Density	12795	2.882	0.027	2.776	2.876	2.889	2.987
pH	12795	4.687	0.681	1.96	4.43	4.95	7.61
Sulphates	12795	4.655	0.93	1	4.41	4.99	8.37
Alcohol	12795	16.193	3.738	1	14.7	18.1	32.2
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

summary(train_imp_plusminconst) #make sure nothing broke

##      TARGET       FixedAcidity   VolatileAcidity   CitricAcid   
##  Min.   :0.000   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:24.30   1st Qu.:3.920   1st Qu.:4.270  
##  Median :3.000   Median :26.00   Median :4.070   Median :4.550  
##  Mean   :3.029   Mean   :26.18   Mean   :4.114   Mean   :4.548  
##  3rd Qu.:4.000   3rd Qu.:28.60   3rd Qu.:4.430   3rd Qu.:4.820  
##  Max.   :8.000   Max.   :53.50   Max.   :7.470   Max.   :8.100  
##  ResidualSugar     Chlorides     FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  1.0   Min.   :1.000   Min.   :   1.0    Min.   :   1.0    
##  1st Qu.:127.0   1st Qu.:2.142   1st Qu.: 556.0    1st Qu.: 851.5    
##  Median :132.7   Median :2.217   Median : 586.0    Median : 947.0    
##  Mean   :134.2   Mean   :2.227   Mean   : 587.2    Mean   : 945.4    
##  3rd Qu.:144.7   3rd Qu.:2.326   3rd Qu.: 626.0    3rd Qu.:1032.0    
##  Max.   :269.9   Max.   :3.522   Max.   :1179.0    Max.   :1881.0    
##     Density            pH          Sulphates        Alcohol      LabelAppeal
##  Min.   :2.776   Min.   :1.960   Min.   :1.000   Min.   : 1.00   0: 504     
##  1st Qu.:2.876   1st Qu.:4.430   1st Qu.:4.410   1st Qu.:14.70   1:3136     
##  Median :2.883   Median :4.680   Median :4.630   Median :16.10   2:5617     
##  Mean   :2.882   Mean   :4.687   Mean   :4.655   Mean   :16.19   3:3048     
##  3rd Qu.:2.889   3rd Qu.:4.950   3rd Qu.:4.990   3rd Qu.:18.10   4: 490     
##  Max.   :2.987   Max.   :7.610   Max.   :8.370   Max.   :32.20              
##    AcidIndex      STARS   
##  Min.   : 4.000   0:3359  
##  1st Qu.: 7.000   1:3042  
##  Median : 8.000   2:3570  
##  Mean   : 7.773   3:2212  
##  3rd Qu.: 8.000   4: 612  
##  Max.   :17.000

plot_num(train_imp_plusminconst)

# I liked the for loop version better, but it was a bit over my head too. Review for later, stick to the above method.

######################################################################
# Transform the data from imputed to logs

train_scaling_subset2 <- train_imputed %>%
  dplyr::select(FixedAcidity,
                VolatileAcidity,
                CitricAcid,
                ResidualSugar,
                Chlorides,
                FreeSulfurDioxide,
                TotalSulfurDioxide,
                Sulphates,
                Alcohol)

train_absscaled_subset <- lapply(train_scaling_subset2,
                                 FUN = function(x) sapply(x, FUN = abs)) %>%
  as.data.frame()

# Join absolute value-scaled subset back to other continuous variables
train_abs <- train_imputed %>%
  dplyr::select(Density,
                AcidIndex,
                pH
                ) %>%
  cbind(train_absscaled_subset)

# Log-scale all continuous variables, adding constant of 1
train_abslog <- lapply(train_abs, FUN = function(x)
  sapply(x, FUN = function(x) log(x+1))) %>%
  as.data.frame()

st(train_abs)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
AcidIndex	12795	7.773	1.324	4	7	8	17
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
FixedAcidity	12795	8.063	4.996	0	5.6	9.8	34.4
VolatileAcidity	12795	0.641	0.556	0	0.25	0.91	3.68
CitricAcid	12795	0.686	0.606	0	0.28	0.97	3.86
ResidualSugar	12795	23.326	24.972	0	3.6	38.6	141.15
Chlorides	12795	0.223	0.234	0	0.046	0.368	1.351
FreeSulfurDioxide	12795	106.642	108.069	0	28	171	623
TotalSulfurDioxide	12795	204.015	162.976	0	99	262	1057
Sulphates	12795	0.845	0.652	0	0.43	1.09	4.24
Alcohol	12795	10.526	3.642	0	9	12.4	26.5

#bring iin scaled Label Appeal from the abs plus min dataframe
train_abslog$LabelAppeal <- train_imp_plusminconst$LabelAppeal


# Map remaining variables to dataframe
#train_abslog$INDEX <- train_imputed$INDEX
train_abslog$TARGET <- train_imputed$TARGET
train_abslog$STARS <- train_imputed$STARS
train_abslog$STARS <- as.factor(train_abslog$STARS)

head(train_abslog)

Density	AcidIndex	pH	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Sulphates	Alcohol	LabelAppeal	TARGET	STARS
0.69	2.2	1.47	1.44	0.77	0.683	4.01	0.449	5.14	5.59	0.464	2.39	2	3	2
0.707	2.08	1.48	1.7	0.148	0.593	3.3	0.354	2.77	5.79	0.531	2.67	1	3	3
0.691	2.2	1.42	2.09	1.29	0.631	2.76	0.0363	5.37	4.96	0.392	3.14	1	5	3
0.691	1.95	1.18	1.9	0.326	0.0392	2.99	0.354	3.14	4.75	1.04	1.97	1	3	1
0.69	2.3	1.42	2.2	0.285	0.815	2.34	0.369	5.12	4.69	1.02	2.69	2	4	2
0.693	2.48	1.44	2.51	0.278	0.464	1.16	0.442	3.64	2.77	0.829	2.8	2	0	0

st(train_abslog)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Density	12795	0.69	0.013	0.636	0.687	0.693	0.742
AcidIndex	12795	2.161	0.14	1.609	2.079	2.197	2.89
pH	12795	1.423	0.171	0.392	1.374	1.497	1.964
FixedAcidity	12795	2.04	0.617	0	1.887	2.38	3.567
VolatileAcidity	12795	0.449	0.293	0	0.223	0.647	1.543
CitricAcid	12795	0.47	0.31	0	0.247	0.678	1.581
ResidualSugar	12795	2.597	1.17	0	1.526	3.679	4.957
Chlorides	12795	0.185	0.174	0	0.045	0.313	0.855
FreeSulfurDioxide	12795	4.14	1.115	0	3.367	5.147	6.436
TotalSulfurDioxide	12795	4.993	0.903	0	4.605	5.572	6.964
Sulphates	12795	0.562	0.304	0	0.358	0.737	1.656
Alcohol	12795	2.382	0.388	0	2.303	2.595	3.314
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
TARGET	12795	3.029	1.926	0	2	4	8
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

plot_num(train_abslog)

train_imp_abs <- train_imputed

#only do it for columns with negative values--not sure that the constant (y+Y-min(Y) 
#(+1 does, I guess it could be any constant, or none?)
#Also --> should these be logged despite having a normal distribution? --I decided not to
#https://www.listendata.com/2015/09/regression-transform-negative-values.html

train_imp_abs$FixedAcidity <- abs(train_imp_abs$FixedAcidity)
train_imp_abs$VolatileAcidity <- abs(train_imp_abs$VolatileAcidity)
train_imp_abs$CitricAcid <- abs(train_imp_abs$CitricAcid)
train_imp_abs$ResidualSugar <-abs(train_imp_abs$ResidualSugar)
train_imp_abs$Chlorides <-abs(train_imp_abs$Chlorides)
train_imp_abs$FreeSulfurDioxide <-abs(train_imp_abs$FreeSulfurDioxide)
train_imp_abs$TotalSulfurDioxide <-abs(train_imp_abs$TotalSulfurDioxide)
train_imp_abs$Sulphates <- abs(train_imp_abs$Sulphates)
train_imp_abs$Alcohol <-abs(train_imp_abs$Alcohol)

#transform Label Appeal too.
train_imp_abs$LabelAppeal <- as.numeric(train_imp_abs$LabelAppeal)
train_imp_abs$LabelAppeal <- abs(train_imp_abs$LabelAppeal) 
#train_imp_abs$LabelAppeal + abs(min(train_imp_abs$LabelAppeal))-2

train_imp_abs$LabelAppeal <- as.factor(train_imp_abs$LabelAppeal)

st(train_imp_abs) #run this to make sure each variable worked after

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	8.063	4.996	0	5.6	9.8	34.4
VolatileAcidity	12795	0.641	0.556	0	0.25	0.91	3.68
CitricAcid	12795	0.686	0.606	0	0.28	0.97	3.86
ResidualSugar	12795	23.326	24.972	0	3.6	38.6	141.15
Chlorides	12795	0.223	0.234	0	0.046	0.368	1.351
FreeSulfurDioxide	12795	106.642	108.069	0	28	171	623
TotalSulfurDioxide	12795	204.015	162.976	0	99	262	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.845	0.652	0	0.43	1.09	4.24
Alcohol	12795	10.526	3.642	0	9	12.4	26.5
LabelAppeal	12795
… 1	504	3.9%
… 2	3136	24.5%
… 3	5617	43.9%
… 4	3048	23.8%
… 5	490	3.8%
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

summary(train_imp_abs)#make sure nothing broke

##      TARGET       FixedAcidity    VolatileAcidity    CitricAcid    
##  Min.   :0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.: 5.600   1st Qu.:0.2500   1st Qu.:0.2800  
##  Median :3.000   Median : 7.000   Median :0.4100   Median :0.4400  
##  Mean   :3.029   Mean   : 8.063   Mean   :0.6411   Mean   :0.6863  
##  3rd Qu.:4.000   3rd Qu.: 9.800   3rd Qu.:0.9100   3rd Qu.:0.9700  
##  Max.   :8.000   Max.   :34.400   Max.   :3.6800   Max.   :3.8600  
##  ResidualSugar      Chlorides      FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  0.00   Min.   :0.0000   Min.   :  0.0     Min.   :   0      
##  1st Qu.:  3.60   1st Qu.:0.0460   1st Qu.: 28.0     1st Qu.:  99      
##  Median : 12.90   Median :0.0980   Median : 56.0     Median : 154      
##  Mean   : 23.33   Mean   :0.2227   Mean   :106.6     Mean   : 204      
##  3rd Qu.: 38.60   3rd Qu.:0.3680   3rd Qu.:171.0     3rd Qu.: 262      
##  Max.   :141.15   Max.   :1.3510   Max.   :623.0     Max.   :1057      
##     Density             pH          Sulphates         Alcohol      LabelAppeal
##  Min.   :0.8881   Min.   :0.480   Min.   :0.0000   Min.   : 0.00   1: 504     
##  1st Qu.:0.9877   1st Qu.:2.950   1st Qu.:0.4300   1st Qu.: 9.00   2:3136     
##  Median :0.9945   Median :3.200   Median :0.5900   Median :10.40   3:5617     
##  Mean   :0.9942   Mean   :3.207   Mean   :0.8454   Mean   :10.53   4:3048     
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.:1.0900   3rd Qu.:12.40   5: 490     
##  Max.   :1.0992   Max.   :6.130   Max.   :4.2400   Max.   :26.50              
##    AcidIndex      STARS   
##  Min.   : 4.000   0:3359  
##  1st Qu.: 7.000   1:3042  
##  Median : 8.000   2:3570  
##  Mean   : 7.773   3:2212  
##  3rd Qu.: 8.000   4: 612  
##  Max.   :17.000

plot_num(train_imp_abs)

train_abs_bc <- train_imp_plusminconst
train_abs_bc$TARGET <-(train_abs_bc$TARGET)
train_abs_bc$LabelAppeal <- as.numeric(train_abs_bc$LabelAppeal)
train_abs_bc$STARS <- as.numeric(train_abs_bc$STARS)
st(train_abs_bc)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	26.176	6.318	1	24.3	28.6	53.5
VolatileAcidity	12795	4.114	0.784	1	3.92	4.43	7.47
CitricAcid	12795	4.548	0.862	1	4.27	4.82	8.1
ResidualSugar	12795	134.24	33.737	1	127	144.65	269.95
Chlorides	12795	2.227	0.319	1	2.142	2.327	3.522
FreeSulfurDioxide	12795	587.198	148.59	1	556	626	1179
TotalSulfurDioxide	12795	945.383	231.196	1	851.5	1032	1881
Density	12795	2.882	0.027	2.776	2.876	2.889	2.987
pH	12795	4.687	0.681	1.96	4.43	4.95	7.61
Sulphates	12795	4.655	0.93	1	4.41	4.99	8.37
Alcohol	12795	16.193	3.738	1	14.7	18.1	32.2
LabelAppeal	12795	2.991	0.891	1	2	4	5
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795	2.506	1.187	1	1	3	5

train_abs_bc[,c("TARGET",
                "FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                "LabelAppeal",
                "AcidIndex",
                "STARS"
                )] = train_abs_bc[,c(
                  "TARGET",
                  "FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                "LabelAppeal",
                "AcidIndex",
                "STARS"
                )]+1

b = boxcox(TARGET ~
                  FixedAcidity
                  +VolatileAcidity
                  +CitricAcid
                  +ResidualSugar
                  +Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  +Density
                  +pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS,
           data=train_abs_bc)

lambda = b$x
lik = b$y
bc = cbind(lambda,lik)

hold=bc[order(-lik),]
bcVal=hold[1,1]

train_abs_bc[,c("FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                #"LabelAppeal",
                "AcidIndex"
                #"STARS"
                )] = train_abs_bc[,c("FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "pH",
                "Sulphates",
                "Alcohol",
                #"LabelAppeal",
                "AcidIndex"
                #"STARS"
                )]^(bcVal)

#Before running the models, lets look at the datasets we will be working with.

st(train_imputed)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	7.076	6.318	-18.1	5.2	9.5	34.4
VolatileAcidity	12795	0.324	0.784	-2.79	0.13	0.64	3.68
CitricAcid	12795	0.308	0.862	-3.24	0.03	0.58	3.86
ResidualSugar	12795	5.44	33.737	-127.8	-1.8	15.85	141.15
Chlorides	12795	0.056	0.319	-1.171	-0.029	0.156	1.351
FreeSulfurDioxide	12795	31.198	148.59	-555	0	70	623
TotalSulfurDioxide	12795	121.383	231.196	-823	27.5	208	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.525	0.93	-3.13	0.28	0.86	4.24
Alcohol	12795	10.493	3.738	-4.7	9	12.4	26.5
LabelAppeal	12795
… -2	504	3.9%
… -1	3136	24.5%
… 0	5617	43.9%
… 1	3048	23.8%
… 2	490	3.8%
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

st(train_imp_plusminconst)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	26.176	6.318	1	24.3	28.6	53.5
VolatileAcidity	12795	4.114	0.784	1	3.92	4.43	7.47
CitricAcid	12795	4.548	0.862	1	4.27	4.82	8.1
ResidualSugar	12795	134.24	33.737	1	127	144.65	269.95
Chlorides	12795	2.227	0.319	1	2.142	2.327	3.522
FreeSulfurDioxide	12795	587.198	148.59	1	556	626	1179
TotalSulfurDioxide	12795	945.383	231.196	1	851.5	1032	1881
Density	12795	2.882	0.027	2.776	2.876	2.889	2.987
pH	12795	4.687	0.681	1.96	4.43	4.95	7.61
Sulphates	12795	4.655	0.93	1	4.41	4.99	8.37
Alcohol	12795	16.193	3.738	1	14.7	18.1	32.2
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

st(train_abslog)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Density	12795	0.69	0.013	0.636	0.687	0.693	0.742
AcidIndex	12795	2.161	0.14	1.609	2.079	2.197	2.89
pH	12795	1.423	0.171	0.392	1.374	1.497	1.964
FixedAcidity	12795	2.04	0.617	0	1.887	2.38	3.567
VolatileAcidity	12795	0.449	0.293	0	0.223	0.647	1.543
CitricAcid	12795	0.47	0.31	0	0.247	0.678	1.581
ResidualSugar	12795	2.597	1.17	0	1.526	3.679	4.957
Chlorides	12795	0.185	0.174	0	0.045	0.313	0.855
FreeSulfurDioxide	12795	4.14	1.115	0	3.367	5.147	6.436
TotalSulfurDioxide	12795	4.993	0.903	0	4.605	5.572	6.964
Sulphates	12795	0.562	0.304	0	0.358	0.737	1.656
Alcohol	12795	2.382	0.388	0	2.303	2.595	3.314
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
TARGET	12795	3.029	1.926	0	2	4	8
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

st(train_imp_abs)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	8.063	4.996	0	5.6	9.8	34.4
VolatileAcidity	12795	0.641	0.556	0	0.25	0.91	3.68
CitricAcid	12795	0.686	0.606	0	0.28	0.97	3.86
ResidualSugar	12795	23.326	24.972	0	3.6	38.6	141.15
Chlorides	12795	0.223	0.234	0	0.046	0.368	1.351
FreeSulfurDioxide	12795	106.642	108.069	0	28	171	623
TotalSulfurDioxide	12795	204.015	162.976	0	99	262	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.845	0.652	0	0.43	1.09	4.24
Alcohol	12795	10.526	3.642	0	9	12.4	26.5
LabelAppeal	12795
… 1	504	3.9%
… 2	3136	24.5%
… 3	5617	43.9%
… 4	3048	23.8%
… 5	490	3.8%
AcidIndex	12795	7.773	1.324	4	7	8	17
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

st(train_abs_bc)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	4.029	1.926	1	3	5	9
FixedAcidity	12795	51.548	14.092	2.285	47.033	56.71	117.394
VolatileAcidity	12795	7.014	1.275	2.285	6.68	7.513	12.763
CitricAcid	12795	7.731	1.423	2.285	7.25	8.161	13.903
ResidualSugar	12795	349.423	102.119	2.285	324.806	378.871	793.976
Chlorides	12795	4.045	0.475	2.285	3.914	4.19	6.041
FreeSulfurDioxide	12795	2015.385	597.457	2.285	1874.293	2158.325	4586.011
TotalSulfurDioxide	12795	3550.688	1019.146	2.285	3112.798	3913.488	7999.852
Density	12795	5.037	0.041	4.873	5.027	5.046	5.2
pH	12795	7.952	1.132	3.645	7.513	8.378	13.015
Sulphates	12795	7.91	1.541	2.285	7.48	8.446	14.396
Alcohol	12795	29.843	7.646	2.285	26.633	33.642	65.024
LabelAppeal	12795	3.991	0.891	2	3	5	6
AcidIndex	12795	13.342	2.443	6.81	11.924	13.721	31.346
STARS	12795	3.506	1.187	2	2	4	6

Building the models

Linear regression #1

This first model tried all variables using the imputed data, both with “AcidIndex” as a factor and numeric. Surprisingly we got a relatively decent result around 0.54 R-squared. If we take “AcidIndex” as numeric, this variable is statistically significant, while if used as a factor, none of the 1-14 categories was significant. Therefore, we continued treating “AcidIndex” as numeric and kept it as such.

We removed the statistically insignificant variables, and the maximum R-Squared score was 0.5405.

When plotting the residuals, we validated the model, and it appeared that there is a strong linear relationship.

Some of the coefficients indicated the following:

• Naturally, the more STARS a wine was given, the more cases one would have expected to sell. • AcidIndex, was negatively correlated; the more acidic wine is, the fewer cases it is expected to sell. • Although the label appeal was statistically significant and had a positive correlation depending on the score, one would expect to sell about one more case if the rating were zero.

#start with linear regression
# available dataframes:

# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################

# Use for checking if "AcidIndex" as numeric or factor makes any difference

###############################################################################

# raw --> train_imputed
train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
st(train_imputed)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	7.076	6.318	-18.1	5.2	9.5	34.4
VolatileAcidity	12795	0.324	0.784	-2.79	0.13	0.64	3.68
CitricAcid	12795	0.308	0.862	-3.24	0.03	0.58	3.86
ResidualSugar	12795	5.44	33.737	-127.8	-1.8	15.85	141.15
Chlorides	12795	0.056	0.319	-1.171	-0.029	0.156	1.351
FreeSulfurDioxide	12795	31.198	148.59	-555	0	70	623
TotalSulfurDioxide	12795	121.383	231.196	-823	27.5	208	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.525	0.93	-3.13	0.28	0.86	4.24
Alcohol	12795	10.493	3.738	-4.7	9	12.4	26.5
LabelAppeal	12795
… -2	504	3.9%
… -1	3136	24.5%
… 0	5617	43.9%
… 1	3048	23.8%
… 2	490	3.8%
AcidIndex	12795	4.773	1.324	1	4	5	14
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

# # scaled through minplusconstant --> train_imp_plusminconst
train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
st(train_imp_plusminconst)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	26.176	6.318	1	24.3	28.6	53.5
VolatileAcidity	12795	4.114	0.784	1	3.92	4.43	7.47
CitricAcid	12795	4.548	0.862	1	4.27	4.82	8.1
ResidualSugar	12795	134.24	33.737	1	127	144.65	269.95
Chlorides	12795	2.227	0.319	1	2.142	2.327	3.522
FreeSulfurDioxide	12795	587.198	148.59	1	556	626	1179
TotalSulfurDioxide	12795	945.383	231.196	1	851.5	1032	1881
Density	12795	2.882	0.027	2.776	2.876	2.889	2.987
pH	12795	4.687	0.681	1.96	4.43	4.95	7.61
Sulphates	12795	4.655	0.93	1	4.41	4.99	8.37
Alcohol	12795	16.193	3.738	1	14.7	18.1	32.2
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
AcidIndex	12795	4.773	1.324	1	4	5	14
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

#
# # scaled (not all) and logged (continuous variables) --> train_abslog
train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
st(train_abslog)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Density	12795	0.69	0.013	0.636	0.687	0.693	0.742
AcidIndex	12795	7.773	1.324	4	7	8	17
pH	12795	1.423	0.171	0.392	1.374	1.497	1.964
FixedAcidity	12795	2.04	0.617	0	1.887	2.38	3.567
VolatileAcidity	12795	0.449	0.293	0	0.223	0.647	1.543
CitricAcid	12795	0.47	0.31	0	0.247	0.678	1.581
ResidualSugar	12795	2.597	1.17	0	1.526	3.679	4.957
Chlorides	12795	0.185	0.174	0	0.045	0.313	0.855
FreeSulfurDioxide	12795	4.14	1.115	0	3.367	5.147	6.436
TotalSulfurDioxide	12795	4.993	0.903	0	4.605	5.572	6.964
Sulphates	12795	0.562	0.304	0	0.358	0.737	1.656
Alcohol	12795	2.382	0.388	0	2.303	2.595	3.314
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
TARGET	12795	3.029	1.926	0	2	4	8
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

#
# # scaled through absolute values
train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
st(train_imp_abs)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	8.063	4.996	0	5.6	9.8	34.4
VolatileAcidity	12795	0.641	0.556	0	0.25	0.91	3.68
CitricAcid	12795	0.686	0.606	0	0.28	0.97	3.86
ResidualSugar	12795	23.326	24.972	0	3.6	38.6	141.15
Chlorides	12795	0.223	0.234	0	0.046	0.368	1.351
FreeSulfurDioxide	12795	106.642	108.069	0	28	171	623
TotalSulfurDioxide	12795	204.015	162.976	0	99	262	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.845	0.652	0	0.43	1.09	4.24
Alcohol	12795	10.526	3.642	0	9	12.4	26.5
LabelAppeal	12795
… 1	504	3.9%
… 2	3136	24.5%
… 3	5617	43.9%
… 4	3048	23.8%
… 5	490	3.8%
AcidIndex	12795	4.773	1.324	1	4	5	14
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

#Model 1 Linear Regression - 
linear_r1 <- lm(formula=TARGET ~
                   #FixedAcidity
                  +VolatileAcidity
                  #+CitricAcid
                  #+ResidualSugar
                  +Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  +Density
                  +pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, data=train_imputed)

summary(linear_r1)

## 
## Call:
## lm(formula = TARGET ~ +VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9533 -0.8596  0.0234  0.8432  6.1683 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.293e+00  4.433e-01   5.174 2.32e-07 ***
## VolatileAcidity    -9.466e-02  1.477e-02  -6.409 1.52e-10 ***
## Chlorides          -1.295e-01  3.630e-02  -3.569  0.00036 ***
## FreeSulfurDioxide   2.503e-04  7.782e-05   3.216  0.00130 ** 
## TotalSulfurDioxide  2.327e-04  5.003e-05   4.651 3.34e-06 ***
## Density            -8.135e-01  4.357e-01  -1.867  0.06188 .  
## pH                 -3.954e-02  1.698e-02  -2.329  0.01990 *  
## Sulphates          -3.428e-02  1.243e-02  -2.759  0.00581 ** 
## Alcohol             1.244e-02  3.100e-03   4.014 6.00e-05 ***
## LabelAppeal-1       3.600e-01  6.283e-02   5.730 1.03e-08 ***
## LabelAppeal0        8.268e-01  6.126e-02  13.495  < 2e-16 ***
## LabelAppeal1        1.291e+00  6.399e-02  20.169  < 2e-16 ***
## LabelAppeal2        1.882e+00  8.431e-02  22.317  < 2e-16 ***
## AcidIndex          -1.991e-01  8.933e-03 -22.282  < 2e-16 ***
## STARS1              1.363e+00  3.291e-02  41.415  < 2e-16 ***
## STARS2              2.398e+00  3.199e-02  74.979  < 2e-16 ***
## STARS3              2.963e+00  3.705e-02  79.971  < 2e-16 ***
## STARS4              3.647e+00  5.923e-02  61.574  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.305 on 12777 degrees of freedom
## Multiple R-squared:  0.5414, Adjusted R-squared:  0.5408 
## F-statistic: 887.3 on 17 and 12777 DF,  p-value: < 2.2e-16

plot(linear_r1) +
  theme_cowplot()

## NULL

vif(linear_r1)

##                        GVIF Df GVIF^(1/(2*Df))
## VolatileAcidity    1.006766  1        1.003377
## Chlorides          1.003662  1        1.001829
## FreeSulfurDioxide  1.003978  1        1.001987
## TotalSulfurDioxide 1.004660  1        1.002327
## Density            1.003654  1        1.001825
## pH                 1.005040  1        1.002517
## Sulphates          1.002371  1        1.001185
## Alcohol            1.007712  1        1.003849
## LabelAppeal        1.118877  4        1.014140
## AcidIndex          1.050180  1        1.024783
## STARS              1.167548  4        1.019552

Linear regression #2

I tried to use the other three transformations (scaled, scaled and logged, and absolute values). My approach was to bounce between the transformations and choose the best one. After trying them all, we got about the same R-squared as model number #1. I also tried converting “AcidIndex” into a factor, and while all the coefficients were negative as expected, numbers between 10-13 appeared borderline statistically significant. The R-squared was 0.5433.

#Model#2 Linear Regression
# available dataframes:

# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################

# Use for checking if "AcidIndex" as numeric or factor makes any difference

###############################################################################

#raw --> train_imputed
train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
st(train_imputed)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	7.076	6.318	-18.1	5.2	9.5	34.4
VolatileAcidity	12795	0.324	0.784	-2.79	0.13	0.64	3.68
CitricAcid	12795	0.308	0.862	-3.24	0.03	0.58	3.86
ResidualSugar	12795	5.44	33.737	-127.8	-1.8	15.85	141.15
Chlorides	12795	0.056	0.319	-1.171	-0.029	0.156	1.351
FreeSulfurDioxide	12795	31.198	148.59	-555	0	70	623
TotalSulfurDioxide	12795	121.383	231.196	-823	27.5	208	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.525	0.93	-3.13	0.28	0.86	4.24
Alcohol	12795	10.493	3.738	-4.7	9	12.4	26.5
LabelAppeal	12795
… -2	504	3.9%
… -1	3136	24.5%
… 0	5617	43.9%
… 1	3048	23.8%
… 2	490	3.8%
AcidIndex	12795	4.773	1.324	1	4	5	14
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

# # scaled through minplusconstant --> train_imp_plusminconst
train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
st(train_imp_plusminconst)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	26.176	6.318	1	24.3	28.6	53.5
VolatileAcidity	12795	4.114	0.784	1	3.92	4.43	7.47
CitricAcid	12795	4.548	0.862	1	4.27	4.82	8.1
ResidualSugar	12795	134.24	33.737	1	127	144.65	269.95
Chlorides	12795	2.227	0.319	1	2.142	2.327	3.522
FreeSulfurDioxide	12795	587.198	148.59	1	556	626	1179
TotalSulfurDioxide	12795	945.383	231.196	1	851.5	1032	1881
Density	12795	2.882	0.027	2.776	2.876	2.889	2.987
pH	12795	4.687	0.681	1.96	4.43	4.95	7.61
Sulphates	12795	4.655	0.93	1	4.41	4.99	8.37
Alcohol	12795	16.193	3.738	1	14.7	18.1	32.2
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
AcidIndex	12795	4.773	1.324	1	4	5	14
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

#
# # scaled (not all) and logged (continuous variables) --> train_abslog
train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
st(train_abslog)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Density	12795	0.69	0.013	0.636	0.687	0.693	0.742
AcidIndex	12795	7.773	1.324	4	7	8	17
pH	12795	1.423	0.171	0.392	1.374	1.497	1.964
FixedAcidity	12795	2.04	0.617	0	1.887	2.38	3.567
VolatileAcidity	12795	0.449	0.293	0	0.223	0.647	1.543
CitricAcid	12795	0.47	0.31	0	0.247	0.678	1.581
ResidualSugar	12795	2.597	1.17	0	1.526	3.679	4.957
Chlorides	12795	0.185	0.174	0	0.045	0.313	0.855
FreeSulfurDioxide	12795	4.14	1.115	0	3.367	5.147	6.436
TotalSulfurDioxide	12795	4.993	0.903	0	4.605	5.572	6.964
Sulphates	12795	0.562	0.304	0	0.358	0.737	1.656
Alcohol	12795	2.382	0.388	0	2.303	2.595	3.314
LabelAppeal	12795
… 0	504	3.9%
… 1	3136	24.5%
… 2	5617	43.9%
… 3	3048	23.8%
… 4	490	3.8%
TARGET	12795	3.029	1.926	0	2	4	8
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

#
# scaled through absolute values
train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
st(train_imp_abs)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3.029	1.926	0	2	4	8
FixedAcidity	12795	8.063	4.996	0	5.6	9.8	34.4
VolatileAcidity	12795	0.641	0.556	0	0.25	0.91	3.68
CitricAcid	12795	0.686	0.606	0	0.28	0.97	3.86
ResidualSugar	12795	23.326	24.972	0	3.6	38.6	141.15
Chlorides	12795	0.223	0.234	0	0.046	0.368	1.351
FreeSulfurDioxide	12795	106.642	108.069	0	28	171	623
TotalSulfurDioxide	12795	204.015	162.976	0	99	262	1057
Density	12795	0.994	0.027	0.888	0.988	1.001	1.099
pH	12795	3.207	0.681	0.48	2.95	3.47	6.13
Sulphates	12795	0.845	0.652	0	0.43	1.09	4.24
Alcohol	12795	10.526	3.642	0	9	12.4	26.5
LabelAppeal	12795
… 1	504	3.9%
… 2	3136	24.5%
… 3	5617	43.9%
… 4	3048	23.8%
… 5	490	3.8%
AcidIndex	12795	4.773	1.324	1	4	5	14
STARS	12795
… 0	3359	26.3%
… 1	3042	23.8%
… 2	3570	27.9%
… 3	2212	17.3%
… 4	612	4.8%

#Model 1 Linear Regression - 
linear_r2 <- lm(formula=TARGET^(bcVal) ~
                  #FixedAcidity
                  +VolatileAcidity
                  +CitricAcid
                  #+ResidualSugar
                  #+Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, data=train_abs_bc)


summary(linear_r2)

## 
## Call:
## lm(formula = TARGET^(bcVal) ~ +VolatileAcidity + CitricAcid + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abs_bc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8554 -1.4118  0.0652  1.3512  9.7439 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.815e+00  2.163e+00   1.764 0.077774 .  
## VolatileAcidity    -9.129e-02  1.377e-02  -6.628 3.55e-11 ***
## CitricAcid          1.873e-02  1.235e-02   1.517 0.129183    
## FreeSulfurDioxide   1.076e-04  2.935e-05   3.668 0.000246 ***
## TotalSulfurDioxide  7.433e-05  1.722e-05   4.317 1.60e-05 ***
## Density            -8.526e-01  4.273e-01  -1.995 0.046042 *  
## Sulphates          -2.797e-02  1.138e-02  -2.459 0.013952 *  
## Alcohol             9.638e-03  2.297e-03   4.196 2.74e-05 ***
## LabelAppeal         7.311e-01  2.044e-02  35.771  < 2e-16 ***
## AcidIndex          -1.654e-01  7.340e-03 -22.529  < 2e-16 ***
## STARS               1.465e+00  1.563e-02  93.706  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.981 on 12784 degrees of freedom
## Multiple R-squared:  0.5371, Adjusted R-squared:  0.5368 
## F-statistic:  1484 on 10 and 12784 DF,  p-value: < 2.2e-16

plot(linear_r2)

vif(linear_r2)

##    VolatileAcidity         CitricAcid  FreeSulfurDioxide TotalSulfurDioxide 
##           1.006155           1.006099           1.002760           1.004454 
##            Density          Sulphates            Alcohol        LabelAppeal 
##           1.002896           1.002084           1.006064           1.081947 
##          AcidIndex              STARS 
##           1.048475           1.122253

Poisson Model #1

I used the absolute values transformation for my independent variables based on early distributions for the first model. Several indicators suggested a relatively ok model. The dispersion was under 1, and then I ran another test to check for the residuals, and this line was flat. Further, I removed the variables with high p values, but the score did not change much.

# Poisson #1

# available dataframes:

# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################

# Use for checking if "AcidIndex" as numeric or factor makes any difference

###############################################################################

# raw --> train_imputed
# train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
# train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
# st(train_imputed)
# 
# # # scaled through minplusconstant --> train_imp_plusminconst
# train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
# train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
# st(train_imp_plusminconst)
# #
# # # scaled (not all) and logged (continuous variables) --> train_abslog
# train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
# train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
# st(train_abslog)
# #
# # # scaled through absolute values
# train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
# train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
# st(train_imp_abs)

#Model 1 Linear Regression - 
poisson_1 <- glm(formula=TARGET ~
                  # FixedAcidity
                  +VolatileAcidity
                  #+CitricAcid
                  #+ResidualSugar
                  #+Chlorides
                  #+FreeSulfurDioxide
                  +TotalSulfurDioxide
                  #+Density
                  #+pH
                  #+Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 fam = poisson,
                 data=train_imp_abs)

summary(poisson_1)

## 
## Call:
## glm(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = train_imp_abs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2543  -0.6402  -0.0075   0.4515   3.7857  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.538e-01  4.797e-02   3.206  0.00135 ** 
## VolatileAcidity    -3.705e-02  9.396e-03  -3.944 8.03e-05 ***
## TotalSulfurDioxide  8.467e-05  3.116e-05   2.718  0.00658 ** 
## Alcohol             3.983e-03  1.403e-03   2.839  0.00453 ** 
## LabelAppeal2        2.357e-01  3.798e-02   6.205 5.45e-10 ***
## LabelAppeal3        4.254e-01  3.705e-02  11.480  < 2e-16 ***
## LabelAppeal4        5.582e-01  3.769e-02  14.813  < 2e-16 ***
## LabelAppeal5        6.946e-01  4.243e-02  16.372  < 2e-16 ***
## AcidIndex          -8.031e-02  4.497e-03 -17.861  < 2e-16 ***
## STARS1              7.702e-01  1.953e-02  39.439  < 2e-16 ***
## STARS2              1.089e+00  1.823e-02  59.749  < 2e-16 ***
## STARS3              1.210e+00  1.918e-02  63.090  < 2e-16 ***
## STARS4              1.330e+00  2.428e-02  54.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 13669  on 12782  degrees of freedom
## AIC: 45637
## 
## Number of Fisher Scoring iterations: 6

dispersiontest(poisson_1)

## 
##  Overdispersion test
## 
## data:  poisson_1
## z = -8.9098, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8848692

sim_p1 <- simulateResiduals(poisson_1, refit=T)
testOverdispersion(sim_p1)

## testOverdispersion is deprecated, switch your code to using the testDispersion function

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  simulationOutput
## dispersion = 0.88397, p-value < 2.2e-16
## alternative hypothesis: two.sided

plotSimulatedResiduals(sim_p1)

## plotSimulatedResiduals is deprecated, please switch your code to simply using the plot() function

plot(poisson_1)

knitr::kable(vif(poisson_1), "html")

	GVIF	Df	GVIF^(1/(2*Df))
VolatileAcidity	1.004014	1	1.002005
TotalSulfurDioxide	1.003350	1	1.001673
Alcohol	1.010637	1	1.005304
LabelAppeal	1.133631	4	1.015802
AcidIndex	1.025613	1	1.012726
STARS	1.165661	4	1.019346

Poisson Model #2

I tested the remaining transformed datasets on this model, but the results were not that much different. For this one, I kept the absolute + min + constant (+1) transform dataset for x. Although the residual deviance changed slightly, my dispersion test came similar, indicating the model fits ok. Similar to the other models, the coefficients indicate a similar behavior, such as AcidIndex, which negatively correlates with the expected cases sold.

#Model 2 Poisson Regression - 
poisson_2 <- glm(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  +Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  +pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 fam = poisson,
                 data=train_abs_bc)

summary(poisson_2)

## 
## Call:
## glm(formula = TARGET ~ +VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + Alcohol + LabelAppeal + 
##     AcidIndex + STARS, family = poisson, data = train_abs_bc)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.35957  -0.60161   0.04921   0.48184   2.80743  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.502e-01  7.581e-02   9.896  < 2e-16 ***
## VolatileAcidity    -1.558e-02  3.481e-03  -4.475 7.65e-06 ***
## Chlorides          -2.337e-02  9.321e-03  -2.508  0.01215 *  
## FreeSulfurDioxide   2.057e-05  7.357e-06   2.795  0.00518 ** 
## TotalSulfurDioxide  1.395e-05  4.337e-06   3.217  0.00130 ** 
## pH                 -7.838e-03  3.909e-03  -2.005  0.04495 *  
## Sulphates          -5.375e-03  2.868e-03  -1.874  0.06094 .  
## Alcohol             9.068e-04  5.787e-04   1.567  0.11710    
## LabelAppeal         1.019e-01  5.225e-03  19.507  < 2e-16 ***
## AcidIndex          -3.397e-02  2.056e-03 -16.524  < 2e-16 ***
## STARS               2.343e-01  3.900e-03  60.062  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 13818  on 12794  degrees of freedom
## Residual deviance:  7692  on 12784  degrees of freedom
## AIC: 47576
## 
## Number of Fisher Scoring iterations: 5

dispersiontest(poisson_2)

## 
##  Overdispersion test
## 
## data:  poisson_2
## z = -60.603, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.5650852

sim_p2 <- simulateResiduals(poisson_2, refit=T)
testOverdispersion(sim_p2)

## testOverdispersion is deprecated, switch your code to using the testDispersion function

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  simulationOutput
## dispersion = 0.56014, p-value < 2.2e-16
## alternative hypothesis: two.sided

plotSimulatedResiduals(sim_p2)

## plotSimulatedResiduals is deprecated, please switch your code to simply using the plot() function

## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

plot (poisson_2)

knitr::kable(vif(poisson_2),"html")

	x
VolatileAcidity	1.005062
Chlorides	1.002222
FreeSulfurDioxide	1.002172
TotalSulfurDioxide	1.002419
pH	1.004460
Sulphates	1.001323
Alcohol	1.008729
LabelAppeal	1.108977
AcidIndex	1.036240
STARS	1.144527

Negative Binomial Model #1 and #2

For these models, we did not see much difference compared to Poisson. At this point, I considered the zero-inflated model due to TARGET having a decent amount of zeros and other variables. I used log-transformed and the absolute values datasets for my x variables. Again, there were no significant differences between the two. I tried the zero inflated models, but I had not many differences, other than the STARS label proving insignificant when 3 or 4 stars were given. This result made me question the previous models, and it appears that deviance begins to increase after a certain point.

#Model 1 Negative Binomial Model - 
nb_1 <- glm.nb(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  #+Chlorides
                  +FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 data=train_abs_bc, 
               #link=log
               )

summary(nb_1)

## 
## Call:
## glm.nb(formula = TARGET ~ +VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal + 
##     AcidIndex + STARS, data = train_abs_bc, init.theta = 109478.1766, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36553  -0.59808   0.04932   0.48309   2.80871  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         5.901e-01  5.671e-02  10.407  < 2e-16 ***
## VolatileAcidity    -1.572e-02  3.481e-03  -4.517 6.27e-06 ***
## FreeSulfurDioxide   2.087e-05  7.356e-06   2.838  0.00454 ** 
## TotalSulfurDioxide  1.401e-05  4.337e-06   3.231  0.00123 ** 
## Sulphates          -5.352e-03  2.868e-03  -1.866  0.06203 .  
## Alcohol             9.515e-04  5.784e-04   1.645  0.09996 .  
## LabelAppeal         1.016e-01  5.224e-03  19.451  < 2e-16 ***
## AcidIndex          -3.381e-02  2.051e-03 -16.482  < 2e-16 ***
## STARS               2.346e-01  3.898e-03  60.180  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(109478.2) family taken to be 1)
## 
##     Null deviance: 13817.9  on 12794  degrees of freedom
## Residual deviance:  7701.9  on 12786  degrees of freedom
## AIC: 47584
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  109478 
##           Std. Err.:  119125 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -47563.93

plot (nb_1)

knitr::kable(vif(nb_1),"html")

	x
VolatileAcidity	1.004856
FreeSulfurDioxide	1.001905
TotalSulfurDioxide	1.002394
Sulphates	1.001305
Alcohol	1.008101
LabelAppeal	1.108525
AcidIndex	1.032101
STARS	1.143609

#Model 2 Negative Binomial Model --
nb_2 <- glm.nb(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  #+Chlorides
                  #+FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 data=train_abs_bc 
               #link=log
               )

summary(nb_2)

## 
## Call:
## glm.nb(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + 
##     Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abs_bc, 
##     init.theta = 109393.0652, link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.35227  -0.59700   0.05004   0.48066   2.80722  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.342e-01  5.454e-02  11.629  < 2e-16 ***
## VolatileAcidity    -1.576e-02  3.481e-03  -4.528 5.95e-06 ***
## TotalSulfurDioxide  1.412e-05  4.337e-06   3.255  0.00113 ** 
## Sulphates          -5.287e-03  2.868e-03  -1.844  0.06524 .  
## Alcohol             9.209e-04  5.784e-04   1.592  0.11136    
## LabelAppeal         1.019e-01  5.224e-03  19.500  < 2e-16 ***
## AcidIndex          -3.401e-02  2.050e-03 -16.587  < 2e-16 ***
## STARS               2.346e-01  3.898e-03  60.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(109393.1) family taken to be 1)
## 
##     Null deviance: 13818  on 12794  degrees of freedom
## Residual deviance:  7710  on 12787  degrees of freedom
## AIC: 47590
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  109393 
##           Std. Err.:  119037 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -47571.97

plot (nb_2)

knitr::kable(vif(nb_2),"html")

	x
VolatileAcidity	1.004839
TotalSulfurDioxide	1.002319
Sulphates	1.001239
Alcohol	1.007742
LabelAppeal	1.108272
AcidIndex	1.031013
STARS	1.143736

Negative Binomial Model #3

#Model 3 Negative Binomial Model --Zero Inflated (all the NA's that were imputed as zeros, and y has quite  bit s well)
nb_3 <- zeroinfl(formula=TARGET ~
                  #FixedAcidity
                  +VolatileAcidity
                  # +CitricAcid
                  # +ResidualSugar
                  #+Chlorides
                  #+FreeSulfurDioxide
                  +TotalSulfurDioxide
                  # +Density
                  #+pH
                  +Sulphates
                  +Alcohol
                  +LabelAppeal
                  +AcidIndex
                  +STARS, 
                 data=train_abslog 
               #link=log
               )

summary(nb_3)

## 
## Call:
## zeroinfl(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abslog)
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.281455 -0.427760  0.004961  0.383624  5.883963 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.460517   0.074704   6.165 7.07e-10 ***
## VolatileAcidity    -0.028028   0.017982  -1.559  0.11909    
## TotalSulfurDioxide -0.001421   0.006141  -0.231  0.81694    
## Sulphates           0.011055   0.017086   0.647  0.51760    
## Alcohol             0.060271   0.013780   4.374 1.22e-05 ***
## LabelAppeal1        0.437858   0.041171  10.635  < 2e-16 ***
## LabelAppeal2        0.726474   0.040239  18.054  < 2e-16 ***
## LabelAppeal3        0.916378   0.040909  22.400  < 2e-16 ***
## LabelAppeal4        1.073404   0.045439  23.623  < 2e-16 ***
## AcidIndex          -0.020142   0.004822  -4.177 2.95e-05 ***
## STARS1              0.061189   0.021127   2.896  0.00378 ** 
## STARS2              0.183247   0.019726   9.289  < 2e-16 ***
## STARS3              0.281102   0.020657  13.608  < 2e-16 ***
## STARS4              0.380375   0.025581  14.869  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.72883    0.48751  -9.700  < 2e-16 ***
## VolatileAcidity      0.44829    0.11488   3.902 9.53e-05 ***
## TotalSulfurDioxide  -0.27086    0.03554  -7.622 2.50e-14 ***
## Sulphates            0.44717    0.11119   4.022 5.78e-05 ***
## Alcohol              0.23974    0.09008   2.661  0.00778 ** 
## LabelAppeal1         1.45212    0.31942   4.546 5.47e-06 ***
## LabelAppeal2         2.19766    0.31662   6.941 3.89e-12 ***
## LabelAppeal3         2.90407    0.32204   9.018  < 2e-16 ***
## LabelAppeal4         3.33806    0.37494   8.903  < 2e-16 ***
## AcidIndex            0.41565    0.02565  16.202  < 2e-16 ***
## STARS1              -2.09320    0.07659 -27.330  < 2e-16 ***
## STARS2              -5.71200    0.32653 -17.493  < 2e-16 ***
## STARS3             -20.23807  341.00123  -0.059  0.95267    
## STARS4             -20.37694  644.74712  -0.032  0.97479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 33 
## Log-likelihood: -2.035e+04 on 28 Df

Model Selection:

Based on the information above, I decided to select the Poisson model #2, which had low dispersion, a p-value of 1. Although the AIC value was relatively high, it was hard to interpret.

#Prepare Evaluation dataset

eval$STARS[is.na(eval$STARS)] <- 0
eval$STARS <-as.factor(eval$STARS)
eval$LabelAppeal <- as.factor(eval$LabelAppeal)
st(eval)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	0
… No	0	NaN%
… Yes	0	NaN%
FixedAcidity	3335	6.864	6.318	-18.2	5.2	9	33.5
VolatileAcidity	3335	0.31	0.807	-2.83	0.08	0.63	3.61
CitricAcid	3335	0.312	0.871	-3.12	0	0.605	3.76
ResidualSugar	3167	5.319	34.371	-128.3	-2.6	17.2	145.4
Chlorides	3197	0.061	0.314	-1.15	0.016	0.171	1.263
FreeSulfurDioxide	3183	34.947	149.633	-563	3	79.25	617
TotalSulfurDioxide	3178	123.41	225.8	-769	27.25	210	1004
Density	3335	0.995	0.026	0.89	0.988	1.001	1.1
pH	3231	3.237	0.676	0.6	2.98	3.49	6.21
Sulphates	3025	0.535	0.905	-3.07	0.33	0.82	4.18
Alcohol	3150	10.584	3.759	-4.2	9	12.5	25.6
LabelAppeal	3335
… -2	114	3.4%
… -1	810	24.3%
… 0	1470	44.1%
… 1	799	24%
… 2	142	4.3%
AcidIndex	3335	7.748	1.315	5	7	8	17
STARS	3335
… 0	841	25.2%
… 1	828	24.8%
… 2	902	27%
… 3	600	18%
… 4	164	4.9%

eval_abs <-eval

eval_abs$FixedAcidity <- abs(eval_abs$FixedAcidity)
eval_abs$VolatileAcidity <- abs(eval_abs$VolatileAcidity)
eval_abs$CitricAcid <- abs(eval_abs$CitricAcid)
eval_abs$ResidualSugar <-abs(eval_abs$ResidualSugar)
eval_abs$Chlorides <-abs(eval_abs$Chlorides)
eval_abs$FreeSulfurDioxide <-abs(eval_abs$FreeSulfurDioxide)
eval_abs$TotalSulfurDioxide <-abs(eval_abs$TotalSulfurDioxide)
eval_abs$Sulphates <- abs(eval_abs$Sulphates)
eval_abs$Alcohol <-abs(eval_abs$Alcohol)

#transform Label Appeal too.
eval_abs$LabelAppeal <- as.numeric(eval_abs$LabelAppeal)
eval_abs$LabelAppeal <- abs(eval_abs$LabelAppeal) 
#eval_abs$LabelAppeal + abs(min(eval_abs$LabelAppeal))-2

eval_abs$LabelAppeal <- as.factor(eval_abs$LabelAppeal)

st(eval_abs) #run this to make sure each variable worked after

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	0
… No	0	NaN%
… Yes	0	NaN%
FixedAcidity	3335	7.967	4.854	0	5.7	9.4	33.5
VolatileAcidity	3335	0.654	0.565	0	0.25	0.93	3.61
CitricAcid	3335	0.697	0.609	0	0.29	1	3.76
ResidualSugar	3167	23.775	25.381	0.1	3.5	38.5	145.4
Chlorides	3197	0.221	0.231	0	0.045	0.369	1.263
FreeSulfurDioxide	3183	107.2	110.075	0	27	174	617
TotalSulfurDioxide	3178	201.511	160.003	0	97	261	1004
Density	3335	0.995	0.026	0.89	0.988	1.001	1.1
pH	3231	3.237	0.676	0.6	2.98	3.49	6.21
Sulphates	3025	0.833	0.641	0	0.43	1.06	4.18
Alcohol	3150	10.614	3.672	0	9	12.5	25.6
LabelAppeal	3335
… 1	114	3.4%
… 2	810	24.3%
… 3	1470	44.1%
… 4	799	24%
… 5	142	4.3%
AcidIndex	3335	7.748	1.315	5	7	8	17
STARS	3335
… 0	841	25.2%
… 1	828	24.8%
… 2	902	27%
… 3	600	18%
… 4	164	4.9%

summary(eval_abs)#make sure nothing broke

##   TARGET         FixedAcidity    VolatileAcidity    CitricAcid    
##  Mode:logical   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  NA's:3335      1st Qu.: 5.700   1st Qu.:0.2500   1st Qu.:0.2900  
##                 Median : 7.000   Median :0.4200   Median :0.4400  
##                 Mean   : 7.967   Mean   :0.6542   Mean   :0.6969  
##                 3rd Qu.: 9.400   3rd Qu.:0.9300   3rd Qu.:1.0000  
##                 Max.   :33.500   Max.   :3.6100   Max.   :3.7600  
##                                                                   
##  ResidualSugar      Chlorides      FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  0.10   Min.   :0.0000   Min.   :  0.0     Min.   :   0.0    
##  1st Qu.:  3.50   1st Qu.:0.0450   1st Qu.: 27.0     1st Qu.:  97.0    
##  Median : 13.50   Median :0.1000   Median : 54.0     Median : 153.0    
##  Mean   : 23.77   Mean   :0.2213   Mean   :107.2     Mean   : 201.5    
##  3rd Qu.: 38.50   3rd Qu.:0.3690   3rd Qu.:174.0     3rd Qu.: 261.0    
##  Max.   :145.40   Max.   :1.2630   Max.   :617.0     Max.   :1004.0    
##  NA's   :168      NA's   :138      NA's   :152       NA's   :157       
##     Density             pH          Sulphates         Alcohol      LabelAppeal
##  Min.   :0.8898   Min.   :0.600   Min.   :0.0000   Min.   : 0.00   1: 114     
##  1st Qu.:0.9883   1st Qu.:2.980   1st Qu.:0.4300   1st Qu.: 9.00   2: 810     
##  Median :0.9946   Median :3.210   Median :0.5900   Median :10.40   3:1470     
##  Mean   :0.9947   Mean   :3.237   Mean   :0.8331   Mean   :10.61   4: 799     
##  3rd Qu.:1.0005   3rd Qu.:3.490   3rd Qu.:1.0600   3rd Qu.:12.50   5: 142     
##  Max.   :1.0998   Max.   :6.210   Max.   :4.1800   Max.   :25.60              
##                   NA's   :104     NA's   :310      NA's   :185                
##    AcidIndex      STARS  
##  Min.   : 5.000   0:841  
##  1st Qu.: 7.000   1:828  
##  Median : 8.000   2:902  
##  Mean   : 7.748   3:600  
##  3rd Qu.: 8.000   4:164  
##  Max.   :17.000          
##

head(eval_abs)

FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
5.4	0.86	0.27	10.7	0.092	23	398	0.985	5.02	0.64	12.3	2	6	0
12.4	0.385	0.76	19.7	1.17	37	68	0.99	3.37	1.09	16	3	6	2
7.2	1.75	0.17	33	0.065	9	76	1.05	4.61	0.68	8.55	3	8	1
6.2	0.1	1.8	1	0.179	104	89	0.989	3.2	2.11	12.3	2	8	1
11.4	0.21	0.28	1.2	0.038	70	53	1.03	2.54	0.07	4.8	3	10	0
17.6	0.04	1.15	1.4	0.535	250	140	0.95	3.06	0.02	11.4	4	8	4

# Predict

eval_abs$TARGET <- predict(poisson_1, newdata = eval_abs, type = "response")
eval_abs$TARGET <- as.numeric(round(eval_abs$TARGET,0))

#inspect projections

head(eval_abs)

TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
1	5.4	0.86	0.27	10.7	0.092	23	398	0.985	5.02	0.64	12.3	2	6	0
3	12.4	0.385	0.76	19.7	1.17	37	68	0.99	3.37	1.09	16	3	6	2
2	7.2	1.75	0.17	33	0.065	9	76	1.05	4.61	0.68	8.55	3	8	1
2	6.2	0.1	1.8	1	0.179	104	89	0.989	3.2	2.11	12.3	2	8	1
1	11.4	0.21	0.28	1.2	0.038	70	53	1.03	2.54	0.07	4.8	3	10	0
4	17.6	0.04	1.15	1.4	0.535	250	140	0.95	3.06	0.02	11.4	4	8	4

predictions <- eval_abs