Introduction
This project aims to find the count—or amount—of potential cases of wine that a particular wine seller can expect to sell based on the variables observed in the data provided. This dataset is quite large, and most variables are wine properties, such as sugar, acidity, among several others. The premise is that these factors affect how restaurants, or their buyers/sommeliers, evaluate these wines. In other words, the seller wants to know what is essential to maximize potential sales.
Data Exploration 1.0
We loaded the data, convert it into a data frame, and remove the index.
#get rid of index
train <- train[,-1]
eval <- eval[,-1]
head(train)
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
| 3 | 3.2 | 1.16 | -0.98 | 54.2 | -0.567 | | 268 | 0.993 | 3.33 | -0.59 | 9.9 | 0 | 8 | 2 |
| 3 | 4.5 | 0.16 | -0.81 | 26.1 | -0.425 | 15 | -327 | 1.03 | 3.38 | 0.7 | | -1 | 7 | 3 |
| 5 | 7.1 | 2.64 | -0.88 | 14.8 | 0.037 | 214 | 142 | 0.995 | 3.12 | 0.48 | 22 | -1 | 8 | 3 |
| 3 | 5.7 | 0.385 | 0.04 | 18.8 | -0.425 | 22 | 115 | 0.996 | 2.24 | 1.83 | 6.2 | -1 | 6 | 1 |
| 4 | 8 | 0.33 | -1.26 | 9.4 | | -167 | 108 | 0.995 | 3.12 | 1.77 | 13.7 | 0 | 9 | 2 |
| 0 | 11.3 | 0.32 | 0.59 | 2.2 | 0.556 | -37 | 15 | 0.999 | 3.2 | 1.29 | 15.4 | 0 | 11 | |
head(eval)
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
| 5.4 | -0.86 | 0.27 | -10.7 | 0.092 | 23 | 398 | 0.985 | 5.02 | 0.64 | 12.3 | -1 | 6 | |
| 12.4 | 0.385 | -0.76 | -19.7 | 1.17 | -37 | 68 | 0.99 | 3.37 | 1.09 | 16 | 0 | 6 | 2 |
| 7.2 | 1.75 | 0.17 | -33 | 0.065 | 9 | 76 | 1.05 | 4.61 | 0.68 | 8.55 | 0 | 8 | 1 |
| 6.2 | 0.1 | 1.8 | 1 | -0.179 | 104 | 89 | 0.989 | 3.2 | 2.11 | 12.3 | -1 | 8 | 1 |
| 11.4 | 0.21 | 0.28 | 1.2 | 0.038 | 70 | 53 | 1.03 | 2.54 | -0.07 | 4.8 | 0 | 10 | |
| 17.6 | 0.04 | -1.15 | 1.4 | 0.535 | -250 | 140 | 0.95 | 3.06 | -0.02 | 11.4 | 1 | 8 | 4 |
Data Exploration 2.0
We first looked to evaluate the training dataset to assess the file. We ran some basic commands and visuals to look at the data structure, meaning, descriptive statistics, and missing values.
We Initially found that the number of NAs in the dataset is significant, and as such—and to a headache degree—we knew that we needed to do some potential imputation at some point. In addition, we noticed several values with negative values, and that to perform the modeling, we would also need to transform these data series and place them in a zero to the above scale.
First, however, we looked at the distribution of all the variables. We scaled them centered around zero to see them in boxplots. We initially thought to convert categorical variables like STARS and LabelAppeal into factors, but to run the correlation, we needed numeric values first. So for both the histograms and the correlation, we found that most variables, including STARS and LabelAppeal, are normally distributed and that these two are closely correlated to our target variable, cases sold.
#Data exploration
#summary tables and data structure
summary(train)
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
str(train)
## tibble [12,795 x 15] (S3: tbl_df/tbl/data.frame)
## $ TARGET : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
st(train)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
7.076
|
6.318
|
-18.1
|
5.2
|
9.5
|
34.4
|
|
VolatileAcidity
|
12795
|
0.324
|
0.784
|
-2.79
|
0.13
|
0.64
|
3.68
|
|
CitricAcid
|
12795
|
0.308
|
0.862
|
-3.24
|
0.03
|
0.58
|
3.86
|
|
ResidualSugar
|
12179
|
5.419
|
33.749
|
-127.8
|
-2
|
15.9
|
141.15
|
|
Chlorides
|
12157
|
0.055
|
0.318
|
-1.171
|
-0.031
|
0.153
|
1.351
|
|
FreeSulfurDioxide
|
12148
|
30.846
|
148.715
|
-555
|
0
|
70
|
623
|
|
TotalSulfurDioxide
|
12113
|
120.714
|
231.913
|
-823
|
27
|
208
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12400
|
3.208
|
0.68
|
0.48
|
2.96
|
3.47
|
6.13
|
|
Sulphates
|
11585
|
0.527
|
0.932
|
-3.13
|
0.28
|
0.86
|
4.24
|
|
Alcohol
|
12142
|
10.489
|
3.728
|
-4.7
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
-0.009
|
0.891
|
-2
|
-1
|
1
|
2
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
9436
|
2.042
|
0.903
|
1
|
1
|
3
|
4
|
describe(train)
## train
##
## 15 Variables 12795 Observations
## --------------------------------------------------------------------------------
## TARGET
## n missing distinct Info Mean Gmd
## 12795 0 9 0.962 3.029 2.141
##
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 2734 244 1091 2611 3177 2014 765 142 17
## Proportion 0.214 0.019 0.085 0.204 0.248 0.157 0.060 0.011 0.001
## --------------------------------------------------------------------------------
## FixedAcidity
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 470 1 7.076 6.688 -3.6 -1.2
## .25 .50 .75 .90 .95
## 5.2 6.9 9.5 15.6 17.8
##
## lowest : -18.1 -18.0 -17.7 -17.5 -17.4, highest: 32.4 32.5 32.6 34.1 34.4
## --------------------------------------------------------------------------------
## VolatileAcidity
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 815 1 0.3241 0.8262 -1.023 -0.720
## .25 .50 .75 .90 .95
## 0.130 0.280 0.640 1.350 1.640
##
## lowest : -2.790 -2.750 -2.745 -2.730 -2.720, highest: 3.500 3.550 3.565 3.590 3.680
## --------------------------------------------------------------------------------
## CitricAcid
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 602 1 0.3084 0.9057 -1.16 -0.84
## .25 .50 .75 .90 .95
## 0.03 0.31 0.58 1.43 1.79
##
## lowest : -3.24 -3.16 -3.10 -3.08 -3.06, highest: 3.63 3.68 3.70 3.77 3.86
## --------------------------------------------------------------------------------
## ResidualSugar
## n missing distinct Info Mean Gmd .05 .10
## 12179 616 2077 1 5.419 35.31 -52.70 -39.66
## .25 .50 .75 .90 .95
## -2.00 3.90 15.90 49.72 62.70
##
## lowest : -127.80 -127.10 -126.20 -126.10 -125.70
## highest: 136.50 137.60 138.00 140.65 141.15
## --------------------------------------------------------------------------------
## Chlorides
## n missing distinct Info Mean Gmd .05 .10
## 12157 638 1663 1 0.05482 0.3311 -0.489 -0.372
## .25 .50 .75 .90 .95
## -0.031 0.046 0.153 0.481 0.598
##
## lowest : -1.171 -1.170 -1.158 -1.156 -1.155, highest: 1.260 1.261 1.270 1.275 1.351
## --------------------------------------------------------------------------------
## FreeSulfurDioxide
## n missing distinct Info Mean Gmd .05 .10
## 12148 647 999 1 30.85 155.2 -224 -171
## .25 .50 .75 .90 .95
## 0 30 70 230 284
##
## lowest : -555 -546 -536 -535 -532, highest: 613 617 618 622 623
## --------------------------------------------------------------------------------
## TotalSulfurDioxide
## n missing distinct Info Mean Gmd .05 .10
## 12113 682 1370 1 120.7 246.9 -273.0 -185.0
## .25 .50 .75 .90 .95
## 27.0 123.0 208.0 421.8 513.4
##
## lowest : -823 -816 -793 -781 -779, highest: 1032 1041 1048 1054 1057
## --------------------------------------------------------------------------------
## Density
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 5933 1 0.9942 0.02769 0.9488 0.9587
## .25 .50 .75 .90 .95
## 0.9877 0.9945 1.0005 1.0295 1.0398
##
## lowest : 0.88809 0.88949 0.88978 0.88983 0.89167
## highest: 1.09658 1.09679 1.09695 1.09791 1.09924
## --------------------------------------------------------------------------------
## pH
## n missing distinct Info Mean Gmd .05 .10
## 12400 395 497 1 3.208 0.7242 2.06 2.31
## .25 .50 .75 .90 .95
## 2.96 3.20 3.47 4.10 4.37
##
## lowest : 0.48 0.53 0.54 0.58 0.59, highest: 5.91 5.94 6.02 6.05 6.13
## --------------------------------------------------------------------------------
## Sulphates
## n missing distinct Info Mean Gmd .05 .10
## 11585 1210 630 1 0.5271 0.9827 -1.05 -0.70
## .25 .50 .75 .90 .95
## 0.28 0.50 0.86 1.77 2.09
##
## lowest : -3.13 -3.12 -3.10 -3.07 -3.03, highest: 4.11 4.16 4.19 4.21 4.24
## --------------------------------------------------------------------------------
## Alcohol
## n missing distinct Info Mean Gmd .05 .10
## 12142 653 401 1 10.49 4.015 4.1 5.7
## .25 .50 .75 .90 .95
## 9.0 10.4 12.4 15.2 16.7
##
## lowest : -4.7 -4.5 -4.4 -4.3 -4.1, highest: 25.4 25.6 26.0 26.1 26.5
## --------------------------------------------------------------------------------
## LabelAppeal
## n missing distinct Info Mean Gmd
## 12795 0 5 0.887 -0.009066 0.9566
##
## lowest : -2 -1 0 1 2, highest: -2 -1 0 1 2
##
## Value -2 -1 0 1 2
## Frequency 504 3136 5617 3048 490
## Proportion 0.039 0.245 0.439 0.238 0.038
## --------------------------------------------------------------------------------
## AcidIndex
## n missing distinct Info Mean Gmd .05 .10
## 12795 0 14 0.908 7.773 1.316 6 7
## .25 .50 .75 .90 .95
## 7 8 8 9 10
##
## lowest : 4 5 6 7 8, highest: 13 14 15 16 17
##
## Value 4 5 6 7 8 9 10 11 12 13 14
## Frequency 3 75 1197 4878 4142 1427 551 258 128 69 47
## Proportion 0.000 0.006 0.094 0.381 0.324 0.112 0.043 0.020 0.010 0.005 0.004
##
## Value 15 16 17
## Frequency 8 5 7
## Proportion 0.001 0.000 0.001
## --------------------------------------------------------------------------------
## STARS
## n missing distinct Info Mean Gmd
## 9436 3359 4 0.899 2.042 0.9777
##
## Value 1 2 3 4
## Frequency 3042 3570 2212 612
## Proportion 0.322 0.378 0.234 0.065
## --------------------------------------------------------------------------------
dim(train)
## [1] 12795 15
#some variables have multiple Nas
#assess missing values
missing <- colSums(train %>% sapply(is.na))
missing_pct <- round(missing / nrow(train) * 100, 2)
na_table <- stack(sort(missing_pct, decreasing = TRUE))
na_table
| values | ind |
| 26.2 | STARS |
| 9.46 | Sulphates |
| 5.33 | TotalSulfurDioxide |
| 5.1 | Alcohol |
| 5.06 | FreeSulfurDioxide |
| 4.99 | Chlorides |
| 4.81 | ResidualSugar |
| 3.09 | pH |
| 0 | TARGET |
| 0 | FixedAcidity |
| 0 | VolatileAcidity |
| 0 | CitricAcid |
| 0 | Density |
| 0 | LabelAppeal |
| 0 | AcidIndex |
plot_missing(train)

Visualization
# SOME VISUALS on histograms and correlation
# Histograms to check how variables are distributed,=
plot_num(train)

# Correlation plot -- al variables as numeric
cor <-cor(train, method="pearson", use = "pairwise.complete.obs")
corrplot(cor, method="circle")

#Scale variables to create boxplot chart with all variables in it
scaled.train <- as.data.table(scale(train[, c(
'TARGET',
'FixedAcidity',
'VolatileAcidity',
'CitricAcid',
'ResidualSugar',
'Chlorides',
'FreeSulfurDioxide',
'TotalSulfurDioxide',
'pH',
'Sulphates',
'Density',
'Alcohol',
'AcidIndex',
'STARS',
'LabelAppeal'
)]))
#Show boxplots
#boxplot(scaled.train)
melt.train <- melt(scaled.train)
scaled_boxplots <- ggplot(melt.train, aes(variable, value)) +
geom_boxplot(width=.5, fill="navyblue", outlier.color="magenta", outlier.size = 1) +
stat_summary(aes(color="mean"), fun.y=mean, geom="point",
size=2, show.legend=TRUE) +
stat_summary(aes(color="median"), fun.y=median, geom="point",
size=2, show.legend=TRUE) +
coord_flip() +
labs(x="", y="") +
scale_color_manual(values=c("blue", "purple")) +
theme(legend.position="top")
scaled_boxplots

Data Preparation
As mentioned earlier, as we inspected the data, we noticed quite a large amount of NA’s. I have worked with NA’s in the past but in a much simpler way. And aside from the “hint” in the assignment paper to not always discard NA’s, I knew that STARS would have to be converted to zeros—not a significant accomplishment since I have learned how to impute NA’s in this class in many other forms.
For the other variables that needed imputation, I read about approaching this in several ways, whether using the imputeTS package, Hmisc, missForest, and mice. Although mice was a tad over my head, I read that it could be the best avenue. I have imputed using means and medians in the past, but after learning a bit more, I went with mice, a random forest regressor, to try something new.
After this, it made sense to revert to the initial idea to convert STARS and LabelAppeal as factors before running the models. Many examples available tried to change “AcidIndex” as a factor so I tried that as well. It also made sense to consider converting the variables with negative values to a positive scale since this exercise aims to run Poisson and Negative Binomial models that cannot take negative values. After looking at several techniques, I questioned whether I would log the data if most independent variables had a relatively normal distribution. Then, I realized that Poisson and Negative Binomial regressions can adjust for that once the data is transformed. This was the progress of the data transformation.
Step 1. Initial Imputed data Data as given, then transform NA’s for stars as zero and then use “MICE” for the remaining variables.
Step 2. Scaled (x + abs(min(x))+C comes from the imputed data, but scaled to change negative numbers to positive; LabelAppeal was adjusted as + 2. The other variables were turned into absolute values. At this point, all the variables remained to show a relatively normal distribution.
Step 3, using the log we converted continuous variables into logs to see if the distribution would take the shape of the negative binomial. And it appeared as it did.
Step 4. Transform data using absolute values I used absolute values without log transformation or scaling it and tested how it would perform. At first glance, it looked like just using absolute values would have been enough.
#impute some of the variables, particularly Stars from na/s to zeros
train$STARS[is.na(train$STARS)] <- 0
train$STARS <-as.factor(train$STARS)
train$LabelAppeal <- as.factor(train$LabelAppeal)
#train$AcidIndex <- as.factor(train$AcidIndex)
# head(train)
# impute the other variables--used mice based on this:
#https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
train_mice <- mice::mice(train, m = 2, method='cart', maxit = 2, print = FALSE)
train_imputed <- mice::complete(train_mice)
density.plot <-densityplot(train_mice)
density.plot

head(train_imputed)
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
| 3 | 3.2 | 1.16 | -0.98 | 54.2 | -0.567 | -170 | 268 | 0.993 | 3.33 | -0.59 | 9.9 | 0 | 8 | 2 |
| 3 | 4.5 | 0.16 | -0.81 | 26.1 | -0.425 | 15 | -327 | 1.03 | 3.38 | 0.7 | 13.5 | -1 | 7 | 3 |
| 5 | 7.1 | 2.64 | -0.88 | 14.8 | 0.037 | 214 | 142 | 0.995 | 3.12 | 0.48 | 22 | -1 | 8 | 3 |
| 3 | 5.7 | 0.385 | 0.04 | 18.8 | -0.425 | 22 | 115 | 0.996 | 2.24 | 1.83 | 6.2 | -1 | 6 | 1 |
| 4 | 8 | 0.33 | -1.26 | 9.4 | -0.446 | -167 | 108 | 0.995 | 3.12 | 1.77 | 13.7 | 0 | 9 | 2 |
| 0 | 11.3 | 0.32 | 0.59 | 2.2 | 0.556 | -37 | 15 | 0.999 | 3.2 | 1.29 | 15.4 | 0 | 11 | 0 |
st(train_imputed)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
7.076
|
6.318
|
-18.1
|
5.2
|
9.5
|
34.4
|
|
VolatileAcidity
|
12795
|
0.324
|
0.784
|
-2.79
|
0.13
|
0.64
|
3.68
|
|
CitricAcid
|
12795
|
0.308
|
0.862
|
-3.24
|
0.03
|
0.58
|
3.86
|
|
ResidualSugar
|
12795
|
5.44
|
33.737
|
-127.8
|
-1.8
|
15.85
|
141.15
|
|
Chlorides
|
12795
|
0.056
|
0.319
|
-1.171
|
-0.029
|
0.156
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
31.198
|
148.59
|
-555
|
0
|
70
|
623
|
|
TotalSulfurDioxide
|
12795
|
121.383
|
231.196
|
-823
|
27.5
|
208
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
Sulphates
|
12795
|
0.525
|
0.93
|
-3.13
|
0.28
|
0.86
|
4.24
|
|
Alcohol
|
12795
|
10.493
|
3.738
|
-4.7
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… -2
|
504
|
3.9%
|
|
|
|
|
|
|
… -1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 0
|
5617
|
43.9%
|
|
|
|
|
|
|
… 1
|
3048
|
23.8%
|
|
|
|
|
|
|
… 2
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
plot_num(train_imputed)

Further insights.
I looked at different techniques in r to adjust the negative values, and the following is the one I understood better. What made sense to me was to convert the absolute value of the minimum + 1 (to change the scale) for each observation. I have done this manually in the past with models in excel. I learned how many analysts did this in several examples and started with this one. There were also analysts who considered “AcidINdex” to be a categorical variable. I did not go this route.
I learned there are other techniques, such as simply absolute values, and then the log of all these absolute values. Without some of these transformations I was still getting normal distributions for the imputed values, and from what the negative binomial model, and possibly poisson model if there is no overdispersion.
train_imp_plusminconst <- train_imputed
#only do it for columns with negative values--not sure that the constant (y+Y-min(Y)
#(+1 does, I guess it could be any constant, or none?)
#Also --> should these be logged despite having a normal distribution? --I decided not to
#https://www.listendata.com/2015/09/regression-transform-negative-values.html
train_imp_plusminconst$FixedAcidity <- train_imp_plusminconst$FixedAcidity + abs(min(train_imp_plusminconst$FixedAcidity))+1
train_imp_plusminconst$VolatileAcidity <- train_imp_plusminconst$VolatileAcidity + abs(min(train_imp_plusminconst$VolatileAcidity))+1
train_imp_plusminconst$CitricAcid <- train_imp_plusminconst$CitricAcid + abs(min(train_imp_plusminconst$CitricAcid))+1
train_imp_plusminconst$ResidualSugar <- train_imp_plusminconst$ResidualSugar + abs(min(train_imp_plusminconst$ResidualSugar))+1
train_imp_plusminconst$Chlorides <- train_imp_plusminconst$Chlorides + abs(min(train_imp_plusminconst$Chlorides))+1
train_imp_plusminconst$FreeSulfurDioxide <- train_imp_plusminconst$FreeSulfurDioxide + abs(min(train_imp_plusminconst$FreeSulfurDioxide))+1
train_imp_plusminconst$TotalSulfurDioxide <- train_imp_plusminconst$TotalSulfurDioxide +abs(min(train_imp_plusminconst$TotalSulfurDioxide ))+1
train_imp_plusminconst$Sulphates <-train_imp_plusminconst$Sulphates +abs(min(train_imp_plusminconst$Sulphates))+1
train_imp_plusminconst$Alcohol <-train_imp_plusminconst$Alcohol +abs(min(train_imp_plusminconst$Alcohol))+1
#this seems out of scale for the other variables, so decided to scale the other positive variables too
train_imp_plusminconst$Density <- train_imp_plusminconst$Density + abs(min(train_imp_plusminconst$Density))+1
train_imp_plusminconst$pH <- train_imp_plusminconst$pH + abs(min(train_imp_plusminconst$pH))+1
#transform Label Appeal too.
train_imp_plusminconst$LabelAppeal <- as.numeric(train_imp_plusminconst$LabelAppeal)
train_imp_plusminconst$LabelAppeal <-train_imp_plusminconst$LabelAppeal + abs(min(train_imp_plusminconst$LabelAppeal))-2
train_imp_plusminconst$LabelAppeal <- as.factor(train_imp_plusminconst$LabelAppeal)
st(train_imp_plusminconst) #run this to make sure each variable worked after
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
26.176
|
6.318
|
1
|
24.3
|
28.6
|
53.5
|
|
VolatileAcidity
|
12795
|
4.114
|
0.784
|
1
|
3.92
|
4.43
|
7.47
|
|
CitricAcid
|
12795
|
4.548
|
0.862
|
1
|
4.27
|
4.82
|
8.1
|
|
ResidualSugar
|
12795
|
134.24
|
33.737
|
1
|
127
|
144.65
|
269.95
|
|
Chlorides
|
12795
|
2.227
|
0.319
|
1
|
2.142
|
2.327
|
3.522
|
|
FreeSulfurDioxide
|
12795
|
587.198
|
148.59
|
1
|
556
|
626
|
1179
|
|
TotalSulfurDioxide
|
12795
|
945.383
|
231.196
|
1
|
851.5
|
1032
|
1881
|
|
Density
|
12795
|
2.882
|
0.027
|
2.776
|
2.876
|
2.889
|
2.987
|
|
pH
|
12795
|
4.687
|
0.681
|
1.96
|
4.43
|
4.95
|
7.61
|
|
Sulphates
|
12795
|
4.655
|
0.93
|
1
|
4.41
|
4.99
|
8.37
|
|
Alcohol
|
12795
|
16.193
|
3.738
|
1
|
14.7
|
18.1
|
32.2
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 0
|
504
|
3.9%
|
|
|
|
|
|
|
… 1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 2
|
5617
|
43.9%
|
|
|
|
|
|
|
… 3
|
3048
|
23.8%
|
|
|
|
|
|
|
… 4
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
summary(train_imp_plusminconst) #make sure nothing broke
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. : 1.00 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:24.30 1st Qu.:3.920 1st Qu.:4.270
## Median :3.000 Median :26.00 Median :4.070 Median :4.550
## Mean :3.029 Mean :26.18 Mean :4.114 Mean :4.548
## 3rd Qu.:4.000 3rd Qu.:28.60 3rd Qu.:4.430 3rd Qu.:4.820
## Max. :8.000 Max. :53.50 Max. :7.470 Max. :8.100
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. : 1.0 Min. :1.000 Min. : 1.0 Min. : 1.0
## 1st Qu.:127.0 1st Qu.:2.142 1st Qu.: 556.0 1st Qu.: 851.5
## Median :132.7 Median :2.217 Median : 586.0 Median : 947.0
## Mean :134.2 Mean :2.227 Mean : 587.2 Mean : 945.4
## 3rd Qu.:144.7 3rd Qu.:2.326 3rd Qu.: 626.0 3rd Qu.:1032.0
## Max. :269.9 Max. :3.522 Max. :1179.0 Max. :1881.0
## Density pH Sulphates Alcohol LabelAppeal
## Min. :2.776 Min. :1.960 Min. :1.000 Min. : 1.00 0: 504
## 1st Qu.:2.876 1st Qu.:4.430 1st Qu.:4.410 1st Qu.:14.70 1:3136
## Median :2.883 Median :4.680 Median :4.630 Median :16.10 2:5617
## Mean :2.882 Mean :4.687 Mean :4.655 Mean :16.19 3:3048
## 3rd Qu.:2.889 3rd Qu.:4.950 3rd Qu.:4.990 3rd Qu.:18.10 4: 490
## Max. :2.987 Max. :7.610 Max. :8.370 Max. :32.20
## AcidIndex STARS
## Min. : 4.000 0:3359
## 1st Qu.: 7.000 1:3042
## Median : 8.000 2:3570
## Mean : 7.773 3:2212
## 3rd Qu.: 8.000 4: 612
## Max. :17.000
plot_num(train_imp_plusminconst)

# I liked the for loop version better, but it was a bit over my head too. Review for later, stick to the above method.
######################################################################
# Transform the data from imputed to logs
train_scaling_subset2 <- train_imputed %>%
dplyr::select(FixedAcidity,
VolatileAcidity,
CitricAcid,
ResidualSugar,
Chlorides,
FreeSulfurDioxide,
TotalSulfurDioxide,
Sulphates,
Alcohol)
train_absscaled_subset <- lapply(train_scaling_subset2,
FUN = function(x) sapply(x, FUN = abs)) %>%
as.data.frame()
# Join absolute value-scaled subset back to other continuous variables
train_abs <- train_imputed %>%
dplyr::select(Density,
AcidIndex,
pH
) %>%
cbind(train_absscaled_subset)
# Log-scale all continuous variables, adding constant of 1
train_abslog <- lapply(train_abs, FUN = function(x)
sapply(x, FUN = function(x) log(x+1))) %>%
as.data.frame()
st(train_abs)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
FixedAcidity
|
12795
|
8.063
|
4.996
|
0
|
5.6
|
9.8
|
34.4
|
|
VolatileAcidity
|
12795
|
0.641
|
0.556
|
0
|
0.25
|
0.91
|
3.68
|
|
CitricAcid
|
12795
|
0.686
|
0.606
|
0
|
0.28
|
0.97
|
3.86
|
|
ResidualSugar
|
12795
|
23.326
|
24.972
|
0
|
3.6
|
38.6
|
141.15
|
|
Chlorides
|
12795
|
0.223
|
0.234
|
0
|
0.046
|
0.368
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
106.642
|
108.069
|
0
|
28
|
171
|
623
|
|
TotalSulfurDioxide
|
12795
|
204.015
|
162.976
|
0
|
99
|
262
|
1057
|
|
Sulphates
|
12795
|
0.845
|
0.652
|
0
|
0.43
|
1.09
|
4.24
|
|
Alcohol
|
12795
|
10.526
|
3.642
|
0
|
9
|
12.4
|
26.5
|
#bring iin scaled Label Appeal from the abs plus min dataframe
train_abslog$LabelAppeal <- train_imp_plusminconst$LabelAppeal
# Map remaining variables to dataframe
#train_abslog$INDEX <- train_imputed$INDEX
train_abslog$TARGET <- train_imputed$TARGET
train_abslog$STARS <- train_imputed$STARS
train_abslog$STARS <- as.factor(train_abslog$STARS)
head(train_abslog)
| Density | AcidIndex | pH | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Sulphates | Alcohol | LabelAppeal | TARGET | STARS |
| 0.69 | 2.2 | 1.47 | 1.44 | 0.77 | 0.683 | 4.01 | 0.449 | 5.14 | 5.59 | 0.464 | 2.39 | 2 | 3 | 2 |
| 0.707 | 2.08 | 1.48 | 1.7 | 0.148 | 0.593 | 3.3 | 0.354 | 2.77 | 5.79 | 0.531 | 2.67 | 1 | 3 | 3 |
| 0.691 | 2.2 | 1.42 | 2.09 | 1.29 | 0.631 | 2.76 | 0.0363 | 5.37 | 4.96 | 0.392 | 3.14 | 1 | 5 | 3 |
| 0.691 | 1.95 | 1.18 | 1.9 | 0.326 | 0.0392 | 2.99 | 0.354 | 3.14 | 4.75 | 1.04 | 1.97 | 1 | 3 | 1 |
| 0.69 | 2.3 | 1.42 | 2.2 | 0.285 | 0.815 | 2.34 | 0.369 | 5.12 | 4.69 | 1.02 | 2.69 | 2 | 4 | 2 |
| 0.693 | 2.48 | 1.44 | 2.51 | 0.278 | 0.464 | 1.16 | 0.442 | 3.64 | 2.77 | 0.829 | 2.8 | 2 | 0 | 0 |
st(train_abslog)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
Density
|
12795
|
0.69
|
0.013
|
0.636
|
0.687
|
0.693
|
0.742
|
|
AcidIndex
|
12795
|
2.161
|
0.14
|
1.609
|
2.079
|
2.197
|
2.89
|
|
pH
|
12795
|
1.423
|
0.171
|
0.392
|
1.374
|
1.497
|
1.964
|
|
FixedAcidity
|
12795
|
2.04
|
0.617
|
0
|
1.887
|
2.38
|
3.567
|
|
VolatileAcidity
|
12795
|
0.449
|
0.293
|
0
|
0.223
|
0.647
|
1.543
|
|
CitricAcid
|
12795
|
0.47
|
0.31
|
0
|
0.247
|
0.678
|
1.581
|
|
ResidualSugar
|
12795
|
2.597
|
1.17
|
0
|
1.526
|
3.679
|
4.957
|
|
Chlorides
|
12795
|
0.185
|
0.174
|
0
|
0.045
|
0.313
|
0.855
|
|
FreeSulfurDioxide
|
12795
|
4.14
|
1.115
|
0
|
3.367
|
5.147
|
6.436
|
|
TotalSulfurDioxide
|
12795
|
4.993
|
0.903
|
0
|
4.605
|
5.572
|
6.964
|
|
Sulphates
|
12795
|
0.562
|
0.304
|
0
|
0.358
|
0.737
|
1.656
|
|
Alcohol
|
12795
|
2.382
|
0.388
|
0
|
2.303
|
2.595
|
3.314
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 0
|
504
|
3.9%
|
|
|
|
|
|
|
… 1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 2
|
5617
|
43.9%
|
|
|
|
|
|
|
… 3
|
3048
|
23.8%
|
|
|
|
|
|
|
… 4
|
490
|
3.8%
|
|
|
|
|
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
plot_num(train_abslog)

train_imp_abs <- train_imputed
#only do it for columns with negative values--not sure that the constant (y+Y-min(Y)
#(+1 does, I guess it could be any constant, or none?)
#Also --> should these be logged despite having a normal distribution? --I decided not to
#https://www.listendata.com/2015/09/regression-transform-negative-values.html
train_imp_abs$FixedAcidity <- abs(train_imp_abs$FixedAcidity)
train_imp_abs$VolatileAcidity <- abs(train_imp_abs$VolatileAcidity)
train_imp_abs$CitricAcid <- abs(train_imp_abs$CitricAcid)
train_imp_abs$ResidualSugar <-abs(train_imp_abs$ResidualSugar)
train_imp_abs$Chlorides <-abs(train_imp_abs$Chlorides)
train_imp_abs$FreeSulfurDioxide <-abs(train_imp_abs$FreeSulfurDioxide)
train_imp_abs$TotalSulfurDioxide <-abs(train_imp_abs$TotalSulfurDioxide)
train_imp_abs$Sulphates <- abs(train_imp_abs$Sulphates)
train_imp_abs$Alcohol <-abs(train_imp_abs$Alcohol)
#transform Label Appeal too.
train_imp_abs$LabelAppeal <- as.numeric(train_imp_abs$LabelAppeal)
train_imp_abs$LabelAppeal <- abs(train_imp_abs$LabelAppeal)
#train_imp_abs$LabelAppeal + abs(min(train_imp_abs$LabelAppeal))-2
train_imp_abs$LabelAppeal <- as.factor(train_imp_abs$LabelAppeal)
st(train_imp_abs) #run this to make sure each variable worked after
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
8.063
|
4.996
|
0
|
5.6
|
9.8
|
34.4
|
|
VolatileAcidity
|
12795
|
0.641
|
0.556
|
0
|
0.25
|
0.91
|
3.68
|
|
CitricAcid
|
12795
|
0.686
|
0.606
|
0
|
0.28
|
0.97
|
3.86
|
|
ResidualSugar
|
12795
|
23.326
|
24.972
|
0
|
3.6
|
38.6
|
141.15
|
|
Chlorides
|
12795
|
0.223
|
0.234
|
0
|
0.046
|
0.368
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
106.642
|
108.069
|
0
|
28
|
171
|
623
|
|
TotalSulfurDioxide
|
12795
|
204.015
|
162.976
|
0
|
99
|
262
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
Sulphates
|
12795
|
0.845
|
0.652
|
0
|
0.43
|
1.09
|
4.24
|
|
Alcohol
|
12795
|
10.526
|
3.642
|
0
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 1
|
504
|
3.9%
|
|
|
|
|
|
|
… 2
|
3136
|
24.5%
|
|
|
|
|
|
|
… 3
|
5617
|
43.9%
|
|
|
|
|
|
|
… 4
|
3048
|
23.8%
|
|
|
|
|
|
|
… 5
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
summary(train_imp_abs)#make sure nothing broke
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.: 5.600 1st Qu.:0.2500 1st Qu.:0.2800
## Median :3.000 Median : 7.000 Median :0.4100 Median :0.4400
## Mean :3.029 Mean : 8.063 Mean :0.6411 Mean :0.6863
## 3rd Qu.:4.000 3rd Qu.: 9.800 3rd Qu.:0.9100 3rd Qu.:0.9700
## Max. :8.000 Max. :34.400 Max. :3.6800 Max. :3.8600
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. : 0.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.: 3.60 1st Qu.:0.0460 1st Qu.: 28.0 1st Qu.: 99
## Median : 12.90 Median :0.0980 Median : 56.0 Median : 154
## Mean : 23.33 Mean :0.2227 Mean :106.6 Mean : 204
## 3rd Qu.: 38.60 3rd Qu.:0.3680 3rd Qu.:171.0 3rd Qu.: 262
## Max. :141.15 Max. :1.3510 Max. :623.0 Max. :1057
## Density pH Sulphates Alcohol LabelAppeal
## Min. :0.8881 Min. :0.480 Min. :0.0000 Min. : 0.00 1: 504
## 1st Qu.:0.9877 1st Qu.:2.950 1st Qu.:0.4300 1st Qu.: 9.00 2:3136
## Median :0.9945 Median :3.200 Median :0.5900 Median :10.40 3:5617
## Mean :0.9942 Mean :3.207 Mean :0.8454 Mean :10.53 4:3048
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.:1.0900 3rd Qu.:12.40 5: 490
## Max. :1.0992 Max. :6.130 Max. :4.2400 Max. :26.50
## AcidIndex STARS
## Min. : 4.000 0:3359
## 1st Qu.: 7.000 1:3042
## Median : 8.000 2:3570
## Mean : 7.773 3:2212
## 3rd Qu.: 8.000 4: 612
## Max. :17.000
plot_num(train_imp_abs)

train_abs_bc <- train_imp_plusminconst
train_abs_bc$TARGET <-(train_abs_bc$TARGET)
train_abs_bc$LabelAppeal <- as.numeric(train_abs_bc$LabelAppeal)
train_abs_bc$STARS <- as.numeric(train_abs_bc$STARS)
st(train_abs_bc)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
26.176
|
6.318
|
1
|
24.3
|
28.6
|
53.5
|
|
VolatileAcidity
|
12795
|
4.114
|
0.784
|
1
|
3.92
|
4.43
|
7.47
|
|
CitricAcid
|
12795
|
4.548
|
0.862
|
1
|
4.27
|
4.82
|
8.1
|
|
ResidualSugar
|
12795
|
134.24
|
33.737
|
1
|
127
|
144.65
|
269.95
|
|
Chlorides
|
12795
|
2.227
|
0.319
|
1
|
2.142
|
2.327
|
3.522
|
|
FreeSulfurDioxide
|
12795
|
587.198
|
148.59
|
1
|
556
|
626
|
1179
|
|
TotalSulfurDioxide
|
12795
|
945.383
|
231.196
|
1
|
851.5
|
1032
|
1881
|
|
Density
|
12795
|
2.882
|
0.027
|
2.776
|
2.876
|
2.889
|
2.987
|
|
pH
|
12795
|
4.687
|
0.681
|
1.96
|
4.43
|
4.95
|
7.61
|
|
Sulphates
|
12795
|
4.655
|
0.93
|
1
|
4.41
|
4.99
|
8.37
|
|
Alcohol
|
12795
|
16.193
|
3.738
|
1
|
14.7
|
18.1
|
32.2
|
|
LabelAppeal
|
12795
|
2.991
|
0.891
|
1
|
2
|
4
|
5
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
2.506
|
1.187
|
1
|
1
|
3
|
5
|
train_abs_bc[,c("TARGET",
"FixedAcidity",
"VolatileAcidity",
"CitricAcid",
"ResidualSugar",
"Chlorides",
"FreeSulfurDioxide",
"TotalSulfurDioxide",
"Density",
"pH",
"Sulphates",
"Alcohol",
"LabelAppeal",
"AcidIndex",
"STARS"
)] = train_abs_bc[,c(
"TARGET",
"FixedAcidity",
"VolatileAcidity",
"CitricAcid",
"ResidualSugar",
"Chlorides",
"FreeSulfurDioxide",
"TotalSulfurDioxide",
"Density",
"pH",
"Sulphates",
"Alcohol",
"LabelAppeal",
"AcidIndex",
"STARS"
)]+1
b = boxcox(TARGET ~
FixedAcidity
+VolatileAcidity
+CitricAcid
+ResidualSugar
+Chlorides
+FreeSulfurDioxide
+TotalSulfurDioxide
+Density
+pH
+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS,
data=train_abs_bc)

lambda = b$x
lik = b$y
bc = cbind(lambda,lik)
hold=bc[order(-lik),]
bcVal=hold[1,1]
train_abs_bc[,c("FixedAcidity",
"VolatileAcidity",
"CitricAcid",
"ResidualSugar",
"Chlorides",
"FreeSulfurDioxide",
"TotalSulfurDioxide",
"Density",
"pH",
"Sulphates",
"Alcohol",
#"LabelAppeal",
"AcidIndex"
#"STARS"
)] = train_abs_bc[,c("FixedAcidity",
"VolatileAcidity",
"CitricAcid",
"ResidualSugar",
"Chlorides",
"FreeSulfurDioxide",
"TotalSulfurDioxide",
"Density",
"pH",
"Sulphates",
"Alcohol",
#"LabelAppeal",
"AcidIndex"
#"STARS"
)]^(bcVal)
#Before running the models, lets look at the datasets we will be working with.
st(train_imputed)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
7.076
|
6.318
|
-18.1
|
5.2
|
9.5
|
34.4
|
|
VolatileAcidity
|
12795
|
0.324
|
0.784
|
-2.79
|
0.13
|
0.64
|
3.68
|
|
CitricAcid
|
12795
|
0.308
|
0.862
|
-3.24
|
0.03
|
0.58
|
3.86
|
|
ResidualSugar
|
12795
|
5.44
|
33.737
|
-127.8
|
-1.8
|
15.85
|
141.15
|
|
Chlorides
|
12795
|
0.056
|
0.319
|
-1.171
|
-0.029
|
0.156
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
31.198
|
148.59
|
-555
|
0
|
70
|
623
|
|
TotalSulfurDioxide
|
12795
|
121.383
|
231.196
|
-823
|
27.5
|
208
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
Sulphates
|
12795
|
0.525
|
0.93
|
-3.13
|
0.28
|
0.86
|
4.24
|
|
Alcohol
|
12795
|
10.493
|
3.738
|
-4.7
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… -2
|
504
|
3.9%
|
|
|
|
|
|
|
… -1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 0
|
5617
|
43.9%
|
|
|
|
|
|
|
… 1
|
3048
|
23.8%
|
|
|
|
|
|
|
… 2
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
st(train_imp_plusminconst)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
26.176
|
6.318
|
1
|
24.3
|
28.6
|
53.5
|
|
VolatileAcidity
|
12795
|
4.114
|
0.784
|
1
|
3.92
|
4.43
|
7.47
|
|
CitricAcid
|
12795
|
4.548
|
0.862
|
1
|
4.27
|
4.82
|
8.1
|
|
ResidualSugar
|
12795
|
134.24
|
33.737
|
1
|
127
|
144.65
|
269.95
|
|
Chlorides
|
12795
|
2.227
|
0.319
|
1
|
2.142
|
2.327
|
3.522
|
|
FreeSulfurDioxide
|
12795
|
587.198
|
148.59
|
1
|
556
|
626
|
1179
|
|
TotalSulfurDioxide
|
12795
|
945.383
|
231.196
|
1
|
851.5
|
1032
|
1881
|
|
Density
|
12795
|
2.882
|
0.027
|
2.776
|
2.876
|
2.889
|
2.987
|
|
pH
|
12795
|
4.687
|
0.681
|
1.96
|
4.43
|
4.95
|
7.61
|
|
Sulphates
|
12795
|
4.655
|
0.93
|
1
|
4.41
|
4.99
|
8.37
|
|
Alcohol
|
12795
|
16.193
|
3.738
|
1
|
14.7
|
18.1
|
32.2
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 0
|
504
|
3.9%
|
|
|
|
|
|
|
… 1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 2
|
5617
|
43.9%
|
|
|
|
|
|
|
… 3
|
3048
|
23.8%
|
|
|
|
|
|
|
… 4
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
st(train_abslog)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
Density
|
12795
|
0.69
|
0.013
|
0.636
|
0.687
|
0.693
|
0.742
|
|
AcidIndex
|
12795
|
2.161
|
0.14
|
1.609
|
2.079
|
2.197
|
2.89
|
|
pH
|
12795
|
1.423
|
0.171
|
0.392
|
1.374
|
1.497
|
1.964
|
|
FixedAcidity
|
12795
|
2.04
|
0.617
|
0
|
1.887
|
2.38
|
3.567
|
|
VolatileAcidity
|
12795
|
0.449
|
0.293
|
0
|
0.223
|
0.647
|
1.543
|
|
CitricAcid
|
12795
|
0.47
|
0.31
|
0
|
0.247
|
0.678
|
1.581
|
|
ResidualSugar
|
12795
|
2.597
|
1.17
|
0
|
1.526
|
3.679
|
4.957
|
|
Chlorides
|
12795
|
0.185
|
0.174
|
0
|
0.045
|
0.313
|
0.855
|
|
FreeSulfurDioxide
|
12795
|
4.14
|
1.115
|
0
|
3.367
|
5.147
|
6.436
|
|
TotalSulfurDioxide
|
12795
|
4.993
|
0.903
|
0
|
4.605
|
5.572
|
6.964
|
|
Sulphates
|
12795
|
0.562
|
0.304
|
0
|
0.358
|
0.737
|
1.656
|
|
Alcohol
|
12795
|
2.382
|
0.388
|
0
|
2.303
|
2.595
|
3.314
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 0
|
504
|
3.9%
|
|
|
|
|
|
|
… 1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 2
|
5617
|
43.9%
|
|
|
|
|
|
|
… 3
|
3048
|
23.8%
|
|
|
|
|
|
|
… 4
|
490
|
3.8%
|
|
|
|
|
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
st(train_imp_abs)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
8.063
|
4.996
|
0
|
5.6
|
9.8
|
34.4
|
|
VolatileAcidity
|
12795
|
0.641
|
0.556
|
0
|
0.25
|
0.91
|
3.68
|
|
CitricAcid
|
12795
|
0.686
|
0.606
|
0
|
0.28
|
0.97
|
3.86
|
|
ResidualSugar
|
12795
|
23.326
|
24.972
|
0
|
3.6
|
38.6
|
141.15
|
|
Chlorides
|
12795
|
0.223
|
0.234
|
0
|
0.046
|
0.368
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
106.642
|
108.069
|
0
|
28
|
171
|
623
|
|
TotalSulfurDioxide
|
12795
|
204.015
|
162.976
|
0
|
99
|
262
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
Sulphates
|
12795
|
0.845
|
0.652
|
0
|
0.43
|
1.09
|
4.24
|
|
Alcohol
|
12795
|
10.526
|
3.642
|
0
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 1
|
504
|
3.9%
|
|
|
|
|
|
|
… 2
|
3136
|
24.5%
|
|
|
|
|
|
|
… 3
|
5617
|
43.9%
|
|
|
|
|
|
|
… 4
|
3048
|
23.8%
|
|
|
|
|
|
|
… 5
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
st(train_abs_bc)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
4.029
|
1.926
|
1
|
3
|
5
|
9
|
|
FixedAcidity
|
12795
|
51.548
|
14.092
|
2.285
|
47.033
|
56.71
|
117.394
|
|
VolatileAcidity
|
12795
|
7.014
|
1.275
|
2.285
|
6.68
|
7.513
|
12.763
|
|
CitricAcid
|
12795
|
7.731
|
1.423
|
2.285
|
7.25
|
8.161
|
13.903
|
|
ResidualSugar
|
12795
|
349.423
|
102.119
|
2.285
|
324.806
|
378.871
|
793.976
|
|
Chlorides
|
12795
|
4.045
|
0.475
|
2.285
|
3.914
|
4.19
|
6.041
|
|
FreeSulfurDioxide
|
12795
|
2015.385
|
597.457
|
2.285
|
1874.293
|
2158.325
|
4586.011
|
|
TotalSulfurDioxide
|
12795
|
3550.688
|
1019.146
|
2.285
|
3112.798
|
3913.488
|
7999.852
|
|
Density
|
12795
|
5.037
|
0.041
|
4.873
|
5.027
|
5.046
|
5.2
|
|
pH
|
12795
|
7.952
|
1.132
|
3.645
|
7.513
|
8.378
|
13.015
|
|
Sulphates
|
12795
|
7.91
|
1.541
|
2.285
|
7.48
|
8.446
|
14.396
|
|
Alcohol
|
12795
|
29.843
|
7.646
|
2.285
|
26.633
|
33.642
|
65.024
|
|
LabelAppeal
|
12795
|
3.991
|
0.891
|
2
|
3
|
5
|
6
|
|
AcidIndex
|
12795
|
13.342
|
2.443
|
6.81
|
11.924
|
13.721
|
31.346
|
|
STARS
|
12795
|
3.506
|
1.187
|
2
|
2
|
4
|
6
|
Building the models
Linear regression #1
This first model tried all variables using the imputed data, both with “AcidIndex” as a factor and numeric. Surprisingly we got a relatively decent result around 0.54 R-squared. If we take “AcidIndex” as numeric, this variable is statistically significant, while if used as a factor, none of the 1-14 categories was significant. Therefore, we continued treating “AcidIndex” as numeric and kept it as such.
We removed the statistically insignificant variables, and the maximum R-Squared score was 0.5405.
When plotting the residuals, we validated the model, and it appeared that there is a strong linear relationship.
Some of the coefficients indicated the following:
• Naturally, the more STARS a wine was given, the more cases one would have expected to sell. • AcidIndex, was negatively correlated; the more acidic wine is, the fewer cases it is expected to sell. • Although the label appeal was statistically significant and had a positive correlation depending on the score, one would expect to sell about one more case if the rating were zero.
#start with linear regression
# available dataframes:
# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################
# Use for checking if "AcidIndex" as numeric or factor makes any difference
###############################################################################
# raw --> train_imputed
train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
st(train_imputed)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
7.076
|
6.318
|
-18.1
|
5.2
|
9.5
|
34.4
|
|
VolatileAcidity
|
12795
|
0.324
|
0.784
|
-2.79
|
0.13
|
0.64
|
3.68
|
|
CitricAcid
|
12795
|
0.308
|
0.862
|
-3.24
|
0.03
|
0.58
|
3.86
|
|
ResidualSugar
|
12795
|
5.44
|
33.737
|
-127.8
|
-1.8
|
15.85
|
141.15
|
|
Chlorides
|
12795
|
0.056
|
0.319
|
-1.171
|
-0.029
|
0.156
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
31.198
|
148.59
|
-555
|
0
|
70
|
623
|
|
TotalSulfurDioxide
|
12795
|
121.383
|
231.196
|
-823
|
27.5
|
208
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
Sulphates
|
12795
|
0.525
|
0.93
|
-3.13
|
0.28
|
0.86
|
4.24
|
|
Alcohol
|
12795
|
10.493
|
3.738
|
-4.7
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… -2
|
504
|
3.9%
|
|
|
|
|
|
|
… -1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 0
|
5617
|
43.9%
|
|
|
|
|
|
|
… 1
|
3048
|
23.8%
|
|
|
|
|
|
|
… 2
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
4.773
|
1.324
|
1
|
4
|
5
|
14
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
# # scaled through minplusconstant --> train_imp_plusminconst
train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
st(train_imp_plusminconst)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
26.176
|
6.318
|
1
|
24.3
|
28.6
|
53.5
|
|
VolatileAcidity
|
12795
|
4.114
|
0.784
|
1
|
3.92
|
4.43
|
7.47
|
|
CitricAcid
|
12795
|
4.548
|
0.862
|
1
|
4.27
|
4.82
|
8.1
|
|
ResidualSugar
|
12795
|
134.24
|
33.737
|
1
|
127
|
144.65
|
269.95
|
|
Chlorides
|
12795
|
2.227
|
0.319
|
1
|
2.142
|
2.327
|
3.522
|
|
FreeSulfurDioxide
|
12795
|
587.198
|
148.59
|
1
|
556
|
626
|
1179
|
|
TotalSulfurDioxide
|
12795
|
945.383
|
231.196
|
1
|
851.5
|
1032
|
1881
|
|
Density
|
12795
|
2.882
|
0.027
|
2.776
|
2.876
|
2.889
|
2.987
|
|
pH
|
12795
|
4.687
|
0.681
|
1.96
|
4.43
|
4.95
|
7.61
|
|
Sulphates
|
12795
|
4.655
|
0.93
|
1
|
4.41
|
4.99
|
8.37
|
|
Alcohol
|
12795
|
16.193
|
3.738
|
1
|
14.7
|
18.1
|
32.2
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 0
|
504
|
3.9%
|
|
|
|
|
|
|
… 1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 2
|
5617
|
43.9%
|
|
|
|
|
|
|
… 3
|
3048
|
23.8%
|
|
|
|
|
|
|
… 4
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
4.773
|
1.324
|
1
|
4
|
5
|
14
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
#
# # scaled (not all) and logged (continuous variables) --> train_abslog
train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
st(train_abslog)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
Density
|
12795
|
0.69
|
0.013
|
0.636
|
0.687
|
0.693
|
0.742
|
|
AcidIndex
|
12795
|
7.773
|
1.324
|
4
|
7
|
8
|
17
|
|
pH
|
12795
|
1.423
|
0.171
|
0.392
|
1.374
|
1.497
|
1.964
|
|
FixedAcidity
|
12795
|
2.04
|
0.617
|
0
|
1.887
|
2.38
|
3.567
|
|
VolatileAcidity
|
12795
|
0.449
|
0.293
|
0
|
0.223
|
0.647
|
1.543
|
|
CitricAcid
|
12795
|
0.47
|
0.31
|
0
|
0.247
|
0.678
|
1.581
|
|
ResidualSugar
|
12795
|
2.597
|
1.17
|
0
|
1.526
|
3.679
|
4.957
|
|
Chlorides
|
12795
|
0.185
|
0.174
|
0
|
0.045
|
0.313
|
0.855
|
|
FreeSulfurDioxide
|
12795
|
4.14
|
1.115
|
0
|
3.367
|
5.147
|
6.436
|
|
TotalSulfurDioxide
|
12795
|
4.993
|
0.903
|
0
|
4.605
|
5.572
|
6.964
|
|
Sulphates
|
12795
|
0.562
|
0.304
|
0
|
0.358
|
0.737
|
1.656
|
|
Alcohol
|
12795
|
2.382
|
0.388
|
0
|
2.303
|
2.595
|
3.314
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 0
|
504
|
3.9%
|
|
|
|
|
|
|
… 1
|
3136
|
24.5%
|
|
|
|
|
|
|
… 2
|
5617
|
43.9%
|
|
|
|
|
|
|
… 3
|
3048
|
23.8%
|
|
|
|
|
|
|
… 4
|
490
|
3.8%
|
|
|
|
|
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
#
# # scaled through absolute values
train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
st(train_imp_abs)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
12795
|
3.029
|
1.926
|
0
|
2
|
4
|
8
|
|
FixedAcidity
|
12795
|
8.063
|
4.996
|
0
|
5.6
|
9.8
|
34.4
|
|
VolatileAcidity
|
12795
|
0.641
|
0.556
|
0
|
0.25
|
0.91
|
3.68
|
|
CitricAcid
|
12795
|
0.686
|
0.606
|
0
|
0.28
|
0.97
|
3.86
|
|
ResidualSugar
|
12795
|
23.326
|
24.972
|
0
|
3.6
|
38.6
|
141.15
|
|
Chlorides
|
12795
|
0.223
|
0.234
|
0
|
0.046
|
0.368
|
1.351
|
|
FreeSulfurDioxide
|
12795
|
106.642
|
108.069
|
0
|
28
|
171
|
623
|
|
TotalSulfurDioxide
|
12795
|
204.015
|
162.976
|
0
|
99
|
262
|
1057
|
|
Density
|
12795
|
0.994
|
0.027
|
0.888
|
0.988
|
1.001
|
1.099
|
|
pH
|
12795
|
3.207
|
0.681
|
0.48
|
2.95
|
3.47
|
6.13
|
|
Sulphates
|
12795
|
0.845
|
0.652
|
0
|
0.43
|
1.09
|
4.24
|
|
Alcohol
|
12795
|
10.526
|
3.642
|
0
|
9
|
12.4
|
26.5
|
|
LabelAppeal
|
12795
|
|
|
|
|
|
|
|
… 1
|
504
|
3.9%
|
|
|
|
|
|
|
… 2
|
3136
|
24.5%
|
|
|
|
|
|
|
… 3
|
5617
|
43.9%
|
|
|
|
|
|
|
… 4
|
3048
|
23.8%
|
|
|
|
|
|
|
… 5
|
490
|
3.8%
|
|
|
|
|
|
|
AcidIndex
|
12795
|
4.773
|
1.324
|
1
|
4
|
5
|
14
|
|
STARS
|
12795
|
|
|
|
|
|
|
|
… 0
|
3359
|
26.3%
|
|
|
|
|
|
|
… 1
|
3042
|
23.8%
|
|
|
|
|
|
|
… 2
|
3570
|
27.9%
|
|
|
|
|
|
|
… 3
|
2212
|
17.3%
|
|
|
|
|
|
|
… 4
|
612
|
4.8%
|
|
|
|
|
|
#Model 1 Linear Regression -
linear_r1 <- lm(formula=TARGET ~
#FixedAcidity
+VolatileAcidity
#+CitricAcid
#+ResidualSugar
+Chlorides
+FreeSulfurDioxide
+TotalSulfurDioxide
+Density
+pH
+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS, data=train_imputed)
summary(linear_r1)
##
## Call:
## lm(formula = TARGET ~ +VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9533 -0.8596 0.0234 0.8432 6.1683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.293e+00 4.433e-01 5.174 2.32e-07 ***
## VolatileAcidity -9.466e-02 1.477e-02 -6.409 1.52e-10 ***
## Chlorides -1.295e-01 3.630e-02 -3.569 0.00036 ***
## FreeSulfurDioxide 2.503e-04 7.782e-05 3.216 0.00130 **
## TotalSulfurDioxide 2.327e-04 5.003e-05 4.651 3.34e-06 ***
## Density -8.135e-01 4.357e-01 -1.867 0.06188 .
## pH -3.954e-02 1.698e-02 -2.329 0.01990 *
## Sulphates -3.428e-02 1.243e-02 -2.759 0.00581 **
## Alcohol 1.244e-02 3.100e-03 4.014 6.00e-05 ***
## LabelAppeal-1 3.600e-01 6.283e-02 5.730 1.03e-08 ***
## LabelAppeal0 8.268e-01 6.126e-02 13.495 < 2e-16 ***
## LabelAppeal1 1.291e+00 6.399e-02 20.169 < 2e-16 ***
## LabelAppeal2 1.882e+00 8.431e-02 22.317 < 2e-16 ***
## AcidIndex -1.991e-01 8.933e-03 -22.282 < 2e-16 ***
## STARS1 1.363e+00 3.291e-02 41.415 < 2e-16 ***
## STARS2 2.398e+00 3.199e-02 74.979 < 2e-16 ***
## STARS3 2.963e+00 3.705e-02 79.971 < 2e-16 ***
## STARS4 3.647e+00 5.923e-02 61.574 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.305 on 12777 degrees of freedom
## Multiple R-squared: 0.5414, Adjusted R-squared: 0.5408
## F-statistic: 887.3 on 17 and 12777 DF, p-value: < 2.2e-16
plot(linear_r1) +
theme_cowplot()




## NULL
vif(linear_r1)
## GVIF Df GVIF^(1/(2*Df))
## VolatileAcidity 1.006766 1 1.003377
## Chlorides 1.003662 1 1.001829
## FreeSulfurDioxide 1.003978 1 1.001987
## TotalSulfurDioxide 1.004660 1 1.002327
## Density 1.003654 1 1.001825
## pH 1.005040 1 1.002517
## Sulphates 1.002371 1 1.001185
## Alcohol 1.007712 1 1.003849
## LabelAppeal 1.118877 4 1.014140
## AcidIndex 1.050180 1 1.024783
## STARS 1.167548 4 1.019552
Poisson Model #1
I used the absolute values transformation for my independent variables based on early distributions for the first model. Several indicators suggested a relatively ok model. The dispersion was under 1, and then I ran another test to check for the residuals, and this line was flat. Further, I removed the variables with high p values, but the score did not change much.
# Poisson #1
# available dataframes:
# raw --> train_imputed
# scaled through minplusconstant --> train_imp_plusminconst
# scaled (not all) and logged (continuous variables) --> train_abslog
# scaled through absolute values
###############################################################################
# Use for checking if "AcidIndex" as numeric or factor makes any difference
###############################################################################
# raw --> train_imputed
# train_imputed$AcidIndex <- as.factor(train_imputed$AcidIndex)
# train_imputed$AcidIndex <- as.numeric(train_imputed$AcidIndex)
# st(train_imputed)
#
# # # scaled through minplusconstant --> train_imp_plusminconst
# train_imp_plusminconst$AcidIndex <- as.factor(train_imp_plusminconst$AcidIndex)
# train_imp_plusminconst$AcidIndex <- as.numeric(train_imp_plusminconst$AcidIndex)
# st(train_imp_plusminconst)
# #
# # # scaled (not all) and logged (continuous variables) --> train_abslog
# train_abslog$AcidIndex <- as.factor(train_abs$AcidIndex)
# train_abslog$AcidIndex <- as.numeric(train_abs$AcidIndex)
# st(train_abslog)
# #
# # # scaled through absolute values
# train_imp_abs$AcidIndex <-as.factor(train_imp_abs$AcidIndex)
# train_imp_abs$AcidIndex <-as.numeric(train_imp_abs$AcidIndex)
# st(train_imp_abs)
#Model 1 Linear Regression -
poisson_1 <- glm(formula=TARGET ~
# FixedAcidity
+VolatileAcidity
#+CitricAcid
#+ResidualSugar
#+Chlorides
#+FreeSulfurDioxide
+TotalSulfurDioxide
#+Density
#+pH
#+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS,
fam = poisson,
data=train_imp_abs)
summary(poisson_1)
##
## Call:
## glm(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide +
## Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson,
## data = train_imp_abs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2543 -0.6402 -0.0075 0.4515 3.7857
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.538e-01 4.797e-02 3.206 0.00135 **
## VolatileAcidity -3.705e-02 9.396e-03 -3.944 8.03e-05 ***
## TotalSulfurDioxide 8.467e-05 3.116e-05 2.718 0.00658 **
## Alcohol 3.983e-03 1.403e-03 2.839 0.00453 **
## LabelAppeal2 2.357e-01 3.798e-02 6.205 5.45e-10 ***
## LabelAppeal3 4.254e-01 3.705e-02 11.480 < 2e-16 ***
## LabelAppeal4 5.582e-01 3.769e-02 14.813 < 2e-16 ***
## LabelAppeal5 6.946e-01 4.243e-02 16.372 < 2e-16 ***
## AcidIndex -8.031e-02 4.497e-03 -17.861 < 2e-16 ***
## STARS1 7.702e-01 1.953e-02 39.439 < 2e-16 ***
## STARS2 1.089e+00 1.823e-02 59.749 < 2e-16 ***
## STARS3 1.210e+00 1.918e-02 63.090 < 2e-16 ***
## STARS4 1.330e+00 2.428e-02 54.763 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 13669 on 12782 degrees of freedom
## AIC: 45637
##
## Number of Fisher Scoring iterations: 6
dispersiontest(poisson_1)
##
## Overdispersion test
##
## data: poisson_1
## z = -8.9098, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 0.8848692
sim_p1 <- simulateResiduals(poisson_1, refit=T)
testOverdispersion(sim_p1)
## testOverdispersion is deprecated, switch your code to using the testDispersion function

##
## DHARMa nonparametric dispersion test via mean deviance residual fitted
## vs. simulated-refitted
##
## data: simulationOutput
## dispersion = 0.88397, p-value < 2.2e-16
## alternative hypothesis: two.sided
plotSimulatedResiduals(sim_p1)
## plotSimulatedResiduals is deprecated, please switch your code to simply using the plot() function

plot(poisson_1)




knitr::kable(vif(poisson_1), "html")
|
|
GVIF
|
Df
|
GVIF^(1/(2*Df))
|
|
VolatileAcidity
|
1.004014
|
1
|
1.002005
|
|
TotalSulfurDioxide
|
1.003350
|
1
|
1.001673
|
|
Alcohol
|
1.010637
|
1
|
1.005304
|
|
LabelAppeal
|
1.133631
|
4
|
1.015802
|
|
AcidIndex
|
1.025613
|
1
|
1.012726
|
|
STARS
|
1.165661
|
4
|
1.019346
|
Poisson Model #2
I tested the remaining transformed datasets on this model, but the results were not that much different. For this one, I kept the absolute + min + constant (+1) transform dataset for x. Although the residual deviance changed slightly, my dispersion test came similar, indicating the model fits ok. Similar to the other models, the coefficients indicate a similar behavior, such as AcidIndex, which negatively correlates with the expected cases sold.
#Model 2 Poisson Regression -
poisson_2 <- glm(formula=TARGET ~
#FixedAcidity
+VolatileAcidity
# +CitricAcid
# +ResidualSugar
+Chlorides
+FreeSulfurDioxide
+TotalSulfurDioxide
# +Density
+pH
+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS,
fam = poisson,
data=train_abs_bc)
summary(poisson_2)
##
## Call:
## glm(formula = TARGET ~ +VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + pH + Sulphates + Alcohol + LabelAppeal +
## AcidIndex + STARS, family = poisson, data = train_abs_bc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35957 -0.60161 0.04921 0.48184 2.80743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.502e-01 7.581e-02 9.896 < 2e-16 ***
## VolatileAcidity -1.558e-02 3.481e-03 -4.475 7.65e-06 ***
## Chlorides -2.337e-02 9.321e-03 -2.508 0.01215 *
## FreeSulfurDioxide 2.057e-05 7.357e-06 2.795 0.00518 **
## TotalSulfurDioxide 1.395e-05 4.337e-06 3.217 0.00130 **
## pH -7.838e-03 3.909e-03 -2.005 0.04495 *
## Sulphates -5.375e-03 2.868e-03 -1.874 0.06094 .
## Alcohol 9.068e-04 5.787e-04 1.567 0.11710
## LabelAppeal 1.019e-01 5.225e-03 19.507 < 2e-16 ***
## AcidIndex -3.397e-02 2.056e-03 -16.524 < 2e-16 ***
## STARS 2.343e-01 3.900e-03 60.062 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 13818 on 12794 degrees of freedom
## Residual deviance: 7692 on 12784 degrees of freedom
## AIC: 47576
##
## Number of Fisher Scoring iterations: 5
dispersiontest(poisson_2)
##
## Overdispersion test
##
## data: poisson_2
## z = -60.603, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 0.5650852
sim_p2 <- simulateResiduals(poisson_2, refit=T)
testOverdispersion(sim_p2)
## testOverdispersion is deprecated, switch your code to using the testDispersion function

##
## DHARMa nonparametric dispersion test via mean deviance residual fitted
## vs. simulated-refitted
##
## data: simulationOutput
## dispersion = 0.56014, p-value < 2.2e-16
## alternative hypothesis: two.sided
plotSimulatedResiduals(sim_p2)
## plotSimulatedResiduals is deprecated, please switch your code to simply using the plot() function
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

plot (poisson_2)




knitr::kable(vif(poisson_2),"html")
|
|
x
|
|
VolatileAcidity
|
1.005062
|
|
Chlorides
|
1.002222
|
|
FreeSulfurDioxide
|
1.002172
|
|
TotalSulfurDioxide
|
1.002419
|
|
pH
|
1.004460
|
|
Sulphates
|
1.001323
|
|
Alcohol
|
1.008729
|
|
LabelAppeal
|
1.108977
|
|
AcidIndex
|
1.036240
|
|
STARS
|
1.144527
|
Negative Binomial Model #1 and #2
For these models, we did not see much difference compared to Poisson. At this point, I considered the zero-inflated model due to TARGET having a decent amount of zeros and other variables. I used log-transformed and the absolute values datasets for my x variables. Again, there were no significant differences between the two. I tried the zero inflated models, but I had not many differences, other than the STARS label proving insignificant when 3 or 4 stars were given. This result made me question the previous models, and it appears that deviance begins to increase after a certain point.
#Model 1 Negative Binomial Model -
nb_1 <- glm.nb(formula=TARGET ~
#FixedAcidity
+VolatileAcidity
# +CitricAcid
# +ResidualSugar
#+Chlorides
+FreeSulfurDioxide
+TotalSulfurDioxide
# +Density
#+pH
+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS,
data=train_abs_bc,
#link=log
)
summary(nb_1)
##
## Call:
## glm.nb(formula = TARGET ~ +VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal +
## AcidIndex + STARS, data = train_abs_bc, init.theta = 109478.1766,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.36553 -0.59808 0.04932 0.48309 2.80871
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.901e-01 5.671e-02 10.407 < 2e-16 ***
## VolatileAcidity -1.572e-02 3.481e-03 -4.517 6.27e-06 ***
## FreeSulfurDioxide 2.087e-05 7.356e-06 2.838 0.00454 **
## TotalSulfurDioxide 1.401e-05 4.337e-06 3.231 0.00123 **
## Sulphates -5.352e-03 2.868e-03 -1.866 0.06203 .
## Alcohol 9.515e-04 5.784e-04 1.645 0.09996 .
## LabelAppeal 1.016e-01 5.224e-03 19.451 < 2e-16 ***
## AcidIndex -3.381e-02 2.051e-03 -16.482 < 2e-16 ***
## STARS 2.346e-01 3.898e-03 60.180 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(109478.2) family taken to be 1)
##
## Null deviance: 13817.9 on 12794 degrees of freedom
## Residual deviance: 7701.9 on 12786 degrees of freedom
## AIC: 47584
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 109478
## Std. Err.: 119125
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -47563.93
plot (nb_1)




knitr::kable(vif(nb_1),"html")
|
|
x
|
|
VolatileAcidity
|
1.004856
|
|
FreeSulfurDioxide
|
1.001905
|
|
TotalSulfurDioxide
|
1.002394
|
|
Sulphates
|
1.001305
|
|
Alcohol
|
1.008101
|
|
LabelAppeal
|
1.108525
|
|
AcidIndex
|
1.032101
|
|
STARS
|
1.143609
|
#Model 2 Negative Binomial Model --
nb_2 <- glm.nb(formula=TARGET ~
#FixedAcidity
+VolatileAcidity
# +CitricAcid
# +ResidualSugar
#+Chlorides
#+FreeSulfurDioxide
+TotalSulfurDioxide
# +Density
#+pH
+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS,
data=train_abs_bc
#link=log
)
summary(nb_2)
##
## Call:
## glm.nb(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide +
## Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abs_bc,
## init.theta = 109393.0652, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.35227 -0.59700 0.05004 0.48066 2.80722
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.342e-01 5.454e-02 11.629 < 2e-16 ***
## VolatileAcidity -1.576e-02 3.481e-03 -4.528 5.95e-06 ***
## TotalSulfurDioxide 1.412e-05 4.337e-06 3.255 0.00113 **
## Sulphates -5.287e-03 2.868e-03 -1.844 0.06524 .
## Alcohol 9.209e-04 5.784e-04 1.592 0.11136
## LabelAppeal 1.019e-01 5.224e-03 19.500 < 2e-16 ***
## AcidIndex -3.401e-02 2.050e-03 -16.587 < 2e-16 ***
## STARS 2.346e-01 3.898e-03 60.189 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(109393.1) family taken to be 1)
##
## Null deviance: 13818 on 12794 degrees of freedom
## Residual deviance: 7710 on 12787 degrees of freedom
## AIC: 47590
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 109393
## Std. Err.: 119037
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -47571.97
plot (nb_2)




knitr::kable(vif(nb_2),"html")
|
|
x
|
|
VolatileAcidity
|
1.004839
|
|
TotalSulfurDioxide
|
1.002319
|
|
Sulphates
|
1.001239
|
|
Alcohol
|
1.007742
|
|
LabelAppeal
|
1.108272
|
|
AcidIndex
|
1.031013
|
|
STARS
|
1.143736
|
Negative Binomial Model #3
#Model 3 Negative Binomial Model --Zero Inflated (all the NA's that were imputed as zeros, and y has quite bit s well)
nb_3 <- zeroinfl(formula=TARGET ~
#FixedAcidity
+VolatileAcidity
# +CitricAcid
# +ResidualSugar
#+Chlorides
#+FreeSulfurDioxide
+TotalSulfurDioxide
# +Density
#+pH
+Sulphates
+Alcohol
+LabelAppeal
+AcidIndex
+STARS,
data=train_abslog
#link=log
)
summary(nb_3)
##
## Call:
## zeroinfl(formula = TARGET ~ +VolatileAcidity + TotalSulfurDioxide + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS, data = train_abslog)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.281455 -0.427760 0.004961 0.383624 5.883963
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.460517 0.074704 6.165 7.07e-10 ***
## VolatileAcidity -0.028028 0.017982 -1.559 0.11909
## TotalSulfurDioxide -0.001421 0.006141 -0.231 0.81694
## Sulphates 0.011055 0.017086 0.647 0.51760
## Alcohol 0.060271 0.013780 4.374 1.22e-05 ***
## LabelAppeal1 0.437858 0.041171 10.635 < 2e-16 ***
## LabelAppeal2 0.726474 0.040239 18.054 < 2e-16 ***
## LabelAppeal3 0.916378 0.040909 22.400 < 2e-16 ***
## LabelAppeal4 1.073404 0.045439 23.623 < 2e-16 ***
## AcidIndex -0.020142 0.004822 -4.177 2.95e-05 ***
## STARS1 0.061189 0.021127 2.896 0.00378 **
## STARS2 0.183247 0.019726 9.289 < 2e-16 ***
## STARS3 0.281102 0.020657 13.608 < 2e-16 ***
## STARS4 0.380375 0.025581 14.869 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.72883 0.48751 -9.700 < 2e-16 ***
## VolatileAcidity 0.44829 0.11488 3.902 9.53e-05 ***
## TotalSulfurDioxide -0.27086 0.03554 -7.622 2.50e-14 ***
## Sulphates 0.44717 0.11119 4.022 5.78e-05 ***
## Alcohol 0.23974 0.09008 2.661 0.00778 **
## LabelAppeal1 1.45212 0.31942 4.546 5.47e-06 ***
## LabelAppeal2 2.19766 0.31662 6.941 3.89e-12 ***
## LabelAppeal3 2.90407 0.32204 9.018 < 2e-16 ***
## LabelAppeal4 3.33806 0.37494 8.903 < 2e-16 ***
## AcidIndex 0.41565 0.02565 16.202 < 2e-16 ***
## STARS1 -2.09320 0.07659 -27.330 < 2e-16 ***
## STARS2 -5.71200 0.32653 -17.493 < 2e-16 ***
## STARS3 -20.23807 341.00123 -0.059 0.95267
## STARS4 -20.37694 644.74712 -0.032 0.97479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 33
## Log-likelihood: -2.035e+04 on 28 Df
Model Selection:
Based on the information above, I decided to select the Poisson model #2, which had low dispersion, a p-value of 1. Although the AIC value was relatively high, it was hard to interpret.
#Prepare Evaluation dataset
eval$STARS[is.na(eval$STARS)] <- 0
eval$STARS <-as.factor(eval$STARS)
eval$LabelAppeal <- as.factor(eval$LabelAppeal)
st(eval)
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
0
|
|
|
|
|
|
|
|
… No
|
0
|
NaN%
|
|
|
|
|
|
|
… Yes
|
0
|
NaN%
|
|
|
|
|
|
|
FixedAcidity
|
3335
|
6.864
|
6.318
|
-18.2
|
5.2
|
9
|
33.5
|
|
VolatileAcidity
|
3335
|
0.31
|
0.807
|
-2.83
|
0.08
|
0.63
|
3.61
|
|
CitricAcid
|
3335
|
0.312
|
0.871
|
-3.12
|
0
|
0.605
|
3.76
|
|
ResidualSugar
|
3167
|
5.319
|
34.371
|
-128.3
|
-2.6
|
17.2
|
145.4
|
|
Chlorides
|
3197
|
0.061
|
0.314
|
-1.15
|
0.016
|
0.171
|
1.263
|
|
FreeSulfurDioxide
|
3183
|
34.947
|
149.633
|
-563
|
3
|
79.25
|
617
|
|
TotalSulfurDioxide
|
3178
|
123.41
|
225.8
|
-769
|
27.25
|
210
|
1004
|
|
Density
|
3335
|
0.995
|
0.026
|
0.89
|
0.988
|
1.001
|
1.1
|
|
pH
|
3231
|
3.237
|
0.676
|
0.6
|
2.98
|
3.49
|
6.21
|
|
Sulphates
|
3025
|
0.535
|
0.905
|
-3.07
|
0.33
|
0.82
|
4.18
|
|
Alcohol
|
3150
|
10.584
|
3.759
|
-4.2
|
9
|
12.5
|
25.6
|
|
LabelAppeal
|
3335
|
|
|
|
|
|
|
|
… -2
|
114
|
3.4%
|
|
|
|
|
|
|
… -1
|
810
|
24.3%
|
|
|
|
|
|
|
… 0
|
1470
|
44.1%
|
|
|
|
|
|
|
… 1
|
799
|
24%
|
|
|
|
|
|
|
… 2
|
142
|
4.3%
|
|
|
|
|
|
|
AcidIndex
|
3335
|
7.748
|
1.315
|
5
|
7
|
8
|
17
|
|
STARS
|
3335
|
|
|
|
|
|
|
|
… 0
|
841
|
25.2%
|
|
|
|
|
|
|
… 1
|
828
|
24.8%
|
|
|
|
|
|
|
… 2
|
902
|
27%
|
|
|
|
|
|
|
… 3
|
600
|
18%
|
|
|
|
|
|
|
… 4
|
164
|
4.9%
|
|
|
|
|
|
eval_abs <-eval
eval_abs$FixedAcidity <- abs(eval_abs$FixedAcidity)
eval_abs$VolatileAcidity <- abs(eval_abs$VolatileAcidity)
eval_abs$CitricAcid <- abs(eval_abs$CitricAcid)
eval_abs$ResidualSugar <-abs(eval_abs$ResidualSugar)
eval_abs$Chlorides <-abs(eval_abs$Chlorides)
eval_abs$FreeSulfurDioxide <-abs(eval_abs$FreeSulfurDioxide)
eval_abs$TotalSulfurDioxide <-abs(eval_abs$TotalSulfurDioxide)
eval_abs$Sulphates <- abs(eval_abs$Sulphates)
eval_abs$Alcohol <-abs(eval_abs$Alcohol)
#transform Label Appeal too.
eval_abs$LabelAppeal <- as.numeric(eval_abs$LabelAppeal)
eval_abs$LabelAppeal <- abs(eval_abs$LabelAppeal)
#eval_abs$LabelAppeal + abs(min(eval_abs$LabelAppeal))-2
eval_abs$LabelAppeal <- as.factor(eval_abs$LabelAppeal)
st(eval_abs) #run this to make sure each variable worked after
Summary Statistics
|
Variable
|
N
|
Mean
|
Std. Dev.
|
Min
|
Pctl. 25
|
Pctl. 75
|
Max
|
|
TARGET
|
0
|
|
|
|
|
|
|
|
… No
|
0
|
NaN%
|
|
|
|
|
|
|
… Yes
|
0
|
NaN%
|
|
|
|
|
|
|
FixedAcidity
|
3335
|
7.967
|
4.854
|
0
|
5.7
|
9.4
|
33.5
|
|
VolatileAcidity
|
3335
|
0.654
|
0.565
|
0
|
0.25
|
0.93
|
3.61
|
|
CitricAcid
|
3335
|
0.697
|
0.609
|
0
|
0.29
|
1
|
3.76
|
|
ResidualSugar
|
3167
|
23.775
|
25.381
|
0.1
|
3.5
|
38.5
|
145.4
|
|
Chlorides
|
3197
|
0.221
|
0.231
|
0
|
0.045
|
0.369
|
1.263
|
|
FreeSulfurDioxide
|
3183
|
107.2
|
110.075
|
0
|
27
|
174
|
617
|
|
TotalSulfurDioxide
|
3178
|
201.511
|
160.003
|
0
|
97
|
261
|
1004
|
|
Density
|
3335
|
0.995
|
0.026
|
0.89
|
0.988
|
1.001
|
1.1
|
|
pH
|
3231
|
3.237
|
0.676
|
0.6
|
2.98
|
3.49
|
6.21
|
|
Sulphates
|
3025
|
0.833
|
0.641
|
0
|
0.43
|
1.06
|
4.18
|
|
Alcohol
|
3150
|
10.614
|
3.672
|
0
|
9
|
12.5
|
25.6
|
|
LabelAppeal
|
3335
|
|
|
|
|
|
|
|
… 1
|
114
|
3.4%
|
|
|
|
|
|
|
… 2
|
810
|
24.3%
|
|
|
|
|
|
|
… 3
|
1470
|
44.1%
|
|
|
|
|
|
|
… 4
|
799
|
24%
|
|
|
|
|
|
|
… 5
|
142
|
4.3%
|
|
|
|
|
|
|
AcidIndex
|
3335
|
7.748
|
1.315
|
5
|
7
|
8
|
17
|
|
STARS
|
3335
|
|
|
|
|
|
|
|
… 0
|
841
|
25.2%
|
|
|
|
|
|
|
… 1
|
828
|
24.8%
|
|
|
|
|
|
|
… 2
|
902
|
27%
|
|
|
|
|
|
|
… 3
|
600
|
18%
|
|
|
|
|
|
|
… 4
|
164
|
4.9%
|
|
|
|
|
|
summary(eval_abs)#make sure nothing broke
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Mode:logical Min. : 0.000 Min. :0.0000 Min. :0.0000
## NA's:3335 1st Qu.: 5.700 1st Qu.:0.2500 1st Qu.:0.2900
## Median : 7.000 Median :0.4200 Median :0.4400
## Mean : 7.967 Mean :0.6542 Mean :0.6969
## 3rd Qu.: 9.400 3rd Qu.:0.9300 3rd Qu.:1.0000
## Max. :33.500 Max. :3.6100 Max. :3.7600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. : 0.10 Min. :0.0000 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.50 1st Qu.:0.0450 1st Qu.: 27.0 1st Qu.: 97.0
## Median : 13.50 Median :0.1000 Median : 54.0 Median : 153.0
## Mean : 23.77 Mean :0.2213 Mean :107.2 Mean : 201.5
## 3rd Qu.: 38.50 3rd Qu.:0.3690 3rd Qu.:174.0 3rd Qu.: 261.0
## Max. :145.40 Max. :1.2630 Max. :617.0 Max. :1004.0
## NA's :168 NA's :138 NA's :152 NA's :157
## Density pH Sulphates Alcohol LabelAppeal
## Min. :0.8898 Min. :0.600 Min. :0.0000 Min. : 0.00 1: 114
## 1st Qu.:0.9883 1st Qu.:2.980 1st Qu.:0.4300 1st Qu.: 9.00 2: 810
## Median :0.9946 Median :3.210 Median :0.5900 Median :10.40 3:1470
## Mean :0.9947 Mean :3.237 Mean :0.8331 Mean :10.61 4: 799
## 3rd Qu.:1.0005 3rd Qu.:3.490 3rd Qu.:1.0600 3rd Qu.:12.50 5: 142
## Max. :1.0998 Max. :6.210 Max. :4.1800 Max. :25.60
## NA's :104 NA's :310 NA's :185
## AcidIndex STARS
## Min. : 5.000 0:841
## 1st Qu.: 7.000 1:828
## Median : 8.000 2:902
## Mean : 7.748 3:600
## 3rd Qu.: 8.000 4:164
## Max. :17.000
##
head(eval_abs)
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
| 5.4 | 0.86 | 0.27 | 10.7 | 0.092 | 23 | 398 | 0.985 | 5.02 | 0.64 | 12.3 | 2 | 6 | 0 |
| 12.4 | 0.385 | 0.76 | 19.7 | 1.17 | 37 | 68 | 0.99 | 3.37 | 1.09 | 16 | 3 | 6 | 2 |
| 7.2 | 1.75 | 0.17 | 33 | 0.065 | 9 | 76 | 1.05 | 4.61 | 0.68 | 8.55 | 3 | 8 | 1 |
| 6.2 | 0.1 | 1.8 | 1 | 0.179 | 104 | 89 | 0.989 | 3.2 | 2.11 | 12.3 | 2 | 8 | 1 |
| 11.4 | 0.21 | 0.28 | 1.2 | 0.038 | 70 | 53 | 1.03 | 2.54 | 0.07 | 4.8 | 3 | 10 | 0 |
| 17.6 | 0.04 | 1.15 | 1.4 | 0.535 | 250 | 140 | 0.95 | 3.06 | 0.02 | 11.4 | 4 | 8 | 4 |
# Predict
eval_abs$TARGET <- predict(poisson_1, newdata = eval_abs, type = "response")
eval_abs$TARGET <- as.numeric(round(eval_abs$TARGET,0))
#inspect projections
head(eval_abs)
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
| 1 | 5.4 | 0.86 | 0.27 | 10.7 | 0.092 | 23 | 398 | 0.985 | 5.02 | 0.64 | 12.3 | 2 | 6 | 0 |
| 3 | 12.4 | 0.385 | 0.76 | 19.7 | 1.17 | 37 | 68 | 0.99 | 3.37 | 1.09 | 16 | 3 | 6 | 2 |
| 2 | 7.2 | 1.75 | 0.17 | 33 | 0.065 | 9 | 76 | 1.05 | 4.61 | 0.68 | 8.55 | 3 | 8 | 1 |
| 2 | 6.2 | 0.1 | 1.8 | 1 | 0.179 | 104 | 89 | 0.989 | 3.2 | 2.11 | 12.3 | 2 | 8 | 1 |
| 1 | 11.4 | 0.21 | 0.28 | 1.2 | 0.038 | 70 | 53 | 1.03 | 2.54 | 0.07 | 4.8 | 3 | 10 | 0 |
| 4 | 17.6 | 0.04 | 1.15 | 1.4 | 0.535 | 250 | 140 | 0.95 | 3.06 | 0.02 | 11.4 | 4 | 8 | 4 |
predictions <- eval_abs