In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
-INDEX: Identification Variable (do not use) None
-TARGET: Number of Cases Purchased None
-AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average
-Alcohol: Alcohol Content
-Chlorides: Chloride content of wine
-CitricAcid: Citric Acid Content
-Density: Density of Wine
-FixedAcidity: Fixed Acidity of Wine
-FreeSulfurDioxide: Sulfur Dioxide content of wine
-LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
-ResidualSugar: Residual Sugar of wine
-STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales
-Sulphates: Sulfate content of wine
-TotalSulfurDioxide: Total Sulfur Dioxide of Wine
-VolatileAcidity: Volatile Acid content of wine
-pH: pH of wine
#importing the train an eval data
<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-training-data.csv")
wine_train_df<- wine_train_df[,2:16]
wine_train_df <- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-evaluation-data.csv")
wine_eval_df<- wine_eval_df[,2:16]
wine_eval_df
#per assignment instructions, we don't use first column 'ID', so we remove it, we performed in above manner to keep all rows
We start off with 3359 N/A’s for STARS. We interpret this as not being rated. We are told from the instruction not to use Index. Every row is a different wine. Also we noticed that in our instructions it states, “the fact that a variable is missing is actually predictive of target” It may be the case that N/A’s may be predictive of the number of sales of wine. Those that are “un-rated” me be indicative of a “cheaper” wine or low sales. We shouldn’t impute the N/A’s since it may have predictive power. We will save these N/A’s in an acceptable form that our models will be able to interpret it. Since We see STARS has a large number of N/A’s it leads us to believe that it may be indicative as it relates to predictions.
We see that all our variables are numeric. This makes it easier since our focus is on quantitative analysis. When we review our variables we notice ‘TotalSulfurDioxide’ has largest range, starts -823 and ends 1057. Based on summaries, we need to further check the variation in the variables with high range such as ‘FreeSulfurDioxide’, ‘TotalSulfurDioxide’, to a lower extent ‘ResidualSugar’. The rest are normal. Secondly we will flag missing values such as with variables STARS and possibly ‘SULPHATES’, ‘CHlorides’, ‘PH’, ‘ResidualSugar’.
We believe it is a good idea to check if ‘TARGET’ variable is very different between entries with N/A’s and complete observations
STARS can be dealt as a factor variable, because it is not continuous, it may be thought of as categorized
Notice in this assignment there is no need to cleanup character/symbols which causes issues when performing mathematical computations
summary(wine_train_df)
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
#we needed to check the variances of the variables with high ranges
#note Standard Deviation is huge for 'FreesulfureDioxide' and 'TotalSulfureDioxide' and to lower extent 'ResidualSugar'
#notice that 'FreesulfureDioxide' and 'TotalSulfureDioxide' appear to include negative values
#'STARS' has exceptionally high number of missing values, as well as multiple chemical properties
#important to note the LOG cannot take negative values
kable(format(sapply(wine_train_df, function(wine_train_df) c( "Stand dev" = round(sd(wine_train_df, na.rm = T),2),
"Mean"= mean(wine_train_df,na.rm=TRUE),
"n" = length(wine_train_df),
"Median" = median(wine_train_df,na.rm = TRUE),
"CoeffofVariation" = sd(wine_train_df)/mean(wine_train_df,na.rm=TRUE),
"Minimum" = min(wine_train_df),
"Maximum" = max(wine_train_df),
"Upper Quantile" = quantile(wine_train_df,1,na.rm = TRUE),
"LowerQuartile" = quantile(wine_train_df,0,na.rm = TRUE)
)scientific = FALSE)
), )
TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stand dev | 1.930000000 | 6.320000000 | 0.780000000 | 0.860000000 | 33.750000000 | 0.320000000 | 148.710000000 | 231.910000000 | 0.030000000 | 0.680000000 | 0.930000000 | 3.730000000 | 0.890000000 | 1.320000000 | 0.900000000 |
Mean | 3.029073857 | 7.075717077 | 0.324103947 | 0.308412661 | 5.418733065 | 0.054822489 | 30.845571287 | 120.714232643 | 0.994202718 | 3.207628226 | 0.527111782 | 10.489236260 | -0.009066041 | 7.772723720 | 2.041754981 |
n | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 |
Median | 3.000000000 | 6.900000000 | 0.280000000 | 0.310000000 | 3.900000000 | 0.046000000 | 30.000000000 | 123.000000000 | 0.994490000 | 3.200000000 | 0.500000000 | 10.400000000 | 0.000000000 | 8.000000000 | 2.000000000 |
CoeffofVariation | 0.635959475 | 0.892862644 | 2.419020945 | 2.795215292 | NA | NA | NA | NA | 0.026692389 | NA | NA | NA | -98.288680252 | 0.170329786 | NA |
Minimum | 0.000000000 | -18.100000000 | -2.790000000 | -3.240000000 | NA | NA | NA | NA | 0.888090000 | NA | NA | NA | -2.000000000 | 4.000000000 | NA |
Maximum | 8.000000000 | 34.400000000 | 3.680000000 | 3.860000000 | NA | NA | NA | NA | 1.099240000 | NA | NA | NA | 2.000000000 | 17.000000000 | NA |
Upper Quantile.100% | 8.000000000 | 34.400000000 | 3.680000000 | 3.860000000 | 141.150000000 | 1.351000000 | 623.000000000 | 1057.000000000 | 1.099240000 | 6.130000000 | 4.240000000 | 26.500000000 | 2.000000000 | 17.000000000 | 4.000000000 |
LowerQuartile.0% | 0.000000000 | -18.100000000 | -2.790000000 | -3.240000000 | -127.800000000 | -1.171000000 | -555.000000000 | -823.000000000 | 0.888090000 | 0.480000000 | -3.130000000 | -4.700000000 | -2.000000000 | 4.000000000 | 1.000000000 |
#We might need to transform the 'FreesulfureDioxide' and 'TotalSulfureDioxide'
#NA's Flags: Explore if they have impact on 'TARGET', if yes, then impute?
#equal to the mean or significantly different from those with without N/A's in terms of 'TARGET' variable
#VISUAL EXPLORATION
ggplot(gather(wine_train_df), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
hist(wine_train_df$TARGET)
table(wine_train_df$TARGET)
##
## 0 1 2 3 4 5 6 7 8
## 2734 244 1091 2611 3177 2014 765 142 17
#Corr matrix and the scatterplot matrix
##correlation matrix
= rcorr(as.matrix(wine_train_df))
wine_train_df.rcorr wine_train_df.rcorr
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 1.00 -0.05 -0.09 0.01 0.02
## FixedAcidity -0.05 1.00 0.01 0.01 -0.02
## VolatileAcidity -0.09 0.01 1.00 -0.02 -0.01
## CitricAcid 0.01 0.01 -0.02 1.00 -0.01
## ResidualSugar 0.02 -0.02 -0.01 -0.01 1.00
## Chlorides -0.04 0.00 0.00 -0.01 -0.01
## FreeSulfurDioxide 0.04 0.00 -0.01 0.01 0.02
## TotalSulfurDioxide 0.05 -0.02 -0.02 0.01 0.02
## Density -0.04 0.01 0.01 -0.01 0.00
## pH -0.01 -0.01 0.01 -0.01 0.01
## Sulphates -0.04 0.03 0.00 -0.01 -0.01
## Alcohol 0.06 -0.01 0.00 0.02 -0.02
## LabelAppeal 0.36 0.00 -0.02 0.01 0.00
## AcidIndex -0.25 0.18 0.04 0.07 -0.01
## STARS 0.56 -0.01 -0.03 0.00 0.02
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
## TARGET -0.04 0.04 0.05 -0.04 -0.01
## FixedAcidity 0.00 0.00 -0.02 0.01 -0.01
## VolatileAcidity 0.00 -0.01 -0.02 0.01 0.01
## CitricAcid -0.01 0.01 0.01 -0.01 -0.01
## ResidualSugar -0.01 0.02 0.02 0.00 0.01
## Chlorides 1.00 -0.02 -0.01 0.02 -0.02
## FreeSulfurDioxide -0.02 1.00 0.01 0.00 0.01
## TotalSulfurDioxide -0.01 0.01 1.00 0.01 0.00
## Density 0.02 0.00 0.01 1.00 0.01
## pH -0.02 0.01 0.00 0.01 1.00
## Sulphates 0.00 0.01 -0.01 -0.01 0.01
## Alcohol -0.02 -0.02 -0.02 -0.01 -0.01
## LabelAppeal 0.01 0.01 -0.01 -0.01 0.00
## AcidIndex 0.03 -0.04 -0.05 0.04 -0.06
## STARS 0.00 -0.01 0.01 -0.02 0.00
## Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET -0.04 0.06 0.36 -0.25 0.56
## FixedAcidity 0.03 -0.01 0.00 0.18 -0.01
## VolatileAcidity 0.00 0.00 -0.02 0.04 -0.03
## CitricAcid -0.01 0.02 0.01 0.07 0.00
## ResidualSugar -0.01 -0.02 0.00 -0.01 0.02
## Chlorides 0.00 -0.02 0.01 0.03 0.00
## FreeSulfurDioxide 0.01 -0.02 0.01 -0.04 -0.01
## TotalSulfurDioxide -0.01 -0.02 -0.01 -0.05 0.01
## Density -0.01 -0.01 -0.01 0.04 -0.02
## pH 0.01 -0.01 0.00 -0.06 0.00
## Sulphates 1.00 0.00 0.00 0.03 -0.01
## Alcohol 0.00 1.00 0.00 -0.04 0.07
## LabelAppeal 0.00 0.00 1.00 0.02 0.33
## AcidIndex 0.03 -0.04 0.02 1.00 -0.09
## STARS -0.01 0.07 0.33 -0.09 1.00
##
## n
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 12795 12795 12795 12795 12179
## FixedAcidity 12795 12795 12795 12795 12179
## VolatileAcidity 12795 12795 12795 12795 12179
## CitricAcid 12795 12795 12795 12795 12179
## ResidualSugar 12179 12179 12179 12179 12179
## Chlorides 12157 12157 12157 12157 11585
## FreeSulfurDioxide 12148 12148 12148 12148 11563
## TotalSulfurDioxide 12113 12113 12113 12113 11532
## Density 12795 12795 12795 12795 12179
## pH 12400 12400 12400 12400 11802
## Sulphates 11585 11585 11585 11585 11030
## Alcohol 12142 12142 12142 12142 11563
## LabelAppeal 12795 12795 12795 12795 12179
## AcidIndex 12795 12795 12795 12795 12179
## STARS 9436 9436 9436 9436 8984
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
## TARGET 12157 12148 12113 12795 12400
## FixedAcidity 12157 12148 12113 12795 12400
## VolatileAcidity 12157 12148 12113 12795 12400
## CitricAcid 12157 12148 12113 12795 12400
## ResidualSugar 11585 11563 11532 12179 11802
## Chlorides 12157 11544 11510 12157 11773
## FreeSulfurDioxide 11544 12148 11512 12148 11771
## TotalSulfurDioxide 11510 11512 12113 12113 11739
## Density 12157 12148 12113 12795 12400
## pH 11773 11771 11739 12400 12400
## Sulphates 10991 10995 10973 11585 11228
## Alcohol 11538 11527 11497 12142 11771
## LabelAppeal 12157 12148 12113 12795 12400
## AcidIndex 12157 12148 12113 12795 12400
## STARS 8969 8979 8942 9436 9154
## Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET 11585 12142 12795 12795 9436
## FixedAcidity 11585 12142 12795 12795 9436
## VolatileAcidity 11585 12142 12795 12795 9436
## CitricAcid 11585 12142 12795 12795 9436
## ResidualSugar 11030 11563 12179 12179 8984
## Chlorides 10991 11538 12157 12157 8969
## FreeSulfurDioxide 10995 11527 12148 12148 8979
## TotalSulfurDioxide 10973 11497 12113 12113 8942
## Density 11585 12142 12795 12795 9436
## pH 11228 11771 12400 12400 9154
## Sulphates 11585 10989 11585 11585 8564
## Alcohol 10989 12142 12142 12142 8963
## LabelAppeal 11585 12142 12795 12795 9436
## AcidIndex 11585 12142 12795 12795 9436
## STARS 8564 8963 9436 9436 9436
##
## P
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 0.0000 0.0000 0.3260 0.0688
## FixedAcidity 0.0000 0.1616 0.1072 0.0375
## VolatileAcidity 0.0000 0.1616 0.0552 0.4744
## CitricAcid 0.3260 0.1072 0.0552 0.4438
## ResidualSugar 0.0688 0.0375 0.4744 0.4438
## Chlorides 0.0000 0.9598 0.9134 0.3449 0.5471
## FreeSulfurDioxide 0.0000 0.5837 0.4354 0.4787 0.0600
## TotalSulfurDioxide 0.0000 0.0133 0.0203 0.4867 0.0158
## Density 0.0000 0.4638 0.0956 0.1145 0.6509
## pH 0.2930 0.3172 0.1302 0.3322 0.1880
## Sulphates 0.0000 0.0009 0.9889 0.1621 0.4173
## Alcohol 0.0000 0.3018 0.6536 0.0603 0.0315
## LabelAppeal 0.0000 0.7034 0.0547 0.3279 0.7979
## AcidIndex 0.0000 0.0000 0.0000 0.0000 0.2989
## STARS 0.0000 0.5197 0.0008 0.9485 0.1126
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## TARGET 0.0000 0.0000 0.0000 0.0000
## FixedAcidity 0.9598 0.5837 0.0133 0.4638
## VolatileAcidity 0.9134 0.4354 0.0203 0.0956
## CitricAcid 0.3449 0.4787 0.4867 0.1145
## ResidualSugar 0.5471 0.0600 0.0158 0.6509
## Chlorides 0.0264 0.1333 0.0125
## FreeSulfurDioxide 0.0264 0.1410 0.7263
## TotalSulfurDioxide 0.1333 0.1410 0.1584
## Density 0.0125 0.7263 0.1584
## pH 0.0561 0.5117 0.6380 0.5207
## Sulphates 0.7302 0.2242 0.4550 0.3296
## Alcohol 0.0344 0.0460 0.0871 0.4267
## LabelAppeal 0.2466 0.2566 0.2834 0.2892
## AcidIndex 0.0054 0.0000 0.0000 0.0000
## STARS 0.6405 0.3895 0.1878 0.0757
## pH Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET 0.2930 0.0000 0.0000 0.0000 0.0000 0.0000
## FixedAcidity 0.3172 0.0009 0.3018 0.7034 0.0000 0.5197
## VolatileAcidity 0.1302 0.9889 0.6536 0.0547 0.0000 0.0008
## CitricAcid 0.3322 0.1621 0.0603 0.3279 0.0000 0.9485
## ResidualSugar 0.1880 0.4173 0.0315 0.7979 0.2989 0.1126
## Chlorides 0.0561 0.7302 0.0344 0.2466 0.0054 0.6405
## FreeSulfurDioxide 0.5117 0.2242 0.0460 0.2566 0.0000 0.3895
## TotalSulfurDioxide 0.6380 0.4550 0.0871 0.2834 0.0000 0.1878
## Density 0.5207 0.3296 0.4267 0.2892 0.0000 0.0757
## pH 0.5618 0.2103 0.6450 0.0000 0.9627
## Sulphates 0.5618 0.6192 0.6757 0.0002 0.2548
## Alcohol 0.2103 0.6192 0.9099 0.0000 0.0000
## LabelAppeal 0.6450 0.6757 0.9099 0.0051 0.0000
## AcidIndex 0.0000 0.0002 0.0000 0.0051 0.0000
## STARS 0.9627 0.2548 0.0000 0.0000 0.0000
#notice when running above correlation matrix that the correlation between STARS and the target variable is high, the higher of a rating the more samples were requested by distribution companies for a specific brand
#very important to note, also when viewing our correlation matrix, in terms of our 'TARGET' variable and the complete variables, we also have 'LabelAppeal' appears to be a variable which is highly correlated with 'TARGET', 'AcidIndex' can be considered similarly so.
= cor(wine_train_df, use = "pairwise.complete.obs")
wine_train_df.cor corrplot(wine_train_df.cor)
#notice that when initially first using "complete.obs", it gives error because of various N/A's throughout our data
#"complete.obs" above initially failed because it only calculates correlations based on rows that don't have N/A's (every row has an N/A somewhere apparently)
#if use has the value "pairwise.complete.obs" then the correlation between each pair of variables is computed using all complete pairs of observations on those variables.
# 'STARS', 'AcidIndex', 'LabelAppeal' are the ones with the highest correlation with the 'TARGET' variable
#Basically, for every value of 'STARS', ie 'STARS=1 or =2 or =3 or =4, it shows the distribution the 'TARGET' variable in each case
#You'll notice, for example when 'STARS'=4, the target variable generally has more high value than low values
histogram(~ TARGET | STARS, data = wine_train_df)
#'LabelAppeal' has range from min -2 and max 2, this can possibly be thought of as categorical
#using 'LabelAppeal' we can think of -2 as "very Bad", 1 as "Bad", 0 as neutral, 1 as "good", 2 as "very good"
#IMPORTANT to note the higher the label appeal, then the higher the demand for samples for specific brand
histogram(~ TARGET | LabelAppeal, data = wine_train_df)
#Note 'AcidIndex' is NOT categorical, we cannot put one graph per every value of 'AcidIndex'. Instead we can switch and use 'TARGET' and have 'AcidIndex' as the histogram
#viewing the histogram, we can say that the ones with high 'TARGET' value should have lower 'AcidIndex'. As we go lower and lower according to 'TARGET' variable then the 'AcidIndex' should get higher. This makes sense since 'AcidIndex' appeared to be negatively correlated (-0.25)
histogram(~ AcidIndex | TARGET, data = wine_train_df)
<- cor.test(wine_train_df$STARS, wine_train_df$TARGET)
cor_stars_tgt cor_stars_tgt
##
## Pearson's product-moment correlation
##
## data: wine_train_df$STARS and wine_train_df$TARGET
## t = 65.446, df = 9434, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5447586 0.5725160
## sample estimates:
## cor
## 0.5587938
<- cor.test(wine_train_df$LabelAppeal, wine_train_df$TARGET)
cor_lbl_tgr cor_lbl_tgr
##
## Pearson's product-moment correlation
##
## data: wine_train_df$LabelAppeal and wine_train_df$TARGET
## t = 43.158, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3412812 0.3715329
## sample estimates:
## cor
## 0.3565005
<- cor.test(wine_train_df$AcidIndex, wine_train_df$TARGET)
cor_acid_tgt cor_acid_tgt
##
## Pearson's product-moment correlation
##
## data: wine_train_df$AcidIndex and wine_train_df$TARGET
## t = -28.712, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2622588 -0.2297013
## sample estimates:
## cor
## -0.2460494
#we tested the significance of the three correlation, with 5% significant level, they are all significant
#T-test (DIDNT WORK)
#because t-test only works if the grouping variable, only has two groups
#t_AcidIndextarget <- t.test(wine_train_df$AcidIndex~ wine_train_df$TARGET)
#t_AcidIndextarget
# Compute the analysis of variance, when has more than two groups perform ANOVA
<- aov(AcidIndex ~ TARGET, data = wine_train_df)
res.aov # Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## TARGET 1 1358 1357.6 824.4 <2e-16 ***
## Residuals 12793 21067 1.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#potentially acid index has a potential effect on TARGET variable
# #For these ones, they seem more random, the 'missingness' may not have any indications, we impute using mean or conditional mean on target.
# ResidualSugar
# Chlorides
# FreeSulfurDioxide
# TotalSulfurDioxide
# pH
# Alcohol
#we are creating groups based on TARGET, replace missing value
<- wine_train_df %>%
wine_train_df mutate(ResidualSugar= ifelse(is.na(ResidualSugar),
mean(ResidualSugar, na.rm=TRUE),ResidualSugar),
Chlorides= ifelse(is.na(Chlorides),
mean(Chlorides, na.rm=TRUE),Chlorides),
FreeSulfurDioxide= ifelse(is.na(FreeSulfurDioxide),
mean(FreeSulfurDioxide, na.rm=TRUE),FreeSulfurDioxide),
TotalSulfurDioxide= ifelse(is.na(TotalSulfurDioxide),
mean(TotalSulfurDioxide, na.rm=TRUE),TotalSulfurDioxide),
pH= ifelse(is.na(pH),
mean(pH, na.rm=TRUE),pH),
Alcohol= ifelse(is.na(Alcohol),
mean(Alcohol, na.rm=TRUE),Alcohol),
)
#LOG TRANSFORMATION
#Log transformation for the variables with high variance, the variables are translated first so that we get rid of the negative values so that the log function can handle them
#we transform 'y' like in last assignment, instead of log(y), we can do y + 1 - min(y)
#for example, the minimum of 'FreeSulfurDioxide' is -555, this will transform all the values of 'FreeSulfurDioxide by 555, so the -555 is not in the negative range(will be positive) and log() will be able to handle
#A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.
$FreeSulfurDioxide_log <- log(wine_train_df$FreeSulfurDioxide + 1 - min(wine_train_df$FreeSulfurDioxide))
wine_train_df$TotalSulfurDioxide_log <- log(wine_train_df$TotalSulfurDioxide + 1 - min(wine_train_df$TotalSulfurDioxide))
wine_train_df
# #Flags for N/A's:
# #These ones need careful consideration as the percentage of N/A's is significant
# Sulphates(*)
# STARS(***)
#OBSERVATION: we notice some variable with "low-ish" N/A's, when we explored the data we saw that every single brand has an N/A in these properties. It may be that these N/A's are random. It may not be the case that specific brands are 'bad' or 'low price' since they have N/A's in certain properties. The N/A's we see dispersed among the different brands
<- wine_train_df %>%
wine_train_df mutate(Sulphates_flag= ifelse(is.na(Sulphates),1,0),
STARS_flag= ifelse(is.na(STARS),1,0)
)#flags = will create 1 if NA
histogram(~ TARGET | STARS_flag, data = wine_train_df)
histogram(~ TARGET | Sulphates_flag, data = wine_train_df)
#left, stars_flag=0 and right side graph=1 missing data
#histogram on the left STARS_flag has no missing data, the average is higher
##histogram on the left STARS_flag has missing data, the average is generally lower
###We might assume that those with missing data generally are less purchases or of lower quality
#sulphates_flag with 0 and 1 are similar(symmetrical)
#corrective actions
<- wine_train_df %>%
wine_train_df mutate(Sulphates= ifelse(is.na(Sulphates),
mean(Sulphates, na.rm=TRUE),Sulphates),
STARS_merged=ifelse(is.na(STARS),0,STARS))
#if it is missing it will equal to 0, if has a value it will stay same
#you will see includes 0 for missing values
table(wine_train_df$STARS_merged)
##
## 0 1 2 3 4
## 3359 3042 3570 2212 612
##you will see includes no 0 columns
table(wine_train_df$STARS)
##
## 1 2 3 4
## 3042 3570 2212 612
#creating clusters for acid index
<- kmeans(wine_train_df$AcidIndex, centers = 5)
kmeans.re table(kmeans.re$cluster)
##
## 1 2 3 4 5
## 1978 12 124 386 10295
$AcidIndex_clusters <- kmeans.re$cluster
wine_train_df
histogram(~ TARGET | AcidIndex_clusters, data = wine_train_df)
#the higher the acidindex the lower the target variable
#other clusters are almost identical in shape, generally even distributed
#better to switch to original acidindex variable