In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
-INDEX: Identification Variable (do not use) None
-TARGET: Number of Cases Purchased None
-AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average
-Alcohol: Alcohol Content
-Chlorides: Chloride content of wine
-CitricAcid: Citric Acid Content
-Density: Density of Wine
-FixedAcidity: Fixed Acidity of Wine
-FreeSulfurDioxide: Sulfur Dioxide content of wine
-LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
-ResidualSugar: Residual Sugar of wine
-STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales
-Sulphates: Sulfate content of wine
-TotalSulfurDioxide: Total Sulfur Dioxide of Wine
-VolatileAcidity: Volatile Acid content of wine
-pH: pH of wine
#importing the train an eval data
<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-training-data.csv")
wine_train_df<- wine_train_df[,2:16]
wine_train_df <- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-evaluation-data.csv")
wine_eval_df<- wine_eval_df[,2:16]
wine_eval_df
#per assignment instructions, we don't use first column 'ID', so we remove it, we performed in above manner to keep all rows
We are provided with two datasets on commercially available wine, one for the purpose of training our model with 12,796 observations of 16 variables and an evaluation dataset with 3,335 observations. The training dataset includes one response variable, TARGET, the number of cases purchased. TARGET is a continuous variable, with values between 0 and 8 in the training data. Because we are analyzing this data with the intention of maximizing sales for this wine manufacturer, we’ll be creating models to better understand how much wine is ordered based on the wine’s characteristics.
To begin our exploration of the data, we start with a broad look at the variables.
There are missing observations in ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS. STARS has the greatest rate of missingness with 26% null observations, while Sulphates has just under 10% null observations, followed by ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, and Alcohol all with about 5% null observations. Do observations with one null have multiple nulls across variables? Or are the nulls scattered throughout observations? Look at just the null values to understand them better. We will have to decide how to handle data missingness in the subsequent section.
Now, we can dig deeper into each variables’ distribution to see if there are any integrity red flags or challenges with skewness. Strangely enough, we appear to have a dataset filled with variables that are normally distributed. Looking at these histograms, we might potentially be interested in a few transformations. AcidIndex, a proprietary method of testing total acidity of wine by using a weighted average, appears to be ever so slightly right-skewed, but it’s a scaled composite variable so may be best left as is. STARS, with all it’s missing observations, is also less than normally distributed. Depending on how we handle those missing observations, we’ll be able to transform and/or handle the variable appropriately. It’s also interesting to notice that our TARGET variable is the least normally distributed, with a large number of 0s and a conspicuous gap of observations between 4 and 5. Are there any other variables with suspicious distributions /minimums /maximums/0s/means?
With a better understanding of individual variables, we can begin to look at how the variables are correlated. When we create a correlation table on complete observations, we see that there’s not a lot of correlation between predictor variables. It will be interesting to look at correlation again once we’ve addressed nulls. There is the strongest positive correlation between STARS and TARGET, which theoretically makes sense given that STARS is the “wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor”. LabelAppeal has the second strongest positive correlation with TARGET, and is another theoretically direct measurement of a customer’s willingness to buy a specific wine. The strongest negative correlation is between AcidIndex and Target. Depending on how the index is set up, this may indicate that consumers have a strong preference for or against more acidic wines.
What else would we like to call out in our data exploration?
summary(wine_train_df)
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
#we needed to check the variances of the variables with high ranges
#note Standard Deviation is huge for 'FreesulfureDioxide' and 'TotalSulfureDioxide' and to lower extent 'ResidualSugar'
#notice that 'FreesulfureDioxide' and 'TotalSulfureDioxide' appear to include negative values
#'STARS' has exceptionally high number of missing values, as well as multiple chemical properties
#important to note the LOG cannot take negative values
kable(format(sapply(wine_train_df, function(wine_train_df) c( "Stand dev" = round(sd(wine_train_df, na.rm = T),2),
"Mean"= mean(wine_train_df,na.rm=TRUE),
"n" = length(wine_train_df),
"Median" = median(wine_train_df,na.rm = TRUE),
"CoeffofVariation" = sd(wine_train_df)/mean(wine_train_df,na.rm=TRUE),
"Minimum" = min(wine_train_df),
"Maximum" = max(wine_train_df),
"Upper Quantile" = quantile(wine_train_df,1,na.rm = TRUE),
"LowerQuartile" = quantile(wine_train_df,0,na.rm = TRUE)
)scientific = FALSE)
), )
TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stand dev | 1.930000000 | 6.320000000 | 0.780000000 | 0.860000000 | 33.750000000 | 0.320000000 | 148.710000000 | 231.910000000 | 0.030000000 | 0.680000000 | 0.930000000 | 3.730000000 | 0.890000000 | 1.320000000 | 0.900000000 |
Mean | 3.029073857 | 7.075717077 | 0.324103947 | 0.308412661 | 5.418733065 | 0.054822489 | 30.845571287 | 120.714232643 | 0.994202718 | 3.207628226 | 0.527111782 | 10.489236260 | -0.009066041 | 7.772723720 | 2.041754981 |
n | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 | 12795.000000000 |
Median | 3.000000000 | 6.900000000 | 0.280000000 | 0.310000000 | 3.900000000 | 0.046000000 | 30.000000000 | 123.000000000 | 0.994490000 | 3.200000000 | 0.500000000 | 10.400000000 | 0.000000000 | 8.000000000 | 2.000000000 |
CoeffofVariation | 0.635959475 | 0.892862644 | 2.419020945 | 2.795215292 | NA | NA | NA | NA | 0.026692389 | NA | NA | NA | -98.288680252 | 0.170329786 | NA |
Minimum | 0.000000000 | -18.100000000 | -2.790000000 | -3.240000000 | NA | NA | NA | NA | 0.888090000 | NA | NA | NA | -2.000000000 | 4.000000000 | NA |
Maximum | 8.000000000 | 34.400000000 | 3.680000000 | 3.860000000 | NA | NA | NA | NA | 1.099240000 | NA | NA | NA | 2.000000000 | 17.000000000 | NA |
Upper Quantile.100% | 8.000000000 | 34.400000000 | 3.680000000 | 3.860000000 | 141.150000000 | 1.351000000 | 623.000000000 | 1057.000000000 | 1.099240000 | 6.130000000 | 4.240000000 | 26.500000000 | 2.000000000 | 17.000000000 | 4.000000000 |
LowerQuartile.0% | 0.000000000 | -18.100000000 | -2.790000000 | -3.240000000 | -127.800000000 | -1.171000000 | -555.000000000 | -823.000000000 | 0.888090000 | 0.480000000 | -3.130000000 | -4.700000000 | -2.000000000 | 4.000000000 | 1.000000000 |
#We might need to transform the 'FreesulfureDioxide' and 'TotalSulfureDioxide'
#NA's Flags: Explore if they have impact on 'TARGET', if yes, then impute?
#equal to the mean or significantly different from those with without N/A's in terms of 'TARGET' variable
#VISUAL EXPLORATION
ggplot(gather(wine_train_df), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
hist(wine_train_df$TARGET)
table(wine_train_df$TARGET)
##
## 0 1 2 3 4 5 6 7 8
## 2734 244 1091 2611 3177 2014 765 142 17
#Corr matrix and the scatterplot matrix
##correlation matrix
= rcorr(as.matrix(wine_train_df))
wine_train_df.rcorr wine_train_df.rcorr
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 1.00 -0.05 -0.09 0.01 0.02
## FixedAcidity -0.05 1.00 0.01 0.01 -0.02
## VolatileAcidity -0.09 0.01 1.00 -0.02 -0.01
## CitricAcid 0.01 0.01 -0.02 1.00 -0.01
## ResidualSugar 0.02 -0.02 -0.01 -0.01 1.00
## Chlorides -0.04 0.00 0.00 -0.01 -0.01
## FreeSulfurDioxide 0.04 0.00 -0.01 0.01 0.02
## TotalSulfurDioxide 0.05 -0.02 -0.02 0.01 0.02
## Density -0.04 0.01 0.01 -0.01 0.00
## pH -0.01 -0.01 0.01 -0.01 0.01
## Sulphates -0.04 0.03 0.00 -0.01 -0.01
## Alcohol 0.06 -0.01 0.00 0.02 -0.02
## LabelAppeal 0.36 0.00 -0.02 0.01 0.00
## AcidIndex -0.25 0.18 0.04 0.07 -0.01
## STARS 0.56 -0.01 -0.03 0.00 0.02
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
## TARGET -0.04 0.04 0.05 -0.04 -0.01
## FixedAcidity 0.00 0.00 -0.02 0.01 -0.01
## VolatileAcidity 0.00 -0.01 -0.02 0.01 0.01
## CitricAcid -0.01 0.01 0.01 -0.01 -0.01
## ResidualSugar -0.01 0.02 0.02 0.00 0.01
## Chlorides 1.00 -0.02 -0.01 0.02 -0.02
## FreeSulfurDioxide -0.02 1.00 0.01 0.00 0.01
## TotalSulfurDioxide -0.01 0.01 1.00 0.01 0.00
## Density 0.02 0.00 0.01 1.00 0.01
## pH -0.02 0.01 0.00 0.01 1.00
## Sulphates 0.00 0.01 -0.01 -0.01 0.01
## Alcohol -0.02 -0.02 -0.02 -0.01 -0.01
## LabelAppeal 0.01 0.01 -0.01 -0.01 0.00
## AcidIndex 0.03 -0.04 -0.05 0.04 -0.06
## STARS 0.00 -0.01 0.01 -0.02 0.00
## Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET -0.04 0.06 0.36 -0.25 0.56
## FixedAcidity 0.03 -0.01 0.00 0.18 -0.01
## VolatileAcidity 0.00 0.00 -0.02 0.04 -0.03
## CitricAcid -0.01 0.02 0.01 0.07 0.00
## ResidualSugar -0.01 -0.02 0.00 -0.01 0.02
## Chlorides 0.00 -0.02 0.01 0.03 0.00
## FreeSulfurDioxide 0.01 -0.02 0.01 -0.04 -0.01
## TotalSulfurDioxide -0.01 -0.02 -0.01 -0.05 0.01
## Density -0.01 -0.01 -0.01 0.04 -0.02
## pH 0.01 -0.01 0.00 -0.06 0.00
## Sulphates 1.00 0.00 0.00 0.03 -0.01
## Alcohol 0.00 1.00 0.00 -0.04 0.07
## LabelAppeal 0.00 0.00 1.00 0.02 0.33
## AcidIndex 0.03 -0.04 0.02 1.00 -0.09
## STARS -0.01 0.07 0.33 -0.09 1.00
##
## n
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 12795 12795 12795 12795 12179
## FixedAcidity 12795 12795 12795 12795 12179
## VolatileAcidity 12795 12795 12795 12795 12179
## CitricAcid 12795 12795 12795 12795 12179
## ResidualSugar 12179 12179 12179 12179 12179
## Chlorides 12157 12157 12157 12157 11585
## FreeSulfurDioxide 12148 12148 12148 12148 11563
## TotalSulfurDioxide 12113 12113 12113 12113 11532
## Density 12795 12795 12795 12795 12179
## pH 12400 12400 12400 12400 11802
## Sulphates 11585 11585 11585 11585 11030
## Alcohol 12142 12142 12142 12142 11563
## LabelAppeal 12795 12795 12795 12795 12179
## AcidIndex 12795 12795 12795 12795 12179
## STARS 9436 9436 9436 9436 8984
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH
## TARGET 12157 12148 12113 12795 12400
## FixedAcidity 12157 12148 12113 12795 12400
## VolatileAcidity 12157 12148 12113 12795 12400
## CitricAcid 12157 12148 12113 12795 12400
## ResidualSugar 11585 11563 11532 12179 11802
## Chlorides 12157 11544 11510 12157 11773
## FreeSulfurDioxide 11544 12148 11512 12148 11771
## TotalSulfurDioxide 11510 11512 12113 12113 11739
## Density 12157 12148 12113 12795 12400
## pH 11773 11771 11739 12400 12400
## Sulphates 10991 10995 10973 11585 11228
## Alcohol 11538 11527 11497 12142 11771
## LabelAppeal 12157 12148 12113 12795 12400
## AcidIndex 12157 12148 12113 12795 12400
## STARS 8969 8979 8942 9436 9154
## Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET 11585 12142 12795 12795 9436
## FixedAcidity 11585 12142 12795 12795 9436
## VolatileAcidity 11585 12142 12795 12795 9436
## CitricAcid 11585 12142 12795 12795 9436
## ResidualSugar 11030 11563 12179 12179 8984
## Chlorides 10991 11538 12157 12157 8969
## FreeSulfurDioxide 10995 11527 12148 12148 8979
## TotalSulfurDioxide 10973 11497 12113 12113 8942
## Density 11585 12142 12795 12795 9436
## pH 11228 11771 12400 12400 9154
## Sulphates 11585 10989 11585 11585 8564
## Alcohol 10989 12142 12142 12142 8963
## LabelAppeal 11585 12142 12795 12795 9436
## AcidIndex 11585 12142 12795 12795 9436
## STARS 8564 8963 9436 9436 9436
##
## P
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET 0.0000 0.0000 0.3260 0.0688
## FixedAcidity 0.0000 0.1616 0.1072 0.0375
## VolatileAcidity 0.0000 0.1616 0.0552 0.4744
## CitricAcid 0.3260 0.1072 0.0552 0.4438
## ResidualSugar 0.0688 0.0375 0.4744 0.4438
## Chlorides 0.0000 0.9598 0.9134 0.3449 0.5471
## FreeSulfurDioxide 0.0000 0.5837 0.4354 0.4787 0.0600
## TotalSulfurDioxide 0.0000 0.0133 0.0203 0.4867 0.0158
## Density 0.0000 0.4638 0.0956 0.1145 0.6509
## pH 0.2930 0.3172 0.1302 0.3322 0.1880
## Sulphates 0.0000 0.0009 0.9889 0.1621 0.4173
## Alcohol 0.0000 0.3018 0.6536 0.0603 0.0315
## LabelAppeal 0.0000 0.7034 0.0547 0.3279 0.7979
## AcidIndex 0.0000 0.0000 0.0000 0.0000 0.2989
## STARS 0.0000 0.5197 0.0008 0.9485 0.1126
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## TARGET 0.0000 0.0000 0.0000 0.0000
## FixedAcidity 0.9598 0.5837 0.0133 0.4638
## VolatileAcidity 0.9134 0.4354 0.0203 0.0956
## CitricAcid 0.3449 0.4787 0.4867 0.1145
## ResidualSugar 0.5471 0.0600 0.0158 0.6509
## Chlorides 0.0264 0.1333 0.0125
## FreeSulfurDioxide 0.0264 0.1410 0.7263
## TotalSulfurDioxide 0.1333 0.1410 0.1584
## Density 0.0125 0.7263 0.1584
## pH 0.0561 0.5117 0.6380 0.5207
## Sulphates 0.7302 0.2242 0.4550 0.3296
## Alcohol 0.0344 0.0460 0.0871 0.4267
## LabelAppeal 0.2466 0.2566 0.2834 0.2892
## AcidIndex 0.0054 0.0000 0.0000 0.0000
## STARS 0.6405 0.3895 0.1878 0.0757
## pH Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET 0.2930 0.0000 0.0000 0.0000 0.0000 0.0000
## FixedAcidity 0.3172 0.0009 0.3018 0.7034 0.0000 0.5197
## VolatileAcidity 0.1302 0.9889 0.6536 0.0547 0.0000 0.0008
## CitricAcid 0.3322 0.1621 0.0603 0.3279 0.0000 0.9485
## ResidualSugar 0.1880 0.4173 0.0315 0.7979 0.2989 0.1126
## Chlorides 0.0561 0.7302 0.0344 0.2466 0.0054 0.6405
## FreeSulfurDioxide 0.5117 0.2242 0.0460 0.2566 0.0000 0.3895
## TotalSulfurDioxide 0.6380 0.4550 0.0871 0.2834 0.0000 0.1878
## Density 0.5207 0.3296 0.4267 0.2892 0.0000 0.0757
## pH 0.5618 0.2103 0.6450 0.0000 0.9627
## Sulphates 0.5618 0.6192 0.6757 0.0002 0.2548
## Alcohol 0.2103 0.6192 0.9099 0.0000 0.0000
## LabelAppeal 0.6450 0.6757 0.9099 0.0051 0.0000
## AcidIndex 0.0000 0.0002 0.0000 0.0051 0.0000
## STARS 0.9627 0.2548 0.0000 0.0000 0.0000
#notice when running above correlation matrix that the correlation between STARS and the target variable is high, the higher of a rating the more samples were requested by distribution companies for a specific brand
#very important to note, also when viewing our correlation matrix, in terms of our 'TARGET' variable and the complete variables, we also have 'LabelAppeal' appears to be a variable which is highly correlated with 'TARGET', 'AcidIndex' can be considered similarly so.
= cor(wine_train_df, use = "pairwise.complete.obs")
wine_train_df.cor corrplot(wine_train_df.cor)
#notice that when initially first using "complete.obs", it gives error because of various N/A's throughout our data
#"complete.obs" above initially failed because it only calculates correlations based on rows that don't have N/A's (every row has an N/A somewhere apparently)
#if use has the value "pairwise.complete.obs" then the correlation between each pair of variables is computed using all complete pairs of observations on those variables.
# 'STARS', 'AcidIndex', 'LabelAppeal' are the ones with the highest correlation with the 'TARGET' variable
#Basically, for every value of 'STARS', ie 'STARS=1 or =2 or =3 or =4, it shows the distribution the 'TARGET' variable in each case
#You'll notice, for example when 'STARS'=4, the target variable generally has more high value than low values
histogram(~ TARGET | STARS, data = wine_train_df)
#'LabelAppeal' has range from min -2 and max 2, this can possibly be thought of as categorical
#using 'LabelAppeal' we can think of -2 as "very Bad", 1 as "Bad", 0 as neutral, 1 as "good", 2 as "very good"
#IMPORTANT to note the higher the label appeal, then the higher the demand for samples for specific brand
histogram(~ TARGET | LabelAppeal, data = wine_train_df)
#Note 'AcidIndex' is NOT categorical, we cannot put one graph per every value of 'AcidIndex'. Instead we can switch and use 'TARGET' and have 'AcidIndex' as the histogram
#viewing the histogram, we can say that the ones with high 'TARGET' value should have lower 'AcidIndex'. As we go lower and lower according to 'TARGET' variable then the 'AcidIndex' should get higher. This makes sense since 'AcidIndex' appeared to be negatively correlated (-0.25)
histogram(~ AcidIndex | TARGET, data = wine_train_df)
<- cor.test(wine_train_df$STARS, wine_train_df$TARGET)
cor_stars_tgt cor_stars_tgt
##
## Pearson's product-moment correlation
##
## data: wine_train_df$STARS and wine_train_df$TARGET
## t = 65.446, df = 9434, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5447586 0.5725160
## sample estimates:
## cor
## 0.5587938
<- cor.test(wine_train_df$LabelAppeal, wine_train_df$TARGET)
cor_lbl_tgr cor_lbl_tgr
##
## Pearson's product-moment correlation
##
## data: wine_train_df$LabelAppeal and wine_train_df$TARGET
## t = 43.158, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3412812 0.3715329
## sample estimates:
## cor
## 0.3565005
<- cor.test(wine_train_df$AcidIndex, wine_train_df$TARGET)
cor_acid_tgt cor_acid_tgt
##
## Pearson's product-moment correlation
##
## data: wine_train_df$AcidIndex and wine_train_df$TARGET
## t = -28.712, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2622588 -0.2297013
## sample estimates:
## cor
## -0.2460494
#we tested the significance of the three correlation, with 5% significant level, they are all significant
#T-test (DIDNT WORK)
#because t-test only works if the grouping variable, only has two groups
#t_AcidIndextarget <- t.test(wine_train_df$AcidIndex~ wine_train_df$TARGET)
#t_AcidIndextarget
# Compute the analysis of variance, when has more than two groups perform ANOVA
<- aov(AcidIndex ~ TARGET, data = wine_train_df)
res.aov # Summary of the analysis
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## TARGET 1 1358 1357.6 824.4 <2e-16 ***
## Residuals 12793 21067 1.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#potentially acid index has a potential effect on TARGET variable
Dropping missing data has important implications on a model’s ability to predict on an evaluation dataset. When all nulls and missing data are removed, the means and medians of the variables change. With no nulls or missing data, there are 6,436 observations total. While this isn’t a bad amount of sample, it does reduce the observations by half of the original dataset. Below, you can see how the removal has affected the medians and means of each variable.
Because data is rarely perfect and simply removing missing data introduces a certain type of bias, we can also try imputing missing data based on XXX.
We can also explore the transformation of variables, either through log transformation or interaction variables. These transformations will be discussed on a model-specific basis in the Build Models selection, as well as highlighted in the R codes included in the Appendix.
# #For these ones, they seem more random, the 'missingness' may not have any indications, we impute using mean or conditional mean on target.
# ResidualSugar
# Chlorides
# FreeSulfurDioxide
# TotalSulfurDioxide
# pH
# Alcohol
#we are creating groups based on TARGET, replace missing value
<- wine_train_df %>%
wine_train_df mutate(ResidualSugar= ifelse(is.na(ResidualSugar),
mean(ResidualSugar, na.rm=TRUE),ResidualSugar),
Chlorides= ifelse(is.na(Chlorides),
mean(Chlorides, na.rm=TRUE),Chlorides),
FreeSulfurDioxide= ifelse(is.na(FreeSulfurDioxide),
mean(FreeSulfurDioxide, na.rm=TRUE),FreeSulfurDioxide),
TotalSulfurDioxide= ifelse(is.na(TotalSulfurDioxide),
mean(TotalSulfurDioxide, na.rm=TRUE),TotalSulfurDioxide),
pH= ifelse(is.na(pH),
mean(pH, na.rm=TRUE),pH),
Alcohol= ifelse(is.na(Alcohol),
mean(Alcohol, na.rm=TRUE),Alcohol),
)
#LOG TRANSFORMATION
#Log transformation for the variables with high variance, the variables are translated first so that we get rid of the negative values so that the log function can handle them
#we transform 'y' like in last assignment, instead of log(y), we can do y + 1 - min(y)
#for example, the minimum of 'FreeSulfurDioxide' is -555, this will transform all the values of 'FreeSulfurDioxide by 555, so the -555 is not in the negative range(will be positive) and log() will be able to handle
#A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1.
$FreeSulfurDioxide_log <- log(wine_train_df$FreeSulfurDioxide + 1 - min(wine_train_df$FreeSulfurDioxide))
wine_train_df$TotalSulfurDioxide_log <- log(wine_train_df$TotalSulfurDioxide + 1 - min(wine_train_df$TotalSulfurDioxide))
wine_train_df
# #Flags for N/A's:
# #These ones need careful consideration as the percentage of N/A's is significant
# Sulphates(*)
# STARS(***)
#OBSERVATION: we notice some variable with "low-ish" N/A's, when we explored the data we saw that every single brand has an N/A in these properties. It may be that these N/A's are random. It may not be the case that specific brands are 'bad' or 'low price' since they have N/A's in certain properties. The N/A's we see dispersed among the different brands
<- wine_train_df %>%
wine_train_df mutate(Sulphates_flag= ifelse(is.na(Sulphates),1,0),
STARS_flag= ifelse(is.na(STARS),1,0)
)#flags = will create 1 if NA
histogram(~ TARGET | STARS_flag, data = wine_train_df)
histogram(~ TARGET | Sulphates_flag, data = wine_train_df)
#left, stars_flag=0 and right side graph=1 missing data
#histogram on the left STARS_flag has no missing data, the average is higher
##histogram on the left STARS_flag has missing data, the average is generally lower
###We might assume that those with missing data generally are less purchases or of lower quality
#sulphates_flag with 0 and 1 are similar(symmetrical)
#corrective actions
<- wine_train_df %>%
wine_train_df mutate(Sulphates= ifelse(is.na(Sulphates),
mean(Sulphates, na.rm=TRUE),Sulphates),
STARS_merged=ifelse(is.na(STARS),0,STARS))
#if it is missing it will equal to 0, if has a value it will stay same
#you will see includes 0 for missing values
table(wine_train_df$STARS_merged)
##
## 0 1 2 3 4
## 3359 3042 3570 2212 612
##you will see includes no 0 columns
table(wine_train_df$STARS)
##
## 1 2 3 4
## 3042 3570 2212 612
#creating clusters for acid index
<- kmeans(wine_train_df$AcidIndex, centers = 5)
kmeans.re table(kmeans.re$cluster)
##
## 1 2 3 4 5
## 1197 2236 78 9020 264
$AcidIndex_clusters <- kmeans.re$cluster
wine_train_df
histogram(~ TARGET | AcidIndex_clusters, data = wine_train_df)
#the higher the acidindex the lower the target variable
#other clusters are almost identical in shape, generally even distributed
#better to switch to original acidindex variable
Poisson Regression Models (2)
Negative Binomial Regression Models (2)
Multiple Linear Regression Models (2)
From assignment: Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations). Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models. Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Discuss the coefficients in the models, do they make sense? In this case, about the only thing you can comment on is the number of stars and the wine label appeal. However, you might comment on the coefficient and magnitude of variables and how they are similar or different from model to model. For example, you might say “pH seems to have a major positive impact in my poisson regression model, but a negative effect in my multiple linear regression model”. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.