OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

-INDEX: Identification Variable (do not use) None

-TARGET: Number of Cases Purchased None

-AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average

-Alcohol: Alcohol Content

-Chlorides: Chloride content of wine

-CitricAcid: Citric Acid Content

-Density: Density of Wine

-FixedAcidity: Fixed Acidity of Wine

-FreeSulfurDioxide: Sulfur Dioxide content of wine

-LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.

-ResidualSugar: Residual Sugar of wine

-STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales

-Sulphates: Sulfate content of wine

-TotalSulfurDioxide: Total Sulfur Dioxide of Wine

-VolatileAcidity: Volatile Acid content of wine

-pH: pH of wine

#importing the train an eval data
wine_train_df<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-training-data.csv")
wine_train_df <- wine_train_df[,2:16]
wine_eval_df<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-evaluation-data.csv")
wine_eval_df <- wine_eval_df[,2:16]

#per assignment instructions, we don't use first column 'ID', so we remove it, we performed in above manner to keep all rows

DATA EXPLORATION

We are provided with two datasets on commercially available wine, one for the purpose of training our model with 12,796 observations of 16 variables and an evaluation dataset with 3,335 observations. The training dataset includes one response variable, TARGET, the number of cases purchased. TARGET is a continuous variable, with values between 0 and 8 in the training data. Because we are analyzing this data with the intention of maximizing sales for this wine manufacturer, we’ll be creating models to better understand how much wine is ordered based on the wine’s characteristics.

To begin our exploration of the data, we start with a broad look at the variables.

There are missing observations in ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS. STARS has the greatest rate of missingness with 26% null observations, while Sulphates has just under 10% null observations, followed by ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, and Alcohol all with about 5% null observations. Do observations with one null have multiple nulls across variables? Or are the nulls scattered throughout observations? Look at just the null values to understand them better. We will have to decide how to handle data missingness in the subsequent section.

Now, we can dig deeper into each variables’ distribution to see if there are any integrity red flags or challenges with skewness. Strangely enough, we appear to have a dataset filled with variables that are normally distributed. Looking at these histograms, we might potentially be interested in a few transformations. AcidIndex, a proprietary method of testing total acidity of wine by using a weighted average, appears to be ever so slightly right-skewed, but it’s a scaled composite variable so may be best left as is. STARS, with all it’s missing observations, is also less than normally distributed. Depending on how we handle those missing observations, we’ll be able to transform and/or handle the variable appropriately. It’s also interesting to notice that our TARGET variable is the least normally distributed, with a large number of 0s and a conspicuous gap of observations between 4 and 5. Are there any other variables with suspicious distributions /minimums /maximums/0s/means?

With a better understanding of individual variables, we can begin to look at how the variables are correlated. When we create a correlation table on complete observations, we see that there’s not a lot of correlation between predictor variables. It will be interesting to look at correlation again once we’ve addressed nulls. There is the strongest positive correlation between STARS and TARGET, which theoretically makes sense given that STARS is the “wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor”. LabelAppeal has the second strongest positive correlation with TARGET, and is another theoretically direct measurement of a customer’s willingness to buy a specific wine. The strongest negative correlation is between AcidIndex and Target. Depending on how the index is set up, this may indicate that consumers have a strong preference for or against more acidic wines.

What else would we like to call out in our data exploration?

summary(wine_train_df)

##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

#we needed to check the variances of the variables with high ranges
#note Standard Deviation is huge for 'FreesulfureDioxide' and 'TotalSulfureDioxide' and to lower extent 'ResidualSugar'
#notice that 'FreesulfureDioxide' and 'TotalSulfureDioxide' appear to include negative values
#'STARS' has exceptionally high number of missing values, as well as multiple chemical properties
#important to note the LOG cannot take negative values

kable(format(sapply(wine_train_df, function(wine_train_df) c( "Stand dev" = round(sd(wine_train_df, na.rm = T),2), 
                         "Mean"= mean(wine_train_df,na.rm=TRUE),
                         "n" = length(wine_train_df),
                         "Median" = median(wine_train_df,na.rm = TRUE),
                         "CoeffofVariation" = sd(wine_train_df)/mean(wine_train_df,na.rm=TRUE),
                         "Minimum" = min(wine_train_df),
                         "Maximum" = max(wine_train_df),
                         "Upper Quantile" = quantile(wine_train_df,1,na.rm = TRUE),
                         "LowerQuartile" = quantile(wine_train_df,0,na.rm = TRUE)
                    )
), scientific = FALSE)
)

	TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
Stand dev	1.930000000	6.320000000	0.780000000	0.860000000	33.750000000	0.320000000	148.710000000	231.910000000	0.030000000	0.680000000	0.930000000	3.730000000	0.890000000	1.320000000	0.900000000
Mean	3.029073857	7.075717077	0.324103947	0.308412661	5.418733065	0.054822489	30.845571287	120.714232643	0.994202718	3.207628226	0.527111782	10.489236260	-0.009066041	7.772723720	2.041754981
n	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000	12795.000000000
Median	3.000000000	6.900000000	0.280000000	0.310000000	3.900000000	0.046000000	30.000000000	123.000000000	0.994490000	3.200000000	0.500000000	10.400000000	0.000000000	8.000000000	2.000000000
CoeffofVariation	0.635959475	0.892862644	2.419020945	2.795215292	NA	NA	NA	NA	0.026692389	NA	NA	NA	-98.288680252	0.170329786	NA
Minimum	0.000000000	-18.100000000	-2.790000000	-3.240000000	NA	NA	NA	NA	0.888090000	NA	NA	NA	-2.000000000	4.000000000	NA
Maximum	8.000000000	34.400000000	3.680000000	3.860000000	NA	NA	NA	NA	1.099240000	NA	NA	NA	2.000000000	17.000000000	NA
Upper Quantile.100%	8.000000000	34.400000000	3.680000000	3.860000000	141.150000000	1.351000000	623.000000000	1057.000000000	1.099240000	6.130000000	4.240000000	26.500000000	2.000000000	17.000000000	4.000000000
LowerQuartile.0%	0.000000000	-18.100000000	-2.790000000	-3.240000000	-127.800000000	-1.171000000	-555.000000000	-823.000000000	0.888090000	0.480000000	-3.130000000	-4.700000000	-2.000000000	4.000000000	1.000000000

#We might need to transform the 'FreesulfureDioxide' and 'TotalSulfureDioxide'
#NA's Flags: Explore if they have impact on 'TARGET', if yes, then impute?
#equal to the mean or significantly different from those with without N/A's in terms of 'TARGET' variable

#VISUAL EXPLORATION
ggplot(gather(wine_train_df), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

hist(wine_train_df$TARGET)

table(wine_train_df$TARGET)

## 
##    0    1    2    3    4    5    6    7    8 
## 2734  244 1091 2611 3177 2014  765  142   17

#Corr matrix and the scatterplot matrix

##correlation matrix
wine_train_df.rcorr = rcorr(as.matrix(wine_train_df))
wine_train_df.rcorr

##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET               1.00        -0.05           -0.09       0.01          0.02
## FixedAcidity        -0.05         1.00            0.01       0.01         -0.02
## VolatileAcidity     -0.09         0.01            1.00      -0.02         -0.01
## CitricAcid           0.01         0.01           -0.02       1.00         -0.01
## ResidualSugar        0.02        -0.02           -0.01      -0.01          1.00
## Chlorides           -0.04         0.00            0.00      -0.01         -0.01
## FreeSulfurDioxide    0.04         0.00           -0.01       0.01          0.02
## TotalSulfurDioxide   0.05        -0.02           -0.02       0.01          0.02
## Density             -0.04         0.01            0.01      -0.01          0.00
## pH                  -0.01        -0.01            0.01      -0.01          0.01
## Sulphates           -0.04         0.03            0.00      -0.01         -0.01
## Alcohol              0.06        -0.01            0.00       0.02         -0.02
## LabelAppeal          0.36         0.00           -0.02       0.01          0.00
## AcidIndex           -0.25         0.18            0.04       0.07         -0.01
## STARS                0.56        -0.01           -0.03       0.00          0.02
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density    pH
## TARGET                 -0.04              0.04               0.05   -0.04 -0.01
## FixedAcidity            0.00              0.00              -0.02    0.01 -0.01
## VolatileAcidity         0.00             -0.01              -0.02    0.01  0.01
## CitricAcid             -0.01              0.01               0.01   -0.01 -0.01
## ResidualSugar          -0.01              0.02               0.02    0.00  0.01
## Chlorides               1.00             -0.02              -0.01    0.02 -0.02
## FreeSulfurDioxide      -0.02              1.00               0.01    0.00  0.01
## TotalSulfurDioxide     -0.01              0.01               1.00    0.01  0.00
## Density                 0.02              0.00               0.01    1.00  0.01
## pH                     -0.02              0.01               0.00    0.01  1.00
## Sulphates               0.00              0.01              -0.01   -0.01  0.01
## Alcohol                -0.02             -0.02              -0.02   -0.01 -0.01
## LabelAppeal             0.01              0.01              -0.01   -0.01  0.00
## AcidIndex               0.03             -0.04              -0.05    0.04 -0.06
## STARS                   0.00             -0.01               0.01   -0.02  0.00
##                    Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET                 -0.04    0.06        0.36     -0.25  0.56
## FixedAcidity            0.03   -0.01        0.00      0.18 -0.01
## VolatileAcidity         0.00    0.00       -0.02      0.04 -0.03
## CitricAcid             -0.01    0.02        0.01      0.07  0.00
## ResidualSugar          -0.01   -0.02        0.00     -0.01  0.02
## Chlorides               0.00   -0.02        0.01      0.03  0.00
## FreeSulfurDioxide       0.01   -0.02        0.01     -0.04 -0.01
## TotalSulfurDioxide     -0.01   -0.02       -0.01     -0.05  0.01
## Density                -0.01   -0.01       -0.01      0.04 -0.02
## pH                      0.01   -0.01        0.00     -0.06  0.00
## Sulphates               1.00    0.00        0.00      0.03 -0.01
## Alcohol                 0.00    1.00        0.00     -0.04  0.07
## LabelAppeal             0.00    0.00        1.00      0.02  0.33
## AcidIndex               0.03   -0.04        0.02      1.00 -0.09
## STARS                  -0.01    0.07        0.33     -0.09  1.00
## 
## n
##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET              12795        12795           12795      12795         12179
## FixedAcidity        12795        12795           12795      12795         12179
## VolatileAcidity     12795        12795           12795      12795         12179
## CitricAcid          12795        12795           12795      12795         12179
## ResidualSugar       12179        12179           12179      12179         12179
## Chlorides           12157        12157           12157      12157         11585
## FreeSulfurDioxide   12148        12148           12148      12148         11563
## TotalSulfurDioxide  12113        12113           12113      12113         11532
## Density             12795        12795           12795      12795         12179
## pH                  12400        12400           12400      12400         11802
## Sulphates           11585        11585           11585      11585         11030
## Alcohol             12142        12142           12142      12142         11563
## LabelAppeal         12795        12795           12795      12795         12179
## AcidIndex           12795        12795           12795      12795         12179
## STARS                9436         9436            9436       9436          8984
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density    pH
## TARGET                 12157             12148              12113   12795 12400
## FixedAcidity           12157             12148              12113   12795 12400
## VolatileAcidity        12157             12148              12113   12795 12400
## CitricAcid             12157             12148              12113   12795 12400
## ResidualSugar          11585             11563              11532   12179 11802
## Chlorides              12157             11544              11510   12157 11773
## FreeSulfurDioxide      11544             12148              11512   12148 11771
## TotalSulfurDioxide     11510             11512              12113   12113 11739
## Density                12157             12148              12113   12795 12400
## pH                     11773             11771              11739   12400 12400
## Sulphates              10991             10995              10973   11585 11228
## Alcohol                11538             11527              11497   12142 11771
## LabelAppeal            12157             12148              12113   12795 12400
## AcidIndex              12157             12148              12113   12795 12400
## STARS                   8969              8979               8942    9436  9154
##                    Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET                 11585   12142       12795     12795  9436
## FixedAcidity           11585   12142       12795     12795  9436
## VolatileAcidity        11585   12142       12795     12795  9436
## CitricAcid             11585   12142       12795     12795  9436
## ResidualSugar          11030   11563       12179     12179  8984
## Chlorides              10991   11538       12157     12157  8969
## FreeSulfurDioxide      10995   11527       12148     12148  8979
## TotalSulfurDioxide     10973   11497       12113     12113  8942
## Density                11585   12142       12795     12795  9436
## pH                     11228   11771       12400     12400  9154
## Sulphates              11585   10989       11585     11585  8564
## Alcohol                10989   12142       12142     12142  8963
## LabelAppeal            11585   12142       12795     12795  9436
## AcidIndex              11585   12142       12795     12795  9436
## STARS                   8564    8963        9436      9436  9436
## 
## P
##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET                    0.0000       0.0000          0.3260     0.0688       
## FixedAcidity       0.0000              0.1616          0.1072     0.0375       
## VolatileAcidity    0.0000 0.1616                       0.0552     0.4744       
## CitricAcid         0.3260 0.1072       0.0552                     0.4438       
## ResidualSugar      0.0688 0.0375       0.4744          0.4438                  
## Chlorides          0.0000 0.9598       0.9134          0.3449     0.5471       
## FreeSulfurDioxide  0.0000 0.5837       0.4354          0.4787     0.0600       
## TotalSulfurDioxide 0.0000 0.0133       0.0203          0.4867     0.0158       
## Density            0.0000 0.4638       0.0956          0.1145     0.6509       
## pH                 0.2930 0.3172       0.1302          0.3322     0.1880       
## Sulphates          0.0000 0.0009       0.9889          0.1621     0.4173       
## Alcohol            0.0000 0.3018       0.6536          0.0603     0.0315       
## LabelAppeal        0.0000 0.7034       0.0547          0.3279     0.7979       
## AcidIndex          0.0000 0.0000       0.0000          0.0000     0.2989       
## STARS              0.0000 0.5197       0.0008          0.9485     0.1126       
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## TARGET             0.0000    0.0000            0.0000             0.0000 
## FixedAcidity       0.9598    0.5837            0.0133             0.4638 
## VolatileAcidity    0.9134    0.4354            0.0203             0.0956 
## CitricAcid         0.3449    0.4787            0.4867             0.1145 
## ResidualSugar      0.5471    0.0600            0.0158             0.6509 
## Chlorides                    0.0264            0.1333             0.0125 
## FreeSulfurDioxide  0.0264                      0.1410             0.7263 
## TotalSulfurDioxide 0.1333    0.1410                               0.1584 
## Density            0.0125    0.7263            0.1584                    
## pH                 0.0561    0.5117            0.6380             0.5207 
## Sulphates          0.7302    0.2242            0.4550             0.3296 
## Alcohol            0.0344    0.0460            0.0871             0.4267 
## LabelAppeal        0.2466    0.2566            0.2834             0.2892 
## AcidIndex          0.0054    0.0000            0.0000             0.0000 
## STARS              0.6405    0.3895            0.1878             0.0757 
##                    pH     Sulphates Alcohol LabelAppeal AcidIndex STARS 
## TARGET             0.2930 0.0000    0.0000  0.0000      0.0000    0.0000
## FixedAcidity       0.3172 0.0009    0.3018  0.7034      0.0000    0.5197
## VolatileAcidity    0.1302 0.9889    0.6536  0.0547      0.0000    0.0008
## CitricAcid         0.3322 0.1621    0.0603  0.3279      0.0000    0.9485
## ResidualSugar      0.1880 0.4173    0.0315  0.7979      0.2989    0.1126
## Chlorides          0.0561 0.7302    0.0344  0.2466      0.0054    0.6405
## FreeSulfurDioxide  0.5117 0.2242    0.0460  0.2566      0.0000    0.3895
## TotalSulfurDioxide 0.6380 0.4550    0.0871  0.2834      0.0000    0.1878
## Density            0.5207 0.3296    0.4267  0.2892      0.0000    0.0757
## pH                        0.5618    0.2103  0.6450      0.0000    0.9627
## Sulphates          0.5618           0.6192  0.6757      0.0002    0.2548
## Alcohol            0.2103 0.6192            0.9099      0.0000    0.0000
## LabelAppeal        0.6450 0.6757    0.9099              0.0051    0.0000
## AcidIndex          0.0000 0.0002    0.0000  0.0051                0.0000
## STARS              0.9627 0.2548    0.0000  0.0000      0.0000

#notice when running above correlation matrix that the correlation between STARS and the target variable is high,  the higher of a rating the more samples were requested by distribution companies for a specific brand

#very important to note, also when viewing our correlation matrix, in terms of our 'TARGET' variable and the complete variables, we also have 'LabelAppeal' appears to be a variable which is highly correlated with 'TARGET', 'AcidIndex' can be considered similarly so. 


wine_train_df.cor = cor(wine_train_df, use = "pairwise.complete.obs")
corrplot(wine_train_df.cor)

#notice that when initially first using "complete.obs", it gives error because of various N/A's throughout our data
#"complete.obs" above initially failed because it only calculates correlations based on rows that don't have N/A's (every row has an N/A somewhere apparently)
#if use has the value "pairwise.complete.obs" then the correlation between each pair of variables is computed using all complete pairs of observations on those variables.


# 'STARS', 'AcidIndex', 'LabelAppeal' are the ones with the highest correlation with the 'TARGET' variable
#Basically, for every value of 'STARS', ie 'STARS=1 or =2 or =3 or =4, it shows the distribution the 'TARGET' variable in each case
#You'll notice, for example when 'STARS'=4, the target variable generally has more high value than low values
histogram(~ TARGET | STARS, data = wine_train_df)

#'LabelAppeal'  has range from min -2 and max 2, this can possibly be thought of as categorical
#using 'LabelAppeal' we can think of -2 as "very Bad", 1 as "Bad", 0 as neutral, 1 as "good", 2 as "very good"
#IMPORTANT to note the higher the label appeal, then the higher the demand for samples for specific brand  
histogram(~ TARGET | LabelAppeal, data = wine_train_df)

#Note 'AcidIndex' is NOT categorical, we cannot put one graph per every value of 'AcidIndex'. Instead we can switch and use 'TARGET' and have 'AcidIndex' as the histogram
#viewing the histogram, we can say that the ones with high 'TARGET' value should have lower 'AcidIndex'. As we go lower and lower according to 'TARGET' variable then the 'AcidIndex' should get higher. This makes sense since 'AcidIndex' appeared to be negatively correlated  (-0.25)
histogram(~ AcidIndex | TARGET, data = wine_train_df)

cor_stars_tgt <- cor.test(wine_train_df$STARS, wine_train_df$TARGET)
cor_stars_tgt

## 
##  Pearson's product-moment correlation
## 
## data:  wine_train_df$STARS and wine_train_df$TARGET
## t = 65.446, df = 9434, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5447586 0.5725160
## sample estimates:
##       cor 
## 0.5587938

cor_lbl_tgr <- cor.test(wine_train_df$LabelAppeal, wine_train_df$TARGET)
cor_lbl_tgr

## 
##  Pearson's product-moment correlation
## 
## data:  wine_train_df$LabelAppeal and wine_train_df$TARGET
## t = 43.158, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3412812 0.3715329
## sample estimates:
##       cor 
## 0.3565005

cor_acid_tgt <- cor.test(wine_train_df$AcidIndex, wine_train_df$TARGET)
cor_acid_tgt

## 
##  Pearson's product-moment correlation
## 
## data:  wine_train_df$AcidIndex and wine_train_df$TARGET
## t = -28.712, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2622588 -0.2297013
## sample estimates:
##        cor 
## -0.2460494

#we tested the significance of the three correlation, with 5% significant level, they are all significant


#T-test (DIDNT WORK)
#because t-test only works if the grouping variable, only has two groups
#t_AcidIndextarget <- t.test(wine_train_df$AcidIndex~ wine_train_df$TARGET)
#t_AcidIndextarget


# Compute the analysis of variance, when has more than two groups perform ANOVA
res.aov <- aov(AcidIndex ~ TARGET, data = wine_train_df)
# Summary of the analysis
summary(res.aov)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## TARGET          1   1358  1357.6   824.4 <2e-16 ***
## Residuals   12793  21067     1.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#potentially acid index has a potential effect on TARGET variable

DATA PREPARATION

Dropping missing data has important implications on a model’s ability to predict on an evaluation dataset. When all nulls and missing data are removed, the means and medians of the variables change. With no nulls or missing data, there are 6,436 observations total. While this isn’t a bad amount of sample, it does reduce the observations by half of the original dataset. Below, you can see how the removal has affected the medians and means of each variable.

Because data is rarely perfect and simply removing missing data introduces a certain type of bias, we can also try imputing missing data based on XXX.

We can also explore the transformation of variables, either through log transformation or interaction variables. These transformations will be discussed on a model-specific basis in the Build Models selection, as well as highlighted in the R codes included in the Appendix.

# #For these ones, they seem more random, the 'missingness' may not have any indications, we impute using mean or conditional mean on target. 
# ResidualSugar
# Chlorides
# FreeSulfurDioxide
# TotalSulfurDioxide
# pH
# Alcohol

#we are creating groups based on TARGET, replace missing value
wine_train_df <- wine_train_df %>% 
            mutate(ResidualSugar= ifelse(is.na(ResidualSugar), 
                                         mean(ResidualSugar, na.rm=TRUE),ResidualSugar),
                   Chlorides= ifelse(is.na(Chlorides), 
                                         mean(Chlorides, na.rm=TRUE),Chlorides),
                   FreeSulfurDioxide= ifelse(is.na(FreeSulfurDioxide), 
                                         mean(FreeSulfurDioxide, na.rm=TRUE),FreeSulfurDioxide),
                   TotalSulfurDioxide= ifelse(is.na(TotalSulfurDioxide), 
                                         mean(TotalSulfurDioxide, na.rm=TRUE),TotalSulfurDioxide),
                   pH= ifelse(is.na(pH), 
                                         mean(pH, na.rm=TRUE),pH),
                   Alcohol= ifelse(is.na(Alcohol), 
                                         mean(Alcohol, na.rm=TRUE),Alcohol),
                   )

#LOG TRANSFORMATION
#Log transformation for the variables with high variance, the variables are translated first so that we get rid of the negative values so that the log function can handle them
#we transform 'y' like in last assignment, instead of log(y), we can do y + 1 - min(y)
#for example, the minimum of 'FreeSulfurDioxide' is -555,  this will transform all the values of 'FreeSulfurDioxide by 555, so the -555 is not in the negative range(will be positive) and log() will be able to handle

#A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1. 

wine_train_df$FreeSulfurDioxide_log <- log(wine_train_df$FreeSulfurDioxide + 1 - min(wine_train_df$FreeSulfurDioxide))
wine_train_df$TotalSulfurDioxide_log <- log(wine_train_df$TotalSulfurDioxide + 1 - min(wine_train_df$TotalSulfurDioxide))                   
                   
# #Flags for N/A's:
# #These ones need careful consideration as the percentage of N/A's is significant
# Sulphates(*)
# STARS(***)

#OBSERVATION: we notice some variable with "low-ish" N/A's, when we explored the data we saw that every single brand has an N/A in these properties. It may be that these N/A's are random. It may not be the case that specific brands are 'bad' or 'low price' since they have N/A's in certain properties. The N/A's we see dispersed among the different brands     

wine_train_df <- wine_train_df %>% 
            mutate(Sulphates_flag= ifelse(is.na(Sulphates),1,0),
                   STARS_flag= ifelse(is.na(STARS),1,0)
                   )
#flags = will create 1 if NA
histogram(~ TARGET | STARS_flag, data = wine_train_df)

histogram(~ TARGET | Sulphates_flag, data = wine_train_df)

#left, stars_flag=0 and right side graph=1 missing data

#histogram on the left STARS_flag has no missing data, the average is higher
##histogram on the left STARS_flag has missing data, the average is generally lower
###We might assume that those with missing data generally are less purchases or of lower quality

#sulphates_flag with 0 and 1 are similar(symmetrical)


#corrective actions
wine_train_df <- wine_train_df %>% 
            mutate(Sulphates= ifelse(is.na(Sulphates), 
                                         mean(Sulphates, na.rm=TRUE),Sulphates),
                   STARS_merged=ifelse(is.na(STARS),0,STARS))
#if it is missing it will equal to 0, if has a value it will stay same

#you will see includes 0 for missing values
table(wine_train_df$STARS_merged)

## 
##    0    1    2    3    4 
## 3359 3042 3570 2212  612

##you will see includes no 0 columns
table(wine_train_df$STARS)

## 
##    1    2    3    4 
## 3042 3570 2212  612

#creating clusters for acid index
kmeans.re <- kmeans(wine_train_df$AcidIndex, centers = 5)
table(kmeans.re$cluster)

## 
##    1    2    3    4    5 
## 1197 2236   78 9020  264

wine_train_df$AcidIndex_clusters <- kmeans.re$cluster

histogram(~ TARGET | AcidIndex_clusters, data = wine_train_df)

#the higher the acidindex the lower the target variable
#other clusters are almost identical in shape, generally even distributed
#better to switch to original acidindex variable

BUILD THE MODELS

Poisson Regression Models (2)

Negative Binomial Regression Models (2)

Multiple Linear Regression Models (2)

From assignment: Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations). Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models. Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done. Discuss the coefficients in the models, do they make sense? In this case, about the only thing you can comment on is the number of stars and the wine label appeal. However, you might comment on the coefficient and magnitude of variables and how they are similar or different from model to model. For example, you might say “pH seems to have a major positive impact in my poisson regression model, but a negative effect in my multiple linear regression model”. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Homework #5: Count Regression Models

Douglas Barley, Ethan Haley, Isabel Magnus, John Mazon, Vinayak Kamath, Arushi Arora

11/28/2021

OVERVIEW

DATA EXPLORATION

DATA PREPARATION

BUILD THE MODELS

SELECT THE MODELS