OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

-INDEX: Identification Variable (do not use) None

-TARGET: Number of Cases Purchased None

-AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average

-Alcohol: Alcohol Content

-Chlorides: Chloride content of wine

-CitricAcid: Citric Acid Content

-Density: Density of Wine

-FixedAcidity: Fixed Acidity of Wine

-FreeSulfurDioxide: Sulfur Dioxide content of wine

-LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.

-ResidualSugar: Residual Sugar of wine

-STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales

-Sulphates: Sulfate content of wine

-TotalSulfurDioxide: Total Sulfur Dioxide of Wine

-VolatileAcidity: Volatile Acid content of wine

-pH: pH of wine

#importing the train an eval data
wine_train_df<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-training-data.csv")
wine_train_df <- wine_train_df[,2:16]
wine_eval_df<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/wine-evaluation-data.csv")
wine_eval_df <- wine_eval_df[,2:16]

#per assignment instructions, we don't use first column 'ID', so we remove it, we performed in above manner to keep all rows

DATA EXPLORATION

We start off with 3359 N/A’s for STARS. We interpret this as not being rated. We are told from the instruction not to use Index. Every row is a different wine. Also we noticed that in our instructions it states, “the fact that a variable is missing is actually predictive of target” It may be the case that N/A’s may be predictive of the number of sales of wine. Those that are “un-rated” me be indicative of a “cheaper” wine or low sales. We shouldn’t impute the N/A’s since it may have predictive power. We will save these N/A’s in an acceptable form that our models will be able to interpret it. Since We see STARS has a large number of N/A’s it leads us to believe that it may be indicative as it relates to predictions.

We see that all our variables are numeric. This makes it easier since our focus is on quantitative analysis. When we review our variables we notice ‘TotalSulfurDioxide’ has largest range, starts -823 and ends 1057. Based on summaries, we need to further check the variation in the variables with high range such as ‘FreeSulfurDioxide’, ‘TotalSulfurDioxide’, to a lower extent ‘ResidualSugar’. The rest are normal. Secondly we will flag missing values such as with variables STARS and possibly ‘SULPHATES’, ‘CHlorides’, ‘PH’, ‘ResidualSugar’.

We believe it is a good idea to check if ‘TARGET’ variable is very different between entries with N/A’s and complete observations

STARS can be dealt as a factor variable, because it is not continuous, it may be thought of as categorized

Notice in this assignment there is no need to cleanup character/symbols which causes issues when performing mathematical computations

summary(wine_train_df)
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359
#we needed to check the variances of the variables with high ranges
#note Standard Deviation is huge for 'FreesulfureDioxide' and 'TotalSulfureDioxide' and to lower extent 'ResidualSugar'
#notice that 'FreesulfureDioxide' and 'TotalSulfureDioxide' appear to include negative values
#'STARS' has exceptionally high number of missing values, as well as multiple chemical properties
#important to note the LOG cannot take negative values

kable(format(sapply(wine_train_df, function(wine_train_df) c( "Stand dev" = round(sd(wine_train_df, na.rm = T),2), 
                         "Mean"= mean(wine_train_df,na.rm=TRUE),
                         "n" = length(wine_train_df),
                         "Median" = median(wine_train_df,na.rm = TRUE),
                         "CoeffofVariation" = sd(wine_train_df)/mean(wine_train_df,na.rm=TRUE),
                         "Minimum" = min(wine_train_df),
                         "Maximum" = max(wine_train_df),
                         "Upper Quantile" = quantile(wine_train_df,1,na.rm = TRUE),
                         "LowerQuartile" = quantile(wine_train_df,0,na.rm = TRUE)
                    )
), scientific = FALSE)
)
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
Stand dev 1.930000000 6.320000000 0.780000000 0.860000000 33.750000000 0.320000000 148.710000000 231.910000000 0.030000000 0.680000000 0.930000000 3.730000000 0.890000000 1.320000000 0.900000000
Mean 3.029073857 7.075717077 0.324103947 0.308412661 5.418733065 0.054822489 30.845571287 120.714232643 0.994202718 3.207628226 0.527111782 10.489236260 -0.009066041 7.772723720 2.041754981
n 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000 12795.000000000
Median 3.000000000 6.900000000 0.280000000 0.310000000 3.900000000 0.046000000 30.000000000 123.000000000 0.994490000 3.200000000 0.500000000 10.400000000 0.000000000 8.000000000 2.000000000
CoeffofVariation 0.635959475 0.892862644 2.419020945 2.795215292 NA NA NA NA 0.026692389 NA NA NA -98.288680252 0.170329786 NA
Minimum 0.000000000 -18.100000000 -2.790000000 -3.240000000 NA NA NA NA 0.888090000 NA NA NA -2.000000000 4.000000000 NA
Maximum 8.000000000 34.400000000 3.680000000 3.860000000 NA NA NA NA 1.099240000 NA NA NA 2.000000000 17.000000000 NA
Upper Quantile.100% 8.000000000 34.400000000 3.680000000 3.860000000 141.150000000 1.351000000 623.000000000 1057.000000000 1.099240000 6.130000000 4.240000000 26.500000000 2.000000000 17.000000000 4.000000000
LowerQuartile.0% 0.000000000 -18.100000000 -2.790000000 -3.240000000 -127.800000000 -1.171000000 -555.000000000 -823.000000000 0.888090000 0.480000000 -3.130000000 -4.700000000 -2.000000000 4.000000000 1.000000000
#We might need to transform the 'FreesulfureDioxide' and 'TotalSulfureDioxide'
#NA's Flags: Explore if they have impact on 'TARGET', if yes, then impute?
#equal to the mean or significantly different from those with without N/A's in terms of 'TARGET' variable

#VISUAL EXPLORATION
ggplot(gather(wine_train_df), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

hist(wine_train_df$TARGET)

table(wine_train_df$TARGET)
## 
##    0    1    2    3    4    5    6    7    8 
## 2734  244 1091 2611 3177 2014  765  142   17
#Corr matrix and the scatterplot matrix

##correlation matrix
wine_train_df.rcorr = rcorr(as.matrix(wine_train_df))
wine_train_df.rcorr
##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET               1.00        -0.05           -0.09       0.01          0.02
## FixedAcidity        -0.05         1.00            0.01       0.01         -0.02
## VolatileAcidity     -0.09         0.01            1.00      -0.02         -0.01
## CitricAcid           0.01         0.01           -0.02       1.00         -0.01
## ResidualSugar        0.02        -0.02           -0.01      -0.01          1.00
## Chlorides           -0.04         0.00            0.00      -0.01         -0.01
## FreeSulfurDioxide    0.04         0.00           -0.01       0.01          0.02
## TotalSulfurDioxide   0.05        -0.02           -0.02       0.01          0.02
## Density             -0.04         0.01            0.01      -0.01          0.00
## pH                  -0.01        -0.01            0.01      -0.01          0.01
## Sulphates           -0.04         0.03            0.00      -0.01         -0.01
## Alcohol              0.06        -0.01            0.00       0.02         -0.02
## LabelAppeal          0.36         0.00           -0.02       0.01          0.00
## AcidIndex           -0.25         0.18            0.04       0.07         -0.01
## STARS                0.56        -0.01           -0.03       0.00          0.02
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density    pH
## TARGET                 -0.04              0.04               0.05   -0.04 -0.01
## FixedAcidity            0.00              0.00              -0.02    0.01 -0.01
## VolatileAcidity         0.00             -0.01              -0.02    0.01  0.01
## CitricAcid             -0.01              0.01               0.01   -0.01 -0.01
## ResidualSugar          -0.01              0.02               0.02    0.00  0.01
## Chlorides               1.00             -0.02              -0.01    0.02 -0.02
## FreeSulfurDioxide      -0.02              1.00               0.01    0.00  0.01
## TotalSulfurDioxide     -0.01              0.01               1.00    0.01  0.00
## Density                 0.02              0.00               0.01    1.00  0.01
## pH                     -0.02              0.01               0.00    0.01  1.00
## Sulphates               0.00              0.01              -0.01   -0.01  0.01
## Alcohol                -0.02             -0.02              -0.02   -0.01 -0.01
## LabelAppeal             0.01              0.01              -0.01   -0.01  0.00
## AcidIndex               0.03             -0.04              -0.05    0.04 -0.06
## STARS                   0.00             -0.01               0.01   -0.02  0.00
##                    Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET                 -0.04    0.06        0.36     -0.25  0.56
## FixedAcidity            0.03   -0.01        0.00      0.18 -0.01
## VolatileAcidity         0.00    0.00       -0.02      0.04 -0.03
## CitricAcid             -0.01    0.02        0.01      0.07  0.00
## ResidualSugar          -0.01   -0.02        0.00     -0.01  0.02
## Chlorides               0.00   -0.02        0.01      0.03  0.00
## FreeSulfurDioxide       0.01   -0.02        0.01     -0.04 -0.01
## TotalSulfurDioxide     -0.01   -0.02       -0.01     -0.05  0.01
## Density                -0.01   -0.01       -0.01      0.04 -0.02
## pH                      0.01   -0.01        0.00     -0.06  0.00
## Sulphates               1.00    0.00        0.00      0.03 -0.01
## Alcohol                 0.00    1.00        0.00     -0.04  0.07
## LabelAppeal             0.00    0.00        1.00      0.02  0.33
## AcidIndex               0.03   -0.04        0.02      1.00 -0.09
## STARS                  -0.01    0.07        0.33     -0.09  1.00
## 
## n
##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET              12795        12795           12795      12795         12179
## FixedAcidity        12795        12795           12795      12795         12179
## VolatileAcidity     12795        12795           12795      12795         12179
## CitricAcid          12795        12795           12795      12795         12179
## ResidualSugar       12179        12179           12179      12179         12179
## Chlorides           12157        12157           12157      12157         11585
## FreeSulfurDioxide   12148        12148           12148      12148         11563
## TotalSulfurDioxide  12113        12113           12113      12113         11532
## Density             12795        12795           12795      12795         12179
## pH                  12400        12400           12400      12400         11802
## Sulphates           11585        11585           11585      11585         11030
## Alcohol             12142        12142           12142      12142         11563
## LabelAppeal         12795        12795           12795      12795         12179
## AcidIndex           12795        12795           12795      12795         12179
## STARS                9436         9436            9436       9436          8984
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density    pH
## TARGET                 12157             12148              12113   12795 12400
## FixedAcidity           12157             12148              12113   12795 12400
## VolatileAcidity        12157             12148              12113   12795 12400
## CitricAcid             12157             12148              12113   12795 12400
## ResidualSugar          11585             11563              11532   12179 11802
## Chlorides              12157             11544              11510   12157 11773
## FreeSulfurDioxide      11544             12148              11512   12148 11771
## TotalSulfurDioxide     11510             11512              12113   12113 11739
## Density                12157             12148              12113   12795 12400
## pH                     11773             11771              11739   12400 12400
## Sulphates              10991             10995              10973   11585 11228
## Alcohol                11538             11527              11497   12142 11771
## LabelAppeal            12157             12148              12113   12795 12400
## AcidIndex              12157             12148              12113   12795 12400
## STARS                   8969              8979               8942    9436  9154
##                    Sulphates Alcohol LabelAppeal AcidIndex STARS
## TARGET                 11585   12142       12795     12795  9436
## FixedAcidity           11585   12142       12795     12795  9436
## VolatileAcidity        11585   12142       12795     12795  9436
## CitricAcid             11585   12142       12795     12795  9436
## ResidualSugar          11030   11563       12179     12179  8984
## Chlorides              10991   11538       12157     12157  8969
## FreeSulfurDioxide      10995   11527       12148     12148  8979
## TotalSulfurDioxide     10973   11497       12113     12113  8942
## Density                11585   12142       12795     12795  9436
## pH                     11228   11771       12400     12400  9154
## Sulphates              11585   10989       11585     11585  8564
## Alcohol                10989   12142       12142     12142  8963
## LabelAppeal            11585   12142       12795     12795  9436
## AcidIndex              11585   12142       12795     12795  9436
## STARS                   8564    8963        9436      9436  9436
## 
## P
##                    TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## TARGET                    0.0000       0.0000          0.3260     0.0688       
## FixedAcidity       0.0000              0.1616          0.1072     0.0375       
## VolatileAcidity    0.0000 0.1616                       0.0552     0.4744       
## CitricAcid         0.3260 0.1072       0.0552                     0.4438       
## ResidualSugar      0.0688 0.0375       0.4744          0.4438                  
## Chlorides          0.0000 0.9598       0.9134          0.3449     0.5471       
## FreeSulfurDioxide  0.0000 0.5837       0.4354          0.4787     0.0600       
## TotalSulfurDioxide 0.0000 0.0133       0.0203          0.4867     0.0158       
## Density            0.0000 0.4638       0.0956          0.1145     0.6509       
## pH                 0.2930 0.3172       0.1302          0.3322     0.1880       
## Sulphates          0.0000 0.0009       0.9889          0.1621     0.4173       
## Alcohol            0.0000 0.3018       0.6536          0.0603     0.0315       
## LabelAppeal        0.0000 0.7034       0.0547          0.3279     0.7979       
## AcidIndex          0.0000 0.0000       0.0000          0.0000     0.2989       
## STARS              0.0000 0.5197       0.0008          0.9485     0.1126       
##                    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## TARGET             0.0000    0.0000            0.0000             0.0000 
## FixedAcidity       0.9598    0.5837            0.0133             0.4638 
## VolatileAcidity    0.9134    0.4354            0.0203             0.0956 
## CitricAcid         0.3449    0.4787            0.4867             0.1145 
## ResidualSugar      0.5471    0.0600            0.0158             0.6509 
## Chlorides                    0.0264            0.1333             0.0125 
## FreeSulfurDioxide  0.0264                      0.1410             0.7263 
## TotalSulfurDioxide 0.1333    0.1410                               0.1584 
## Density            0.0125    0.7263            0.1584                    
## pH                 0.0561    0.5117            0.6380             0.5207 
## Sulphates          0.7302    0.2242            0.4550             0.3296 
## Alcohol            0.0344    0.0460            0.0871             0.4267 
## LabelAppeal        0.2466    0.2566            0.2834             0.2892 
## AcidIndex          0.0054    0.0000            0.0000             0.0000 
## STARS              0.6405    0.3895            0.1878             0.0757 
##                    pH     Sulphates Alcohol LabelAppeal AcidIndex STARS 
## TARGET             0.2930 0.0000    0.0000  0.0000      0.0000    0.0000
## FixedAcidity       0.3172 0.0009    0.3018  0.7034      0.0000    0.5197
## VolatileAcidity    0.1302 0.9889    0.6536  0.0547      0.0000    0.0008
## CitricAcid         0.3322 0.1621    0.0603  0.3279      0.0000    0.9485
## ResidualSugar      0.1880 0.4173    0.0315  0.7979      0.2989    0.1126
## Chlorides          0.0561 0.7302    0.0344  0.2466      0.0054    0.6405
## FreeSulfurDioxide  0.5117 0.2242    0.0460  0.2566      0.0000    0.3895
## TotalSulfurDioxide 0.6380 0.4550    0.0871  0.2834      0.0000    0.1878
## Density            0.5207 0.3296    0.4267  0.2892      0.0000    0.0757
## pH                        0.5618    0.2103  0.6450      0.0000    0.9627
## Sulphates          0.5618           0.6192  0.6757      0.0002    0.2548
## Alcohol            0.2103 0.6192            0.9099      0.0000    0.0000
## LabelAppeal        0.6450 0.6757    0.9099              0.0051    0.0000
## AcidIndex          0.0000 0.0002    0.0000  0.0051                0.0000
## STARS              0.9627 0.2548    0.0000  0.0000      0.0000
#notice when running above correlation matrix that the correlation between STARS and the target variable is high,  the higher of a rating the more samples were requested by distribution companies for a specific brand

#very important to note, also when viewing our correlation matrix, in terms of our 'TARGET' variable and the complete variables, we also have 'LabelAppeal' appears to be a variable which is highly correlated with 'TARGET', 'AcidIndex' can be considered similarly so. 


wine_train_df.cor = cor(wine_train_df, use = "pairwise.complete.obs")
corrplot(wine_train_df.cor)

#notice that when initially first using "complete.obs", it gives error because of various N/A's throughout our data
#"complete.obs" above initially failed because it only calculates correlations based on rows that don't have N/A's (every row has an N/A somewhere apparently)
#if use has the value "pairwise.complete.obs" then the correlation between each pair of variables is computed using all complete pairs of observations on those variables.


# 'STARS', 'AcidIndex', 'LabelAppeal' are the ones with the highest correlation with the 'TARGET' variable
#Basically, for every value of 'STARS', ie 'STARS=1 or =2 or =3 or =4, it shows the distribution the 'TARGET' variable in each case
#You'll notice, for example when 'STARS'=4, the target variable generally has more high value than low values
histogram(~ TARGET | STARS, data = wine_train_df)

#'LabelAppeal'  has range from min -2 and max 2, this can possibly be thought of as categorical
#using 'LabelAppeal' we can think of -2 as "very Bad", 1 as "Bad", 0 as neutral, 1 as "good", 2 as "very good"
#IMPORTANT to note the higher the label appeal, then the higher the demand for samples for specific brand  
histogram(~ TARGET | LabelAppeal, data = wine_train_df)

#Note 'AcidIndex' is NOT categorical, we cannot put one graph per every value of 'AcidIndex'. Instead we can switch and use 'TARGET' and have 'AcidIndex' as the histogram
#viewing the histogram, we can say that the ones with high 'TARGET' value should have lower 'AcidIndex'. As we go lower and lower according to 'TARGET' variable then the 'AcidIndex' should get higher. This makes sense since 'AcidIndex' appeared to be negatively correlated  (-0.25)
histogram(~ AcidIndex | TARGET, data = wine_train_df)

cor_stars_tgt <- cor.test(wine_train_df$STARS, wine_train_df$TARGET)
cor_stars_tgt
## 
##  Pearson's product-moment correlation
## 
## data:  wine_train_df$STARS and wine_train_df$TARGET
## t = 65.446, df = 9434, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5447586 0.5725160
## sample estimates:
##       cor 
## 0.5587938
cor_lbl_tgr <- cor.test(wine_train_df$LabelAppeal, wine_train_df$TARGET)
cor_lbl_tgr
## 
##  Pearson's product-moment correlation
## 
## data:  wine_train_df$LabelAppeal and wine_train_df$TARGET
## t = 43.158, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3412812 0.3715329
## sample estimates:
##       cor 
## 0.3565005
cor_acid_tgt <- cor.test(wine_train_df$AcidIndex, wine_train_df$TARGET)
cor_acid_tgt
## 
##  Pearson's product-moment correlation
## 
## data:  wine_train_df$AcidIndex and wine_train_df$TARGET
## t = -28.712, df = 12793, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2622588 -0.2297013
## sample estimates:
##        cor 
## -0.2460494
#we tested the significance of the three correlation, with 5% significant level, they are all significant


#T-test (DIDNT WORK)
#because t-test only works if the grouping variable, only has two groups
#t_AcidIndextarget <- t.test(wine_train_df$AcidIndex~ wine_train_df$TARGET)
#t_AcidIndextarget


# Compute the analysis of variance, when has more than two groups perform ANOVA
res.aov <- aov(AcidIndex ~ TARGET, data = wine_train_df)
# Summary of the analysis
summary(res.aov)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## TARGET          1   1358  1357.6   824.4 <2e-16 ***
## Residuals   12793  21067     1.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#potentially acid index has a potential effect on TARGET variable

DATA PREPARATION

# #For these ones, they seem more random, the 'missingness' may not have any indications, we impute using mean or conditional mean on target. 
# ResidualSugar
# Chlorides
# FreeSulfurDioxide
# TotalSulfurDioxide
# pH
# Alcohol

#we are creating groups based on TARGET, replace missing value
wine_train_df <- wine_train_df %>% 
            mutate(ResidualSugar= ifelse(is.na(ResidualSugar), 
                                         mean(ResidualSugar, na.rm=TRUE),ResidualSugar),
                   Chlorides= ifelse(is.na(Chlorides), 
                                         mean(Chlorides, na.rm=TRUE),Chlorides),
                   FreeSulfurDioxide= ifelse(is.na(FreeSulfurDioxide), 
                                         mean(FreeSulfurDioxide, na.rm=TRUE),FreeSulfurDioxide),
                   TotalSulfurDioxide= ifelse(is.na(TotalSulfurDioxide), 
                                         mean(TotalSulfurDioxide, na.rm=TRUE),TotalSulfurDioxide),
                   pH= ifelse(is.na(pH), 
                                         mean(pH, na.rm=TRUE),pH),
                   Alcohol= ifelse(is.na(Alcohol), 
                                         mean(Alcohol, na.rm=TRUE),Alcohol),
                   )

#LOG TRANSFORMATION
#Log transformation for the variables with high variance, the variables are translated first so that we get rid of the negative values so that the log function can handle them
#we transform 'y' like in last assignment, instead of log(y), we can do y + 1 - min(y)
#for example, the minimum of 'FreeSulfurDioxide' is -555,  this will transform all the values of 'FreeSulfurDioxide by 555, so the -555 is not in the negative range(will be positive) and log() will be able to handle

#A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1. For the latter choice, you can show that a = b – min(Y), where b is either a small number or is 1. 

wine_train_df$FreeSulfurDioxide_log <- log(wine_train_df$FreeSulfurDioxide + 1 - min(wine_train_df$FreeSulfurDioxide))
wine_train_df$TotalSulfurDioxide_log <- log(wine_train_df$TotalSulfurDioxide + 1 - min(wine_train_df$TotalSulfurDioxide))                   
                   
# #Flags for N/A's:
# #These ones need careful consideration as the percentage of N/A's is significant
# Sulphates(*)
# STARS(***)

#OBSERVATION: we notice some variable with "low-ish" N/A's, when we explored the data we saw that every single brand has an N/A in these properties. It may be that these N/A's are random. It may not be the case that specific brands are 'bad' or 'low price' since they have N/A's in certain properties. The N/A's we see dispersed among the different brands     

wine_train_df <- wine_train_df %>% 
            mutate(Sulphates_flag= ifelse(is.na(Sulphates),1,0),
                   STARS_flag= ifelse(is.na(STARS),1,0)
                   )
#flags = will create 1 if NA
histogram(~ TARGET | STARS_flag, data = wine_train_df)

histogram(~ TARGET | Sulphates_flag, data = wine_train_df)

#left, stars_flag=0 and right side graph=1 missing data

#histogram on the left STARS_flag has no missing data, the average is higher
##histogram on the left STARS_flag has missing data, the average is generally lower
###We might assume that those with missing data generally are less purchases or of lower quality

#sulphates_flag with 0 and 1 are similar(symmetrical)


#corrective actions
wine_train_df <- wine_train_df %>% 
            mutate(Sulphates= ifelse(is.na(Sulphates), 
                                         mean(Sulphates, na.rm=TRUE),Sulphates),
                   STARS_merged=ifelse(is.na(STARS),0,STARS))
#if it is missing it will equal to 0, if has a value it will stay same

#you will see includes 0 for missing values
table(wine_train_df$STARS_merged) 
## 
##    0    1    2    3    4 
## 3359 3042 3570 2212  612
##you will see includes no 0 columns
table(wine_train_df$STARS)
## 
##    1    2    3    4 
## 3042 3570 2212  612
#creating clusters for acid index
kmeans.re <- kmeans(wine_train_df$AcidIndex, centers = 5)
table(kmeans.re$cluster)
## 
##     1     2     3     4     5 
##  1978    12   124   386 10295
wine_train_df$AcidIndex_clusters <- kmeans.re$cluster

histogram(~ TARGET | AcidIndex_clusters, data = wine_train_df)

#the higher the acidindex the lower the target variable
#other clusters are almost identical in shape, generally even distributed
#better to switch to original acidindex variable

BUILD THE MODELS

SELECT THE MODELS