Introduction

Here is a description of the assignment and data as provided by Marcus Ellis.

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant.

A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

INDEX - Identification variable (do not use)
TARGET - Number of cases purchased
AcidIndex - Proprietary method of testing total acidity of wine by using a weighted average
Alcohol
Chloride
CitricAcid
Density
FixedAcidity
FreeSulfurDioxide
LabelAppeal - Marketing score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like the design.
ResidualSugar
STARS - Wine rating by a team of experts. 4 stars = excellent, 1 star = poor
Sulphates
TotalSulfurDioxide
VolatileAcidity
pH

Libraries

We will use tidyverse libraries including ggplot2, tidyr, dplyr, and stringr to process this data.

We will also use gridExtra to be able to place ggplot2 plots side-by-side.

We use pscl package for its function hurdle.

We use MASS package to run negative binomial models.

Data exploration

Basic data exploration

Reading in the data

Start by reading in the data.

Also, let’s immediately remove the INDEX column.

Read in both the training and evaluation data now, so that as we format the training data we can also change the evaluation data in the same way.

We read in the training data from here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw5/master/wine-training-data.csv

And the evaluation data from here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw5/master/wine-evaluation-data.csv

Check for variable type of each column and for missing values.

We start by looking at the number of non-NA values for column, which lets us look for missing values.

## [1] "Total number of records in data:"

## [1] 12795

## [1] "Number of non-NA values per variable in data:"

##              Variable     n
## 1               STARS  9436
## 2           Sulphates 11585
## 3  TotalSulfurDioxide 12113
## 4             Alcohol 12142
## 5   FreeSulfurDioxide 12148
## 6           Chlorides 12157
## 7       ResidualSugar 12179
## 8                  pH 12400
## 9           AcidIndex 12795
## 10         CitricAcid 12795
## 11            Density 12795
## 12       FixedAcidity 12795
## 13        LabelAppeal 12795
## 14             TARGET 12795
## 15    VolatileAcidity 12795

We find that over 25% of wines were not rated by experts at all.

After that we find some other variables also contain some missing values, but not nearly as many.

Now, let’s check the number of unique values per variable.

##              Variable Num.uniques
## 8         LabelAppeal           5
## 11              STARS           5
## 13             TARGET           9
## 1           AcidIndex          14
## 2             Alcohol         402
## 6        FixedAcidity         470
## 9                  pH         498
## 4          CitricAcid         602
## 12          Sulphates         631
## 15    VolatileAcidity         815
## 7   FreeSulfurDioxide        1000
## 14 TotalSulfurDioxide        1371
## 3           Chlorides        1664
## 10      ResidualSugar        2078
## 5             Density        5933

## [1] "Unique values in AcidIndex:"

##  [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17

In addition to the variables we expected, we find that AcidIndex also has relatively few unique values, namely all integers 4-17 inclusive.

Distribution of variables

Distribution of values with <15 unique values

## [1] "Table of number of records per value in AcidIndex:"

##    value    n
## 1      4    3
## 2      5   75
## 3      6 1197
## 4      7 4878
## 5      8 4142
## 6      9 1427
## 7     10  551
## 8     11  258
## 9     12  128
## 10    13   69
## 11    14   47
## 12    15    8
## 13    16    5
## 14    17    7

We find that very few wines that were rated by experts are rated 4 stars. We also find that 2-star ratings are more common than 3-star ratings. This suggests that either these experts are fairly critical, or perhaps with only 4 stars instead of 5 they are using 2 as their rating for “just OK”. Notably, a fair number of wines (over 20%) were also rated as poor by experts.

Most label appeal ratings are either completely neutral (0) or mild (+/- 1). Very few are strongly positive or negative (+/- 2).

We find that over 20% of wines were not purchased at all, not even a single case. Within wines where at least one case was purchased, the distribution of cases purchased appears roughly normal, with 4 cases being the most common and relatively few instances of individuals purchasing only one case or 7+ cases.

Finally, we find that most wines have an acid index between 6 and 10. There are very few wines with an acid index of 4 (3) or 15+ (20).

Distribution of variables with many unique values

We get some values that do not make sense in many variables.

For example, I am assuming that alcohol is a percentage by volume. In which case most of the common values we see make sense. However, none of these values should be negative.

In other variables, there are quite a few negative values where it seems like there shouldn’t be. For example, residual sugar is typically measured as either a percentage (ranging from 0.1-20%) or in grams per liter (rangining from 1 to 200) (https://www.winecurmudgeon.com/residual-sugar-in-wine-with-charts-and-graphs/). So there really should not be negative values. Logarithmic transformation could explain some negative values if you transformed the percentages, but this would only make sense if the negative values went down to -1 or -2 instead of -100.

Comparison of variables to plausible values

I did a bit of research on some of the continuous numeric variables.

Fixed acidity - typical unit would be g/L. According to this site (http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity), typical values are:

1-4 g/L tartaric acid
0-8 g/L malic acid
0-0.5 g/L citric acid
0.5-2 g/L succinic acid

We find a fair number of fixed acidity values in the 15+ range. Even assuming that fixed acidity is the sum of the four main fixed acids, this is way too high, even before considering the negative values.

In a different data set that had a lot of similar variables, also about wine, fixed acidity was mostly around 6-7, with a range of around 4-16 (https://rstudio-pubs-static.s3.amazonaws.com/57835_c4ace81da9dc45438ad0c286bcbb4224.html).

Volatile acidity - US legal limit is at most 1.4 g/L (https://www.law.cornell.edu/cfr/text/27/4.21), or else the wine is considered “substandard”. Other countries probably have similar limits. We find nearly 1,000 wines with values higher than this, which seems like a lot.

In the same data set I used to look at fixed acidity, volatile acidity maxed out at 1.58.

pH - We should not be seeing many values with a pH below 2, as a pH of 2 is equal to vinegar (https://www.sciencebuddies.org/science-fair-projects/references/acids-bases-the-ph-scale). Yet, we see hundreds of wines with very low pHs, including over 50 with pH < 1 (more like battery or stomach acid).
Alcohol - While the positive values seemed mostly plausible at first, the number of wines with very low alcohol content (less than 4-5%) still seems quite high. Even a very low alcohol wine like a Moscato will still have at least 4-5% alcohol.

Conclusions from looking at distributions of variables and comparison to plausible values

In conclusion, some quick research shows that the data contains many implausible values for the continous numeric variables representing chemical properties of wine, even separate from the issue of the negative values.

Going forward, I will focus my exploration and model building on the three variables that are not direct chemical properties of wine (STARS, LabelAppeal, and AcidIndex).

Correlation of predictor variables with target

We find that within wines we are assuming were not reviewed by experts (STARS = not stated), the proportion of wines that had no cases purchased at all is substantially higher (~61% of these wines had TARGET = 0, versus max 20% in wines with a star rating even if rated poorly). Higher star ratings are also associated with more cases purchased in cases where the wine was purchased at all.

We find that label appeal is not correlated with whether or not at least one case was purchased. However, it does appear that wines with negative ratings for label appeal are substantially more likely to have only 1 or 2 cases purchased instead of 3+.

Finally, we find that higher acid index values (mainly once you go to 9+) are associated with a substantially higher proportion of wines having zero cases purchased. However, there does not appear to be a clear trend in terms of the number of cases purchased for wines where at least one case was purchased.

Logit(p) vs. STARS

For each level of STARS below 3 (treating not reviewed = 0, so 0, 1, and 2), get logit of the probability that at least one case will be purchased.

What if we take the last probability as for all wines with STARS >= 2 ((number of wines with STARS >= 2 and TARGET >= 1)/(number of wines with STARS >= 2)?

If we were to run a binary logistic regression to predict whether or not a wine was purchased at all, converting STARS to a dummy variable where 0 = not reviewed, 1 = 1-star review, and 2 = 2+ star review, we would get a pretty good linear relationship between the dummy variable and logit(p).

Logit(p) vs. AcidIndex

For each AcidIndex level, get logit of the probability that at least one case will be purchased.

If we were to run a binary logistic regression to predict whether or not a wine was purchased at all, it seems we would get the best linear relationship if we converted to a dummy variable where 1 = AcidIndex <= 8, 2 = AcidIndex of 9, 3 = AcidIndex of 10, and 4 = AcidIndex of 11+.

Let’s plot this.

Looks great!

Data transformations

We will be removing a lot of variables from consideration due to concerns about data integrity. So, there are not many variables left to transform.

One step that seems reasonable is to treat 8 cases the same as 7 cases. E.g. subtract one from TARGET when TARGET = 8.

We should also ensure that records with STARS = NA are included by treating these records as having STARS = 0.

We will also create a variable with STARS as a dummy variable with only 3 levels, where 0 = not given, 1 = 1-star review, and 2 = 2+ star review.

For AcidIndex, create an additional variable that is the simplified version of this, with levels 1 for <= 8, 2 for 9, 3 for 10, and 4 for 11+.

Also treat 4 the same as 5, and >12 the same as 12.

Data modeling

Poisson regression models

Model 1 - Non-zero-inflated, STARS + LabelAppeal + AcidIndex

## 
## Call:
## glm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex, family = "poisson", 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.98534  -0.72012   0.05626   0.54962   3.00033  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.246331   0.037600   33.15   <2e-16 ***
## STARS        0.313841   0.004509   69.61   <2e-16 ***
## LabelAppeal  0.131994   0.006061   21.78   <2e-16 ***
## AcidIndex   -0.091960   0.004615  -19.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22830  on 12794  degrees of freedom
## Residual deviance: 14804  on 12791  degrees of freedom
## AIC: 46751
## 
## Number of Fisher Scoring iterations: 5

STARS and LabelAppeal have a positive coefficient, while AcidIndex is negative. This makes sense with what we saw in the exploration.

Model 2 - Hurdle, STARS + LabelAppeal for count model, STARS.factor + AcidIndex.simplified for zero hurdle model

## 
## Call:
## hurdle(formula = TARGET ~ STARS + LabelAppeal | STARS.factor + AcidIndex.simplified, 
##     data = training)
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.320418 -0.388881 -0.006511  0.456221  4.609426 
## 
## Count model coefficients (truncated poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.104051   0.011729   94.13   <2e-16 ***
## STARS       0.097617   0.005258   18.57   <2e-16 ***
## LabelAppeal 0.240735   0.006534   36.84   <2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.52791    0.05818   9.074   <2e-16 ***
## STARS.factor          2.11855    0.04157  50.969   <2e-16 ***
## AcidIndex.simplified -0.71509    0.03470 -20.608   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 9 
## Log-likelihood: -2.059e+04 on 6 Df

Directions of coefficients for both count and zero hurdle models make sense.

Model 1 vs. model 2

Let’s look at the distribution of fitted values in each model, and compare to the real data.

Round each fitted value to the nearest integer for now.

We find both models have a distribution of fitted values that is quite different from the real one.

Notably, while the hurdle model does lead to some records having a fitted value = 0, there are still a lot fewer zeroes than in the real data.

Let’s look at the distribution of model2 fitted values when real = 0, and the distribution of real values when fitted = 1.

## [1] "TARGET actual values when Poisson model 2 fitted = 1:"

## 
##    0    1    2    3    4    5    6    7 
## 1409  121  324  442  201   55    7    3

We find that for a lot of the cases where the real value = 0, fitted is around 0.5-1.5 in model 2.

Negative binomial models

Model 1 - Non-zero-inflated, STARS + LabelAppeal + AcidIndex

Same model1 as for Poisson, but now using negative binomial.

## 
## Call:
## glm.nb(formula = TARGET ~ STARS + AcidIndex + LabelAppeal, data = training, 
##     init.theta = 49019.8924, link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.98528  -0.72011   0.05625   0.54959   3.00026  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.246340   0.037601   33.15   <2e-16 ***
## STARS        0.313845   0.004509   69.61   <2e-16 ***
## AcidIndex   -0.091962   0.004616  -19.92   <2e-16 ***
## LabelAppeal  0.131993   0.006061   21.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(49019.89) family taken to be 1)
## 
##     Null deviance: 22829  on 12794  degrees of freedom
## Residual deviance: 14803  on 12791  degrees of freedom
## AIC: 46754
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  49020 
##           Std. Err.:  50931 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -46743.59

Model 2 - Hurdle, STARS + LabelAppeal for count model, STARS.factor + AcidIndex.simplified for zero hurdle model

Same model2 as for Poisson, but now using negative binomial.

## 
## Call:
## hurdle(formula = TARGET ~ STARS + LabelAppeal | STARS.factor + AcidIndex.simplified, 
##     data = training, dist = "negbin")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.320409 -0.388863 -0.006512  0.456232  4.609421 
## 
## Count model coefficients (truncated negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.104052   0.011729  94.129   <2e-16 ***
## STARS        0.097617   0.005258  18.567   <2e-16 ***
## LabelAppeal  0.240731   0.006534  36.840   <2e-16 ***
## Log(theta)  14.555204  10.952950   1.329    0.184    
## Zero hurdle model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.52791    0.05818   9.074   <2e-16 ***
## STARS.factor          2.11855    0.04157  50.969   <2e-16 ***
## AcidIndex.simplified -0.71509    0.03470 -20.608   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta: count = 2095292.824
## Number of iterations in BFGS optimization: 29 
## Log-likelihood: -2.059e+04 on 7 Df

Both look pretty similar to Poisson.

Multiple linear regression models

Model 1 - STARS + LabelAppeal + AcidIndex

## 
## Call:
## lm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5470 -1.0017  0.0789  0.9120  5.3090 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.33090    0.07984   41.72   <2e-16 ***
## STARS        0.98437    0.01044   94.31   <2e-16 ***
## AcidIndex   -0.22975    0.00961  -23.91   <2e-16 ***
## LabelAppeal  0.42781    0.01369   31.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.327 on 12791 degrees of freedom
## Multiple R-squared:  0.5237, Adjusted R-squared:  0.5236 
## F-statistic:  4689 on 3 and 12791 DF,  p-value: < 2.2e-16

## [1] "Table of fitted values:"

## 
##    0    1    2    3    4    5    6    7 
##  118 2092 3031 2914 2524 1545  529   42

Directions of coefficients make sense, though again not enough zeroes fitted.

Model 2 - STARS + LabelAppeal + AcidIndex.simplified

## 
## Call:
## lm(formula = TARGET ~ STARS + AcidIndex.simplified + LabelAppeal, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5022 -0.9080 -0.0359  0.9264  5.2898 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.08368    0.03107   67.06   <2e-16 ***
## STARS                 0.98116    0.01043   94.07   <2e-16 ***
## AcidIndex.simplified -0.40107    0.01612  -24.89   <2e-16 ***
## LabelAppeal           0.42865    0.01367   31.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.325 on 12791 degrees of freedom
## Multiple R-squared:  0.5254, Adjusted R-squared:  0.5253 
## F-statistic:  4721 on 3 and 12791 DF,  p-value: < 2.2e-16

## [1] "Table of fitted values:"

## 
##    0    1    2    3    4    5    6 
##  307 1622 2987 2683 2729 1921  546

Very similar to model1.

Model selection

Considering the zero hurdle models still have very few zeroes, we might as well choose one of the non-zero-hurdle models for parsimony.

Let’s choose the slightly simpler Poisson over negative binomial. So, Poisson model 1.

AIC of this model is 46,751.

Code appendix

Code for the outputs above is shown below.

## ----load-libs, echo=FALSE, eval=TRUE,message=FALSE,warning=FALSE--------
library(ggplot2)
library(stringr)
library(tidyr)
library(dplyr)
library(gridExtra)
library(pscl)
library(MASS)

## ----read-in-data, echo=FALSE, eval=TRUE---------------------------------
training <- read.csv("https://raw.githubusercontent.com/heathergeiger/Data621_hw5/master/wine-training-data.csv",header=TRUE,stringsAsFactors=FALSE)
evaluation <- read.csv("https://raw.githubusercontent.com/heathergeiger/Data621_hw5/master/wine-evaluation-data.csv",header=TRUE,stringsAsFactors=FALSE)

training <- training[,setdiff(colnames(training),"INDEX")]
evaluation <- evaluation[,setdiff(colnames(evaluation),"INDEX")]

## ----num-non-NA-per-column, echo=FALSE, eval=TRUE------------------------
print("Total number of records in data:")
nrow(training)

print("Number of non-NA values per variable in data:")

non_NA_per_column <- training %>%
    gather() %>%
na.omit(value) %>%
count(key)

non_NA_per_column <- data.frame(non_NA_per_column)

non_NA_per_column <- non_NA_per_column[order(non_NA_per_column[,2]),]

data.frame(Variable = non_NA_per_column[,1],n = non_NA_per_column[,2])

## ----num-unique-per-column, echo=FALSE, eval=TRUE,message=FALSE,warning=FALSE----
num_unique_values_per_variable <- training %>%
    gather() %>%
group_by(key) %>%
summarize(uniques = n_distinct(value))

num_unique_values_per_variable <- data.frame(num_unique_values_per_variable,stringsAsFactors=FALSE)

num_unique_values_per_variable <- data.frame(Variable = num_unique_values_per_variable[,1],Num.uniques = num_unique_values_per_variable[,2],stringsAsFactors=FALSE)

num_unique_values_per_variable$Variable <- as.vector(num_unique_values_per_variable$Variable)

num_unique_values_per_variable[order(num_unique_values_per_variable[,2]),]

print("Unique values in AcidIndex:")
unique(training$AcidIndex)[order(unique(training$AcidIndex))]

## ----barplot-integer-values, echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE----
for(var in c("STARS","LabelAppeal","TARGET","AcidIndex"))
{
frequency_per_level <- data.frame(table(training[,var]))
colnames(frequency_per_level) <- c("value","n")
frequency_per_level$value <- as.vector(frequency_per_level$value)
if(var == "STARS")
{
frequency_per_level <- rbind(frequency_per_level,data.frame(value = "Not stated",n = length(which(is.na(training[,var]) == TRUE)),stringsAsFactors=FALSE))
frequency_per_level$value <- factor(frequency_per_level$value,levels=c(unique(training[,var])[order(unique(training[,var]))],"Not stated"))
}
if(var != "STARS")
{
frequency_per_level$value <- factor(frequency_per_level$value,levels=unique(training[,var])[order(unique(training[,var]))])
}

print(ggplot(frequency_per_level,
aes(value,n*100/12795)) +
geom_bar(stat = "identity",col="black",fill="darkgrey") +
xlab("") +
ylab("Percent of records") +
ggtitle(var) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)))
}

print("Table of number of records per value in AcidIndex:")
frequency_per_level <- data.frame(table(training[,"AcidIndex"]))
colnames(frequency_per_level) <- c("value","n")
frequency_per_level$value <- as.numeric(as.vector(frequency_per_level$value))
frequency_per_level[order(frequency_per_level$value),]

## ----hist-continuous-values, echo=FALSE, eval=TRUE,fig.width=12,fig.height=8----
par(mfrow=c(4,3))

for(var in setdiff(colnames(training),c("STARS","LabelAppeal","TARGET","AcidIndex")))
{
hist(training[,var],xlab="Values",ylab="Number of records",labels=TRUE,main=var)
}

## ----density-continuous-values, echo=FALSE, eval=TRUE,fig.width=12,fig.height=8----
training[,setdiff(colnames(training),c("STARS","LabelAppeal","TARGET","AcidIndex"))] %>%
gather() %>%
na.omit(value) %>%
ggplot(.,
aes(x=value)) +
geom_density() +
facet_wrap(~key,scales="free")

## ----correlation-integer-predictors-with-target, echo=FALSE, eval=TRUE----
integer_predictors <- training[,c("STARS","LabelAppeal","TARGET","AcidIndex")]

integer_predictors$TARGET <- plyr::mapvalues(integer_predictors$TARGET,
    from=0:8,
    to=c(as.character(0:6),"7+","7+"))

integer_predictors$AcidIndex <- plyr::mapvalues(integer_predictors$AcidIndex,
    from=4:17,
    to=c("4-5","4-5",as.character(6:11),rep("12+",times=6)))

integer_predictors$STARS <- as.character(integer_predictors$STARS)

integer_predictors$STARS[which(is.na(integer_predictors$STARS) == TRUE)] <- "Not stated"

mycol <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

for(var in c("STARS","LabelAppeal","AcidIndex"))
{
table_cases_purchased_vs_level <- table(integer_predictors$TARGET,integer_predictors[,var])
table_cases_purchased_vs_level <- spread(data.frame(table_cases_purchased_vs_level),"Var2","Freq")
rownames(table_cases_purchased_vs_level) <- as.vector(table_cases_purchased_vs_level$Var1)
table_cases_purchased_vs_level <- table_cases_purchased_vs_level[,2:ncol(table_cases_purchased_vs_level)]
for(i in 1:ncol(table_cases_purchased_vs_level)){table_cases_purchased_vs_level[,i] <- table_cases_purchased_vs_level[,i]/sum(table_cases_purchased_vs_level[,i])}
table_cases_purchased_vs_level <- table_cases_purchased_vs_level*100
table_cases_purchased_vs_level <- data.frame(gather(table_cases_purchased_vs_level),
    TARGET = rep(rownames(table_cases_purchased_vs_level),times=ncol(table_cases_purchased_vs_level)),
    check.names=FALSE,stringsAsFactors=FALSE)
table_cases_purchased_vs_level$TARGET <- factor(table_cases_purchased_vs_level$TARGET,levels=rev(c("0","1","2","3","4","5","6","7+")))
if(var == "AcidIndex"){table_cases_purchased_vs_level$key <- factor(table_cases_purchased_vs_level$key,levels=c("4-5","6","7","8","9","10","11","12+"))}
if(var == "LabelAppeal"){table_cases_purchased_vs_level$key <- factor(table_cases_purchased_vs_level$key,levels=c("-2","-1","0","1","2"))}
print(ggplot(table_cases_purchased_vs_level,
aes(key,value,fill=TARGET)) +
geom_bar(stat = "identity") +
xlab(var) +
ylab("Percent of records") +
scale_fill_manual(values=c(mycol[2:8],"grey")))
}

var = "AcidIndex"
table_cases_purchased_vs_level <- table(integer_predictors$TARGET,integer_predictors[,var])
table_cases_purchased_vs_level <- spread(data.frame(table_cases_purchased_vs_level),"Var2","Freq")
rownames(table_cases_purchased_vs_level) <- as.vector(table_cases_purchased_vs_level$Var1)
table_cases_purchased_vs_level <- table_cases_purchased_vs_level[2:nrow(table_cases_purchased_vs_level),2:ncol(table_cases_purchased_vs_level)]
for(i in 1:ncol(table_cases_purchased_vs_level)){table_cases_purchased_vs_level[,i] <- table_cases_purchased_vs_level[,i]/sum(table_cases_purchased_vs_level[,i])}
table_cases_purchased_vs_level <- table_cases_purchased_vs_level*100
table_cases_purchased_vs_level <- data.frame(gather(table_cases_purchased_vs_level),
        TARGET = rep(rownames(table_cases_purchased_vs_level),times=ncol(table_cases_purchased_vs_level)),
        check.names=FALSE,stringsAsFactors=FALSE)
table_cases_purchased_vs_level$TARGET <- factor(table_cases_purchased_vs_level$TARGET,levels=rev(c("0","1","2","3","4","5","6","7+")))
table_cases_purchased_vs_level$key <- factor(table_cases_purchased_vs_level$key,levels=c("4-5","6","7","8","9","10","11","12+"))
ggplot(table_cases_purchased_vs_level,
aes(key,value,fill=TARGET)) +
geom_bar(stat = "identity") +
xlab(var) +
ylab("Percent of records with TARGET >= 1") +
scale_fill_manual(values=c(mycol[2:8],"grey"))

## ----logitp-stars, echo=FALSE, eval=TRUE---------------------------------
stars_vs_target <- training[,c("STARS","TARGET")]
stars_vs_target$TARGET <- ifelse(stars_vs_target$TARGET == 0,0,1)

stars_vs_target$STARS[which(is.na(stars_vs_target$STARS) == TRUE)] <- 0

probabilities_wine_purchased <- c()

for(i in 0:4)
{
prob <- length(which(stars_vs_target$STARS == i & stars_vs_target$TARGET == 1))/length(which(stars_vs_target$STARS == i))
probabilities_wine_purchased <- c(probabilities_wine_purchased,prob)
}

logit_probabilities_wine_purchased <- c()

for(prob in probabilities_wine_purchased)
{
logit_p <- log(prob/(1 - prob))
logit_probabilities_wine_purchased <- c(logit_probabilities_wine_purchased,logit_p)
}

plot(0:2,logit_probabilities_wine_purchased[1:3],
xlab="STARS (0 = not reviewed)",
ylab="logit(p) for probability of 1+ cases purchased",
type="o")

## ----logitp-stars-last-level-2plus, echo=FALSE, eval=TRUE----------------
probabilities_wine_purchased[3] <- length(which(stars_vs_target$STARS >= 2 & stars_vs_target$TARGET == 1))/length(which(stars_vs_target$STARS >= 2))

logit_probabilities_wine_purchased[3] <- log(probabilities_wine_purchased[3]/(1 - probabilities_wine_purchased[3]))

plot(0:2,logit_probabilities_wine_purchased[1:3],
xlab="STARS (0 = not reviewed, 2 = 2+)",
ylab="logit(p) for probability of 1+ cases purchased",
type="o")

## ----logitp-acid-index, echo=FALSE, eval=TRUE----------------------------
acid_index_vs_target <- integer_predictors[,c("AcidIndex","TARGET")]
acid_index_vs_target$TARGET <- ifelse(acid_index_vs_target$TARGET == 0,0,1)

probabilities_wine_purchased <- c()

for(index in c("4-5","6","7","8","9","10","11","12+"))
{
prob <- length(which(acid_index_vs_target$AcidIndex == index & acid_index_vs_target$TARGET == 1))/length(which(acid_index_vs_target$AcidIndex == index))
probabilities_wine_purchased <- c(probabilities_wine_purchased,prob)
}

logit_probabilities_wine_purchased <- c()

for(prob in probabilities_wine_purchased)
{
logit_p <- log(prob/(1 - prob))
logit_probabilities_wine_purchased <- c(logit_probabilities_wine_purchased,logit_p)
}

plot(5:12,logit_probabilities_wine_purchased,
xlab="AcidIndex (5 = 4-5, 12 = 12+)",
ylab="logit(p) for probability of 1+ cases purchased",
type="o")

## ----logitp-acid-index-transformed, echo=FALSE, eval=TRUE----------------
acid_index_vs_target <- integer_predictors[,c("AcidIndex","TARGET")]
acid_index_vs_target$TARGET <- ifelse(acid_index_vs_target$TARGET == 0,0,1)
acid_index_vs_target$AcidIndex[acid_index_vs_target$AcidIndex == "4-5" | acid_index_vs_target$AcidIndex == "6" | acid_index_vs_target$AcidIndex == "7" | acid_index_vs_target$AcidIndex == "8"] <- 1
acid_index_vs_target$AcidIndex[acid_index_vs_target$AcidIndex == "9"] <- 2
acid_index_vs_target$AcidIndex[acid_index_vs_target$AcidIndex == "10"] <- 3
acid_index_vs_target$AcidIndex[acid_index_vs_target$AcidIndex == "11" | acid_index_vs_target$AcidIndex == "12+"] <- 4

probabilities_wine_purchased <- c()

for(index in 1:4)
{
prob <- length(which(acid_index_vs_target$AcidIndex == index & acid_index_vs_target$TARGET == 1))/length(which(acid_index_vs_target$AcidIndex == index))
probabilities_wine_purchased <- c(probabilities_wine_purchased,prob)
}

logit_probabilities_wine_purchased <- c()

for(prob in probabilities_wine_purchased)
{
logit_p <- log(prob/(1 - prob))
logit_probabilities_wine_purchased <- c(logit_probabilities_wine_purchased,logit_p)
}

plot(1:4,logit_probabilities_wine_purchased,
xlab="AcidIndex (simplified)",
ylab="logit(p) for probability of 1+ cases purchased",
type="o")

## ----data-transformation, echo=FALSE, eval=TRUE--------------------------
training <- training[,c("TARGET","STARS","LabelAppeal","AcidIndex")]
evaluation <- evaluation[,c("TARGET","STARS","LabelAppeal","AcidIndex")]

training$TARGET <- ifelse(training$TARGET == 8,7,training$TARGET)

training$STARS[which(is.na(training$STARS) == TRUE)] <- 0
evaluation$STARS[which(is.na(evaluation$STARS) == TRUE)] <- 0

training$AcidIndex[training$AcidIndex == 4] <- 5
training$AcidIndex[training$AcidIndex > 12] <- 12

evaluation$AcidIndex[evaluation$AcidIndex == 4] <- 5
evaluation$AcidIndex[evaluation$AcidIndex > 12] <- 12

training_acid_index_simplified <- training$AcidIndex
evaluation_acid_index_simplified <- evaluation$AcidIndex

training_acid_index_simplified[training_acid_index_simplified <= 8] <- 8
training_acid_index_simplified[training_acid_index_simplified >= 11] <- 11

evaluation_acid_index_simplified[evaluation_acid_index_simplified <= 8] <- 8
evaluation_acid_index_simplified[evaluation_acid_index_simplified >= 11] <- 11

training <- data.frame(training,
    STARS.factor = ifelse(training$STARS >= 2,2,training$STARS),
    AcidIndex.simplified = training_acid_index_simplified - 7,
    TARGET.binary = ifelse(training$TARGET == 0,0,1),
    stringsAsFactors=FALSE)

evaluation <- data.frame(evaluation,
    STARS.factor = ifelse(evaluation$STARS >= 2,2,evaluation$STARS),
    AcidIndex.simplified = evaluation_acid_index_simplified - 7,
    TARGET.binary = ifelse(evaluation$TARGET == 0,0,1),
    stringsAsFactors=FALSE)


#print("Table per variable in training:")
#apply(training,2,table)
#print("Table per variable in evaluation:")
#apply(evaluation,2,table)

## ----poisson1, echo=FALSE, eval=TRUE-------------------------------------
poisson1 <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex,data=training,family="poisson")

summary(poisson1)

## ----poisson2, echo=FALSE, eval=TRUE-------------------------------------
poisson2 <- hurdle(TARGET ~ STARS + LabelAppeal | STARS.factor + AcidIndex.simplified,data=training)

summary(poisson2)

## ----dist-poisson-fitted, echo=FALSE, eval=TRUE--------------------------
poisson_fitted_values <- data.frame(Model1 = poisson1$fitted.values,
    Model2 = poisson2$fitted.values,
    Real = training$TARGET)

poisson_fitted_values <- gather(round(poisson_fitted_values))

poisson_fitted_values <- poisson_fitted_values %>% group_by(key) %>% count(value)

poisson_fitted_values <- data.frame(poisson_fitted_values)

poisson_fitted_values$value <- factor(poisson_fitted_values$value,levels=rev(0:9))

ggplot(poisson_fitted_values,
aes(key,n,fill=value)) +
geom_bar(stat="identity") +
scale_fill_manual(values = rev(c("grey",mycol[2:8],"black","black"))) +
xlab("") +
ylab("Number of records with fitted value") +
ggtitle("Poisson models")

## ----poisson2-examine-fitted, echo=FALSE, eval=TRUE----------------------
hist(poisson2$fitted.values[training$TARGET == 0],
xlab="Poisson model 2 fitted values",
ylab="Number of records",
main="When TARGET = 0")

print("TARGET actual values when Poisson model 2 fitted = 1:")
table(training$TARGET[round(poisson2$fitted.values) == 1])

## ----negbinom1, echo=FALSE, eval=TRUE,message=FALSE,warning=FALSE--------
negbinom1 <- glm.nb(TARGET ~ STARS + AcidIndex + LabelAppeal,data=training)

summary(negbinom1)

## ----negbinom2, echo=FALSE, eval=TRUE------------------------------------
negbinom2 <- hurdle(TARGET ~ STARS + LabelAppeal | STARS.factor + AcidIndex.simplified,data=training,dist = "negbin")

summary(negbinom2)

## ----regular-linear, echo=FALSE, eval=TRUE-------------------------------
regular_linear1 <- lm(TARGET ~ STARS + AcidIndex + LabelAppeal,data=training)

summary(regular_linear1)

print("Table of fitted values:")
table(round(regular_linear1$fitted.values))

## ----regular-linear-2, echo=FALSE, eval=TRUE-----------------------------
regular_linear2 <- lm(TARGET ~ STARS + AcidIndex.simplified + LabelAppeal,data=training)

summary(regular_linear2)

print("Table of fitted values:")
table(round(regular_linear2$fitted.values))

Data 621 Homework 5: Predicting Wine Purchases

Heather Geiger

July 18, 2018

Introduction

Libraries

Data exploration

Basic data exploration

Reading in the data

Check for variable type of each column and for missing values.

Distribution of variables

Distribution of values with <15 unique values

Distribution of variables with many unique values

Comparison of variables to plausible values

Conclusions from looking at distributions of variables and comparison to plausible values

Correlation of predictor variables with target

Logit(p) vs. STARS

Logit(p) vs. AcidIndex

Data transformations

Data modeling

Poisson regression models

Model 1 - Non-zero-inflated, STARS + LabelAppeal + AcidIndex

Model 2 - Hurdle, STARS + LabelAppeal for count model, STARS.factor + AcidIndex.simplified for zero hurdle model

Model 1 vs. model 2

Negative binomial models

Model 1 - Non-zero-inflated, STARS + LabelAppeal + AcidIndex

Model 2 - Hurdle, STARS + LabelAppeal for count model, STARS.factor + AcidIndex.simplified for zero hurdle model

Multiple linear regression models

Model 1 - STARS + LabelAppeal + AcidIndex

Model 2 - STARS + LabelAppeal + AcidIndex.simplified

Model selection

Code appendix