df <- read.csv("/Users/teddykelly/Downloads/wine-data-1-1.csv")Predicting the Count of Sample Wine Cases Purchased from Distributors
Motivation
Wine tasting is a very competitive industry with lots of money involved. Knowing which factors and chemical properties contribute the most to wine being purchased is very valuable information for wine companies to understand. In this report, I will construct 6 different regression models that will each predict how many sample cases of wine are purchased based on their chemical properties, judge’s scores of how they taste, and public perception of wine labels. Specifically, I will build two linear regressions, 2 Poisson count regressions, and two negative binomial count regressions and select the best one to make these predictions.
Setup
Load in the total data set for the wine data.
Below is a table of all of the variables included in the wine dataset with an explanation of what each of them mean. Note that I have renamed the variable TARGET to cases_purchased and the variable STARS to wine_rating in the dataset for more intuitive interpretations of them.
| Variable Names | Meaning |
|---|---|
| Cases Purchased | Number of cases purchased |
| Fixed Acidity | Fixed acidity of wine |
| Volatile Acidity | Volatile acid content of wine |
| Citric Acid | Citric acid Content |
| Residuals Sugar | Residual sugar of wine |
| Chlorides | Chloride content of wine |
| Free Sulfur Dioxide | Sulfur dioxide content of wine |
| Density | Density of wine |
| pH | pH level of wine |
| Sulphates | Sulphate content of wine |
| Alcohol | Alcohol content of the wine |
| Label Appeal | Marketing score that indicates the appeal of the label design for consumers. |
| Acid Index | Total acidity of wine using a weighted average |
| Wine Rating | Wine rating by a team of experts
|
I will first rename all of the variables to make it easier to deduce what each variable means. These new variable names will match exactly what the names are in the variable name and meaning table above.
df <- df |> transmute(
cases_purchased = TARGET,
fixed_acidity = FixedAcidity,
volatile_acidity = VolatileAcidity,
citric_acid = CitricAcid,
residual_sugar = ResidualSugar,
chlorides = Chlorides,
free_sulfur_dioxide = FreeSulfurDioxide,
density = Density,
ph = pH,
sulphates = Sulphates,
alcohol = Alcohol,
label_appeal = LabelAppeal,
acid_index = AcidIndex,
wine_rating = STARS
)Below, I have split the wine dataset into 80% of the data for training the model and 20% for testing the model.
# Creating the evaluation data set using random seeds
set.seed(123)
# Generate a vector of indices for the training set
train_index <- sample(x = nrow(df), size = round(0.8*nrow(df)))
# Create the training and testing sets
train_df <- df[train_index,]
test_df <- df[-train_index,]Now that we have our training and testing datasets, it’s time to do some explanatory data analysis on the training data.
1 Data Exploration
1.1 Summary Statistics
labels <- c(
"Cases Purchased",
"Fixed Acidity",
"volatile Acidity",
"Citric Acid",
"Residual Sugar",
"Chlorides",
"Free Sulfur Dioxide",
"Density",
"pH Level",
"Sulphates",
"Alcohol Content",
"Label Appeal",
"Acid Index",
"Wine Rating"
)
stargazer(train_df, type = "text", covariate.labels = labels, median = T,
title = "Initial Summary Statistics")
Initial Summary Statistics
==================================================================
Statistic N Mean St. Dev. Min Median Max
------------------------------------------------------------------
Cases Purchased 10,236 3.026 1.927 0 3 8
Fixed Acidity 10,236 7.090 6.333 -18.000 6.900 34.400
volatile Acidity 10,236 0.327 0.784 -2.790 0.280 3.680
Citric Acid 10,236 0.309 0.863 -3.240 0.310 3.860
Residual Sugar 9,742 5.275 33.941 -127.800 3.900 141.150
Chlorides 9,711 0.055 0.319 -1.170 0.046 1.351
Free Sulfur Dioxide 9,697 30.899 147.309 -546.000 30.000 622.000
Density 10,236 0.994 0.027 0.889 0.995 1.099
pH Level 9,915 3.212 0.675 0.480 3.200 6.050
Sulphates 9,276 0.529 0.940 -3.130 0.500 4.240
Alcohol Content 9,717 10.497 3.717 -4.700 10.400 26.100
Label Appeal 10,236 -0.011 0.895 -2 0 2
Acid Index 10,236 7.781 1.339 4 8 17
Wine Rating 7,548 2.045 0.905 1 2 4
------------------------------------------------------------------
Summary Statistics Observations
Missing Values
There are several different variables with missing values.
residual sugar
chlorides
free sulfur dioxide
pH level
sulphates
alcohol
wine rating
We will address these missing values in the data preparation section either through median imputation or deleting the missing observations.
Dependent Variable
Cases Purchased
This is the main response variable that represents the number of sample cases that were purchased by the wine distribution companies.
The mean value of 3 represents that on average, the wine distribution companies purchased about 3 sample cases of wine for each commercially available wine in the training dataset.
Predictor Variables
It is unclear as to which of the chemical properties of the wine will be most important in predicting how many cases of the wine will be purchased after testing. For this reason, I have summarized the results from the summary statistics for all of the results.
Fixed Acidity
- The average fixed acidity is about 7 with a relatively high standard deviation compared to the rest of the chemical properties.
Volatile Acidity
- The mean volatile acidity is roughly 0.327 and the median is about 0.28 which indicates a slight skewness to the right considering the small scale.
Citric Acid
- The mean and median citric acid content are almost identical at about 0.31 with a considerably low spread.
Residual Sugar
The mean and median are 5.28 and 3.9 respectively and the standard deviation is about 34 which is very high compared to the rest of the chemical properties.
The sugar content in the wines varies considerably more than for the other chemical properties.
Chlorides
- The mean and median values are very low around zero with a small standard deviation. The chloride levels generally do not change much between the wines which are tested.
Free Sulfur Dioxide
- The mean and median values are very close at around 30 which is much higher than the rest and this variable has by far the most dispersed data with a standard deviation of 147.
Density
- The mean and median density levels are almost identical at 0.99 and density appears to be very clustered together.
pH Level
- Mean and median pH levels are the same at about 3.2 which indicates a symmetric distribution.
Sulphates
- The mean is slightly greater than the median, indicating slight right skewness.
Alcohol Content
- The mean and median alcohol levels are practically identical at about 10.4, but the max and min values are fairly far from those average values, indicating the data points are generally dispersed around the mean.
Label Appeal
- Negative Scores suggest customers don’t like the label design and positive scores suggest customers do like the label design. The mean and median values close to zero represent that people equally liked and disliked the wine labels.
Acid Index
- The average acid index is around 8 with a low value of 4 and a high value of 17.
Wine Rating
- Perhaps the most important predictor of the sample wines bought, the mean and median wine ratings are around 2. The low and high values of 1 and 4 confirm the lowest and highest possible wine ratings.
1.2 Histograms and Bar Graphs
After building a quantitative understanding of the training dataset using the summary statistics, the next step is to understand the data visually. Our first goal is to visualize the distribution of the data using histograms and bar graphs for continuous and discrete variables respectively.
First, let’s visualize the distribution of the continuous variables. I have first created a separate data frame only containing those variables and melted them to plot their histograms all together.
train_continuous <- train_df |>
dplyr::select(fixed_acidity, volatile_acidity, citric_acid,
residual_sugar, chlorides, free_sulfur_dioxide,
density, ph, sulphates, alcohol)
continuous_melted <- reshape2::melt(data = train_continuous)
continuous_labels <- c(
fixed_acidity = "Fixed Acidity",
volatile_acidity = "Volatile Acidity",
citric_acid = "Citric Acid",
residual_sugar = "Residual Sugar",
chlorides = "Chlorides",
free_sulfur_dioxide = "Free Sulfur Dioxide",
density = "Density",
ph = "pH",
sulphates = "Sulphates",
alcohol = "Alcohol"
)
ggplot(data = continuous_melted,
mapping = aes(x = value)) +
geom_histogram(fill = "blue") +
facet_wrap(~variable, scales = "free_x",
labeller = labeller(variable = continuous_labels)) Histogram Analysis
It appears that the distributions of all of the continuous variables are symmetric
Because of the large number of observations, the distribution appears strange because most of the observations assume the mean/median value of the corresponding variable measure.
Now, it is time to evaluate the distribution of the discrete numerical variables.
train_discrete <- train_df |> dplyr::select(cases_purchased,
label_appeal,acid_index,
wine_rating)
discrete_melted <- reshape2::melt(data = train_discrete)
discrete_labels <- c(
cases_purchased = "Cases Purchased",
label_appeal = "Label Appeal",
acid_index = "Acid Index",
wine_rating = "Wine Rating"
)
ggplot(data = discrete_melted,
mapping = aes(x = value)) +
geom_bar(fill = "blue") +
facet_wrap(~variable, scales = "free_x",
labeller = labeller(variable = discrete_labels))Bar Chart Analysis
cases_purchased- The frequency with the most amount of sample wine cases purchased is 4. However, there were lots of sample wines that did not have any cases purchased which is why the mean and median sample cases purchased are around 3.label_appeal- Most customers appear to be indifferent to the wine labels with the highest frequency being a marketing score of zero. It also appears that an equal number of customers gave positive label ratings and negative label ratings.acid_idex- The two highest frequencies of the acid index are 7 and 8 with 7 being the highest.wine_rating- This is the rating system used by the team of experts with scores ranging from 1 to 4. The rating that receives the most votes for the judges is 2 stars with the majority of the experts rating the wines in the training data as either 1 or 2 stars.
1.3 Correlation Matrix
We will now conduct a preliminary investigation into whether the dependent variable, cases_purchased is correlated to any of the independent variables before running any official regressions. We will use a correlation matrix to give a visual representation of how the variables are correlated with each other.
Since there are missing observations for several of the variables, creating a correlation table without addressing those missing values will result in lots of NA values in the table.
Therefore, I have temporarily imputed the NA values with the median values of the corresponding variable names.
train_temp <- train_dftrain_temp$residual_sugar[is.na(train_temp$residual_sugar)] <-
median(train_temp$residual_sugar, na.rm = T)
train_temp$chlorides[is.na(train_temp$chlorides)] <-
median(train_temp$chlorides, na.rm = T)
train_temp$free_sulfur_dioxide[is.na(train_temp$free_sulfur_dioxide)] <-
median(train_temp$free_sulfur_dioxide, na.rm = T)
train_temp$ph[is.na(train_temp$ph)] <-
median(train_temp$ph, na.rm = T)
train_temp$sulphates[is.na(train_temp$sulphates)] <-
median(train_temp$sulphates, na.rm = T)
train_temp$alcohol[is.na(train_temp$alcohol)] <-
median(train_temp$alcohol, na.rm = T)
train_temp$wine_rating[is.na(train_temp$wine_rating)] <-
median(train_temp$wine_rating, na.rm = T)Now with no missing values, we can go ahead and create our correlation matrix to evaluate the initial relationships between the variables.
corr_table <- cor(x = train_temp)
colnames(corr_table) <- labels
rownames(corr_table) <- labels
corr_mat <- ggcorrplot(corr = corr_table,
method = c("square"),
type = c("upper"),
title = "Correlation Matrix of Wine Variables",
colors = c("red", "white", "green"),
lab = TRUE,
lab_size = 1,
insig = "pch",
tl.cex = 8
)
corr_matCorrelation Matrix Analysis
There is not much correlation between most of the variables except for a select few. Below I have analyzed the relationships between positively and negatively correlated variables.
Positive Correlation
Wine Rating and Cases Purchased
Positively correlated with a correlation of 0.4. This is the strongest positive correlation between any two variables.
Indication that high wine ratings are associated with more purchases of sample cases and low wine ratings with less purchases.
Label Appeal and Cases Purchased
Positively correlated with a value of 0.36. The 2nd strongest correlation with how many cases were sold.
Indicates that customers being more receptive to wine labels results in more sample cases being purchased.
Acid Index and Fixed Acidity
Positively correlated with a value of 0.18.
This makes sense because we would expect that wines with a high wine index would also have a high fixed acidity index.
Including both predictors in the final count regression may lead to strong multicollinearity.
Wine Rating and Label Appeal
Positively correlated with a correlation of 0.28. This is the highest positive correlation among the independent variables which makes sense because we’d expect that wines with better labels also taste better.
There will likely be some multicollinearity if both are included in the final count regression, however, since both predictors are strongly correlated with the response variable, it may be necessary to include both.
Negative Correlation
Acid Index and Cases Purchased
Negatively correlated with a correlation of -0.24 which is by far the strongest negative correlation between any of the variables.
Indication that wines with a high acid index are correlated with less sample cases of those wines being purchased.
2 Data Preparation
The next step in our data analysis is to clean the training data so it’s suitable to build our count regressions which will be used to predict how many sample cases will be purchased. This data cleaning involves addressing any missing values and/or outlier values that may cause bias in our analysis.
2.1 Missing Values
Below, I have visualized all of the missing values in our data.
plot_missing(train_df, missing_only = T)About half of the variables contain missing values as we noted previously when analyzing the summary statistics of the training data.
The variables indicated in yellow have missing values in at least 5% of the observations in the training data which may be problematic. What’s most concerning is that wine_rating has over 26% missing values and it is the most correlated with cases_purchased according to the correlation matrix.
Imputing about 26% of the observations with the median wine rating score will likely not be the best course of action because wine_rating is a discrete variable with a small range of possible values from 1 to 4. Therefore, imputing the missing values with the median wine rating will result in just over 26% of the wine ratings automatically receiving a rating of 2 stars (the median wine rating) which will certainly skew the data.
For this reason, I will delete all observations that do not have a wine rating. This will still result in having over 7,000 observations which is plenty of data to analyze. However, because omitting all the observations with missing values would reduce the training data by 50%, I will impute the variables that have less missing values as the bias resulting from these imputations will be minimal and none of them had any significant correlation with the response variable cases_purchased. This will ensure that no observations that do have a wine rating do not get removed from the data since wine rating appears to be the most important predictor of sample cases purchased.
train_df <- train_df[!is.na(train_df$wine_rating),]Now that we have removed all of the missing observations for the wine rating column in the data frame, it’s time to impute the remaining missing values for the other variables.
train_df$residual_sugar[is.na(train_df$residual_sugar)] <-
median(train_df$residual_sugar, na.rm = T)
train_df$chlorides[is.na(train_df$chlorides)] <-
median(train_df$chlorides, na.rm = T)
train_df$free_sulfur_dioxide[is.na(train_df$free_sulfur_dioxide)] <-
median(train_df$free_sulfur_dioxide, na.rm = T)
train_df$ph[is.na(train_df$ph)] <-
median(train_df$ph, na.rm = T)
train_df$sulphates[is.na(train_df$sulphates)] <-
median(train_df$sulphates, na.rm = T)
train_df$alcohol[is.na(train_df$alcohol)] <-
median(train_df$alcohol, na.rm = T)There are no more missing observations left in the data which will enable us to perform count regressions on the training data without any issues. However, there may still be outliers present in the data which could potentially bias the data.
2.2 Outliers
One method I normally use to reduce the influence of outliers is to employ a log transformation on continuous variables. Normally, this will reduce the level of skeweness caused by outliers and result in a distribution that more closely resembles a normal distribution.
However, as we saw with the histograms, none of the continuous variables appear to have any skewness, and more importantly, they take on negative values. This means that applying a log transformation will result in undefined values in the log transformed variables which will cause major problems when running our regressions.
Therefore, I did not perform any variable transformations, as I did not want to disturb the data too much since there are many negative values for the chemical properties of the wines. Also, I did not combine any of the chemical properties together because I do not know enough about wine properties to accurately combine the correct properties to create meaningful variables for the regressions.
# Box plot
ggplot(data = continuous_melted,
mapping = aes(x = factor(variable), y = value))+
geom_boxplot()+
facet_wrap(facets = ~variable, scale = "free",
labeller = labeller(variable = continuous_labels))# Fixed Acidity
# Detect & impute outliers for all numeric variables in train_df
train_df_imputed <- train_df # copy so you keep the original
numeric_vars <- names(train_df)[sapply(train_df, is.numeric)]
outlier_summary <- data.frame(
variable = numeric_vars,
n_outliers = NA_integer_, row.names = labels
)
for (v in numeric_vars) {
x <- train_df[[v]]
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
# Boolean vector of which values are outliers
outliers_bool <- x < lower_bound | x > upper_bound
# Store count in summary table
outlier_summary[outlier_summary$variable == v, "n_outliers"] <-
sum(outliers_bool, na.rm = TRUE)
}
# Show outlier counts by variable
outlier_summary$variable <- rownames(outlier_summary)
rownames(outlier_summary) <- NULL
outlier_summary variable n_outliers
1 Cases Purchased 0
2 Fixed Acidity 1536
3 volatile Acidity 1643
4 Citric Acid 1656
5 Residual Sugar 2348
6 Chlorides 2646
7 Free Sulfur Dioxide 2398
8 Density 2330
9 pH Level 1437
10 Sulphates 2198
11 Alcohol Content 650
12 Label Appeal 0
13 Acid Index 487
14 Wine Rating 0
Looking at the table above, there are some outliers in all of the continuous numerical variables. However, there are none for any of the discrete numerical variables, CasesPurchased, LabelAppeal, and WineRating which are the most important variables in the dataset. Therefore, I have decided not to delete or impute any of the continuous variables to avoid bias.
2.3 Updated Summary Statistics
Below is an updated version of the summary statistics of the training data after addressing all of the missing values. As we can see, all of the variables in the data have exactly 7,548 observations.
stargazer(train_df, type = "text", title = "Update Summary Statistics",
omit.summary.stat = "N", covariate.labels = labels,
median = T, notes = "N = 7,548")
Update Summary Statistics
===========================================================
Statistic Mean St. Dev. Min Median Max
-----------------------------------------------------------
Cases Purchased 3.685 1.556 0 4 8
Fixed Acidity 6.923 6.261 -18.000 6.900 32.500
volatile Acidity 0.301 0.781 -2.745 0.270 3.680
Citric Acid 0.309 0.860 -3.160 0.310 3.770
Residual Sugar 5.374 32.838 -127.800 4.500 141.150
Chlorides 0.048 0.308 -1.170 0.045 1.270
Free Sulfur Dioxide 34.064 143.892 -546.000 33.000 622.000
Density 0.994 0.027 0.889 0.994 1.099
pH Level 3.208 0.663 0.480 3.190 6.050
Sulphates 0.508 0.889 -3.130 0.490 4.210
Alcohol Content 10.553 3.644 -4.500 10.500 26.100
Label Appeal 0.044 0.884 -2 0 2
Acid Index 7.647 1.198 5 7 17
Wine Rating 2.045 0.905 1 2 4
-----------------------------------------------------------
N = 7,548
3 Build Models
The next step in our data analysis is to build regression models that will accurately make predictions for how many sample cases of wine were purchased using the variables in the training data.
Since the dependent variable we are attempting to predict, cases_purchased, is a discrete variable, we will use count regressions to make these predictions. Specifically, there are two types of count regressions, negative binomial and passion regressions, that we will use. We will construct two regressions for each variation and then compare the results of the regressions to each other. Furthermore, I will also build two multiple linear regression models to compare to the more accurate count regressions.
3.1 Multiple Linear Regressions
We will first construct two multivariate linear regressions that will attempt to predict the number of sample cases of wine purchased for each observation. Note that this is not the ideal model to use since the dependent variable, CasesPurchased, is not a continuous variable, but rather a discrete variable. The goal of these regressions is to simply establish a base-line for our regression results which we will then use to compare to the more suitable count regressions.
3.1.1 Multiple Linear Regression 1
The first linear regression I ran contains every variable in the dataset. Below is the estimating equation:
\[ \begin{aligned} CasesPurchased_i =\;& \beta_0+\beta_1\cdot FixedAcidity_i+\beta_2\cdot VolatileAcidity+\beta_3\cdot CitricAcid_i+ \\ &\beta_4\cdot ResidualSugar_i+\beta_5\cdot Chlorides_i+\beta_6\cdot FreeSulfurDioxide_i+\\&\beta_7\cdot Density_i+\beta_8\cdot ph_i+\beta_9\cdot Sulphates_i+\beta_{10}\cdot Alcohol_i+\\&\beta_{11}\cdot LabelAppeal_i+\beta_{12}\cdot AcidIndex_i+\beta_{13}\cdot WineRating_i +u_i \end{aligned} \]
lin_reg1 <- lm(data = train_df,
formula = cases_purchased ~ .)Variable Selection Process
I do not have any prior knowledge about which chemical properties most significantly determine the quality of wine, so I thought it would be best to run a preliminary regression containing every variable in the dataset to get an idea of which chemical properties most influence the number of sample cases purchased.
From our explanatory data analysis, we found that the following three discrete variables have the strongest correlation with
CasesPurchased:LabelAppeal: Positive correlation, so there should be a strong positive effect on the number of cases purchased in the regression.AcidIndex: Negative correlation, so acid index should have a negative effect on cases purchased in the regression.WineRating: Similar to label appeal, this also has a strong positive correlation and should have a positive effect on the number of sample cases purchased in the regression.
We will observe the regression results for linear regression 1 in the coefficient interpretation model.
To obtain the second multivariate linear regression, I will use backward selection using the
stepAICcommand which will select the best model based on the lowest Akaike Information Criterion (AIC).We should expect to see
LabelAppeal,AcidIndex, andWineRatingall remain in the second model since they have the highest correlation with the dependent variable.
Below, I have run the stepAIC command on the first linear regression to obtain the second regression using backward variable selection.
stepAIC(object = lin_reg1,
direction = "backward")3.1.2 Multiple Linear Regression 2
Based on the results from performing backward selection, the estimating equation with the lowest AIC is the following:
\[ \begin{aligned} CasesPurchased_i = \;& \beta_0 +\beta_1\cdot VolatileAcidity_i+\beta_2\cdot chlorides_i+ \\& \beta_3\cdot FreeSulfurDioxide_i+\beta_4\cdot density_i+\\ & \beta_5\cdot alcohol_i+\beta_6\cdot LabelAppeal_i+\beta_7\cdot AcidIndex_i+\\&\beta_8\cdot WineRating_i + u_i \end{aligned} \]
lin_reg2 <- lm(formula = cases_purchased ~ volatile_acidity + chlorides +
free_sulfur_dioxide + density + alcohol + label_appeal +
acid_index + wine_rating, data = train_df)We can see that the second model is shorter than the first and that all three of the main regressors from the first regression have remained. (label appeal, acid index, and wine rating)
We will compare the regression results of these two linear models in the linear coefficient interpretations section and evaluate the reliability of their models by performing residual analysis, heteroskedasticity, and multicollinearity tests on them both.
We will now shift our attention to constructing the count regression models.
3.2 Poisson Count Regressions
Now that we have our preliminary multiple linear regressions, we can now begin to build our count regressions to accurately predict the number of sample cases purchased for each observation. We will start by building two Poisson regressions.
For Poisson regressions, the key assumption is that the conditional mean and variance of the dependent variable (number of sample cases purchased are the same.
3.2.1 Poisson Regression 1
The first Poisson count regression will contain all of the variables in the dataset, and its estimating equation is seen below:
\[ \begin{aligned} log(\mu_i)=\;& \beta_0+\beta_1\cdot FixedAcidity_i+\beta_2\cdot VolatileAcidity+\beta_3\cdot CitricAcid_i+ \\ &\beta_4\cdot ResidualSugar_i+\beta_5\cdot Chlorides_i+\beta_6\cdot FreeSulfurDioxide_i+\\&\beta_7\cdot Density_i+\beta_8\cdot ph_i+\beta_9\cdot Sulphates_i+\beta_{10}\cdot Alcohol_i+\\&\beta_{11}\cdot LabelAppeal_i+\beta_{12}\cdot AcidIndex_i+\beta_{13}\cdot WineRating_i \end{aligned} \]
where \(\mu_i\) is the mean count of the number of sample cases purchased.
poisson_reg1 <- glm(data = train_df,
formula = cases_purchased ~.,
family = poisson(link = "log"))Variable Selection Process
Similar to the process used for the linear models, the first Poisson regression incorporates all of the chemical property variables.
Despite the fact that the same exact variables are used here as in the first linear model, this regression model will yield different results because of the inherently different structure.
Notice the \(log(\mu_i)\) in the estimating equation. This log transformation ensures that the mean count of the number of cases purchased is always positive. This then influences how we interpret this model’s coefficient estimates because Poisson regressions follow a log-linear structure.
We will also incorporate backward selection to select the best possible Poisson model based on the AIC value. The second Poisson model will likely use different variables than were used in the second linear model because the first linear and Poisson models are structurally different even though they use the same variables.
StepAIC
stepAIC(object = poisson_reg1,
direction = "backward")3.2.2 Poisson Regression 2
\[ \begin{aligned} log(\mu_i)=\;& \beta_0+\beta_1\cdot VolatileAcidity+\beta_2\cdot Chlorides_i+\beta_3\cdot FreeSulfurDioxide_i+\\&\beta_4\cdot Alcohol_i+\beta_5\cdot LabelAppeal_i+\beta_6\cdot AcidIndex_i+\\&\beta_7\cdot WineRating_i \end{aligned} \]
poisson_reg2 <- glm(formula = cases_purchased ~ volatile_acidity + chlorides +
free_sulfur_dioxide + alcohol + label_appeal + acid_index +
wine_rating, family = poisson(link = "log"), data = train_df)The second Poisson model contains 7 of the original 13 variables from the first model.
Notice that label appeal, acid index, and wine rating are all still included with the second model even after backward selection. Just as with the linear model, these three variables should have the greatest effect on the count of sample cases purchased.
We will compare the results of these Poisson models with another type of count model, the negative binomial count model.
3.3 Negative Binomial Count Regressions
The next step is to build two negative binomial regressions which will handle any over-dispersion (variance is greater than the mean) from the data and compare the results with the Poisson regressions.
3.3.1 Negative Binomial Regression 1
Regression 1
The first negative binomial regression will include all of the variables that are included in the wine dataset. The estimating equation is exactly as seen with the Poisson model 1, expect the negative binomial model takes into account for overdispersion where the variance of the count of sample wines purchased may be greater than the mean count:
\[ \begin{aligned} Var(CasesPurchased_i) = \mu_i+\alpha\cdot \mu_i^2 \end{aligned} \]
where \(\alpha\) is a constant. If \(\alpha>0\), then the variance will be greater than the mean and there is overdispersion. However, if \(\alpha=0\), then the \(\alpha\cdot \mu_i^2\) term will cancel out and the variance will be precisely the mean, resulting in the original Poisson model.
nb_reg1 <- glm.nb(data = train_df,
formula = cases_purchased ~.)Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
control$trace > : iteration limit reached
Variable Selection Process
The variables included in the first negative binomial model are precisely the same as the variables for the first linear and Poisson models.
If there is no overdispersion in the data (the conditional mean and variance of the number of sample cases purchased is the same), then the results for the negative binomial model will be exactly the same as the Poisson model.
If the negative binomial model yields different results than the Poisson regression, then the variance is greater than the mean in the data, and using the Poisson regression is not ideal since it assumes equal mean and variance of the dependent variable.
I have also used backward selection to select the variables for the second negative binomial model. Note that if the conditional mean is equal to the variance of the dependent variable, then the second negative binomial model will contain exactly the same variables as the second Poisson model.
StepAIC
stepAIC(object = nb_reg1,
direction = "backward")3.3.2 Negative Binomial Regression 2
After applying backward selection on the preliminary negative binomial regression, I have run a second negative binomial regression.
StepAIC actually outputs the same exact remaining variables as used for the second Poisson model. Therefore, the estimating equation will be precisely the same as the second Poisson model with the addition of the variance condition to account for possible overdispersion:
\[ Var(CasesPurchased_i)=\mu_i + \alpha\cdot \mu_i^2 \]
As specified for the first negative binomial model, the variance of the number of sample cases purchased will become equal to the mean if the constant \(\alpha=0\).
nb_reg2 <- glm.nb(formula = cases_purchased ~ volatile_acidity + chlorides +
free_sulfur_dioxide + alcohol + label_appeal + acid_index +
wine_rating, data = train_df, init.theta = 141259.0943, link = log)As we can see, the model above contains precisely the same variables as the second Poisson model, indicating that there is no overdispersion in the data and that using a Poisson model in this scenario would be appropriate.
We will investigate this more thoroughly in the count regression interpretations section and analyze if the coefficient results are indeed the same.
I will create a third negative binomial model only containing the three most important regressors so we have at least one negative binomial model that yields different coefficient estimates than the Poisson models.
3.3.3 Negative Binomial Regression 3
As we will see in section 3.5, the coefficient estimates for the second Poisson and negative binomial regressions are identical, indicating that the mean and the variance are precisely the same. Therefore, I have also included a third negative binomial regression using different input variables to obtain different results.
In our explanatory data analysis, we determined that LabelAppeal, WineRating, and AcidIndex have the strongest correlation with how many sample cases of wine were purchased. Therefore, I have decided to remove any irrelevant variables that may reduce the precision of the coefficient estimates of the key variables.
Therefore, the estimating equation for the third estimating equation is the following:
\[ \begin{aligned} log(\mu_i)=\;& \beta_0+\beta_1\cdot LabelAppeal_i+\beta_2\cdot AcidIndex_i+\beta_3\cdot WineRating_i \end{aligned} \]
where \[Var(CasesPurchased_i)=\mu_i+\alpha\cdot \mu_i^2\]Since the first two estimating equations yield the same exact results as the two Poisson models, then for this equation, it’s likely that \(\alpha\) will also equal zero, meaning no overdispersion, and thus a Poisson model of the same form would yield the same results produced from this estimating equation.
nb_reg3 <- glm.nb(formula = cases_purchased ~ label_appeal + acid_index +
wine_rating,
data = train_df)3.4 Linear Regressions Coefficient Interpretations
labels1 <- c(
"Fixed Acidity",
"Volatile Acidity",
"Citric Acid",
"Residual Sugar",
"Chlorides",
"Free Sulfur Dioxide",
"Density",
"pH Level",
"Sulphates",
"Alcohol Content",
"Label Appeal",
"Acid Index",
"Wine Rating"
)
stargazer(lin_reg1, lin_reg2, type = "text",
title = "Linear Regression Regression Comparison",
covariate.labels = labels1)
Linear Regression Regression Comparison
========================================================================
Dependent variable:
----------------------------------------------------
cases_purchased
(1) (2)
------------------------------------------------------------------------
Fixed Acidity 0.0004
(0.002)
Volatile Acidity -0.088*** -0.089***
(0.017) (0.017)
Citric Acid 0.021
(0.015)
Residual Sugar -0.0001
(0.0004)
Chlorides -0.131*** -0.131***
(0.043) (0.043)
Free Sulfur Dioxide 0.0003*** 0.0003***
(0.0001) (0.0001)
Density -0.742 -0.746
(0.496) (0.496)
pH Level -0.015
(0.020)
Sulphates -0.009
(0.015)
Alcohol Content 0.016*** 0.016***
(0.004) (0.004)
Label Appeal 0.657*** 0.657***
(0.016) (0.016)
Acid Index -0.162*** -0.161***
(0.011) (0.011)
Wine Rating 0.726*** 0.726***
(0.016) (0.016)
Constant 4.047*** 3.990***
(0.506) (0.501)
------------------------------------------------------------------------
Observations 7,548 7,548
R2 0.455 0.455
Adjusted R2 0.454 0.454
Residual Std. Error 1.149 (df = 7534) 1.149 (df = 7539)
F Statistic 484.329*** (df = 13; 7534) 786.874*** (df = 8; 7539)
========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Coefficient Interpretations
I will use the second linear model derived from performing backward selection to discuss the sign, economic magnitude, and statistical significance of the important variables.
Label Appeal
Sign
Positive (+)
This is the expected coefficient sign because it makes sense that wines that receive a more positive reaction to their labels will be more likely to be purchased.
Economic magnitude
The coefficient is 0.657.
Since this is linear regression, this means that for a one unit increase in a wine’s label appeal, the number of sample cases purchased will increase by about 0.657 cases, holding all else constant.
This is a fairly significant result in terms of magnitude because an increase in label appeal leads to almost a whole other sample case of wine purchased.
Statistical Significance
- Statistically significant below the \(\alpha=0.01\) level, implying that we are confident that this estimate is not a result of random chance,
Wine Rating
Sign
positive (+)
This is the expected coefficient sign because wine that is more highly rated by experts will be more likely to have sample cases purchased.
Economic Magnitude
The coefficient is 0.726.
For a one unit increase in wine rating (1 additional star), the number of sample cases will increase on average by about 0.726 cases, holding all else constant.
This result is even more economically significant than the result for label appeal.
Statistical Significance
Statistically significant below the \(\alpha=0.01\) level
We are confident that this is the true estimate and that it likely did not occur by random chance.
Acid Index
Sign
Negative (-)
Suggests that wine with a low acid index tends to be purchased more.
Economic Magnitude
The coefficient estimate is -0.161
For every additional unit increase in a wine’s acid index, on average, the number of sample cases purchased by distributors decreases by about 0.161 cases, holding all else constant.
This is not as economically significant of a result as for label appeal and wine rating which makes sense because its negative correlation with sample cases purchased was not as strong in magnitude as label appeal and wine rating.
Statistical Significance
Statistically significant at the \(\alpha=0.01\) level
This coefficient estimate is likely to not have come from random chance.
3.5 Poisson and Negative Binomial Count Regression Interpretations
This section will compare the coefficient estimates of the second models for the Poisson and negative binomial count regressions. These models both have lower AIC levels than the first regressions, so I have not included the full models in this section to make the coefficient analysis more straightforward.
labels2 <- c(
"Volatile Acidity",
"Chlorides",
"Free Sulfur Dioxide",
"Alcohol Content",
"Label Appeal",
"Acid Index",
"Wine Rating"
)
stargazer(poisson_reg2, nb_reg2, nb_reg3, type = "text",
title = "Count Regression Comparison",
covariate.labels = labels2)
Count Regression Comparison
===================================================================================
Dependent variable:
---------------------------------------------------------------
cases_purchased
Poisson negative
binomial
(1) (2) (3)
-----------------------------------------------------------------------------------
Volatile Acidity -0.024*** -0.024***
(0.008) (0.008)
Chlorides -0.036* -0.036*
(0.019) (0.019)
Free Sulfur Dioxide 0.0001* 0.0001*
(0.00004) (0.00004)
Alcohol Content 0.004** 0.004**
(0.002) (0.002)
Label Appeal 0.180*** 0.180*** 0.180***
(0.007) (0.007) (0.007)
Acid Index -0.047*** -0.047*** -0.048***
(0.005) (0.005) (0.005)
Wine Rating 0.186*** 0.186*** 0.188***
(0.007) (0.007) (0.007)
Constant 1.199*** 1.199*** 1.243***
(0.048) (0.048) (0.044)
-----------------------------------------------------------------------------------
Observations 7,548 7,548 7,548
Log Likelihood -13,573.050 -13,574.110 -13,585.180
theta 141,259.400 (218,247.900) 140,639.200 (217,227.300)
Akaike Inf. Crit. 27,162.100 27,164.220 27,178.360
===================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
As we can see, the coefficient estimates for the Poisson and negative binomial count regressions are identical and the log likelihood and AIC values are almost the same as well. This indicates that there is no sign of overdispersion (the variance is greater than the mean), and therefore the Poisson count regression is sufficient to use in this case.
As noted in section 3.3.3, I included another negative binomial regression which only includes the three most correlated predictors of the number of sample cases purchased. The coefficient estimates of the included predictors did not change nearly at all, however, the AIC value did increase which suggests that it is not as well-balanced of a model as the Poisson or negative binomial regressions also in the regression table.
I will still provide a multiclass confusion matrix for the shortened negative binomial model and evaluate its prediction accuracy in comparison to the Poisson model 2.
Coefficient Interpretations (Poisson Regression)
Since there is no evidence of overdispersion, I will interpret the coefficient for the Poisson count regression. Also, I will create a new model for the negative binomial distribution to obtain a different model.
Label Appeal
Sign
Positive (+)
Similar to the linear regression, this is the expected coefficient sign because if consumers are more receptive to the label design of a wine bottle, we can expect that more sample cases will be purchased by a wine distribution company.
Economic magnitude
The coefficient estimate is 0.18.
This is a count regression, so this means that an additional unit increase in a wine’s label appeal score leads to an increase in the expected count by factor of \(e^{0.18}\) or about 1.19.
It’s more difficult to determine if this is an economically significant result than it was for the linear regression because of the coefficient interpretation.
Statistical Significance
Statistically significant under the \(\alpha=0.01\) significance level
We are confident that this estimate likely did not occur by random chance alone.
Wine Rating
Sign
Positive(+)
This is the expected coefficient sign because it makes sense that the number of sample cases purchased will increase if its wine rating increases.
Economic Magnitude
The coefficient estimate is 0.186.
Every additional star a wine receives results in the expected count of sample wines purchased to increase by a factor of \(e^{0.186}=1.204\) holding all else constant. This means that the expected count of sample wines purchased to increase by about 20.4%.
The economic magnitude is greater than for label appeal which makes sense if we refer back to our correlation matrix in section 1 where win rating had the highest positive correlation with cases purchased.
Statistical Significance
Statistically significant under the \(\alpha=0.01\)
We are confident that the coefficient estimate did not occur by random chance alone.
Acid Index
Sign
- Negative (-)
Economic Magnitude
The coefficient estimate is -0.047.
A one unit increase in the acid index of wine results in the expected count of sample cases purchased by wine distribution companies to decrease by a factor of about 0.95 holding all else constant.
Alternatively, this means that the expected count of sample cases purchased resulting from a one unit increase in the acid index decreases by about 4.59%.
This doesn’t seem like a lot, but if the acid index of a particular wine increases by a lot, then we can deduce that the number of sample cases purchased will decrease by a fairly significant amount.
Statistical Significance
Statistically significant at the \(\alpha=0.01\) significance level.
The coefficient result likely did not occur due to random chance and we can be confident with the estimate.
4 Select Models
In this section, we will be evaluating the model performance of the linear and count regressions we have. For the linear regressions, I will use standard residual analysis and will compare each of their F-statistics, MSE, and R-squared values.
For the count regressions, I will make classification predictions on the training data and build a multiclass confusion matrix to test the accuracy of both the Poisson and negative binomial regressions. As we saw in the previous section, the AIC value for the Poisson regression was lower than the negative binomial regression, indicating a more sound model, however I will still check their prediction accuracy before making a final selection.
4.1 Linear Regression Diagnostic Tests
diagnostic_table <- data.frame(
model = c("Model 1", "Model 2"),
MSE = c(mean(lin_reg1$residuals**2), mean(lin_reg2$residuals**2)),
R_squared = c((summary(lin_reg1))$adj.r.squared,
(summary(lin_reg2))$adj.r.squared),
f_stat = c((summary(lin_reg1))$fstatistic[1],
(summary(lin_reg2))$fstatistic[1])
)
diagnostic_table |> kable(digits = 3,
caption = "Comparison of Diagnostic Results",
col.names = c("Model", "MSE",
"R Squared", "F-Statistic"))| Model | MSE | R Squared | F-Statistic |
|---|---|---|---|
| Model 1 | 1.318 | 0.454 | 484.329 |
| Model 2 | 1.319 | 0.454 | 786.874 |
Mean Squared Error
- The mean squared error values are almost identical with the full linear model actually having a slightly lower MSE value. This indicates that the predictions for the full model may be slightly closer to the actual values, although not by very much.
R Squared
The adjusted r-squared values are exactly the same in both of the linear models.
For both models, approximately 45.5% of the variation in the number of sample wines purchased can be explained by the models which is a fairly strong r-squared value.
F-statistic
The f-statistic is much higher for the second linear model which suggests that model 2 is generally more statistically significant than model 1.
Both f-static values are relatively high and statistically significant which means that there is likely a relationship between the predictors and the number of sample wine cases purchased.
Diagnostic Testing Conclusion
Based on the large disparity in the f-statistic value between the models, model 2 appears to be the strongest and most reliable linear regression model. I will have to further investigate how these models uphold the Gauss markov assumptions through residual analysis to make a clear decision of which linear model is the best.
Also, I will still have to evaluate the prediction accuracy of the count regression models to determine the best overall model for prediction on the evaluation dataset.
4.1.1 Residual Analysis Plots
Below I have plotted the four residual plots for the second linear model. The residual plots for model 1 are identical so I only included model 2 here for simplicity.
plot(lin_reg2)Residuals VS Fitted Values
- Clearly, we can see that the residual values are not randomly distributed around zero and that the variance of the residuals follow a clear downward sloping trend as the fitted values increase. Clear sign of heteroskedasticity.
Normal Q-Q Plot
- The residuals follow a relatively normal distribution for positive theoretical quantiles, however the residuals near the lower tail shift off of the normal distribution fairly significantly.
Scale Location Plot
The residuals are clearly not spread equally among the predicted values which is a clear violation of homoskedasticity.
I have also run a White’s Test below to confirm the presence of heteroskedasticity in both linear models.
Residuals vs Leverage Plot
- There is no indication of any leverage points in the data.
White’s Test
Below is a comparison of the White’s test results for both linear model 1 and 2.
library(skedastic)
wt1 <- white(mainlm = lin_reg1)
wt2 <- white(mainlm = lin_reg2)
white_tests <- rbind(wt1, wt2)
white_tests <- add_column(white_tests, Model = c("Model 1", "Model 2"),
.before = 1)
kable(white_tests)| Model | statistic | p.value | parameter | method | alternative |
|---|---|---|---|---|---|
| Model 1 | 894.4523 | 0 | 26 | White’s Test | greater |
| Model 2 | 882.3044 | 0 | 16 | White’s Test | greater |
The null hypothesis for a White’s Test is that there is homoskedasticity in a regression model. Since the p-values for both model 1 and 2 are exactly zero, then we can reject the null hypothesis which confirms what we found from the residuals plots that there is strong evidence to suggest heteroskedasticity.
Since there is such strong heteroskedasticity present, the statistically significant results that we found for the linear models may be unreliable since not having constant variance among residuals affects the standard errors. Therefore, these models may not be the best for making reliable predictions on the evaluation data, however we will also analyze the homoskedasticity assumption for the count regressions.
Multicollinearity Testing with Variance inflation Factor
vif1 <- data.frame(Variable = c("Volatile Acidity", "Chlorides",
"Free Sulfur Dioxide", "Density", "Alcohol",
"Label Appeal", "Acid Index", "Wine Rating"),
VIF = vif(lin_reg2),
row.names = NULL)
kable(vif1)| Variable | VIF |
|---|---|
| Volatile Acidity | 1.003559 |
| Chlorides | 1.001043 |
| Free Sulfur Dioxide | 1.001287 |
| Density | 1.002089 |
| Alcohol | 1.008457 |
| Label Appeal | 1.129241 |
| Acid Index | 1.014251 |
| Wine Rating | 1.141363 |
Below displays that none of the variables in the linear model 2 have high VIF values, indicating that there is no multicollinearity present in the regression model.
4.1.2 Predictions with Linear Model
Below, I have used the second linear model that I created to generate predictions for the number of sample cases purchased in the training dataset. I will compare these results with the predictions of the count models in the next section.
train_df$lin_initial <- predict(lin_reg2)
train_df$lin_pred <- round(train_df$lin_initial)Since I have rounded the linear model predictions to the nearest whole number, I will also create a multiclass confusion matrix to evaluate its prediction accuracy and compare this with the count models.
Since the minimum number of sample cases predicted to be purchased is 1 and the maximum amount is only 7 for the linear model, I will have to specify the levels 0 through 8 for the linear model so the confusion matrix will work properly.
library(caret)
com_levels <- 0:8
confusionMatrix(factor(train_df$cases_purchased, levels = com_levels),
factor(train_df$lin_pred, levels = com_levels),
positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8
0 0 20 175 258 100 3 0 0 0
1 0 19 65 20 0 0 0 0 0
2 0 15 318 228 40 0 0 0 0
3 0 11 349 841 481 32 0 0 0
4 0 2 77 690 1229 331 18 0 0
5 0 0 15 159 688 528 121 3 0
6 0 0 0 13 137 268 163 9 0
7 0 0 0 0 9 37 54 9 0
8 0 0 0 0 0 1 9 3 0
Overall Statistics
Accuracy : 0.4116
95% CI : (0.4005, 0.4228)
No Information Rate : 0.3556
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.2424
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity NA 0.283582 0.31832 0.3807 0.4579 0.44000
Specificity 0.92634 0.988638 0.95679 0.8365 0.7701 0.84468
Pos Pred Value NA 0.182692 0.52912 0.4907 0.5236 0.34875
Neg Pred Value NA 0.993552 0.90197 0.7655 0.7202 0.88863
Prevalence 0.00000 0.008877 0.13235 0.2927 0.3556 0.15898
Detection Rate 0.00000 0.002517 0.04213 0.1114 0.1628 0.06995
Detection Prevalence 0.07366 0.013778 0.07962 0.2271 0.3109 0.20058
Balanced Accuracy NA 0.636110 0.63755 0.6086 0.6140 0.64234
Class: 6 Class: 7 Class: 8
Sensitivity 0.44658 0.375000 NA
Specificity 0.94055 0.986709 0.998278
Pos Pred Value 0.27627 0.082569 NA
Neg Pred Value 0.97097 0.997984 NA
Prevalence 0.04836 0.003180 0.000000
Detection Rate 0.02160 0.001192 0.000000
Detection Prevalence 0.07817 0.014441 0.001722
Balanced Accuracy 0.69356 0.680855 NA
Linear Model Confusion Matrix Analysis
IMPORTANT NOTE:
The confusion matrix that R creates when using the
confusionMatrix()command under thecaretpackage lays out the matrix so that Reference refers to the rows and Prediction refers to the columns.However, I noticed that, for whatever reason, the confusion matrix metrics are calculated as if Reference refers to the columns and Prediction refers to the rows.
Therefore, the values for sensitivity and positive predicted value are swapped, and the values for specificity and negative predicted value are also swapped.
Note that this does not affect the total accuracy of the models, so accuracy and classification error rate will be the same as they are seen for all of the models.
For this first linear model, I will use logic to deduce the sensitivity, specificity, and precision for the class with the highest percentage, but then for the count models, I will just swap the corresponding values to obtain the correct percentages for each metric to reduce clutter.
Accuracy
The accuracy of the linear model is 41.16%.
It accurately predicts the correct number of sample cases purchased for only 41.16% of the wines.
This does not seem like a very good prediction accuracy, but considering that it is a linear regression model and not intrinsically a classification model, this is not too bad.
Classification Error Rate
1-41.18 = 58.82%.
The linear model inaccurately predicts the number of sample cases purchased for 58.84% of the wines.
Sensitivity (True Positive Rate)
As noted earlier, the sensitivity, specificity, and positive predicted values as seen above are not accurate. In particular, sensitivity and positive predicted value are swapped. So, the class with the highest true positive rate should be class 4 with a value of 52.36% since that is the highest “positive predicted value” as depicted from the confusion matrix metrics.
I will solve for the true positive rate for class 4 manually to confirm this.
# number of observations where cases purchased equals 4
length(which(train_df$cases_purchased==4))[1] 2347
- We can see that there are 2,347 observations in the training dataset where the number wines with 4 sample cases purchased is exactly 4. If we add up the number of times Reference is 4 in the confusion matrix above, we also get 2,347.
2 + 77 + 690 + 1229 + 331 + 18[1] 2347
- Also, according to the confusion matrix, the linear model accurately predicted 4 sample cases of wine to be purchased 1,229 times. Therefore, calculating the true positive rate, we get \(1229/2347=0.5236\) as seen below:
1229/2347[1] 0.5236472
- This confirms the sensitivity value we found above from swapping the positive predicted value for sensitivity. Therefore, the linear correctly classified 52.36% of the wine that actually had 4 sample cases purchased.
Specificity (True negative Rate)
Using similar logic as for sensitivity, interpreting the negative predicted values for class 4 as the specificity values, the class with the lowest specificity I Class 4 with a true negative rate of 72.02%.
I will solve for the true negative rate manually to confirm this. The total number of observations that did not have 4 sample cases purchased is 5,201 as seen below:
# Number of observations where cases purchased does not equal 4
length(which(train_df$cases_purchased!=4))[1] 5201
- Therefore, the number of true negatives that the linear model predicted for class 4 will be 5,201 minus the number of times the model incorrectly predicted 4 sample cases purchased. According to the confusion matrix, the model inaccurately predicted 4 cases purchased 1,455 times as seen below.
100+40+481+688+137+9[1] 1455
- Subtracting 1,455 from 5,201 gives us 3,746 which represents the true negatives. Then the true negative rate is the following \(3746/5201=0.7202\) as seen below:
3746/5201[1] 0.7202461
This confirms the specificity value we found for class 4 earlier from interpreting the negative predicted value as the specificity.
The true negative rate of 72.02% for class 4 means that for the wines where there were not 4 sample cases purchased, the Poisson model only classifies 72% of those observations in another class other than 4 cases purchased.
Precision (Positive Predicted Value)
The class with the highest positive predicted value should be class 4 with a value of 45.79% since that is the highest “sensitivity” as depicted from the confusion matrix metrics.
I will solve for the positive predicted value to confirm this finding.
length(which(train_df$lin_pred == 4))[1] 2684
- We can see that there are 2,684 observations in the training data where the linear model predicted the number of sample cases of wine purchased to be 4. Also, adding up the number of times where Prediction is 4 in the confusion matrix we also get 2,684:
100+40+481+1229+688+137+9[1] 2684
- Finally, since the confusion matrix shows that the linear correctly predicted 4 sample cases of wine being purchased 1,229 times, calculating the positive predicted value, we get \(1229/2684=0.4579\) as seen below:
1229/2684[1] 0.4578987
This confirms the same precision value we found earlier from swapping the sensitivity with the precision.
This means that out of the wines that the linear model predicted to have 4 sample cases purchased, almost 46% of those classifications were accurate, meaning 46% of those wines actually did have 4 sample cases purchased.
4.2 Count Regression Multiclass Confusion Matrix
Now, I will evaluate the accuracy and reliability of the count regressions.
The first step is to make classification predictions for how many sample cases were purchased by wine distribution companies for each observation. Here, I have used the second Poisson count regression to make predictions on the number of sample wine cases bought.
train_df$pois_log <- predict(poisson_reg2, type = "response")
# Round these predictions to the nearest whole number
train_df$pois_pred <- round(train_df$pois_log)The Poisson predicted counts range from 2 through 9, meaning that the model does not predict any of the observations to have only 1 or zero sample cases purchased, and that it actually predicts some observations to have 9 cases purchased. This does not match the levels of 0 through 8 for the actual cases purchased, so I have to set common factor levels to get the ConfusionMatrix command to work.
4.2.1 Poisson Regression Multiclass Confusion Matrix
Below is the multiclass confusion matrix for the second Poisson count regression.
common_levels <- 0:9
confusionMatrix(factor(train_df$cases_purchased, levels = common_levels),
factor(train_df$pois_pred, levels = common_levels),
positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 0 0 163 311 79 3 0 0 0 0
1 0 0 79 25 0 0 0 0 0 0
2 0 0 274 301 26 0 0 0 0 0
3 0 0 289 1045 362 18 0 0 0 0
4 0 0 55 921 1094 250 27 0 0 0
5 0 0 10 230 674 434 147 16 3 0
6 0 0 0 24 141 230 147 39 9 0
7 0 0 0 0 9 29 48 18 4 1
8 0 0 0 0 0 0 3 7 3 0
9 0 0 0 0 0 0 0 0 0 0
Overall Statistics
Accuracy : 0.3994
95% CI : (0.3884, 0.4106)
No Information Rate : 0.3785
P-Value [Acc > NIR] : 9.731e-05
Kappa : 0.2271
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity NA NA 0.31494 0.3658 0.4587 0.4502
Specificity 0.92634 0.98622 0.95103 0.8574 0.7573 0.8360
Pos Pred Value NA NA 0.45591 0.6097 0.4661 0.2867
Neg Pred Value NA NA 0.91421 0.6894 0.7518 0.9122
Prevalence 0.00000 0.00000 0.11526 0.3785 0.3160 0.1277
Detection Rate 0.00000 0.00000 0.03630 0.1384 0.1449 0.0575
Detection Prevalence 0.07366 0.01378 0.07962 0.2271 0.3109 0.2006
Balanced Accuracy NA NA 0.63299 0.6116 0.6080 0.6431
Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity 0.39516 0.225000 0.1578947 0.0000000
Specificity 0.93827 0.987815 0.9986718 1.0000000
Pos Pred Value 0.24915 0.165138 0.2307692 NaN
Neg Pred Value 0.96766 0.991666 0.9978766 0.9998675
Prevalence 0.04928 0.010599 0.0025172 0.0001325
Detection Rate 0.01948 0.002385 0.0003975 0.0000000
Detection Prevalence 0.07817 0.014441 0.0017223 0.0000000
Balanced Accuracy 0.66671 0.606407 0.5782833 0.5000000
4.2.2 Negative Binomial Multiclass Confusion Matrix (Shortened Model)
I have followed the same procedure as highlighted above with the Poisson regression prediction, but this time using the third negative binomial regression.
train_df$nb_log <- predict(nb_reg3, type = "response")
train_df$nb_pred <- round(train_df$nb_log)
confusionMatrix(factor(train_df$cases_purchased, levels = common_levels),
factor(train_df$nb_pred, levels = common_levels),
positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8 9
0 0 0 171 299 83 3 0 0 0 0
1 0 0 82 22 0 0 0 0 0 0
2 0 0 316 261 24 0 0 0 0 0
3 0 0 328 1000 368 18 0 0 0 0
4 0 0 79 905 1078 260 25 0 0 0
5 0 0 16 227 675 453 122 18 3 0
6 0 0 0 24 143 230 147 29 17 0
7 0 0 0 0 7 34 46 14 8 0
8 0 0 0 0 0 0 3 7 3 0
9 0 0 0 0 0 0 0 0 0 0
Overall Statistics
Accuracy : 0.3989
95% CI : (0.3878, 0.4101)
No Information Rate : 0.3627
P-Value [Acc > NIR] : 4.53e-11
Kappa : 0.2284
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity NA NA 0.31855 0.3652 0.4533 0.45391
Specificity 0.92634 0.98622 0.95653 0.8516 0.7545 0.83802
Pos Pred Value NA NA 0.52579 0.5834 0.4593 0.29921
Neg Pred Value NA NA 0.90269 0.7021 0.7500 0.90968
Prevalence 0.00000 0.00000 0.13143 0.3627 0.3151 0.13222
Detection Rate 0.00000 0.00000 0.04187 0.1325 0.1428 0.06002
Detection Prevalence 0.07366 0.01378 0.07962 0.2271 0.3109 0.20058
Balanced Accuracy NA NA 0.63754 0.6084 0.6039 0.64596
Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity 0.42857 0.205882 0.0967742 NA
Specificity 0.93851 0.987299 0.9986697 1
Pos Pred Value 0.24915 0.128440 0.2307692 NA
Neg Pred Value 0.97183 0.992741 0.9962840 NA
Prevalence 0.04544 0.009009 0.0041070 0
Detection Rate 0.01948 0.001855 0.0003975 0
Detection Prevalence 0.07817 0.014441 0.0017223 0
Balanced Accuracy 0.68354 0.596591 0.5477219 NA
4.2.3 Confusion Matrix Analysis
As discussed in the confusion matrix analysis for the linear model, the values seen for the positive predicted values are to be interpreted as the sensitivities and the values seen as the negative predicted values are to be interpreted as the specificity values.
To save space, I will not outline the calculations necessary to obtain these values, but will instead just swap the corresponding metrics.
Accuracy
Poisson (39.94%)
Negative Binomial (39.89%)
Both of these count regression models have low prediction accuracy, but the Poisson Model has a slightly higher prediction accuracy by about 0.05 percentage points.
Classification Error Rate
Poisson: (1-39.94) = 60.06%
Negative Binomial: (1-39.49) = 60.11%
Both classification error rates are high, but the Poisson regression has a slightly lower classification error rate.
Sensitivity (True Positive Rate)
Poisson
- Highest detection of true positives is for observations where 3 sample cases were purchased (60.97%)
Negative Binomial
- Highest detection of true positives if for observations where also 5 sample cases were purchased (58.34%)
Both models do a poor job of detecting true positives and neither model was able to accurately predict observations where only 1 or no sample cases were purchased.
Specificity (True Negative Rate)
Poisson
The lowest accuracy for detecting true negatives is for class 3 with an accuracy of 68.94%
Therefore, for observations where there were not 3 sample cases purchased, the Poisson model only classifies 69% of those observations in another class other than 3 cases purchased.
Negative Binomial
The lowest accuracy for detecting true negatives is also for class 3 with a true positive rate of 70.21%.
A similar interpretation as seen for the Poisson model follows here.
Both models were very successful in detecting true negatives, that is, for the observations that are not in a certain class for the number of sample cases purchased, the models do a good job of predicting that the number of cases purchased is not in the specific class.
Precision (Positive Predicted Value)
Poisson
Does a decent job for classes 4 and 5 with a maximum precision of 45.87% for class 4.
Out of the observations that the Poisson model predicted to have 4 sample cases purchased, about 46% of those wines actually did have 3 sample cases purchased.
Negative Binomial
Similar interpretation as for the Poisson model but the highest positive predicted value is for class 5 at 45.39%.
Out of the observations in which the negative binomial model predicted to have 5 sample cases purchased, about 45.39% of those wines actually did have 3 sample cases purchased.
On average, the sensitivity for the two models is generally better than their precision.
Note that both of these models did generate at least one instance of predicting 9 sample cases to be purchased even though the maximum number of sample cases purchased in the training data is only 8.
4.2.4 Count Regression Heteroskedasticity White’s Test
Below, I ran a White’s test for the second poisson and third negative binomial model, and stored their results in a table.
pois_wt <- white(mainlm = poisson_reg2)
nb_wt <- white(mainlm = nb_reg3)
count_white_tests <- rbind(pois_wt, nb_wt)
count_white_tests <- add_column(count_white_tests,
Model = c("Poisson Model 2",
"Negative Binomial Model 3"),
.before = 1)
kable(count_white_tests)| Model | statistic | p.value | parameter | method | alternative |
|---|---|---|---|---|---|
| Poisson Model 2 | 1135.047 | 0 | 14 | White’s Test | greater |
| Negative Binomial Model 3 | 1105.249 | 0 | 6 | White’s Test | greater |
Similarly to the linear regression models, the table above shows that both the Poisson and negative binomial count models have a p-value of zero, meaning we can reject the null hypothesis of homoskedasticty. Therefore, there is a strong indication of heteroskedasticty which indicates unreliable conclusions of the statistical significance of our estimates, potentially leading to misleading predictions on the testing data.
4.2.5 Selecting the Best Model
After implementing diagnostic testing for the linear and count models, I have decided to select the second linear model to make predictions on the testing data. This may seem counterintuitive since the dependent variable is the number of sample cases purchased which is a discrete variable. However, after rounding the predictions of the linear model to the nearest whole number and building a multiclass confusion matrix, it ended up having the highest prediction accuracy at 41.18%, as well as generally having higher sensitivity, specificity, and positive predicated values across the classes than the count models. Also, even though the linear model showed clear signs of heteroskedasticity in the residual plots, the count models were shown to also have heteroskedasticity in the White’s Test. Furthermore, the linear model had a very high F-statistic value which suggests that the entire model is significantly significant in predicting the number of sample wine cases purchased.
4.3 Predictions on Testing Data
Before making predictions on the testing wine data, I must address the NA values that are present so it is compatible to receive predictions from the linear model.
Since I deleted any observation that had a missing value for the wine rating, I will also delete any observation with a missing wine rating for the testing data. Furthermore, since there were a large number of other variables that had missing values, I imputed those values with their corresponding median values to prevent too much loss of data. To be consistent, I will also impute any missing values not in the WineRating column with their corresponding median value.
plot_missing(test_df, missing_only = T)# Delete missing wine rating values
test_df <- test_df[!is.na(test_df$wine_rating),]
# Imputing other missing values with median values of corresponding variables
test_df$residual_sugar[is.na(test_df$residual_sugar)] <-
median(test_df$residual_sugar, na.rm = T)
test_df$chlorides[is.na(test_df$chlorides)] <-
median(test_df$chlorides, na.rm = T)
test_df$free_sulfur_dioxide[is.na(test_df$free_sulfur_dioxide)] <-
median(test_df$free_sulfur_dioxide, na.rm = T)
test_df$ph[is.na(test_df$ph)] <-
median(test_df$ph, na.rm = T)
test_df$sulphates[is.na(test_df$sulphates)] <-
median(test_df$sulphates, na.rm = T)
test_df$alcohol[is.na(test_df$alcohol)] <-
median(test_df$alcohol, na.rm = T)Now that the data is cleaned and all of the missing values have been addressed, we can finally make predictions using the linear regression model on the testing data.
I first made normal predictions using the linear regression model and then I rounded the predictions to the nearest whole number. I will use these rounded values to construct the multiclass confusion matrix for the evaluation data.
# initial predictions using the linear model
test_df$lin_initial <- predict(lin_reg2, newdata = test_df)
# Rounding the predictions to the nearst whole number
test_df$lin_pred <- round(test_df$lin_initial)
min(test_df$lin_pred)[1] 1
max(test_df$lin_pred)[1] 7
The minimum and maximum number of sample cases predicted by the linear model are again 1 and 7 respectively which means we must identify the common levels that must be used for the confusion matrix to be accurate.
confusionMatrix(factor(test_df$cases_purchased, levels = com_levels,
ordered = TRUE),
factor(test_df$lin_pred, levels = com_levels, ordered = TRUE))Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8
0 0 4 41 78 16 1 0 0 0
1 0 1 11 2 0 0 0 0 0
2 0 3 84 54 13 1 0 0 0
3 0 0 85 205 140 10 0 0 0
4 0 0 20 180 299 66 5 0 0
5 0 0 2 39 187 130 41 0 0
6 0 0 0 3 44 52 42 2 0
7 0 0 0 0 0 7 15 3 0
8 0 0 0 0 0 0 2 0 0
Overall Statistics
Accuracy : 0.4047
95% CI : (0.3824, 0.4272)
No Information Rate : 0.3702
P-Value [Acc > NIR] : 0.001113
Kappa : 0.2311
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity NA 0.1250000 0.34568 0.3654 0.4278 0.48689
Specificity 0.92585 0.9930851 0.95684 0.8229 0.7721 0.83405
Pos Pred Value NA 0.0714286 0.54194 0.4659 0.5246 0.32581
Neg Pred Value NA 0.9962647 0.90825 0.7541 0.6965 0.90799
Prevalence 0.00000 0.0042373 0.12871 0.2971 0.3702 0.14142
Detection Rate 0.00000 0.0005297 0.04449 0.1086 0.1584 0.06886
Detection Prevalence 0.07415 0.0074153 0.08210 0.2331 0.3019 0.21133
Balanced Accuracy NA 0.5590426 0.65126 0.5942 0.5999 0.66047
Class: 6 Class: 7 Class: 8
Sensitivity 0.40000 0.600000 NA
Specificity 0.94335 0.988317 0.998941
Pos Pred Value 0.29371 0.120000 NA
Neg Pred Value 0.96390 0.998926 NA
Prevalence 0.05561 0.002648 0.000000
Detection Rate 0.02225 0.001589 0.000000
Detection Prevalence 0.07574 0.013242 0.001059
Balanced Accuracy 0.67168 0.794158 NA
Confusion Matrix Interpretations
The confusion matrix for the testing data has the same problems as I mentioned with the linear model on the training data.
Therefore, I will again interpret the values that are displayed as the positive predicted values as the true positive rates and vice versa, and also the values seen as the negative predicted values as the true negative rates.
Accuracy
The accuracy of the linear model is 40.47%.
It accurately predicts the correct number of sample cases purchased for only 40.47% of the wines.
This classification accuracy is slightly less than the linear model achieved for the training data, but it is still higher than the prediction accuracy of any of the count regression models.
Classification Error Rate
1-41.18 = 59.53%.
The linear model inaccurately predicts the number of sample cases purchased for 59.53% of the wines.
Sensitivity (True Positive Rate)
Interpreting the positive predicted values as the true positive rates, the class with the highest sensitivity is class 2 with a true positive rate of 54.19%.
This means that out of the wines who actually had 2 sample cases purchased, the linear model correctly detected 54%% of those wines having 2 sample cases purchased.
Specificity (True negative Rate)
- The specificity values are all very high across all of the classes but the lowest specificity value occurs in class 4 with a true negative rate of about 69.65%.
- This means that out of the wines that did not have 4 sample cases purchased, the linear model correctly predicted almost 70% of those wines as not having 4 sample cases purchased.
Precision ( Positive Predicted Value)
The class in which the linear model had the highest precision is class 7 with a positive predicted value of 60%.
This means that out of the wines that the linear model predicted to have 7 sample cases purchased, exactly 60% of those classifications were accurate, meaning 60% of those wines actually did have 7 sample cases purchased.
4.4 Conclusion
In this project, we looked at a dataset that contained about 12,000 different commercially available wines along with various chemical properties of each of the wines and how they were rated in wine sampling.
I trained 6 different regression models on the training data.
2 multivariate linear regression models
2 Poisson Count Regressions
2 Negative Binomial Count Regressions
After comparing the linear and count models using confusion matrix metrics, I determined that linear model 2 was the best model for prediction on the evaluation data set. This model contained 8 of the 13 original variables and the following are 7 of those variables that were found to be statistically significantly associated with determining the number of sample cases of wine purchased after sampling them at a significance level of \(\alpha=0.01\):
Volatile Acidity (\(\beta=-0.089\))
Chlorides (\(\beta=-0.131\))
Free Sulfur Dioxide (\(\beta=0.0003\))
Alcohol Content (\(\beta=0.016\))
Label Appeal (\(\beta=0.657\))
Acid Index (\(\beta=-0.161\))
Wine Rating (\(\beta=0.726\))
Specifically, our explanatory data analysis showed label appeal and wine rating having the two strongest positive correlations and acid index having the strongest negative correlation with the number of sample cases purchased. These findings are corroborated with the signs and economic magnitude of their respective coefficient estimates from the linear regression as seen above.
This linear model had higher prediction accuracies than both of the count models and specifically, it had a prediction accuracy of about 40.47%. This means that it correctly classified the number of sample cases of wine purchased for 40.47% of the wines in the testing data.
The linear model used for prediction in the testing data did have strong signs of heteroskedasticty which may have negatively affected the prediction accuracy of the model.