Data 621 Homework 5

Overview In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.3      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(knitr)
library(readr)
library(kableExtra)

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(caTools)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

#Data Exploration

wine_train <- read_csv("https://raw.githubusercontent.com/Wilchau/Data_621_Homework_5/main/wine-training-data.csv")

## Rows: 12795 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): INDEX, TARGET, FixedAcidity, VolatileAcidity, CitricAcid, Residual...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

wine_eval <- read_csv("https://raw.githubusercontent.com/Wilchau/Data_621_Homework_5/main/wine-evaluation-data.csv")

## Rows: 3335 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): IN, FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlo...
## lgl  (1): TARGET
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The summary below shows multiple missing variables across most of the variables in the wine dataset. The TARGET variable seems to show a discrete variable rather than continious - # of wine boxes sold.

##Visual Exploration

###Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

The boxplots show

The target variable, number of cases, is shown below. The data shows a large number of zero values.

The distribution looks like a Poisson distribution, with a significant amount of zero values.

AcidIndex looks more shaped like a poisson distribution, with a slight right skew. LabelAppearl and STARS seems to be more categorical.

## Warning: Removed 4841 rows containing non-finite values (`stat_bin()`).

The other variables seem to be more normally distributed with high kurtosis.

###Correlation

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.

	x
INDEX	0.0314911
TARGET	0.4979465
FixedAcidity	0.0113760
VolatileAcidity	-0.0202420
CitricAcid	0.0153316
ResidualSugar	-0.0045793
Chlorides	-0.0063870
FreeSulfurDioxide	0.0149601
TotalSulfurDioxide	-0.0027237
Density	-0.0180944
pH	0.0002182
Sulphates	0.0037687
Alcohol	-0.0006449
LabelAppeal	1.0000000
AcidIndex	0.0103010
STARS	0.3188970

Looking at the correlations, very few look correlated at all. The ones that do (STARS, LabelAppeal) have a small positive correlation, while AcidIndex and TARGET have a small negative correlation.

###Missing Values

According to the graph, the data shows multiple variables with missing variables. The STARS variable has the most NA values. These missing values will be imputed later on during the data preperation using the MICE package.

#Data Prep

###Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Since the data has many missing values over multiple different variables. The MICE algorithm takes some computing time..

##Absoulte value of variables

Some of the discussion among classmates has been about taking the abs value of the variables in the dataset - since the debate on the negative numbers for multiple variables.

In this case I will take an ABS transformation and apply it to the top performing model.

It seems however, that taking the ABS of the values in the dataset introduces a right skew where the variable would have been approx. normal.

If this data is transformed using the log transformation, it seems to become ‘more’ normal - but this might be introducting overfitting into the data?

#Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. In this assignment, I will try various models such as: Linear models, Negative Binomial, and Poisson, as suggested by the homework instructions.

##Model 1 - Poisson with imputed data

As per the homework videos, the poisson distribution works well with count data.

##Model 2 - Poisson without imputed data

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2158  -0.2734   0.0616   0.3732   1.6830  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
## FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
## VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
## CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
## ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
## Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
## FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
## TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
## Density            -3.725e-01  2.462e-01  -1.513  0.13026    
## pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
## Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
## Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
## LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
## AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
## STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4009.1  on 6421  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5

##Model 3 - Negative Binomial

##Linear Model

##Zero inflation

##Model- glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.

#Select Models

####Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

##Evaulating the model

The model will be evaulated by looking at the MSE.

The linear model and GLmulti model have very close RME. Both models predictions are shown below:

Data 621 Homework 5

Wilson Chau

2023-12-17