This report will explore and analyze a data set containing information on approximately 12,000 commercially available wines. with variables relating mostly to the chemical properties of the wine being sold. The response variable corresponds to the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine, which will be used to by restaurants and wine stores. The higher the sample case number, the more likely that the wine will be sold.
The objective of the analysis is to predict the cases of wine that will be sold given certain properties of the wine using the following variables:
Variable Name
Definition
TARGET
Number of Cases Purchased
Alcohol
Alcohol Content
Acid Index
Proprietary method of testing total acidity of wine by using a weighted average
Chlorides
Chloride content of wine
CitricAcid
Citric Acid Content
Density
Density of Wine
FixedAcidity
Fixed Acidity of Wine
FreeSulfurDioxide
Sulfur Dioxide content of wine
LabelAppeal
Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customer don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
ResidualSugar
Residual Sugar of wine
STARS
Wine rating by a team of experts; 4 Stars = Excellent, 1 Star = Poor
TotalSulfurDioxide
Total Sulfur Dioxide of Wine
VolatileAcidity
Volatile Acid content of wine
pH
pH of wine
stargazer(train_data1, type ="text", title="Descriptive statistics")
We first note that nearly all the variables have negative values, which we can take to indicate that the data has been log transformed in order to show a normal distribution. In addition, there appear to be a high amount of missing values, with the most NA’s coming from the STARS variable (2688). Given the positive correlation between Target and Stars, we can impute missing values for Stars as ‘zero’.
In regards to the skewness, we see that the data appear to have a relatively normal distribution and are centered, with slightly negative/right skewness with the variables of Target, FixedAcidity, CitricAcid, ResidualSugar, Density, and Alcohol. AcidIndex is the most skewed. The kurtosis being greater than 1 for all variables except Target, LabelAppeal, and Stars, the distribution is leptokurtic so the data in its original form is perhaps indicative of several outlier values. The graph below demonstrates these distributions:
gather_df <- train_data1 %>%gather(key ='variable', value ='value')# Histogram plots of each variableggplot(gather_df) +geom_histogram(aes(x=value, y = ..density..), bins=30) +geom_density(aes(x=value), color='red') +facet_wrap(. ~variable, scales='free', ncol=4)
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Warning: Removed 6578 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 6578 rows containing non-finite outside the scale range
(`stat_density()`).
We see that just over 26% of the values for Stars are missing, with the full breakdown of the percentage of missing values in the data set below:
Looking into the correlations we have a full breakdown below, and we see that Stars and LabelAppeal are the most correlated with Target, with AcidIndex, VolatileAcidity, Density Chlorides, FixedAcidity, Sulphates, and pH all showing negative correlation to the Target variable.
cor_train_data1 <-cor(train_data1, use ="complete.obs")corrplot(cor_train_data1, method ='square', type ='lower', tl.col ='darkblue', addgrid.col ='black', order ='original',addshade ='all', tl.cex =0.75,number.cex =0.75, tl.srt =45, mar =c(0,0,0,0), diag =FALSE)
#full breakdown of correlation coefficientscorrelations <-round(cor(train_data1, use ="complete.obs"),digits =3)correlations