Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
set.seed(200)
wine_train <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA621/Homework 5/HOMEWORK 5/wine-training-data.csv",stringsAsFactors = F,header=TRUE)
wine_test <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA621/Homework 5/HOMEWORK 5/wine-evaluation-data.csv",stringsAsFactors = F,header=TRUE)
variables <- c("INDEX", "TARGET", " ", " ", "ACID INDEX", "ALCOHOL", "CHLORIDES", "CITRIC ACID", "DENSITY", "FIXED ACIDITY", "FREE SULFUR DIOXIDE", "LABEL APPEAL", "RESIDUAL SUGAR", "STARS", "SULPHATES", "TOTAL SULFUR DIOXIDE", "VOLATILE ACIDITY", "pH")
define <- c("Identification Variable (do not use)", "Number of Cases Purchased", " ", " ", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate content of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine.", "pH of wine")
theoreffect <- c("None", "None", " ", " ", " ", " ", " ", " ", " ", " ", " ", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.", " ", "A high number of stars suggests high sales", " ", " ", " ", " ")
kable(cbind(variables, define, theoreffect), col.names = c("VARIABLE NAME", "DEFINITION", "THEORETICAL EFFECT")) %>%
kable_paper(full_width = T)| VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
|---|---|---|
| INDEX | Identification Variable (do not use) | None |
| TARGET | Number of Cases Purchased | None |
| ACID INDEX | Proprietary method of testing total acidity of wine by using a weighted average | |
| ALCOHOL | Alcohol Content | |
| CHLORIDES | Chloride content of wine | |
| CITRIC ACID | Citric Acid Content | |
| DENSITY | Density of Wine | |
| FIXED ACIDITY | Fixed Acidity of Wine | |
| FREE SULFUR DIOXIDE | Sulfur Dioxide content of wine | |
| LABEL APPEAL | Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design. | Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales. |
| RESIDUAL SUGAR | Residual Sugar of wine | |
| STARS | Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor | A high number of stars suggests high sales |
| SULPHATES | Sulfate content of wine | |
| TOTAL SULFUR DIOXIDE | Total Sulfur Dioxide of Wine | |
| VOLATILE ACIDITY | Volatile Acid content of wine. | |
| pH | pH of wine |
# remove index column as it is not needed
wine_train <- wine_train %>%
dplyr::select(-"INDEX")
wine_test <- wine_test %>%
dplyr::select(-"IN")
head(wine_train) TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1 3 3.2 1.160 -0.98 54.2 -0.567
2 3 4.5 0.160 -0.81 26.1 -0.425
3 5 7.1 2.640 -0.88 14.8 0.037
4 3 5.7 0.385 0.04 18.8 -0.425
5 4 8.0 0.330 -1.26 9.4 NA
6 0 11.3 0.320 0.59 2.2 0.556
FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
1 NA 268 0.99280 3.33 -0.59 9.9
2 15 -327 1.02792 3.38 0.70 NA
3 214 142 0.99518 3.12 0.48 22.0
4 22 115 0.99640 2.24 1.83 6.2
5 -167 108 0.99457 3.12 1.77 13.7
6 -37 15 0.99940 3.20 1.29 15.4
LabelAppeal AcidIndex STARS
1 0 8 2
2 -1 7 3
3 -1 8 3
4 -1 6 1
5 0 9 2
6 0 11 NA
head(wine_test) TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1 NA 5.4 -0.860 0.27 -10.7 0.092
2 NA 12.4 0.385 -0.76 -19.7 1.169
3 NA 7.2 1.750 0.17 -33.0 0.065
4 NA 6.2 0.100 1.80 1.0 -0.179
5 NA 11.4 0.210 0.28 1.2 0.038
6 NA 17.6 0.040 -1.15 1.4 0.535
FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
1 23 398 0.98527 5.02 0.64 12.30
2 -37 68 0.99048 3.37 1.09 16.00
3 9 76 1.04641 4.61 0.68 8.55
4 104 89 0.98877 3.20 2.11 12.30
5 70 53 1.02899 2.54 -0.07 4.80
6 -250 140 0.95028 3.06 -0.02 11.40
LabelAppeal AcidIndex STARS
1 -1 6 NA
2 0 6 2
3 0 8 1
4 -1 8 1
5 0 10 NA
6 1 8 4
skim(wine_test)| Name | wine_test |
| Number of rows | 3335 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| logical | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| TARGET | 3335 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FixedAcidity | 0 | 1.00 | 6.86 | 6.32 | -18.20 | 5.20 | 6.90 | 9.00 | 33.50 | ▁▂▇▂▁ |
| VolatileAcidity | 0 | 1.00 | 0.31 | 0.81 | -2.83 | 0.08 | 0.28 | 0.63 | 3.61 | ▁▂▇▂▁ |
| CitricAcid | 0 | 1.00 | 0.31 | 0.87 | -3.12 | 0.00 | 0.31 | 0.60 | 3.76 | ▁▂▇▂▁ |
| ResidualSugar | 168 | 0.95 | 5.32 | 34.37 | -128.30 | -2.60 | 3.60 | 17.20 | 145.40 | ▁▂▇▂▁ |
| Chlorides | 138 | 0.96 | 0.06 | 0.31 | -1.15 | 0.02 | 0.05 | 0.17 | 1.26 | ▁▂▇▂▁ |
| FreeSulfurDioxide | 152 | 0.95 | 34.95 | 149.63 | -563.00 | 3.00 | 30.00 | 79.25 | 617.00 | ▁▂▇▂▁ |
| TotalSulfurDioxide | 157 | 0.95 | 123.41 | 225.80 | -769.00 | 27.25 | 124.00 | 210.00 | 1004.00 | ▁▂▇▂▁ |
| Density | 0 | 1.00 | 0.99 | 0.03 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
| pH | 104 | 0.97 | 3.24 | 0.68 | 0.60 | 2.98 | 3.21 | 3.49 | 6.21 | ▁▂▇▂▁ |
| Sulphates | 310 | 0.91 | 0.53 | 0.91 | -3.07 | 0.33 | 0.50 | 0.82 | 4.18 | ▁▂▇▂▁ |
| Alcohol | 185 | 0.94 | 10.58 | 3.76 | -4.20 | 9.00 | 10.40 | 12.50 | 25.60 | ▁▂▇▂▁ |
| LabelAppeal | 0 | 1.00 | 0.01 | 0.89 | -2.00 | -1.00 | 0.00 | 1.00 | 2.00 | ▁▅▇▅▁ |
| AcidIndex | 0 | 1.00 | 7.75 | 1.32 | 5.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▇▇▁▁▁ |
| STARS | 841 | 0.75 | 2.04 | 0.91 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 | ▇▇▁▅▂ |
Missing/Summary Review
skim(wine_train)| Name | wine_train |
| Number of rows | 12795 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 0 | 1.00 | 3.03 | 1.93 | 0.00 | 2.00 | 3.00 | 4.00 | 8.00 | ▆▇▇▆▁ |
| FixedAcidity | 0 | 1.00 | 7.08 | 6.32 | -18.10 | 5.20 | 6.90 | 9.50 | 34.40 | ▁▂▇▂▁ |
| VolatileAcidity | 0 | 1.00 | 0.32 | 0.78 | -2.79 | 0.13 | 0.28 | 0.64 | 3.68 | ▁▂▇▂▁ |
| CitricAcid | 0 | 1.00 | 0.31 | 0.86 | -3.24 | 0.03 | 0.31 | 0.58 | 3.86 | ▁▂▇▂▁ |
| ResidualSugar | 616 | 0.95 | 5.42 | 33.75 | -127.80 | -2.00 | 3.90 | 15.90 | 141.15 | ▁▂▇▂▁ |
| Chlorides | 638 | 0.95 | 0.05 | 0.32 | -1.17 | -0.03 | 0.05 | 0.15 | 1.35 | ▁▂▇▂▁ |
| FreeSulfurDioxide | 647 | 0.95 | 30.85 | 148.71 | -555.00 | 0.00 | 30.00 | 70.00 | 623.00 | ▁▂▇▂▁ |
| TotalSulfurDioxide | 682 | 0.95 | 120.71 | 231.91 | -823.00 | 27.00 | 123.00 | 208.00 | 1057.00 | ▁▂▇▂▁ |
| Density | 0 | 1.00 | 0.99 | 0.03 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
| pH | 395 | 0.97 | 3.21 | 0.68 | 0.48 | 2.96 | 3.20 | 3.47 | 6.13 | ▁▂▇▂▁ |
| Sulphates | 1210 | 0.91 | 0.53 | 0.93 | -3.13 | 0.28 | 0.50 | 0.86 | 4.24 | ▁▂▇▂▁ |
| Alcohol | 653 | 0.95 | 10.49 | 3.73 | -4.70 | 9.00 | 10.40 | 12.40 | 26.50 | ▁▂▇▂▁ |
| LabelAppeal | 0 | 1.00 | -0.01 | 0.89 | -2.00 | -1.00 | 0.00 | 1.00 | 2.00 | ▁▅▇▅▁ |
| AcidIndex | 0 | 1.00 | 7.77 | 1.32 | 4.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▁▇▁▁▁ |
| STARS | 3359 | 0.74 | 2.04 | 0.90 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 | ▇▇▁▅▂ |
skim(wine_test)| Name | wine_test |
| Number of rows | 3335 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| logical | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| TARGET | 3335 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FixedAcidity | 0 | 1.00 | 6.86 | 6.32 | -18.20 | 5.20 | 6.90 | 9.00 | 33.50 | ▁▂▇▂▁ |
| VolatileAcidity | 0 | 1.00 | 0.31 | 0.81 | -2.83 | 0.08 | 0.28 | 0.63 | 3.61 | ▁▂▇▂▁ |
| CitricAcid | 0 | 1.00 | 0.31 | 0.87 | -3.12 | 0.00 | 0.31 | 0.60 | 3.76 | ▁▂▇▂▁ |
| ResidualSugar | 168 | 0.95 | 5.32 | 34.37 | -128.30 | -2.60 | 3.60 | 17.20 | 145.40 | ▁▂▇▂▁ |
| Chlorides | 138 | 0.96 | 0.06 | 0.31 | -1.15 | 0.02 | 0.05 | 0.17 | 1.26 | ▁▂▇▂▁ |
| FreeSulfurDioxide | 152 | 0.95 | 34.95 | 149.63 | -563.00 | 3.00 | 30.00 | 79.25 | 617.00 | ▁▂▇▂▁ |
| TotalSulfurDioxide | 157 | 0.95 | 123.41 | 225.80 | -769.00 | 27.25 | 124.00 | 210.00 | 1004.00 | ▁▂▇▂▁ |
| Density | 0 | 1.00 | 0.99 | 0.03 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
| pH | 104 | 0.97 | 3.24 | 0.68 | 0.60 | 2.98 | 3.21 | 3.49 | 6.21 | ▁▂▇▂▁ |
| Sulphates | 310 | 0.91 | 0.53 | 0.91 | -3.07 | 0.33 | 0.50 | 0.82 | 4.18 | ▁▂▇▂▁ |
| Alcohol | 185 | 0.94 | 10.58 | 3.76 | -4.20 | 9.00 | 10.40 | 12.50 | 25.60 | ▁▂▇▂▁ |
| LabelAppeal | 0 | 1.00 | 0.01 | 0.89 | -2.00 | -1.00 | 0.00 | 1.00 | 2.00 | ▁▅▇▅▁ |
| AcidIndex | 0 | 1.00 | 7.75 | 1.32 | 5.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▇▇▁▁▁ |
| STARS | 841 | 0.75 | 2.04 | 0.91 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 | ▇▇▁▅▂ |
Visualizations
# histogram
wine_train %>%
dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="light blue") +
theme_minimal()wine_train %>%
ggplot(aes(x=TARGET)) +
geom_histogram(fill='light blue')# Prepare data for ggplot
wine_df <- wine_train %>%
gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(wine_df, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6)wine_factor <- wine_train %>%
dplyr::select(TARGET, STARS, LabelAppeal, AcidIndex) %>%
pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
arrange(variable, value)
wine_factor %>%
ggplot(mapping = aes(x = factor(value), y = TARGET)) +
geom_boxplot() +
facet_wrap(.~variable, scales="free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("kassambara/ggpubr")
library(ggpubr)
# barchart
w3 <- wine_train %>%
dplyr::select(TARGET, STARS) %>%
mutate(STARS = as.factor(STARS),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(STARS)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
w4 <- wine_train %>%
dplyr::select(TARGET, LabelAppeal) %>%
mutate(STARS = as.factor(LabelAppeal),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(LabelAppeal)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
w5 <- wine_train %>%
dplyr::select(TARGET, AcidIndex) %>%
mutate(STARS = as.factor(AcidIndex),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(AcidIndex)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
ggarrange(w5, ggarrange(w3, w4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)# top correlation
wtrain_corr <- wine_train %>%
drop_na() %>%
cor()
kable(sort(wtrain_corr[,1], decreasing = T), col.names = c("Correlation")) %>%
kable_styling(full_width = F)| Correlation | |
|---|---|
| TARGET | 1.0000000 |
| STARS | 0.5546857 |
| LabelAppeal | 0.4979465 |
| Alcohol | 0.0737771 |
| FreeSulfurDioxide | 0.0226398 |
| TotalSulfurDioxide | 0.0216021 |
| ResidualSugar | 0.0035196 |
| CitricAcid | 0.0023450 |
| pH | 0.0002199 |
| FixedAcidity | -0.0125381 |
| Sulphates | -0.0212204 |
| Chlorides | -0.0304301 |
| Density | -0.0475989 |
| VolatileAcidity | -0.0759979 |
| AcidIndex | -0.1676431 |
library(VIM)
aggr(wine_train,
sortVars=TRUE,
labels=names(wine_train),
cex.axis=.5,
bars = FALSE,
col = c("white", "#E46726"),
combined = TRUE,
#border = NA,
ylab = "Missing Values")
Variables sorted by number of missings:
Variable Count
STARS 3359
Sulphates 1210
TotalSulfurDioxide 682
Alcohol 653
FreeSulfurDioxide 647
Chlorides 638
ResidualSugar 616
pH 395
TARGET 0
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
Density 0
LabelAppeal 0
AcidIndex 0
library(corrplot)
library(RColorBrewer)
# correlation plot
corrplot(wtrain_corr,
method = "number",
type = "lower",
col = brewer.pal(n = 15, name = "Reds"),
number.cex = .7, tl.cex = .7,
tl.col = "black", tl.srt = 45)corrplot(wtrain_corr, tl.cex = .9, tl.col="black", method = 'color', addCoef.col = "black",
type="upper", order="hclust", number.cex=0.6, diag=FALSE)library(VIM)
# missing value columns
aggr(wine_train,
sortVars=TRUE,
labels=names(wine_train),
cex.axis=.5,
bars = FALSE,
col = c("white", "light blue"),
combined = TRUE,
#border = NA,
ylab = "Missing Values")
Variables sorted by number of missings:
Variable Count
STARS 3359
Sulphates 1210
TotalSulfurDioxide 682
Alcohol 653
FreeSulfurDioxide 647
Chlorides 638
ResidualSugar 616
pH 395
TARGET 0
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
Density 0
LabelAppeal 0
AcidIndex 0
library(VIM)
# missing value columns
aggr(wine_test,
sortVars=TRUE,
labels=names(wine_test),
cex.axis=.5,
bars = FALSE,
col = c("white", "light blue"),
combined = TRUE,
#border = NA,
ylab = "Missing Values")
Variables sorted by number of missings:
Variable Count
TARGET 3335
STARS 841
Sulphates 310
Alcohol 185
ResidualSugar 168
TotalSulfurDioxide 157
FreeSulfurDioxide 152
Chlorides 138
pH 104
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
Density 0
LabelAppeal 0
AcidIndex 0
library(mice)
# imputing train data
wimputetrain <- mice(wine_train)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
meth <- wimputetrain$method
predM <- wimputetrain$predictorMatrix
predM[, c("TARGET")] <- 0
final_impute <- mice(wine_train, method = 'rf', predictorMatrix=predM)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
final_impute1 <- complete(final_impute)
print(paste0("Missing value after imputation: ", sum(is.na(final_impute1))))[1] "Missing value after imputation: 0"
aggr(final_impute1,
sortVars=TRUE,
labels=names(wine_train),
cex.axis=.5,
bars = FALSE,
col = c("white", "light blue"),
combined = TRUE,
#border = NA,
ylab = "Missing Values")
Variables sorted by number of missings:
Variable Count
TARGET 0
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
ResidualSugar 0
Chlorides 0
FreeSulfurDioxide 0
TotalSulfurDioxide 0
Density 0
pH 0
Sulphates 0
Alcohol 0
LabelAppeal 0
AcidIndex 0
STARS 0
# histogram
final_impute1 %>%
dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="light blue") +
theme_minimal()# wine_train$ResidualSugar[is.na(wine_train$ResidualSugar)] <- median(wine_train$ResidualSugar, na.rm=TRUE)
# wine_train$Chlorides[is.na(wine_train$Chlorides)] <- median(wine_train$Chlorides, na.rm=TRUE)
# wine_train$FreeSulfurDioxide[is.na(wine_train$FreeSulfurDioxide)] <- median(wine_train$FreeSulfurDioxide, na.rm=TRUE)
# wine_train$TotalSulfurDioxide[is.na(wine_train$TotalSulfurDioxide)] <- median(wine_train$TotalSulfurDioxide, na.rm=TRUE)
# wine_train$pH[is.na(wine_train$pH)] <- median(wine_train$pH, na.rm=TRUE)
# wine_train$Sulphates[is.na(wine_train$Sulphates)] <- median(wine_train$Sulphates, na.rm=TRUE)
# wine_train$Alcohol[is.na(wine_train$Alcohol)] <- median(wine_train$Alcohol, na.rm=TRUE)
# wine_train$STARS[is.na(wine_train$STARS)] <- median(wine_train$STARS, na.rm=TRUE)Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations).
Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models.
Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques.
Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? In this case, about the only thing you can comment on is the number of stars and the wine label appeal.
However, you might comment on the coefficient and magnitude of variables and how they are similar or different from model to model. For example, you might say “pH seems to have a major positive impact in my poisson regression model, but a negative effect in my multiple linear regression model”.
Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
r2glm <- function(model) {
summaryLog <- summary(model)
1 - summaryLog$deviance / summaryLog$null.deviance
}set.seed(100)
model_metrics <- function(model) {
data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
performance <- data.frame("RSE" = model$sigma,
"Adj R2" = model$adj.r.squared,
"F-Statistic" = model$fstatistic[1])
return(performance)
}set.seed(100)
#poisson models with all variables
model1 <- glm(TARGET ~ ., family = poisson, wine_train)
summary(model1)
Call:
glm(formula = TARGET ~ ., family = poisson, data = wine_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.593e+00 2.506e-01 6.359 2.03e-10 ***
FixedAcidity 3.293e-04 1.053e-03 0.313 0.75447
VolatileAcidity -2.560e-02 8.353e-03 -3.065 0.00218 **
CitricAcid -7.259e-04 7.575e-03 -0.096 0.92365
ResidualSugar -6.141e-05 1.941e-04 -0.316 0.75165
Chlorides -3.007e-02 2.056e-02 -1.463 0.14346
FreeSulfurDioxide 6.734e-05 4.404e-05 1.529 0.12620
TotalSulfurDioxide 2.081e-05 2.855e-05 0.729 0.46618
Density -3.725e-01 2.462e-01 -1.513 0.13026
pH -4.661e-03 9.598e-03 -0.486 0.62722
Sulphates -5.164e-03 7.051e-03 -0.732 0.46398
Alcohol 3.948e-03 1.771e-03 2.229 0.02579 *
LabelAppeal 1.771e-01 7.954e-03 22.271 < 2e-16 ***
AcidIndex -4.870e-02 5.903e-03 -8.251 < 2e-16 ***
STARS 1.871e-01 7.487e-03 24.993 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5844.1 on 6435 degrees of freedom
Residual deviance: 4009.1 on 6421 degrees of freedom
(6359 observations deleted due to missingness)
AIC: 23172
Number of Fisher Scoring iterations: 5
r2glm(model1)[1] 0.3139796
anova(model1, test = "Chisq")Analysis of Deviance Table
Model: poisson, link: log
Response: TARGET
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 6435 5844.1
FixedAcidity 1 0.66 6434 5843.4 0.416923
VolatileAcidity 1 24.07 6433 5819.3 9.305e-07 ***
CitricAcid 1 0.00 6432 5819.3 0.961864
ResidualSugar 1 0.05 6431 5819.3 0.825953
Chlorides 1 3.62 6430 5815.7 0.057171 .
FreeSulfurDioxide 1 1.90 6429 5813.8 0.167654
TotalSulfurDioxide 1 1.85 6428 5811.9 0.173597
Density 1 8.95 6427 5803.0 0.002776 **
pH 1 0.00 6426 5803.0 0.992117
Sulphates 1 1.97 6425 5801.0 0.160678
Alcohol 1 22.97 6424 5778.0 1.641e-06 ***
LabelAppeal 1 1031.50 6423 4746.5 < 2.2e-16 ***
AcidIndex 1 118.65 6422 4627.9 < 2.2e-16 ***
STARS 1 618.72 6421 4009.1 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(model1)#install.packages("faraway")
library("faraway")set.seed(100)
print('Goodness of Fit Test:')[1] "Goodness of Fit Test:"
with(model1, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE))) res.deviance df p
[1,] 4009.142 6421 1
model_metrics(model1)data frame with 0 columns and 0 rows
set.seed(100)
# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, final_impute1)
summary(model2)
Call:
glm(formula = TARGET ~ ., family = poisson, data = final_impute1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.153e+00 1.955e-01 11.012 < 2e-16 ***
FixedAcidity -5.151e-04 8.197e-04 -0.628 0.529739
VolatileAcidity -5.110e-02 6.488e-03 -7.876 3.37e-15 ***
CitricAcid 1.366e-02 5.896e-03 2.318 0.020467 *
ResidualSugar 1.757e-04 1.507e-04 1.166 0.243545
Chlorides -6.050e-02 1.610e-02 -3.759 0.000171 ***
FreeSulfurDioxide 1.395e-04 3.419e-05 4.080 4.51e-05 ***
TotalSulfurDioxide 1.058e-04 2.202e-05 4.806 1.54e-06 ***
Density -4.637e-01 1.920e-01 -2.415 0.015744 *
pH -2.507e-02 7.519e-03 -3.335 0.000854 ***
Sulphates -1.266e-02 5.470e-03 -2.315 0.020593 *
Alcohol 5.042e-03 1.371e-03 3.677 0.000236 ***
LabelAppeal 1.956e-01 6.108e-03 32.018 < 2e-16 ***
AcidIndex -1.227e-01 4.465e-03 -27.488 < 2e-16 ***
STARS 1.764e-01 5.809e-03 30.378 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 22861 on 12794 degrees of freedom
Residual deviance: 18649 on 12780 degrees of freedom
AIC: 50621
Number of Fisher Scoring iterations: 5
plot(model2)halfnorm(residuals(model2))plot(log(fitted(model2)),log((final_impute1$TARGET-fitted(model2))^2),
xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)(dp<- sum(residuals(model2,type="pearson")^2)/model2$df.res)[1] 0.9634397
summary(model2,dispersion=dp)
Call:
glm(formula = TARGET ~ ., family = poisson, data = final_impute1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.153e+00 1.919e-01 11.219 < 2e-16 ***
FixedAcidity -5.151e-04 8.046e-04 -0.640 0.522032
VolatileAcidity -5.110e-02 6.368e-03 -8.024 1.02e-15 ***
CitricAcid 1.366e-02 5.787e-03 2.361 0.018214 *
ResidualSugar 1.757e-04 1.479e-04 1.188 0.234797
Chlorides -6.050e-02 1.580e-02 -3.829 0.000129 ***
FreeSulfurDioxide 1.395e-04 3.356e-05 4.156 3.23e-05 ***
TotalSulfurDioxide 1.058e-04 2.161e-05 4.896 9.77e-07 ***
Density -4.637e-01 1.885e-01 -2.460 0.013886 *
pH -2.507e-02 7.380e-03 -3.397 0.000680 ***
Sulphates -1.266e-02 5.369e-03 -2.359 0.018330 *
Alcohol 5.042e-03 1.346e-03 3.746 0.000180 ***
LabelAppeal 1.956e-01 5.995e-03 32.620 < 2e-16 ***
AcidIndex -1.227e-01 4.382e-03 -28.005 < 2e-16 ***
STARS 1.764e-01 5.701e-03 30.949 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 0.9634397)
Null deviance: 22861 on 12794 degrees of freedom
Residual deviance: 18649 on 12780 degrees of freedom
AIC: 50621
Number of Fisher Scoring iterations: 5
set.seed(100)
print('Goodness of Fit Test:')[1] "Goodness of Fit Test:"
with(model2, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE))) res.deviance df p
[1,] 18648.82 12780 2.330118e-228
model_metrics(model2)data frame with 0 columns and 0 rows
r2glm(model2)[1] 0.1842479
anova(model2, test = "Chisq")Analysis of Deviance Table
Model: poisson, link: log
Response: TARGET
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 12794 22861
FixedAcidity 1 37.64 12793 22823 8.527e-10 ***
VolatileAcidity 1 121.94 12792 22701 < 2.2e-16 ***
CitricAcid 1 0.96 12791 22700 0.3275400
ResidualSugar 1 3.94 12790 22696 0.0470345 *
Chlorides 1 20.12 12789 22676 7.255e-06 ***
FreeSulfurDioxide 1 23.83 12788 22653 1.053e-06 ***
TotalSulfurDioxide 1 33.27 12787 22619 8.018e-09 ***
Density 1 18.07 12786 22601 2.124e-05 ***
pH 1 1.47 12785 22600 0.2253360
Sulphates 1 13.36 12784 22586 0.0002573 ***
Alcohol 1 55.95 12783 22530 7.442e-14 ***
LabelAppeal 1 1978.17 12782 20552 < 2.2e-16 ***
AcidIndex 1 991.12 12781 19561 < 2.2e-16 ***
STARS 1 912.23 12780 18649 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
set.seed(100)
model3 <- glm(TARGET ~ ., family = quasipoisson(link='log'), final_impute1)
summary(model3)
Call:
glm(formula = TARGET ~ ., family = quasipoisson(link = "log"),
data = final_impute1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.153e+00 1.919e-01 11.219 < 2e-16 ***
FixedAcidity -5.151e-04 8.046e-04 -0.640 0.522044
VolatileAcidity -5.110e-02 6.368e-03 -8.024 1.11e-15 ***
CitricAcid 1.366e-02 5.787e-03 2.361 0.018229 *
ResidualSugar 1.757e-04 1.479e-04 1.188 0.234820
Chlorides -6.050e-02 1.580e-02 -3.829 0.000129 ***
FreeSulfurDioxide 1.395e-04 3.356e-05 4.156 3.25e-05 ***
TotalSulfurDioxide 1.058e-04 2.161e-05 4.896 9.89e-07 ***
Density -4.637e-01 1.885e-01 -2.460 0.013900 *
pH -2.507e-02 7.380e-03 -3.397 0.000682 ***
Sulphates -1.266e-02 5.369e-03 -2.359 0.018345 *
Alcohol 5.042e-03 1.346e-03 3.746 0.000180 ***
LabelAppeal 1.956e-01 5.995e-03 32.620 < 2e-16 ***
AcidIndex -1.227e-01 4.382e-03 -28.005 < 2e-16 ***
STARS 1.764e-01 5.701e-03 30.949 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 0.9634428)
Null deviance: 22861 on 12794 degrees of freedom
Residual deviance: 18649 on 12780 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
plot(model3)drop1(model3,test="F")Single term deletions
Model:
TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
Df Deviance F value Pr(>F)
<none> 18649
FixedAcidity 1 18649 0.2706 0.6029343
VolatileAcidity 1 18711 42.4910 7.368e-11 ***
CitricAcid 1 18654 3.6814 0.0550449 .
ResidualSugar 1 18650 0.9321 0.3343401
Chlorides 1 18663 9.6817 0.0018652 **
FreeSulfurDioxide 1 18666 11.4049 0.0007347 ***
TotalSulfurDioxide 1 18672 15.8225 6.995e-05 ***
Density 1 18655 3.9953 0.0456492 *
pH 1 18660 7.6213 0.0057765 **
Sulphates 1 18654 3.6737 0.0553011 .
Alcohol 1 18662 9.2670 0.0023379 **
LabelAppeal 1 19674 702.6412 < 2.2e-16 ***
AcidIndex 1 19457 553.6452 < 2.2e-16 ***
STARS 1 19561 625.1480 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_metrics(model3)data frame with 0 columns and 0 rows
r2glm(model3)[1] 0.1842479
anova(model3, test = "Chisq")Analysis of Deviance Table
Model: quasipoisson, link: log
Response: TARGET
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 12794 22861
FixedAcidity 1 37.64 12793 22823 4.102e-10 ***
VolatileAcidity 1 121.94 12792 22701 < 2.2e-16 ***
CitricAcid 1 0.96 12791 22700 0.3185300
ResidualSugar 1 3.94 12790 22696 0.0430405 *
Chlorides 1 20.12 12789 22676 4.868e-06 ***
FreeSulfurDioxide 1 23.83 12788 22653 6.588e-07 ***
TotalSulfurDioxide 1 33.27 12787 22619 4.190e-09 ***
Density 1 18.07 12786 22601 1.482e-05 ***
pH 1 1.47 12785 22600 0.2167376
Sulphates 1 13.36 12784 22586 0.0001964 ***
Alcohol 1 55.95 12783 22530 2.529e-14 ***
LabelAppeal 1 1978.17 12782 20552 < 2.2e-16 ***
AcidIndex 1 991.12 12781 19561 < 2.2e-16 ***
STARS 1 912.23 12780 18649 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
set.seed(100)
print('Goodness of Fit Test:')[1] "Goodness of Fit Test:"
with(model3, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE))) res.deviance df p
[1,] 18648.82 12780 2.330118e-228
set.seed(100)
model4 <- glm(TARGET ~ ., family = negative.binomial(1), final_impute1)
summary(model4)
Call:
glm(formula = TARGET ~ ., family = negative.binomial(1), data = final_impute1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.267e+00 2.066e-01 10.976 < 2e-16 ***
FixedAcidity -4.834e-04 8.664e-04 -0.558 0.576935
VolatileAcidity -5.736e-02 6.880e-03 -8.336 < 2e-16 ***
CitricAcid 1.257e-02 6.255e-03 2.010 0.044504 *
ResidualSugar 2.333e-04 1.596e-04 1.462 0.143803
Chlorides -6.870e-02 1.700e-02 -4.041 5.35e-05 ***
FreeSulfurDioxide 1.555e-04 3.620e-05 4.297 1.75e-05 ***
TotalSulfurDioxide 1.203e-04 2.323e-05 5.179 2.26e-07 ***
Density -4.729e-01 2.031e-01 -2.328 0.019924 *
pH -3.062e-02 7.942e-03 -3.855 0.000116 ***
Sulphates -1.634e-02 5.784e-03 -2.826 0.004726 **
Alcohol 4.712e-03 1.450e-03 3.250 0.001158 **
LabelAppeal 1.991e-01 6.440e-03 30.919 < 2e-16 ***
AcidIndex -1.373e-01 4.398e-03 -31.223 < 2e-16 ***
STARS 1.904e-01 6.321e-03 30.120 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1) family taken to be 0.2722089)
Null deviance: 9042.5 on 12794 degrees of freedom
Residual deviance: 7931.5 on 12780 degrees of freedom
AIC: 56693
Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')[1] "Goodness of Fit Test:"
with(model4, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE))) res.deviance df p
[1,] 7931.463 12780 1
model_metrics(model4)data frame with 0 columns and 0 rows
model4
Call: glm(formula = TARGET ~ ., family = negative.binomial(1), data = final_impute1)
Coefficients:
(Intercept) FixedAcidity VolatileAcidity CitricAcid
2.2670936 -0.0004834 -0.0573578 0.0125690
ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
0.0002333 -0.0686994 0.0001555 0.0001203
Density pH Sulphates Alcohol
-0.4729458 -0.0306159 -0.0163440 0.0047118
LabelAppeal AcidIndex STARS
0.1991308 -0.1373164 0.1904056
Degrees of Freedom: 12794 Total (i.e. Null); 12780 Residual
Null Deviance: 9043
Residual Deviance: 7931 AIC: 56690
#khat = 35593 and standard error = 59638
model4nb <- glm.nb(TARGET ~ .,final_impute1)
summary(model4nb)
Call:
glm.nb(formula = TARGET ~ ., data = final_impute1, init.theta = 34870.30494,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.153e+00 1.955e-01 11.011 < 2e-16 ***
FixedAcidity -5.151e-04 8.197e-04 -0.628 0.529762
VolatileAcidity -5.110e-02 6.488e-03 -7.876 3.38e-15 ***
CitricAcid 1.366e-02 5.896e-03 2.318 0.020475 *
ResidualSugar 1.757e-04 1.507e-04 1.166 0.243545
Chlorides -6.050e-02 1.610e-02 -3.758 0.000171 ***
FreeSulfurDioxide 1.395e-04 3.419e-05 4.080 4.51e-05 ***
TotalSulfurDioxide 1.058e-04 2.202e-05 4.806 1.54e-06 ***
Density -4.637e-01 1.920e-01 -2.415 0.015748 *
pH -2.507e-02 7.519e-03 -3.335 0.000854 ***
Sulphates -1.266e-02 5.470e-03 -2.315 0.020595 *
Alcohol 5.042e-03 1.371e-03 3.677 0.000236 ***
LabelAppeal 1.956e-01 6.108e-03 32.017 < 2e-16 ***
AcidIndex -1.227e-01 4.465e-03 -27.487 < 2e-16 ***
STARS 1.765e-01 5.809e-03 30.376 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(34870.3) family taken to be 1)
Null deviance: 22860 on 12794 degrees of freedom
Residual deviance: 18648 on 12780 degrees of freedom
AIC: 50623
Number of Fisher Scoring iterations: 1
Theta: 34870
Std. Err.: 59563
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -50590.95
halfnorm(residuals(model4nb))plot(log(fitted(model4nb)),log((final_impute1$TARGET-fitted(model4nb))^2),
xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)(dp<- sum(residuals(model4nb,type="pearson")^2)/model4nb$df.res)[1] 0.963362
summary(model4nb,dispersion=dp)
Call:
glm.nb(formula = TARGET ~ ., data = final_impute1, init.theta = 34870.30494,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.153e+00 1.919e-01 11.219 < 2e-16 ***
FixedAcidity -5.151e-04 8.046e-04 -0.640 0.522038
VolatileAcidity -5.110e-02 6.368e-03 -8.024 1.02e-15 ***
CitricAcid 1.366e-02 5.787e-03 2.361 0.018216 *
ResidualSugar 1.757e-04 1.479e-04 1.188 0.234778
Chlorides -6.050e-02 1.580e-02 -3.829 0.000129 ***
FreeSulfurDioxide 1.395e-04 3.356e-05 4.156 3.23e-05 ***
TotalSulfurDioxide 1.058e-04 2.161e-05 4.896 9.77e-07 ***
Density -4.637e-01 1.885e-01 -2.460 0.013887 *
pH -2.507e-02 7.380e-03 -3.397 0.000680 ***
Sulphates -1.266e-02 5.369e-03 -2.359 0.018327 *
Alcohol 5.042e-03 1.346e-03 3.746 0.000180 ***
LabelAppeal 1.956e-01 5.995e-03 32.620 < 2e-16 ***
AcidIndex -1.227e-01 4.382e-03 -28.005 < 2e-16 ***
STARS 1.765e-01 5.701e-03 30.949 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(34870.3) family taken to be 0.963362)
Null deviance: 22860 on 12794 degrees of freedom
Residual deviance: 18648 on 12780 degrees of freedom
AIC: 50623
Number of Fisher Scoring iterations: 1
Theta: 34870
Std. Err.: 59563
Warning while fitting theta: iteration limit reached
2 x log-likelihood: -50590.95
Negative Binomial reduced (remove insignificant variables first)
set.seed(100)
model5 <- glm(TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - Density , family = negative.binomial(1), final_impute1)
summary(model5)
Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
Density, family = negative.binomial(1), data = final_impute1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.800e+00 4.913e-02 36.628 < 2e-16 ***
VolatileAcidity -5.805e-02 6.876e-03 -8.443 < 2e-16 ***
Chlorides -7.013e-02 1.699e-02 -4.129 3.67e-05 ***
FreeSulfurDioxide 1.559e-04 3.617e-05 4.311 1.64e-05 ***
TotalSulfurDioxide 1.204e-04 2.321e-05 5.189 2.14e-07 ***
pH -3.037e-02 7.938e-03 -3.826 0.000131 ***
Sulphates -1.646e-02 5.779e-03 -2.849 0.004390 **
Alcohol 4.750e-03 1.449e-03 3.279 0.001044 **
LabelAppeal 1.994e-01 6.437e-03 30.975 < 2e-16 ***
AcidIndex -1.376e-01 4.323e-03 -31.823 < 2e-16 ***
STARS 1.906e-01 6.318e-03 30.164 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1) family taken to be 0.2719912)
Null deviance: 9042.5 on 12794 degrees of freedom
Residual deviance: 7934.7 on 12784 degrees of freedom
AIC: 56688
Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')[1] "Goodness of Fit Test:"
with(model5, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE))) res.deviance df p
[1,] 7934.724 12784 1
model_metrics(model5)data frame with 0 columns and 0 rows
halfnorm(residuals(model5))plot(log(fitted(model5)),log((final_impute1$TARGET-fitted(model5))^2),
xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)(dp<- sum(residuals(model5,type="pearson")^2)/model5$df.res)[1] 0.2719901
summary(model5,dispersion=dp)
Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
Density, family = negative.binomial(1), data = final_impute1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.800e+00 4.913e-02 36.628 < 2e-16 ***
VolatileAcidity -5.805e-02 6.876e-03 -8.443 < 2e-16 ***
Chlorides -7.013e-02 1.699e-02 -4.129 3.65e-05 ***
FreeSulfurDioxide 1.559e-04 3.617e-05 4.311 1.63e-05 ***
TotalSulfurDioxide 1.204e-04 2.321e-05 5.189 2.11e-07 ***
pH -3.037e-02 7.938e-03 -3.826 0.00013 ***
Sulphates -1.646e-02 5.779e-03 -2.849 0.00438 **
Alcohol 4.750e-03 1.449e-03 3.279 0.00104 **
LabelAppeal 1.994e-01 6.437e-03 30.975 < 2e-16 ***
AcidIndex -1.376e-01 4.323e-03 -31.823 < 2e-16 ***
STARS 1.906e-01 6.318e-03 30.164 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1) family taken to be 0.2719901)
Null deviance: 9042.5 on 12794 degrees of freedom
Residual deviance: 7934.7 on 12784 degrees of freedom
AIC: 56688
Number of Fisher Scoring iterations: 5
Negative Binomial reduced (remove additional significant variables)
set.seed(100)
model6 <- glm(TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - Density - VolatileAcidity - Chlorides - FreeSulfurDioxide - TotalSulfurDioxide , family = negative.binomial(1), final_impute1)
summary(model6)
Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar -
Density - VolatileAcidity - Chlorides - FreeSulfurDioxide -
TotalSulfurDioxide, family = negative.binomial(1), data = final_impute1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.818737 0.048649 37.385 < 2e-16 ***
pH -0.030350 0.007909 -3.837 0.000125 ***
Sulphates -0.015844 0.005760 -2.751 0.005952 **
Alcohol 0.004461 0.001443 3.091 0.002001 **
LabelAppeal 0.199783 0.006413 31.152 < 2e-16 ***
AcidIndex -0.139989 0.004294 -32.597 < 2e-16 ***
STARS 0.191107 0.006291 30.376 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1) family taken to be 0.2704147)
Null deviance: 9042.5 on 12794 degrees of freedom
Residual deviance: 7971.7 on 12788 degrees of freedom
AIC: 56717
Number of Fisher Scoring iterations: 5
print('Goodness of Fit Test:')[1] "Goodness of Fit Test:"
with(model6, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE))) res.deviance df p
[1,] 7971.736 12788 1
model_metrics(model6)data frame with 0 columns and 0 rows
plot(model6)set.seed(100)
model7 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
as.factor(STARS),data=final_impute1)
summary(model7)
Call:
lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
as.factor(AcidIndex) + as.factor(STARS), data = final_impute1)
Residuals:
Min 1Q Median 3Q Max
-5.6149 -0.6851 0.3624 1.0981 4.8820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.716e+00 1.094e+00 1.569 0.11666
FixedAcidity -6.841e-04 2.334e-03 -0.293 0.76943
VolatileAcidity -1.493e-01 1.854e-02 -8.053 8.78e-16 ***
CitricAcid 3.785e-02 1.686e-02 2.244 0.02484 *
ResidualSugar 4.296e-04 4.307e-04 0.997 0.31863
Chlorides -1.839e-01 4.576e-02 -4.019 5.89e-05 ***
FreeSulfurDioxide 3.895e-04 9.757e-05 3.991 6.61e-05 ***
TotalSulfurDioxide 2.792e-04 6.255e-05 4.465 8.09e-06 ***
Density -1.402e+00 5.472e-01 -2.562 0.01041 *
pH -6.141e-02 2.139e-02 -2.872 0.00409 **
Sulphates -3.158e-02 1.557e-02 -2.028 0.04258 *
Alcohol 1.976e-02 3.909e-03 5.056 4.34e-07 ***
as.factor(LabelAppeal)-1 5.641e-01 7.894e-02 7.146 9.40e-13 ***
as.factor(LabelAppeal)0 1.160e+00 7.711e-02 15.043 < 2e-16 ***
as.factor(LabelAppeal)1 1.730e+00 8.077e-02 21.419 < 2e-16 ***
as.factor(LabelAppeal)2 2.396e+00 1.062e-01 22.565 < 2e-16 ***
as.factor(AcidIndex)5 1.199e+00 9.665e-01 1.241 0.21476
as.factor(AcidIndex)6 1.293e+00 9.491e-01 1.362 0.17311
as.factor(AcidIndex)7 1.180e+00 9.483e-01 1.245 0.21328
as.factor(AcidIndex)8 1.002e+00 9.485e-01 1.056 0.29078
as.factor(AcidIndex)9 5.423e-01 9.492e-01 0.571 0.56783
as.factor(AcidIndex)10 -9.396e-02 9.509e-01 -0.099 0.92129
as.factor(AcidIndex)11 -7.273e-01 9.539e-01 -0.762 0.44583
as.factor(AcidIndex)12 -9.939e-01 9.595e-01 -1.036 0.30028
as.factor(AcidIndex)13 -3.783e-01 9.688e-01 -0.390 0.69619
as.factor(AcidIndex)14 -4.739e-01 9.785e-01 -0.484 0.62820
as.factor(AcidIndex)15 2.691e-01 1.112e+00 0.242 0.80875
as.factor(AcidIndex)16 -1.093e+00 1.199e+00 -0.911 0.36232
as.factor(AcidIndex)17 -1.538e+00 1.134e+00 -1.356 0.17509
as.factor(STARS)2 7.323e-01 3.488e-02 20.996 < 2e-16 ***
as.factor(STARS)3 1.182e+00 4.137e-02 28.560 < 2e-16 ***
as.factor(STARS)4 1.657e+00 6.656e-02 24.897 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.638 on 12763 degrees of freedom
Multiple R-squared: 0.2784, Adjusted R-squared: 0.2766
F-statistic: 158.8 on 31 and 12763 DF, p-value: < 2.2e-16
sqrt(sum(model7$residuals^2) / model7$df.residual)[1] 1.63838
set.seed(100)
lm8 <- stepAIC(model7, direction = "both",
scope = list(upper = model7, lower = ~ 1),
scale = 0, trace = FALSE)
summary(lm8)
Call:
lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
as.factor(STARS), data = final_impute1)
Residuals:
Min 1Q Median 3Q Max
-5.5993 -0.6856 0.3608 1.0993 4.8883
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.746e+00 1.093e+00 1.597 0.11020
VolatileAcidity -1.495e-01 1.854e-02 -8.060 8.28e-16 ***
CitricAcid 3.776e-02 1.686e-02 2.239 0.02517 *
Chlorides -1.841e-01 4.576e-02 -4.022 5.79e-05 ***
FreeSulfurDioxide 3.908e-04 9.755e-05 4.006 6.20e-05 ***
TotalSulfurDioxide 2.809e-04 6.252e-05 4.492 7.10e-06 ***
Density -1.400e+00 5.471e-01 -2.559 0.01051 *
pH -6.119e-02 2.138e-02 -2.862 0.00422 **
Sulphates -3.176e-02 1.557e-02 -2.040 0.04134 *
Alcohol 1.969e-02 3.908e-03 5.039 4.74e-07 ***
as.factor(LabelAppeal)-1 5.647e-01 7.893e-02 7.154 8.87e-13 ***
as.factor(LabelAppeal)0 1.160e+00 7.710e-02 15.043 < 2e-16 ***
as.factor(LabelAppeal)1 1.731e+00 8.076e-02 21.428 < 2e-16 ***
as.factor(LabelAppeal)2 2.397e+00 1.062e-01 22.576 < 2e-16 ***
as.factor(AcidIndex)5 1.166e+00 9.659e-01 1.207 0.22736
as.factor(AcidIndex)6 1.258e+00 9.484e-01 1.327 0.18464
as.factor(AcidIndex)7 1.145e+00 9.476e-01 1.209 0.22685
as.factor(AcidIndex)8 9.674e-01 9.477e-01 1.021 0.30738
as.factor(AcidIndex)9 5.060e-01 9.484e-01 0.534 0.59367
as.factor(AcidIndex)10 -1.314e-01 9.500e-01 -0.138 0.89001
as.factor(AcidIndex)11 -7.653e-01 9.530e-01 -0.803 0.42193
as.factor(AcidIndex)12 -1.033e+00 9.585e-01 -1.078 0.28107
as.factor(AcidIndex)13 -4.182e-01 9.677e-01 -0.432 0.66565
as.factor(AcidIndex)14 -5.129e-01 9.774e-01 -0.525 0.59980
as.factor(AcidIndex)15 2.338e-01 1.111e+00 0.210 0.83330
as.factor(AcidIndex)16 -1.122e+00 1.198e+00 -0.936 0.34916
as.factor(AcidIndex)17 -1.578e+00 1.133e+00 -1.393 0.16358
as.factor(STARS)2 7.328e-01 3.487e-02 21.015 < 2e-16 ***
as.factor(STARS)3 1.182e+00 4.136e-02 28.578 < 2e-16 ***
as.factor(STARS)4 1.657e+00 6.655e-02 24.902 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.638 on 12765 degrees of freedom
Multiple R-squared: 0.2783, Adjusted R-squared: 0.2767
F-statistic: 169.8 on 29 and 12765 DF, p-value: < 2.2e-16
plot(lm8)set.seed(100)
# model9 <- zeroinfl(TARGET ~. | STARS, data=final_impute1)
# summary(model9)
model9 <- zeroinfl(TARGET ~., final_impute1, dist = 'poisson')
summary(model9)
Call:
zeroinfl(formula = TARGET ~ ., data = final_impute1, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-2.2428 -0.3653 0.1667 0.5012 4.3875
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.449e+00 2.065e-01 7.017 2.26e-12 ***
FixedAcidity 2.377e-04 8.575e-04 0.277 0.78165
VolatileAcidity -1.340e-02 6.872e-03 -1.950 0.05120 .
CitricAcid -2.172e-05 6.143e-03 -0.004 0.99718
ResidualSugar -6.728e-05 1.581e-04 -0.426 0.67037
Chlorides -1.801e-02 1.693e-02 -1.064 0.28743
FreeSulfurDioxide 3.035e-05 3.496e-05 0.868 0.38530
TotalSulfurDioxide -2.622e-05 2.228e-05 -1.177 0.23917
Density -3.341e-01 2.026e-01 -1.649 0.09917 .
pH 6.368e-03 7.930e-03 0.803 0.42195
Sulphates 1.363e-03 5.772e-03 0.236 0.81328
Alcohol 7.214e-03 1.427e-03 5.055 4.30e-07 ***
LabelAppeal 2.466e-01 6.428e-03 38.369 < 2e-16 ***
AcidIndex -1.589e-02 4.983e-03 -3.188 0.00143 **
STARS 9.578e-02 6.294e-03 15.217 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9321816 1.0390266 -5.709 1.13e-08 ***
FixedAcidity 0.0026326 0.0043097 0.611 0.541298
VolatileAcidity 0.2346703 0.0347026 6.762 1.36e-11 ***
CitricAcid -0.0803708 0.0314027 -2.559 0.010486 *
ResidualSugar -0.0013743 0.0008004 -1.717 0.085974 .
Chlorides 0.2554050 0.0850914 3.002 0.002686 **
FreeSulfurDioxide -0.0006658 0.0001796 -3.708 0.000209 ***
TotalSulfurDioxide -0.0008123 0.0001156 -7.026 2.12e-12 ***
Density 0.8564562 1.0224154 0.838 0.402211
pH 0.1938946 0.0400659 4.839 1.30e-06 ***
Sulphates 0.0911598 0.0290604 3.137 0.001707 **
Alcohol 0.0109721 0.0072459 1.514 0.129960
LabelAppeal 0.3333973 0.0331303 10.063 < 2e-16 ***
AcidIndex 0.4764164 0.0197806 24.085 < 2e-16 ***
STARS -0.5334391 0.0347708 -15.342 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 37
Log-likelihood: -2.254e+04 on 30 Df
set.seed(100)
library(randomForest)
model10 <- randomForest(TARGET ~., final_impute1)
model10
Call:
randomForest(formula = TARGET ~ ., data = final_impute1)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 2.554585
% Var explained: 31.15
set.seed(100)
hurd.mod2 <- hurdle(TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS |
VolatileAcidity + FreeSulfurDioxide +
TotalSulfurDioxide + pH + Sulphates + Alcohol +
LabelAppeal + AcidIndex + STARS,
data = final_impute1)
hurd.summ <- summary(hurd.mod2)
hurd.summ
Call:
hurdle(formula = TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS |
VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates +
Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)
Pearson residuals:
Min 1Q Median 3Q Max
-2.2746 -0.3759 0.1674 0.5034 3.6135
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.138239 0.043803 25.986 < 2e-16 ***
AcidIndex -0.017645 0.004924 -3.584 0.000339 ***
Alcohol 0.007507 0.001439 5.217 1.81e-07 ***
LabelAppeal 0.249872 0.006512 38.371 < 2e-16 ***
STARS 0.097081 0.006264 15.499 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.370e+00 2.030e-01 21.525 < 2e-16 ***
VolatileAcidity -2.145e-01 2.938e-02 -7.299 2.90e-13 ***
FreeSulfurDioxide 6.149e-04 1.540e-04 3.993 6.52e-05 ***
TotalSulfurDioxide 6.756e-04 9.873e-05 6.843 7.76e-12 ***
pH -1.645e-01 3.381e-02 -4.866 1.14e-06 ***
Sulphates -7.999e-02 2.453e-02 -3.261 0.00111 **
Alcohol -6.325e-03 6.163e-03 -1.026 0.30476
LabelAppeal -1.570e-01 2.710e-02 -5.793 6.91e-09 ***
AcidIndex -4.270e-01 1.688e-02 -25.305 < 2e-16 ***
STARS 5.063e-01 2.931e-02 17.272 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 11
Log-likelihood: -2.256e+04 on 15 Df
plot(fitted(hurd.mod2), fitted(model9),xlab="Hurdle pred",ylab="Zero infl poisson")
abline(0,1)set.seed(100)
model11 <- zeroinfl(TARGET ~ AcidIndex+ Alcohol + LabelAppeal + STARS | VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)
summary(model11)
Call:
zeroinfl(formula = TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS |
VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates +
Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)
Pearson residuals:
Min 1Q Median 3Q Max
-2.2910 -0.3650 0.1668 0.4996 4.1425
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.138204 0.043655 26.073 < 2e-16 ***
AcidIndex -0.016738 0.004915 -3.406 0.00066 ***
Alcohol 0.007265 0.001426 5.096 3.47e-07 ***
LabelAppeal 0.247012 0.006423 38.460 < 2e-16 ***
STARS 0.095769 0.006290 15.226 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.0456291 0.2381556 -21.186 < 2e-16 ***
VolatileAcidity 0.2468154 0.0343463 7.186 6.67e-13 ***
FreeSulfurDioxide -0.0007116 0.0001780 -3.997 6.42e-05 ***
TotalSulfurDioxide -0.0007974 0.0001146 -6.955 3.52e-12 ***
pH 0.1876248 0.0396506 4.732 2.22e-06 ***
Sulphates 0.0923097 0.0287160 3.215 0.00131 **
Alcohol 0.0104456 0.0072259 1.446 0.14829
LabelAppeal 0.3323864 0.0330600 10.054 < 2e-16 ***
AcidIndex 0.4755215 0.0193439 24.583 < 2e-16 ***
STARS -0.5340782 0.0346984 -15.392 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 22
Log-likelihood: -2.256e+04 on 15 Df
(lrt <- 2* (model9$loglik-model11$loglik))[1] 31.94336
1-pchisq(31.82583,15)[1] 0.006798843
exp(coef(model11)) count_(Intercept) count_AcidIndex count_Alcohol
3.121157914 0.983401152 1.007291473
count_LabelAppeal count_STARS zero_(Intercept)
1.280194333 1.100504780 0.006437409
zero_VolatileAcidity zero_FreeSulfurDioxide zero_TotalSulfurDioxide
1.279942812 0.999288666 0.999202910
zero_pH zero_Sulphates zero_Alcohol
1.206380833 1.096704402 1.010500378
zero_LabelAppeal zero_AcidIndex zero_STARS
1.394291527 1.608852970 0.586209398
For the count regression model, will you use a metric such as AIC, average squared error, etc.?
Be sure to explain how you can make inferences from the model, and discuss other relevant model output.
If you like the multiple linear regression model the best, please say why.
However, you must select a count regression model for model deployment.
Using the training data set, evaluate the performance of the count regression model. Make predictions using the evaluation data set.
set.seed(100)
predtrain <- data.frame(TARGET=final_impute1$TARGET,
model2=model2$fitted.values,
model3=model3$fitted.values,
model4=model4$fitted.values,
model5=model5$fitted.values,
model6=model6$fitted.values,
model7=model7$fitted.values,
model8=lm8$fitted.values,
model9=model9$fitted.values,
#model10=model10$fitted.values)
model11=model11$fitted.values,
model12=hurd.mod2$fitted.values)
predtrain <- round(predtrain, 0)
colnames(predtrain) <- c("TARGET","Poisson-Imputed" ,"Quasipoisson", "negative.binom", "negative.binom remove insig", "negative.binom reduce insig/sig v","linear", "linear step AIC", "Zero Inflated","zeroinfl with reduced", "hurdle model")
predtrain %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 5) +
geom_bar(fill="pink") +
theme_minimal() + labs(x="Cases Bought", y = "Count", title = "Prediction Histogram")hist(hurd.mod2$fitted.values)set.seed(100)
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- lm8$aic
bic2 <- model2$bic
bic3 <- model3$bic
bic4 <- model4$bic
bic5 <- model5$bic
bic6 <- model6$bic
bic7 <- model7$bic
bic8 <- lm8$bic
#aic9 <- model9@aic
#mse1 <- mean((final_impute1$TARGET - predict(model1))^2)
mse2 <- mean((final_impute1$TARGET - predict(model2))^2)
mse3 <- mean((final_impute1$TARGET - predict(model3))^2)
mse4 <- mean((final_impute1$TARGET - predict(model4))^2)
mse5 <- mean((final_impute1$TARGET - predict(model5))^2)
mse6 <- mean((final_impute1$TARGET - predict(model6))^2)
mse7 <- mean((final_impute1$TARGET - predict(model7))^2)
mse8 <- mean((final_impute1$TARGET - predict(lm8))^2)
#mse9 <- mean((final_impute1$TARGET - predict(model9))^2)
MSE <- c(mse2, mse3, mse4, mse5, mse6, mse7, mse8)
AIC <- c(aic2, aic3, aic4, aic5, aic6, aic7, aic8)
compare_df <- cbind(MSE,AIC)
rownames(compare_df) <- c("M2Poisson-Imputed" ,"M3Quasipoisson", "M4negative.binom", "M5negative.binom remove insig", "M6negative.binom reduce insig/sig v","M7linear", "M8linear step AIC")
DT::datatable(compare_df)final_impute1$PoissonImputed = predict(model2,final_impute1,type = "response")
final_impute1$Quasipoisson= predict(model3,final_impute1,type = "response")
final_impute1$NB_M1= predict(model4,final_impute1,type = "response")
final_impute1$NB_RI= predict(model5,final_impute1,type = "response")
final_impute1$NB_SIS= predict(model6,final_impute1,type = "response")
final_impute1$Linear_M1= predict(model7,final_impute1)
final_impute1$Linear_STEP= predict(lm8,final_impute1)
final_impute1$ZeroInf_M1= predict(model9,final_impute1,type = "response")
final_impute1$RF= predict(model10,final_impute1,type = "response")
final_impute1$ZeroInf_M2= predict(model11,final_impute1,type = "response")
final_impute1$Hurdle= predict(hurd.mod2,final_impute1,type = "response")
Final_test_df = data.frame(
Models =c("M2Poisson-Imputed" ,"M3Quasipoisson", "M4negative.binom", "M5negative.binom remove insig", "M6negative.binom reduce insig/sig v","M7linear", "M8linear step AIC", "Zero Inflated","Random Forest","zeroinfl with reduced", "hurdle model"),
RMSE=c( sqrt(mean((final_impute1$TARGET - final_impute1$PoissonImputed)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$Quasipoisson)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$NB_M1)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$NB_RI)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$NB_SIS)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$Linear_M1)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$Linear_STEP)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$ZeroInf_M1)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$RF)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$ZeroInf_M2)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$Hurdle)^2))))
( sqrt(mean((final_impute1$TARGET - final_impute1$PoissonImputed)^2)))[1] 1.659053
Final_test_df Models RMSE
1 M2Poisson-Imputed 1.6590528
2 M3Quasipoisson 1.6590528
3 M4negative.binom 1.6659795
4 M5negative.binom remove insig 1.6668970
5 M6negative.binom reduce insig/sig v 1.6726298
6 M7linear 1.6363296
7 M8linear step AIC 1.6363996
8 Zero Inflated 1.6399992
9 Random Forest 0.6974797
10 zeroinfl with reduced 1.6429352
11 hurdle model 1.6452958
set.seed(100)
model2_fitted.values <- factor(round(model2$fitted.values),levels=rev(0:9))
lm8_fitted.values <- factor(round(lm8$fitted.values),levels=rev(0:9))
model9_fitted.values <- factor(round(model9$fitted.values),levels=rev(0:9))
model11_fitted.values <- factor(round(model11$fitted.values),levels=rev(0:9))
model12_fitted.values <- factor(round(model9$fitted.values),levels=rev(0:9))
m2_cfm <- confusionMatrix(model2_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m8_cfm <- confusionMatrix(lm8_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m9_cfm <- confusionMatrix(model9_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m11_cfm <- confusionMatrix(model11_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m12_cfm <- confusionMatrix(model12_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
models_sum <- data.frame(m2_cfm$overall, m8_cfm$overall, m9_cfm$overall,m11_cfm$overall,m12_cfm$overall)
colnames(models_sum) <- c("Poisson (Imputed)" ,"Linear (stepaic)", "Zero Inflated","zeroinfl with reduced", "hurdle model")
round(models_sum, 2) Poisson (Imputed) Linear (stepaic) Zero Inflated
Accuracy 0.25 0.27 0.27
Kappa 0.10 0.12 0.12
AccuracyLower 0.24 0.27 0.26
AccuracyUpper 0.25 0.28 0.28
AccuracyNull 0.25 0.25 0.25
AccuracyPValue 0.62 0.00 0.00
McnemarPValue NaN NaN NaN
zeroinfl with reduced hurdle model
Accuracy 0.27 0.27
Kappa 0.12 0.12
AccuracyLower 0.26 0.26
AccuracyUpper 0.27 0.28
AccuracyNull 0.25 0.25
AccuracyPValue 0.00 0.00
McnemarPValue NaN NaN
imputetest <- mice(wine_test)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
meth <- imputetest$method
predM <- imputetest$predictorMatrix
predM[, c("TARGET")] <- 0
data_test_impute <- mice(wine_test, method = 'rf', predictorMatrix=predM)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
final_testimpute <- complete(data_test_impute)
print(paste0("Missing value after imputation: ", sum(is.na(final_testimpute))))[1] "Missing value after imputation: 3335"
test_predict <- predict(model9, newdata=final_testimpute)
test_predict <- round(test_predict,0)
data_pred <- data.frame(TARGET=test_predict)
ggplot(data_pred, aes(x=TARGET)) + geom_bar(fill="light pink") + theme_minimal() +
labs(y="Count", title = "Prediction: Zero Inflated Model (Model9)") +
scale_x_discrete(name = "Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))ggplot(wine_train, aes(x=TARGET)) + geom_bar(fill="light blue") + theme_minimal() +
labs(y="Count", title = "Actual") +
scale_x_discrete(name = "Training dataset Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))write.csv(test_predict, "WinePredictionsFinal.csv")