In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
## Warning: package 'ggplot2' was built under R version 4.3.3
Loading datasets:
Train data
#Loading dataset from the URL
wine_train <- read_csv("https://raw.githubusercontent.com/lburenkov/Wine/main/wine-training-data.csv", show_col_types = FALSE)
#View the structure of the dataset
str(wine_train)## spc_tbl_ [12,795 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ INDEX : num [1:12795] 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : num [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : num [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : num [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : num [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
## - attr(*, "spec")=
## .. cols(
## .. INDEX = col_double(),
## .. TARGET = col_double(),
## .. FixedAcidity = col_double(),
## .. VolatileAcidity = col_double(),
## .. CitricAcid = col_double(),
## .. ResidualSugar = col_double(),
## .. Chlorides = col_double(),
## .. FreeSulfurDioxide = col_double(),
## .. TotalSulfurDioxide = col_double(),
## .. Density = col_double(),
## .. pH = col_double(),
## .. Sulphates = col_double(),
## .. Alcohol = col_double(),
## .. LabelAppeal = col_double(),
## .. AcidIndex = col_double(),
## .. STARS = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Evaluation data
wine_evaluation <- read_csv("https://raw.githubusercontent.com/lburenkov/Wine/main/wine-evaluation-data.csv", show_col_types = FALSE)
#View the structure of the dataset
str(wine_train)## spc_tbl_ [12,795 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ INDEX : num [1:12795] 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : num [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : num [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : num [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : num [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
## - attr(*, "spec")=
## .. cols(
## .. INDEX = col_double(),
## .. TARGET = col_double(),
## .. FixedAcidity = col_double(),
## .. VolatileAcidity = col_double(),
## .. CitricAcid = col_double(),
## .. ResidualSugar = col_double(),
## .. Chlorides = col_double(),
## .. FreeSulfurDioxide = col_double(),
## .. TotalSulfurDioxide = col_double(),
## .. Density = col_double(),
## .. pH = col_double(),
## .. Sulphates = col_double(),
## .. Alcohol = col_double(),
## .. LabelAppeal = col_double(),
## .. AcidIndex = col_double(),
## .. STARS = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
## Warning: package 'skimr' was built under R version 4.3.3
#Summary statistics for numeric variables
numeric_summary <- sapply(wine_train, function(x) {
if (is.numeric(x)) {
return(c(min = min(x), max = max(x), mean = mean(x), median = median(x), sd = sd(x)))
} else {
return(rep(NA, 5)) # For non-numeric variables, return NA
}
})
#Combining summary statistics for numeric variables
train_df <- data.frame(
Variable = names(wine_train),
Min = numeric_summary[1, ],
Max = numeric_summary[2, ],
Mean = numeric_summary[3, ],
Median = numeric_summary[4, ],
SD = numeric_summary[5, ],
stringsAsFactors = FALSE
)
#Printing table
train_df %>%
kable("html", escape = FALSE, align = "c") %>%
kable_styling("striped", full_width = TRUE) %>%
column_spec(1, bold = TRUE)| Variable | Min | Max | Mean | Median | SD | |
|---|---|---|---|---|---|---|
| INDEX | INDEX | 1.00000 | 16129.00000 | 8069.9803048 | 8110.00000 | 4656.9051071 |
| TARGET | TARGET | 0.00000 | 8.00000 | 3.0290739 | 3.00000 | 1.9263682 |
| FixedAcidity | FixedAcidity | -18.10000 | 34.40000 | 7.0757171 | 6.90000 | 6.3176435 |
| VolatileAcidity | VolatileAcidity | -2.79000 | 3.68000 | 0.3241039 | 0.28000 | 0.7840142 |
| CitricAcid | CitricAcid | -3.24000 | 3.86000 | 0.3084127 | 0.31000 | 0.8620798 |
| ResidualSugar | ResidualSugar | NA | NA | NA | NA | NA |
| Chlorides | Chlorides | NA | NA | NA | NA | NA |
| FreeSulfurDioxide | FreeSulfurDioxide | NA | NA | NA | NA | NA |
| TotalSulfurDioxide | TotalSulfurDioxide | NA | NA | NA | NA | NA |
| Density | Density | 0.88809 | 1.09924 | 0.9942027 | 0.99449 | 0.0265376 |
| pH | pH | NA | NA | NA | NA | NA |
| Sulphates | Sulphates | NA | NA | NA | NA | NA |
| Alcohol | Alcohol | NA | NA | NA | NA | NA |
| LabelAppeal | LabelAppeal | -2.00000 | 2.00000 | -0.0090660 | 0.00000 | 0.8910892 |
| AcidIndex | AcidIndex | 4.00000 | 17.00000 | 7.7727237 | 8.00000 | 1.3239264 |
| STARS | STARS | NA | NA | NA | NA | NA |
The data shows a large number of zero values in the TARGET variable, it indicates that there are instances where no wine boxes were sold.
“AcidIndex” appears to follow a Poisson distribution with a slight right skew, suggesting clustering around a central value with a tail towards higher values.
“LabelAppearl” and “STARS” seem more categorical, indicating they likely represent categorical or ordinal data with distinct levels or rankings.
wine_hist <- wine_train[,-c(1, 2, 14:16)]
wine_hist %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 35) ## Warning: Removed 4841 rows containing non-finite outside the scale range
## (`stat_bin()`).
It suggests that they have peaked distributions with heavy tails
kable(cor(drop_na(wine_train))[,14], "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T) %>%
scroll_box(height = "500px")| x | |
|---|---|
| INDEX | 0.0314911 |
| TARGET | 0.4979465 |
| FixedAcidity | 0.0113760 |
| VolatileAcidity | -0.0202420 |
| CitricAcid | 0.0153316 |
| ResidualSugar | -0.0045793 |
| Chlorides | -0.0063870 |
| FreeSulfurDioxide | 0.0149601 |
| TotalSulfurDioxide | -0.0027237 |
| Density | -0.0180944 |
| pH | 0.0002182 |
| Sulphates | 0.0037687 |
| Alcohol | -0.0006449 |
| LabelAppeal | 1.0000000 |
| AcidIndex | 0.0103010 |
| STARS | 0.3188970 |
library(ggplot2)
#Calculating correlation matrix
correlation_matrix <- cor(drop_na(wine_train))
#Selecting correlations for the 14th column (assuming it's numeric)
correlation_14 <- correlation_matrix[, 14]
#Converting to dataframe for visualization
correlation_df <- data.frame(variable = names(correlation_14), correlation = correlation_14)
#Plotting heatmap
ggplot(correlation_df, aes(x = variable, y = "", fill = correlation)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
labs(title = "Correlation with 14th Column",
x = "Variable",
y = NULL) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))The correlation visualization provided below illustrates the relationships among variables in the dataset. Upon inspection, it’s evident that certain variables exhibit stronger associations compared to others.
#Summary of missing values
missing_summary <- summary(wine_train)
#Printing the summary
print(missing_summary)## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
The data shows multiple variables with missing variables. The STARS variable has the most NA values.
For numeric variables
#Replacing missing values in numeric variables with mean
numeric_vars <- c("FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar",
"Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density",
"pH", "Sulphates", "Alcohol", "AcidIndex")
for (var in numeric_vars) {
wine_train[[var]][is.na(wine_train[[var]])] <- mean(wine_train[[var]], na.rm = TRUE)
}
#Checking if missing values have been imputed
sum(is.na(wine_train[numeric_vars]))## [1] 0
#Replacing missing values in categorical variables with mode
categorical_vars <- c("LabelAppeal", "STARS")
for (var in categorical_vars) {
mode_val <- names(sort(table(wine_train[[var]], useNA = "ifany"), decreasing = TRUE)[1])
wine_train[[var]][is.na(wine_train[[var]])] <- mode_val
}
#Checking if missing values have been imputed
sum(is.na(wine_train[categorical_vars]))## [1] 0
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.17100 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: 0.900 1st Qu.: 0.00000 1st Qu.: 5.00
## Median : 0.3100 Median : 4.900 Median : 0.04800 Median : 30.85
## Mean : 0.3084 Mean : 5.419 Mean : 0.05482 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 14.900 3rd Qu.: 0.12800 3rd Qu.: 64.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.35100 Max. : 623.00
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 34.0 1st Qu.:0.9877 1st Qu.:2.970 1st Qu.: 0.3400
## Median : 120.7 Median :0.9945 Median :3.208 Median : 0.5271
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 198.0 3rd Qu.:1.0005 3rd Qu.:3.450 3rd Qu.: 0.7700
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Length:12795 Min. : 4.000 Length:12795
## 1st Qu.: 9.10 Class :character 1st Qu.: 7.000 Class :character
## Median :10.49 Mode :character Median : 8.000 Mode :character
## Mean :10.49 Mean : 7.773
## 3rd Qu.:12.20 3rd Qu.: 8.000
## Max. :26.50 Max. :17.000
#Replacing missing values in numeric variables with mean
numeric_vars <- c("FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar",
"Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density",
"pH", "Sulphates", "Alcohol", "AcidIndex")
for (var in numeric_vars) {
wine_evaluation[[var]][is.na(wine_evaluation[[var]])] <- mean(wine_evaluation[[var]], na.rm = TRUE)
}
#Checking if missing values have been imputed
sum(is.na(wine_evaluation[numeric_vars]))## [1] 0
#Replacing missing values in categorical variables with mode
categorical_vars <- c("LabelAppeal", "STARS")
for (var in categorical_vars) {
mode_val <- names(sort(table(wine_evaluation[[var]], useNA = "ifany"), decreasing = TRUE)[1])
wine_evaluation[[var]][is.na(wine_evaluation[[var]])] <- mode_val
}
#Checking if missing values have been imputed
sum(is.na(wine_evaluation[categorical_vars]))## [1] 0
Model 1
For the first Poisson regression model, we are using the FixedAcidity, VolatileAcidity, and CitricAcid variables as predictors for predicting the count of wine boxes sold (TARGET).
##
## Attaching package: 'MASS'
## The following objects are masked from 'package:openintro':
##
## housing, mammals
## The following object is masked from 'package:dplyr':
##
## select
#Model 1: Poisson regression with FixedAcidity, VolatileAcidity, and CitricAcid as predictors
poisson_model_1 <- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid,
family = poisson, data = wine_train)
#Summary of the model
summary(poisson_model_1)##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid,
## family = poisson, data = wine_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1617475 0.0079134 146.807 < 2e-16 ***
## FixedAcidity -0.0048317 0.0008032 -6.016 1.79e-09 ***
## VolatileAcidity -0.0714368 0.0064769 -11.029 < 2e-16 ***
## CitricAcid 0.0057702 0.0058937 0.979 0.328
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 22700 on 12791 degrees of freedom
## AIC: 54650
##
## Number of Fisher Scoring iterations: 5
Model 2:
For the second Poisson regression model, we are using different predictor variables or transformations, such as ResidualSugar, Chlorides, and pH.
#Model 2: Poisson regression with ResidualSugar, Chlorides, and pH as predictors
poisson_model_2 <- glm(TARGET ~ ResidualSugar + Chlorides + pH,
family = poisson, data = wine_train)
#Summary of the model
summary(poisson_model_2)##
## Call:
## glm(formula = TARGET ~ ResidualSugar + Chlorides + pH, family = poisson,
## data = wine_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1412293 0.0248905 45.850 < 2e-16 ***
## ResidualSugar 0.0003088 0.0001543 2.001 0.0454 *
## Chlorides -0.0764960 0.0163693 -4.673 2.97e-06 ***
## pH -0.0096034 0.0075958 -1.264 0.2061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 22834 on 12791 degrees of freedom
## AIC: 54784
##
## Number of Fisher Scoring iterations: 5
Model 3
For the first negative binomial regression model, we are using the same predictor variables as in Model 1.
#Model 3: Negative binomial regression with FixedAcidity, VolatileAcidity, and CitricAcid as predictors
negbin_model_1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid, data = wine_train)
#Summary of the model
summary(negbin_model_1)##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid,
## data = wine_train, init.theta = 7.750395767, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1621987 0.0093941 123.715 < 2e-16 ***
## FixedAcidity -0.0048628 0.0009482 -5.129 2.92e-07 ***
## VolatileAcidity -0.0719691 0.0076496 -9.408 < 2e-16 ***
## CitricAcid 0.0055331 0.0069537 0.796 0.426
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(7.7504) family taken to be 1)
##
## Null deviance: 18317 on 12794 degrees of freedom
## Residual deviance: 18201 on 12791 degrees of freedom
## AIC: 54224
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 7.750
## Std. Err.: 0.458
##
## 2 x log-likelihood: -54214.339
Model 4:
For the second negative binomial regression model, we are using the same predictor variables as in Model 2.
#Model 4: Negative binomial regression with ResidualSugar, Chlorides, and pH as predictors
negbin_model_2 <- glm.nb(TARGET ~ ResidualSugar + Chlorides + pH, data = wine_train)
#Summary of the model
summary(negbin_model_2)##
## Call:
## glm.nb(formula = TARGET ~ ResidualSugar + Chlorides + pH, data = wine_train,
## init.theta = 7.406172002, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1413530 0.0295564 38.616 < 2e-16 ***
## ResidualSugar 0.0003097 0.0001832 1.691 0.0909 .
## Chlorides -0.0769216 0.0194343 -3.958 7.56e-05 ***
## pH -0.0096367 0.0090162 -1.069 0.2852
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(7.4062) family taken to be 1)
##
## Null deviance: 18161 on 12794 degrees of freedom
## Residual deviance: 18141 on 12791 degrees of freedom
## AIC: 54320
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 7.406
## Std. Err.: 0.423
##
## 2 x log-likelihood: -54310.142
Model 5:
For the first multiple linear regression model, we use Density, Sulphates, and Alcohol as predictors.
#Model 5: Multiple linear regression with Density, Sulphates, and Alcohol as predictors
linear_model_1 <- lm(TARGET ~ Density + Sulphates + Alcohol, data = wine_train)
#Summary of the model
summary(linear_model_1)##
## Call:
## lm(formula = TARGET ~ Density + Sulphates + Alcohol, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6073 -1.1254 0.1536 1.1787 5.2562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.292012 0.638765 8.285 < 2e-16 ***
## Density -2.570721 0.639855 -4.018 5.91e-05 ***
## Sulphates -0.081394 0.019144 -4.252 2.14e-05 ***
## Alcohol 0.032012 0.004676 6.846 7.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.921 on 12791 degrees of freedom
## Multiple R-squared: 0.006288, Adjusted R-squared: 0.006055
## F-statistic: 26.98 on 3 and 12791 DF, p-value: < 2.2e-16
Model 6:
For the second multiple linear regression model, we use LabelAppeal and STARS as categorical predictors.
#Converting LabelAppeal and STARS to factors
wine_train$LabelAppeal <- as.factor(wine_train$LabelAppeal)
wine_train$STARS <- as.factor(wine_train$STARS)
#Model 6: Multiple linear regression with LabelAppeal and STARS as predictors
linear_model_2 <- lm(TARGET ~ LabelAppeal + STARS, data = wine_train)
#Summary of the model
summary(linear_model_2)##
## Call:
## lm(formula = TARGET ~ LabelAppeal + STARS, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7550 -0.8727 0.4058 1.2375 4.9206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.19176 0.03736 58.664 < 2e-16 ***
## LabelAppeal-2 -0.63227 0.07816 -8.089 6.53e-16 ***
## LabelAppeal0 0.57074 0.03658 15.601 < 2e-16 ***
## LabelAppeal1 1.05589 0.04270 24.730 < 2e-16 ***
## LabelAppeal2 1.56325 0.08056 19.405 < 2e-16 ***
## STARS2 -0.16826 0.03560 -4.727 2.30e-06 ***
## STARS3 1.62507 0.04672 34.783 < 2e-16 ***
## STARS4 2.30900 0.07422 31.109 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.627 on 12787 degrees of freedom
## Multiple R-squared: 0.2867, Adjusted R-squared: 0.2863
## F-statistic: 734.1 on 7 and 12787 DF, p-value: < 2.2e-16
# Summary of the models
summary_list <- list(
poisson_model_1 = summary(poisson_model_1),
poisson_model_2 = summary(poisson_model_2),
negbin_model_1 = summary(negbin_model_1),
negbin_model_2 = summary(negbin_model_2),
linear_model_1 = summary(linear_model_1),
linear_model_2 = summary(linear_model_2)
)
# Extracting coefficient information
coefficients_list <- lapply(summary_list, function(model_summary) {
if (inherits(model_summary, "summary.glm")) {
coefficients <- model_summary$coefficients
} else if (inherits(model_summary, "summary.glm.nb")) {
coefficients <- model_summary$coefficients
} else {
coefficients <- model_summary$coefficients
}
return(coefficients)
})
# Printing coefficients for each model
for (i in seq_along(coefficients_list)) {
cat("Model", i, "Coefficients:\n")
print(coefficients_list[[i]])
cat("\n")
}## Model 1 Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.161747485 0.0079134438 146.8068160 0.000000e+00
## FixedAcidity -0.004831658 0.0008031771 -6.0156824 1.791304e-09
## VolatileAcidity -0.071436847 0.0064769027 -11.0294765 2.754637e-28
## CitricAcid 0.005770176 0.0058936631 0.9790475 3.275565e-01
##
## Model 2 Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1412292811 0.024890461 45.850066 0.000000e+00
## ResidualSugar 0.0003088131 0.000154305 2.001317 4.535826e-02
## Chlorides -0.0764960214 0.016369307 -4.673137 2.966332e-06
## pH -0.0096034085 0.007595844 -1.264298 2.061232e-01
##
## Model 3 Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.162198684 0.0093941373 123.7153185 0.000000e+00
## FixedAcidity -0.004862769 0.0009481591 -5.1286420 2.918397e-07
## VolatileAcidity -0.071969065 0.0076496382 -9.4081658 5.048697e-21
## CitricAcid 0.005533134 0.0069537117 0.7957094 4.262009e-01
##
## Model 4 Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1413530338 0.0295564465 38.616044 0.000000e+00
## ResidualSugar 0.0003096661 0.0001831736 1.690560 9.092084e-02
## Chlorides -0.0769216411 0.0194343431 -3.958026 7.557164e-05
## pH -0.0096367078 0.0090162415 -1.068817 2.851524e-01
##
## Model 5 Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.29201176 0.638764971 8.284756 1.300695e-16
## Density -2.57072105 0.639855245 -4.017660 5.911518e-05
## Sulphates -0.08139384 0.019144113 -4.251638 2.137192e-05
## Alcohol 0.03201221 0.004675772 6.846400 7.918962e-12
##
## Model 6 Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1917643 0.03736160 58.663560 0.000000e+00
## LabelAppeal-2 -0.6322650 0.07815883 -8.089490 6.529438e-16
## LabelAppeal0 0.5707390 0.03658350 15.600996 2.270452e-54
## LabelAppeal1 1.0558870 0.04269695 24.729800 6.264121e-132
## LabelAppeal2 1.5632546 0.08056078 19.404660 1.084250e-82
## STARS2 -0.1682647 0.03559572 -4.727103 2.301750e-06
## STARS3 1.6250694 0.04672043 34.782845 2.297525e-253
## STARS4 2.3090028 0.07422290 31.109034 7.048768e-205
Model 1 (Poisson Regression):
The intercept is highly significant, indicating the expected count of wine boxes sold when all predictor variables are zero. FixedAcidity and VolatileAcidity have significant negative coefficients, suggesting that as these variables increase, the count of wine boxes sold decreases. CitricAcid’s coefficient is not statistically significant at conventional levels (p-value > 0.05).
Model 2 (Poisson Regression):
The intercept is highly significant, similar to Model 1. ResidualSugar and Chlorides have significant coefficients, indicating their impact on the count of wine boxes sold. pH’s coefficient is not statistically significant at conventional levels.
Model 3 (Negative Binomial Regression):
Similar to Model 1, FixedAcidity and VolatileAcidity have significant negative coefficients, while CitricAcid’s coefficient is not significant. The intercept is highly significant, indicating the expected count when all predictors are zero.
Model 4 (Negative Binomial Regression):
Similar to Model 2, ResidualSugar and Chlorides have significant coefficients, while pH’s coefficient is not significant. The intercept is highly significant, similar to other models.
Model 5 (Multiple Linear Regression):
Density, Sulphates, and Alcohol have significant coefficients. The intercept is highly significant, indicating the expected wine box sales when all predictors are zero. Unlike the Poisson and Negative Binomial models, this is a continuous outcome model, and the coefficients represent the change in wine box sales for a one-unit change in the predictor variable.
Model 6 (Multiple Linear Regression):
LabelAppeal and STARS are categorical predictors, and each level has its coefficient. All coefficients for different levels of LabelAppeal and STARS are highly significant. The intercept represents the expected wine box sales for the reference levels of LabelAppeal and STARS. Overall, the coefficients in each model seem to align with our expectations. For example, in the Poisson and Negative Binomial models, variables like FixedAcidity and VolatileAcidity have negative coefficients, indicating a decrease in wine box sales with an increase in these variables. Similarly, in the Multiple Linear Regression models, significant predictors like Density, Sulphates, and Alcohol also show a coherent relationship with wine box sales.
We will evaluate the models by calculating the Mean Squared Error (MSE). The MSE measures the average squared difference between the predicted values and the actual values. It provides a single numerical value that indicates how well the model is performing in terms of prediction accuracy.
By comparing the MSE values of different models, we can identify which model performs better in terms of minimizing prediction errors and providing more accurate predictions. This helps in selecting the best model for the given task.
#Functions MRE for models
modelValidation <- function(mod, test) {
preds <- predict(mod, test)
diffMat <- as.numeric(preds) - as.numeric(test$TARGET)
diffMat <- diffMat^2
loss <- mean(diffMat)
return(loss)
}df1 <- modelValidation(poisson_model_1, as.data.frame(wine_train))
df2 <- modelValidation(poisson_model_2, as.data.frame(wine_train))
df3 <- modelValidation(negbin_model_1, as.data.frame(wine_train))
df4 <- modelValidation(negbin_model_2, as.data.frame(wine_train))
df5 <- modelValidation(linear_model_1, as.data.frame(wine_train))
df6 <- modelValidation(linear_model_2, as.data.frame(wine_train))We can visualize a comparison table with our models
compare_model1 <- c(df1)
compare_model2 <- c(df2)
compare_model3 <- c(df3)
compare_model4 <- c(df4)
compare_model5 <- c(df5)
compare_model6 <- c(df6)
compare <- data.frame(compare_model1, compare_model2, compare_model3, compare_model4, compare_model5, compare_model6)
colnames(compare) <- c("Poisson Model 2", "Poisson Model 1", "Negative BinomMod1", "Negative BinomMod1", "Multiple linear1", "Multiple linear2")
kable(compare)| Poisson Model 2 | Poisson Model 1 | Negative BinomMod1 | Negative BinomMod1 | Multiple linear1 | Multiple linear2 |
|---|---|---|---|---|---|
| 7.387152 | 7.397933 | 7.387092 | 7.397926 | 3.687273 | 2.64693 |
The loss values represent mean squared error (MSE), then lower MSE values indicate better performance.
Given the provided loss values:
Poisson Model 2: 7.387152 Poisson Model 1: 7.397933 Negative Binomial Model 1: 7.387092 Negative Binomial Model 2: 7.397926 Multiple Linear Model 1: 3.687273 Multiple Linear Model 2: 2.64693 We can see that Multiple Linear Model 2 (with a loss value of 2.64693) has the lowest loss among all the models, indicating better performance according to the evaluation metric used.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.336 2.939 3.029 3.030 3.121 3.719
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.391 2.594 2.763 3.059 3.248 6.064
By evaluating the performance of multiple models on the evaluation dataset, we can identify which model performs better in terms of prediction accuracy and generalization, aiding in the selection of the best model for the task.
Linear Models with R: Julian Faraway.