In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
The training dataset has 14 predictors, 1 response variable and 12795 observations.
From the 14 predictor variables, 3 are categorical (LabelAppeal
,AcidIndex
,STARS
), the other 11 are continuous numerical. The response variable TARGET
is categorical.
## Rows: 12,795
## Columns: 16
## $ TARGET <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, ~
## $ INDEX <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19~
## $ FixedAcidity <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,~
## $ VolatileAcidity <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1~
## $ CitricAcid <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34~
## $ ResidualSugar <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40~
## $ Chlorides <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.~
## $ FreeSulfurDioxide <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551~
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65~
## $ Density <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994~
## $ pH <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9~
## $ Sulphates <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26~
## $ Alcohol <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1~
## $ LabelAppeal <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, ~
## $ AcidIndex <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9~
## $ STARS <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA~
## Rows: 3,335
## Columns: 16
## $ IN <dbl> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 6~
## $ TARGET <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ FixedAcidity <dbl> 5.4, 12.4, 7.2, 6.2, 11.4, 17.6, 15.5, 15.9, 11.6, ~
## $ VolatileAcidity <dbl> -0.860, 0.385, 1.750, 0.100, 0.210, 0.040, 0.530, 1~
## $ CitricAcid <dbl> 0.27, -0.76, 0.17, 1.80, 0.28, -1.15, -0.53, 1.14, ~
## $ ResidualSugar <dbl> -10.70, -19.70, -33.00, 1.00, 1.20, 1.40, 4.60, 31.~
## $ Chlorides <dbl> 0.092, 1.169, 0.065, -0.179, 0.038, 0.535, 1.263, -~
## $ FreeSulfurDioxide <dbl> 23, -37, 9, 104, 70, -250, 10, 115, 35, 40, NA, -21~
## $ TotalSulfurDioxide <dbl> 398, 68, 76, 89, 53, 140, 17, 381, 83, 129, 583, -5~
## $ Density <dbl> 0.98527, 0.99048, 1.04641, 0.98877, 1.02899, 0.9502~
## $ pH <dbl> 5.02, 3.37, 4.61, 3.20, 2.54, 3.06, 3.07, 2.99, 3.3~
## $ Sulphates <dbl> 0.64, 1.09, 0.68, 2.11, -0.07, -0.02, 0.75, 0.31, 2~
## $ Alcohol <dbl> 12.30, 16.00, 8.55, 12.30, 4.80, 11.40, 8.50, 11.40~
## $ LabelAppeal <dbl> -1, 0, 0, -1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, -1, -1,~
## $ AcidIndex <dbl> 6, 6, 8, 8, 10, 8, 12, 7, 12, 7, 8, 10, 9, 8, 9, 11~
## $ STARS <dbl> NA, 2, 1, 1, NA, 4, 3, NA, NA, NA, 1, NA, 2, NA, NA~
Missing Values & Data Type Check
Missing Values & Data Type Check
Non_NAs | NAs | NA_Percent | |
---|---|---|---|
TARGET | 12795 | 0 | 0.0000000 |
INDEX | 12795 | 0 | 0.0000000 |
FixedAcidity | 12795 | 0 | 0.0000000 |
VolatileAcidity | 12795 | 0 | 0.0000000 |
CitricAcid | 12795 | 0 | 0.0000000 |
ResidualSugar | 12179 | 616 | 0.0481438 |
Chlorides | 12157 | 638 | 0.0498632 |
FreeSulfurDioxide | 12148 | 647 | 0.0505666 |
TotalSulfurDioxide | 12113 | 682 | 0.0533021 |
Density | 12795 | 0 | 0.0000000 |
pH | 12400 | 395 | 0.0308714 |
Sulphates | 11585 | 1210 | 0.0945682 |
Alcohol | 12142 | 653 | 0.0510356 |
LabelAppeal | 12795 | 0 | 0.0000000 |
AcidIndex | 12795 | 0 | 0.0000000 |
STARS | 9436 | 3359 | 0.2625244 |
A binary logistic regression model is built using the training set
, hence the training set
is used for data exploration.
training set
summary(training_modified)
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
training_modified %>%
scale() %>%
as.data.frame() %>%
stack() %>%
ggplot(aes(x = ind, y = values)) +
geom_boxplot(fill = 'deeppink4') +
labs(title = 'Boxplot: Scaled Training Set',
x = 'Variables',
y = 'Normalized_Values')+
theme(panel.background = element_rect(fill = 'grey'),axis.text.x=element_text(size=10, angle=90))
Boxplot: Scaled Training Set
From the above Boxplot outliers occur in: FixedAcidity
, VolatileAcidity
, CitricAcid
, ResidualSugar
, Chlorides
, FreeSulfurDioxide
, TotalSulfurDioxide
, Density
, pH
, Sulphates
, Alcohol
, LabelAppeal
andAcidIndex
.
library(qqplotr)
with(training_modified, c(summary(FixedAcidity), SD=sd(FixedAcidity), Skew=skewness(FixedAcidity), Kurt=kurtosis(FixedAcidity)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -18.10000000 5.20000000 6.90000000 7.07571708 9.50000000 34.40000000
## SD Skew Kurt
## 6.31764346 -0.02258596 1.67499867
hist <- ggplot(training_modified, aes(FixedAcidity)) + geom_histogram(fill = 'dodgerblue', binwidth = 4, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of FixedAcidity') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=FixedAcidity)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of FixedAcidity") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", FixedAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of FixedAcidity', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), FixedAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of FixedAcidity by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(VolatileAcidity), SD=sd(VolatileAcidity), Skew=skewness(VolatileAcidity), Kurt=kurtosis(VolatileAcidity)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.79000000 0.13000000 0.28000000 0.32410395 0.64000000 3.68000000
## SD Skew Kurt
## 0.78401424 0.02037997 1.83221064
hist <- ggplot(training_modified, aes(VolatileAcidity)) + geom_histogram(fill = 'dodgerblue', binwidth = .5, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of VolatileAcidity') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=VolatileAcidity)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of VolatileAcidity") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", VolatileAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of VolatileAcidity', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), VolatileAcidity)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of VolatileAcidity by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(CitricAcid), SD=sd(CitricAcid), Skew=skewness(CitricAcid), Kurt=kurtosis(CitricAcid)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.24000000 0.03000000 0.31000000 0.30841266 0.58000000 3.86000000
## SD Skew Kurt
## 0.86207979 -0.05030704 1.83794007
hist <- ggplot(training_modified, aes(CitricAcid)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of CitricAcid') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=CitricAcid)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of CitricAcid") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", CitricAcid)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of CitricAcid', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), CitricAcid)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of CitricAcid by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(ResidualSugar), SD=sd(ResidualSugar), Skew=skewness(ResidualSugar), Kurt=kurtosis(ResidualSugar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -127.800000 -2.000000 3.900000 5.418733 15.900000 141.150000
## NA's SD Skew Kurt
## 616.000000 NA NA NA
hist <- ggplot(training_modified, aes(ResidualSugar)) + geom_histogram(fill = 'dodgerblue', binwidth = 20, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of ResidualSugar') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=ResidualSugar)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of ResidualSugar") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", ResidualSugar)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of ResidualSugar', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), ResidualSugar)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of ResidualSugar by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(Chlorides), SD=sd(Chlorides), Skew=skewness(Chlorides), Kurt=kurtosis(Chlorides)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.17100000 -0.03100000 0.04600000 0.05482249 0.15300000 1.35100000
## NA's SD Skew Kurt
## 638.00000000 NA NA NA
hist <- ggplot(training_modified, aes(Chlorides)) + geom_histogram(fill = 'dodgerblue', binwidth = .2, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of Chlorides') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=Chlorides)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Chlorides") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", Chlorides)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of Chlorides', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Chlorides)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of Chlorides by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(FreeSulfurDioxide), SD=sd(FreeSulfurDioxide), Skew=skewness(FreeSulfurDioxide), Kurt=kurtosis(FreeSulfurDioxide)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -555.00000 0.00000 30.00000 30.84557 70.00000 623.00000 647.00000
## SD Skew Kurt
## NA NA NA
hist <- ggplot(training_modified, aes(FreeSulfurDioxide)) + geom_histogram(fill = 'dodgerblue', binwidth = 50, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of FreeSulfurDioxide') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=FreeSulfurDioxide)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of FreeSulfurDioxide") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", FreeSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of FreeSulfurDioxide', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), FreeSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of FreeSulfurDioxide by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(TotalSulfurDioxide), SD=sd(TotalSulfurDioxide), Skew=skewness(TotalSulfurDioxide), Kurt=kurtosis(TotalSulfurDioxide)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's SD
## -823.0000 27.0000 123.0000 120.7142 208.0000 1057.0000 682.0000 NA
## Skew Kurt
## NA NA
hist <- ggplot(training_modified, aes(TotalSulfurDioxide)) + geom_histogram(fill = 'dodgerblue', binwidth = 200, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of TotalSulfurDioxide') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=TotalSulfurDioxide)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of TotalSulfurDioxide") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", TotalSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of TotalSulfurDioxide', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), TotalSulfurDioxide)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of TotalSulfurDioxide by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(Density), SD=sd(Density), Skew=skewness(Density), Kurt=kurtosis(Density)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.88809000 0.98772000 0.99449000 0.99420272 1.00051500 1.09924000
## SD Skew Kurt
## 0.02653765 -0.01869376 1.89995921
hist <- ggplot(training_modified, aes(Density)) + geom_histogram(fill = 'dodgerblue', binwidth = .05, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of Density') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=Density)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Density") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", Density)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of Density', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Density)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of Density by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(Sulphates), SD=sd(Sulphates), Skew=skewness(Sulphates), Kurt=kurtosis(Sulphates)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.1300000 0.2800000 0.5000000 0.5271118 0.8600000 4.2400000
## NA's SD Skew Kurt
## 1210.0000000 NA NA NA
hist <- ggplot(training_modified, aes(Sulphates)) + geom_histogram(fill = 'dodgerblue', binwidth = .5, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of Sulphates') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=Sulphates)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Sulphates") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", Sulphates)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of Sulphates', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Sulphates)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of Sulphates by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(Alcohol), SD=sd(Alcohol), Skew=skewness(Alcohol), Kurt=kurtosis(Alcohol)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's SD
## -4.70000 9.00000 10.40000 10.48924 12.40000 26.50000 653.00000 NA
## Skew Kurt
## NA NA
hist <- ggplot(training_modified, aes(Alcohol)) + geom_histogram(fill = 'dodgerblue', binwidth = 2, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of Alcohol') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=Alcohol)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of Alcohol") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", Alcohol)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of Alcohol', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), Alcohol)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of Alcohol by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(LabelAppeal), SD=sd(LabelAppeal), Skew=skewness(LabelAppeal), Kurt=kurtosis(LabelAppeal)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.000000000 -1.000000000 0.000000000 -0.009066041 1.000000000 2.000000000
## SD Skew Kurt
## 0.891089247 0.008429457 -0.262291551
hist <- ggplot(training_modified, aes(LabelAppeal)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of LabelAppeal') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=LabelAppeal)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of LabelAppeal") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", LabelAppeal)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of LabelAppeal', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), LabelAppeal)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of LabelAppeal by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(AcidIndex), SD=sd(AcidIndex), Skew=skewness(AcidIndex), Kurt=kurtosis(AcidIndex)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. SD Skew
## 4.000000 7.000000 8.000000 7.772724 8.000000 17.000000 1.323926 1.648496
## Kurt
## 5.190092
hist <- ggplot(training_modified, aes(AcidIndex)) + geom_histogram(fill = 'dodgerblue', binwidth = 2, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of AcidIndex') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=AcidIndex)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of AcidIndex") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", AcidIndex)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of AcidIndex', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), AcidIndex)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of AcidIndex by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
with(training_modified, c(summary(STARS), SD=sd(STARS), Skew=skewness(STARS), Kurt=kurtosis(STARS)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000000 1.000000 2.000000 2.041755 3.000000 4.000000
## NA's SD Skew Kurt
## 3359.000000 NA NA NA
hist <- ggplot(training_modified, aes(STARS)) + geom_histogram(fill = 'dodgerblue', binwidth = 1, color = 'darkgray' ) +
theme_classic() + labs(title = 'Histogram of STARS') + theme(plot.title = element_text(hjust = 0.5))
qq_plot <- ggplot(training_modified, aes(sample=STARS)) + stat_qq_point(color='dodgerblue') + stat_qq_line(color='darkgray') +
labs(x="Thoretical Quantiles", y="Sample Quantiles", title = "QQ Plot of STARS") + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
box_plot <- ggplot(training_modified, aes(x="", STARS)) + geom_boxplot(fill='dodgerblue', color='darkgray')+ theme_classic() +
labs(title = 'Boxplot of STARS', x="") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
box_TARGET <- ggplot(training_modified, aes(x=factor(TARGET), STARS)) + geom_boxplot(fill='dodgerblue', color='darkgrey') +
labs(x='TARGET', title = 'Boxplot of STARS by TARGET') + theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
training_modified %>%
select_if(is.numeric) %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(x=value)) + # Plot the values
facet_wrap(~key, scales = "free") + # In separate panels
geom_density()
Density Plot: Training Set
From the scaled histogram and density plots, AcidIndex
is skewed right; AcidIndex
, STARS
, LabelAppeal
and TARGET
have multimodal distribution; while other variables seem to be normally distrbuted.
Correlation between variables in the dataset.
drop_na(training_modified) %>%
select_if(is.numeric) %>%
cor() %>%
#corrplot(method = "square", type = "upper", order = 'hclust', tl.col = "black", diag = FALSE, bg= 'white', col = colorRampPalette(c('deeppink4','white','steelblue1'))(100))
corrplot.mixed(upper = 'pie', lower = 'number', order = 'hclust', tl.col = "black")
Correlation Pie Chart: Training Set
From the correlation matrix below, the response variable TARGET
has strong positive relationship with variables FixedAcidity
,CitricAcid
,ResidualSugar
,Density
,Alcohol
.
PerformanceAnalytics::chart.Correlation(training_dataset, histogram=TRUE, pch=19)
Correlation Chart: Training Set
training_dataset %>%
gather(-TARGET, key = "key", value = "ResponseVariables") %>%
ggplot(aes(x = ResponseVariables, y = TARGET)) +
geom_point(size = .5) +
geom_smooth(method='lm',formula=y~x, color = 'dark grey')+
facet_wrap(~ key, scales = "free")+
ggthemes::theme_tufte()+
ylab('Cases Bought')
## Warning: Removed 8200 rows containing non-finite values (stat_smooth).
## Warning: Removed 8200 rows containing missing values (geom_point).
There are no unusual patterns. STARS
and LableAppleal
seem to have the greatest correlation.
training_dataset %>%
dplyr::select(-(INDEX)) %>%
cor() %>%
as.data.frame() %>%
rownames_to_column('Variable') %>%
dplyr::rename(Correlation_vs_Response = TARGET)
## Variable Correlation_vs_Response FixedAcidity VolatileAcidity
## 1 TARGET 1.000000000 -0.049010939 -0.08879321
## 2 FixedAcidity -0.049010939 1.000000000 0.01237500
## 3 VolatileAcidity -0.088793212 0.012375004 1.00000000
## 4 CitricAcid 0.008684633 0.014240478 -0.01695337
## 5 ResidualSugar NA NA NA
## 6 Chlorides NA NA NA
## 7 FreeSulfurDioxide NA NA NA
## 8 TotalSulfurDioxide NA NA NA
## 9 Density -0.035517502 0.006476528 0.01473492
## 10 pH NA NA NA
## 11 Sulphates NA NA NA
## 12 Alcohol NA NA NA
## 13 LabelAppeal 0.356500469 -0.003366431 -0.01698703
## 14 AcidIndex -0.246049449 0.178437010 0.04464154
## 15 STARS NA NA NA
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## 1 0.008684633 NA NA NA NA
## 2 0.014240478 NA NA NA NA
## 3 -0.016953371 NA NA NA NA
## 4 1.000000000 NA NA NA NA
## 5 NA 1 NA NA NA
## 6 NA NA 1 NA NA
## 7 NA NA NA 1 NA
## 8 NA NA NA NA 1
## 9 -0.013952174 NA NA NA NA
## 10 NA NA NA NA NA
## 11 NA NA NA NA NA
## 12 NA NA NA NA NA
## 13 0.008650177 NA NA NA NA
## 14 0.065697323 NA NA NA NA
## 15 NA NA NA NA NA
## Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
## 1 -0.035517502 NA NA NA 0.356500469 -0.24604945 NA
## 2 0.006476528 NA NA NA -0.003366431 0.17843701 NA
## 3 0.014734923 NA NA NA -0.016987027 0.04464154 NA
## 4 -0.013952174 NA NA NA 0.008650177 0.06569732 NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## 7 NA NA NA NA NA NA NA
## 8 NA NA NA NA NA NA NA
## 9 1.000000000 NA NA NA -0.009370077 0.04041295 NA
## 10 NA 1 NA NA NA NA NA
## 11 NA NA 1 NA NA NA NA
## 12 NA NA NA 1 NA NA NA
## 13 -0.009370077 NA NA NA 1.000000000 0.02475468 NA
## 14 0.040412947 NA NA NA 0.024754678 1.00000000 NA
## 15 NA NA NA NA NA NA 1
The data exploration process can be summarized in the data dictionary below:
data_stat <- training_dataset %>%
dplyr::select(-TARGET,-INDEX) %>%
gather() %>%
group_by(key) %>%
summarise(Mean = mean(value),
Median = median(value),
Max = max(value),
Min = min(value),
SD = sd(value))
data_cor <- training_dataset %>%
cor() %>%
as.data.frame() %>%
dplyr::select(TARGET) %>%
rownames_to_column('Variable') %>%
dplyr::rename(Correlation_vs_Response = TARGET)
training_dataset %>%
gather() %>%
dplyr::select(key) %>%
unique() %>%
dplyr::rename(Variable = key) %>%
mutate(
Missing_Value = 'No') %>%
left_join(data_stat, by = c('Variable'='key')) %>%
left_join(data_cor, by = 'Variable') %>%
mutate_if(is.numeric,round,2) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)
Variable | Missing_Value | Mean | Median | Max | Min | SD | Correlation_vs_Response |
---|---|---|---|---|---|---|---|
TARGET | No | NA | NA | NA | NA | NA | 1.00 |
INDEX | No | NA | NA | NA | NA | NA | 0.00 |
FixedAcidity | No | 7.08 | 6.90 | 34.40 | -18.10 | 6.32 | -0.05 |
VolatileAcidity | No | 0.32 | 0.28 | 3.68 | -2.79 | 0.78 | -0.09 |
CitricAcid | No | 0.31 | 0.31 | 3.86 | -3.24 | 0.86 | 0.01 |
ResidualSugar | No | NA | NA | NA | NA | NA | NA |
Chlorides | No | NA | NA | NA | NA | NA | NA |
FreeSulfurDioxide | No | NA | NA | NA | NA | NA | NA |
TotalSulfurDioxide | No | NA | NA | NA | NA | NA | NA |
Density | No | 0.99 | 0.99 | 1.10 | 0.89 | 0.03 | -0.04 |
pH | No | NA | NA | NA | NA | NA | NA |
Sulphates | No | NA | NA | NA | NA | NA | NA |
Alcohol | No | NA | NA | NA | NA | NA | NA |
LabelAppeal | No | -0.01 | 0.00 | 2.00 | -2.00 | 0.89 | 0.36 |
AcidIndex | No | 7.77 | 8.00 | 17.00 | 4.00 | 1.32 | -0.25 |
STARS | No | NA | NA | NA | NA | NA | NA |
I split the data into training and test.
The MICE package comes in handy.
set.seed(999)
sampl = caTools::sample.split(training_modified$TARGET, SplitRatio = .80)
train_dataset_1 <- subset(training_modified, sampl == TRUE)
test_dataset_1 <- subset(training_modified, sampl == FALSE)
train_dataset_2 <- as.data.frame(tidyr::complete(mice(train_dataset_1, m=1, maxit = 5, seed = 42)))
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
test_dataset_2 <- as.data.frame(tidyr::complete(mice(test_dataset_1, m=1, maxit = 5, seed = 42)))
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
Perform log transform to the data.
train_dataset_2$AcidIndex <- log(train_dataset_2$AcidIndex)
test_dataset_2$AcidIndex <- log(test_dataset_2$AcidIndex)
require(ggplot2)
require(gridExtra)
model1 = glm(TARGET ~ ., data=train_dataset_1, family=poisson)
summary(model1)
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train_dataset_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2128 -0.2757 0.0647 0.3766 1.6981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.608e+00 2.796e-01 5.750 8.90e-09 ***
## FixedAcidity 6.705e-04 1.177e-03 0.570 0.56901
## VolatileAcidity -2.750e-02 9.283e-03 -2.963 0.00305 **
## CitricAcid -3.835e-03 8.519e-03 -0.450 0.65259
## ResidualSugar 1.828e-05 2.152e-04 0.085 0.93232
## Chlorides -3.764e-02 2.314e-02 -1.627 0.10377
## FreeSulfurDioxide 5.671e-05 4.892e-05 1.159 0.24630
## TotalSulfurDioxide 2.230e-05 3.177e-05 0.702 0.48274
## Density -4.025e-01 2.749e-01 -1.464 0.14326
## pH 2.307e-04 1.085e-02 0.021 0.98303
## Sulphates -5.984e-03 7.973e-03 -0.751 0.45293
## Alcohol 3.262e-03 2.004e-03 1.628 0.10360
## LabelAppeal 1.730e-01 8.858e-03 19.530 < 2e-16 ***
## AcidIndex -4.967e-02 6.666e-03 -7.451 9.28e-14 ***
## STARS 1.929e-01 8.328e-03 23.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4720.5 on 5143 degrees of freedom
## Residual deviance: 3242.8 on 5129 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18545
##
## Number of Fisher Scoring iterations: 5
plot(model1)
grid.arrange(hist, qq_plot, box_plot, box_TARGET, ncol=2)
model2 = glm(TARGET ~ ., data=train_dataset_2, family=poisson)
summary(model2)
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train_dataset_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9896 -0.6853 0.1295 0.6351 2.4378
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.350e+00 2.280e-01 10.308 < 2e-16 ***
## FixedAcidity 2.238e-04 9.202e-04 0.243 0.807855
## VolatileAcidity -4.578e-02 7.286e-03 -6.282 3.33e-10 ***
## CitricAcid 6.770e-03 6.566e-03 1.031 0.302500
## ResidualSugar 1.586e-04 1.674e-04 0.947 0.343467
## Chlorides -6.268e-02 1.794e-02 -3.494 0.000476 ***
## FreeSulfurDioxide 1.249e-04 3.801e-05 3.285 0.001019 **
## TotalSulfurDioxide 8.362e-05 2.457e-05 3.403 0.000666 ***
## Density -3.887e-01 2.145e-01 -1.813 0.069894 .
## pH -1.990e-02 8.406e-03 -2.367 0.017923 *
## Sulphates -1.035e-02 6.173e-03 -1.677 0.093602 .
## Alcohol 2.527e-03 1.557e-03 1.623 0.104531
## LabelAppeal 1.429e-01 6.779e-03 21.084 < 2e-16 ***
## AcidIndex -7.543e-01 4.005e-02 -18.833 < 2e-16 ***
## STARS 3.432e-01 6.241e-03 54.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18291 on 10236 degrees of freedom
## Residual deviance: 12767 on 10222 degrees of freedom
## AIC: 38355
##
## Number of Fisher Scoring iterations: 5
plot(model2)
model3 <- glm.nb(TARGET ~ ., data = train_dataset_1)
summary(model3)
##
## Call:
## glm.nb(formula = TARGET ~ ., data = train_dataset_1, init.theta = 138898.9107,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2127 -0.2757 0.0647 0.3766 1.6981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.608e+00 2.796e-01 5.750 8.91e-09 ***
## FixedAcidity 6.705e-04 1.177e-03 0.570 0.56900
## VolatileAcidity -2.750e-02 9.283e-03 -2.963 0.00305 **
## CitricAcid -3.835e-03 8.519e-03 -0.450 0.65259
## ResidualSugar 1.828e-05 2.152e-04 0.085 0.93231
## Chlorides -3.764e-02 2.314e-02 -1.627 0.10378
## FreeSulfurDioxide 5.671e-05 4.892e-05 1.159 0.24630
## TotalSulfurDioxide 2.230e-05 3.177e-05 0.702 0.48275
## Density -4.025e-01 2.750e-01 -1.464 0.14326
## pH 2.307e-04 1.085e-02 0.021 0.98303
## Sulphates -5.984e-03 7.973e-03 -0.751 0.45293
## Alcohol 3.262e-03 2.004e-03 1.628 0.10360
## LabelAppeal 1.730e-01 8.858e-03 19.529 < 2e-16 ***
## AcidIndex -4.967e-02 6.666e-03 -7.451 9.28e-14 ***
## STARS 1.929e-01 8.328e-03 23.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(138898.9) family taken to be 1)
##
## Null deviance: 4720.4 on 5143 degrees of freedom
## Residual deviance: 3242.7 on 5129 degrees of freedom
## (5093 observations deleted due to missingness)
## AIC: 18547
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 138899
## Std. Err.: 259921
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -18515.07
plot(model3)
model4 <- glm.nb(TARGET ~ ., data = train_dataset_2)
summary(model4)
##
## Call:
## glm.nb(formula = TARGET ~ ., data = train_dataset_2, init.theta = 49067.84702,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9896 -0.6852 0.1295 0.6351 2.4377
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.350e+00 2.280e-01 10.307 < 2e-16 ***
## FixedAcidity 2.238e-04 9.202e-04 0.243 0.807848
## VolatileAcidity -4.578e-02 7.287e-03 -6.282 3.34e-10 ***
## CitricAcid 6.770e-03 6.566e-03 1.031 0.302510
## ResidualSugar 1.586e-04 1.674e-04 0.947 0.343463
## Chlorides -6.268e-02 1.794e-02 -3.494 0.000476 ***
## FreeSulfurDioxide 1.249e-04 3.801e-05 3.285 0.001020 **
## TotalSulfurDioxide 8.362e-05 2.457e-05 3.403 0.000666 ***
## Density -3.887e-01 2.145e-01 -1.813 0.069902 .
## pH -1.990e-02 8.406e-03 -2.367 0.017923 *
## Sulphates -1.035e-02 6.173e-03 -1.677 0.093603 .
## Alcohol 2.527e-03 1.557e-03 1.623 0.104549
## LabelAppeal 1.429e-01 6.779e-03 21.084 < 2e-16 ***
## AcidIndex -7.543e-01 4.005e-02 -18.832 < 2e-16 ***
## STARS 3.433e-01 6.241e-03 54.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(49067.85) family taken to be 1)
##
## Null deviance: 18290 on 10236 degrees of freedom
## Residual deviance: 12766 on 10222 degrees of freedom
## AIC: 38357
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 49068
## Std. Err.: 63120
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38324.77
plot(model4)
Using Linear Regression Model on imputed training data.
model5 <- lm(TARGET ~ ., data = train_dataset_2)
summary(model5)
##
## Call:
## lm(formula = TARGET ~ ., data = train_dataset_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3708 -1.0318 0.1602 1.0263 4.3131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.001e+00 5.535e-01 10.842 < 2e-16 ***
## FixedAcidity 1.363e-03 2.241e-03 0.608 0.543026
## VolatileAcidity -1.326e-01 1.781e-02 -7.445 1.05e-13 ***
## CitricAcid 1.950e-02 1.621e-02 1.203 0.229052
## ResidualSugar 4.527e-04 4.108e-04 1.102 0.270444
## Chlorides -1.916e-01 4.383e-02 -4.371 1.25e-05 ***
## FreeSulfurDioxide 3.229e-04 9.354e-05 3.452 0.000560 ***
## TotalSulfurDioxide 2.237e-04 5.984e-05 3.738 0.000187 ***
## Density -9.958e-01 5.233e-01 -1.903 0.057047 .
## pH -4.893e-02 2.060e-02 -2.376 0.017542 *
## Sulphates -2.701e-02 1.510e-02 -1.789 0.073692 .
## Alcohol 1.092e-02 3.777e-03 2.892 0.003831 **
## LabelAppeal 4.385e-01 1.633e-02 26.862 < 2e-16 ***
## AcidIndex -2.020e+00 9.201e-02 -21.954 < 2e-16 ***
## STARS 1.175e+00 1.655e-02 71.034 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.409 on 10222 degrees of freedom
## Multiple R-squared: 0.4661, Adjusted R-squared: 0.4654
## F-statistic: 637.5 on 14 and 10222 DF, p-value: < 2.2e-16
plot(model5)
This regression uses ordered factors. I would expect this to be one of the top performers.
polrDF <- train_dataset_2
polrDF$TARGET <- as.factor(polrDF$TARGET)
model6 <- polr(TARGET ~ ., data = polrDF, Hess=TRUE)
summary(model6)
## Call:
## polr(formula = TARGET ~ ., data = polrDF, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## FixedAcidity 0.0025541 2.909e-03 0.8780
## VolatileAcidity -0.1635870 2.328e-02 -7.0256
## CitricAcid 0.0195114 2.111e-02 0.9242
## ResidualSugar 0.0003666 5.310e-04 0.6904
## Chlorides -0.2395616 5.681e-02 -4.2170
## FreeSulfurDioxide 0.0004071 1.216e-04 3.3467
## TotalSulfurDioxide 0.0002398 7.814e-05 3.0682
## Density -1.5138603 1.493e-01 -10.1393
## pH -0.0385963 2.685e-02 -1.4375
## Sulphates -0.0207772 1.982e-02 -1.0483
## Alcohol 0.0253462 4.883e-03 5.1902
## LabelAppeal 0.8329795 2.381e-02 34.9811
## AcidIndex -2.5978181 1.251e-01 -20.7696
## STARS 1.4733679 2.563e-02 57.4787
##
## Intercepts:
## Value Std. Error t value
## 0|1 -6.0275 0.1360 -44.3081
## 1|2 -5.8898 0.1359 -43.3357
## 2|3 -5.2832 0.1355 -38.9891
## 3|4 -3.9123 0.1354 -28.8884
## 4|5 -2.0643 0.1377 -14.9885
## 5|6 -0.0952 0.1442 -0.6605
## 6|7 2.1034 0.1679 12.5310
## 7|8 4.4397 0.3036 14.6243
##
## Residual Deviance: 29978.30
## AIC: 30022.30
Zero inflation understands that some Poisson distrobutions are dominated by many zeros. As such it corrects for this. This is one of the most promissing ones because as we saw in our data exploration, there were more zeros, and then normally distributed data after that.
library(pscl)
## Warning: package 'pscl' was built under R version 4.0.5
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
model7 <- zeroinfl(TARGET ~ . | STARS, data = train_dataset_2, dist = 'negbin')
summary(model7)
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = train_dataset_2, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.17822 -0.50776 0.06056 0.48757 2.09505
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.793e+00 2.380e-01 7.532 4.99e-14 ***
## FixedAcidity 5.166e-04 9.446e-04 0.547 0.58447
## VolatileAcidity -2.095e-02 7.556e-03 -2.772 0.00556 **
## CitricAcid 7.361e-04 6.706e-03 0.110 0.91260
## ResidualSugar -3.494e-05 1.721e-04 -0.203 0.83913
## Chlorides -2.832e-02 1.851e-02 -1.530 0.12608
## FreeSulfurDioxide 3.380e-05 3.849e-05 0.878 0.37983
## TotalSulfurDioxide 4.004e-07 2.457e-05 0.016 0.98700
## Density -2.918e-01 2.222e-01 -1.313 0.18915
## pH 8.443e-04 8.732e-03 0.097 0.92297
## Sulphates -2.372e-03 6.392e-03 -0.371 0.71058
## Alcohol 6.463e-03 1.587e-03 4.073 4.63e-05 ***
## LabelAppeal 2.280e-01 7.088e-03 32.169 < 2e-16 ***
## AcidIndex -2.614e-01 4.464e-02 -5.856 4.73e-09 ***
## STARS 1.200e-01 6.995e-03 17.156 < 2e-16 ***
## Log(theta) 1.749e+01 4.449e+00 3.932 8.44e-05 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.7093 0.1347 20.11 <2e-16 ***
## STARS -3.0279 0.1235 -24.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 39544917.0696
## Number of iterations in BFGS optimization: 23
## Log-likelihood: -1.715e+04 on 18 Df
scatterPreds <- predict(model7, train_dataset_2)
qplot(train_dataset_2$TARGET, scatterPreds, main = 'Predicted vs Actual') + ggthemes::theme_tufte()
residPlot <- scatterPreds - train_dataset_2$TARGET
qplot(train_dataset_2$TARGET, residPlot, main = 'Residuals') + ggthemes::theme_tufte()
aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
mse1 <- mean((train_dataset_2$TARGET - predict(model1))^2)
mse2 <- mean((train_dataset_2$TARGET - predict(model2))^2)
mse3 <- mean((train_dataset_2$TARGET - predict(model3))^2)
mse4 <- mean((train_dataset_2$TARGET - predict(model4))^2)
mse5 <- mean((train_dataset_2$TARGET - predict(model5))^2)
mse6 <- mean((train_dataset_2$TARGET - predict(model6))^2)
mse7 <- mean((train_dataset_2$TARGET - predict(model7))^2)
compare_aic_mse <- matrix(c(mse1, mse2, mse3, mse4, mse5, mse6, mse7,
aic1, aic2, aic3, aic4, aic5, aic6,
aic7),nrow=7,ncol=2,byrow=TRUE)
rownames(compare_aic_mse) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7")
colnames(compare_aic_mse) <- c("MSE","AIC")
compare_models <- as.data.frame(compare_models)
kable(compare_aic_mse) %>%
kable_styling(full_width = T)
MSE | AIC | |
---|---|---|
Model1 | 6.929787 | 6.842661 |
Model2 | 6.929788 | 6.842657 |
Model3 | 1.981498 | NA |
Model4 | 1.942825 | 18544.983192 |
Model5 | 38354.589216 | 18547.067808 |
Model6 | 38356.768268 | 6.929787 |
Model7 | 6.842661 | 6.929788 |
Now lets see the output of the Models using test data
We will use the squared loss to validate the model. We will use the squared difference to select a model (MSE) from predictions on the training sets. (Lower numbers are better.)
modelValidation <- function(mod){
preds = predict(mod, test_dataset_2)
diffMat = as.numeric(preds) - as.numeric(test_dataset_2$TARGET)
diffMat = diffMat^2
loss <- mean(diffMat)
return(loss)
}
compare_models <- matrix(c(modelValidation(model1),modelValidation(model2),
modelValidation(model3),modelValidation(model4),
modelValidation(model5),modelValidation(model6),
modelValidation(model7)),
nrow=7,ncol=1,byrow=TRUE)
rownames(compare_models) <- c("Model1","Model2","Model3","Model4","Model5","Model6","Model7")
colnames(compare_models) <- c("Loss:")
compare_models <- as.data.frame(compare_models)
compare_models
## Loss:
## Model1 5.479565
## Model2 6.844632
## Model3 5.479559
## Model4 6.844628
## Model5 2.038659
## Model6 3.667318
## Model7 2.003325
The squared loss metric tells how accurate our model is without caring about confidence intervals among others. Based on this parameter, the Zero Poisson Inflation model is the most accurate.
model7 (Zero Poisson Inflations) is the best choice overally due to the following reasons: - least loss - good MSE score - good AIC score
In the same manner as before, lets impute and use log transformation for AcidIndex.
evaluation_data_modified <- evaluation_dataset %>% dplyr::select(-(IN)) %>% mice(m=1, maxit = 5, seed = 42)
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## Warning: Number of logged events: 1
imputed_eval_data <- as.data.frame(complete(evaluation_data_modified))
imputed_eval_data$AcidIndex <- log(imputed_eval_data$AcidIndex)
imputed_eval_data$TARGET <- predict(model7, newdata=imputed_eval_data)
write.csv(imputed_eval_data,"predicted_data.csv", row.names=FALSE)
data_predicted_eval <- read_csv("predicted_data.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## TARGET = col_double(),
## FixedAcidity = col_double(),
## VolatileAcidity = col_double(),
## CitricAcid = col_double(),
## ResidualSugar = col_double(),
## Chlorides = col_double(),
## FreeSulfurDioxide = col_double(),
## TotalSulfurDioxide = col_double(),
## Density = col_double(),
## pH = col_double(),
## Sulphates = col_double(),
## Alcohol = col_double(),
## LabelAppeal = col_double(),
## AcidIndex = col_double(),
## STARS = col_double()
## )
DT::datatable(data_predicted_eval)
options(width=100)
round(with(data_predicted_eval, c(summary(TARGET), StdD=sd(TARGET), Skew=skewness(TARGET), Kurt=kurtosis(TARGET))),2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
## 1.01 1.91 3.32 3.25 4.20 8.28 1.38 0.47 -0.36
```