In this project, we try to explore wine quality to assess which factors are more important and influence wine quality to most.
After acknowledged our data, and familiar with our variables. We have executed logistics regression, random forest and decision tree as well as GBM.
Logistic regression predicts a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.
Random forests is a learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time
Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems. Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines.
In our methodology quality is our response variable.
White and Red data sets are related to red and white variants of the Portuguese Vinho Verde Wine and are available at UCI ML repository. Our goal is to characterize the relationship between wine quality and its analytical characteristics. The white wine dataset has 6497 observations, 12 predictors and 1 outcome (quality). All of the predictors are numeric values, outcomes are integer.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(readxl)
library(caTools)
library(corrgram)
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readxl)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(clValid)
## Zorunlu paket yükleniyor: cluster
library(cluster)
library (caret)
## Zorunlu paket yükleniyor: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
library(lattice)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(nnet)
library(party)
## Zorunlu paket yükleniyor: grid
## Zorunlu paket yükleniyor: mvtnorm
## Zorunlu paket yükleniyor: modeltools
## Zorunlu paket yükleniyor: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:clValid':
##
## clusters
## Zorunlu paket yükleniyor: strucchange
## Zorunlu paket yükleniyor: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Zorunlu paket yükleniyor: sandwich
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(grid)
library(mvtnorm)
library(modeltools)
library(stats4)
library(strucchange)
library(zoo)
library(sandwich)
library(tree)
library(ISLR)
library(rpart)
library(rpart.plot)
First thing first, we will install all the packages we need and the we call them. Since our data has 2 different wine color, and these data sets were extracted, we have to combine the data for starting to prepossessing part.
library(readxl)
wine_white <- read_excel("C:/Users/PC/Desktop/devamke/wine-white2.xlsx")
View(wine_white)
library(readxl)
wine_red <- read_excel("C:/Users/PC/Desktop/devamke/wine-red2.xlsx")
View(wine_red)
red <- wine_red
white <- wine_white
red['color'] <- 'red'
white['color'] <- 'white'
data <- rbind(red, white)
Adding categorical variables to both sets. And then we will merge the red and white wine to data sets.
head(data)
dim(data)
## [1] 6497 13
names(data)
## [1] "fixed acidity" "volatile acidity" "citric acid"
## [4] "residual sugar" "chlorides" "free sulfur dioxide"
## [7] "total sulfur dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "color"
names(data) <- c('fixed_acidity','volatile_acidity','citric_acid','residual_sugar','chlorides', 'free_sulfurdioxide', 'total_sulfurdioxide', 'density', 'pH','sulphates','alcohol','quality','color')
In this section, we will prepare our data for analyzing.
summary(data)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free_sulfurdioxide total_sulfurdioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. : 0.99
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.: 0.99
## Median :0.04700 Median : 29.00 Median :118.0 Median : 0.99
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean : 96.81
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.: 1.00
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :100196.00
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.0 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.5 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median : 10.3 Median :6.000
## Mean :3.219 Mean :0.5313 Mean : 412.1 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.: 11.3 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :1490646.0 Max. :9.000
## color
## Length:6497
## Class :character
## Mode :character
##
##
##
As we can see in the summary, the massive number appears in the variables such as alcohol and density. So we thought that maybe the data has some outliers. To extract the values of the potential outliers based on the IQR criterion. We will start with the densitiy.
boxplot.stats(data$density)$out
## [1] 1.00100e+03 1.00100e+03 1.00020e+04 1.00040e+04 1.01030e+00 1.01030e+00
## [7] 1.00182e+05 1.03898e+00 1.00014e+05 1.00196e+05 1.00037e+05 1.00038e+05
## [13] 1.00038e+05
Thanks to the which() function it is possible to extract the row number corresponding to these outliers:
out <- boxplot.stats(data$density)$out
out_ind <- which(data$density %in% c(out))
out_ind
## [1] 1791 2643 2665 3180 3253 3263 3850 4381 4862 5020 5145 5614 5618
Q <- quantile(data$density, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(data$density)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(data,
data$density > (Q[1] - 1.5*iqr) & data$density < (Q[2]+1.5*iqr))
eliminated
summary(eliminated)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.214 Mean :0.3397 Mean :0.3184 Mean : 5.406
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :26.050
## chlorides free_sulfurdioxide total_sulfurdioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05604 Mean : 30.51 Mean :115.6 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9969
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.0 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.5 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median : 10.3 Median :6.000
## Mean :3.219 Mean :0.5313 Mean : 412.9 Mean :5.819
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.: 11.3 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :1490646.0 Max. :9.000
## color
## Length:6484
## Class :character
## Mode :character
##
##
##
Removing the outliers in the density variables made huge difference between the max and min value. We can see the changes more clearly in the distribution from histogram graphs.
par(mfrow = c(1, 2))
hist(data$density)
hist(eliminated$density)
We will repeat the process with alcohol variable.
boxplot(eliminated$alcohol, plot = FALSE)$out
## [1] 1490646 1006404 102411 10050
out_al <- boxplot.stats(data$alcohol)$out
out_ind_al <- which(data$alcohol %in% c(out))
outliers1 <- boxplot(eliminated$alcohol, plot=FALSE)$out
df <- eliminated
df <- df[-which(df$alcohol %in% outliers1),]
par(mfrow = c(1, 2))
hist(data$alcohol)
hist(df$alcohol)
From here, we will be searching for the NA values.Although we did not
see and NA values in the summary, we wanted to verify our findings.
which(is.na(df))
## integer(0)
sum(is.na(df))
## [1] 0
##^Exploratory Data Analysis (EDA)
summary(df)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.213 Mean :0.3396 Mean :0.3185 Mean : 5.406
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.600 Max. :1.5800 Max. :1.6600 Max. :26.050
## chlorides free_sulfurdioxide total_sulfurdioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.51 Mean :115.6 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9969
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.00 Max. :9.000
## color
## Length:6480
## Class :character
## Mode :character
##
##
##
str(df)
## tibble [6,480 × 13] (S3: tbl_df/tbl/data.frame)
## $ fixed_acidity : num [1:6480] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num [1:6480] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num [1:6480] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num [1:6480] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:6480] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfurdioxide : num [1:6480] 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfurdioxide: num [1:6480] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:6480] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:6480] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:6480] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:6480] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:6480] 5 5 5 6 5 5 5 7 7 5 ...
## $ color : chr [1:6480] "red" "red" "red" "red" ...
Observations from the Summary Mean residual sugar level is 5.4 g/l. Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75% is 41 ppm. PH of wine is within range from 2.7 till 4, mean 3.2.
Alcohol: lightest wine is 8%. the highest amount is 14%
Minimum quality mark is 3, mean 5.8, highest is 9.
library(corrgram)
corrgram(df, order = TRUE , lower.panel=panel.conf)
cor(x=df[,1:12], y=df$quality)
## [,1]
## fixed_acidity -0.07549878
## volatile_acidity -0.26696446
## citric_acid 0.08717157
## residual_sugar -0.03748329
## chlorides -0.20059999
## free_sulfurdioxide 0.05553655
## total_sulfurdioxide -0.04154195
## density -0.31314424
## pH 0.01895031
## sulphates 0.03844359
## alcohol 0.44541285
## quality 1.00000000
Quality is mostly correlated with amount of alcohol. It looks like the biggest positive correlation with quality has free sulfur dioxide (0.23), sulphates (0.25) and alcohol(0.48). Negative correlation exist for the volatile acidity(-0.39), chlorides(-0.27), total.sulfur.dioxide(-0.19) and density(-0.17).
library(ggplot2)
qplot(quality, data = df, fill = color, binwidth = 1) +
scale_x_continuous(breaks = seq(3,10,1), lim = c(3,10))
## Warning: Removed 4 rows containing missing values (geom_bar).
qplot(alcohol, data=df, fill =color, binwidth = 0.5)+ scale_x_continuous(breaks = seq(8,15,0.5), lim = c(8,15))
## Warning: Removed 4 rows containing missing values (geom_bar).
qplot(quality, data =df, binwidth = 1, color = color, geom = "density") +
scale_x_continuous(breaks = seq(3, 9, 1))
## Warning: Ignoring unknown parameters: binwidth
After drawing these visuals we are more confident that we can combine our data and bring red and white data together as we see very similar and parallel to response variable of both data-sets.
quality_min <- min(df$quality)
quality_max <- max(df$quality)
quality_mean <- mean(df$quality)
quality_median <- median(df$quality)
quality_iqr <- IQR(df$quality)
quality_q1 <- quality_median - quality_iqr
quality_q3 <- quality_median + quality_iqr
ggplot(df, aes(x=quality, fill = quality)) +
geom_bar(stat="count") +
geom_text(position = "stack", stat='count',aes(label=..count..), vjust = -0.5)+
labs(y="Num of Observations", x="Wine Quality") +
labs(title= "Distribution of Wine Quality Ratings")
We can see quality is not balanced across its entire range of 0-10. Most of the numbers are around 5 or 6. In other words, there are much more medium wines than very excellent or poor ones.
PCA is a way to deal with highly correlated variables. Since the color variable is irrelevant in our analysis, so we remove it.
drop <- c("color")
df1 = df[,!(names(df) %in% drop)]
head(df1)
wine2 = scale(df1 [,1:11])
wine_scaled= cbind.data.frame(wine2 , df1$quality)
colnames(wine_scaled) = colnames(df1)
head(wine_scaled)
As our data is pre-prepossessed,let’s perform PCA on the data, excluding the quality variable.
library(FactoMineR)
library(factoextra)
library(ggplot2)
set.seed(150)
res = PCA( wine_scaled , quali.sup=12 ,graph = FALSE )
fviz_pca_biplot(res ,geom.ind = c("point"),habillage = 12 , col.var = "black" , repel=TRUE)
In this scree plot we indicate that we can divide our variable into 3 component. At the left hand side of the graphic we see free radicals, at right upper side, we see the density of components, and right down we see wine acidity.
Now we will compute and determine the number of PCs
results <- prcomp(df1[,-1], center=FALSE, scale.=FALSE)
summary(results)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 133.5050 12.0788 5.76241 4.0204 0.69886 0.36182 0.1585
## Proportion of Variance 0.9891 0.0081 0.00184 0.0009 0.00003 0.00001 0.0000
## Cumulative Proportion 0.9891 0.9972 0.99907 1.0000 0.99999 1.00000 1.0000
## PC8 PC9 PC10 PC11
## Standard deviation 0.1411 0.1113 0.03863 0.02729
## Proportion of Variance 0.0000 0.0000 0.00000 0.00000
## Cumulative Proportion 1.0000 1.0000 1.00000 1.00000
Individuals-PCA: We can conclude from this graph, that going from the top to the bottom the quality of the wine increase. In other words, the upper part of dim 1 has a lower quality than the bottom part.
Accordingly to the result, total_sulfurdioxide,free_sulfurdioxide, volatile_acidity, residual_sugar have contributed the most in the 1st dim However, density, alcohol, fixed_acidity, residual_sugar contributed the most in 2nd
The eigen values measure the variance retained by each PC and can be used to determine the number of PC. Eigen value > 1 indicate us that the PCs account for more variance than accounted by one of the original variables in standardized data, and is used to determine cutoff point for which we keep PCs.
eig.val <- get_eigenvalue(results)
eig.val
The proportion of variation explained by each eigenvalue is given in the second column. 99.9 % of the variation is explained by the first three eigenvalues.For the first three PCs the eigen values are larger than 1, thus we should retain 3 PCs.
Another way to determine the number of PC is to look at the Scree Plot.
fviz_screeplot(res, addlabels = TRUE)
The above is called a scree plot. It shows the variances explained by each latent variable. The first component explains approx. 28% of the variance in the whole dataset.
Ideally, we would like to see an elbow shape in order to decide which PCs to keep and which ones to disregard. In practice, this rarely happens. Most of the time, we use enough PCs so that they explain 95% or 99% of the variation in the data.
By examining the above figure, we can conclude that first 6 variables contain most of the information inside the data.
Fİrst two components are accounting for almost 50% of the total explained variance.We represent the data projected in the plane of the two principal components.
Both free sulfurdioxide and total sulfurdioxide are almost aligned with 2nd PCA component(PC2) accounting for 22,7% of the feature explained variation whilst pH is aligned with 1st PCA component (PC1), accounting for 27,6% of the feature explained variation.
The plot of Red wine quality distribution shows that majority of the data set has quality between 5 and 6. We will divide the categories into groups labeling as 1,2 and 3.
wine2 = scale(df1 [,1:11])
wine_scaled= cbind.data.frame(wine2 , df1$quality)
colnames(wine_scaled) = colnames(df1)
head(wine_scaled)
wine_scaled$quality=as.integer(as.character( wine_scaled$quality))
wine_scaled$quality = lapply(wine_scaled[,12], function (x)
{
if(x >6) { "3"}
else if(x > 4) {"2"}
else { "1"}
})
wine_scaled$quality = unlist(wine_scaled$quality)
wine_scaled$quality = as.factor(wine_scaled$quality)
table(wine_scaled$quality)
##
## 1 2 3
## 246 4960 1274
Spliting the data into training set and test set. We will devide the data into train and test set with 80% for train data and 20% for test data.
set.seed(1580)
x<-sample(c(1:nrow(wine_scaled)) , 0.8*nrow(wine_scaled))
train_data=wine_scaled[x,]
test_data=wine_scaled[-x,]
table(train_data$quality)
##
## 1 2 3
## 189 3974 1021
table(test_data$quality)
##
## 1 2 3
## 57 986 253
We split the data into 3 quality level; bad(1), medium(2) and good(3). Bad indicates wines have quality level under 4, medium 4 (including) to 6, while good indicates wines above 6 (including). To be able to continue with classification step, we have trained the data and tested it. Our findings have the right portion to go to the next step.
glm_wine <- glm(quality ~ . , data = train_data, family = binomial)
summary(glm_wine)
##
## Call:
## glm(formula = quality ~ ., family = binomial, data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0725 0.1670 0.2155 0.2782 1.4161
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.63168 0.09544 38.053 < 2e-16 ***
## fixed_acidity -0.37341 0.18562 -2.012 0.044250 *
## volatile_acidity -0.64884 0.08119 -7.992 1.33e-15 ***
## citric_acid 0.05410 0.09442 0.573 0.566623
## residual_sugar -0.03288 0.21633 -0.152 0.879213
## chlorides -0.02018 0.10367 -0.195 0.845669
## free_sulfurdioxide 0.48786 0.12629 3.863 0.000112 ***
## total_sulfurdioxide -0.46075 0.12156 -3.790 0.000150 ***
## density 0.69180 0.32841 2.107 0.035158 *
## pH -0.21677 0.13312 -1.628 0.103443
## sulphates 0.31111 0.11810 2.634 0.008431 **
## alcohol 0.63000 0.18517 3.402 0.000668 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1622.8 on 5183 degrees of freedom
## Residual deviance: 1457.1 on 5172 degrees of freedom
## AIC: 1481.1
##
## Number of Fisher Scoring iterations: 6
In the logistic regression model, p values that have * means these variables are important for our model. Which is ; fixed_acidity,free_sulfurdioxide, volatile_acidity, alcohol, total_sulfurdioxide, density,sulphates.
pred_wine <- predict(glm_wine, newdata = test_data,
type = "response")
head(pred_wine)
## 2 9 13 14 23 27
## 0.8726233 0.9233697 0.8859923 0.9889551 0.9801286 0.9617344
Convert the values into binary response based on the threshold values.
pred_wine <- lapply(pred_wine, function (x)
{
if (x > 0.90) {"1"}
else if (x > 0.40) {"2"}
else {"3"}} )
pred_wine1 <- unlist(pred_wine)
head(pred_wine1)
## 2 9 13 14 23 27
## "2" "1" "2" "1" "1" "1"
This is the misclassification error rate on the testing data set.
Simple creation for the Confusion matrix;
test_score_table_wine <- table(actual = test_data$quality,
prediction = pred_wine1)
test_score_table_wine
## prediction
## actual 1 2 3
## 1 42 15 0
## 2 906 79 1
## 3 252 1 0
calc_class_err <- function(actual, predicted) {
mean(actual != predicted)}
calc_class_err(actual = test_data$quality, predicted = pred_wine1)
## [1] 0.9066358
rpTree <- rpart(quality ~ fixed_acidity + volatile_acidity + residual_sugar +
chlorides + total_sulfurdioxide + density + pH + sulphates +
alcohol, data=train_data )
rpTree$cptable
## CP nsplit rel error xerror xstd
## 1 0.02045455 0 1.0000000 1.0000000 0.02517032
## 2 0.01000000 5 0.8966942 0.9173554 0.02440923
rpTree <- rpart(quality ~ fixed_acidity + volatile_acidity + residual_sugar +
chlorides + total_sulfurdioxide + density + pH + sulphates +
alcohol, data=train_data , cp=0.01 , control = rpart.control((minsplit = 11)))
rpart.plot(rpTree, box.palette = "RdBu", shadow.col = "gray" ,nn=TRUE )
## Predicting and comparison with the true observations
library(party)
tree = ctree(quality ~fixed_acidity + volatile_acidity + residual_sugar +
chlorides + total_sulfurdioxide + density + pH + sulphates +
alcohol , data = train_data)
p_tree= predict(tree , test_data[,-12] , method = "class")
tab_tree = table(p_tree , test_data$quality)
tab_tree
##
## p_tree 1 2 3
## 1 0 3 0
## 2 54 947 193
## 3 3 36 60
sum(diag(tab_tree))/sum(tab_tree)
## [1] 0.7770062
library(gbm)
## Loaded gbm 2.1.8
wine.boost <- gbm(quality ~ ., data=train_data, distribution = "gaussian", n.trees = 5000)
summary(wine.boost)
control2 <- trainControl(method="repeatedcv", number=10, repeats=5)
gbm_model <- train(quality ~ fixed_acidity + volatile_acidity + residual_sugar +
chlorides + total_sulfurdioxide + density + pH + sulphates +
alcohol , data=train_data, method="gbm", metric="Accuracy", trControl=control2 , verbose = FALSE)
predictTest = predict(gbm_model, newdata = test_data[,-12])
tab_GBM = table( predictTest ,test_data$quality)
tab_GBM
##
## predictTest 1 2 3
## 1 4 4 0
## 2 51 926 157
## 3 2 56 96
sum(diag(tab_GBM))/sum(tab_GBM)
## [1] 0.7916667
Fitting Random Forest Classification to the Training set
library(randomForest)
classifier1 = randomForest(x = train_data[-12],
y = train_data$quality,
ntree = 500)
y_pred1 = predict(classifier1, newdata = test_data[-12])
cm1 = table(actual=test_data[, 12],predicted= y_pred1)
tab_random = table(y_pred1 , test_data$quality)
tab_random
##
## y_pred1 1 2 3
## 1 4 0 0
## 2 50 954 109
## 3 3 32 144
sum(diag(tab_random))/sum(tab_random)
## [1] 0.8503086
Although, it is not always right to state a conclusion with only accuracy, it would not be wrong to express that the logistic regression model output fits the best. Further analysis can be done.
Accuracy of the classification methods: 0.9066358 logistic regression 0.7770062 decision tree 0.7924383 gbm 0.8510802 random forest