Introduction

In this project, we try to explore wine quality to assess which factors are more important and influence wine quality to most.

Methodology

After acknowledged our data, and familiar with our variables. We have executed logistics regression, random forest and decision tree as well as GBM.

Logistic regression predicts a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.

Random forests is a learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems. Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines.

In our methodology quality is our response variable.

Data Description

White and Red data sets are related to red and white variants of the Portuguese Vinho Verde Wine and are available at UCI ML repository. Our goal is to characterize the relationship between wine quality and its analytical characteristics. The white wine dataset has 6497 observations, 12 predictors and 1 outcome (quality). All of the predictors are numeric values, outcomes are integer.

library(ggplot2) 
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(readxl)
library(caTools)
library(corrgram)
library(knitr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readxl)
library(FactoMineR)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(clValid)

## Zorunlu paket yükleniyor: cluster

library(cluster)
library (caret)

## Zorunlu paket yükleniyor: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:corrgram':
## 
##     panel.fill

library(lattice)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(nnet)
library(party)

## Zorunlu paket yükleniyor: grid

## Zorunlu paket yükleniyor: mvtnorm

## Zorunlu paket yükleniyor: modeltools

## Zorunlu paket yükleniyor: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:clValid':
## 
##     clusters

## Zorunlu paket yükleniyor: strucchange

## Zorunlu paket yükleniyor: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Zorunlu paket yükleniyor: sandwich

library(rpart)
library(rpart.plot)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(grid)
library(mvtnorm)
library(modeltools)
library(stats4)
library(strucchange)
library(zoo)
library(sandwich)
library(tree)
library(ISLR)
library(rpart)
library(rpart.plot)

Preparing Data

First thing first, we will install all the packages we need and the we call them. Since our data has 2 different wine color, and these data sets were extracted, we have to combine the data for starting to prepossessing part.

library(readxl)
wine_white <- read_excel("C:/Users/PC/Desktop/devamke/wine-white2.xlsx")
View(wine_white)

library(readxl)
wine_red <- read_excel("C:/Users/PC/Desktop/devamke/wine-red2.xlsx")
View(wine_red)

red <- wine_red
white <- wine_white

red['color'] <- 'red'
white['color'] <- 'white'
data <- rbind(red, white)

Adding categorical variables to both sets. And then we will merge the red and white wine to data sets.

head(data)

dim(data)

## [1] 6497   13

names(data)

##  [1] "fixed acidity"        "volatile acidity"     "citric acid"         
##  [4] "residual sugar"       "chlorides"            "free sulfur dioxide" 
##  [7] "total sulfur dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "color"

names(data) <- c('fixed_acidity','volatile_acidity','citric_acid','residual_sugar','chlorides', 'free_sulfurdioxide', 'total_sulfurdioxide', 'density', 'pH','sulphates','alcohol','quality','color')

Explaratory Data Aanalysis (EDA) and Pre-Processing

Pre-Proccessing

In this section, we will prepare our data for analyzing.

summary(data)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free_sulfurdioxide total_sulfurdioxide    density         
##  Min.   :0.00900   Min.   :  1.00     Min.   :  6.0       Min.   :     0.99  
##  1st Qu.:0.03800   1st Qu.: 17.00     1st Qu.: 77.0       1st Qu.:     0.99  
##  Median :0.04700   Median : 29.00     Median :118.0       Median :     0.99  
##  Mean   :0.05603   Mean   : 30.53     Mean   :115.7       Mean   :    96.81  
##  3rd Qu.:0.06500   3rd Qu.: 41.00     3rd Qu.:156.0       3rd Qu.:     1.00  
##  Max.   :0.61100   Max.   :289.00     Max.   :440.0       Max.   :100196.00  
##        pH          sulphates         alcohol             quality     
##  Min.   :2.720   Min.   :0.2200   Min.   :      8.0   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.:      9.5   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :     10.3   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :    412.1   Mean   :5.818  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:     11.3   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :1490646.0   Max.   :9.000  
##     color          
##  Length:6497       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

As we can see in the summary, the massive number appears in the variables such as alcohol and density. So we thought that maybe the data has some outliers. To extract the values of the potential outliers based on the IQR criterion. We will start with the densitiy.

boxplot.stats(data$density)$out

##  [1] 1.00100e+03 1.00100e+03 1.00020e+04 1.00040e+04 1.01030e+00 1.01030e+00
##  [7] 1.00182e+05 1.03898e+00 1.00014e+05 1.00196e+05 1.00037e+05 1.00038e+05
## [13] 1.00038e+05

Thanks to the which() function it is possible to extract the row number corresponding to these outliers:

out <- boxplot.stats(data$density)$out
out_ind <- which(data$density %in% c(out))
out_ind

##  [1] 1791 2643 2665 3180 3253 3263 3850 4381 4862 5020 5145 5614 5618

Q <- quantile(data$density, probs=c(.25, .75), na.rm = FALSE)

iqr <- IQR(data$density)

up <-  Q[2]+1.5*iqr # Upper Range  
low<- Q[1]-1.5*iqr # Lower Range

eliminated<- subset(data, 
                    data$density > (Q[1] - 1.5*iqr) & data$density < (Q[2]+1.5*iqr))


eliminated

summary(eliminated)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.214   Mean   :0.3397   Mean   :0.3184   Mean   : 5.406  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :26.050  
##    chlorides       free_sulfurdioxide total_sulfurdioxide    density      
##  Min.   :0.00900   Min.   :  1.00     Min.   :  6.0       Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00     1st Qu.: 77.0       1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00     Median :118.0       Median :0.9949  
##  Mean   :0.05604   Mean   : 30.51     Mean   :115.6       Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00     3rd Qu.:156.0       3rd Qu.:0.9969  
##  Max.   :0.61100   Max.   :289.00     Max.   :440.0       Max.   :1.0037  
##        pH          sulphates         alcohol             quality     
##  Min.   :2.720   Min.   :0.2200   Min.   :      8.0   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.:      9.5   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :     10.3   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :    412.9   Mean   :5.819  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:     11.3   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :1490646.0   Max.   :9.000  
##     color          
##  Length:6484       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Removing the outliers in the density variables made huge difference between the max and min value. We can see the changes more clearly in the distribution from histogram graphs.

par(mfrow = c(1, 2))
hist(data$density)
hist(eliminated$density)

We will repeat the process with alcohol variable.

boxplot(eliminated$alcohol, plot = FALSE)$out

## [1] 1490646 1006404  102411   10050

out_al <- boxplot.stats(data$alcohol)$out
out_ind_al <- which(data$alcohol %in% c(out))

outliers1 <- boxplot(eliminated$alcohol, plot=FALSE)$out
df <- eliminated
df <- df[-which(df$alcohol %in% outliers1),]

par(mfrow = c(1, 2))
hist(data$alcohol)
hist(df$alcohol)

From here, we will be searching for the NA values.Although we did not see and NA values in the summary, we wanted to verify our findings.

which(is.na(df))

## integer(0)

sum(is.na(df))

## [1] 0

##^Exploratory Data Analysis (EDA)

summary(df)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.213   Mean   :0.3396   Mean   :0.3185   Mean   : 5.406  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.600   Max.   :1.5800   Max.   :1.6600   Max.   :26.050  
##    chlorides       free_sulfurdioxide total_sulfurdioxide    density      
##  Min.   :0.00900   Min.   :  1.00     Min.   :  6.0       Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00     1st Qu.: 77.0       1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00     Median :118.0       Median :0.9949  
##  Mean   :0.05603   Mean   : 30.51     Mean   :115.6       Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00     3rd Qu.:156.0       3rd Qu.:0.9969  
##  Max.   :0.61100   Max.   :289.00     Max.   :440.0       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   Mean   :5.818  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.00   Max.   :9.000  
##     color          
##  Length:6480       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

str(df)

## tibble [6,480 × 13] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity      : num [1:6480] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile_acidity   : num [1:6480] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric_acid        : num [1:6480] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual_sugar     : num [1:6480] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides          : num [1:6480] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free_sulfurdioxide : num [1:6480] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total_sulfurdioxide: num [1:6480] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density            : num [1:6480] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                 : num [1:6480] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates          : num [1:6480] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol            : num [1:6480] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality            : num [1:6480] 5 5 5 6 5 5 5 7 7 5 ...
##  $ color              : chr [1:6480] "red" "red" "red" "red" ...

Observations from the Summary Mean residual sugar level is 5.4 g/l. Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75% is 41 ppm. PH of wine is within range from 2.7 till 4, mean 3.2.

Alcohol: lightest wine is 8%. the highest amount is 14%

Minimum quality mark is 3, mean 5.8, highest is 9.

library(corrgram)
corrgram(df, order = TRUE , lower.panel=panel.conf)

cor(x=df[,1:12], y=df$quality)

##                            [,1]
## fixed_acidity       -0.07549878
## volatile_acidity    -0.26696446
## citric_acid          0.08717157
## residual_sugar      -0.03748329
## chlorides           -0.20059999
## free_sulfurdioxide   0.05553655
## total_sulfurdioxide -0.04154195
## density             -0.31314424
## pH                   0.01895031
## sulphates            0.03844359
## alcohol              0.44541285
## quality              1.00000000

Quality is mostly correlated with amount of alcohol. It looks like the biggest positive correlation with quality has free sulfur dioxide (0.23), sulphates (0.25) and alcohol(0.48). Negative correlation exist for the volatile acidity(-0.39), chlorides(-0.27), total.sulfur.dioxide(-0.19) and density(-0.17).

library(ggplot2)
qplot(quality, data = df, fill = color, binwidth = 1) +
  scale_x_continuous(breaks = seq(3,10,1), lim = c(3,10))

## Warning: Removed 4 rows containing missing values (geom_bar).

qplot(alcohol, data=df, fill =color, binwidth = 0.5)+ scale_x_continuous(breaks = seq(8,15,0.5), lim = c(8,15))

## Warning: Removed 4 rows containing missing values (geom_bar).

qplot(quality, data =df, binwidth = 1, color = color, geom = "density") + 
    scale_x_continuous(breaks = seq(3, 9, 1))

## Warning: Ignoring unknown parameters: binwidth

After drawing these visuals we are more confident that we can combine our data and bring red and white data together as we see very similar and parallel to response variable of both data-sets.

quality_min <- min(df$quality)
quality_max <- max(df$quality)
quality_mean <- mean(df$quality)
quality_median <- median(df$quality)
quality_iqr <- IQR(df$quality)
quality_q1 <- quality_median - quality_iqr
quality_q3 <- quality_median + quality_iqr

Visualization of Wine Quality

ggplot(df, aes(x=quality, fill = quality)) +
  geom_bar(stat="count") +
  geom_text(position = "stack", stat='count',aes(label=..count..), vjust = -0.5)+
  labs(y="Num of Observations", x="Wine Quality") +
  labs(title= "Distribution of Wine Quality Ratings")

We can see quality is not balanced across its entire range of 0-10. Most of the numbers are around 5 or 6. In other words, there are much more medium wines than very excellent or poor ones.

Principal Component Analysis PCA

PCA is a way to deal with highly correlated variables. Since the color variable is irrelevant in our analysis, so we remove it.

drop <- c("color")
df1 = df[,!(names(df) %in% drop)]
head(df1)

wine2 = scale(df1 [,1:11])
wine_scaled= cbind.data.frame(wine2 , df1$quality)
colnames(wine_scaled) = colnames(df1)
head(wine_scaled)

As our data is pre-prepossessed,let’s perform PCA on the data, excluding the quality variable.

library(FactoMineR)
library(factoextra)
library(ggplot2)
set.seed(150)

res = PCA( wine_scaled , quali.sup=12 ,graph = FALSE )

fviz_pca_biplot(res ,geom.ind = c("point"),habillage = 12 , col.var = "black" , repel=TRUE)

In this scree plot we indicate that we can divide our variable into 3 component. At the left hand side of the graphic we see free radicals, at right upper side, we see the density of components, and right down we see wine acidity.

Now we will compute and determine the number of PCs

results <- prcomp(df1[,-1], center=FALSE, scale.=FALSE)

summary(results)

## Importance of components:
##                             PC1     PC2     PC3    PC4     PC5     PC6    PC7
## Standard deviation     133.5050 12.0788 5.76241 4.0204 0.69886 0.36182 0.1585
## Proportion of Variance   0.9891  0.0081 0.00184 0.0009 0.00003 0.00001 0.0000
## Cumulative Proportion    0.9891  0.9972 0.99907 1.0000 0.99999 1.00000 1.0000
##                           PC8    PC9    PC10    PC11
## Standard deviation     0.1411 0.1113 0.03863 0.02729
## Proportion of Variance 0.0000 0.0000 0.00000 0.00000
## Cumulative Proportion  1.0000 1.0000 1.00000 1.00000

Individuals-PCA: We can conclude from this graph, that going from the top to the bottom the quality of the wine increase. In other words, the upper part of dim 1 has a lower quality than the bottom part.

Accordingly to the result, total_sulfurdioxide,free_sulfurdioxide, volatile_acidity, residual_sugar have contributed the most in the 1st dim However, density, alcohol, fixed_acidity, residual_sugar contributed the most in 2nd

Eigen values

The eigen values measure the variance retained by each PC and can be used to determine the number of PC. Eigen value > 1 indicate us that the PCs account for more variance than accounted by one of the original variables in standardized data, and is used to determine cutoff point for which we keep PCs.

Eigen value PCA summary

eig.val <-  get_eigenvalue(results)
eig.val

The proportion of variation explained by each eigenvalue is given in the second column. 99.9 % of the variation is explained by the first three eigenvalues.For the first three PCs the eigen values are larger than 1, thus we should retain 3 PCs.

Scree of eigenvalues

Another way to determine the number of PC is to look at the Scree Plot.

fviz_screeplot(res, addlabels = TRUE)

The above is called a scree plot. It shows the variances explained by each latent variable. The first component explains approx. 28% of the variance in the whole dataset.

Ideally, we would like to see an elbow shape in order to decide which PCs to keep and which ones to disregard. In practice, this rarely happens. Most of the time, we use enough PCs so that they explain 95% or 99% of the variation in the data.

By examining the above figure, we can conclude that first 6 variables contain most of the information inside the data.

Fİrst two components are accounting for almost 50% of the total explained variance.We represent the data projected in the plane of the two principal components.

Both free sulfurdioxide and total sulfurdioxide are almost aligned with 2nd PCA component(PC2) accounting for 22,7% of the feature explained variation whilst pH is aligned with 1st PCA component (PC1), accounting for 27,6% of the feature explained variation.

Predictive analysis

Preparing data

The plot of Red wine quality distribution shows that majority of the data set has quality between 5 and 6. We will divide the categories into groups labeling as 1,2 and 3.

wine2 = scale(df1 [,1:11])
wine_scaled= cbind.data.frame(wine2 , df1$quality)
colnames(wine_scaled) = colnames(df1)
head(wine_scaled)

wine_scaled$quality=as.integer(as.character( wine_scaled$quality))

wine_scaled$quality = lapply(wine_scaled[,12], function (x)
{
  if(x >6)  { "3"}
  else if(x > 4)  {"2"}
  else { "1"}   
})
wine_scaled$quality = unlist(wine_scaled$quality)
wine_scaled$quality = as.factor(wine_scaled$quality)
table(wine_scaled$quality)

## 
##    1    2    3 
##  246 4960 1274

Spliting the data into training set and test set. We will devide the data into train and test set with 80% for train data and 20% for test data.

set.seed(1580)
x<-sample(c(1:nrow(wine_scaled)) , 0.8*nrow(wine_scaled))

train_data=wine_scaled[x,]
test_data=wine_scaled[-x,]

table(train_data$quality)

## 
##    1    2    3 
##  189 3974 1021

table(test_data$quality)

## 
##   1   2   3 
##  57 986 253

We split the data into 3 quality level; bad(1), medium(2) and good(3). Bad indicates wines have quality level under 4, medium 4 (including) to 6, while good indicates wines above 6 (including). To be able to continue with classification step, we have trained the data and tested it. Our findings have the right portion to go to the next step.

Logistic Regression

glm_wine <- glm(quality ~ . , data = train_data, family = binomial)
summary(glm_wine)

## 
## Call:
## glm(formula = quality ~ ., family = binomial, data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0725   0.1670   0.2155   0.2782   1.4161  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.63168    0.09544  38.053  < 2e-16 ***
## fixed_acidity       -0.37341    0.18562  -2.012 0.044250 *  
## volatile_acidity    -0.64884    0.08119  -7.992 1.33e-15 ***
## citric_acid          0.05410    0.09442   0.573 0.566623    
## residual_sugar      -0.03288    0.21633  -0.152 0.879213    
## chlorides           -0.02018    0.10367  -0.195 0.845669    
## free_sulfurdioxide   0.48786    0.12629   3.863 0.000112 ***
## total_sulfurdioxide -0.46075    0.12156  -3.790 0.000150 ***
## density              0.69180    0.32841   2.107 0.035158 *  
## pH                  -0.21677    0.13312  -1.628 0.103443    
## sulphates            0.31111    0.11810   2.634 0.008431 ** 
## alcohol              0.63000    0.18517   3.402 0.000668 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1622.8  on 5183  degrees of freedom
## Residual deviance: 1457.1  on 5172  degrees of freedom
## AIC: 1481.1
## 
## Number of Fisher Scoring iterations: 6

In the logistic regression model, p values that have * means these variables are important for our model. Which is ; fixed_acidity,free_sulfurdioxide, volatile_acidity, alcohol, total_sulfurdioxide, density,sulphates.

Predicting data

pred_wine <- predict(glm_wine, newdata = test_data,
type = "response")
head(pred_wine)

##         2         9        13        14        23        27 
## 0.8726233 0.9233697 0.8859923 0.9889551 0.9801286 0.9617344

Convert the values into binary response based on the threshold values.

pred_wine <- lapply(pred_wine, function (x)
  {
  if (x > 0.90) {"1"}
  else if (x > 0.40) {"2"}
  else {"3"}} )

pred_wine1 <- unlist(pred_wine)
head(pred_wine1)

##   2   9  13  14  23  27 
## "2" "1" "2" "1" "1" "1"

This is the misclassification error rate on the testing data set.

Simple creation for the Confusion matrix;

test_score_table_wine <- table(actual = test_data$quality,
prediction = pred_wine1)
test_score_table_wine

##       prediction
## actual   1   2   3
##      1  42  15   0
##      2 906  79   1
##      3 252   1   0

Comparison with the true observations;

calc_class_err <- function(actual, predicted) {
mean(actual != predicted)}
calc_class_err(actual = test_data$quality, predicted = pred_wine1)

## [1] 0.9066358

Desicion Tree

rpTree <- rpart(quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
    chlorides + total_sulfurdioxide + density + pH + sulphates + 
    alcohol, data=train_data )
rpTree$cptable

##           CP nsplit rel error    xerror       xstd
## 1 0.02045455      0 1.0000000 1.0000000 0.02517032
## 2 0.01000000      5 0.8966942 0.9173554 0.02440923

Decision Tree Regression

rpTree <- rpart(quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
    chlorides + total_sulfurdioxide + density + pH + sulphates + 
    alcohol, data=train_data , cp=0.01 , control = rpart.control((minsplit = 11)))

rpart.plot(rpTree, box.palette = "RdBu", shadow.col = "gray" ,nn=TRUE  )

## Predicting and comparison with the true observations

library(party)
tree = ctree(quality ~fixed_acidity + volatile_acidity + residual_sugar + 
    chlorides + total_sulfurdioxide + density + pH + sulphates + 
    alcohol , data = train_data)

p_tree= predict(tree  , test_data[,-12] , method = "class")

tab_tree = table(p_tree , test_data$quality)
tab_tree

##       
## p_tree   1   2   3
##      1   0   3   0
##      2  54 947 193
##      3   3  36  60

sum(diag(tab_tree))/sum(tab_tree)

## [1] 0.7770062

Gradient Boosting Machines GBM

library(gbm)

## Loaded gbm 2.1.8

wine.boost <- gbm(quality ~ ., data=train_data, distribution = "gaussian", n.trees = 5000)
summary(wine.boost)

control2 <- trainControl(method="repeatedcv", number=10, repeats=5)

gbm_model <- train(quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
  chlorides + total_sulfurdioxide + density + pH + sulphates + 
  alcohol  , data=train_data, method="gbm", metric="Accuracy", trControl=control2 , verbose = FALSE)

predictTest = predict(gbm_model, newdata = test_data[,-12])

Comparison with the true observations

tab_GBM = table(  predictTest ,test_data$quality)
tab_GBM

##            
## predictTest   1   2   3
##           1   4   4   0
##           2  51 926 157
##           3   2  56  96

sum(diag(tab_GBM))/sum(tab_GBM)

## [1] 0.7916667

Random Forest

Fitting Random Forest Classification to the Training set

library(randomForest)



classifier1 = randomForest(x = train_data[-12],
                          y = train_data$quality,
                          ntree = 500)

Predicting the Test set results

y_pred1 = predict(classifier1, newdata = test_data[-12])

Comparison with the true observations

cm1 = table(actual=test_data[, 12],predicted= y_pred1)
tab_random = table(y_pred1 , test_data$quality)
tab_random

##        
## y_pred1   1   2   3
##       1   4   0   0
##       2  50 954 109
##       3   3  32 144

sum(diag(tab_random))/sum(tab_random)

## [1] 0.8503086

Conclusions

Although, it is not always right to state a conclusion with only accuracy, it would not be wrong to express that the logistic regression model output fits the best. Further analysis can be done.

Accuracy of the classification methods: 0.9066358 logistic regression 0.7770062 decision tree 0.7924383 gbm 0.8510802 random forest

Group 1 - HW 3 - Final Submission

Group1 - Sabahat Şengezer - A Özge Akar - Büşra Gülen

2022-06-06

Introduction

Methodology

Data Description

Preparing Data

Explaratory Data Aanalysis (EDA) and Pre-Processing

Pre-Proccessing

Visualization of Wine Quality

Principal Component Analysis PCA

Eigen values

Eigen value PCA summary

Scree of eigenvalues

Predictive analysis

Preparing data

Logistic Regression

Predicting data

Comparison with the true observations;

Desicion Tree

Decision Tree Regression

Gradient Boosting Machines GBM

Comparison with the true observations

Random Forest

Predicting the Test set results

Comparison with the true observations

Conclusions