Determining wine quality by modelling classifications in R language.

Abstract: This paper explores what makes a good wine through an anonymized dataset of chemical-based properties in red Vinho Verde wines from northern Portugal to determine correlations between wine elements and quality by using decision trees and random forest. We find that the random forest technique produces the desired results with the highest accuracy and demonstrates that the quality of wine is most influenced by its alcohol, sulphate and volatile acidity levels found in red wines.

License: MIT License. \(~\)


Get Data from UCI Machine Learning Repository

Purpose: This script gets data from UCI Machine Learning Repository and read online dataset

Workstation Setup

\(~\)

library(rvest)
## Loading required package: xml2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::pluck()          masks rvest::pluck()
library(dplyr)
library(gapminder)
library(skimr)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(magrittr) 
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(devtools)
## Loading required package: usethis
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(corrplot)
## corrplot 0.84 loaded
library(corrgram)
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
## 
##     baseball
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROCR)
library(ggplot2)
library(ggthemes)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Import the dataset Red Wine Quality (1599 rows x12 columns) available in the following website http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv, note that the type of separator in this dataset = ;
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

Quick information about the package

\(~\)

RedWineQuality%>%
  glimpse()
## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.…
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600,…
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, …
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069,…
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 1…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, …
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, …
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, …
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10…
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, …
# Idea of database - Details in EDA # 
summary(RedWineQuality)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Clean Data for Package/Database - RedWineQuality

Purpose: This script cleans data that was downloaded in 01-get_data.R.

Clean Data

Check for Missing Value

\(~\)

str(RedWineQuality)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
which(is.na(RedWineQuality))
## integer(0)
sum(is.na(RedWineQuality))
## [1] 0
#### There is no missing value from the dataset, the database is clean.#### 

EDA

Purpose: This script conducts exploratory data analysis of data that was downloaded in 01-get_data.R and cleaned in 02_clean_data.R

Study Data

General Idea

\(~\)

summary(RedWineQuality$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
table(RedWineQuality$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
#Use Skim to understand the databse#
library(skimr)
RedWineQuality %>% 
  skim() %>% 
  kable()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric fixed.acidity 0 1 8.3196373 1.7410963 4.60000 7.1000 7.90000 9.200000 15.90000 ▂▇▂▁▁
numeric volatile.acidity 0 1 0.5278205 0.1790597 0.12000 0.3900 0.52000 0.640000 1.58000 ▅▇▂▁▁
numeric citric.acid 0 1 0.2709756 0.1948011 0.00000 0.0900 0.26000 0.420000 1.00000 ▇▆▅▁▁
numeric residual.sugar 0 1 2.5388055 1.4099281 0.90000 1.9000 2.20000 2.600000 15.50000 ▇▁▁▁▁
numeric chlorides 0 1 0.0874665 0.0470653 0.01200 0.0700 0.07900 0.090000 0.61100 ▇▁▁▁▁
numeric free.sulfur.dioxide 0 1 15.8749218 10.4601570 1.00000 7.0000 14.00000 21.000000 72.00000 ▇▅▁▁▁
numeric total.sulfur.dioxide 0 1 46.4677924 32.8953245 6.00000 22.0000 38.00000 62.000000 289.00000 ▇▂▁▁▁
numeric density 0 1 0.9967467 0.0018873 0.99007 0.9956 0.99675 0.997835 1.00369 ▁▃▇▂▁
numeric pH 0 1 3.3111132 0.1543865 2.74000 3.2100 3.31000 3.400000 4.01000 ▁▅▇▂▁
numeric sulphates 0 1 0.6581488 0.1695070 0.33000 0.5500 0.62000 0.730000 2.00000 ▇▅▁▁▁
numeric alcohol 0 1 10.4229831 1.0656676 8.40000 9.5000 10.20000 11.100000 14.90000 ▇▇▃▁▁
numeric quality 0 1 5.6360225 0.8075694 3.00000 5.0000 6.00000 6.000000 8.00000 ▁▇▇▂▁

Data Transformation### - In Appendix Section

find_skewness() calculates the skewness and finds the skewed data.

\(~\)

library(dlookr)
## Loading required package: mice
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
## 
##     transform
# find index of skewed variables
find_skewness(RedWineQuality)
## [1]  1  2  4  5  6  7 10 11
# find names of skewed variables
find_skewness(RedWineQuality, index = FALSE)
## [1] "fixed.acidity"        "volatile.acidity"     "residual.sugar"      
## [4] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide"
## [7] "sulphates"            "alcohol"
# compute the skewness
find_skewness(RedWineQuality, value = TRUE)
##        fixed.acidity     volatile.acidity          citric.acid 
##                0.982                0.671                0.318 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                4.536                5.675                1.249 
## total.sulfur.dioxide              density                   pH 
##                1.514                0.071                0.194 
##            sulphates              alcohol 
##                2.426                0.860
# compute the skewness & filtering with threshold
find_skewness(RedWineQuality, value = TRUE, thres = 0.1)
##        fixed.acidity     volatile.acidity          citric.acid 
##                0.982                0.671                0.318 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                4.536                5.675                1.249 
## total.sulfur.dioxide                   pH            sulphates 
##                1.514                0.194                2.426 
##              alcohol 
##                0.860
#Visualization of Numeric/Continuous Variables in RedWineQuality Dataset
RedWineQuality %>%
  gather() %>% 
  ggplot(aes(value,fill=key)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=sqrt(nrow(RedWineQuality))) +
  theme(legend.position="none") 

#### Mostly skewed distribution

Visualization of Numeric/Continuous Variables in RedWineQuality Dataset (X Axis Log Scale)

Log transform the variable to make the x axis on a log scale.

\(~\)

RedWineQuality %>%
  gather() %>% 
  ggplot(aes(value,fill=key)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=sqrt(nrow(RedWineQuality))) +
  theme(legend.position="none") +
  scale_x_continuous(trans='log10')
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 132 rows containing non-finite values (stat_bin).

###1: Transformation introduced infinite values in continuous x-axis 
###2: Removed 132 rows containing non-finite values (stat_bin). 
###3: The skewed distributions from before visually look closer to normal. 
###4: The effect of transforming a (right tail) skewed distribution with a log transformation.

Study Overall Quality

\(~\)

# Calculate the percentage of each Quality Score # 
Percentage_Quality <- table(RedWineQuality$quality)
prop_table <- prop.table(Percentage_Quality)

# Make the prop_table to dataframe
prop_table.df <- as.data.frame(prop_table )
prop_table.df 
##   Var1        Freq
## 1    3 0.006253909
## 2    4 0.033145716
## 3    5 0.425891182
## 4    6 0.398999375
## 5    7 0.124452783
## 6    8 0.011257036
#Quality of Portuguese Vinho Verde Red Wine#
# Make pie chart#
library(ggplot2)
library(ggrepel) #Avoid label overlap in pie chart:Replace geom_text by geom_text_repel
ggplot(prop_table.df, aes(x = 2, y = Freq, fill =  Var1 )) +
  geom_bar(stat = "identity", color = "white") +
  geom_text_repel(aes(label = paste(round(Freq * 100, 1), "%")),size=3, position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y", start = 0)+
  theme_void()+
  xlim(0.5, 2.5)+
  labs(
    title = "Quality of Portuguese Vinho Verde Red Wine",
    caption = "Source: UCI Machine Learning Repository",
    fill = "Score Between 0 and 10")

#Use this Graph to talk about data weakness#
summary(RedWineQuality$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
ggplot(aes(x=quality), data=RedWineQuality)+
  geom_histogram(binwidth = 1, fill='Red', color='Yellow')+
  labs(title="Quality of Portuguese Vinho Verde Red Wine",subtitle= "Score Between 0 and 10", y="Counts", x = "Quality Score ",caption = "Source: UCI Machine Learning Repository")

#### There is more 5-6 Score Level red wine than l3-4 and 7-8 Score Level red wine. 
#### Use dplyr to sort non-normal distribution variables by p_value
#### Above plot shows what we have inferred previously, that good wines were outnumbered by bad wines by a large margin. Most wines were mediocre (rated 5 or 6), but we could also see that there are some poor wines (3 or 4). A vast majority of good wines has a quality rating of 7.

Data Exploration

Classification Technique: Each Quality Score is Considered A Class

Low:3-4, Medium:5-6, High:7-8

\(~\)

RedWineQuality.orig = RedWineQuality # Save off original copy before making any alterations
## Change quality to categorical
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
  if(x<= 4)  {"Low"}
  else if(x>= 7)  {"High"}
  else { "Medium"}   
})

RedWineQuality$quality = unlist(RedWineQuality$quality)
RedWineQuality$quality = as.factor(RedWineQuality$quality)

Pairs of Data

  1. Set up the target value as quality,using GGally package to plot pairs of data.
  2. This allows us to show on the same graph:
  3. The pairs of each features;
  4. The boxplot for each feature,
  5. The density plots of each feature;
  6. The histograms for each feature. \(~\)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggplot2)
library(caret)
RedWineQuality$quality <- as.factor(RedWineQuality$quality)
ggpairs(RedWineQuality, aes(colour = quality, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(RedWineQuality, 
        aes(alpha=0.6, color = quality),
        upper = list(continuous = wrap("cor", size = 2)),
        diag = list(continuous = "barDiag"),
        lower = list(continuous = "smooth"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Box Plots - Quality#
#Box plots for each feature, grouped by quality#
featurePlot(x = RedWineQuality[, 1:11], 
            y = RedWineQuality$quality, plot = "box", 
            scales = list(x = list(relation="free"), y = list(relation="free")), 
            adjust = 1.5, pch = ".", 
            layout = c(4, 3), auto.key = list(columns = 3))

#Density Plots Quality#
#Densityplots for each feature, grouped by quality#
featurePlot(x = RedWineQuality[, 1:11], 
            y = RedWineQuality$quality, plot = "density", 
            scales = list(x = list(relation="free"), y = list(relation="free")), 
            adjust = 1.5, pch = ".", 
            layout = c(4, 3), auto.key = list(columns = 3))

*** ### Correlation ### Correlation between the attributes other than wine quality# \(~\)

#Import the dataset WineQuality -Read Database again in each script#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

winecor = cor(RedWineQuality[-12])
winecor
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity           1.00000000     -0.256130895  0.67170343    0.114776724
## volatile.acidity       -0.25613089      1.000000000 -0.55249568    0.001917882
## citric.acid             0.67170343     -0.552495685  1.00000000    0.143577162
## residual.sugar          0.11477672      0.001917882  0.14357716    1.000000000
## chlorides               0.09370519      0.061297772  0.20382291    0.055609535
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813    0.187048995
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302    0.203027882
## density                 0.66804729      0.022026232  0.36494718    0.355283371
## pH                     -0.68297819      0.234937294 -0.54190414   -0.085652422
## sulphates               0.18300566     -0.260986685  0.31277004    0.005527121
## alcohol                -0.06166827     -0.202288027  0.10990325    0.042075437
##                         chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity         0.093705186        -0.153794193          -0.11318144
## volatile.acidity      0.061297772        -0.010503827           0.07647000
## citric.acid           0.203822914        -0.060978129           0.03553302
## residual.sugar        0.055609535         0.187048995           0.20302788
## chlorides             1.000000000         0.005562147           0.04740047
## free.sulfur.dioxide   0.005562147         1.000000000           0.66766645
## total.sulfur.dioxide  0.047400468         0.667666450           1.00000000
## density               0.200632327        -0.021945831           0.07126948
## pH                   -0.265026131         0.070377499          -0.06649456
## sulphates             0.371260481         0.051657572           0.04294684
## alcohol              -0.221140545        -0.069408354          -0.20565394
##                          density          pH    sulphates     alcohol
## fixed.acidity         0.66804729 -0.68297819  0.183005664 -0.06166827
## volatile.acidity      0.02202623  0.23493729 -0.260986685 -0.20228803
## citric.acid           0.36494718 -0.54190414  0.312770044  0.10990325
## residual.sugar        0.35528337 -0.08565242  0.005527121  0.04207544
## chlorides             0.20063233 -0.26502613  0.371260481 -0.22114054
## free.sulfur.dioxide  -0.02194583  0.07037750  0.051657572 -0.06940835
## total.sulfur.dioxide  0.07126948 -0.06649456  0.042946836 -0.20565394
## density               1.00000000 -0.34169933  0.148506412 -0.49617977
## pH                   -0.34169933  1.00000000 -0.196647602  0.20563251
## sulphates             0.14850641 -0.19664760  1.000000000  0.09359475
## alcohol              -0.49617977  0.20563251  0.093594750  1.00000000
#Corrgram#
#Use Corrplot to explore correlation between all the variables#
RedWineQuality %>% corrgram(lower.panel=panel.shade, upper.panel=panel.ellipse)

#Pearson correlation#
#Correlation between the features using cor function for Pearson correlation#
correlations <- cor(RedWineQuality,method="pearson")
corrplot(correlations, number.cex = 1, method = "square", 
         hclust.method = "complete", order = "AOE",
         type = "full", tl.cex=0.8,tl.col = "Black")

Classification for RedWineQuality

Purpose: This script conducts exploratory data analysis of data that was downloaded in 01-get_data.R and cleaned in 02_clean_data.R.


Divide the data to training and testing groups

\(~\)

#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]

table(TrainData$quality)
## 
##   3   4   5   6   7   8 
##  10  41 542 511 163  13
table(TestData$quality)
## 
##   4   5   6   7   8 
##  12 139 127  36   5
#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)
##  Named int [1:1281] 2 3 4 5 6 7 8 9 10 11 ...
##  - attr(*, "names")= chr [1:1281] "Resample11" "Resample12" "Resample13" "Resample14" ...
TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]

table(TrainData$quality)
## 
##   3   4   5   6   7   8 
##   8  44 544 511 157  17
table(TestData$quality)
## 
##   3   4   5   6   7   8 
##   2   9 137 127  42   1

Create a variable indicating if a wine is good or bad

\(~\)

TrainData$Good_Quality_Wine<-ifelse(TrainData$quality>5,"Good Quality Wine","Bad Quality Wine")

RF_Model<-randomForest(factor(Good_Quality_Wine)~.-quality,TrainData,ntree=150)
RF_Model
## 
## Call:
##  randomForest(formula = factor(Good_Quality_Wine) ~ . - quality,      data = TrainData, ntree = 150) 
##                Type of random forest: classification
##                      Number of trees: 150
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.74%
## Confusion matrix:
##                   Bad Quality Wine Good Quality Wine class.error
## Bad Quality Wine               474               122   0.2046980
## Good Quality Wine              118               567   0.1722628
library(randomForest)
importance    <- importance(RF_Model)

varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, importance), 
                           y = importance, fill = importance)) +
  geom_bar(stat="identity", color="steelblue", position=position_dodge())+
  theme_minimal() + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 3, colour = 'yellow') +
  ggtitle("Bar Chart with Most Discriminating Factor of Wine Quality")+
  labs(x = 'Discriminating Factor of Wine Quality',y="Importance") +
  coord_flip() + 
  theme_classic()

*** ### Classification Technique: Each Quality Score is Considered A Class \(~\)

#Import the dataset WineQuality#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

#Low:3-4, Medium:5-6, High:7-8 
RedWineQuality.orig = RedWineQuality # Save off original copy before making any alterations
## Change quality to categorical
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
  if(x<= 4)  {"Low"}
  else if(x>= 7)  {"High"}
  else { "Medium"}   
})

RedWineQuality$quality = unlist(RedWineQuality$quality)
RedWineQuality$quality = as.factor(RedWineQuality$quality)

### Divide the data to training and testing groups###
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]

table(TrainData$quality)
## 
##   High    Low Medium 
##    168     54   1071
table(TestData$quality)
## 
##   High    Low Medium 
##     49      9    248
#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)
##  Named int [1:1281] 2 3 4 5 6 8 9 10 11 14 ...
##  - attr(*, "names")= chr [1:1281] "Resample11" "Resample12" "Resample13" "Resample14" ...
TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]

table(TrainData$quality)
## 
##   High    Low Medium 
##    174     51   1056
table(TestData$quality)
## 
##   High    Low Medium 
##     43     12    263
###Classification Approaches to Quality Score###
#Decsion Three#
library(rpart)
Model_All <- quality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density +
  pH + sulphates + alcohol 
Model_Alchool_VA<- quality~ alcohol + volatile.acidity+sulphates

#Decsion Three#
Tree_All <- rpart(Model_All , method = "class", data=TrainData)
Tree_Alchool_VA <- rpart(Model_Alchool_VA, method = "class", data=TrainData)

#Tree_All 
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_All , box.palette = "RdBu", nn=TRUE)

#Test for Prediction 
Prediction_Train = predict(Tree_All,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High     91    0     83
##     Low       0    0     51
##     Medium   19    0   1037
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8806          
##                  95% CI : (0.8615, 0.8978)
##     No Information Rate : 0.9141          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4913          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.82727         NA        0.8856
## Specificity              0.92912    0.96019        0.8273
## Pos Pred Value           0.52299         NA        0.9820
## Neg Pred Value           0.98284         NA        0.4044
## Prevalence               0.08587    0.00000        0.9141
## Detection Rate           0.07104    0.00000        0.8095
## Detection Prevalence     0.13583    0.03981        0.8244
## Balanced Accuracy        0.87820         NA        0.8564
Prediction_Test = predict(Tree_All,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low Medium
##     High     16   0     27
##     Low       0   0     12
##     Medium    9   0    254
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8491          
##                  95% CI : (0.8049, 0.8866)
##     No Information Rate : 0.9214          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3361          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.64000         NA        0.8669
## Specificity              0.90785    0.96226        0.6400
## Pos Pred Value           0.37209         NA        0.9658
## Neg Pred Value           0.96727         NA        0.2909
## Prevalence               0.07862    0.00000        0.9214
## Detection Rate           0.05031    0.00000        0.7987
## Detection Prevalence     0.13522    0.03774        0.8270
## Balanced Accuracy        0.77392         NA        0.7534
library(rpart) #for trees
library(rpart.plot)

#Tree_Alchool_VA
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_Alchool_VA, box.palette = "RdBu",  nn=TRUE)

#Test for Prediction 
Prediction_Train_Alchool_VA= predict(Tree_Alchool_VA,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train_Alchool_VA)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High     66    0    108
##     Low       0    0     51
##     Medium   21    0   1035
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8595          
##                  95% CI : (0.8392, 0.8781)
##     No Information Rate : 0.9321          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3682          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.75862         NA        0.8668
## Specificity              0.90955    0.96019        0.7586
## Pos Pred Value           0.37931         NA        0.9801
## Neg Pred Value           0.98103         NA        0.2933
## Prevalence               0.06792    0.00000        0.9321
## Detection Rate           0.05152    0.00000        0.8080
## Detection Prevalence     0.13583    0.03981        0.8244
## Balanced Accuracy        0.83408         NA        0.8127
Prediction_Test_Alchool_VA= predict(Tree_Alchool_VA,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test_Alchool_VA)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low Medium
##     High      8   0     35
##     Low       0   0     12
##     Medium    7   0    256
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8302          
##                  95% CI : (0.7843, 0.8698)
##     No Information Rate : 0.9528          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.174           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.53333         NA        0.8449
## Specificity              0.88449    0.96226        0.5333
## Pos Pred Value           0.18605         NA        0.9734
## Neg Pred Value           0.97455         NA        0.1455
## Prevalence               0.04717    0.00000        0.9528
## Detection Rate           0.02516    0.00000        0.8050
## Detection Prevalence     0.13522    0.03774        0.8270
## Balanced Accuracy        0.70891         NA        0.6891

Devide into Two Groups

\(~\)

#Good Wine:x > 5, Bad Wine:Others

#Import the dataset WineQuality#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
  if(x>5)  {"Good Wine"}
  else {"Bad Wine"}   
})

RedWineQuality$quality = unlist(RedWineQuality$quality);
RedWineQuality$quality = as.factor(RedWineQuality$quality)

### Divide the data to training and testing groups###
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]

#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)
##  Named int [1:1280] 1 3 4 5 8 9 10 11 12 14 ...
##  - attr(*, "names")= chr [1:1280] "Resample11" "Resample12" "Resample13" "Resample14" ...
TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]

table(TrainData$quality)
## 
##  Bad Wine Good Wine 
##       596       684
table(TestData$quality)
## 
##  Bad Wine Good Wine 
##       148       171
#Decsion Tree#
Model_All <- quality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density +
  pH + sulphates + alcohol 
Model_Alchool_VA<- quality~ alcohol + volatile.acidity+sulphates

#Decsion Tree#
Tree_All <- rpart(Model_All , method = "class", data=TrainData)
Tree_Alchool_VA <- rpart(Model_Alchool_VA, method = "class", data=TrainData)

#Tree_All 
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_All, box.palette = "RdBu", nn=TRUE)

#Test for Prediction 
Prediction_Train = predict(Tree_All,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       469       127
##   Good Wine      170       514
##                                           
##                Accuracy : 0.768           
##                  95% CI : (0.7439, 0.7908)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5359          
##                                           
##  Mcnemar's Test P-Value : 0.01481         
##                                           
##             Sensitivity : 0.7340          
##             Specificity : 0.8019          
##          Pos Pred Value : 0.7869          
##          Neg Pred Value : 0.7515          
##              Prevalence : 0.4992          
##          Detection Rate : 0.3664          
##    Detection Prevalence : 0.4656          
##       Balanced Accuracy : 0.7679          
##                                           
##        'Positive' Class : Bad Wine        
## 
Prediction_Test = predict(Tree_All,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       115        33
##   Good Wine       54       117
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.6749, 0.7754)
##     No Information Rate : 0.5298          
##     P-Value [Acc > NIR] : 3.906e-13       
##                                           
##                   Kappa : 0.4569          
##                                           
##  Mcnemar's Test P-Value : 0.03201         
##                                           
##             Sensitivity : 0.6805          
##             Specificity : 0.7800          
##          Pos Pred Value : 0.7770          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.5298          
##          Detection Rate : 0.3605          
##    Detection Prevalence : 0.4639          
##       Balanced Accuracy : 0.7302          
##                                           
##        'Positive' Class : Bad Wine        
## 
#Tree_Alchool_VA 
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_Alchool_VA, box.palette = "RdBu", nn=TRUE)

#Test for Prediction 
Prediction_Train = predict(Tree_Alchool_VA,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       480       116
##   Good Wine      226       458
##                                           
##                Accuracy : 0.7328          
##                  95% CI : (0.7077, 0.7569)
##     No Information Rate : 0.5516          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4694          
##                                           
##  Mcnemar's Test P-Value : 3.769e-09       
##                                           
##             Sensitivity : 0.6799          
##             Specificity : 0.7979          
##          Pos Pred Value : 0.8054          
##          Neg Pred Value : 0.6696          
##              Prevalence : 0.5516          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.4656          
##       Balanced Accuracy : 0.7389          
##                                           
##        'Positive' Class : Bad Wine        
## 
Prediction_Test = predict(Tree_Alchool_VA,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       120        28
##   Good Wine       71       100
##                                         
##                Accuracy : 0.6897        
##                  95% CI : (0.6357, 0.74)
##     No Information Rate : 0.5987        
##     P-Value [Acc > NIR] : 0.0004804     
##                                         
##                   Kappa : 0.388         
##                                         
##  Mcnemar's Test P-Value : 2.43e-05      
##                                         
##             Sensitivity : 0.6283        
##             Specificity : 0.7812        
##          Pos Pred Value : 0.8108        
##          Neg Pred Value : 0.5848        
##              Prevalence : 0.5987        
##          Detection Rate : 0.3762        
##    Detection Prevalence : 0.4639        
##       Balanced Accuracy : 0.7048        
##                                         
##        'Positive' Class : Bad Wine      
## 

References

Adrian Dragulescu and Cole Arendt (2020). xlsx: Read, Write, Format Excel 2007 and Excel 97/2000/XP/2003 Files. R package version 0.6.2. https://CRAN.R-project.org/package=xlsx

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.

Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2020). skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr (website), https://github.com/ropensci/skimr.

Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.

Hadley Wickham (2019). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.5. https://CRAN.R-project.org/package=rvest

Hadley Wickham, Jim Hester and Winston Chang (2020). devtools: Tools to Make Developing R Packages Easier. R package version 2.2.2. https://CRAN.R-project.org/package=devtools

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.4. https://CRAN.R-project.org/package=dplyr

Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes

Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder

Kevin Wright (2018). corrgram: Plot a Correlogram. R package version 1.13. https://CRAN.R-project.org/package=corrgram

Max Kuhn (2020). caret: Classification and Regression Training. R package version 6.0-85. https://CRAN.R-project.org/package=caret

Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman

Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). “ROCR: visualizing classifier performance in R.” Bioinformatics, 21(20), 7881. <URL: http://rocr.bioinf.mpi-sb.mpg.de>.

Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr

Stephen Milborrow (2019). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R package version 3.0.8. https://CRAN.R-project.org/package=rpart.plot

Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot

Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686