More Alcohol in Red Wine Please

Determining wine quality by modelling classifications in R language.

Abstract: This paper explores what makes a good wine through an anonymized dataset of chemical-based properties in red Vinho Verde wines from northern Portugal to determine correlations between wine elements and quality by using decision trees and random forest. We find that the random forest technique produces the desired results with the highest accuracy and demonstrates that the quality of wine is most influenced by its alcohol, sulphate and volatile acidity levels found in red wines.

License: MIT License. \(~\)

Get Data from UCI Machine Learning Repository

Purpose: This script gets data from UCI Machine Learning Repository and read online dataset

Workstation Setup

\(~\)

library(rvest)

## Loading required package: xml2

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::pluck()          masks rvest::pluck()

library(dplyr)
library(gapminder)
library(skimr)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(devtools)

## Loading required package: usethis

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(corrplot)

## corrplot 0.84 loaded

library(corrgram)

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

## 
## Attaching package: 'corrgram'

## The following object is masked from 'package:plyr':
## 
##     baseball

library(rpart)
library(rpart.plot)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:corrgram':
## 
##     panel.fill

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ROCR)
library(ggplot2)
library(ggthemes)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Import the dataset Red Wine Quality (1599 rows x12 columns) available in the following website http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv, note that the type of separator in this dataset = ;
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

Quick information about the package

\(~\)

RedWineQuality%>%
  glimpse()

## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.…
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600,…
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, …
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069,…
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 1…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, …
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, …
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, …
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10…
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, …

# Idea of database - Details in EDA # 
summary(RedWineQuality)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Clean Data for Package/Database - RedWineQuality

Purpose: This script cleans data that was downloaded in 01-get_data.R.

Clean Data

Check for Missing Value

\(~\)

str(RedWineQuality)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

which(is.na(RedWineQuality))

## integer(0)

sum(is.na(RedWineQuality))

## [1] 0

#### There is no missing value from the dataset, the database is clean.####

EDA

Purpose: This script conducts exploratory data analysis of data that was downloaded in 01-get_data.R and cleaned in 02_clean_data.R

Study Data

General Idea

\(~\)

summary(RedWineQuality$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

table(RedWineQuality$quality)

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

#Use Skim to understand the databse#
library(skimr)
RedWineQuality %>% 
  skim() %>% 
  kable()

skim_type	skim_variable	complete_rate	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
numeric	fixed.acidity	1	8.3196373	1.7410963	4.60000	7.1000	7.90000	9.200000	15.90000	▂▇▂▁▁
numeric	volatile.acidity	1	0.5278205	0.1790597	0.12000	0.3900	0.52000	0.640000	1.58000	▅▇▂▁▁
numeric	citric.acid	1	0.2709756	0.1948011	0.00000	0.0900	0.26000	0.420000	1.00000	▇▆▅▁▁
numeric	residual.sugar	1	2.5388055	1.4099281	0.90000	1.9000	2.20000	2.600000	15.50000	▇▁▁▁▁
numeric	chlorides	1	0.0874665	0.0470653	0.01200	0.0700	0.07900	0.090000	0.61100	▇▁▁▁▁
numeric	free.sulfur.dioxide	1	15.8749218	10.4601570	1.00000	7.0000	14.00000	21.000000	72.00000	▇▅▁▁▁
numeric	total.sulfur.dioxide	1	46.4677924	32.8953245	6.00000	22.0000	38.00000	62.000000	289.00000	▇▂▁▁▁
numeric	density	1	0.9967467	0.0018873	0.99007	0.9956	0.99675	0.997835	1.00369	▁▃▇▂▁
numeric	pH	1	3.3111132	0.1543865	2.74000	3.2100	3.31000	3.400000	4.01000	▁▅▇▂▁
numeric	sulphates	1	0.6581488	0.1695070	0.33000	0.5500	0.62000	0.730000	2.00000	▇▅▁▁▁
numeric	alcohol	1	10.4229831	1.0656676	8.40000	9.5000	10.20000	11.100000	14.90000	▇▇▃▁▁
numeric	quality	1	5.6360225	0.8075694	3.00000	5.0000	6.00000	6.000000	8.00000	▁▇▇▂▁

Data Transformation### - In Appendix Section

find_skewness() calculates the skewness and finds the skewed data.

\(~\)

library(dlookr)

## Loading required package: mice

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'dlookr'

## The following object is masked from 'package:base':
## 
##     transform

# find index of skewed variables
find_skewness(RedWineQuality)

## [1]  1  2  4  5  6  7 10 11

# find names of skewed variables
find_skewness(RedWineQuality, index = FALSE)

## [1] "fixed.acidity"        "volatile.acidity"     "residual.sugar"      
## [4] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide"
## [7] "sulphates"            "alcohol"

# compute the skewness
find_skewness(RedWineQuality, value = TRUE)

##        fixed.acidity     volatile.acidity          citric.acid 
##                0.982                0.671                0.318 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                4.536                5.675                1.249 
## total.sulfur.dioxide              density                   pH 
##                1.514                0.071                0.194 
##            sulphates              alcohol 
##                2.426                0.860

# compute the skewness & filtering with threshold
find_skewness(RedWineQuality, value = TRUE, thres = 0.1)

##        fixed.acidity     volatile.acidity          citric.acid 
##                0.982                0.671                0.318 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                4.536                5.675                1.249 
## total.sulfur.dioxide                   pH            sulphates 
##                1.514                0.194                2.426 
##              alcohol 
##                0.860

#Visualization of Numeric/Continuous Variables in RedWineQuality Dataset
RedWineQuality %>%
  gather() %>% 
  ggplot(aes(value,fill=key)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=sqrt(nrow(RedWineQuality))) +
  theme(legend.position="none")

#### Mostly skewed distribution

Visualization of Numeric/Continuous Variables in RedWineQuality Dataset (X Axis Log Scale)

Log transform the variable to make the x axis on a log scale.

\(~\)

RedWineQuality %>%
  gather() %>% 
  ggplot(aes(value,fill=key)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=sqrt(nrow(RedWineQuality))) +
  theme(legend.position="none") +
  scale_x_continuous(trans='log10')

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 132 rows containing non-finite values (stat_bin).

###1: Transformation introduced infinite values in continuous x-axis 
###2: Removed 132 rows containing non-finite values (stat_bin). 
###3: The skewed distributions from before visually look closer to normal. 
###4: The effect of transforming a (right tail) skewed distribution with a log transformation.

Study Overall Quality

\(~\)

# Calculate the percentage of each Quality Score # 
Percentage_Quality <- table(RedWineQuality$quality)
prop_table <- prop.table(Percentage_Quality)

# Make the prop_table to dataframe
prop_table.df <- as.data.frame(prop_table )
prop_table.df

##   Var1        Freq
## 1    3 0.006253909
## 2    4 0.033145716
## 3    5 0.425891182
## 4    6 0.398999375
## 5    7 0.124452783
## 6    8 0.011257036

#Quality of Portuguese Vinho Verde Red Wine#
# Make pie chart#
library(ggplot2)
library(ggrepel) #Avoid label overlap in pie chart:Replace geom_text by geom_text_repel
ggplot(prop_table.df, aes(x = 2, y = Freq, fill =  Var1 )) +
  geom_bar(stat = "identity", color = "white") +
  geom_text_repel(aes(label = paste(round(Freq * 100, 1), "%")),size=3, position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y", start = 0)+
  theme_void()+
  xlim(0.5, 2.5)+
  labs(
    title = "Quality of Portuguese Vinho Verde Red Wine",
    caption = "Source: UCI Machine Learning Repository",
    fill = "Score Between 0 and 10")

#Use this Graph to talk about data weakness#
summary(RedWineQuality$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

ggplot(aes(x=quality), data=RedWineQuality)+
  geom_histogram(binwidth = 1, fill='Red', color='Yellow')+
  labs(title="Quality of Portuguese Vinho Verde Red Wine",subtitle= "Score Between 0 and 10", y="Counts", x = "Quality Score ",caption = "Source: UCI Machine Learning Repository")

#### There is more 5-6 Score Level red wine than l3-4 and 7-8 Score Level red wine. 
#### Use dplyr to sort non-normal distribution variables by p_value
#### Above plot shows what we have inferred previously, that good wines were outnumbered by bad wines by a large margin. Most wines were mediocre (rated 5 or 6), but we could also see that there are some poor wines (3 or 4). A vast majority of good wines has a quality rating of 7.

Data Exploration

Classification Technique: Each Quality Score is Considered A Class

Low:3-4, Medium:5-6, High:7-8

\(~\)

RedWineQuality.orig = RedWineQuality # Save off original copy before making any alterations
## Change quality to categorical
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
  if(x<= 4)  {"Low"}
  else if(x>= 7)  {"High"}
  else { "Medium"}   
})

RedWineQuality$quality = unlist(RedWineQuality$quality)
RedWineQuality$quality = as.factor(RedWineQuality$quality)

Pairs of Data

Set up the target value as quality,using GGally package to plot pairs of data.
This allows us to show on the same graph:
The pairs of each features;
The boxplot for each feature,
The density plots of each feature;
The histograms for each feature. \(~\)

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(ggplot2)
library(caret)
RedWineQuality$quality <- as.factor(RedWineQuality$quality)
ggpairs(RedWineQuality, aes(colour = quality, alpha = 0.4))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(RedWineQuality, 
        aes(alpha=0.6, color = quality),
        upper = list(continuous = wrap("cor", size = 2)),
        diag = list(continuous = "barDiag"),
        lower = list(continuous = "smooth"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Box Plots - Quality#
#Box plots for each feature, grouped by quality#
featurePlot(x = RedWineQuality[, 1:11], 
            y = RedWineQuality$quality, plot = "box", 
            scales = list(x = list(relation="free"), y = list(relation="free")), 
            adjust = 1.5, pch = ".", 
            layout = c(4, 3), auto.key = list(columns = 3))

#Density Plots Quality#
#Densityplots for each feature, grouped by quality#
featurePlot(x = RedWineQuality[, 1:11], 
            y = RedWineQuality$quality, plot = "density", 
            scales = list(x = list(relation="free"), y = list(relation="free")), 
            adjust = 1.5, pch = ".", 
            layout = c(4, 3), auto.key = list(columns = 3))

*** ### Correlation ### Correlation between the attributes other than wine quality# \(~\)

#Import the dataset WineQuality -Read Database again in each script#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

winecor = cor(RedWineQuality[-12])
winecor

##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity           1.00000000     -0.256130895  0.67170343    0.114776724
## volatile.acidity       -0.25613089      1.000000000 -0.55249568    0.001917882
## citric.acid             0.67170343     -0.552495685  1.00000000    0.143577162
## residual.sugar          0.11477672      0.001917882  0.14357716    1.000000000
## chlorides               0.09370519      0.061297772  0.20382291    0.055609535
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813    0.187048995
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302    0.203027882
## density                 0.66804729      0.022026232  0.36494718    0.355283371
## pH                     -0.68297819      0.234937294 -0.54190414   -0.085652422
## sulphates               0.18300566     -0.260986685  0.31277004    0.005527121
## alcohol                -0.06166827     -0.202288027  0.10990325    0.042075437
##                         chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity         0.093705186        -0.153794193          -0.11318144
## volatile.acidity      0.061297772        -0.010503827           0.07647000
## citric.acid           0.203822914        -0.060978129           0.03553302
## residual.sugar        0.055609535         0.187048995           0.20302788
## chlorides             1.000000000         0.005562147           0.04740047
## free.sulfur.dioxide   0.005562147         1.000000000           0.66766645
## total.sulfur.dioxide  0.047400468         0.667666450           1.00000000
## density               0.200632327        -0.021945831           0.07126948
## pH                   -0.265026131         0.070377499          -0.06649456
## sulphates             0.371260481         0.051657572           0.04294684
## alcohol              -0.221140545        -0.069408354          -0.20565394
##                          density          pH    sulphates     alcohol
## fixed.acidity         0.66804729 -0.68297819  0.183005664 -0.06166827
## volatile.acidity      0.02202623  0.23493729 -0.260986685 -0.20228803
## citric.acid           0.36494718 -0.54190414  0.312770044  0.10990325
## residual.sugar        0.35528337 -0.08565242  0.005527121  0.04207544
## chlorides             0.20063233 -0.26502613  0.371260481 -0.22114054
## free.sulfur.dioxide  -0.02194583  0.07037750  0.051657572 -0.06940835
## total.sulfur.dioxide  0.07126948 -0.06649456  0.042946836 -0.20565394
## density               1.00000000 -0.34169933  0.148506412 -0.49617977
## pH                   -0.34169933  1.00000000 -0.196647602  0.20563251
## sulphates             0.14850641 -0.19664760  1.000000000  0.09359475
## alcohol              -0.49617977  0.20563251  0.093594750  1.00000000

#Corrgram#
#Use Corrplot to explore correlation between all the variables#
RedWineQuality %>% corrgram(lower.panel=panel.shade, upper.panel=panel.ellipse)

#Pearson correlation#
#Correlation between the features using cor function for Pearson correlation#
correlations <- cor(RedWineQuality,method="pearson")
corrplot(correlations, number.cex = 1, method = "square", 
         hclust.method = "complete", order = "AOE",
         type = "full", tl.cex=0.8,tl.col = "Black")

Classification for RedWineQuality

Purpose: This script conducts exploratory data analysis of data that was downloaded in 01-get_data.R and cleaned in 02_clean_data.R.

Divide the data to training and testing groups

\(~\)

#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]

table(TrainData$quality)

## 
##   3   4   5   6   7   8 
##  10  41 542 511 163  13

table(TestData$quality)

## 
##   4   5   6   7   8 
##  12 139 127  36   5

#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)

##  Named int [1:1281] 2 3 4 5 6 7 8 9 10 11 ...
##  - attr(*, "names")= chr [1:1281] "Resample11" "Resample12" "Resample13" "Resample14" ...

TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]

table(TrainData$quality)

## 
##   3   4   5   6   7   8 
##   8  44 544 511 157  17

table(TestData$quality)

## 
##   3   4   5   6   7   8 
##   2   9 137 127  42   1

Create a variable indicating if a wine is good or bad

\(~\)

TrainData$Good_Quality_Wine<-ifelse(TrainData$quality>5,"Good Quality Wine","Bad Quality Wine")

RF_Model<-randomForest(factor(Good_Quality_Wine)~.-quality,TrainData,ntree=150)
RF_Model

## 
## Call:
##  randomForest(formula = factor(Good_Quality_Wine) ~ . - quality,      data = TrainData, ntree = 150) 
##                Type of random forest: classification
##                      Number of trees: 150
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.74%
## Confusion matrix:
##                   Bad Quality Wine Good Quality Wine class.error
## Bad Quality Wine               474               122   0.2046980
## Good Quality Wine              118               567   0.1722628

library(randomForest)
importance    <- importance(RF_Model)

varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, importance), 
                           y = importance, fill = importance)) +
  geom_bar(stat="identity", color="steelblue", position=position_dodge())+
  theme_minimal() + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 3, colour = 'yellow') +
  ggtitle("Bar Chart with Most Discriminating Factor of Wine Quality")+
  labs(x = 'Discriminating Factor of Wine Quality',y="Importance") +
  coord_flip() + 
  theme_classic()

*** ### Classification Technique: Each Quality Score is Considered A Class \(~\)

#Import the dataset WineQuality#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

#Low:3-4, Medium:5-6, High:7-8 
RedWineQuality.orig = RedWineQuality # Save off original copy before making any alterations
## Change quality to categorical
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
  if(x<= 4)  {"Low"}
  else if(x>= 7)  {"High"}
  else { "Medium"}   
})

RedWineQuality$quality = unlist(RedWineQuality$quality)
RedWineQuality$quality = as.factor(RedWineQuality$quality)

### Divide the data to training and testing groups###
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]

table(TrainData$quality)

## 
##   High    Low Medium 
##    168     54   1071

table(TestData$quality)

## 
##   High    Low Medium 
##     49      9    248

#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)

##  Named int [1:1281] 2 3 4 5 6 8 9 10 11 14 ...
##  - attr(*, "names")= chr [1:1281] "Resample11" "Resample12" "Resample13" "Resample14" ...

TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]

table(TrainData$quality)

## 
##   High    Low Medium 
##    174     51   1056

table(TestData$quality)

## 
##   High    Low Medium 
##     43     12    263

###Classification Approaches to Quality Score###
#Decsion Three#
library(rpart)
Model_All <- quality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density +
  pH + sulphates + alcohol 
Model_Alchool_VA<- quality~ alcohol + volatile.acidity+sulphates

#Decsion Three#
Tree_All <- rpart(Model_All , method = "class", data=TrainData)
Tree_Alchool_VA <- rpart(Model_Alchool_VA, method = "class", data=TrainData)

#Tree_All 
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_All , box.palette = "RdBu", nn=TRUE)

#Test for Prediction 
Prediction_Train = predict(Tree_All,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High     91    0     83
##     Low       0    0     51
##     Medium   19    0   1037
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8806          
##                  95% CI : (0.8615, 0.8978)
##     No Information Rate : 0.9141          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4913          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.82727         NA        0.8856
## Specificity              0.92912    0.96019        0.8273
## Pos Pred Value           0.52299         NA        0.9820
## Neg Pred Value           0.98284         NA        0.4044
## Prevalence               0.08587    0.00000        0.9141
## Detection Rate           0.07104    0.00000        0.8095
## Detection Prevalence     0.13583    0.03981        0.8244
## Balanced Accuracy        0.87820         NA        0.8564

Prediction_Test = predict(Tree_All,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low Medium
##     High     16   0     27
##     Low       0   0     12
##     Medium    9   0    254
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8491          
##                  95% CI : (0.8049, 0.8866)
##     No Information Rate : 0.9214          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3361          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.64000         NA        0.8669
## Specificity              0.90785    0.96226        0.6400
## Pos Pred Value           0.37209         NA        0.9658
## Neg Pred Value           0.96727         NA        0.2909
## Prevalence               0.07862    0.00000        0.9214
## Detection Rate           0.05031    0.00000        0.7987
## Detection Prevalence     0.13522    0.03774        0.8270
## Balanced Accuracy        0.77392         NA        0.7534

library(rpart) #for trees
library(rpart.plot)

#Tree_Alchool_VA
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_Alchool_VA, box.palette = "RdBu",  nn=TRUE)

#Test for Prediction 
Prediction_Train_Alchool_VA= predict(Tree_Alchool_VA,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train_Alchool_VA)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High     66    0    108
##     Low       0    0     51
##     Medium   21    0   1035
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8595          
##                  95% CI : (0.8392, 0.8781)
##     No Information Rate : 0.9321          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3682          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.75862         NA        0.8668
## Specificity              0.90955    0.96019        0.7586
## Pos Pred Value           0.37931         NA        0.9801
## Neg Pred Value           0.98103         NA        0.2933
## Prevalence               0.06792    0.00000        0.9321
## Detection Rate           0.05152    0.00000        0.8080
## Detection Prevalence     0.13583    0.03981        0.8244
## Balanced Accuracy        0.83408         NA        0.8127

Prediction_Test_Alchool_VA= predict(Tree_Alchool_VA,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test_Alchool_VA)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low Medium
##     High      8   0     35
##     Low       0   0     12
##     Medium    7   0    256
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8302          
##                  95% CI : (0.7843, 0.8698)
##     No Information Rate : 0.9528          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.174           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity              0.53333         NA        0.8449
## Specificity              0.88449    0.96226        0.5333
## Pos Pred Value           0.18605         NA        0.9734
## Neg Pred Value           0.97455         NA        0.1455
## Prevalence               0.04717    0.00000        0.9528
## Detection Rate           0.02516    0.00000        0.8050
## Detection Prevalence     0.13522    0.03774        0.8270
## Balanced Accuracy        0.70891         NA        0.6891

Devide into Two Groups

\(~\)

#Good Wine:x > 5, Bad Wine:Others

#Import the dataset WineQuality#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )

RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
  if(x>5)  {"Good Wine"}
  else {"Bad Wine"}   
})

RedWineQuality$quality = unlist(RedWineQuality$quality);
RedWineQuality$quality = as.factor(RedWineQuality$quality)

### Divide the data to training and testing groups###
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]

#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)

##  Named int [1:1280] 1 3 4 5 8 9 10 11 12 14 ...
##  - attr(*, "names")= chr [1:1280] "Resample11" "Resample12" "Resample13" "Resample14" ...

TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]

table(TrainData$quality)

## 
##  Bad Wine Good Wine 
##       596       684

table(TestData$quality)

## 
##  Bad Wine Good Wine 
##       148       171

#Decsion Tree#
Model_All <- quality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density +
  pH + sulphates + alcohol 
Model_Alchool_VA<- quality~ alcohol + volatile.acidity+sulphates

#Decsion Tree#
Tree_All <- rpart(Model_All , method = "class", data=TrainData)
Tree_Alchool_VA <- rpart(Model_Alchool_VA, method = "class", data=TrainData)

#Tree_All 
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_All, box.palette = "RdBu", nn=TRUE)

#Test for Prediction 
Prediction_Train = predict(Tree_All,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       469       127
##   Good Wine      170       514
##                                           
##                Accuracy : 0.768           
##                  95% CI : (0.7439, 0.7908)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5359          
##                                           
##  Mcnemar's Test P-Value : 0.01481         
##                                           
##             Sensitivity : 0.7340          
##             Specificity : 0.8019          
##          Pos Pred Value : 0.7869          
##          Neg Pred Value : 0.7515          
##              Prevalence : 0.4992          
##          Detection Rate : 0.3664          
##    Detection Prevalence : 0.4656          
##       Balanced Accuracy : 0.7679          
##                                           
##        'Positive' Class : Bad Wine        
##

Prediction_Test = predict(Tree_All,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       115        33
##   Good Wine       54       117
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.6749, 0.7754)
##     No Information Rate : 0.5298          
##     P-Value [Acc > NIR] : 3.906e-13       
##                                           
##                   Kappa : 0.4569          
##                                           
##  Mcnemar's Test P-Value : 0.03201         
##                                           
##             Sensitivity : 0.6805          
##             Specificity : 0.7800          
##          Pos Pred Value : 0.7770          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.5298          
##          Detection Rate : 0.3605          
##    Detection Prevalence : 0.4639          
##       Balanced Accuracy : 0.7302          
##                                           
##        'Positive' Class : Bad Wine        
##

#Tree_Alchool_VA 
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_Alchool_VA, box.palette = "RdBu", nn=TRUE)

#Test for Prediction 
Prediction_Train = predict(Tree_Alchool_VA,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       480       116
##   Good Wine      226       458
##                                           
##                Accuracy : 0.7328          
##                  95% CI : (0.7077, 0.7569)
##     No Information Rate : 0.5516          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4694          
##                                           
##  Mcnemar's Test P-Value : 3.769e-09       
##                                           
##             Sensitivity : 0.6799          
##             Specificity : 0.7979          
##          Pos Pred Value : 0.8054          
##          Neg Pred Value : 0.6696          
##              Prevalence : 0.5516          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.4656          
##       Balanced Accuracy : 0.7389          
##                                           
##        'Positive' Class : Bad Wine        
##

Prediction_Test = predict(Tree_Alchool_VA,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Bad Wine Good Wine
##   Bad Wine       120        28
##   Good Wine       71       100
##                                         
##                Accuracy : 0.6897        
##                  95% CI : (0.6357, 0.74)
##     No Information Rate : 0.5987        
##     P-Value [Acc > NIR] : 0.0004804     
##                                         
##                   Kappa : 0.388         
##                                         
##  Mcnemar's Test P-Value : 2.43e-05      
##                                         
##             Sensitivity : 0.6283        
##             Specificity : 0.7812        
##          Pos Pred Value : 0.8108        
##          Neg Pred Value : 0.5848        
##              Prevalence : 0.5987        
##          Detection Rate : 0.3762        
##    Detection Prevalence : 0.4639        
##       Balanced Accuracy : 0.7048        
##                                         
##        'Positive' Class : Bad Wine      
##

References

Adrian Dragulescu and Cole Arendt (2020). xlsx: Read, Write, Format Excel 2007 and Excel 97/2000/XP/2003 Files. R package version 0.6.2. https://CRAN.R-project.org/package=xlsx

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.

Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2020). skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr (website), https://github.com/ropensci/skimr.

Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.

Hadley Wickham (2019). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.5. https://CRAN.R-project.org/package=rvest

Hadley Wickham, Jim Hester and Winston Chang (2020). devtools: Tools to Make Developing R Packages Easier. R package version 2.2.2. https://CRAN.R-project.org/package=devtools

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.4. https://CRAN.R-project.org/package=dplyr

Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes

Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder

Kevin Wright (2018). corrgram: Plot a Correlogram. R package version 1.13. https://CRAN.R-project.org/package=corrgram

Max Kuhn (2020). caret: Classification and Regression Training. R package version 6.0-85. https://CRAN.R-project.org/package=caret

Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman

Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). “ROCR: visualizing classifier performance in R.” Bioinformatics, 21(20), 7881. <URL: http://rocr.bioinf.mpi-sb.mpg.de>.

Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr

Stephen Milborrow (2019). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R package version 3.0.8. https://CRAN.R-project.org/package=rpart.plot

Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot

Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

More Alcohol in Red Wine Please

Angela Cao

22/02/2020

Determining wine quality by modelling classifications in R language.

Get Data from UCI Machine Learning Repository

Workstation Setup

Quick information about the package

Clean Data for Package/Database - RedWineQuality

Clean Data

Check for Missing Value

EDA

Study Data

General Idea

Data Transformation### - In Appendix Section

find_skewness() calculates the skewness and finds the skewed data.

Visualization of Numeric/Continuous Variables in RedWineQuality Dataset (X Axis Log Scale)

Log transform the variable to make the x axis on a log scale.

Study Overall Quality

Data Exploration

Classification Technique: Each Quality Score is Considered A Class

Low:3-4, Medium:5-6, High:7-8

Pairs of Data

Classification for RedWineQuality

Divide the data to training and testing groups

Create a variable indicating if a wine is good or bad

Devide into Two Groups

References