Abstract: This paper explores what makes a good wine through an anonymized dataset of chemical-based properties in red Vinho Verde wines from northern Portugal to determine correlations between wine elements and quality by using decision trees and random forest. We find that the random forest technique produces the desired results with the highest accuracy and demonstrates that the quality of wine is most influenced by its alcohol, sulphate and volatile acidity levels found in red wines.
License: MIT License. \(~\)
Purpose: This script gets data from UCI Machine Learning Repository and read online dataset
\(~\)
library(rvest)
## Loading required package: xml2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.3 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::pluck() masks rvest::pluck()
library(dplyr)
library(gapminder)
library(skimr)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(devtools)
## Loading required package: usethis
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(corrplot)
## corrplot 0.84 loaded
library(corrgram)
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
##
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
##
## baseball
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR)
library(ggplot2)
library(ggthemes)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Import the dataset Red Wine Quality (1599 rows x12 columns) available in the following website http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv, note that the type of separator in this dataset = ;
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )
\(~\)
RedWineQuality%>%
glimpse()
## Rows: 1,599
## Columns: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.…
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600,…
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, …
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069,…
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 1…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, …
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, …
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, …
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10…
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, …
# Idea of database - Details in EDA #
summary(RedWineQuality)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Purpose: This script cleans data that was downloaded in 01-get_data.R.
\(~\)
str(RedWineQuality)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
which(is.na(RedWineQuality))
## integer(0)
sum(is.na(RedWineQuality))
## [1] 0
#### There is no missing value from the dataset, the database is clean.####
Purpose: This script conducts exploratory data analysis of data that was downloaded in 01-get_data.R and cleaned in 02_clean_data.R
\(~\)
summary(RedWineQuality$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
table(RedWineQuality$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
#Use Skim to understand the databse#
library(skimr)
RedWineQuality %>%
skim() %>%
kable()
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | fixed.acidity | 0 | 1 | 8.3196373 | 1.7410963 | 4.60000 | 7.1000 | 7.90000 | 9.200000 | 15.90000 | ▂▇▂▁▁ |
| numeric | volatile.acidity | 0 | 1 | 0.5278205 | 0.1790597 | 0.12000 | 0.3900 | 0.52000 | 0.640000 | 1.58000 | ▅▇▂▁▁ |
| numeric | citric.acid | 0 | 1 | 0.2709756 | 0.1948011 | 0.00000 | 0.0900 | 0.26000 | 0.420000 | 1.00000 | ▇▆▅▁▁ |
| numeric | residual.sugar | 0 | 1 | 2.5388055 | 1.4099281 | 0.90000 | 1.9000 | 2.20000 | 2.600000 | 15.50000 | ▇▁▁▁▁ |
| numeric | chlorides | 0 | 1 | 0.0874665 | 0.0470653 | 0.01200 | 0.0700 | 0.07900 | 0.090000 | 0.61100 | ▇▁▁▁▁ |
| numeric | free.sulfur.dioxide | 0 | 1 | 15.8749218 | 10.4601570 | 1.00000 | 7.0000 | 14.00000 | 21.000000 | 72.00000 | ▇▅▁▁▁ |
| numeric | total.sulfur.dioxide | 0 | 1 | 46.4677924 | 32.8953245 | 6.00000 | 22.0000 | 38.00000 | 62.000000 | 289.00000 | ▇▂▁▁▁ |
| numeric | density | 0 | 1 | 0.9967467 | 0.0018873 | 0.99007 | 0.9956 | 0.99675 | 0.997835 | 1.00369 | ▁▃▇▂▁ |
| numeric | pH | 0 | 1 | 3.3111132 | 0.1543865 | 2.74000 | 3.2100 | 3.31000 | 3.400000 | 4.01000 | ▁▅▇▂▁ |
| numeric | sulphates | 0 | 1 | 0.6581488 | 0.1695070 | 0.33000 | 0.5500 | 0.62000 | 0.730000 | 2.00000 | ▇▅▁▁▁ |
| numeric | alcohol | 0 | 1 | 10.4229831 | 1.0656676 | 8.40000 | 9.5000 | 10.20000 | 11.100000 | 14.90000 | ▇▇▃▁▁ |
| numeric | quality | 0 | 1 | 5.6360225 | 0.8075694 | 3.00000 | 5.0000 | 6.00000 | 6.000000 | 8.00000 | ▁▇▇▂▁ |
\(~\)
library(dlookr)
## Loading required package: mice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
##
## transform
# find index of skewed variables
find_skewness(RedWineQuality)
## [1] 1 2 4 5 6 7 10 11
# find names of skewed variables
find_skewness(RedWineQuality, index = FALSE)
## [1] "fixed.acidity" "volatile.acidity" "residual.sugar"
## [4] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide"
## [7] "sulphates" "alcohol"
# compute the skewness
find_skewness(RedWineQuality, value = TRUE)
## fixed.acidity volatile.acidity citric.acid
## 0.982 0.671 0.318
## residual.sugar chlorides free.sulfur.dioxide
## 4.536 5.675 1.249
## total.sulfur.dioxide density pH
## 1.514 0.071 0.194
## sulphates alcohol
## 2.426 0.860
# compute the skewness & filtering with threshold
find_skewness(RedWineQuality, value = TRUE, thres = 0.1)
## fixed.acidity volatile.acidity citric.acid
## 0.982 0.671 0.318
## residual.sugar chlorides free.sulfur.dioxide
## 4.536 5.675 1.249
## total.sulfur.dioxide pH sulphates
## 1.514 0.194 2.426
## alcohol
## 0.860
#Visualization of Numeric/Continuous Variables in RedWineQuality Dataset
RedWineQuality %>%
gather() %>%
ggplot(aes(value,fill=key)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=sqrt(nrow(RedWineQuality))) +
theme(legend.position="none")
#### Mostly skewed distribution
\(~\)
RedWineQuality %>%
gather() %>%
ggplot(aes(value,fill=key)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=sqrt(nrow(RedWineQuality))) +
theme(legend.position="none") +
scale_x_continuous(trans='log10')
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 132 rows containing non-finite values (stat_bin).
###1: Transformation introduced infinite values in continuous x-axis
###2: Removed 132 rows containing non-finite values (stat_bin).
###3: The skewed distributions from before visually look closer to normal.
###4: The effect of transforming a (right tail) skewed distribution with a log transformation.
\(~\)
# Calculate the percentage of each Quality Score #
Percentage_Quality <- table(RedWineQuality$quality)
prop_table <- prop.table(Percentage_Quality)
# Make the prop_table to dataframe
prop_table.df <- as.data.frame(prop_table )
prop_table.df
## Var1 Freq
## 1 3 0.006253909
## 2 4 0.033145716
## 3 5 0.425891182
## 4 6 0.398999375
## 5 7 0.124452783
## 6 8 0.011257036
#Quality of Portuguese Vinho Verde Red Wine#
# Make pie chart#
library(ggplot2)
library(ggrepel) #Avoid label overlap in pie chart:Replace geom_text by geom_text_repel
ggplot(prop_table.df, aes(x = 2, y = Freq, fill = Var1 )) +
geom_bar(stat = "identity", color = "white") +
geom_text_repel(aes(label = paste(round(Freq * 100, 1), "%")),size=3, position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y", start = 0)+
theme_void()+
xlim(0.5, 2.5)+
labs(
title = "Quality of Portuguese Vinho Verde Red Wine",
caption = "Source: UCI Machine Learning Repository",
fill = "Score Between 0 and 10")
#Use this Graph to talk about data weakness#
summary(RedWineQuality$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
ggplot(aes(x=quality), data=RedWineQuality)+
geom_histogram(binwidth = 1, fill='Red', color='Yellow')+
labs(title="Quality of Portuguese Vinho Verde Red Wine",subtitle= "Score Between 0 and 10", y="Counts", x = "Quality Score ",caption = "Source: UCI Machine Learning Repository")
#### There is more 5-6 Score Level red wine than l3-4 and 7-8 Score Level red wine.
#### Use dplyr to sort non-normal distribution variables by p_value
#### Above plot shows what we have inferred previously, that good wines were outnumbered by bad wines by a large margin. Most wines were mediocre (rated 5 or 6), but we could also see that there are some poor wines (3 or 4). A vast majority of good wines has a quality rating of 7.
\(~\)
RedWineQuality.orig = RedWineQuality # Save off original copy before making any alterations
## Change quality to categorical
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
if(x<= 4) {"Low"}
else if(x>= 7) {"High"}
else { "Medium"}
})
RedWineQuality$quality = unlist(RedWineQuality$quality)
RedWineQuality$quality = as.factor(RedWineQuality$quality)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggplot2)
library(caret)
RedWineQuality$quality <- as.factor(RedWineQuality$quality)
ggpairs(RedWineQuality, aes(colour = quality, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(RedWineQuality,
aes(alpha=0.6, color = quality),
upper = list(continuous = wrap("cor", size = 2)),
diag = list(continuous = "barDiag"),
lower = list(continuous = "smooth"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Box Plots - Quality#
#Box plots for each feature, grouped by quality#
featurePlot(x = RedWineQuality[, 1:11],
y = RedWineQuality$quality, plot = "box",
scales = list(x = list(relation="free"), y = list(relation="free")),
adjust = 1.5, pch = ".",
layout = c(4, 3), auto.key = list(columns = 3))
#Density Plots Quality#
#Densityplots for each feature, grouped by quality#
featurePlot(x = RedWineQuality[, 1:11],
y = RedWineQuality$quality, plot = "density",
scales = list(x = list(relation="free"), y = list(relation="free")),
adjust = 1.5, pch = ".",
layout = c(4, 3), auto.key = list(columns = 3))
*** ### Correlation ### Correlation between the attributes other than wine quality# \(~\)
#Import the dataset WineQuality -Read Database again in each script#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )
winecor = cor(RedWineQuality[-12])
winecor
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00000000 -0.256130895 0.67170343 0.114776724
## volatile.acidity -0.25613089 1.000000000 -0.55249568 0.001917882
## citric.acid 0.67170343 -0.552495685 1.00000000 0.143577162
## residual.sugar 0.11477672 0.001917882 0.14357716 1.000000000
## chlorides 0.09370519 0.061297772 0.20382291 0.055609535
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813 0.187048995
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302 0.203027882
## density 0.66804729 0.022026232 0.36494718 0.355283371
## pH -0.68297819 0.234937294 -0.54190414 -0.085652422
## sulphates 0.18300566 -0.260986685 0.31277004 0.005527121
## alcohol -0.06166827 -0.202288027 0.10990325 0.042075437
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity 0.093705186 -0.153794193 -0.11318144
## volatile.acidity 0.061297772 -0.010503827 0.07647000
## citric.acid 0.203822914 -0.060978129 0.03553302
## residual.sugar 0.055609535 0.187048995 0.20302788
## chlorides 1.000000000 0.005562147 0.04740047
## free.sulfur.dioxide 0.005562147 1.000000000 0.66766645
## total.sulfur.dioxide 0.047400468 0.667666450 1.00000000
## density 0.200632327 -0.021945831 0.07126948
## pH -0.265026131 0.070377499 -0.06649456
## sulphates 0.371260481 0.051657572 0.04294684
## alcohol -0.221140545 -0.069408354 -0.20565394
## density pH sulphates alcohol
## fixed.acidity 0.66804729 -0.68297819 0.183005664 -0.06166827
## volatile.acidity 0.02202623 0.23493729 -0.260986685 -0.20228803
## citric.acid 0.36494718 -0.54190414 0.312770044 0.10990325
## residual.sugar 0.35528337 -0.08565242 0.005527121 0.04207544
## chlorides 0.20063233 -0.26502613 0.371260481 -0.22114054
## free.sulfur.dioxide -0.02194583 0.07037750 0.051657572 -0.06940835
## total.sulfur.dioxide 0.07126948 -0.06649456 0.042946836 -0.20565394
## density 1.00000000 -0.34169933 0.148506412 -0.49617977
## pH -0.34169933 1.00000000 -0.196647602 0.20563251
## sulphates 0.14850641 -0.19664760 1.000000000 0.09359475
## alcohol -0.49617977 0.20563251 0.093594750 1.00000000
#Corrgram#
#Use Corrplot to explore correlation between all the variables#
RedWineQuality %>% corrgram(lower.panel=panel.shade, upper.panel=panel.ellipse)
#Pearson correlation#
#Correlation between the features using cor function for Pearson correlation#
correlations <- cor(RedWineQuality,method="pearson")
corrplot(correlations, number.cex = 1, method = "square",
hclust.method = "complete", order = "AOE",
type = "full", tl.cex=0.8,tl.col = "Black")
Purpose: This script conducts exploratory data analysis of data that was downloaded in 01-get_data.R and cleaned in 02_clean_data.R.
\(~\)
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]
table(TrainData$quality)
##
## 3 4 5 6 7 8
## 10 41 542 511 163 13
table(TestData$quality)
##
## 4 5 6 7 8
## 12 139 127 36 5
#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)
## Named int [1:1281] 2 3 4 5 6 7 8 9 10 11 ...
## - attr(*, "names")= chr [1:1281] "Resample11" "Resample12" "Resample13" "Resample14" ...
TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]
table(TrainData$quality)
##
## 3 4 5 6 7 8
## 8 44 544 511 157 17
table(TestData$quality)
##
## 3 4 5 6 7 8
## 2 9 137 127 42 1
\(~\)
TrainData$Good_Quality_Wine<-ifelse(TrainData$quality>5,"Good Quality Wine","Bad Quality Wine")
RF_Model<-randomForest(factor(Good_Quality_Wine)~.-quality,TrainData,ntree=150)
RF_Model
##
## Call:
## randomForest(formula = factor(Good_Quality_Wine) ~ . - quality, data = TrainData, ntree = 150)
## Type of random forest: classification
## Number of trees: 150
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.74%
## Confusion matrix:
## Bad Quality Wine Good Quality Wine class.error
## Bad Quality Wine 474 122 0.2046980
## Good Quality Wine 118 567 0.1722628
library(randomForest)
importance <- importance(RF_Model)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(importance))))
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, importance),
y = importance, fill = importance)) +
geom_bar(stat="identity", color="steelblue", position=position_dodge())+
theme_minimal() +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 3, colour = 'yellow') +
ggtitle("Bar Chart with Most Discriminating Factor of Wine Quality")+
labs(x = 'Discriminating Factor of Wine Quality',y="Importance") +
coord_flip() +
theme_classic()
*** ### Classification Technique: Each Quality Score is Considered A Class \(~\)
#Import the dataset WineQuality#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )
#Low:3-4, Medium:5-6, High:7-8
RedWineQuality.orig = RedWineQuality # Save off original copy before making any alterations
## Change quality to categorical
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
if(x<= 4) {"Low"}
else if(x>= 7) {"High"}
else { "Medium"}
})
RedWineQuality$quality = unlist(RedWineQuality$quality)
RedWineQuality$quality = as.factor(RedWineQuality$quality)
### Divide the data to training and testing groups###
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]
table(TrainData$quality)
##
## High Low Medium
## 168 54 1071
table(TestData$quality)
##
## High Low Medium
## 49 9 248
#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)
## Named int [1:1281] 2 3 4 5 6 8 9 10 11 14 ...
## - attr(*, "names")= chr [1:1281] "Resample11" "Resample12" "Resample13" "Resample14" ...
TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]
table(TrainData$quality)
##
## High Low Medium
## 174 51 1056
table(TestData$quality)
##
## High Low Medium
## 43 12 263
###Classification Approaches to Quality Score###
#Decsion Three#
library(rpart)
Model_All <- quality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density +
pH + sulphates + alcohol
Model_Alchool_VA<- quality~ alcohol + volatile.acidity+sulphates
#Decsion Three#
Tree_All <- rpart(Model_All , method = "class", data=TrainData)
Tree_Alchool_VA <- rpart(Model_Alchool_VA, method = "class", data=TrainData)
#Tree_All
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_All , box.palette = "RdBu", nn=TRUE)
#Test for Prediction
Prediction_Train = predict(Tree_All,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 91 0 83
## Low 0 0 51
## Medium 19 0 1037
##
## Overall Statistics
##
## Accuracy : 0.8806
## 95% CI : (0.8615, 0.8978)
## No Information Rate : 0.9141
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4913
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.82727 NA 0.8856
## Specificity 0.92912 0.96019 0.8273
## Pos Pred Value 0.52299 NA 0.9820
## Neg Pred Value 0.98284 NA 0.4044
## Prevalence 0.08587 0.00000 0.9141
## Detection Rate 0.07104 0.00000 0.8095
## Detection Prevalence 0.13583 0.03981 0.8244
## Balanced Accuracy 0.87820 NA 0.8564
Prediction_Test = predict(Tree_All,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test )
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 16 0 27
## Low 0 0 12
## Medium 9 0 254
##
## Overall Statistics
##
## Accuracy : 0.8491
## 95% CI : (0.8049, 0.8866)
## No Information Rate : 0.9214
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3361
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.64000 NA 0.8669
## Specificity 0.90785 0.96226 0.6400
## Pos Pred Value 0.37209 NA 0.9658
## Neg Pred Value 0.96727 NA 0.2909
## Prevalence 0.07862 0.00000 0.9214
## Detection Rate 0.05031 0.00000 0.7987
## Detection Prevalence 0.13522 0.03774 0.8270
## Balanced Accuracy 0.77392 NA 0.7534
library(rpart) #for trees
library(rpart.plot)
#Tree_Alchool_VA
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_Alchool_VA, box.palette = "RdBu", nn=TRUE)
#Test for Prediction
Prediction_Train_Alchool_VA= predict(Tree_Alchool_VA,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train_Alchool_VA)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 66 0 108
## Low 0 0 51
## Medium 21 0 1035
##
## Overall Statistics
##
## Accuracy : 0.8595
## 95% CI : (0.8392, 0.8781)
## No Information Rate : 0.9321
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3682
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.75862 NA 0.8668
## Specificity 0.90955 0.96019 0.7586
## Pos Pred Value 0.37931 NA 0.9801
## Neg Pred Value 0.98103 NA 0.2933
## Prevalence 0.06792 0.00000 0.9321
## Detection Rate 0.05152 0.00000 0.8080
## Detection Prevalence 0.13583 0.03981 0.8244
## Balanced Accuracy 0.83408 NA 0.8127
Prediction_Test_Alchool_VA= predict(Tree_Alchool_VA,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test_Alchool_VA)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 8 0 35
## Low 0 0 12
## Medium 7 0 256
##
## Overall Statistics
##
## Accuracy : 0.8302
## 95% CI : (0.7843, 0.8698)
## No Information Rate : 0.9528
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.174
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.53333 NA 0.8449
## Specificity 0.88449 0.96226 0.5333
## Pos Pred Value 0.18605 NA 0.9734
## Neg Pred Value 0.97455 NA 0.1455
## Prevalence 0.04717 0.00000 0.9528
## Detection Rate 0.02516 0.00000 0.8050
## Detection Prevalence 0.13522 0.03774 0.8270
## Balanced Accuracy 0.70891 NA 0.6891
\(~\)
#Good Wine:x > 5, Bad Wine:Others
#Import the dataset WineQuality#
RedWineQuality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE, stringsAsFactors = FALSE )
RedWineQuality$quality = lapply(RedWineQuality[,12], function (x)
{
if(x>5) {"Good Wine"}
else {"Bad Wine"}
})
RedWineQuality$quality = unlist(RedWineQuality$quality);
RedWineQuality$quality = as.factor(RedWineQuality$quality)
### Divide the data to training and testing groups###
#80% for train data and 20% for test data.
Data_Divide <- sample(2, nrow(RedWineQuality), replace=TRUE, prob=c(0.8, 0.2))
TrainData <-RedWineQuality[Data_Divide==1,]
TestData <- RedWineQuality[Data_Divide==2,]
#Data Partition#
idTrainData <- unlist(createDataPartition(RedWineQuality$quality,p=0.8))
str(idTrainData)
## Named int [1:1280] 1 3 4 5 8 9 10 11 12 14 ...
## - attr(*, "names")= chr [1:1280] "Resample11" "Resample12" "Resample13" "Resample14" ...
TrainData <-RedWineQuality[idTrainData,]
TestData <-RedWineQuality[-idTrainData,]
table(TrainData$quality)
##
## Bad Wine Good Wine
## 596 684
table(TestData$quality)
##
## Bad Wine Good Wine
## 148 171
#Decsion Tree#
Model_All <- quality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density +
pH + sulphates + alcohol
Model_Alchool_VA<- quality~ alcohol + volatile.acidity+sulphates
#Decsion Tree#
Tree_All <- rpart(Model_All , method = "class", data=TrainData)
Tree_Alchool_VA <- rpart(Model_Alchool_VA, method = "class", data=TrainData)
#Tree_All
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_All, box.palette = "RdBu", nn=TRUE)
#Test for Prediction
Prediction_Train = predict(Tree_All,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Wine Good Wine
## Bad Wine 469 127
## Good Wine 170 514
##
## Accuracy : 0.768
## 95% CI : (0.7439, 0.7908)
## No Information Rate : 0.5008
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5359
##
## Mcnemar's Test P-Value : 0.01481
##
## Sensitivity : 0.7340
## Specificity : 0.8019
## Pos Pred Value : 0.7869
## Neg Pred Value : 0.7515
## Prevalence : 0.4992
## Detection Rate : 0.3664
## Detection Prevalence : 0.4656
## Balanced Accuracy : 0.7679
##
## 'Positive' Class : Bad Wine
##
Prediction_Test = predict(Tree_All,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Wine Good Wine
## Bad Wine 115 33
## Good Wine 54 117
##
## Accuracy : 0.7273
## 95% CI : (0.6749, 0.7754)
## No Information Rate : 0.5298
## P-Value [Acc > NIR] : 3.906e-13
##
## Kappa : 0.4569
##
## Mcnemar's Test P-Value : 0.03201
##
## Sensitivity : 0.6805
## Specificity : 0.7800
## Pos Pred Value : 0.7770
## Neg Pred Value : 0.6842
## Prevalence : 0.5298
## Detection Rate : 0.3605
## Detection Prevalence : 0.4639
## Balanced Accuracy : 0.7302
##
## 'Positive' Class : Bad Wine
##
#Tree_Alchool_VA
library(rpart) #for trees
library(rpart.plot)
rpart.plot(Tree_Alchool_VA, box.palette = "RdBu", nn=TRUE)
#Test for Prediction
Prediction_Train = predict(Tree_Alchool_VA,TrainData,type = "class")
confusionMatrix(TrainData$quality,Prediction_Train)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Wine Good Wine
## Bad Wine 480 116
## Good Wine 226 458
##
## Accuracy : 0.7328
## 95% CI : (0.7077, 0.7569)
## No Information Rate : 0.5516
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4694
##
## Mcnemar's Test P-Value : 3.769e-09
##
## Sensitivity : 0.6799
## Specificity : 0.7979
## Pos Pred Value : 0.8054
## Neg Pred Value : 0.6696
## Prevalence : 0.5516
## Detection Rate : 0.3750
## Detection Prevalence : 0.4656
## Balanced Accuracy : 0.7389
##
## 'Positive' Class : Bad Wine
##
Prediction_Test = predict(Tree_Alchool_VA,TestData,type = "class")
confusionMatrix(TestData$quality,Prediction_Test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Wine Good Wine
## Bad Wine 120 28
## Good Wine 71 100
##
## Accuracy : 0.6897
## 95% CI : (0.6357, 0.74)
## No Information Rate : 0.5987
## P-Value [Acc > NIR] : 0.0004804
##
## Kappa : 0.388
##
## Mcnemar's Test P-Value : 2.43e-05
##
## Sensitivity : 0.6283
## Specificity : 0.7812
## Pos Pred Value : 0.8108
## Neg Pred Value : 0.5848
## Prevalence : 0.5987
## Detection Rate : 0.3762
## Detection Prevalence : 0.4639
## Balanced Accuracy : 0.7048
##
## 'Positive' Class : Bad Wine
##
Adrian Dragulescu and Cole Arendt (2020). xlsx: Read, Write, Format Excel 2007 and Excel 97/2000/XP/2003 Files. R package version 0.6.2. https://CRAN.R-project.org/package=xlsx
A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
Elin Waring, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu and Shannon Ellis (2020). skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr (website), https://github.com/ropensci/skimr.
Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.
Hadley Wickham (2019). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.5. https://CRAN.R-project.org/package=rvest
Hadley Wickham, Jim Hester and Winston Chang (2020). devtools: Tools to Make Developing R Packages Easier. R package version 2.2.2. https://CRAN.R-project.org/package=devtools
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.4. https://CRAN.R-project.org/package=dplyr
Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder
Kevin Wright (2018). corrgram: Plot a Correlogram. R package version 1.13. https://CRAN.R-project.org/package=corrgram
Max Kuhn (2020). caret: Classification and Regression Training. R package version 6.0-85. https://CRAN.R-project.org/package=caret
Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). “ROCR: visualizing classifier performance in R.” Bioinformatics, 21(20), 7881. <URL: http://rocr.bioinf.mpi-sb.mpg.de>.
Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr
Stephen Milborrow (2019). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R package version 3.0.8. https://CRAN.R-project.org/package=rpart.plot
Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot
Terry Therneau and Beth Atkinson (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686