The objective of this assignment is to predict wine quality based on chemical properties in wine. This would allow vineyards to save money and time using taste testers to evaluate wine quality.

First we’ll upload the data and explore what it looks like.

setwd("~/Desktop/R")
wine <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";")
str(wine)
'data.frame':   1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
table(wine$quality)

  3   4   5   6   7   8 
 10  53 681 638 199  18 

The data set contains 1599 observations of 12 variables. 11 variables are numeric, and the wine quality variable is an integer rating - all wines are rated as an integer ranging from 3 to 8.

names(wine)
 [1] "fixed.acidity"        "volatile.acidity"    
 [3] "citric.acid"          "residual.sugar"      
 [5] "chlorides"            "free.sulfur.dioxide" 
 [7] "total.sulfur.dioxide" "density"             
 [9] "pH"                   "sulphates"           
[11] "alcohol"              "quality"             
sum(is.na(wine))
[1] 0

The names look sufficient, and there are 0 “na”" values. Let’s change the predictor variable “quality” to a factor.

wine$quality <- as.factor(wine$quality)
str(wine$quality)
 Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Before we split the data, let’s first look at a histogram of the frequency of wine quality ratings. It should be mentioned that the levels of the histogram don’t represent the integers in the data frame, but instead the 6 levels that’re used.

sauce <- as.numeric(wine$quality)
hist(sauce)

The majority of ratings are levels 3 and 4, which would be ratings 5 and 6 in the data frame.

For this project, we’ll first use the decision tree classification method for classifying wine into the 6 levels based on its properties. We’ll use the rpart() library to classify. The first step is to split the data into training and testing sets. To be safe, we’ll randomize these samples. Let’s use 80% of the data for training and 20% for testing.

.8 * 1599
[1] 1279.2
s <- sample(1599, 1279)
wine_train <- wine[s, ]
wine_test <- wine[-s, ]
dim(wine_train)
[1] 1279   12
dim(wine_test)
[1] 320  12

We now have two randomized samples of the data. Let’s create the decision tree model using rpart().

install.packages("rpart")
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/rpart_4.1-11.tgz'
Content type 'application/x-gzip' length 902431 bytes (881 KB)
==================================================
downloaded 881 KB

The downloaded binary packages are in
    /var/folders/6b/yb20hcv16nz__qyd0fcg7b9h0000gn/T//RtmpmELVaW/downloaded_packages
library(rpart)
package ‘rpart’ was built under R version 3.3.2
tm <- rpart(quality~., wine_train, method = "class")

Now to inspect the result using rpart.plot(), and the tweak command to increase the font size. Be sure to expand the graph so it can be viewed with greater clarity.

install.packages("rpart.plot")
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/rpart.plot_2.1.2.tgz'
Content type 'application/x-gzip' length 713904 bytes (697 KB)
==================================================
downloaded 697 KB

The downloaded binary packages are in
    /var/folders/6b/yb20hcv16nz__qyd0fcg7b9h0000gn/T//RtmpmELVaW/downloaded_packages
library(rpart.plot)
package ‘rpart.plot’ was built under R version 3.3.2
rpart.plot(tm, tweak = 1.6)

In this graph, yes is always to the left and no is always to the right. Each branch is a decision for splitting the data into a new classification. The decision tree split the data into only 3 of the 6 available classifications: 5, 6 and 7. Let’s create another graph with more detail.

rpart.plot(tm, type = 4, extra = 101, tweak = 1.6)

The furthest branches show that the this prediction made a considerable amount of errors. Let’s go on to test its prediction on the unseen data.

pred <- predict(tm, wine_test, type = "class")
table(wine_test$quality, pred)
   pred
     3  4  5  6  7  8
  3  0  0  0  0  0  0
  4  0  0  6  3  0  0
  5  0  0 81 37  7  0
  6  0  0 44 83  6  0
  7  0  0  2 36 12  0
  8  0  0  0  3  0  0
install.packages("caret")
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/caret_6.0-78.tgz'
Content type 'application/x-gzip' length 5148319 bytes (4.9 MB)
==================================================
downloaded 4.9 MB

The downloaded binary packages are in
    /var/folders/6b/yb20hcv16nz__qyd0fcg7b9h0000gn/T//RtmpmELVaW/downloaded_packages
library(caret)
package ‘caret’ was built under R version 3.3.2Loading required package: lattice
Loading required package: ggplot2
package ‘ggplot2’ was built under R version 3.3.2unknown timezone 'zone/tz/2017c.1.0/zoneinfo/America/Denver'
confusionMatrix(table(pred, wine_test$quality))
Confusion Matrix and Statistics

    
pred  3  4  5  6  7  8
   3  0  0  0  0  0  0
   4  0  0  0  0  0  0
   5  0  6 81 44  2  0
   6  0  3 37 83 36  3
   7  0  0  7  6 12  0
   8  0  0  0  0  0  0

Overall Statistics
                                          
               Accuracy : 0.55            
                 95% CI : (0.4937, 0.6054)
    No Information Rate : 0.4156          
    P-Value [Acc > NIR] : 8.828e-07       
                                          
                  Kappa : 0.2683          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
Sensitivity                NA  0.00000   0.6480   0.6241  0.24000
Specificity                 1  1.00000   0.7333   0.5775  0.95185
Pos Pred Value             NA      NaN   0.6090   0.5123  0.48000
Neg Pred Value             NA  0.97188   0.7647   0.6835  0.87119
Prevalence                  0  0.02813   0.3906   0.4156  0.15625
Detection Rate              0  0.00000   0.2531   0.2594  0.03750
Detection Prevalence        0  0.00000   0.4156   0.5062  0.07812
Balanced Accuracy          NA  0.50000   0.6907   0.6008  0.59593
                     Class: 8
Sensitivity          0.000000
Specificity          1.000000
Pos Pred Value            NaN
Neg Pred Value       0.990625
Prevalence           0.009375
Detection Rate       0.000000
Detection Prevalence 0.000000
Balanced Accuracy    0.500000

As shown above, the predictions were only 64.06% accurate, which isn’t very good. Let’s try a random forest approach. My approach will be guided by the following blog https://www.r-bloggers.com/predicting-wine-quality-using-random-forests/, with some changes as I see fit.

For our random forest model, let’s also redefine our quality ranking and reduce the number of levels. For this I’m going to reload the data in its original form.

setwd("~/Desktop/R")
wine2 <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";")

Let’s look again at the distribution of wine rankings, this time with a bar plot.

barplot(table(wine2$quality))

I’d like to classify the wines ranked as 5 and 6 as “normal”, the lower ranked wines as “bad”, and the wines ranked above as “good”.

wine2$taste <- ifelse(wine2$quality < 5, "bad", "good")
wine2$taste[wine2$quality == 5] <- "normal"
wine2$taste[wine2$quality == 6] <- "normal"
wine2$taste <- as.factor(wine2$taste)
str(wine2$taste)
 Factor w/ 3 levels "bad","good","normal": 3 3 3 3 3 3 3 2 2 3 ...
barplot(table(wine2$taste))

table(wine2$taste)

As seen above, there are a lot more normal wines in the dataset then there are bad or good. In a real world example, a company might be more concerned with these simplified classifications than classifying precise integer ratings.

We can now proceed to splitting our data into training and testing sets. We’ll use 80% for testing again for the random forest approach.

samp <- sample(1599, 1279)
wine_train2 <- wine2[samp, ]
wine_test2 <- wine2[-samp, ]
dim(wine_train2)
[1] 1279   13
dim(wine_test2)
[1] 320  13
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin
model <- randomForest(taste ~ . - quality, data = wine_train2)
model

Call:
 randomForest(formula = taste ~ . - quality, data = wine_train2) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 12.9%
Confusion matrix:
       bad good normal class.error
bad      1    0     46   0.9787234
good     0  102     86   0.4574468
normal   2   31   1011   0.0316092

We can now test our model on the remaining data.

prediction <- predict(model, newdata = wine_test2)
table(prediction, wine_test2$taste)
          
prediction bad good normal
    bad      0    0      0
    good     1   13      4
    normal  15   16    271
(0 + 21 + 260) / nrow(wine_test2)
[1] 0.878125

As seen above, our model was approx. 88% accurate - a major improvement from our decision tree.

I’d like to try one more random forest, and make the prediction more difficult for the algorithm. I’m now going to consider the integer rating of wine “5” as “bad” instead of “normal”. And I’m only going to use 60% of the data to train the algorithm.

wine3 <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";")
wine3$taste <- ifelse(wine3$quality < 6, "bad", "good")
wine3$taste[wine3$quality == 6] <- "normal"
wine3$taste <- as.factor(wine3$taste)
str(wine3$taste)
 Factor w/ 3 levels "bad","good","normal": 1 1 1 3 1 1 1 2 2 1 ...
barplot(table(wine3$taste))

table(wine3$taste)

   bad   good normal 
   744    217    638 

This changes the distribution drastically.

.6 * 1599
[1] 959.4
samp2 <- sample(1599, 960)
wine_train3 <- wine2[samp2, ]
wine_test3 <- wine2[-samp2, ]
dim(wine_train3)
[1] 960  13
dim(wine_test3)
[1] 639  13
library(randomForest)
model2 <- randomForest(taste ~ . - quality, data = wine_train3)
model2

Call:
 randomForest(formula = taste ~ . - quality, data = wine_train3) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 13.33%
Confusion matrix:
       bad good normal class.error
bad      0    1     37  1.00000000
good     0   77     65  0.45774648
normal   0   25    755  0.03205128
prediction2 <- predict(model2, newdata = wine_test3)
table(prediction2, wine_test3$taste)
           
prediction2 bad good normal
     bad      0    0      0
     good     0   38     19
     normal  25   37    520
(2 + 44 + 524) / nrow(wine_test3)
[1] 0.8920188

This model predicted the wine classification at a rate of approx. 89%. This is an impressive algorithm.

It seems reasonable that a random forest approach has the power to be more effective in its classification. It also follows that reducing the levels of classification from 6 to 3 helped improve the power of the model.

---
title: "Decision Trees and Random Forests with R"
output:
  pdf_document: default
  html_notebook: default
---

The objective of this assignment is to predict wine quality based on chemical properties in wine. This would allow vineyards to save money and time using taste testers to evaluate wine quality.

First we'll upload the data and explore what it looks like.

```{r}
setwd("~/Desktop/R")
wine <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";")
```


```{r}
str(wine)
table(wine$quality)
```

The data set contains 1599 observations of 12 variables. 11 variables are numeric, and the wine quality variable is an integer rating - all wines are rated as an integer ranging from 3 to 8. 

```{r}
names(wine)
sum(is.na(wine))
```

The names look sufficient, and there are 0 "na"" values. Let's change the predictor variable "quality" to a factor.

```{r}
wine$quality <- as.factor(wine$quality)
str(wine$quality)
```

Before we split the data, let's first look at a histogram of the frequency of wine quality ratings. It should be mentioned that the levels of the histogram don't represent the integers in the data frame, but instead the 6 levels that're used.



```{r}
sauce <- as.numeric(wine$quality)

hist(sauce)

```

The majority of ratings are levels 3 and 4, which would be ratings 5 and 6 in the data frame.


For this project, we'll first use the decision tree classification method for classifying wine into the 6 levels based on its properties. We'll use the rpart() library to classify. The first step is to split the data into training and testing sets. To be safe, we'll randomize these samples. Let's use 80% of the data for training and 20% for testing.

```{r}
.8 * 1599
```

```{r}
s <- sample(1599, 1279)
wine_train <- wine[s, ]
wine_test <- wine[-s, ]

dim(wine_train)
dim(wine_test)

```

We now have two randomized samples of the data. Let's create the decision tree model using rpart().

```{r}
install.packages("rpart")
library(rpart)
```


```{r}
tm <- rpart(quality~., wine_train, method = "class")
```


Now to inspect the result using rpart.plot(), and the tweak command to increase the font size. Be sure to expand the graph so it can be viewed with greater clarity. 

```{r}
install.packages("rpart.plot")
library(rpart.plot)
```

```{r}
rpart.plot(tm, tweak = 1.6)
```

In this graph, yes is always to the left and no is always to the right. Each branch is a decision for splitting the data into a new classification. The decision tree split the data into only 3 of the 6 available classifications: 5, 6 and 7. Let's create another graph with more detail.

```{r}
rpart.plot(tm, type = 4, extra = 101, tweak = 1.6)
```


The furthest branches show that the this prediction made a considerable amount of errors. Let's go on to test its prediction on the unseen data.


```{r}

pred <- predict(tm, wine_test, type = "class")
table(wine_test$quality, pred)
```


```{r}
install.packages("caret")
```



```{r}
library(caret)
confusionMatrix(table(pred, wine_test$quality))
```


As shown above, the predictions were only 64.06% accurate, which isn't very good. Let's try a random forest approach. My approach will be guided by the following blog https://www.r-bloggers.com/predicting-wine-quality-using-random-forests/, with some changes as I see fit.


For our random forest model, let's also redefine our quality ranking and reduce the number of levels. For this I'm going to reload the data in its original form.

```{r}
setwd("~/Desktop/R")
wine2 <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";")
```


Let's look again at the distribution of wine rankings, this time with a bar plot.

```{r}
barplot(table(wine2$quality))
```


I'd like to classify the wines ranked as 5 and 6 as "normal", the lower ranked wines as "bad", and the wines ranked above as "good".

```{r}
wine2$taste <- ifelse(wine2$quality < 5, "bad", "good")
wine2$taste[wine2$quality == 5] <- "normal"
wine2$taste[wine2$quality == 6] <- "normal"
wine2$taste <- as.factor(wine2$taste)
str(wine2$taste)
barplot(table(wine2$taste))
```

```{r}
table(wine2$taste)
```


As seen above, there are a lot more normal wines in the dataset then there are bad or good. In a real world example, a company might be more concerned with these simplified classifications than classifying precise integer ratings.

We can now proceed to splitting our data into training and testing sets. We'll use 80% for testing again for the random forest approach.

```{r}
samp <- sample(1599, 1279)
wine_train2 <- wine2[samp, ]
wine_test2 <- wine2[-samp, ]

dim(wine_train2)
dim(wine_test2)
```

```{r}
library(randomForest)
model <- randomForest(taste ~ . - quality, data = wine_train2)
```

```{r}
model
```


We can now test our model on the remaining data.

```{r}
prediction <- predict(model, newdata = wine_test2)
table(prediction, wine_test2$taste)
```

```{r}
(0 + 21 + 260) / nrow(wine_test2)
```



As seen above, our model was approx. 88% accurate - a major improvement from our decision tree. 


I'd like to try one more random forest, and make the prediction more difficult for the algorithm. I'm now going to consider the integer rating of wine "5" as "bad" instead of "normal". And I'm only going to use 60% of the data to train the algorithm.


```{r}
wine3 <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";")

wine3$taste <- ifelse(wine3$quality < 6, "bad", "good")
wine3$taste[wine3$quality == 6] <- "normal"
wine3$taste <- as.factor(wine3$taste)
str(wine3$taste)
barplot(table(wine3$taste))

```

```{r}
table(wine3$taste)
```


This changes the distribution drastically.


```{r}
.6 * 1599
```

```{r}
samp2 <- sample(1599, 960)
wine_train3 <- wine2[samp2, ]
wine_test3 <- wine2[-samp2, ]

dim(wine_train3)
dim(wine_test3)
```


```{r}
library(randomForest)
model2 <- randomForest(taste ~ . - quality, data = wine_train3)
model2
```


```{r}
prediction2 <- predict(model2, newdata = wine_test3)
table(prediction2, wine_test3$taste)
```

```{r}
(2 + 44 + 524) / nrow(wine_test3)
```


This model predicted the wine classification at a rate of approx. 89%. This is an impressive algorithm.


It seems reasonable that a random forest approach has the power to be more effective in its classification. It also follows that reducing the levels of classification from 6 to 3 helped improve the power of the model.


