1 Introduction
2 Exploring some of the meost important variables.
3 Creating the Machine Learning Model (via a 1:68 ratio)
4 Creating the Machine Learning Model (via a 1:225 ratio)
5 Creating the Machine Learning Model (via a 1:404 ratio)

1 Introduction

Greetings!. I would like to analyze the Wine Quality Dataset. I got this link from one of my Datacamp courses that i took although this one here deals more with Wine Quality. I would like to apply principles of Machine Learning on here.

Machine Learning Principles applied here i learned from Rick Scavetta’s analysis of the Boston Housing Prices Dataset, from Julia Slige’s Machine Learning Case Study course in Datacamp, from Kaggle’s Machine Learning tutorial and from the guy who won that Iowa housing dataset competition at Kaggle.

1.1 Necessary variables

We would like to have a look at the relationship between the alcohol content and everything else.

1.2 Importing necessary libraries

library(caret)

## Warning: package 'caret' was built under R version 3.5.2

## Loading required package: lattice

## Loading required package: ggplot2

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v dplyr   0.7.8
## v readr   1.1.1     v stringr 1.3.1
## v tibble  1.4.2     v forcats 0.3.0

## Warning: package 'dplyr' was built under R version 3.5.2

## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

library(corrplot)

## corrplot 0.84 loaded

library(keras)

## Warning: package 'keras' was built under R version 3.5.2

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.2

library(modelr)
library(psych)

## Warning: package 'psych' was built under R version 3.5.2

## 
## Attaching package: 'psych'

## The following object is masked from 'package:modelr':
## 
##     heights

## The following object is masked from 'package:randomForest':
## 
##     outlier

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(Rmisc)

## Warning: package 'Rmisc' was built under R version 3.5.2

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

library(scales)

## Warning: package 'scales' was built under R version 3.5.2

## 
## Attaching package: 'scales'

## The following objects are masked from 'package:psych':
## 
##     alpha, rescale

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(rpart)
library(yardstick)

## Loading required package: broom

## Warning: package 'broom' was built under R version 3.5.2

## 
## Attaching package: 'broom'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

## 
## Attaching package: 'yardstick'

## The following objects are masked from 'package:modelr':
## 
##     mae, rmse

## The following object is masked from 'package:readr':
## 
##     spec

## The following objects are masked from 'package:caret':
## 
##     mnLogLoss, precision, recall

1.3 Importing the Dataset

winequality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")

head(winequality)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

summary(winequality)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Thankfully there are no NAs in wine.

1.4 Description of variables

I’m not a chemist but i will give the best description i can give for each of the variables given that i can.

fixed.acidity - acid produced in wines from other sources other than carbon dioxide

volatile.acidity - measure of the low molecular weight fatty acids in wine and is generally perceived as the odour of vinegar

citric.acid - weak organic acid that occurs naturarlly in citrus fruits

residual.sugar - refers to any natural grape sugars that are left over after the ceasing of fermantation regardless if they are in purpose or not

chlorides - self explanatory

free.sulfur.dioxide - SO2 available to react and thus exhibits both germicidal and antioxidant properties

total.sulfur.dioxide - self explanatory

density - self explanatory

pH - from a winemaker’s POV, it is a way to measure ripeness in relation to acidity

sulfites - self explanatory (amount of sulfites in wine)

alcohol - defined as amount of alcohol by volume

quality - self explanatory

Apologies if my explanation isn’t good enough, but from what i researched this is the best i can give

2 Exploring some of the meost important variables.

We would like to see the relationship between the alcohol quality and everything else that have nothing to do with quality.

2.1 Response variable - alcohol

I would like to see via a histogram the distribution of the number of wines assessed based on its alcohol content.

ggplot(winequality, aes(x = alcohol)) + geom_histogram(fill = "blue", bindwith = 1)

## Warning: Ignoring unknown parameters: bindwith

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2 Most important numeric predictors

I want to figure out each of the variable’s correlation to alcohol. I need to figure out which correlates MORE with the alcohol variable

numericVars <- which(sapply(winequality, is.numeric))
numericVarNames <- names(numericVars)
cat("There are", length(numericVars), "numeric variables")

## There are 12 numeric variables

2.2.1 Creating the corrplot

all_numVar <- winequality[, numericVars]
cor_numVar <- cor(all_numVar, use = "pairwise.complete.obs")

#Sort on decreasing correlations with alcohol
cor_sorted <- as.matrix(sort(cor_numVar[,"alcohol"], decreasing = TRUE))

#Selecting high correlations 
Cor_High <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.175)))
cor_numVar <- cor_numVar[Cor_High, Cor_High]

corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")

Since quality is interrelated with alcohol content, let’s try to create plots to relate to its relatability

ggplot(winequality, aes(x = quality, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Quality \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.3 Determining full correlation between alcohol and the other variables

cor(winequality$alcohol, winequality$fixed.acidity, use = "pairwise.complete.obs")

## [1] -0.1208811

cor(winequality$alcohol, winequality$volatile.acidity, use = "pairwise.complete.obs")

## [1] 0.06771794

cor(winequality$alcohol, winequality$citric.acid, use = "pairwise.complete.obs")

## [1] -0.07572873

cor(winequality$alcohol, winequality$residual.sugar, use = "pairwise.complete.obs")

## [1] -0.4506312

cor(winequality$alcohol, winequality$chlorides, use = "pairwise.complete.obs")

## [1] -0.3601887

cor(winequality$alcohol, winequality$free.sulfur.dioxide, use = "pairwise.complete.obs")

## [1] -0.2501039

cor(winequality$alcohol, winequality$total.sulfur.dioxide, use = "pairwise.complete.obs")

## [1] -0.4488921

cor(winequality$alcohol, winequality$density, use = "pairwise.complete.obs")

## [1] -0.7801376

cor(winequality$alcohol, winequality$pH, use = "pairwise.complete.obs")

## [1] 0.1214321

cor(winequality$alcohol, winequality$sulphates, use = "pairwise.complete.obs")

## [1] -0.01743277

cor(winequality$alcohol, winequality$quality, use = "pairwise.complete.obs")

## [1] 0.4355747

2.4 Visualizing with some of the other variables

2.4.1 alcohol with pH

ggplot(winequality, aes(x = pH, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n pH content \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.2 alcohol with citric acid

ggplot(winequality, aes(x = citric.acid, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Citric Acid \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.3 alcohol with fixed.acidity

ggplot(winequality, aes(x = fixed.acidity, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Fixed Acidity in the wine \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.4 alcohol with volatile.acidity

ggplot(winequality, aes(x = volatile.acidity, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Volatile Acidity \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.5 alcohol with residual.sugar

ggplot(winequality, aes(x = residual.sugar, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Residual Sugar Content \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.6 alcohol with chlorides

ggplot(winequality, aes(x = chlorides, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Chlorides in wine \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.7 alcohol with free.sulfur.dioxide

ggplot(winequality, aes(x = free.sulfur.dioxide, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Free Sulfur Dioxide \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.8 alcohol with total sulfur dioxide

ggplot(winequality, aes(x = total.sulfur.dioxide, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Total Sulfur Dioxide \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.9 alcohol with density

ggplot(winequality, aes(x = density, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Density \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

2.4.10 alcohol with sulphates

ggplot(winequality, aes(x = sulphates, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Sulphates \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))

3 Creating the Machine Learning Model (via a 1:68 ratio)

# Via keras

index <- 1:68

training <- winequality[index,]
testing <- winequality[-index,]

3.1 Conducting Z-score scaling

This is used to convert all the indicators to a common scale with an average of zero and an SD (standard deviation of 1)

# using dplyr to keep the data frame structure

training %>%
  mutate_at(vars(-alcohol), scale) -> training

testing %>%
  mutate_at(vars(-alcohol), scale) -> testing

3.2 GLM and LM

Assuming normal distribution with linear modelling

fit_lm <- lm(alcohol~., data = training)

summary(fit_lm)

## 
## Call:
## lm(formula = alcohol ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59689 -0.24361 -0.03415  0.17444  0.92859 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.238235   0.043027 237.950  < 2e-16 ***
## fixed.acidity         0.252765   0.062516   4.043 0.000163 ***
## volatile.acidity      0.086418   0.048215   1.792 0.078479 .  
## citric.acid           0.149721   0.058733   2.549 0.013565 *  
## residual.sugar        1.129491   0.120762   9.353 4.87e-13 ***
## chlorides            -0.001271   0.060959  -0.021 0.983445    
## free.sulfur.dioxide  -0.103949   0.057158  -1.819 0.074316 .  
## total.sulfur.dioxide  0.102732   0.058450   1.758 0.084282 .  
## density              -1.778473   0.117720 -15.108  < 2e-16 ***
## pH                    0.244542   0.058097   4.209 9.37e-05 ***
## sulphates             0.130461   0.049824   2.618 0.011342 *  
## quality               0.064368   0.055449   1.161 0.250625    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3548 on 56 degrees of freedom
## Multiple R-squared:  0.9088, Adjusted R-squared:  0.8909 
## F-statistic: 50.72 on 11 and 56 DF,  p-value: < 2.2e-16

3.2.1 Making predictions

pred_lm <- predict(fit_lm, newdata = testing)
print(head(pred_lm))

##         1         2         3         4         5         6 
## 10.493639  9.537425  9.598242  9.080628  9.317752  9.249997

3.3 Using Random Forests

fit_rf <- randomForest(alcohol~., data = training)
summary(fit_rf)

##                 Length Class  Mode     
## call              3    -none- call     
## type              1    -none- character
## predicted        68    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times        68    -none- numeric  
## importance       11    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y                68    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

3.3.1 Making predictions

pred_rf <- predict(fit_rf, newdata = testing)
print(head(pred_rf))

##         1         2         3         4         5         6 
## 10.341243  9.972557  9.881607  9.827643  9.856643 10.263370

3.4 Using rpart

fit_rpart <- rpart(alcohol~., data = training)
summary(fit_rpart)

## Call:
## rpart(formula = alcohol ~ ., data = training)
##   n= 68 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.65643234      0 1.0000000 1.0348123 0.18148704
## 2 0.05740445      1 0.3435677 0.4134829 0.06184590
## 3 0.03478234      2 0.2861632 0.3501176 0.05194178
## 4 0.02230450      3 0.2513809 0.3446927 0.05086082
## 5 0.01276974      4 0.2290764 0.3206472 0.04735077
## 6 0.01000000      5 0.2163066 0.3184275 0.04731005
## 
## Variable importance
##              density            chlorides              quality 
##                   46                   17                   16 
##       residual.sugar     volatile.acidity          citric.acid 
##                    8                    3                    3 
##        fixed.acidity                   pH  free.sulfur.dioxide 
##                    2                    2                    1 
## total.sulfur.dioxide 
##                    1 
## 
## Node number 1: 68 observations,    complexity param=0.6564323
##   mean=10.23824, MSE=1.136479 
##   left son=2 (50 obs) right son=3 (18 obs)
##   Primary splits:
##       density        < -0.7316956  to the right, improve=0.6564323, (0 missing)
##       quality        < 0.725376    to the left,  improve=0.3191095, (0 missing)
##       chlorides      < -0.5260461  to the right, improve=0.2335563, (0 missing)
##       residual.sugar < -0.5031297  to the right, improve=0.1816994, (0 missing)
##       citric.acid    < 0.7301943   to the left,  improve=0.1691722, (0 missing)
##   Surrogate splits:
##       quality          < 0.725376    to the left,  agree=0.838, adj=0.389, (0 split)
##       chlorides        < -0.7692599  to the right, agree=0.824, adj=0.333, (0 split)
##       residual.sugar   < -0.714262   to the right, agree=0.765, adj=0.111, (0 split)
##       volatile.acidity < -0.9739436  to the right, agree=0.750, adj=0.056, (0 split)
##       citric.acid      < 0.7301943   to the left,  agree=0.750, adj=0.056, (0 split)
## 
## Node number 2: 50 observations,    complexity param=0.05740445
##   mean=9.72, MSE=0.3296 
##   left son=4 (8 obs) right son=5 (42 obs)
##   Primary splits:
##       density             < 1.07319     to the right, improve=0.2691899, (0 missing)
##       residual.sugar      < 1.350143    to the right, improve=0.2498302, (0 missing)
##       pH                  < -0.9040212  to the left,  improve=0.1796153, (0 missing)
##       free.sulfur.dioxide < 0.5871171   to the right, improve=0.1761516, (0 missing)
##       fixed.acidity       < 0.6195383   to the left,  improve=0.1636859, (0 missing)
##   Surrogate splits:
##       residual.sugar < 0.8903435   to the right, agree=0.96, adj=0.750, (0 split)
##       pH             < -1.62161    to the left,  agree=0.90, adj=0.375, (0 split)
##       fixed.acidity  < -1.295398   to the left,  agree=0.86, adj=0.125, (0 split)
## 
## Node number 3: 18 observations
##   mean=11.67778, MSE=0.5595062 
## 
## Node number 4: 8 observations
##   mean=9.0375, MSE=0.2373437 
## 
## Node number 5: 42 observations,    complexity param=0.03478234
##   mean=9.85, MSE=0.2415476 
##   left son=10 (12 obs) right son=11 (30 obs)
##   Primary splits:
##       chlorides            < 0.1661777   to the right, improve=0.2649581, (0 missing)
##       fixed.acidity        < -0.1464363  to the left,  improve=0.2259107, (0 missing)
##       total.sulfur.dioxide < -0.8614635  to the right, improve=0.1558577, (0 missing)
##       pH                   < -0.7844231  to the left,  improve=0.1369843, (0 missing)
##       citric.acid          < 0.2876523   to the left,  improve=0.1159906, (0 missing)
##   Surrogate splits:
##       pH                   < -0.6648249  to the left,  agree=0.810, adj=0.333, (0 split)
##       free.sulfur.dioxide  < 1.788369    to the right, agree=0.786, adj=0.250, (0 split)
##       volatile.acidity     < 1.20111     to the right, agree=0.762, adj=0.167, (0 split)
##       total.sulfur.dioxide < 0.3206344   to the right, agree=0.738, adj=0.083, (0 split)
## 
## Node number 10: 12 observations
##   mean=9.45, MSE=0.1341667 
## 
## Node number 11: 30 observations,    complexity param=0.0223045
##   mean=10.01, MSE=0.1949 
##   left son=22 (8 obs) right son=23 (22 obs)
##   Primary splits:
##       fixed.acidity        < -0.6060211  to the left,  improve=0.2948015, (0 missing)
##       citric.acid          < 0.2876523   to the left,  improve=0.2793740, (0 missing)
##       density              < 0.01258714  to the right, improve=0.1601391, (0 missing)
##       total.sulfur.dioxide < -0.1784736  to the right, improve=0.1169922, (0 missing)
##       volatile.acidity     < 0.04699996  to the right, improve=0.1142894, (0 missing)
##   Surrogate splits:
##       volatile.acidity    < 0.04699996  to the right, agree=0.867, adj=0.500, (0 split)
##       citric.acid         < -0.6416859  to the left,  agree=0.800, adj=0.250, (0 split)
##       sulphates           < -0.9908584  to the left,  agree=0.800, adj=0.250, (0 split)
##       free.sulfur.dioxide < -1.320754   to the left,  agree=0.767, adj=0.125, (0 split)
## 
## Node number 22: 8 observations
##   mean=9.6125, MSE=0.1135937 
## 
## Node number 23: 22 observations,    complexity param=0.01276974
##   mean=10.15455, MSE=0.1461157 
##   left son=46 (9 obs) right son=47 (13 obs)
##   Primary splits:
##       total.sulfur.dioxide < 0.1104837   to the right, improve=0.3069962, (0 missing)
##       free.sulfur.dioxide  < -0.08417077 to the right, improve=0.2386171, (0 missing)
##       citric.acid          < 0.2876523   to the left,  improve=0.1961538, (0 missing)
##       pH                   < -0.3857626  to the left,  improve=0.1070944, (0 missing)
##       chlorides            < -0.151871   to the right, improve=0.1070944, (0 missing)
##   Surrogate splits:
##       chlorides           < -0.1144535  to the right, agree=0.864, adj=0.667, (0 split)
##       free.sulfur.dioxide < -0.08417077 to the right, agree=0.864, adj=0.667, (0 split)
##       citric.acid         < 0.2876523   to the left,  agree=0.773, adj=0.444, (0 split)
##       fixed.acidity       < 0.5429409   to the left,  agree=0.727, adj=0.333, (0 split)
##       residual.sugar      < 0.5149972   to the right, agree=0.727, adj=0.333, (0 split)
## 
## Node number 46: 9 observations
##   mean=9.9, MSE=0.06 
## 
## Node number 47: 13 observations
##   mean=10.33077, MSE=0.1298225

3.4.1 Making predictions

pred_rpart <- predict(fit_rpart, data = testing)
print(head(pred_rpart))

##        1        2        3        4        5        6 
##  9.03750  9.61250 10.33077  9.90000  9.90000 10.33077

4 Creating the Machine Learning Model (via a 1:225 ratio)

# Via keras

index <- 1:225

training <- winequality[index,]
testing <- winequality[-index,]

4.1 Conducting Z-score scaling

This is used to convert all the indicators to a common scale with an average of zero and an SD (standard deviation of 1)

# using dplyr to keep the data frame structure

training %>%
  mutate_at(vars(-alcohol), scale) -> training

testing %>%
  mutate_at(vars(-alcohol), scale) -> testing

4.2 GLM and LM

Assuming normal distribution with linear modelling

fit_lm <- lm(alcohol~., data = training)

summary(fit_lm)

## 
## Call:
## lm(formula = alcohol ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67579 -0.20055 -0.03262  0.18538  1.32780 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.04044    0.02160 464.762  < 2e-16 ***
## fixed.acidity         0.34104    0.03009  11.335  < 2e-16 ***
## volatile.acidity      0.02345    0.02318   1.012   0.3127    
## citric.acid           0.03489    0.02532   1.378   0.1696    
## residual.sugar        1.35874    0.06314  21.519  < 2e-16 ***
## chlorides             0.04464    0.02690   1.659   0.0985 .  
## free.sulfur.dioxide  -0.04373    0.03412  -1.281   0.2014    
## total.sulfur.dioxide  0.06717    0.03524   1.906   0.0580 .  
## density              -2.13757    0.06656 -32.116  < 2e-16 ***
## pH                    0.28243    0.02956   9.553  < 2e-16 ***
## sulphates             0.15293    0.02403   6.363 1.19e-09 ***
## quality               0.02577    0.02665   0.967   0.3345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3241 on 213 degrees of freedom
## Multiple R-squared:  0.9192, Adjusted R-squared:  0.915 
## F-statistic: 220.2 on 11 and 213 DF,  p-value: < 2.2e-16

4.2.1 Making predictions

pred_lm <- predict(fit_lm, newdata = testing)
print(head(pred_lm))

##        1        2        3        4        5        6 
## 9.177984 9.589678 9.249720 9.589678 9.177984 8.950655

4.3 Using Random Forests

fit_rf <- randomForest(alcohol~., data = training)
summary(fit_rf)

##                 Length Class  Mode     
## call              3    -none- call     
## type              1    -none- character
## predicted       225    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times       225    -none- numeric  
## importance       11    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y               225    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

4.3.1 Making predictions

pred_rf <- predict(fit_rf, newdata = testing)
print(head(pred_rf))

##        1        2        3        4        5        6 
## 9.096147 9.664323 9.515713 9.664323 9.096147 9.405313

4.4 Using rpart

fit_rpart <- rpart(alcohol~., data = training)
summary(fit_rpart)

## Call:
## rpart(formula = alcohol ~ ., data = training)
##   n= 225 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.61124095      0 1.0000000 1.0081151 0.09090015
## 2 0.07347520      1 0.3887590 0.4390047 0.04353133
## 3 0.05549413      2 0.3152838 0.3837303 0.04033121
## 4 0.03561969      3 0.2597897 0.3290749 0.03483100
## 5 0.01909043      4 0.2241700 0.2993195 0.03101116
## 6 0.01467852      5 0.2050796 0.2943712 0.03049512
## 7 0.01294534      6 0.1904011 0.3052157 0.03115360
## 8 0.01164583      7 0.1774557 0.3021325 0.03270031
## 9 0.01000000      8 0.1658099 0.2827364 0.03175247
## 
## Variable importance
##              density       residual.sugar              quality 
##                   39                   18                   13 
##            chlorides total.sulfur.dioxide        fixed.acidity 
##                   12                    6                    3 
##  free.sulfur.dioxide     volatile.acidity                   pH 
##                    3                    2                    1 
##          citric.acid            sulphates 
##                    1                    1 
## 
## Node number 1: 225 observations,    complexity param=0.611241
##   mean=10.04044, MSE=1.229875 
##   left son=2 (161 obs) right son=3 (64 obs)
##   Primary splits:
##       density              < -0.6752237  to the right, improve=0.6112410, (0 missing)
##       quality              < 0.9023869   to the left,  improve=0.3603031, (0 missing)
##       residual.sugar       < 0.06196372  to the right, improve=0.2923846, (0 missing)
##       total.sulfur.dioxide < -0.05262977 to the right, improve=0.2911999, (0 missing)
##       chlorides            < -0.3529252  to the right, improve=0.2829519, (0 missing)
##   Surrogate splits:
##       residual.sugar       < -0.6411415  to the right, agree=0.849, adj=0.469, (0 split)
##       quality              < 0.9023869   to the left,  agree=0.827, adj=0.391, (0 split)
##       chlorides            < -0.3939736  to the right, agree=0.800, adj=0.297, (0 split)
##       total.sulfur.dioxide < -0.490358   to the right, agree=0.756, adj=0.141, (0 split)
##       fixed.acidity        < -2.113016   to the right, agree=0.724, adj=0.031, (0 split)
## 
## Node number 2: 161 observations,    complexity param=0.05549413
##   mean=9.493789, MSE=0.3978496 
##   left son=4 (88 obs) right son=5 (73 obs)
##   Primary splits:
##       density              < 0.3721636   to the right, improve=0.2397429, (0 missing)
##       free.sulfur.dioxide  < 0.5643408   to the right, improve=0.1603777, (0 missing)
##       residual.sugar       < 0.6633038   to the right, improve=0.1525241, (0 missing)
##       chlorides            < -0.5786915  to the right, improve=0.1380782, (0 missing)
##       total.sulfur.dioxide < -0.06414894 to the right, improve=0.1288413, (0 missing)
##   Surrogate splits:
##       residual.sugar       < 0.1082206   to the right, agree=0.907, adj=0.795, (0 split)
##       free.sulfur.dioxide  < 0.3198022   to the right, agree=0.733, adj=0.411, (0 split)
##       total.sulfur.dioxide < 0.0164852   to the right, agree=0.733, adj=0.411, (0 split)
##       pH                   < 0.1962378   to the left,  agree=0.640, adj=0.205, (0 split)
##       sulphates            < -0.6277445  to the right, agree=0.640, adj=0.205, (0 split)
## 
## Node number 3: 64 observations,    complexity param=0.0734752
##   mean=11.41562, MSE=0.6800684 
##   left son=6 (46 obs) right son=7 (18 obs)
##   Primary splits:
##       density              < -1.415618   to the right, improve=0.4671452, (0 missing)
##       quality              < 0.9023869   to the left,  improve=0.2017941, (0 missing)
##       volatile.acidity     < 0.2604736   to the left,  improve=0.1638049, (0 missing)
##       chlorides            < -0.5992157  to the right, improve=0.1632888, (0 missing)
##       total.sulfur.dioxide < 0.08560018  to the right, improve=0.1065815, (0 missing)
##   Surrogate splits:
##       volatile.acidity < 0.9682824   to the left,  agree=0.812, adj=0.333, (0 split)
##       quality          < 2.059293    to the left,  agree=0.766, adj=0.167, (0 split)
##       fixed.acidity    < -2.045281   to the right, agree=0.750, adj=0.111, (0 split)
##       citric.acid      < -2.304522   to the right, agree=0.750, adj=0.111, (0 split)
##       chlorides        < 0.5296158   to the left,  agree=0.734, adj=0.056, (0 split)
## 
## Node number 4: 88 observations,    complexity param=0.01164583
##   mean=9.2125, MSE=0.1774574 
##   left son=8 (64 obs) right son=9 (24 obs)
##   Primary splits:
##       free.sulfur.dioxide  < 0.1975329   to the right, improve=0.2063656, (0 missing)
##       pH                   < 1.179769    to the left,  improve=0.1935050, (0 missing)
##       density              < 0.9861492   to the right, improve=0.1793659, (0 missing)
##       fixed.acidity        < 0.7318338   to the left,  improve=0.1509011, (0 missing)
##       total.sulfur.dioxide < -0.02959145 to the right, improve=0.1409015, (0 missing)
##   Surrogate splits:
##       total.sulfur.dioxide < -0.352128   to the right, agree=0.830, adj=0.375, (0 split)
##       fixed.acidity        < 1.476913    to the left,  agree=0.795, adj=0.250, (0 split)
##       sulphates            < -1.138937   to the right, agree=0.795, adj=0.250, (0 split)
##       citric.acid          < -1.042826   to the right, agree=0.750, adj=0.083, (0 split)
##       volatile.acidity     < -1.155144   to the right, agree=0.739, adj=0.042, (0 split)
## 
## Node number 5: 73 observations,    complexity param=0.03561969
##   mean=9.832877, MSE=0.4531657 
##   left son=10 (47 obs) right son=11 (26 obs)
##   Primary splits:
##       chlorides            < -0.1682073  to the right, improve=0.29795720, (0 missing)
##       residual.sugar       < 0.001829718 to the left,  improve=0.24941140, (0 missing)
##       fixed.acidity        < -0.2164494  to the left,  improve=0.13288350, (0 missing)
##       sulphates            < -1.241176   to the right, improve=0.13010280, (0 missing)
##       total.sulfur.dioxide < 0.50029     to the right, improve=0.06840864, (0 missing)
##   Surrogate splits:
##       residual.sugar       < 0.0157068   to the left,  agree=0.767, adj=0.346, (0 split)
##       total.sulfur.dioxide < -1.42341    to the right, agree=0.726, adj=0.231, (0 split)
##       fixed.acidity        < 0.7318338   to the left,  agree=0.712, adj=0.192, (0 split)
##       sulphates            < -1.343414   to the right, agree=0.712, adj=0.192, (0 split)
##       free.sulfur.dioxide  < -1.881045   to the right, agree=0.685, adj=0.115, (0 split)
## 
## Node number 6: 46 observations,    complexity param=0.01909043
##   mean=11.06304, MSE=0.3810255 
##   left son=12 (22 obs) right son=13 (24 obs)
##   Primary splits:
##       density              < -1.036392   to the right, improve=0.3014030, (0 missing)
##       total.sulfur.dioxide < -1.377333   to the right, improve=0.1728717, (0 missing)
##       free.sulfur.dioxide  < -0.5972176  to the right, improve=0.1671277, (0 missing)
##       citric.acid          < 0.2188691   to the left,  improve=0.1663449, (0 missing)
##       residual.sugar       < -1.006571   to the left,  improve=0.1389388, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide < -0.536083   to the right, agree=0.717, adj=0.409, (0 split)
##       residual.sugar      < -0.9371859  to the right, agree=0.674, adj=0.318, (0 split)
##       chlorides           < -0.6197399  to the right, agree=0.674, adj=0.318, (0 split)
##       quality             < 0.9023869   to the left,  agree=0.630, adj=0.227, (0 split)
##       fixed.acidity       < -1.029264   to the left,  agree=0.609, adj=0.182, (0 split)
## 
## Node number 7: 18 observations
##   mean=12.31667, MSE=0.3147222 
## 
## Node number 8: 64 observations
##   mean=9.095312, MSE=0.1163843 
## 
## Node number 9: 24 observations
##   mean=9.525, MSE=0.2060417 
## 
## Node number 10: 47 observations,    complexity param=0.01294534
##   mean=9.559574, MSE=0.2092168 
##   left son=20 (14 obs) right son=21 (33 obs)
##   Primary splits:
##       pH          < -0.4711582  to the left,  improve=0.36430300, (0 missing)
##       chlorides   < 0.09860742  to the right, improve=0.19334220, (0 missing)
##       citric.acid < 0.613149    to the right, improve=0.11691220, (0 missing)
##       density     < -0.4765813  to the right, improve=0.10675790, (0 missing)
##       sulphates   < -0.8322215  to the left,  improve=0.09523678, (0 missing)
##   Surrogate splits:
##       chlorides           < 0.8374789   to the right, agree=0.851, adj=0.500, (0 split)
##       citric.acid         < 0.613149    to the right, agree=0.809, adj=0.357, (0 split)
##       free.sulfur.dioxide < 0.5643408   to the right, agree=0.809, adj=0.357, (0 split)
##       volatile.acidity    < 3.256864    to the right, agree=0.745, adj=0.143, (0 split)
##       residual.sugar      < -1.038951   to the left,  agree=0.723, adj=0.071, (0 split)
## 
## Node number 11: 26 observations,    complexity param=0.01467852
##   mean=10.32692, MSE=0.5150444 
##   left son=22 (12 obs) right son=23 (14 obs)
##   Primary splits:
##       fixed.acidity        < -0.2164494  to the left,  improve=0.3033247, (0 missing)
##       residual.sugar       < -0.09530983 to the left,  improve=0.2260971, (0 missing)
##       sulphates            < -0.9344601  to the right, improve=0.1791869, (0 missing)
##       total.sulfur.dioxide < -0.5940304  to the left,  improve=0.1601954, (0 missing)
##       pH                   < -0.4009059  to the right, improve=0.1430691, (0 missing)
##   Surrogate splits:
##       citric.acid      < -0.6091186  to the left,  agree=0.846, adj=0.667, (0 split)
##       chlorides        < -0.3529252  to the right, agree=0.731, adj=0.417, (0 split)
##       volatile.acidity < 0.07172462  to the right, agree=0.692, adj=0.333, (0 split)
##       density          < -0.4946397  to the left,  agree=0.654, adj=0.250, (0 split)
##       pH               < -0.7872931  to the right, agree=0.654, adj=0.250, (0 split)
## 
## Node number 12: 22 observations
##   mean=10.70909, MSE=0.2617355 
## 
## Node number 13: 24 observations
##   mean=11.3875, MSE=0.2702604 
## 
## Node number 20: 14 observations
##   mean=9.135714, MSE=0.0580102 
## 
## Node number 21: 33 observations
##   mean=9.739394, MSE=0.1648118 
## 
## Node number 22: 12 observations
##   mean=9.9, MSE=0.2633333 
## 
## Node number 23: 14 observations
##   mean=10.69286, MSE=0.4406633

4.4.1 Making predictions

pred_rpart <- predict(fit_rpart, data = testing)
print(head(pred_rpart))

##        1        2        3        4        5        6 
## 9.095312 9.739394 9.739394 9.739394 9.739394 9.739394

5 Creating the Machine Learning Model (via a 1:404 ratio)

# Via keras

index <- 1:404

training <- winequality[index,]
testing <- winequality[-index,]

5.1 Conducting Z-score scaling

This is used to convert all the indicators to a common scale with an average of zero and an SD (standard deviation of 1)

# using dplyr to keep the data frame structure

training %>%
  mutate_at(vars(-alcohol), scale) -> training

testing %>%
  mutate_at(vars(-alcohol), scale) -> testing

5.2 GLM and LM

Assuming normal distribution with linear modelling

fit_lm <- lm(alcohol~., data = training)

summary(fit_lm)

## 
## Call:
## lm(formula = alcohol ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65607 -0.20673 -0.04291  0.19469  1.45034 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.089604   0.015595 646.978  < 2e-16 ***
## fixed.acidity         0.388110   0.022015  17.630  < 2e-16 ***
## volatile.acidity      0.027934   0.016967   1.646  0.10050    
## citric.acid           0.052451   0.017711   2.961  0.00325 ** 
## residual.sugar        1.324580   0.045297  29.242  < 2e-16 ***
## chlorides             0.029724   0.018929   1.570  0.11716    
## free.sulfur.dioxide  -0.029254   0.023679  -1.235  0.21741    
## total.sulfur.dioxide  0.063907   0.025215   2.534  0.01165 *  
## density              -2.097863   0.048762 -43.022  < 2e-16 ***
## pH                    0.338720   0.021875  15.484  < 2e-16 ***
## sulphates             0.136301   0.017340   7.860 3.74e-14 ***
## quality              -0.005372   0.018510  -0.290  0.77182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3135 on 392 degrees of freedom
## Multiple R-squared:  0.9168, Adjusted R-squared:  0.9144 
## F-statistic: 392.5 on 11 and 392 DF,  p-value: < 2.2e-16

5.2.1 Making predictions

pred_lm <- predict(fit_lm, newdata = testing)
print(head(pred_lm))

##         1         2         3         4         5         6 
## 11.902581  9.868841 12.292013  9.335277 10.042195 10.728154

5.3 Using Random Forests

fit_rf <- randomForest(alcohol~., data = training)
summary(fit_rf)

##                 Length Class  Mode     
## call              3    -none- call     
## type              1    -none- character
## predicted       404    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times       404    -none- numeric  
## importance       11    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y               404    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

5.3.1 Making predictions

pred_rf <- predict(fit_rf, newdata = testing)
print(head(pred_rf))

##         1         2         3         4         5         6 
## 10.593747  9.993910 11.850000  9.720163 10.207677 10.832270

5.4 Using rpart

fit_rpart <- rpart(alcohol~., data = training)
summary(fit_rpart)

## Call:
## rpart(formula = alcohol ~ ., data = training)
##   n= 404 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.58090511      0 1.0000000 1.0043466 0.06605389
## 2 0.09530956      1 0.4190949 0.4236376 0.02846941
## 3 0.04157645      2 0.3237853 0.3450813 0.02605592
## 4 0.04021294      3 0.2822089 0.3541419 0.02672443
## 5 0.02404081      4 0.2419959 0.3103928 0.02481352
## 6 0.01254340      5 0.2179551 0.2828711 0.02430734
## 7 0.01000000      6 0.2054117 0.2821220 0.02347507
## 
## Variable importance
##              density       residual.sugar            chlorides 
##                   43                   17                   12 
##              quality total.sulfur.dioxide        fixed.acidity 
##                    9                    9                    4 
##  free.sulfur.dioxide                   pH     volatile.acidity 
##                    2                    1                    1 
## 
## Node number 1: 404 observations,    complexity param=0.5809051
##   mean=10.0896, MSE=1.145337 
##   left son=2 (302 obs) right son=3 (102 obs)
##   Primary splits:
##       density              < -0.7856747  to the right, improve=0.5809051, (0 missing)
##       chlorides            < -0.4367856  to the right, improve=0.3400786, (0 missing)
##       total.sulfur.dioxide < -0.1599907  to the right, improve=0.2954995, (0 missing)
##       residual.sugar       < 0.05098712  to the right, improve=0.2889399, (0 missing)
##       quality              < 0.7876364   to the left,  improve=0.2722077, (0 missing)
##   Surrogate splits:
##       residual.sugar       < -0.8665956  to the right, agree=0.827, adj=0.314, (0 split)
##       chlorides            < -0.5356158  to the right, agree=0.822, adj=0.294, (0 split)
##       quality              < 0.7876364   to the left,  agree=0.809, adj=0.245, (0 split)
##       total.sulfur.dioxide < -0.7026774  to the right, agree=0.795, adj=0.186, (0 split)
##       fixed.acidity        < -2.04522    to the right, agree=0.762, adj=0.059, (0 split)
## 
## Node number 2: 302 observations,    complexity param=0.09530956
##   mean=9.615563, MSE=0.4469432 
##   left son=4 (144 obs) right son=5 (158 obs)
##   Primary splits:
##       density              < 0.4529375   to the right, improve=0.3267322, (0 missing)
##       residual.sugar       < 0.6876772   to the right, improve=0.1925828, (0 missing)
##       chlorides            < -0.3379554  to the right, improve=0.1634240, (0 missing)
##       total.sulfur.dioxide < -0.1147668  to the right, improve=0.1288189, (0 missing)
##       free.sulfur.dioxide  < 0.6054841   to the right, improve=0.1182795, (0 missing)
##   Surrogate splits:
##       residual.sugar       < 0.144618    to the right, agree=0.904, adj=0.799, (0 split)
##       free.sulfur.dioxide  < 0.6054841   to the right, agree=0.728, adj=0.431, (0 split)
##       total.sulfur.dioxide < 0.06612874  to the right, agree=0.702, adj=0.375, (0 split)
##       fixed.acidity        < 0.1727764   to the right, agree=0.659, adj=0.285, (0 split)
##       pH                   < -0.747385   to the left,  agree=0.639, adj=0.243, (0 split)
## 
## Node number 3: 102 observations,    complexity param=0.04157645
##   mean=11.49314, MSE=0.5778941 
##   left son=6 (86 obs) right son=7 (16 obs)
##   Primary splits:
##       density          < -1.605345   to the right, improve=0.3263727, (0 missing)
##       quality          < 0.7876364   to the left,  improve=0.1934016, (0 missing)
##       residual.sugar   < -1.00236    to the left,  improve=0.1836943, (0 missing)
##       chlorides        < -0.585031   to the right, improve=0.1294672, (0 missing)
##       volatile.acidity < -0.1407947  to the left,  improve=0.1088130, (0 missing)
##   Surrogate splits:
##       fixed.acidity    < -1.849514   to the right, agree=0.873, adj=0.188, (0 split)
##       volatile.acidity < 2.93134     to the left,  agree=0.863, adj=0.125, (0 split)
##       quality          < 1.912036    to the left,  agree=0.853, adj=0.062, (0 split)
## 
## Node number 4: 144 observations
##   mean=9.215278, MSE=0.1786555 
## 
## Node number 5: 158 observations,    complexity param=0.04021294
##   mean=9.98038, MSE=0.4123366 
##   left son=10 (85 obs) right son=11 (73 obs)
##   Primary splits:
##       chlorides      < -0.09087978 to the right, improve=0.28560890, (0 missing)
##       sulphates      < -1.421424   to the right, improve=0.12795620, (0 missing)
##       density        < -0.421377   to the right, improve=0.10952380, (0 missing)
##       residual.sugar < -1.035131   to the left,  improve=0.09701653, (0 missing)
##       quality        < 0.7876364   to the left,  improve=0.09389713, (0 missing)
##   Surrogate splits:
##       residual.sugar       < -0.2205425  to the left,  agree=0.690, adj=0.329, (0 split)
##       total.sulfur.dioxide < -1.030551   to the right, agree=0.639, adj=0.219, (0 split)
##       density              < -0.6035259  to the right, agree=0.620, adj=0.178, (0 split)
##       quality              < 0.7876364   to the left,  agree=0.608, adj=0.151, (0 split)
##       fixed.acidity        < -0.6100459  to the right, agree=0.589, adj=0.110, (0 split)
## 
## Node number 6: 86 observations,    complexity param=0.02404081
##   mean=11.30581, MSE=0.4535708 
##   left son=12 (72 obs) right son=13 (14 obs)
##   Primary splits:
##       residual.sugar   < -0.6325184  to the left,  improve=0.28518090, (0 missing)
##       density          < -0.9678236  to the right, improve=0.17544140, (0 missing)
##       volatile.acidity < -0.3037109  to the left,  improve=0.10697700, (0 missing)
##       citric.acid      < -0.3861785  to the left,  improve=0.09882531, (0 missing)
##       sulphates        < 0.3506651   to the left,  improve=0.09403990, (0 missing)
##   Surrogate splits:
##       quality          < 1.912036    to the left,  agree=0.884, adj=0.286, (0 split)
##       volatile.acidity < 2.00039     to the left,  agree=0.860, adj=0.143, (0 split)
##       sulphates        < -1.421424   to the right, agree=0.860, adj=0.143, (0 split)
##       pH               < 2.050261    to the left,  agree=0.849, adj=0.071, (0 split)
## 
## Node number 7: 16 observations
##   mean=12.5, MSE=0.04375 
## 
## Node number 10: 85 observations
##   mean=9.662353, MSE=0.2279945 
## 
## Node number 11: 73 observations
##   mean=10.35068, MSE=0.3720886 
## 
## Node number 12: 72 observations,    complexity param=0.0125434
##   mean=11.14722, MSE=0.336659 
##   left son=24 (21 obs) right son=25 (51 obs)
##   Primary splits:
##       density              < -1.004253   to the right, improve=0.23944600, (0 missing)
##       residual.sugar       < -1.00236    to the left,  improve=0.19215240, (0 missing)
##       citric.acid          < -0.3861785  to the left,  improve=0.14760910, (0 missing)
##       total.sulfur.dioxide < 0.06612874  to the right, improve=0.07944766, (0 missing)
##       sulphates            < 0.3506651   to the left,  improve=0.07523465, (0 missing)
##   Surrogate splits:
##       fixed.acidity        < 1.281775    to the right, agree=0.736, adj=0.095, (0 split)
##       free.sulfur.dioxide  < 0.8724668   to the right, agree=0.736, adj=0.095, (0 split)
##       citric.acid          < 1.213356    to the right, agree=0.722, adj=0.048, (0 split)
##       total.sulfur.dioxide < 0.6088154   to the right, agree=0.722, adj=0.048, (0 split)
##       pH                   < 1.881728    to the right, agree=0.722, adj=0.048, (0 split)
## 
## Node number 13: 14 observations
##   mean=12.12143, MSE=0.2602551 
## 
## Node number 24: 21 observations
##   mean=10.70476, MSE=0.2375964 
## 
## Node number 25: 51 observations
##   mean=11.32941, MSE=0.2636448

5.4.1 Making predictions

pred_rpart <- predict(fit_rpart, data = testing)
print(head(pred_rpart))

##        1        2        3        4        5        6 
## 9.215278 9.662353 9.662353 9.662353 9.662353 9.662353

White Wine Quality Dataset (from the University of California Irvine)

Joel Jr Rudinas

January 27, 2019

1 Introduction

1.1 Necessary variables

1.2 Importing necessary libraries

1.3 Importing the Dataset

1.4 Description of variables

2 Exploring some of the meost important variables.

2.1 Response variable - alcohol

2.2 Most important numeric predictors

2.2.1 Creating the corrplot

2.3 Determining full correlation between alcohol and the other variables

2.4 Visualizing with some of the other variables

2.4.1 alcohol with pH

2.4.2 alcohol with citric acid

2.4.3 alcohol with fixed.acidity

2.4.4 alcohol with volatile.acidity

2.4.5 alcohol with residual.sugar

2.4.6 alcohol with chlorides

2.4.7 alcohol with free.sulfur.dioxide

2.4.8 alcohol with total sulfur dioxide

2.4.9 alcohol with density

2.4.10 alcohol with sulphates

3 Creating the Machine Learning Model (via a 1:68 ratio)

3.1 Conducting Z-score scaling

3.2 GLM and LM

3.2.1 Making predictions

3.3 Using Random Forests

3.3.1 Making predictions

3.4 Using rpart

3.4.1 Making predictions

4 Creating the Machine Learning Model (via a 1:225 ratio)

4.1 Conducting Z-score scaling

4.2 GLM and LM

4.2.1 Making predictions

4.3 Using Random Forests

4.3.1 Making predictions

4.4 Using rpart

4.4.1 Making predictions

5 Creating the Machine Learning Model (via a 1:404 ratio)

5.1 Conducting Z-score scaling

5.2 GLM and LM

5.2.1 Making predictions

5.3 Using Random Forests

5.3.1 Making predictions

5.4 Using rpart

5.4.1 Making predictions