Greetings!. I would like to analyze the Wine Quality Dataset. I got this link from one of my Datacamp courses that i took although this one here deals more with Wine Quality. I would like to apply principles of Machine Learning on here.
Machine Learning Principles applied here i learned from Rick Scavetta’s analysis of the Boston Housing Prices Dataset, from Julia Slige’s Machine Learning Case Study course in Datacamp, from Kaggle’s Machine Learning tutorial and from the guy who won that Iowa housing dataset competition at Kaggle.
We would like to have a look at the relationship between the alcohol content and everything else.
library(caret)## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
## Loading required package: ggplot2
library(tidyverse)## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v dplyr 0.7.8
## v readr 1.1.1 v stringr 1.3.1
## v tibble 1.4.2 v forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.2
## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(corrplot)## corrplot 0.84 loaded
library(keras)## Warning: package 'keras' was built under R version 3.5.2
library(randomForest)## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(kableExtra)## Warning: package 'kableExtra' was built under R version 3.5.2
library(modelr)
library(psych)## Warning: package 'psych' was built under R version 3.5.2
##
## Attaching package: 'psych'
## The following object is masked from 'package:modelr':
##
## heights
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(Rmisc)## Warning: package 'Rmisc' was built under R version 3.5.2
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
library(scales)## Warning: package 'scales' was built under R version 3.5.2
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(rpart)
library(yardstick)## Loading required package: broom
## Warning: package 'broom' was built under R version 3.5.2
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:modelr':
##
## mae, rmse
## The following object is masked from 'package:readr':
##
## spec
## The following objects are masked from 'package:caret':
##
## mnLogLoss, precision, recall
winequality <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep = ";")
head(winequality)## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
summary(winequality)## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Thankfully there are no NAs in wine.
I’m not a chemist but i will give the best description i can give for each of the variables given that i can.
fixed.acidity - acid produced in wines from other sources other than carbon dioxide
volatile.acidity - measure of the low molecular weight fatty acids in wine and is generally perceived as the odour of vinegar
citric.acid - weak organic acid that occurs naturarlly in citrus fruits
residual.sugar - refers to any natural grape sugars that are left over after the ceasing of fermantation regardless if they are in purpose or not
chlorides - self explanatory
free.sulfur.dioxide - SO2 available to react and thus exhibits both germicidal and antioxidant properties
total.sulfur.dioxide - self explanatory
density - self explanatory
pH - from a winemaker’s POV, it is a way to measure ripeness in relation to acidity
sulfites - self explanatory (amount of sulfites in wine)
alcohol - defined as amount of alcohol by volume
quality - self explanatory
Apologies if my explanation isn’t good enough, but from what i researched this is the best i can give
We would like to see the relationship between the alcohol quality and everything else that have nothing to do with quality.
I would like to see via a histogram the distribution of the number of wines assessed based on its alcohol content.
ggplot(winequality, aes(x = alcohol)) + geom_histogram(fill = "blue", bindwith = 1)## Warning: Ignoring unknown parameters: bindwith
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I want to figure out each of the variable’s correlation to alcohol. I need to figure out which correlates MORE with the alcohol variable
numericVars <- which(sapply(winequality, is.numeric))
numericVarNames <- names(numericVars)
cat("There are", length(numericVars), "numeric variables")## There are 12 numeric variables
all_numVar <- winequality[, numericVars]
cor_numVar <- cor(all_numVar, use = "pairwise.complete.obs")
#Sort on decreasing correlations with alcohol
cor_sorted <- as.matrix(sort(cor_numVar[,"alcohol"], decreasing = TRUE))
#Selecting high correlations
Cor_High <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.175)))
cor_numVar <- cor_numVar[Cor_High, Cor_High]
corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")Since quality is interrelated with alcohol content, let’s try to create plots to relate to its relatability
ggplot(winequality, aes(x = quality, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Quality \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))cor(winequality$alcohol, winequality$fixed.acidity, use = "pairwise.complete.obs")## [1] -0.1208811
cor(winequality$alcohol, winequality$volatile.acidity, use = "pairwise.complete.obs")## [1] 0.06771794
cor(winequality$alcohol, winequality$citric.acid, use = "pairwise.complete.obs")## [1] -0.07572873
cor(winequality$alcohol, winequality$residual.sugar, use = "pairwise.complete.obs")## [1] -0.4506312
cor(winequality$alcohol, winequality$chlorides, use = "pairwise.complete.obs")## [1] -0.3601887
cor(winequality$alcohol, winequality$free.sulfur.dioxide, use = "pairwise.complete.obs")## [1] -0.2501039
cor(winequality$alcohol, winequality$total.sulfur.dioxide, use = "pairwise.complete.obs")## [1] -0.4488921
cor(winequality$alcohol, winequality$density, use = "pairwise.complete.obs")## [1] -0.7801376
cor(winequality$alcohol, winequality$pH, use = "pairwise.complete.obs")## [1] 0.1214321
cor(winequality$alcohol, winequality$sulphates, use = "pairwise.complete.obs")## [1] -0.01743277
cor(winequality$alcohol, winequality$quality, use = "pairwise.complete.obs")## [1] 0.4355747
ggplot(winequality, aes(x = pH, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n pH content \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = citric.acid, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Citric Acid \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = fixed.acidity, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Fixed Acidity in the wine \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = volatile.acidity, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Volatile Acidity \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = residual.sugar, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Residual Sugar Content \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = chlorides, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Chlorides in wine \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = free.sulfur.dioxide, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Free Sulfur Dioxide \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = total.sulfur.dioxide, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Total Sulfur Dioxide \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = density, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Density \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))ggplot(winequality, aes(x = sulphates, y = alcohol)) + geom_point(col = "blue") + labs(x = "\n Sulphates \n") + labs(y = "\n Alcohol by Volume \n") + geom_smooth(method = "lm", se = FALSE, color = "red", aes(group = 1))# Via keras
index <- 1:68
training <- winequality[index,]
testing <- winequality[-index,]This is used to convert all the indicators to a common scale with an average of zero and an SD (standard deviation of 1)
# using dplyr to keep the data frame structure
training %>%
mutate_at(vars(-alcohol), scale) -> training
testing %>%
mutate_at(vars(-alcohol), scale) -> testingAssuming normal distribution with linear modelling
fit_lm <- lm(alcohol~., data = training)summary(fit_lm)##
## Call:
## lm(formula = alcohol ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59689 -0.24361 -0.03415 0.17444 0.92859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.238235 0.043027 237.950 < 2e-16 ***
## fixed.acidity 0.252765 0.062516 4.043 0.000163 ***
## volatile.acidity 0.086418 0.048215 1.792 0.078479 .
## citric.acid 0.149721 0.058733 2.549 0.013565 *
## residual.sugar 1.129491 0.120762 9.353 4.87e-13 ***
## chlorides -0.001271 0.060959 -0.021 0.983445
## free.sulfur.dioxide -0.103949 0.057158 -1.819 0.074316 .
## total.sulfur.dioxide 0.102732 0.058450 1.758 0.084282 .
## density -1.778473 0.117720 -15.108 < 2e-16 ***
## pH 0.244542 0.058097 4.209 9.37e-05 ***
## sulphates 0.130461 0.049824 2.618 0.011342 *
## quality 0.064368 0.055449 1.161 0.250625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3548 on 56 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.8909
## F-statistic: 50.72 on 11 and 56 DF, p-value: < 2.2e-16
pred_lm <- predict(fit_lm, newdata = testing)
print(head(pred_lm))## 1 2 3 4 5 6
## 10.493639 9.537425 9.598242 9.080628 9.317752 9.249997
fit_rf <- randomForest(alcohol~., data = training)
summary(fit_rf)## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 68 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 68 -none- numeric
## importance 11 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 68 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
pred_rf <- predict(fit_rf, newdata = testing)
print(head(pred_rf))## 1 2 3 4 5 6
## 10.341243 9.972557 9.881607 9.827643 9.856643 10.263370
fit_rpart <- rpart(alcohol~., data = training)
summary(fit_rpart)## Call:
## rpart(formula = alcohol ~ ., data = training)
## n= 68
##
## CP nsplit rel error xerror xstd
## 1 0.65643234 0 1.0000000 1.0348123 0.18148704
## 2 0.05740445 1 0.3435677 0.4134829 0.06184590
## 3 0.03478234 2 0.2861632 0.3501176 0.05194178
## 4 0.02230450 3 0.2513809 0.3446927 0.05086082
## 5 0.01276974 4 0.2290764 0.3206472 0.04735077
## 6 0.01000000 5 0.2163066 0.3184275 0.04731005
##
## Variable importance
## density chlorides quality
## 46 17 16
## residual.sugar volatile.acidity citric.acid
## 8 3 3
## fixed.acidity pH free.sulfur.dioxide
## 2 2 1
## total.sulfur.dioxide
## 1
##
## Node number 1: 68 observations, complexity param=0.6564323
## mean=10.23824, MSE=1.136479
## left son=2 (50 obs) right son=3 (18 obs)
## Primary splits:
## density < -0.7316956 to the right, improve=0.6564323, (0 missing)
## quality < 0.725376 to the left, improve=0.3191095, (0 missing)
## chlorides < -0.5260461 to the right, improve=0.2335563, (0 missing)
## residual.sugar < -0.5031297 to the right, improve=0.1816994, (0 missing)
## citric.acid < 0.7301943 to the left, improve=0.1691722, (0 missing)
## Surrogate splits:
## quality < 0.725376 to the left, agree=0.838, adj=0.389, (0 split)
## chlorides < -0.7692599 to the right, agree=0.824, adj=0.333, (0 split)
## residual.sugar < -0.714262 to the right, agree=0.765, adj=0.111, (0 split)
## volatile.acidity < -0.9739436 to the right, agree=0.750, adj=0.056, (0 split)
## citric.acid < 0.7301943 to the left, agree=0.750, adj=0.056, (0 split)
##
## Node number 2: 50 observations, complexity param=0.05740445
## mean=9.72, MSE=0.3296
## left son=4 (8 obs) right son=5 (42 obs)
## Primary splits:
## density < 1.07319 to the right, improve=0.2691899, (0 missing)
## residual.sugar < 1.350143 to the right, improve=0.2498302, (0 missing)
## pH < -0.9040212 to the left, improve=0.1796153, (0 missing)
## free.sulfur.dioxide < 0.5871171 to the right, improve=0.1761516, (0 missing)
## fixed.acidity < 0.6195383 to the left, improve=0.1636859, (0 missing)
## Surrogate splits:
## residual.sugar < 0.8903435 to the right, agree=0.96, adj=0.750, (0 split)
## pH < -1.62161 to the left, agree=0.90, adj=0.375, (0 split)
## fixed.acidity < -1.295398 to the left, agree=0.86, adj=0.125, (0 split)
##
## Node number 3: 18 observations
## mean=11.67778, MSE=0.5595062
##
## Node number 4: 8 observations
## mean=9.0375, MSE=0.2373437
##
## Node number 5: 42 observations, complexity param=0.03478234
## mean=9.85, MSE=0.2415476
## left son=10 (12 obs) right son=11 (30 obs)
## Primary splits:
## chlorides < 0.1661777 to the right, improve=0.2649581, (0 missing)
## fixed.acidity < -0.1464363 to the left, improve=0.2259107, (0 missing)
## total.sulfur.dioxide < -0.8614635 to the right, improve=0.1558577, (0 missing)
## pH < -0.7844231 to the left, improve=0.1369843, (0 missing)
## citric.acid < 0.2876523 to the left, improve=0.1159906, (0 missing)
## Surrogate splits:
## pH < -0.6648249 to the left, agree=0.810, adj=0.333, (0 split)
## free.sulfur.dioxide < 1.788369 to the right, agree=0.786, adj=0.250, (0 split)
## volatile.acidity < 1.20111 to the right, agree=0.762, adj=0.167, (0 split)
## total.sulfur.dioxide < 0.3206344 to the right, agree=0.738, adj=0.083, (0 split)
##
## Node number 10: 12 observations
## mean=9.45, MSE=0.1341667
##
## Node number 11: 30 observations, complexity param=0.0223045
## mean=10.01, MSE=0.1949
## left son=22 (8 obs) right son=23 (22 obs)
## Primary splits:
## fixed.acidity < -0.6060211 to the left, improve=0.2948015, (0 missing)
## citric.acid < 0.2876523 to the left, improve=0.2793740, (0 missing)
## density < 0.01258714 to the right, improve=0.1601391, (0 missing)
## total.sulfur.dioxide < -0.1784736 to the right, improve=0.1169922, (0 missing)
## volatile.acidity < 0.04699996 to the right, improve=0.1142894, (0 missing)
## Surrogate splits:
## volatile.acidity < 0.04699996 to the right, agree=0.867, adj=0.500, (0 split)
## citric.acid < -0.6416859 to the left, agree=0.800, adj=0.250, (0 split)
## sulphates < -0.9908584 to the left, agree=0.800, adj=0.250, (0 split)
## free.sulfur.dioxide < -1.320754 to the left, agree=0.767, adj=0.125, (0 split)
##
## Node number 22: 8 observations
## mean=9.6125, MSE=0.1135937
##
## Node number 23: 22 observations, complexity param=0.01276974
## mean=10.15455, MSE=0.1461157
## left son=46 (9 obs) right son=47 (13 obs)
## Primary splits:
## total.sulfur.dioxide < 0.1104837 to the right, improve=0.3069962, (0 missing)
## free.sulfur.dioxide < -0.08417077 to the right, improve=0.2386171, (0 missing)
## citric.acid < 0.2876523 to the left, improve=0.1961538, (0 missing)
## pH < -0.3857626 to the left, improve=0.1070944, (0 missing)
## chlorides < -0.151871 to the right, improve=0.1070944, (0 missing)
## Surrogate splits:
## chlorides < -0.1144535 to the right, agree=0.864, adj=0.667, (0 split)
## free.sulfur.dioxide < -0.08417077 to the right, agree=0.864, adj=0.667, (0 split)
## citric.acid < 0.2876523 to the left, agree=0.773, adj=0.444, (0 split)
## fixed.acidity < 0.5429409 to the left, agree=0.727, adj=0.333, (0 split)
## residual.sugar < 0.5149972 to the right, agree=0.727, adj=0.333, (0 split)
##
## Node number 46: 9 observations
## mean=9.9, MSE=0.06
##
## Node number 47: 13 observations
## mean=10.33077, MSE=0.1298225
pred_rpart <- predict(fit_rpart, data = testing)
print(head(pred_rpart))## 1 2 3 4 5 6
## 9.03750 9.61250 10.33077 9.90000 9.90000 10.33077
# Via keras
index <- 1:225
training <- winequality[index,]
testing <- winequality[-index,]This is used to convert all the indicators to a common scale with an average of zero and an SD (standard deviation of 1)
# using dplyr to keep the data frame structure
training %>%
mutate_at(vars(-alcohol), scale) -> training
testing %>%
mutate_at(vars(-alcohol), scale) -> testingAssuming normal distribution with linear modelling
fit_lm <- lm(alcohol~., data = training)summary(fit_lm)##
## Call:
## lm(formula = alcohol ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67579 -0.20055 -0.03262 0.18538 1.32780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.04044 0.02160 464.762 < 2e-16 ***
## fixed.acidity 0.34104 0.03009 11.335 < 2e-16 ***
## volatile.acidity 0.02345 0.02318 1.012 0.3127
## citric.acid 0.03489 0.02532 1.378 0.1696
## residual.sugar 1.35874 0.06314 21.519 < 2e-16 ***
## chlorides 0.04464 0.02690 1.659 0.0985 .
## free.sulfur.dioxide -0.04373 0.03412 -1.281 0.2014
## total.sulfur.dioxide 0.06717 0.03524 1.906 0.0580 .
## density -2.13757 0.06656 -32.116 < 2e-16 ***
## pH 0.28243 0.02956 9.553 < 2e-16 ***
## sulphates 0.15293 0.02403 6.363 1.19e-09 ***
## quality 0.02577 0.02665 0.967 0.3345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3241 on 213 degrees of freedom
## Multiple R-squared: 0.9192, Adjusted R-squared: 0.915
## F-statistic: 220.2 on 11 and 213 DF, p-value: < 2.2e-16
pred_lm <- predict(fit_lm, newdata = testing)
print(head(pred_lm))## 1 2 3 4 5 6
## 9.177984 9.589678 9.249720 9.589678 9.177984 8.950655
fit_rf <- randomForest(alcohol~., data = training)
summary(fit_rf)## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 225 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 225 -none- numeric
## importance 11 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 225 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
pred_rf <- predict(fit_rf, newdata = testing)
print(head(pred_rf))## 1 2 3 4 5 6
## 9.096147 9.664323 9.515713 9.664323 9.096147 9.405313
fit_rpart <- rpart(alcohol~., data = training)
summary(fit_rpart)## Call:
## rpart(formula = alcohol ~ ., data = training)
## n= 225
##
## CP nsplit rel error xerror xstd
## 1 0.61124095 0 1.0000000 1.0081151 0.09090015
## 2 0.07347520 1 0.3887590 0.4390047 0.04353133
## 3 0.05549413 2 0.3152838 0.3837303 0.04033121
## 4 0.03561969 3 0.2597897 0.3290749 0.03483100
## 5 0.01909043 4 0.2241700 0.2993195 0.03101116
## 6 0.01467852 5 0.2050796 0.2943712 0.03049512
## 7 0.01294534 6 0.1904011 0.3052157 0.03115360
## 8 0.01164583 7 0.1774557 0.3021325 0.03270031
## 9 0.01000000 8 0.1658099 0.2827364 0.03175247
##
## Variable importance
## density residual.sugar quality
## 39 18 13
## chlorides total.sulfur.dioxide fixed.acidity
## 12 6 3
## free.sulfur.dioxide volatile.acidity pH
## 3 2 1
## citric.acid sulphates
## 1 1
##
## Node number 1: 225 observations, complexity param=0.611241
## mean=10.04044, MSE=1.229875
## left son=2 (161 obs) right son=3 (64 obs)
## Primary splits:
## density < -0.6752237 to the right, improve=0.6112410, (0 missing)
## quality < 0.9023869 to the left, improve=0.3603031, (0 missing)
## residual.sugar < 0.06196372 to the right, improve=0.2923846, (0 missing)
## total.sulfur.dioxide < -0.05262977 to the right, improve=0.2911999, (0 missing)
## chlorides < -0.3529252 to the right, improve=0.2829519, (0 missing)
## Surrogate splits:
## residual.sugar < -0.6411415 to the right, agree=0.849, adj=0.469, (0 split)
## quality < 0.9023869 to the left, agree=0.827, adj=0.391, (0 split)
## chlorides < -0.3939736 to the right, agree=0.800, adj=0.297, (0 split)
## total.sulfur.dioxide < -0.490358 to the right, agree=0.756, adj=0.141, (0 split)
## fixed.acidity < -2.113016 to the right, agree=0.724, adj=0.031, (0 split)
##
## Node number 2: 161 observations, complexity param=0.05549413
## mean=9.493789, MSE=0.3978496
## left son=4 (88 obs) right son=5 (73 obs)
## Primary splits:
## density < 0.3721636 to the right, improve=0.2397429, (0 missing)
## free.sulfur.dioxide < 0.5643408 to the right, improve=0.1603777, (0 missing)
## residual.sugar < 0.6633038 to the right, improve=0.1525241, (0 missing)
## chlorides < -0.5786915 to the right, improve=0.1380782, (0 missing)
## total.sulfur.dioxide < -0.06414894 to the right, improve=0.1288413, (0 missing)
## Surrogate splits:
## residual.sugar < 0.1082206 to the right, agree=0.907, adj=0.795, (0 split)
## free.sulfur.dioxide < 0.3198022 to the right, agree=0.733, adj=0.411, (0 split)
## total.sulfur.dioxide < 0.0164852 to the right, agree=0.733, adj=0.411, (0 split)
## pH < 0.1962378 to the left, agree=0.640, adj=0.205, (0 split)
## sulphates < -0.6277445 to the right, agree=0.640, adj=0.205, (0 split)
##
## Node number 3: 64 observations, complexity param=0.0734752
## mean=11.41562, MSE=0.6800684
## left son=6 (46 obs) right son=7 (18 obs)
## Primary splits:
## density < -1.415618 to the right, improve=0.4671452, (0 missing)
## quality < 0.9023869 to the left, improve=0.2017941, (0 missing)
## volatile.acidity < 0.2604736 to the left, improve=0.1638049, (0 missing)
## chlorides < -0.5992157 to the right, improve=0.1632888, (0 missing)
## total.sulfur.dioxide < 0.08560018 to the right, improve=0.1065815, (0 missing)
## Surrogate splits:
## volatile.acidity < 0.9682824 to the left, agree=0.812, adj=0.333, (0 split)
## quality < 2.059293 to the left, agree=0.766, adj=0.167, (0 split)
## fixed.acidity < -2.045281 to the right, agree=0.750, adj=0.111, (0 split)
## citric.acid < -2.304522 to the right, agree=0.750, adj=0.111, (0 split)
## chlorides < 0.5296158 to the left, agree=0.734, adj=0.056, (0 split)
##
## Node number 4: 88 observations, complexity param=0.01164583
## mean=9.2125, MSE=0.1774574
## left son=8 (64 obs) right son=9 (24 obs)
## Primary splits:
## free.sulfur.dioxide < 0.1975329 to the right, improve=0.2063656, (0 missing)
## pH < 1.179769 to the left, improve=0.1935050, (0 missing)
## density < 0.9861492 to the right, improve=0.1793659, (0 missing)
## fixed.acidity < 0.7318338 to the left, improve=0.1509011, (0 missing)
## total.sulfur.dioxide < -0.02959145 to the right, improve=0.1409015, (0 missing)
## Surrogate splits:
## total.sulfur.dioxide < -0.352128 to the right, agree=0.830, adj=0.375, (0 split)
## fixed.acidity < 1.476913 to the left, agree=0.795, adj=0.250, (0 split)
## sulphates < -1.138937 to the right, agree=0.795, adj=0.250, (0 split)
## citric.acid < -1.042826 to the right, agree=0.750, adj=0.083, (0 split)
## volatile.acidity < -1.155144 to the right, agree=0.739, adj=0.042, (0 split)
##
## Node number 5: 73 observations, complexity param=0.03561969
## mean=9.832877, MSE=0.4531657
## left son=10 (47 obs) right son=11 (26 obs)
## Primary splits:
## chlorides < -0.1682073 to the right, improve=0.29795720, (0 missing)
## residual.sugar < 0.001829718 to the left, improve=0.24941140, (0 missing)
## fixed.acidity < -0.2164494 to the left, improve=0.13288350, (0 missing)
## sulphates < -1.241176 to the right, improve=0.13010280, (0 missing)
## total.sulfur.dioxide < 0.50029 to the right, improve=0.06840864, (0 missing)
## Surrogate splits:
## residual.sugar < 0.0157068 to the left, agree=0.767, adj=0.346, (0 split)
## total.sulfur.dioxide < -1.42341 to the right, agree=0.726, adj=0.231, (0 split)
## fixed.acidity < 0.7318338 to the left, agree=0.712, adj=0.192, (0 split)
## sulphates < -1.343414 to the right, agree=0.712, adj=0.192, (0 split)
## free.sulfur.dioxide < -1.881045 to the right, agree=0.685, adj=0.115, (0 split)
##
## Node number 6: 46 observations, complexity param=0.01909043
## mean=11.06304, MSE=0.3810255
## left son=12 (22 obs) right son=13 (24 obs)
## Primary splits:
## density < -1.036392 to the right, improve=0.3014030, (0 missing)
## total.sulfur.dioxide < -1.377333 to the right, improve=0.1728717, (0 missing)
## free.sulfur.dioxide < -0.5972176 to the right, improve=0.1671277, (0 missing)
## citric.acid < 0.2188691 to the left, improve=0.1663449, (0 missing)
## residual.sugar < -1.006571 to the left, improve=0.1389388, (0 missing)
## Surrogate splits:
## free.sulfur.dioxide < -0.536083 to the right, agree=0.717, adj=0.409, (0 split)
## residual.sugar < -0.9371859 to the right, agree=0.674, adj=0.318, (0 split)
## chlorides < -0.6197399 to the right, agree=0.674, adj=0.318, (0 split)
## quality < 0.9023869 to the left, agree=0.630, adj=0.227, (0 split)
## fixed.acidity < -1.029264 to the left, agree=0.609, adj=0.182, (0 split)
##
## Node number 7: 18 observations
## mean=12.31667, MSE=0.3147222
##
## Node number 8: 64 observations
## mean=9.095312, MSE=0.1163843
##
## Node number 9: 24 observations
## mean=9.525, MSE=0.2060417
##
## Node number 10: 47 observations, complexity param=0.01294534
## mean=9.559574, MSE=0.2092168
## left son=20 (14 obs) right son=21 (33 obs)
## Primary splits:
## pH < -0.4711582 to the left, improve=0.36430300, (0 missing)
## chlorides < 0.09860742 to the right, improve=0.19334220, (0 missing)
## citric.acid < 0.613149 to the right, improve=0.11691220, (0 missing)
## density < -0.4765813 to the right, improve=0.10675790, (0 missing)
## sulphates < -0.8322215 to the left, improve=0.09523678, (0 missing)
## Surrogate splits:
## chlorides < 0.8374789 to the right, agree=0.851, adj=0.500, (0 split)
## citric.acid < 0.613149 to the right, agree=0.809, adj=0.357, (0 split)
## free.sulfur.dioxide < 0.5643408 to the right, agree=0.809, adj=0.357, (0 split)
## volatile.acidity < 3.256864 to the right, agree=0.745, adj=0.143, (0 split)
## residual.sugar < -1.038951 to the left, agree=0.723, adj=0.071, (0 split)
##
## Node number 11: 26 observations, complexity param=0.01467852
## mean=10.32692, MSE=0.5150444
## left son=22 (12 obs) right son=23 (14 obs)
## Primary splits:
## fixed.acidity < -0.2164494 to the left, improve=0.3033247, (0 missing)
## residual.sugar < -0.09530983 to the left, improve=0.2260971, (0 missing)
## sulphates < -0.9344601 to the right, improve=0.1791869, (0 missing)
## total.sulfur.dioxide < -0.5940304 to the left, improve=0.1601954, (0 missing)
## pH < -0.4009059 to the right, improve=0.1430691, (0 missing)
## Surrogate splits:
## citric.acid < -0.6091186 to the left, agree=0.846, adj=0.667, (0 split)
## chlorides < -0.3529252 to the right, agree=0.731, adj=0.417, (0 split)
## volatile.acidity < 0.07172462 to the right, agree=0.692, adj=0.333, (0 split)
## density < -0.4946397 to the left, agree=0.654, adj=0.250, (0 split)
## pH < -0.7872931 to the right, agree=0.654, adj=0.250, (0 split)
##
## Node number 12: 22 observations
## mean=10.70909, MSE=0.2617355
##
## Node number 13: 24 observations
## mean=11.3875, MSE=0.2702604
##
## Node number 20: 14 observations
## mean=9.135714, MSE=0.0580102
##
## Node number 21: 33 observations
## mean=9.739394, MSE=0.1648118
##
## Node number 22: 12 observations
## mean=9.9, MSE=0.2633333
##
## Node number 23: 14 observations
## mean=10.69286, MSE=0.4406633
pred_rpart <- predict(fit_rpart, data = testing)
print(head(pred_rpart))## 1 2 3 4 5 6
## 9.095312 9.739394 9.739394 9.739394 9.739394 9.739394
# Via keras
index <- 1:404
training <- winequality[index,]
testing <- winequality[-index,]This is used to convert all the indicators to a common scale with an average of zero and an SD (standard deviation of 1)
# using dplyr to keep the data frame structure
training %>%
mutate_at(vars(-alcohol), scale) -> training
testing %>%
mutate_at(vars(-alcohol), scale) -> testingAssuming normal distribution with linear modelling
fit_lm <- lm(alcohol~., data = training)summary(fit_lm)##
## Call:
## lm(formula = alcohol ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65607 -0.20673 -0.04291 0.19469 1.45034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.089604 0.015595 646.978 < 2e-16 ***
## fixed.acidity 0.388110 0.022015 17.630 < 2e-16 ***
## volatile.acidity 0.027934 0.016967 1.646 0.10050
## citric.acid 0.052451 0.017711 2.961 0.00325 **
## residual.sugar 1.324580 0.045297 29.242 < 2e-16 ***
## chlorides 0.029724 0.018929 1.570 0.11716
## free.sulfur.dioxide -0.029254 0.023679 -1.235 0.21741
## total.sulfur.dioxide 0.063907 0.025215 2.534 0.01165 *
## density -2.097863 0.048762 -43.022 < 2e-16 ***
## pH 0.338720 0.021875 15.484 < 2e-16 ***
## sulphates 0.136301 0.017340 7.860 3.74e-14 ***
## quality -0.005372 0.018510 -0.290 0.77182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3135 on 392 degrees of freedom
## Multiple R-squared: 0.9168, Adjusted R-squared: 0.9144
## F-statistic: 392.5 on 11 and 392 DF, p-value: < 2.2e-16
pred_lm <- predict(fit_lm, newdata = testing)
print(head(pred_lm))## 1 2 3 4 5 6
## 11.902581 9.868841 12.292013 9.335277 10.042195 10.728154
fit_rf <- randomForest(alcohol~., data = training)
summary(fit_rf)## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 404 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 404 -none- numeric
## importance 11 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 404 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
pred_rf <- predict(fit_rf, newdata = testing)
print(head(pred_rf))## 1 2 3 4 5 6
## 10.593747 9.993910 11.850000 9.720163 10.207677 10.832270
fit_rpart <- rpart(alcohol~., data = training)
summary(fit_rpart)## Call:
## rpart(formula = alcohol ~ ., data = training)
## n= 404
##
## CP nsplit rel error xerror xstd
## 1 0.58090511 0 1.0000000 1.0043466 0.06605389
## 2 0.09530956 1 0.4190949 0.4236376 0.02846941
## 3 0.04157645 2 0.3237853 0.3450813 0.02605592
## 4 0.04021294 3 0.2822089 0.3541419 0.02672443
## 5 0.02404081 4 0.2419959 0.3103928 0.02481352
## 6 0.01254340 5 0.2179551 0.2828711 0.02430734
## 7 0.01000000 6 0.2054117 0.2821220 0.02347507
##
## Variable importance
## density residual.sugar chlorides
## 43 17 12
## quality total.sulfur.dioxide fixed.acidity
## 9 9 4
## free.sulfur.dioxide pH volatile.acidity
## 2 1 1
##
## Node number 1: 404 observations, complexity param=0.5809051
## mean=10.0896, MSE=1.145337
## left son=2 (302 obs) right son=3 (102 obs)
## Primary splits:
## density < -0.7856747 to the right, improve=0.5809051, (0 missing)
## chlorides < -0.4367856 to the right, improve=0.3400786, (0 missing)
## total.sulfur.dioxide < -0.1599907 to the right, improve=0.2954995, (0 missing)
## residual.sugar < 0.05098712 to the right, improve=0.2889399, (0 missing)
## quality < 0.7876364 to the left, improve=0.2722077, (0 missing)
## Surrogate splits:
## residual.sugar < -0.8665956 to the right, agree=0.827, adj=0.314, (0 split)
## chlorides < -0.5356158 to the right, agree=0.822, adj=0.294, (0 split)
## quality < 0.7876364 to the left, agree=0.809, adj=0.245, (0 split)
## total.sulfur.dioxide < -0.7026774 to the right, agree=0.795, adj=0.186, (0 split)
## fixed.acidity < -2.04522 to the right, agree=0.762, adj=0.059, (0 split)
##
## Node number 2: 302 observations, complexity param=0.09530956
## mean=9.615563, MSE=0.4469432
## left son=4 (144 obs) right son=5 (158 obs)
## Primary splits:
## density < 0.4529375 to the right, improve=0.3267322, (0 missing)
## residual.sugar < 0.6876772 to the right, improve=0.1925828, (0 missing)
## chlorides < -0.3379554 to the right, improve=0.1634240, (0 missing)
## total.sulfur.dioxide < -0.1147668 to the right, improve=0.1288189, (0 missing)
## free.sulfur.dioxide < 0.6054841 to the right, improve=0.1182795, (0 missing)
## Surrogate splits:
## residual.sugar < 0.144618 to the right, agree=0.904, adj=0.799, (0 split)
## free.sulfur.dioxide < 0.6054841 to the right, agree=0.728, adj=0.431, (0 split)
## total.sulfur.dioxide < 0.06612874 to the right, agree=0.702, adj=0.375, (0 split)
## fixed.acidity < 0.1727764 to the right, agree=0.659, adj=0.285, (0 split)
## pH < -0.747385 to the left, agree=0.639, adj=0.243, (0 split)
##
## Node number 3: 102 observations, complexity param=0.04157645
## mean=11.49314, MSE=0.5778941
## left son=6 (86 obs) right son=7 (16 obs)
## Primary splits:
## density < -1.605345 to the right, improve=0.3263727, (0 missing)
## quality < 0.7876364 to the left, improve=0.1934016, (0 missing)
## residual.sugar < -1.00236 to the left, improve=0.1836943, (0 missing)
## chlorides < -0.585031 to the right, improve=0.1294672, (0 missing)
## volatile.acidity < -0.1407947 to the left, improve=0.1088130, (0 missing)
## Surrogate splits:
## fixed.acidity < -1.849514 to the right, agree=0.873, adj=0.188, (0 split)
## volatile.acidity < 2.93134 to the left, agree=0.863, adj=0.125, (0 split)
## quality < 1.912036 to the left, agree=0.853, adj=0.062, (0 split)
##
## Node number 4: 144 observations
## mean=9.215278, MSE=0.1786555
##
## Node number 5: 158 observations, complexity param=0.04021294
## mean=9.98038, MSE=0.4123366
## left son=10 (85 obs) right son=11 (73 obs)
## Primary splits:
## chlorides < -0.09087978 to the right, improve=0.28560890, (0 missing)
## sulphates < -1.421424 to the right, improve=0.12795620, (0 missing)
## density < -0.421377 to the right, improve=0.10952380, (0 missing)
## residual.sugar < -1.035131 to the left, improve=0.09701653, (0 missing)
## quality < 0.7876364 to the left, improve=0.09389713, (0 missing)
## Surrogate splits:
## residual.sugar < -0.2205425 to the left, agree=0.690, adj=0.329, (0 split)
## total.sulfur.dioxide < -1.030551 to the right, agree=0.639, adj=0.219, (0 split)
## density < -0.6035259 to the right, agree=0.620, adj=0.178, (0 split)
## quality < 0.7876364 to the left, agree=0.608, adj=0.151, (0 split)
## fixed.acidity < -0.6100459 to the right, agree=0.589, adj=0.110, (0 split)
##
## Node number 6: 86 observations, complexity param=0.02404081
## mean=11.30581, MSE=0.4535708
## left son=12 (72 obs) right son=13 (14 obs)
## Primary splits:
## residual.sugar < -0.6325184 to the left, improve=0.28518090, (0 missing)
## density < -0.9678236 to the right, improve=0.17544140, (0 missing)
## volatile.acidity < -0.3037109 to the left, improve=0.10697700, (0 missing)
## citric.acid < -0.3861785 to the left, improve=0.09882531, (0 missing)
## sulphates < 0.3506651 to the left, improve=0.09403990, (0 missing)
## Surrogate splits:
## quality < 1.912036 to the left, agree=0.884, adj=0.286, (0 split)
## volatile.acidity < 2.00039 to the left, agree=0.860, adj=0.143, (0 split)
## sulphates < -1.421424 to the right, agree=0.860, adj=0.143, (0 split)
## pH < 2.050261 to the left, agree=0.849, adj=0.071, (0 split)
##
## Node number 7: 16 observations
## mean=12.5, MSE=0.04375
##
## Node number 10: 85 observations
## mean=9.662353, MSE=0.2279945
##
## Node number 11: 73 observations
## mean=10.35068, MSE=0.3720886
##
## Node number 12: 72 observations, complexity param=0.0125434
## mean=11.14722, MSE=0.336659
## left son=24 (21 obs) right son=25 (51 obs)
## Primary splits:
## density < -1.004253 to the right, improve=0.23944600, (0 missing)
## residual.sugar < -1.00236 to the left, improve=0.19215240, (0 missing)
## citric.acid < -0.3861785 to the left, improve=0.14760910, (0 missing)
## total.sulfur.dioxide < 0.06612874 to the right, improve=0.07944766, (0 missing)
## sulphates < 0.3506651 to the left, improve=0.07523465, (0 missing)
## Surrogate splits:
## fixed.acidity < 1.281775 to the right, agree=0.736, adj=0.095, (0 split)
## free.sulfur.dioxide < 0.8724668 to the right, agree=0.736, adj=0.095, (0 split)
## citric.acid < 1.213356 to the right, agree=0.722, adj=0.048, (0 split)
## total.sulfur.dioxide < 0.6088154 to the right, agree=0.722, adj=0.048, (0 split)
## pH < 1.881728 to the right, agree=0.722, adj=0.048, (0 split)
##
## Node number 13: 14 observations
## mean=12.12143, MSE=0.2602551
##
## Node number 24: 21 observations
## mean=10.70476, MSE=0.2375964
##
## Node number 25: 51 observations
## mean=11.32941, MSE=0.2636448
pred_rpart <- predict(fit_rpart, data = testing)
print(head(pred_rpart))## 1 2 3 4 5 6
## 9.215278 9.662353 9.662353 9.662353 9.662353 9.662353