Los datos que analizamos son datos de vinos blancos de Portugal. El conjunto de datos se puede acceder en Repositorio de Datos de Aprendizaje Automático de UCI http://archive.ics.uci.edu/ml
url <-"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
wine <- read.table(file=url, header = T, sep=";")
knitr::kable(head(wine),caption = "Resumen de datos")
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
| 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
| 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
| 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
library(plotly)
plot_ly(data = wine, x =~quality, type = "histogram")
Queremos predecir la calidad del vino, Quality según otras características del mismo:
aciditysugar contentchloridessulfuralcoholpHdensityQualitywine2 <- wine
wine2$qualitychar <- ifelse( wine2$quality == 3, "a_tres"
, ifelse(wine2$quality == 4, "b_Cuatro"
, ifelse(wine2$quality == 5, "c_cinco"
, ifelse(wine2$quality == 6, "d_Seis"
, ifelse(wine2$quality == 7, "e_Siete"
, ifelse(wine2$quality == 8, "f_Ocho"
, "g_Nueve"))) )))
plot_ly(data = wine2, x = ~qualitychar, y = ~alcohol
, color = ~qualitychar
, type = "box"
, colors = "Dark2"
)
Quality sigue una distribución más o menos Normal.plot_ly(data = wine2, x = ~qualitychar, y = ~density
, color = ~qualitychar, type = "box", colors = "Set1")
No es necesario el procesamiento previo de datos para emplear un modelo de árbol de decisión. Para los datos de entrenamiento (training) y prueba (test) tomar una muestra al azar con una proporción del 75% y del 25% respectivamente.
# training set
set.seed(pi)
itrain <- sample( 1:4898, size=3750, replace = FALSE)
wine_train <- wine[itrain, ]
nrow(wine_train)
## [1] 3750
# test set
wine_test <- wine[-itrain, ]
El paquete rpart (partición recursiva) ofrece una muy buena implementación árboles de regresión CART.
library(rpart)
m.rpart <- rpart(quality ~.
, data = wine_train)
m.rpart
## n= 3750
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3750 2933.71800 5.880800
## 2) alcohol< 10.85 2374 1431.89400 5.607835
## 4) volatile.acidity>=0.235 1488 738.64250 5.412634
## 8) free.sulfur.dioxide< 13.5 132 69.51515 4.939394 *
## 9) free.sulfur.dioxide>=13.5 1356 636.68730 5.458702
## 18) alcohol< 9.85 882 335.83670 5.346939 *
## 19) alcohol>=9.85 474 269.33330 5.666667 *
## 5) volatile.acidity< 0.235 886 541.33300 5.935666 *
## 3) alcohol>=10.85 1376 1019.75600 6.351744
## 6) alcohol< 11.74167 664 501.30120 6.114458
## 12) free.sulfur.dioxide< 10.5 36 38.00000 5.000000 *
## 13) free.sulfur.dioxide>=10.5 628 416.02550 6.178344 *
## 7) alcohol>=11.74167 712 446.20220 6.573034 *
library(rpart.plot)
rpart.plot(m.rpart)
p.rpart <- predict( m.rpart, wine_test )
summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.939 5.347 5.936 5.888 6.178 6.573
summary( wine_test$quality )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.868 6.000 9.000
Otra forma de analizar el rendimiento del modelo es considerar cómo de diferente, en promedio, ha caído la predicción del valor real: Error absoluto medio (MAE).
MAE <- function(actual, predicted){
mean(abs (actual - predicted))
}
MAE(wine_test$quality, p.rpart)
## [1] 0.5971976
Un Error Absoluto Medio del 59% es aceptable.
Para mejorar el modelo pensamos en implementar otro algoritmo, un model tree, este ejecuta múltiples modelos de regresión en cada nodo.
M5P (where the P stands for ‘prime’) generates M5 model trees using the M5’ algorithm, which was introduced in Wang & Witten (1997) and enhances the original M5 algorithm by Quinlan (1992).
# ajuste
m.m5p <- M5P(quality ~. , data = wine_train)
# Construimos un predictor
p.m5p <- predict(m.m5p, wine_test)
m.m5p
## M5 pruned model tree:
## (using smoothed linear models)
##
## alcohol <= 10.85 :
## | volatile.acidity <= 0.275 :
## | | volatile.acidity <= 0.207 :
## | | | residual.sugar <= 12.575 : LM1 (452/80.189%)
## | | | residual.sugar > 12.575 :
## | | | | alcohol <= 9.05 :
## | | | | | free.sulfur.dioxide <= 30.5 : LM2 (18/0%)
## | | | | | free.sulfur.dioxide > 30.5 :
## | | | | | | fixed.acidity <= 7.15 : LM3 (8/37.391%)
## | | | | | | fixed.acidity > 7.15 :
## | | | | | | | citric.acid <= 0.275 : LM4 (6/0%)
## | | | | | | | citric.acid > 0.275 :
## | | | | | | | | pH <= 2.995 : LM5 (8/0%)
## | | | | | | | | pH > 2.995 :
## | | | | | | | | | citric.acid <= 0.305 : LM6 (5/0%)
## | | | | | | | | | citric.acid > 0.305 :
## | | | | | | | | | | total.sulfur.dioxide <= 152.5 : LM7 (3/0%)
## | | | | | | | | | | total.sulfur.dioxide > 152.5 : LM8 (4/48.956%)
## | | | | alcohol > 9.05 :
## | | | | | citric.acid <= 0.295 : LM9 (15/0%)
## | | | | | citric.acid > 0.295 :
## | | | | | | chlorides <= 0.052 :
## | | | | | | | density <= 0.997 :
## | | | | | | | | density <= 0.997 : LM10 (3/53.297%)
## | | | | | | | | density > 0.997 : LM11 (6/0%)
## | | | | | | | density > 0.997 :
## | | | | | | | | density <= 0.999 : LM12 (20/0%)
## | | | | | | | | density > 0.999 :
## | | | | | | | | | citric.acid <= 0.355 : LM13 (3/0%)
## | | | | | | | | | citric.acid > 0.355 : LM14 (11/72.677%)
## | | | | | | chlorides > 0.052 :
## | | | | | | | citric.acid <= 0.33 : LM15 (7/0%)
## | | | | | | | citric.acid > 0.33 :
## | | | | | | | | fixed.acidity <= 7.35 : LM16 (2/0%)
## | | | | | | | | fixed.acidity > 7.35 : LM17 (2/0%)
## | | volatile.acidity > 0.207 : LM18 (794/80.012%)
## | volatile.acidity > 0.275 : LM19 (1007/70.185%)
## alcohol > 10.85 :
## | alcohol <= 11.742 :
## | | free.sulfur.dioxide <= 19.5 : LM20 (139/99.551%)
## | | free.sulfur.dioxide > 19.5 : LM21 (525/85.946%)
## | alcohol > 11.742 : LM22 (712/82.921%)
##
## LM num: 1
## quality =
## 0.224 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.0252 * citric.acid
## + 0.1675 * residual.sugar
## + 3.6926 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 344.0395 * density
## + 1.4353 * pH
## + 1.1799 * sulphates
## - 0.1373 * alcohol
## + 341.6078
##
## LM num: 2
## quality =
## 0.2851 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 1.3837 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.1324 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 39.1688
##
## LM num: 3
## quality =
## 0.4128 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 2.1858 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.1324 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 38.353
##
## LM num: 4
## quality =
## 0.0945 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 3.2342 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## - 0.0028 * total.sulfur.dioxide
## - 33.7487 * density
## + 1.1023 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 38.1374
##
## LM num: 5
## quality =
## 0.0786 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 3.1743 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## - 0.0017 * total.sulfur.dioxide
## - 33.7487 * density
## + 1.1553 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 37.7903
##
## LM num: 6
## quality =
## 0.095 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 3.6989 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## - 0.0031 * total.sulfur.dioxide
## - 33.7487 * density
## + 1.0743 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 38.4185
##
## LM num: 7
## quality =
## 0.095 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 3.6436 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## - 0.0037 * total.sulfur.dioxide
## - 33.7487 * density
## + 1.0743 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 38.4799
##
## LM num: 8
## quality =
## 0.095 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 3.6436 * citric.acid
## + 0.0009 * residual.sugar
## + 6.5209 * chlorides
## + 0 * free.sulfur.dioxide
## - 0.0037 * total.sulfur.dioxide
## - 33.7487 * density
## + 1.0743 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 38.4633
##
## LM num: 9
## quality =
## 0.0914 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.261 * citric.acid
## - 0.0158 * residual.sugar
## + 5.2945 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 39.7253
##
## LM num: 10
## quality =
## 0.0914 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.4809 * citric.acid
## - 0.0043 * residual.sugar
## + 8.5031 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## + 59.3298 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## - 53.175
##
## LM num: 11
## quality =
## 0.0914 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.4809 * citric.acid
## - 0.0043 * residual.sugar
## + 8.5031 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## + 46.0329 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## - 39.8825
##
## LM num: 12
## quality =
## 0.0914 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.6526 * citric.acid
## + 0.0069 * residual.sugar
## + 8.5031 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 39.405
##
## LM num: 13
## quality =
## 0.0914 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.6881 * citric.acid
## + 0.0314 * residual.sugar
## + 8.5031 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 39.0305
##
## LM num: 14
## quality =
## 0.0914 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.6881 * citric.acid
## + 0.0246 * residual.sugar
## + 8.5031 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 39.1027
##
## LM num: 15
## quality =
## -0.0366 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.7517 * citric.acid
## - 0.0043 * residual.sugar
## + 12.4522 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 40.6239
##
## LM num: 16
## quality =
## -0.0568 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.7517 * citric.acid
## - 0.0043 * residual.sugar
## + 12.4522 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 40.749
##
## LM num: 17
## quality =
## -0.0568 * fixed.acidity
## - 0.1521 * volatile.acidity
## - 0.7517 * citric.acid
## - 0.0043 * residual.sugar
## + 12.4522 * chlorides
## + 0 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 33.7487 * density
## - 0.0831 * pH
## + 0.1277 * sulphates
## - 0.0234 * alcohol
## + 40.7432
##
## LM num: 18
## quality =
## 0.0768 * fixed.acidity
## - 2.9002 * volatile.acidity
## - 0.0059 * citric.acid
## + 0.0546 * residual.sugar
## + 0.0335 * chlorides
## + 0.0023 * free.sulfur.dioxide
## + 0 * total.sulfur.dioxide
## - 130.8842 * density
## + 0.3813 * pH
## + 0.7226 * sulphates
## + 0.1358 * alcohol
## + 132.7272
##
## LM num: 19
## quality =
## 0.0016 * fixed.acidity
## - 1.1871 * volatile.acidity
## + 0.0548 * residual.sugar
## + 0.0025 * free.sulfur.dioxide
## - 99.3856 * density
## + 0.6052 * pH
## + 0.4919 * sulphates
## + 0.1314 * alcohol
## + 100.7473
##
## LM num: 20
## quality =
## 0.0039 * fixed.acidity
## - 3.445 * volatile.acidity
## + 0.0698 * citric.acid
## + 0.1353 * residual.sugar
## - 0.587 * chlorides
## + 0.0705 * free.sulfur.dioxide
## - 321.2688 * density
## + 0.1402 * pH
## + 0.0758 * sulphates
## + 0.0036 * alcohol
## + 323.5085
##
## LM num: 21
## quality =
## 0.0039 * fixed.acidity
## - 1.3212 * volatile.acidity
## + 0.0199 * citric.acid
## + 0.0802 * residual.sugar
## - 5.9846 * chlorides
## + 0.0003 * free.sulfur.dioxide
## - 0.0016 * total.sulfur.dioxide
## - 96.9542 * density
## + 0.9579 * pH
## + 0.6961 * sulphates
## + 0.0036 * alcohol
## + 99.3517
##
## LM num: 22
## quality =
## 0.1596 * fixed.acidity
## - 0.0393 * volatile.acidity
## + 0.1054 * residual.sugar
## - 6.2377 * chlorides
## + 0.0148 * free.sulfur.dioxide
## - 258.2448 * density
## + 1.3379 * pH
## + 0.6299 * sulphates
## + 0.0034 * alcohol
## + 256.083
##
## Number of Rules : 22
MAE( wine_test$quality, p.m5p )
## [1] 0.5765669
de 59% a 57% hemos mejorado la estimación.
Podemos estimar la calidad de un vino concreto que tenga los siguientes valores en las variables:
test <- data.frame(fixed.acidity = 8.5, volatile.acidity = 0.33
, citric.acid = 0.42, residual.sugar = 10.5
, chlorides = 0.065, free.sulfur.dioxide = 47
, total.sulfur.dioxide = 186, density = 0.9955
, pH = 3.10, sulphates = 0.40, alcohol = 9.9)
test_pred <- predict(m.m5p, test)
test_pred
## [1] 5.496149
Hornik K, Buchta C and Zeileis A (2009). “Open-Source Machine Learning: R Meets Weka.” Computational Statistics, 24(2), pp. 225-232. doi:10.1007/s00180-008-0119-7 (URL:http://doi.org/10.1007/s00180-008-0119-7).
Terry Therneau and Beth Atkinson (2018). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-12. https://CRAN.R-project.org/package=rpart
Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec and Pedro Despouy (2017). plotly: Create Interactive Web Graphics via ‘plotly.js’. R package version 4.7.1. https://CRAN.R-project.org/package=plotly
Jason Chan. Estimating Wine Quality with Decision Tree (2016). http://www.rpubs.com/jasonchanhku/wine
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=es_ES.UTF-8
## [9] LC_ADDRESS=es_ES.UTF-8 LC_TELEPHONE=es_ES.UTF-8
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=es_ES.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 RWeka_0.4-37 RColorBrewer_1.1-2
## [4] rpart.plot_2.1.2 plotly_4.7.1 ggplot2_2.2.1
## [7] rpart_4.1-12
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 compiler_3.4.3 plyr_1.8.4
## [4] bindr_0.1 RWekajars_3.9.2-1 tools_3.4.3
## [7] digest_0.6.12 jsonlite_1.5 evaluate_0.10.1
## [10] tibble_1.3.4 gtable_0.2.0 viridisLite_0.2.0
## [13] pkgconfig_2.0.1 rlang_0.1.2 shiny_1.0.5
## [16] crosstalk_1.0.0 yaml_2.1.14 rJava_0.9-8
## [19] stringr_1.2.0 dplyr_0.7.4 httr_1.3.1
## [22] knitr_1.17 htmlwidgets_0.9 rprojroot_1.2
## [25] grid_3.4.3 glue_1.1.1 data.table_1.10.4
## [28] R6_2.2.2 rmarkdown_1.6 purrr_0.2.4
## [31] tidyr_0.8.0 magrittr_1.5 backports_1.1.0
## [34] scales_0.5.0 htmltools_0.3.6 assertthat_0.2.0
## [37] xtable_1.8-2 mime_0.5 colorspace_1.3-2
## [40] httpuv_1.3.5 stringi_1.1.5 lazyeval_0.2.0
## [43] munsell_0.4.3