Desarrollar regresión para evaluar la calidad del vino
Analizar y aplicar la técnica de regresión lineal y lo árboles de regresión en el conjunto de datos de vinos; realizar interpretaciones y de los modelos lineal y árbol de regresión para elaborar predicciones , comparaciones y establecer resultados de la calidad de los vinos.
library(rpart) # Arboles
library(rpart.plot) # Visualizar y represenar árboles
## Warning: package 'rpart.plot' was built under R version 3.6.3
library(caret) # Para llevar a cabo particiones de conjuntos de datos en caso de...
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr) # Para select, filter, mutate, arange ....
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr) # Para leer datos
library(ggplot2) # Para grafica mas vistosas
library(reshape) # Para renombrar columnas
## Warning: package 'reshape' was built under R version 3.6.3
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
library(corrplot)
## corrplot 0.84 loaded
datosRed <- read_csv("C:/Users/Sergio/Documents/DIPLOMADO IOT DATA CIENCE/MODULO 5 MACHINE LEARNING/Proyecto/MachineLearning/datos/winequality-red.csv")
## Parsed with column specification:
## cols(
## fixed_acidity = col_double(),
## volatile_acidity = col_double(),
## citric_acid = col_double(),
## residual_sugar = col_double(),
## chlorides = col_double(),
## free_sulfur_dioxide = col_double(),
## total_sulfur_dioxide = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_double()
## )
datosWhite <- read_csv("C:/Users/Sergio/Documents/DIPLOMADO IOT DATA CIENCE/MODULO 5 MACHINE LEARNING/Proyecto/MachineLearning/datos/winequality-white.csv")
## Parsed with column specification:
## cols(
## `fixed acidity` = col_double(),
## `volatile acidity` = col_double(),
## `citric acid` = col_double(),
## `residual sugar` = col_double(),
## chlorides = col_double(),
## `free sulfur dioxide` = col_double(),
## `total sulfur dioxide` = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_double()
## )
head(datosRed)
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
head(datosWhite)
## # A tibble: 6 x 12
## `fixed acidity` `volatile acidi… `citric acid` `residual sugar` chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 0.27 0.36 20.7 0.045
## 2 6.3 0.3 0.34 1.6 0.049
## 3 8.1 0.28 0.4 6.9 0.05
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.4 6.9 0.05
## # … with 7 more variables: `free sulfur dioxide` <dbl>, `total sulfur
## # dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>, alcohol <dbl>,
## # quality <dbl>
View(datosRed)
summary(datosRed)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
summary(datosWhite)
## fixed acidity volatile acidity citric acid residual sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free sulfur dioxide total sulfur dioxide density
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.180 Median :0.4700 Median :10.40 Median :6.000
## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
str(datosRed)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1599 obs. of 12 variables:
## $ fixed_acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfur_dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfur_dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "spec")=
## .. cols(
## .. fixed_acidity = col_double(),
## .. volatile_acidity = col_double(),
## .. citric_acid = col_double(),
## .. residual_sugar = col_double(),
## .. chlorides = col_double(),
## .. free_sulfur_dioxide = col_double(),
## .. total_sulfur_dioxide = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
datosRed
## # A tibble: 1,599 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## 7 7.9 0.6 0.06 1.6 0.069
## 8 7.3 0.65 0 1.2 0.065
## 9 7.8 0.580 0.02 2 0.073
## 10 7.5 0.5 0.36 6.1 0.071
## # … with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
set.seed(2020) # Semilla
entrena <- createDataPartition(datosRed$quality, p=0.7, list = FALSE)
head(entrena)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
## [6,] 6
#registros en entrena
nrow(entrena)
## [1] 1120
#los que no estan en enetrena
head(datosRed[-entrena,])
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.9 0.6 0.06 1.6 0.069
## 2 7.3 0.65 0 1.2 0.065
## 3 7.5 0.5 0.36 6.1 0.071
## 4 8.5 0.28 0.56 1.8 0.092
## 5 8.1 0.56 0.28 1.7 0.368
## 6 7.9 0.43 0.21 1.6 0.106
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
nrow(datosRed[-entrena,])
## [1] 479
head(datosRed)
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
# Ahora a determinar conjuntos de datos de entrenamiento y luego head()
datos.Entrena <- datosRed[entrena,]
head(datos.Entrena)
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.4 0.7 0 1.9 0.076
## 2 7.8 0.88 0 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.7 0 1.9 0.076
## 6 7.4 0.66 0 1.8 0.075
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
summary(datos.Entrena)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.70 Min. :0.1200 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.4000 1st Qu.:0.0975 1st Qu.: 1.900
## Median : 7.90 Median :0.5300 Median :0.2500 Median : 2.200
## Mean : 8.34 Mean :0.5326 Mean :0.2691 Mean : 2.554
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.4300 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :0.7900 Max. :15.500
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.03400 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07100 1st Qu.: 8.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.08000 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08693 Mean :16.13 Mean : 46.82 Mean :0.9968
## 3rd Qu.:0.09025 3rd Qu.:22.00 3rd Qu.: 62.00 3rd Qu.:0.9979
## Max. :0.46700 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.860 Min. :0.3700 Min. : 8.4 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.5 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.1 Median :6.000
## Mean :3.311 Mean :0.6588 Mean :10.4 Mean :5.635
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.1 3rd Qu.:6.000
## Max. :4.010 Max. :1.9800 Max. :14.9 Max. :8.000
# y conjunto de datos de validación y luego head()
datos.Valida <- datosRed[-entrena,]
head(datos.Valida)
## # A tibble: 6 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.9 0.6 0.06 1.6 0.069
## 2 7.3 0.65 0 1.2 0.065
## 3 7.5 0.5 0.36 6.1 0.071
## 4 8.5 0.28 0.56 1.8 0.092
## 5 8.1 0.56 0.28 1.7 0.368
## 6 7.9 0.43 0.21 1.6 0.106
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
summary(datos.Valida)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :1.200
## 1st Qu.: 7.100 1st Qu.:0.3900 1st Qu.:0.0900 1st Qu.:1.900
## Median : 7.900 Median :0.5000 Median :0.2800 Median :2.200
## Mean : 8.272 Mean :0.5166 Mean :0.2753 Mean :2.503
## 3rd Qu.: 9.200 3rd Qu.:0.6300 3rd Qu.:0.4200 3rd Qu.:2.600
## Max. :15.600 Max. :1.1850 Max. :1.0000 Max. :9.000
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9902
## 1st Qu.:0.06800 1st Qu.: 7.00 1st Qu.: 21.50 1st Qu.:0.9956
## Median :0.07800 Median :12.00 Median : 36.00 Median :0.9967
## Mean :0.08872 Mean :15.29 Mean : 45.64 Mean :0.9966
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 63.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :68.00 Max. :155.00 Max. :1.0031
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 9.00 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.30 Median :6.000
## Mean :3.312 Mean :0.6566 Mean :10.47 Mean :5.639
## 3rd Qu.:3.410 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :3.900 Max. :2.0000 Max. :14.00 Max. :8.000
modelo <- lm(quality ~ ., datos.Entrena)
modelo
##
## Call:
## lm(formula = quality ~ ., data = datos.Entrena)
##
## Coefficients:
## (Intercept) fixed_acidity volatile_acidity
## 20.747534 0.008274 -0.922244
## citric_acid residual_sugar chlorides
## -0.019610 0.011741 -1.804597
## free_sulfur_dioxide total_sulfur_dioxide density
## 0.006650 -0.003765 -16.312818
## pH sulphates alcohol
## -0.573518 0.928338 0.294048
summary(modelo)
##
## Call:
## lm(formula = quality ~ ., data = datos.Entrena)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.70643 -0.36046 -0.04914 0.45944 1.98343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.075e+01 2.562e+01 0.810 0.418266
## fixed_acidity 8.274e-03 3.111e-02 0.266 0.790320
## volatile_acidity -9.222e-01 1.446e-01 -6.378 2.63e-10 ***
## citric_acid -1.961e-02 1.793e-01 -0.109 0.912913
## residual_sugar 1.174e-02 1.716e-02 0.684 0.493948
## chlorides -1.805e+00 5.417e-01 -3.331 0.000893 ***
## free_sulfur_dioxide 6.650e-03 2.652e-03 2.508 0.012300 *
## total_sulfur_dioxide -3.765e-03 8.681e-04 -4.337 1.57e-05 ***
## density -1.631e+01 2.613e+01 -0.624 0.532544
## pH -5.735e-01 2.276e-01 -2.520 0.011878 *
## sulphates 9.283e-01 1.360e-01 6.824 1.46e-11 ***
## alcohol 2.940e-01 3.192e-02 9.212 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6611 on 1108 degrees of freedom
## Multiple R-squared: 0.3558, Adjusted R-squared: 0.3494
## F-statistic: 55.64 on 11 and 1108 DF, p-value: < 2.2e-16
#PREDICCION
predecir <- predict(modelo, newdata = datos.Valida )
datos.Valida[1,]
## # A tibble: 1 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.9 0.6 0.06 1.6 0.069
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
datos.Valida[nrow(datos.Valida),]
## # A tibble: 1 x 12
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 0.31 0.47 3.6 0.067
## # … with 7 more variables: free_sulfur_dioxide <dbl>,
## # total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## # alcohol <dbl>, quality <dbl>
predecir[1]
## 1
## 5.074639
predecir[nrow(datos.Valida)]
## 479
## 6.048752
Con el modelo a partir de los datos de entrenamiento realizar predicciones sobre el conjunto de datos de validación Agregar un nuevo registro con datos aleatorio provistos por analista y enconrar la predicción del mismo.
Interpretar resultados
¿Cuáles fueron las predicciones del conjunto de entrenamiento?
¿Cuál fué la predicción para el nuevo dato?
¿Cuáles son los rangos de los valores de confianza para la predicción?
Aplicar al mimso conunto de datos entrenamiento la técnica de árboles de regresión Generar predicciion con los árboles de regesión de los datos de validación Aplicar predicción con nuevo datos con el árbol de regresión Comparar técnicas y predicciones
fixed_acidity=7
volatile_acidity=.600
citric_acid=.50
residual_sugar=2.0
chlorides=.090
free_sulfur_dioxide=20
total_sulfur_dioxide=100
density=.9950
pH=3.15
sulphates=.60
alcohol=15
quality=7
nuevo.datos <- data.frame(fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide,density, pH, sulphates, alcohol, quality)
predecir <- predict(modelo, newdata = nuevo.datos )
nuevo.datos[1,]
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 7 0.6 0.5 2 0.09
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 20 100 0.995 3.15 0.6 15
## quality
## 1 7
predecir[1]
## 1
## 6.789744
predecir[nrow(nuevo.datos)]
## 1
## 6.789744