Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).
Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment_Numericas_Preparado.csv
Se crean datos de entrenamiento al 80%
Se crean datos de validación al 20%
Se crea el modelo regresión múltiple con datos de entrenamiento
Con este modelo se responde a preguntas tales como:
¿cuáles son variables que están por encima del 90% de confianza como predictores?,
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Se crea el modelo árboles de regresión con los datos de entrenamiento
Se identifica la importancia de las variables sobre el precio
Se visualiza el árbol de regresión y sus reglas de asociación
Se hacen predicciones con datos de validación
Se determinar el estadístico RMSE para efectos de comparación
Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados
Se identifica la importancia de las variables sobre el precio
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Al final del caso, se describe una interpretación personal
# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
library(dplyr)
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
library(caret) # Para particionar
library(Metrics) # Para determinar rmse
library(rpart) # Para árbol
library(rpart.plot) # Para árbol
library(randomForest) # Para random forest
library(caret) # Para hacer divisiones o particiones
library(reshape) # Para renombrar columnas
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment_Numericas_Preparado.csv")
str(datos)
## 'data.frame': 205 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num 169 169 171 177 177 ...
## $ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
## $ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
| Col | Nombre | Descripción |
|---|---|---|
| 1 | Symboling | Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) |
| 2 | wheelbase | Weelbase of car (Numeric). Distancia de ejes en pulgadas. |
| 3 | carlength | Length of car (Numeric). Longitud |
| 4 | carwidth | Width of car (Numeric). Amplitud |
| 5 | carheight | height of car (Numeric). Altura |
| 6 | curbweight | The weight of a car without occupants or baggage. (Numeric). Peso del auto |
| 7 | enginesize | Size of car (Numeric). Tamaño del carro en … |
| 8 | boreratio | Boreratio of car (Numeric). Eficiencia de motor |
| 9 | stroke | Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión |
| 10 | compressionratio | compression ratio of car (Numeric). Comprensión o medición de presión en motor |
| 11 | horsepower | Horsepower (Numeric). Poder del carro |
| 12 | peakrpm | car peak rpm (Numeric). Picos de revoluciones por minuto |
| 13 | citympg | Mileage in city (Numeric). Consumo de gasolina |
| 14 | highwaympg | Mileage on highway (Numeric). Consumo de gasolina |
| 16 | price (Dependent variable) |
Price of car (Numeric). Precio del carro en dólares |
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| X | symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 2 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 1 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 2 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 5 | 2 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 6 | 2 | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 1 | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 0 | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
Datos de entrenamiento al 80% de los datos y 20% los datos de validación.
n <- nrow(datos)
set.seed(1280) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| X | symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 3 | 1 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 4 | 2 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 6 | 6 | 2 | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 7 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 8 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 9 | 1 | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 10 | 0 | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
| 11 | 11 | 2 | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | 108 | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16430.00 |
| 12 | 12 | 0 | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | 108 | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16925.00 |
kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| X | symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 5 | 5 | 2 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
| 17 | 17 | 0 | 103.5 | 193.8 | 67.9 | 53.7 | 3380 | 209 | 3.62 | 3.39 | 8.0 | 182 | 5400 | 16 | 22 | 41315 |
| 29 | 29 | -1 | 103.3 | 174.6 | 64.6 | 59.8 | 2535 | 122 | 3.34 | 3.46 | 8.5 | 88 | 5000 | 24 | 30 | 8921 |
| 32 | 32 | 2 | 86.6 | 144.6 | 63.9 | 50.8 | 1819 | 92 | 2.91 | 3.41 | 9.2 | 76 | 6000 | 31 | 38 | 6855 |
| 42 | 42 | 0 | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | 110 | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 |
| 44 | 44 | 0 | 94.3 | 170.7 | 61.8 | 53.5 | 2337 | 111 | 3.31 | 3.23 | 8.5 | 78 | 4800 | 24 | 29 | 6785 |
| 49 | 49 | 0 | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 |
| 52 | 52 | 1 | 93.1 | 159.1 | 64.2 | 54.1 | 1900 | 91 | 3.03 | 3.15 | 9.0 | 68 | 5000 | 31 | 38 | 6095 |
| 62 | 62 | 1 | 98.8 | 177.8 | 66.5 | 53.7 | 2385 | 122 | 3.39 | 3.39 | 8.6 | 84 | 4800 | 26 | 32 | 10595 |
Se construye el modelo de regresión lineal múltiple (rm)
# Modelo de regresión lineal múltiple para observar variables de importancia
modelo_rm <- lm(formula = price ~ symboling + wheelbase + carlength + carwidth + carheight + curbweight + enginesize + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg ,
data = datos.entrenamiento)
summary(modelo_rm)
##
## Call:
## lm(formula = price ~ symboling + wheelbase + carlength + carwidth +
## carheight + curbweight + enginesize + boreratio + stroke +
## compressionratio + horsepower + peakrpm + citympg + highwaympg,
## data = datos.entrenamiento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8090.7 -1651.4 -249.5 1572.2 10886.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.898e+04 1.642e+04 -3.593 0.000442 ***
## symboling 6.244e+02 2.730e+02 2.287 0.023604 *
## wheelbase 2.557e+02 1.171e+02 2.183 0.030600 *
## carlength -1.224e+02 6.350e+01 -1.927 0.055864 .
## carwidth 6.020e+02 2.621e+02 2.296 0.023041 *
## carheight 2.406e+02 1.524e+02 1.579 0.116549
## curbweight 1.735e+00 1.846e+00 0.940 0.348961
## enginesize 1.060e+02 1.501e+01 7.059 5.79e-11 ***
## boreratio -1.433e+03 1.349e+03 -1.062 0.289820
## stroke -3.813e+03 8.863e+02 -4.302 3.03e-05 ***
## compressionratio 2.910e+02 8.792e+01 3.309 0.001171 **
## horsepower 3.174e+01 1.754e+01 1.809 0.072377 .
## peakrpm 1.911e+00 7.555e-01 2.529 0.012476 *
## citympg -3.381e+02 1.833e+02 -1.845 0.067036 .
## highwaympg 2.254e+02 1.623e+02 1.389 0.166915
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3072 on 150 degrees of freedom
## Multiple R-squared: 0.8499, Adjusted R-squared: 0.8359
## F-statistic: 60.69 on 14 and 150 DF, p-value: < 2.2e-16
¿cuáles son variables que están por encima del 90% de confianza como predictores?
El coeficiente de intersección tiene un nivel de confianza del 99%.
Las variables wheelbase, carwidth, horsepower y citympg tienen un nivel de confianza del 95% (.)
La variable symboling tiene un nivel de confianza por encima del 95%. Por otro lado, la variable carlenght si posee un valor del 95%.
Las variables stroke, compressionratio y peakrpm tienen un nivel de confianza como predictores del 99% (**)
La variable enginesize tiene un nivel de confianza como predictor del 99.9% (***)
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
En modelos lineales múltiples el estadístico Adjusted R-squared: 0.8359 significa que las variables independientes explican aproximadamente el 83.59% de la variable dependiente precio.
predicciones_rm <- predict(object = modelo_rm, newdata = datos.validacion)
predicciones_rm
## 1 5 17 29 32 42 44 49
## 12901.228 16240.823 25427.601 11528.568 7771.359 9363.523 5613.568 29606.046
## 52 62 68 69 74 86 92 94
## 6640.663 10140.805 24218.544 25155.599 40090.991 9568.592 6361.486 5705.138
## 98 109 110 113 122 128 134 135
## 5727.689 18531.625 16435.337 18627.034 5656.783 26064.891 14457.253 20348.617
## 137 141 148 149 155 157 158 167
## 17480.503 10209.006 11230.640 10669.640 6864.923 6733.793 6734.237 10976.094
## 169 174 177 178 180 184 188 202
## 13408.654 8383.966 8762.033 8597.723 22640.503 10693.576 10393.880 19704.251
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rm)
kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 12901.228 |
| 5 | 17450 | 16240.823 |
| 17 | 41315 | 25427.601 |
| 29 | 8921 | 11528.568 |
| 32 | 6855 | 7771.359 |
| 42 | 12945 | 9363.523 |
| 44 | 6785 | 5613.568 |
| 49 | 35550 | 29606.046 |
| 52 | 6095 | 6640.663 |
| 62 | 10595 | 10140.805 |
rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm
## [1] 3812.759
Se construye el modelo de árbol de regresión (ar)
modelo_ar <- rpart(formula = price ~ symboling + wheelbase + carlength + carwidth + carheight + curbweight + enginesize + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg ,
data = datos.entrenamiento )
modelo_ar
## n= 165
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 165 9434738000 13083.16
## 2) enginesize< 182 153 3410799000 11421.30
## 4) curbweight< 2659.5 101 666542000 8745.00
## 8) curbweight< 2291.5 62 97558280 7289.29 *
## 9) curbweight>=2291.5 39 228733800 11059.21
## 18) highwaympg>=29.5 27 55460610 10003.26 *
## 19) highwaympg< 29.5 12 75430060 13435.08 *
## 5) curbweight>=2659.5 52 615727700 16619.50
## 10) carwidth< 68.65 45 445665800 16006.31 *
## 11) carwidth>=68.65 7 44370490 20561.43 *
## 3) enginesize>=182 12 213848400 34271.88 *
summary(modelo_ar)
## Call:
## rpart(formula = price ~ symboling + wheelbase + carlength + carwidth +
## carheight + curbweight + enginesize + boreratio + stroke +
## compressionratio + horsepower + peakrpm + citympg + highwaympg,
## data = datos.entrenamiento)
## n= 165
##
## CP nsplit rel error xerror xstd
## 1 0.61581898 0 1.00000000 1.0188984 0.17837555
## 2 0.22560554 1 0.38418102 0.5040613 0.08670527
## 3 0.03606352 2 0.15857548 0.2106205 0.02892480
## 4 0.01332220 3 0.12251196 0.1768833 0.02578643
## 5 0.01037052 4 0.10918976 0.1831988 0.02652567
## 6 0.01000000 5 0.09881924 0.1806921 0.02679749
##
## Variable importance
## enginesize curbweight horsepower carwidth highwaympg citympg carlength
## 23 20 15 14 13 8 5
## wheelbase
## 1
##
## Node number 1: 165 observations, complexity param=0.615819
## mean=13083.16, MSE=5.718023e+07
## left son=2 (153 obs) right son=3 (12 obs)
## Primary splits:
## enginesize < 182 to the left, improve=0.6158190, (0 missing)
## curbweight < 2665.5 to the left, improve=0.5219724, (0 missing)
## highwaympg < 28.5 to the right, improve=0.4996421, (0 missing)
## horsepower < 118 to the left, improve=0.4972609, (0 missing)
## citympg < 22.5 to the right, improve=0.4883601, (0 missing)
## Surrogate splits:
## curbweight < 3490 to the left, agree=0.976, adj=0.667, (0 split)
## horsepower < 175.5 to the left, agree=0.970, adj=0.583, (0 split)
## carwidth < 69.25 to the left, agree=0.964, adj=0.500, (0 split)
## citympg < 16.5 to the right, agree=0.958, adj=0.417, (0 split)
## highwaympg < 19.5 to the right, agree=0.958, adj=0.417, (0 split)
##
## Node number 2: 153 observations, complexity param=0.2256055
## mean=11421.3, MSE=2.22928e+07
## left son=4 (101 obs) right son=5 (52 obs)
## Primary splits:
## curbweight < 2659.5 to the left, improve=0.6240559, (0 missing)
## highwaympg < 29.5 to the right, improve=0.6006180, (0 missing)
## enginesize < 126 to the left, improve=0.5347020, (0 missing)
## horsepower < 94.5 to the left, improve=0.5289197, (0 missing)
## wheelbase < 98.95 to the left, improve=0.5245366, (0 missing)
## Surrogate splits:
## enginesize < 126 to the left, agree=0.908, adj=0.731, (0 split)
## highwaympg < 28.5 to the right, agree=0.895, adj=0.692, (0 split)
## horsepower < 113 to the left, agree=0.863, adj=0.596, (0 split)
## carlength < 178 to the left, agree=0.856, adj=0.577, (0 split)
## carwidth < 66.05 to the left, agree=0.856, adj=0.577, (0 split)
##
## Node number 3: 12 observations
## mean=34271.88, MSE=1.78207e+07
##
## Node number 4: 101 observations, complexity param=0.03606352
## mean=8745, MSE=6599426
## left son=8 (62 obs) right son=9 (39 obs)
## Primary splits:
## curbweight < 2291.5 to the left, improve=0.5104703, (0 missing)
## highwaympg < 29.5 to the right, improve=0.4359229, (0 missing)
## carlength < 168.75 to the left, improve=0.4018012, (0 missing)
## horsepower < 83 to the left, improve=0.3947234, (0 missing)
## citympg < 23.5 to the right, improve=0.3851430, (0 missing)
## Surrogate splits:
## carlength < 168.75 to the left, agree=0.921, adj=0.795, (0 split)
## carwidth < 64.5 to the left, agree=0.891, adj=0.718, (0 split)
## horsepower < 83 to the left, agree=0.851, adj=0.615, (0 split)
## citympg < 27.5 to the right, agree=0.851, adj=0.615, (0 split)
## wheelbase < 95.9 to the left, agree=0.842, adj=0.590, (0 split)
##
## Node number 5: 52 observations, complexity param=0.0133222
## mean=16619.5, MSE=1.184092e+07
## left son=10 (45 obs) right son=11 (7 obs)
## Primary splits:
## carwidth < 68.65 to the left, improve=0.2041348, (0 missing)
## wheelbase < 100.8 to the left, improve=0.1768359, (0 missing)
## carlength < 188.3 to the left, improve=0.1262024, (0 missing)
## peakrpm < 5450 to the left, improve=0.1090728, (0 missing)
## highwaympg < 28.5 to the right, improve=0.1051026, (0 missing)
## Surrogate splits:
## wheelbase < 108.55 to the left, agree=0.885, adj=0.143, (0 split)
##
## Node number 8: 62 observations
## mean=7289.29, MSE=1573521
##
## Node number 9: 39 observations, complexity param=0.01037052
## mean=11059.21, MSE=5864970
## left son=18 (27 obs) right son=19 (12 obs)
## Primary splits:
## highwaympg < 29.5 to the right, improve=0.4277599, (0 missing)
## stroke < 3.43 to the right, improve=0.3433487, (0 missing)
## horsepower < 100.5 to the left, improve=0.2823519, (0 missing)
## enginesize < 109.5 to the right, improve=0.2539563, (0 missing)
## citympg < 22 to the right, improve=0.2539524, (0 missing)
## Surrogate splits:
## stroke < 3.3025 to the right, agree=0.949, adj=0.833, (0 split)
## enginesize < 108.5 to the right, agree=0.872, adj=0.583, (0 split)
## citympg < 22 to the right, agree=0.872, adj=0.583, (0 split)
## wheelbase < 95.7 to the right, agree=0.821, adj=0.417, (0 split)
## carlength < 169.05 to the right, agree=0.821, adj=0.417, (0 split)
##
## Node number 10: 45 observations
## mean=16006.31, MSE=9903685
##
## Node number 11: 7 observations
## mean=20561.43, MSE=6338641
##
## Node number 18: 27 observations
## mean=10003.26, MSE=2054097
##
## Node number 19: 12 observations
## mean=13435.08, MSE=6285838
rpart.plot(modelo_ar)
predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar
## 1 5 17 29 32 42 44 49
## 13435.08 16006.31 34271.88 10003.26 7289.29 13435.08 13435.08 34271.88
## 52 62 68 69 74 86 92 94
## 7289.29 10003.26 34271.88 34271.88 34271.88 10003.26 7289.29 7289.29
## 98 109 110 113 122 128 134 135
## 7289.29 16006.31 16006.31 16006.31 7289.29 34271.88 16006.31 16006.31
## 137 141 148 149 155 157 158 167
## 16006.31 7289.29 10003.26 13435.08 7289.29 7289.29 7289.29 13435.08
## 169 174 177 178 180 184 188 202
## 10003.26 10003.26 10003.26 10003.26 16006.31 7289.29 10003.26 20561.43
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_ar)
kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 13435.08 |
| 5 | 17450 | 16006.31 |
| 17 | 41315 | 34271.88 |
| 29 | 8921 | 10003.26 |
| 32 | 6855 | 7289.29 |
| 42 | 12945 | 13435.08 |
| 44 | 6785 | 13435.08 |
| 49 | 35550 | 34271.88 |
| 52 | 6095 | 7289.29 |
| 62 | 10595 | 10003.26 |
rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar
## [1] 3007.861
Se construye el modelo de árbol de regresión (ar)
modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "wheelbase",
"carlength", "carwidth", "carheight", "curbweight",
"enginesize", "boreratio", "stroke",
"compressionratio", "horsepower", "peakrpm",
"citympg", "highwaympg" )],
y = datos.entrenamiento[,'price'],
importance = TRUE,
keep.forest = TRUE,
ntree=20)
modelo_rf
##
## Call:
## randomForest(x = datos.entrenamiento[, c("symboling", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginesize", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[, "price"], ntree = 20, importance = TRUE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 5410968
## % Var explained: 90.54
as.data.frame(modelo_rf$importance) %>%
arrange(desc(IncNodePurity))
## %IncMSE IncNodePurity
## curbweight 14477866.4 1820539066
## citympg 11828643.5 1805852182
## carwidth 6040519.6 1593727280
## horsepower 9179697.2 1521396316
## enginesize 13773942.5 924356681
## carlength 2040652.1 437971411
## wheelbase 3869201.3 277556990
## highwaympg 2264907.8 225604520
## boreratio 635980.9 165964315
## compressionratio 579149.5 133257452
## peakrpm 319454.6 108361784
## stroke 1818618.1 75943968
## carheight 448648.6 48620485
## symboling -151969.0 16603118
predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf
## 1 5 17 29 32 42 44 49
## 15385.863 15047.423 32107.879 10076.485 6756.509 10483.219 9850.090 33882.302
## 52 62 68 69 74 86 92 94
## 6116.381 9729.441 29050.739 29460.322 37213.357 8698.035 6875.108 7518.999
## 98 109 110 113 122 128 134 135
## 7518.999 17045.555 17152.983 17045.555 7275.078 32507.840 13813.106 15214.170
## 137 141 148 149 155 157 158 167
## 17930.594 7158.610 10011.651 10366.817 8047.132 7704.135 7704.135 10605.692
## 169 174 177 178 180 184 188 202
## 10005.901 10250.160 11157.527 11101.078 17235.312 8312.611 8399.892 20138.345
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rf)
kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 15385.863 |
| 5 | 17450 | 15047.423 |
| 17 | 41315 | 32107.879 |
| 29 | 8921 | 10076.485 |
| 32 | 6855 | 6756.509 |
| 42 | 12945 | 10483.219 |
| 44 | 6785 | 9850.090 |
| 49 | 35550 | 33882.302 |
| 52 | 6095 | 6116.381 |
| 62 | 10595 | 9729.441 |
rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf
## [1] 2245.088
Se comparan las predicciones
comparaciones <- data.frame(cbind(datos.validacion[,-1], predicciones_rm, predicciones_ar, predicciones_rf))
Se visualizan las predicciones de cada modelo
kable(comparaciones, caption = "Predicciones de los modelos") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | predicciones_rm | predicciones_ar | predicciones_rf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 | 12901.228 | 13435.08 | 15385.863 |
| 5 | 2 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 | 16240.823 | 16006.31 | 15047.423 |
| 17 | 0 | 103.5 | 193.8 | 67.9 | 53.7 | 3380 | 209 | 3.62 | 3.39 | 8.0 | 182 | 5400 | 16 | 22 | 41315 | 25427.601 | 34271.88 | 32107.879 |
| 29 | -1 | 103.3 | 174.6 | 64.6 | 59.8 | 2535 | 122 | 3.34 | 3.46 | 8.5 | 88 | 5000 | 24 | 30 | 8921 | 11528.568 | 10003.26 | 10076.485 |
| 32 | 2 | 86.6 | 144.6 | 63.9 | 50.8 | 1819 | 92 | 2.91 | 3.41 | 9.2 | 76 | 6000 | 31 | 38 | 6855 | 7771.359 | 7289.29 | 6756.509 |
| 42 | 0 | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | 110 | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 | 9363.523 | 13435.08 | 10483.219 |
| 44 | 0 | 94.3 | 170.7 | 61.8 | 53.5 | 2337 | 111 | 3.31 | 3.23 | 8.5 | 78 | 4800 | 24 | 29 | 6785 | 5613.568 | 13435.08 | 9850.090 |
| 49 | 0 | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 | 29606.046 | 34271.88 | 33882.302 |
| 52 | 1 | 93.1 | 159.1 | 64.2 | 54.1 | 1900 | 91 | 3.03 | 3.15 | 9.0 | 68 | 5000 | 31 | 38 | 6095 | 6640.663 | 7289.29 | 6116.381 |
| 62 | 1 | 98.8 | 177.8 | 66.5 | 53.7 | 2385 | 122 | 3.39 | 3.39 | 8.6 | 84 | 4800 | 26 | 32 | 10595 | 10140.805 | 10003.26 | 9729.441 |
| 68 | -1 | 110.0 | 190.9 | 70.3 | 56.5 | 3515 | 183 | 3.58 | 3.64 | 21.5 | 123 | 4350 | 22 | 25 | 25552 | 24218.544 | 34271.88 | 29050.739 |
| 69 | -1 | 110.0 | 190.9 | 70.3 | 58.7 | 3750 | 183 | 3.58 | 3.64 | 21.5 | 123 | 4350 | 22 | 25 | 28248 | 25155.599 | 34271.88 | 29460.322 |
| 74 | 0 | 120.9 | 208.1 | 71.7 | 56.7 | 3900 | 308 | 3.80 | 3.35 | 8.0 | 184 | 4500 | 14 | 16 | 40960 | 40090.991 | 34271.88 | 37213.357 |
| 86 | 1 | 96.3 | 172.4 | 65.4 | 51.6 | 2365 | 122 | 3.35 | 3.46 | 8.5 | 88 | 5000 | 25 | 32 | 6989 | 9568.592 | 10003.26 | 8698.035 |
| 92 | 1 | 94.5 | 165.3 | 63.8 | 54.5 | 1918 | 97 | 3.15 | 3.29 | 9.4 | 69 | 5200 | 31 | 37 | 6649 | 6361.486 | 7289.29 | 6875.108 |
| 94 | 1 | 94.5 | 170.2 | 63.8 | 53.5 | 2024 | 97 | 3.15 | 3.29 | 9.4 | 69 | 5200 | 31 | 37 | 7349 | 5705.138 | 7289.29 | 7518.999 |
| 98 | 1 | 94.5 | 170.2 | 63.8 | 53.5 | 2037 | 97 | 3.15 | 3.29 | 9.4 | 69 | 5200 | 31 | 37 | 7999 | 5727.689 | 7289.29 | 7518.999 |
| 109 | 0 | 107.9 | 186.7 | 68.4 | 56.7 | 3197 | 152 | 3.70 | 3.52 | 21.0 | 95 | 4150 | 28 | 33 | 13200 | 18531.625 | 16006.31 | 17045.555 |
| 110 | 0 | 114.2 | 198.9 | 68.4 | 58.7 | 3230 | 120 | 3.46 | 3.19 | 8.4 | 97 | 5000 | 19 | 24 | 12440 | 16435.337 | 16006.31 | 17152.983 |
| 113 | 0 | 107.9 | 186.7 | 68.4 | 56.7 | 3252 | 152 | 3.70 | 3.52 | 21.0 | 95 | 4150 | 28 | 33 | 16900 | 18627.034 | 16006.31 | 17045.555 |
| 122 | 1 | 93.7 | 167.3 | 63.8 | 50.8 | 1989 | 90 | 2.97 | 3.23 | 9.4 | 68 | 5500 | 31 | 38 | 6692 | 5656.783 | 7289.29 | 7275.078 |
| 128 | 3 | 89.5 | 168.9 | 65.0 | 51.6 | 2756 | 194 | 3.74 | 2.90 | 9.5 | 207 | 5900 | 17 | 25 | 34028 | 26064.891 | 34271.88 | 32507.840 |
| 134 | 2 | 99.1 | 186.6 | 66.5 | 56.1 | 2695 | 121 | 3.54 | 3.07 | 9.3 | 110 | 5250 | 21 | 28 | 12170 | 14457.253 | 16006.31 | 13813.106 |
| 135 | 3 | 99.1 | 186.6 | 66.5 | 56.1 | 2707 | 121 | 2.54 | 2.07 | 9.3 | 110 | 5250 | 21 | 28 | 15040 | 20348.617 | 16006.31 | 15214.170 |
| 137 | 3 | 99.1 | 186.6 | 66.5 | 56.1 | 2808 | 121 | 3.54 | 3.07 | 9.0 | 160 | 5500 | 19 | 26 | 18150 | 17480.503 | 16006.31 | 17930.594 |
| 141 | 2 | 93.3 | 157.3 | 63.8 | 55.7 | 2240 | 108 | 3.62 | 2.64 | 8.7 | 73 | 4400 | 26 | 31 | 7603 | 10209.006 | 7289.29 | 7158.610 |
| 148 | 0 | 97.0 | 173.5 | 65.4 | 53.0 | 2455 | 108 | 3.62 | 2.64 | 9.0 | 94 | 5200 | 25 | 31 | 10198 | 11230.640 | 10003.26 | 10011.651 |
| 149 | 0 | 96.9 | 173.6 | 65.4 | 54.9 | 2420 | 108 | 3.62 | 2.64 | 9.0 | 82 | 4800 | 23 | 29 | 8013 | 10669.640 | 13435.08 | 10366.817 |
| 155 | 0 | 95.7 | 169.7 | 63.6 | 59.1 | 2290 | 92 | 3.05 | 3.03 | 9.0 | 62 | 4800 | 27 | 32 | 7898 | 6864.923 | 7289.29 | 8047.132 |
| 157 | 0 | 95.7 | 166.3 | 64.4 | 53.0 | 2081 | 98 | 3.19 | 3.03 | 9.0 | 70 | 4800 | 30 | 37 | 6938 | 6733.793 | 7289.29 | 7704.135 |
| 158 | 0 | 95.7 | 166.3 | 64.4 | 52.8 | 2109 | 98 | 3.19 | 3.03 | 9.0 | 70 | 4800 | 30 | 37 | 7198 | 6734.237 | 7289.29 | 7704.135 |
| 167 | 1 | 94.5 | 168.7 | 64.0 | 52.6 | 2300 | 98 | 3.24 | 3.08 | 9.4 | 112 | 6600 | 26 | 29 | 9538 | 10976.094 | 13435.08 | 10605.692 |
| 169 | 2 | 98.4 | 176.2 | 65.6 | 52.0 | 2536 | 146 | 3.62 | 3.50 | 9.3 | 116 | 4800 | 24 | 30 | 9639 | 13408.654 | 10003.26 | 10005.901 |
| 174 | -1 | 102.4 | 175.6 | 66.5 | 54.9 | 2326 | 122 | 3.31 | 3.54 | 8.7 | 92 | 4200 | 29 | 34 | 8948 | 8383.966 | 10003.26 | 10250.160 |
| 177 | -1 | 102.4 | 175.6 | 66.5 | 54.9 | 2414 | 122 | 3.31 | 3.54 | 8.7 | 92 | 4200 | 27 | 32 | 10898 | 8762.033 | 10003.26 | 11157.527 |
| 178 | -1 | 102.4 | 175.6 | 66.5 | 53.9 | 2458 | 122 | 3.31 | 3.54 | 8.7 | 92 | 4200 | 27 | 32 | 11248 | 8597.723 | 10003.26 | 11101.078 |
| 180 | 3 | 102.9 | 183.5 | 67.7 | 52.0 | 3016 | 171 | 3.27 | 3.35 | 9.3 | 161 | 5200 | 19 | 24 | 15998 | 22640.503 | 16006.31 | 17235.312 |
| 184 | 2 | 97.3 | 171.7 | 65.5 | 55.7 | 2209 | 109 | 3.19 | 3.40 | 9.0 | 85 | 5250 | 27 | 34 | 7975 | 10693.576 | 7289.29 | 8312.611 |
| 188 | 2 | 97.3 | 171.7 | 65.5 | 55.7 | 2319 | 97 | 3.01 | 3.40 | 23.0 | 68 | 4500 | 37 | 42 | 9495 | 10393.880 | 10003.26 | 8399.892 |
| 202 | -1 | 109.1 | 188.8 | 68.8 | 55.5 | 3049 | 141 | 3.78 | 3.15 | 8.7 | 160 | 5300 | 19 | 25 | 19045 | 19704.251 | 20561.43 | 20138.345 |
Se compara el RMSE
rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)
kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| rm | ar | rf |
|---|---|---|
| 3812.759 | 3007.861 | 2245.088 |
El ejercicio consistió en cargar un conjunto de datos numéricos de precios de automóviles con respecto a algunas variables numéricas.
El modelo de regresión linea múltiple destaca variables estadísticamente significativas: Las variables stroke, compressionratio y peakrpm tienen un nivel de confianza como predictores del 99%. La variable enginesize también posee un nivel de confianza del 99.9%. Por otro lado, la mayoría de las variables están por lo menos muy cerca del 90% y en algunos casos del 95%.
En el modelode árbol de regresión la variable con mayor importancia es enginesize con un valor de 23 y luego le siguen algunas otras como curbweight, horsepower, carwidth, highwaympg, citympg, carlength y wheelbase, también con ese orden de importancia.
El modelo de bosque aleatorio considera variables de importancia tales como: enginesize, curbweight, horsepower, citympg y carwidth.
En este caso, la variable enginesize continua estando presente en todos los modelos como importante y significativa. Otro dato interesante es que Las variables enginesize, curbweight y horsepower figuran como importantes en los modelos árbol de regresión y bosque aleatorio.
El mejor modelo conforme al estadístico raiz del error cuadrático medio (rmse) fue el de bosques aleatorios con estos datos de entrenamiento y validación y con el porcentaje de datos de entrenamiento y validación de 80% y 20%. El valor que arrojó fue de 2245.088, siendo el más bajo de los 3 modelos de regresión.
Comparando los resultados en R con los resultados arrojados en Python, el modelo que proporcionó el menor valor del estádistico RMSE fue el de random forest en ambos casos. No obstante, en R tuvo una cantidad de 2245.088 y en Python tuvo otra de 2631.95698, por lo tanto se puede concluir en que el modelo más óptimo, especificamente con estos datos, es efectivamente el random forest pero haciendo uso de la programación en R.