Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).
Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment_Numericas_Preparado.csv
Se crean datos de entrenamiento al 80%
Se crean datos de validación al 20%
Se crea el modelo regresión múltiple con datos de entrenamiento
Con este modelo se responde a preguntas tales como:
¿cuáles son variables que están por encima del 90% de confianza como predictores?,
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Se crea el modelo árboles de regresión con los datos de entrenamiento
Se identifica la importancia de las variables sobre el precio
Se visualiza el árbol de regresión y sus reglas de asociación
Se hacen predicciones con datos de validación
Se determinar el estadístico RMSE para efectos de comparación
Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados
Se identifica la importancia de las variables sobre el precio
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Al final del caso, se describe una interpretación personal
# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
library(dplyr)
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
library(caret) # Para particionar
library(Metrics) # Para determinar rmse
library(rpart) # Para árbol
library(rpart.plot) # Para árbol
library(randomForest) # Para random forest
library(caret) # Para hacer divisiones o particiones
library(reshape) # Para renombrar columnas
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment_Numericas_Preparado.csv")
str(datos)
## 'data.frame': 205 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num 169 169 171 177 177 ...
## $ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
## $ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
| Col | Nombre | Descripción |
|---|---|---|
| 1 | Symboling | Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) |
| 2 | wheelbase | Weelbase of car (Numeric). Distancia de ejes en pulgadas. |
| 3 | carlength | Length of car (Numeric). Longitud |
| 4 | carwidth | Width of car (Numeric). Amplitud |
| 5 | carheight | height of car (Numeric). Altura |
| 6 | curbweight | The weight of a car without occupants or baggage. (Numeric). Peso del auto |
| 7 | enginesize | Size of car (Numeric). Tamaño del carro en … |
| 8 | boreratio | Boreratio of car (Numeric). Eficiencia de motor |
| 9 | stroke | Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión |
| 10 | compressionratio | compression ratio of car (Numeric). Comprensión o medición de presión en motor |
| 11 | horsepower | Horsepower (Numeric). Poder del carro |
| 12 | peakrpm | car peak rpm (Numeric). Picos de revoluciones por minuto |
| 13 | citympg | Mileage in city (Numeric). Consumo de gasolina |
| 14 | highwaympg | Mileage on highway (Numeric). Consumo de gasolina |
| 16 | price (Dependent variable) |
Price of car (Numeric). Precio del carro en dólares |
~Fuente: https://archive.ics.uci.edu/ml/datasets/Automobile~
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| X | symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 2 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 1 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 2 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 5 | 2 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 6 | 2 | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 1 | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 0 | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
Datos de entrenamiento al 80% de los datos y 20% los datos de validación.
n <- nrow(datos)
set.seed(1321) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| X | symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 2 | 2 | 3 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 3 | 1 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 4 | 2 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 6 | 6 | 2 | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 7 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 10 | 10 | 0 | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
| 11 | 11 | 2 | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | 108 | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16430.00 |
| 12 | 12 | 0 | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | 108 | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16925.00 |
| 13 | 13 | 0 | 101.2 | 176.8 | 64.8 | 54.3 | 2710 | 164 | 3.31 | 3.19 | 9.0 | 121 | 4250 | 21 | 28 | 20970.00 |
kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| X | symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 5 | 2 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
| 8 | 8 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920 |
| 9 | 9 | 1 | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875 |
| 23 | 23 | 1 | 93.7 | 157.3 | 63.8 | 50.8 | 1876 | 90 | 2.97 | 3.23 | 9.4 | 68 | 5500 | 31 | 38 | 6377 |
| 35 | 35 | 1 | 93.7 | 150.0 | 64.0 | 52.6 | 1956 | 92 | 2.91 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7129 |
| 36 | 36 | 0 | 96.5 | 163.4 | 64.0 | 54.5 | 2010 | 92 | 2.91 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7295 |
| 37 | 37 | 0 | 96.5 | 157.1 | 63.9 | 58.3 | 2024 | 92 | 2.92 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7295 |
| 42 | 42 | 0 | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | 110 | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 |
| 48 | 48 | 0 | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 32250 |
| 49 | 49 | 0 | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 |
Se construye el modelo de regresión lineal múltiple (rm)
# Modelo de regresión lineal múltiple para observar variables de importancia
modelo_rm <- lm(formula = price ~ symboling + wheelbase + carlength + carwidth + carheight + curbweight + enginesize + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg ,
data = datos.entrenamiento)
summary(modelo_rm)
##
## Call:
## lm(formula = price ~ symboling + wheelbase + carlength + carwidth +
## carheight + curbweight + enginesize + boreratio + stroke +
## compressionratio + horsepower + peakrpm + citympg + highwaympg,
## data = datos.entrenamiento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9626.5 -1863.2 -123.3 1654.9 14661.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.719e+04 1.776e+04 -2.657 0.008727 **
## symboling 9.565e+01 2.820e+02 0.339 0.734944
## wheelbase 1.386e+02 1.280e+02 1.083 0.280609
## carlength -7.859e+01 6.088e+01 -1.291 0.198709
## carwidth 4.094e+02 2.922e+02 1.401 0.163267
## carheight 1.777e+02 1.584e+02 1.121 0.263885
## curbweight 1.199e+00 1.997e+00 0.600 0.549182
## enginesize 1.171e+02 1.589e+01 7.368 1.08e-11 ***
## boreratio -3.049e+02 1.333e+03 -0.229 0.819367
## stroke -2.832e+03 8.807e+02 -3.216 0.001591 **
## compressionratio 3.624e+02 1.023e+02 3.542 0.000529 ***
## horsepower 2.874e+01 1.773e+01 1.621 0.107221
## peakrpm 2.362e+00 7.806e-01 3.026 0.002914 **
## citympg -2.334e+02 2.196e+02 -1.063 0.289597
## highwaympg 1.103e+02 2.081e+02 0.530 0.596978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3214 on 150 degrees of freedom
## Multiple R-squared: 0.8437, Adjusted R-squared: 0.8291
## F-statistic: 57.84 on 14 and 150 DF, p-value: < 2.2e-16
¿cuáles son variables que están por encima del 90% de confianza como predictores?
El coeficiente de intersección tiene un nivel de confianza del 95%.
Las variables wheelbase, carwidth y citympg tienen un nivel de confianza del 90% (.)
Las variable compressionratio tiene un nivel de confianza del 95% (*)
Las variables stroke y peakrpm tienen un nivel de confianza como predictores del 99% (**)
La variable enginesize tiene un nivel de confianza como predictor del 99.9% (***)
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
En modelos lineales múltiples el estadístico Adjusted R-squared: 0.8351 significa que las variables independientes explican aproximadamente el 83.51% de la variable dependiente precio.
predicciones_rm <- predict(object = modelo_rm, newdata = datos.validacion)
predicciones_rm
## 5 8 9 23 35 36 37 42
## 15855.603 17968.309 18300.141 6714.623 8659.412 8301.068 9444.239 10788.265
## 48 49 52 60 64 74 77 83
## 30626.999 30626.999 6268.162 10227.530 13471.053 40036.974 6270.753 15199.592
## 87 89 100 102 111 113 121 129
## 10104.383 10430.461 10655.204 22423.067 19312.495 19011.284 6788.190 26836.792
## 131 143 145 146 150 155 157 159
## 11177.942 8851.242 10192.292 11145.456 10852.063 6888.182 6889.106 10333.430
## 160 167 170 173 175 195 196 205
## 10577.299 12156.071 13769.149 14455.204 12970.014 16850.766 17323.694 18966.007
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rm)
kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 5 | 17450 | 15855.603 |
| 8 | 18920 | 17968.309 |
| 9 | 23875 | 18300.141 |
| 23 | 6377 | 6714.623 |
| 35 | 7129 | 8659.412 |
| 36 | 7295 | 8301.068 |
| 37 | 7295 | 9444.239 |
| 42 | 12945 | 10788.265 |
| 48 | 32250 | 30626.999 |
| 49 | 35550 | 30626.999 |
rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm
## [1] 3236.724
Se construye el modelo de árbol de regresión (ar)
modelo_ar <- rpart(formula = price ~ symboling + wheelbase + carlength + carwidth + carheight + curbweight + enginesize + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg ,
data = datos.entrenamiento )
modelo_ar
## n= 165
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 165 9914069000 13196.010
## 2) enginesize< 182 151 3153903000 11299.430
## 4) curbweight< 2544 96 539037400 8514.677
## 8) horsepower< 83 52 57008810 7051.635 *
## 9) horsepower>=83 44 239179900 10243.730
## 18) highwaympg>=29.5 34 59677900 9387.235 *
## 19) highwaympg< 29.5 10 69758620 13155.800 *
## 5) curbweight>=2544 55 570966900 16160.090
## 10) horsepower< 118 27 231195100 14663.260 *
## 11) horsepower>=118 28 220944500 17603.470
## 22) stroke>=3.24 15 56366990 15775.740 *
## 23) stroke< 3.24 13 56651040 19712.380 *
## 3) enginesize>=182 14 358772600 33651.960 *
Pendiente
rpart.plot(modelo_ar)
predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar
## 5 8 9 23 35 36 37 42
## 14663.259 14663.259 15775.744 7051.635 7051.635 7051.635 7051.635 13155.800
## 48 49 52 60 64 74 77 83
## 33651.964 33651.964 7051.635 9387.235 7051.635 33651.964 7051.635 15775.744
## 87 89 100 102 111 113 121 129
## 9387.235 9387.235 9387.235 15775.744 14663.259 14663.259 7051.635 33651.964
## 131 143 145 146 150 155 157 159
## 14663.259 7051.635 7051.635 13155.800 14663.259 7051.635 7051.635 7051.635
## 160 167 170 173 175 195 196 205
## 7051.635 13155.800 14663.259 14663.259 7051.635 14663.259 14663.259 14663.259
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_ar)
kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 5 | 17450 | 14663.259 |
| 8 | 18920 | 14663.259 |
| 9 | 23875 | 15775.744 |
| 23 | 6377 | 7051.635 |
| 35 | 7129 | 7051.635 |
| 36 | 7295 | 7051.635 |
| 37 | 7295 | 7051.635 |
| 42 | 12945 | 13155.800 |
| 48 | 32250 | 33651.964 |
| 49 | 35550 | 33651.964 |
rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar
## [1] 3070.846
Se construye el modelo de árbol de regresión (ar)
modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "wheelbase",
"carlength", "carwidth", "carheight", "curbweight",
"enginesize", "boreratio", "stroke",
"compressionratio", "horsepower", "peakrpm",
"citympg", "highwaympg" )],
y = datos.entrenamiento[,'price'],
importance = TRUE,
keep.forest = TRUE,
ntree=20)
modelo_rf
##
## Call:
## randomForest(x = datos.entrenamiento[, c("symboling", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginesize", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[, "price"], ntree = 20, importance = TRUE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 6330438
## % Var explained: 89.46
as.data.frame(modelo_rf$importance) %>%
arrange(desc(IncNodePurity))
## %IncMSE IncNodePurity
## curbweight 18132525.24 1999706206
## enginesize 14320617.41 1852259145
## highwaympg 8589788.37 1795945748
## horsepower 14836365.13 1561559058
## carwidth 3273893.54 914812774
## carlength 3843971.04 568845702
## citympg 2921449.74 393901245
## compressionratio 401521.89 168128046
## wheelbase 1839287.96 155307688
## boreratio 1389648.34 153154804
## peakrpm 1149158.47 104514270
## stroke 832625.36 59772606
## carheight 44791.57 38876814
## symboling -77368.56 32158782
predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf
## 5 8 9 23 35 36 37 42
## 15798.828 17342.450 27052.028 6202.515 6408.399 6861.312 6790.689 12337.843
## 48 49 52 60 64 74 77 83
## 35554.229 35554.229 6648.324 10280.199 10789.761 39960.790 5968.551 13778.485
## 87 89 100 102 111 113 121 129
## 8256.301 9104.692 9206.513 16026.099 18824.635 16549.072 6713.682 31125.460
## 131 143 145 146 150 155 157 159
## 10892.901 8302.891 11057.935 11600.466 13527.883 7333.159 7678.558 7521.930
## 160 167 170 173 175 195 196 205
## 7785.957 10176.152 9364.388 12793.742 10886.969 16176.910 16297.085 18552.843
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rf)
kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 5 | 17450 | 15798.828 |
| 8 | 18920 | 17342.450 |
| 9 | 23875 | 27052.028 |
| 23 | 6377 | 6202.515 |
| 35 | 7129 | 6408.399 |
| 36 | 7295 | 6861.312 |
| 37 | 7295 | 6790.689 |
| 42 | 12945 | 12337.843 |
| 48 | 32250 | 35554.229 |
| 49 | 35550 | 35554.229 |
rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf
## [1] 2058.887
Se comparan las predicciones
comparaciones <- data.frame(cbind(datos.validacion[,-1], predicciones_rm, predicciones_ar, predicciones_rf))
Se visualizan las predicciones de cada modelo
kable(comparaciones, caption = "Predicciones de los modelos") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | predicciones_rm | predicciones_ar | predicciones_rf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 2 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 | 15855.603 | 14663.259 | 15798.828 |
| 8 | 1 | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920 | 17968.309 | 14663.259 | 17342.450 |
| 9 | 1 | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875 | 18300.141 | 15775.744 | 27052.028 |
| 23 | 1 | 93.7 | 157.3 | 63.8 | 50.8 | 1876 | 90 | 2.97 | 3.23 | 9.4 | 68 | 5500 | 31 | 38 | 6377 | 6714.623 | 7051.635 | 6202.515 |
| 35 | 1 | 93.7 | 150.0 | 64.0 | 52.6 | 1956 | 92 | 2.91 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7129 | 8659.412 | 7051.635 | 6408.399 |
| 36 | 0 | 96.5 | 163.4 | 64.0 | 54.5 | 2010 | 92 | 2.91 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7295 | 8301.068 | 7051.635 | 6861.312 |
| 37 | 0 | 96.5 | 157.1 | 63.9 | 58.3 | 2024 | 92 | 2.92 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7295 | 9444.239 | 7051.635 | 6790.689 |
| 42 | 0 | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | 110 | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 | 10788.265 | 13155.800 | 12337.843 |
| 48 | 0 | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 32250 | 30626.999 | 33651.964 | 35554.229 |
| 49 | 0 | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | 258 | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 | 30626.999 | 33651.964 | 35554.229 |
| 52 | 1 | 93.1 | 159.1 | 64.2 | 54.1 | 1900 | 91 | 3.03 | 3.15 | 9.0 | 68 | 5000 | 31 | 38 | 6095 | 6268.162 | 7051.635 | 6648.324 |
| 60 | 1 | 98.8 | 177.8 | 66.5 | 53.7 | 2385 | 122 | 3.39 | 3.39 | 8.6 | 84 | 4800 | 26 | 32 | 8845 | 10227.530 | 9387.235 | 10280.199 |
| 64 | 0 | 98.8 | 177.8 | 66.5 | 55.5 | 2443 | 122 | 3.39 | 3.39 | 22.7 | 64 | 4650 | 36 | 42 | 10795 | 13471.053 | 7051.635 | 10789.761 |
| 74 | 0 | 120.9 | 208.1 | 71.7 | 56.7 | 3900 | 308 | 3.80 | 3.35 | 8.0 | 184 | 4500 | 14 | 16 | 40960 | 40036.974 | 33651.964 | 39960.790 |
| 77 | 2 | 93.7 | 157.3 | 64.4 | 50.8 | 1918 | 92 | 2.97 | 3.23 | 9.4 | 68 | 5500 | 37 | 41 | 5389 | 6270.753 | 7051.635 | 5968.551 |
| 83 | 3 | 95.9 | 173.2 | 66.3 | 50.2 | 2833 | 156 | 3.58 | 3.86 | 7.0 | 145 | 5000 | 19 | 24 | 12629 | 15199.592 | 15775.744 | 13778.485 |
| 87 | 1 | 96.3 | 172.4 | 65.4 | 51.6 | 2405 | 122 | 3.35 | 3.46 | 8.5 | 88 | 5000 | 25 | 32 | 8189 | 10104.383 | 9387.235 | 8256.301 |
| 89 | -1 | 96.3 | 172.4 | 65.4 | 51.6 | 2403 | 110 | 3.17 | 3.46 | 7.5 | 116 | 5500 | 23 | 30 | 9279 | 10430.461 | 9387.235 | 9104.692 |
| 100 | 0 | 97.2 | 173.4 | 65.2 | 54.7 | 2324 | 120 | 3.33 | 3.47 | 8.5 | 97 | 5200 | 27 | 34 | 8949 | 10655.204 | 9387.235 | 9206.513 |
| 102 | 0 | 100.4 | 181.7 | 66.5 | 55.1 | 3095 | 181 | 3.43 | 3.27 | 9.0 | 152 | 5200 | 17 | 22 | 13499 | 22423.067 | 15775.744 | 16026.099 |
| 111 | 0 | 114.2 | 198.9 | 68.4 | 58.7 | 3430 | 152 | 3.70 | 3.52 | 21.0 | 95 | 4150 | 25 | 25 | 13860 | 19312.495 | 14663.259 | 18824.635 |
| 113 | 0 | 107.9 | 186.7 | 68.4 | 56.7 | 3252 | 152 | 3.70 | 3.52 | 21.0 | 95 | 4150 | 28 | 33 | 16900 | 19011.284 | 14663.259 | 16549.072 |
| 121 | 1 | 93.7 | 157.3 | 63.8 | 50.6 | 1967 | 90 | 2.97 | 3.23 | 9.4 | 68 | 5500 | 31 | 38 | 6229 | 6788.190 | 7051.635 | 6713.682 |
| 129 | 3 | 89.5 | 168.9 | 65.0 | 51.6 | 2800 | 194 | 3.74 | 2.90 | 9.5 | 207 | 5900 | 17 | 25 | 37028 | 26836.792 | 33651.964 | 31125.460 |
| 131 | 0 | 96.1 | 181.5 | 66.5 | 55.2 | 2579 | 132 | 3.46 | 3.90 | 8.7 | 90 | 5100 | 23 | 31 | 9295 | 11177.942 | 14663.259 | 10892.901 |
| 143 | 0 | 97.2 | 172.0 | 65.4 | 52.5 | 2190 | 108 | 3.62 | 2.64 | 9.5 | 82 | 4400 | 28 | 33 | 7775 | 8851.242 | 7051.635 | 8302.891 |
| 145 | 0 | 97.0 | 172.0 | 65.4 | 54.3 | 2385 | 108 | 3.62 | 2.64 | 9.0 | 82 | 4800 | 24 | 25 | 9233 | 10192.292 | 7051.635 | 11057.935 |
| 146 | 0 | 97.0 | 172.0 | 65.4 | 54.3 | 2510 | 108 | 3.62 | 2.64 | 7.7 | 111 | 4800 | 24 | 29 | 11259 | 11145.456 | 13155.800 | 11600.466 |
| 150 | 0 | 96.9 | 173.6 | 65.4 | 54.9 | 2650 | 108 | 3.62 | 2.64 | 7.7 | 111 | 4800 | 23 | 23 | 11694 | 10852.063 | 14663.259 | 13527.883 |
| 155 | 0 | 95.7 | 169.7 | 63.6 | 59.1 | 2290 | 92 | 3.05 | 3.03 | 9.0 | 62 | 4800 | 27 | 32 | 7898 | 6888.182 | 7051.635 | 7333.159 |
| 157 | 0 | 95.7 | 166.3 | 64.4 | 53.0 | 2081 | 98 | 3.19 | 3.03 | 9.0 | 70 | 4800 | 30 | 37 | 6938 | 6889.106 | 7051.635 | 7678.558 |
| 159 | 0 | 95.7 | 166.3 | 64.4 | 53.0 | 2275 | 110 | 3.27 | 3.35 | 22.5 | 56 | 4500 | 34 | 36 | 7898 | 10333.430 | 7051.635 | 7521.930 |
| 160 | 0 | 95.7 | 166.3 | 64.4 | 52.8 | 2275 | 110 | 3.27 | 3.35 | 22.5 | 56 | 4500 | 38 | 47 | 7788 | 10577.299 | 7051.635 | 7785.958 |
| 167 | 1 | 94.5 | 168.7 | 64.0 | 52.6 | 2300 | 98 | 3.24 | 3.08 | 9.4 | 112 | 6600 | 26 | 29 | 9538 | 12156.071 | 13155.800 | 10176.152 |
| 170 | 2 | 98.4 | 176.2 | 65.6 | 52.0 | 2551 | 146 | 3.62 | 3.50 | 9.3 | 116 | 4800 | 24 | 30 | 9989 | 13769.149 | 14663.259 | 9364.388 |
| 173 | 2 | 98.4 | 176.2 | 65.6 | 53.0 | 2975 | 146 | 3.62 | 3.50 | 9.3 | 116 | 4800 | 24 | 30 | 17669 | 14455.204 | 14663.259 | 12793.742 |
| 175 | -1 | 102.4 | 175.6 | 66.5 | 54.9 | 2480 | 110 | 3.27 | 3.35 | 22.5 | 73 | 4500 | 30 | 33 | 10698 | 12970.014 | 7051.635 | 10886.969 |
| 195 | -2 | 104.3 | 188.8 | 67.2 | 56.2 | 2912 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 23 | 28 | 12940 | 16850.766 | 14663.259 | 16176.910 |
| 196 | -1 | 104.3 | 188.8 | 67.2 | 57.5 | 3034 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 23 | 28 | 13415 | 17323.694 | 14663.259 | 16297.085 |
| 205 | -1 | 109.1 | 188.8 | 68.9 | 55.5 | 3062 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 19 | 25 | 22625 | 18966.007 | 14663.259 | 18552.843 |
Se compara el RMSE
rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)
kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| rm | ar | rf |
|---|---|---|
| 3236.724 | 3070.846 | 2058.887 |
Se cargaron datos numéricos de precios de automóviles basados en algunas variables numéricas.
El modelo de regresión linea múltiple destaca variables estadísticamente significativas: Las variable compressionratio tiene un nivel de confianza del 95%; las variables stroke y peakrpm tienen un nivel de confianza como predictores del 99% y la variable enginesize tiene un nivel de confianza como predictor del 99.9%.
El modelo de árbol de regresión sus variables de importancia fueron: enginesize, highwaympg, curbweight y horsepower.
El modelo de bosque aleatorio considera variables de importancia tales como: enginesize, curbweight, horsepower, citympg y carwidth.
A destacar la variable enginesize en todos los modelos como importante y significativa y las variables enginesize, curbweight y horsepower como importantes en los modelos árbol de regresión y bosque aleatorio.
Sin embargo, una alta confianza no significa necesariamente que dos variables se entiendan bien. En general el mejor modelo conforme al estadístico raiz del error cuadrático medio (rmse) fue el de bosques aleatorios con estos datos de entrenamiento y validación y con el porcentaje de datos de entrenamiento y validación de 80% y 20%.