Objetivo

Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).

Descripción

Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv

Participan todas las variables del conjunto de datos.

Se crean datos de entrenamiento al 80%

Se crean datos de validación al 20%

Se crea el modelo regresión múltiple con datos de entrenamiento

Con este modelo se responde a preguntas tales como:

¿cuáles son variables que están por encima del 90% de confianza como predictores?,

¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?

Se generan predicciones con datos de validación

Se determina el estadístico RMSE para efectos de comparación

Se crea el modelo árboles de regresión con los datos de entrenamiento

Se identifica la importancia de las variables sobre el precio

Se visualiza el árbol de regresión y sus reglas de asociación

Se hacen predicciones con datos de validación

Se determinar el estadístico RMSE para efectos de comparación

Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados

Se identifica la importancia de las variables sobre el precio

Se generan predicciones con datos de validación

Se determina el estadístico RMSE para efectos de comparación

Al final del caso, se describe una interpretación personal

Desarrollo

Cargar librerías

# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
library(dplyr)
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
library(caret)  # Para particionar
library(Metrics) # Para determinar rmse
library(rpart) # Para árbol
library(rpart.plot) # Para árbol
library(randomForest) # Para random forest
library(caret) # Para hacer divisiones o particiones
library(reshape)    # Para renombrar columnas

Cargar datos

datos <-  read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv", 
                   fileEncoding = "UTF-8", 
                   stringsAsFactors = TRUE)

Exploración de datos

Hay 205 observaciones y 26 variables de las cuales se eligen las variables numéricas.

str(datos)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : Factor w/ 147 levels "alfa-romero giulia",..: 1 3 2 4 5 9 5 7 6 8 ...
##  $ fueltype        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration      : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ doornumber      : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ carbody         : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drivewheel      : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ enginelocation  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ cylindernumber  : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de precios de carros
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
1 3 alfa-romero giulia gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.00
2 3 alfa-romero stelvio gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.00
3 1 alfa-romero Quadrifoglio gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.00
4 2 audi 100 ls gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.00
5 2 audi 100ls gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.00
6 2 audi fox gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250.00
7 1 audi 100ls gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710.00
8 1 audi 5000 gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920.00
9 1 audi 4000 gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875.00
10 0 audi 5000s (diesel) gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 17859.17

Diccionario de datos

Col Nombre Descripción
1 Car_ID Unique id of each observation (Interger)
2 Symboling Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical)
3 carCompany Name of car company (Categorical)
4 fueltype Car fuel type i.e gas or diesel (Categorical)
5 aspiration Aspiration used in a car (Categorical) (Std o Turbo)
6 doornumber Number of doors in a car (Categorical). Puertas
7 carbody body of car (Categorical). (convertible, sedan, wagon …)
8 drivewheel type of drive wheel (Categorical). (hidráulica, manual, )
9 enginelocation Location of car engine (Categorical). Lugar del motor
10 wheelbase Weelbase of car (Numeric). Distancia de ejes en pulgadas.
11 carlength Length of car (Numeric). Longitud
12 carwidth Width of car (Numeric). Amplitud
13 carheight height of car (Numeric). Altura
14 curbweight The weight of a car without occupants or baggage. (Numeric). Peso del auto
15 enginetype Type of engine. (Categorical). Tipo de motor
16 cylindernumber cylinder placed in the car (Categorical). Cilindraje
17 enginesize Size of car (Numeric). Tamaño del carro en …
18 fuelsystem Fuel system of car (Categorical)
19 boreratio Boreratio of car (Numeric). Eficiencia de motor
20 stroke Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión
21 compressionratio compression ratio of car (Numeric). Comprensión o medición de presión en motor
22 horsepower Horsepower (Numeric). Poder del carro
23 peakrpm car peak rpm (Numeric). Picos de revoluciones por minuto
24 citympg Mileage in city (Numeric). Consumo de gasolina
25 highwaympg Mileage on highway (Numeric). Consumo de gasolina
26

price

(Dependent variable)

Price of car (Numeric). Precio del carro en dólares

~Fuente: https://archive.ics.uci.edu/ml/datasets/Automobile~

Preparar los datos

Quitar variables que no reflejan algún interés estadístico es decir, quitar la columnas 1 y 3, car_ID y CarName

datos <- datos[, c(2,4:26)]

Nuevamente los primeros registros

kable(head(datos, 10), caption = "Datos de precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de precios de carros
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.00
3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.00
1 gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.00
2 gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.00
2 gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.00
2 gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250.00
1 gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710.00
1 gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920.00
1 gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875.00
0 gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 17859.17

Datos de entrenamiento y validación

Datos de entrenamiento al 80% de los datos y 20% los datos de validación.

n <- nrow(datos)
set.seed(1264) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]

Datos de entrenamiento

kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de Entrenamient. Precios de carros
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
2 3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.00
4 2 gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.00
5 2 gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.00
6 2 gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250.00
8 1 gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920.00
9 1 gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875.00
10 0 gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 17859.17
11 2 gas std two sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.80 8.8 101 5800 23 29 16430.00
13 0 gas std two sedan rwd front 101.2 176.8 64.8 54.3 2710 ohc six 164 mpfi 3.31 3.19 9.0 121 4250 21 28 20970.00
15 1 gas std four sedan rwd front 103.5 189.0 66.9 55.7 3055 ohc six 164 mpfi 3.31 3.19 9.0 121 4250 20 25 24565.00

Datos de validación

kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de Validación. Precios de carros
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
1 3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.680 9.0 111 5000 21 27 13495
3 1 gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.470 9.0 154 5000 19 26 16500
7 1 gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.400 8.5 110 5500 19 25 17710
12 0 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.800 8.8 101 5800 23 29 16925
14 0 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2765 ohc six 164 mpfi 3.31 3.190 9.0 121 4250 21 28 21105
27 1 gas std four sedan fwd front 93.7 157.3 63.8 50.6 1989 ohc four 90 2bbl 2.97 3.230 9.4 68 5500 31 38 7609
38 0 gas std two hatchback fwd front 96.5 167.5 65.2 53.3 2236 ohc four 110 1bbl 3.15 3.580 9.0 86 5800 27 33 7895
52 1 gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1900 ohc four 91 2bbl 3.03 3.150 9.0 68 5000 31 38 6095
53 1 gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1905 ohc four 91 2bbl 3.03 3.150 9.0 68 5000 31 38 6795
56 3 gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2380 rotor two 70 4bbl 3.33 3.255 9.4 101 6000 17 23 10945

Modelos Supervisados

Modelo de regresión lineal múltiple. (RM)

Se construye el modelo de regresión lineal múltiple (rm). La variable precio en función de todas las variables independientes incluyendo numéricas y no numéricas.

La expresión price ~ . singnifica price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg

# Modelo de regresión lineal múltiple para observar variables de importancia
#modelo_rm <- lm(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento)
modelo_rm <- lm(formula = price ~ . , 
                data = datos.entrenamiento)
summary(modelo_rm)
## 
## Call:
## lm(formula = price ~ ., data = datos.entrenamiento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5738.0  -890.4     0.0   974.8  7628.2 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.258e+04  1.688e+04  -2.522 0.012941 *  
## symboling            -7.622e+01  2.580e+02  -0.295 0.768135    
## fueltypegas          -7.843e+03  7.150e+03  -1.097 0.274847    
## aspirationturbo       2.819e+03  8.928e+02   3.157 0.002005 ** 
## doornumbertwo         6.283e+02  5.988e+02   1.049 0.296120    
## carbodyhardtop       -3.229e+03  1.427e+03  -2.263 0.025377 *  
## carbodyhatchback     -4.009e+03  1.384e+03  -2.897 0.004455 ** 
## carbodysedan         -2.716e+03  1.491e+03  -1.822 0.070874 .  
## carbodywagon         -4.146e+03  1.612e+03  -2.573 0.011283 *  
## drivewheelfwd         1.688e+02  1.107e+03   0.152 0.879064    
## drivewheelrwd        -7.977e+01  1.280e+03  -0.062 0.950425    
## enginelocationrear    5.970e+03  2.545e+03   2.345 0.020608 *  
## wheelbase            -5.815e+01  1.021e+02  -0.570 0.569944    
## carlength            -2.957e+01  4.997e+01  -0.592 0.555192    
## carwidth              8.633e+02  2.367e+02   3.647 0.000390 ***
## carheight             1.163e+02  1.387e+02   0.839 0.403078    
## curbweight            3.853e+00  1.725e+00   2.234 0.027290 *  
## enginetypedohcv      -1.059e+04  4.529e+03  -2.338 0.020998 *  
## enginetypel          -1.505e+03  1.691e+03  -0.890 0.375380    
## enginetypeohc         2.381e+03  9.875e+02   2.411 0.017389 *  
## enginetypeohcf       -1.389e+03  1.744e+03  -0.796 0.427423    
## enginetypeohcv       -9.558e+03  1.358e+03  -7.041 1.19e-10 ***
## enginetyperotor      -3.382e+03  4.428e+03  -0.764 0.446441    
## cylindernumberfive   -1.305e+04  2.777e+03  -4.702 6.80e-06 ***
## cylindernumberfour   -1.432e+04  3.092e+03  -4.631 9.13e-06 ***
## cylindernumbersix    -7.387e+03  2.123e+03  -3.479 0.000696 ***
## cylindernumberthree  -5.135e+03  4.446e+03  -1.155 0.250382    
## cylindernumbertwelve -1.010e+04  4.335e+03  -2.330 0.021431 *  
## cylindernumbertwo            NA         NA      NA       NA    
## enginesize            1.230e+02  2.610e+01   4.713 6.48e-06 ***
## fuelsystem2bbl       -9.847e+01  8.958e+02  -0.110 0.912654    
## fuelsystem4bbl       -2.070e+03  2.727e+03  -0.759 0.449147    
## fuelsystemidi                NA         NA      NA       NA    
## fuelsystemmfi        -3.280e+03  2.473e+03  -1.326 0.187166    
## fuelsystemmpfi       -3.971e+02  1.012e+03  -0.392 0.695415    
## fuelsystemspdi       -2.578e+03  1.379e+03  -1.870 0.063892 .  
## fuelsystemspfi        1.312e+01  2.362e+03   0.006 0.995575    
## boreratio             1.462e+03  1.742e+03   0.839 0.402948    
## stroke               -5.589e+03  9.581e+02  -5.834 4.49e-08 ***
## compressionratio     -5.648e+02  5.360e+02  -1.054 0.294101    
## horsepower           -1.308e+01  2.272e+01  -0.576 0.565922    
## peakrpm               3.415e+00  6.754e-01   5.056 1.51e-06 ***
## citympg               1.153e+01  1.517e+02   0.076 0.939512    
## highwaympg            7.951e+01  1.402e+02   0.567 0.571543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2017 on 123 degrees of freedom
## Multiple R-squared:  0.9574, Adjusted R-squared:  0.9432 
## F-statistic: 67.42 on 41 and 123 DF,  p-value: < 2.2e-16
  • ¿cuáles son variables que están por encima del 90% de confianza como predictores?

  • El coeficiente de intersección tiene un nivel de confianza del 95%.

  • Se observan algunos coeficientes igual o por encima del 90% de confianza

  • Dado que algunos predictores no presentan un nivel de confianza por encima del 90% es posible que se quiera construir un modelo con solo los predictores que presentan niveles de confianza igual o superior del 90%. Es para trabajos futuros, no se hace en este caso.

  • En modelos lineales múltiples el estadístico Adjusted R-squared: 0.9736 significa que las variables independientes explican aproximadamente el 97.36% de la variable dependiente precio.

Predicciones del modelo rm

predicciones_rm <- predict(object = modelo_rm, newdata = datos.validacion)
## Warning in predict.lm(object = modelo_rm, newdata = datos.validacion):
## prediction from a rank-deficient fit may be misleading
predicciones_rm
##         1         3         7        12        14        27        38        52 
## 14894.135  7155.108 20258.961 13669.073 20687.997  6777.815  9636.339  5680.662 
##        53        56        60        67        71        90        97        98 
##  5699.929 12735.366 10401.997 10804.809 27861.474  6863.488  6551.147  5113.741 
##       113       115       116       121       132       136       141       148 
## 16459.548 14761.806 11449.746  5399.805  9474.092 14441.362  6104.461  8839.126 
##       152       154       155       161       163       165       166       176 
##  5827.832  6193.676  5619.714  8703.080  7708.080  6606.759 10188.249  6464.469 
##       178       180       181       190       195       197       199       204 
##  6634.020 22438.604 22973.060 12646.596 18073.803 18173.967 18748.983 26624.392

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_rm)

Al haber usado semilla 2023 y habiendo realizado las pruebas, se concluye que los datos de entrenamiento deben de cubrir y garantizar todas los posibles valores de las variables categóricas en los datos de validación, es decir, no debe haber valores en datos de validación que no se hayan entrenado.

kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones
precio_real precio_predicciones
1 13495 14894.135
3 16500 7155.108
7 17710 20258.961
12 16925 13669.073
14 21105 20687.997
27 7609 6777.815
38 7895 9636.339
52 6095 5680.662
53 6795 5699.929
56 10945 12735.366

RMSE modelo de rm

rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm
## [1] 3272.124

Modelo de árbol de regresión. (AR)

Se construye el modelo de árbol de regresión (ar)

modelo_ar <- rpart(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento )
modelo_ar
## n= 165 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 165 11742610000 13495.790  
##    2) enginesize< 182 148  3021864000 11091.150  
##      4) highwaympg>=28.5 99   585475300  8595.414  
##        8) carlength< 175.5 78   158367900  7738.115 *
##        9) carlength>=175.5 21   156851600 11779.670 *
##      5) highwaympg< 28.5 49   573881700 16133.550  
##       10) horsepower< 112.5 17    90563670 13697.760 *
##       11) horsepower>=112.5 32   328872800 17427.570 *
##    3) enginesize>=182 17   414635700 34430.320 *

Variables de importancia

Pendiente

Visualización de árbol de regresión

rpart.plot(modelo_ar)

Predicciones del modelo (ar)

predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar
##         1         3         7        12        14        27        38        52 
## 13697.765 17427.568 13697.765 11779.667 17427.568  7738.115  7738.115  7738.115 
##        53        56        60        67        71        90        97        98 
##  7738.115 13697.765 11779.667  7738.115 34430.324  7738.115  7738.115  7738.115 
##       113       115       116       121       132       136       141       148 
## 11779.667 13697.765 13697.765  7738.115 11779.667 13697.765  7738.115  7738.115 
##       152       154       155       161       163       165       166       176 
##  7738.115  7738.115  7738.115  7738.115  7738.115  7738.115  7738.115 11779.667 
##       178       180       181       190       195       197       199       204 
## 11779.667 17427.568 17427.568  7738.115 17427.568 17427.568 17427.568 13697.765

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_ar)
kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones
precio_real precio_predicciones
1 13495 13697.765
3 16500 17427.568
7 17710 13697.765
12 16925 11779.667
14 21105 17427.568
27 7609 7738.115
38 7895 7738.115
52 6095 7738.115
53 6795 7738.115
56 10945 13697.765

RMSE modelo de ar

rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar
## [1] 3142.833

Modelo de bosques aleatorios (RF)

Se construye el modelo de árbol de regresión (ar)

modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "fueltype", "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginetype", "cylindernumber", "enginesize", "fuelsystem", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")], 
                          y = datos.entrenamiento[,'price'], 
                          importance = TRUE, 
                          keep.forest = TRUE, 
                          ntree=20)
modelo_rf
## 
## Call:
##  randomForest(x = datos.entrenamiento[, c("symboling", "fueltype",      "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation",      "wheelbase", "carlength", "carwidth", "carheight", "curbweight",      "enginetype", "cylindernumber", "enginesize", "fuelsystem",      "boreratio", "stroke", "compressionratio", "horsepower",      "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[,      "price"], ntree = 20, importance = TRUE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 20
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 5228164
##                     % Var explained: 92.65

Variables de importancia

as.data.frame(modelo_rf$importance) %>%
    arrange(desc(IncNodePurity))
##                      %IncMSE IncNodePurity
## enginesize       26700428.60  2.936059e+09
## horsepower       14135156.67  2.384172e+09
## citympg          10901979.44  1.382432e+09
## curbweight        8492078.98  1.277358e+09
## highwaympg        6873665.93  1.163907e+09
## cylindernumber    4231464.58  9.070652e+08
## drivewheel        2475227.06  3.555691e+08
## carwidth          1259453.96  2.962653e+08
## carlength         1606190.67  2.412287e+08
## boreratio         1562423.12  2.089318e+08
## wheelbase         1092363.05  1.225738e+08
## compressionratio   488167.56  1.119685e+08
## fuelsystem        1361864.39  1.054098e+08
## peakrpm            724722.47  9.191926e+07
## carbody           1306911.88  8.823256e+07
## stroke             501644.98  6.101246e+07
## enginelocation     487090.93  4.706763e+07
## carheight          310523.45  2.383873e+07
## aspiration         337177.91  2.047677e+07
## symboling          171429.30  1.571178e+07
## enginetype         258279.67  8.917956e+06
## doornumber          80491.18  8.294080e+06
## fueltype                0.00  2.709375e+04

Predicciones del modelo (rf)

predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf
##         1         3         7        12        14        27        38        52 
## 14876.004 15712.824 19685.660 12917.123 19295.866  6811.736  8816.698  5931.224 
##        53        56        60        67        71        90        97        98 
##  5931.224 12843.357 10728.432 11522.212 28694.987  7302.545  7118.512  7419.763 
##       113       115       116       121       132       136       141       148 
## 15928.975 16856.017 13488.805  6492.623 10650.431 14363.554  7271.007  9956.282 
##       152       154       155       161       163       165       166       176 
##  6218.555  7696.061  8112.408  7228.502  7562.012  7880.615  9758.815 10145.972 
##       178       180       181       190       195       197       199       204 
## 10265.382 17558.471 16946.797  9936.677 16187.671 16526.521 19134.203 19221.157

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_rf)
kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones
precio_real precio_predicciones
1 13495 14876.004
3 16500 15712.824
7 17710 19685.660
12 16925 12917.123
14 21105 19295.866
27 7609 6811.736
38 7895 8816.698
52 6095 5931.224
53 6795 5931.224
56 10945 12843.357

RMSE modelo de ar

rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf
## [1] 1875.285

Evaluación de modelos

Se comparan las predicciones

comparaciones <- data.frame(cbind(datos.validacion[,-1], predicciones_rm, predicciones_ar, predicciones_rf))

Se visualizan las predicciones de cada modelo

kable(comparaciones, caption = "Predicciones de los modelos") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Predicciones de los modelos
fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price predicciones_rm predicciones_ar predicciones_rf
1 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.680 9.0 111 5000 21 27 13495 14894.135 13697.765 14876.004
3 gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.470 9.0 154 5000 19 26 16500 7155.108 17427.568 15712.824
7 gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.400 8.5 110 5500 19 25 17710 20258.961 13697.765 19685.660
12 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.800 8.8 101 5800 23 29 16925 13669.073 11779.667 12917.123
14 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2765 ohc six 164 mpfi 3.31 3.190 9.0 121 4250 21 28 21105 20687.997 17427.568 19295.866
27 gas std four sedan fwd front 93.7 157.3 63.8 50.6 1989 ohc four 90 2bbl 2.97 3.230 9.4 68 5500 31 38 7609 6777.815 7738.115 6811.736
38 gas std two hatchback fwd front 96.5 167.5 65.2 53.3 2236 ohc four 110 1bbl 3.15 3.580 9.0 86 5800 27 33 7895 9636.339 7738.115 8816.698
52 gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1900 ohc four 91 2bbl 3.03 3.150 9.0 68 5000 31 38 6095 5680.662 7738.115 5931.224
53 gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1905 ohc four 91 2bbl 3.03 3.150 9.0 68 5000 31 38 6795 5699.929 7738.115 5931.224
56 gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2380 rotor two 70 4bbl 3.33 3.255 9.4 101 6000 17 23 10945 12735.366 13697.765 12843.357
60 gas std two hatchback fwd front 98.8 177.8 66.5 53.7 2385 ohc four 122 2bbl 3.39 3.390 8.6 84 4800 26 32 8845 10401.997 11779.667 10728.433
67 diesel std four sedan rwd front 104.9 175.0 66.1 54.4 2700 ohc four 134 idi 3.43 3.640 22.0 72 4200 31 39 18344 10804.809 7738.115 11522.212
71 diesel turbo four sedan rwd front 115.6 202.6 71.7 56.3 3770 ohc five 183 idi 3.58 3.640 21.5 123 4350 22 25 31600 27861.474 34430.324 28694.987
90 gas std two sedan fwd front 94.5 165.3 63.8 54.5 1889 ohc four 97 2bbl 3.15 3.290 9.4 69 5200 31 37 5499 6863.488 7738.115 7302.545
97 gas std four sedan fwd front 94.5 165.3 63.8 54.5 1971 ohc four 97 2bbl 3.15 3.290 9.4 69 5200 31 37 7499 6551.147 7738.115 7118.512
98 gas std four wagon fwd front 94.5 170.2 63.8 53.5 2037 ohc four 97 2bbl 3.15 3.290 9.4 69 5200 31 37 7999 5113.741 7738.115 7419.763
113 diesel turbo four sedan rwd front 107.9 186.7 68.4 56.7 3252 l four 152 idi 3.70 3.520 21.0 95 4150 28 33 16900 16459.548 11779.667 15928.975
115 diesel turbo four wagon rwd front 114.2 198.9 68.4 58.7 3485 l four 152 idi 3.70 3.520 21.0 95 4150 25 25 17075 14761.806 13697.765 16856.017
116 gas std four sedan rwd front 107.9 186.7 68.4 56.7 3075 l four 120 mpfi 3.46 3.190 8.4 97 5000 19 24 16630 11449.746 13697.765 13488.805
121 gas std four hatchback fwd front 93.7 157.3 63.8 50.6 1967 ohc four 90 2bbl 2.97 3.230 9.4 68 5500 31 38 6229 5399.805 7738.115 6492.623
132 gas std two hatchback fwd front 96.1 176.8 66.6 50.5 2460 ohc four 132 mpfi 3.46 3.900 8.7 90 5100 23 31 9895 9474.092 11779.667 10650.431
136 gas std four sedan fwd front 99.1 186.6 66.5 56.1 2758 ohc four 121 mpfi 3.54 3.070 9.3 110 5250 21 28 15510 14441.362 13697.765 14363.554
141 gas std two hatchback 4wd front 93.3 157.3 63.8 55.7 2240 ohcf four 108 2bbl 3.62 2.640 8.7 73 4400 26 31 7603 6104.461 7738.115 7271.007
148 gas std four wagon fwd front 97.0 173.5 65.4 53.0 2455 ohcf four 108 mpfi 3.62 2.640 9.0 94 5200 25 31 10198 8839.126 7738.115 9956.282
152 gas std two hatchback fwd front 95.7 158.7 63.6 54.5 2040 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 31 38 6338 5827.832 7738.115 6218.555
154 gas std four wagon fwd front 95.7 169.7 63.6 59.1 2280 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 31 37 6918 6193.676 7738.115 7696.061
155 gas std four wagon 4wd front 95.7 169.7 63.6 59.1 2290 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 27 32 7898 5619.714 7738.115 8112.408
161 gas std four sedan fwd front 95.7 166.3 64.4 53.0 2094 ohc four 98 2bbl 3.19 3.030 9.0 70 4800 38 47 7738 8703.080 7738.115 7228.502
163 gas std four sedan fwd front 95.7 166.3 64.4 52.8 2140 ohc four 98 2bbl 3.19 3.030 9.0 70 4800 28 34 9258 7708.080 7738.115 7562.012
165 gas std two hatchback rwd front 94.5 168.7 64.0 52.6 2204 ohc four 98 2bbl 3.19 3.030 9.0 70 4800 29 34 8238 6606.759 7738.115 7880.615
166 gas std two sedan rwd front 94.5 168.7 64.0 52.6 2265 dohc four 98 mpfi 3.24 3.080 9.4 112 6600 26 29 9298 10188.249 7738.115 9758.815
176 gas std four hatchback fwd front 102.4 175.6 66.5 53.9 2414 ohc four 122 mpfi 3.31 3.540 8.7 92 4200 27 32 9988 6464.469 11779.667 10145.972
178 gas std four hatchback fwd front 102.4 175.6 66.5 53.9 2458 ohc four 122 mpfi 3.31 3.540 8.7 92 4200 27 32 11248 6634.020 11779.667 10265.382
180 gas std two hatchback rwd front 102.9 183.5 67.7 52.0 3016 dohc six 171 mpfi 3.27 3.350 9.3 161 5200 19 24 15998 22438.604 17427.568 17558.471
181 gas std four sedan rwd front 104.5 187.8 66.5 54.1 3131 dohc six 171 mpfi 3.27 3.350 9.2 156 5200 20 24 15690 22973.060 17427.568 16946.797
190 gas std two convertible fwd front 94.5 159.3 64.2 55.6 2254 ohc four 109 mpfi 3.19 3.400 8.5 90 5500 24 29 11595 12646.596 7738.115 9936.677
195 gas std four sedan rwd front 104.3 188.8 67.2 56.2 2912 ohc four 141 mpfi 3.78 3.150 9.5 114 5400 23 28 12940 18073.803 17427.568 16187.671
197 gas std four sedan rwd front 104.3 188.8 67.2 56.2 2935 ohc four 141 mpfi 3.78 3.150 9.5 114 5400 24 28 15985 18173.967 17427.568 16526.521
199 gas turbo four sedan rwd front 104.3 188.8 67.2 56.2 3045 ohc four 130 mpfi 3.62 3.150 7.5 162 5100 17 22 18420 18748.983 17427.568 19134.203
204 diesel turbo four sedan rwd front 109.1 188.8 68.9 55.5 3217 ohc six 145 idi 3.01 3.400 23.0 106 4800 26 27 22470 26624.392 13697.765 19221.158

Se compara el RMSE

rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)
kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Estadístico RMSE de cada modelo
rm ar rf
3272.124 3142.833 1875.285

Interpretación

Se utilizaron todos los factores numéricos y categóricos para cargar los datos de los precios de los coches.

El modelo de regresión lineal múltiple pone de relieve algunos factores estadísticamente significativos.

Utilizando la medida de error cuadrático medio RMSE, el mejor modelo utilizando estos datos de entrenamiento y validación y porcentajes de datos de entrenamiento y validación del 80% y el 20% fue el modelo de bosque aleatorio.