Objetivo

Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).

Descripción

Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv

Participan todas las variables del conjunto de datos.

Se crean datos de entrenamiento al 80%

Se crean datos de validación al 20%

Se crea el modelo regresión múltiple con datos de entrenamiento

Con este modelo se responde a preguntas tales como:

¿cuáles son variables que están por encima del 90% de confianza como predictores?,

¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?

Se generan predicciones con datos de validación

Se determina el estadístico RMSE para efectos de comparación

Se crea el modelo árboles de regresión con los datos de entrenamiento

Se identifica la importancia de las variables sobre el precio

Se visualiza el árbol de regresión y sus reglas de asociación

Se hacen predicciones con datos de validación

Se determinar el estadístico RMSE para efectos de comparación

Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados

Se identifica la importancia de las variables sobre el precio

Se generan predicciones con datos de validación

Se determina el estadístico RMSE para efectos de comparación

Al final del caso, se describe una interpretación personal

Desarrollo

Cargar librerías

# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
library(dplyr)
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
library(caret)  # Para particionar
library(Metrics) # Para determinar rmse
library(rpart) # Para árbol
library(rpart.plot) # Para árbol
library(randomForest) # Para random forest
library(caret) # Para hacer divisiones o particiones
library(reshape)    # Para renombrar columnas

Cargar datos

datos <-  read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv", 
                   fileEncoding = "UTF-8", 
                   stringsAsFactors = TRUE)

Exploración de datos

Hay 205 observaciones y 26 variables de las cuales se eligen las variables numéricas.

str(datos)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : Factor w/ 147 levels "alfa-romero giulia",..: 1 3 2 4 5 9 5 7 6 8 ...
##  $ fueltype        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration      : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ doornumber      : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ carbody         : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drivewheel      : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ enginelocation  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ cylindernumber  : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de precios de carros
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
1 3 alfa-romero giulia gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.00
2 3 alfa-romero stelvio gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.00
3 1 alfa-romero Quadrifoglio gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.00
4 2 audi 100 ls gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.00
5 2 audi 100ls gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.00
6 2 audi fox gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250.00
7 1 audi 100ls gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710.00
8 1 audi 5000 gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920.00
9 1 audi 4000 gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875.00
10 0 audi 5000s (diesel) gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 17859.17

Diccionario de datos

Col Nombre Descripción
1 Car_ID Unique id of each observation (Interger)
2 Symboling Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical)
3 carCompany Name of car company (Categorical)
4 fueltype Car fuel type i.e gas or diesel (Categorical)
5 aspiration Aspiration used in a car (Categorical) (Std o Turbo)
6 doornumber Number of doors in a car (Categorical). Puertas
7 carbody body of car (Categorical). (convertible, sedan, wagon …)
8 drivewheel type of drive wheel (Categorical). (hidráulica, manual, )
9 enginelocation Location of car engine (Categorical). Lugar del motor
10 wheelbase Weelbase of car (Numeric). Distancia de ejes en pulgadas.
11 carlength Length of car (Numeric). Longitud
12 carwidth Width of car (Numeric). Amplitud
13 carheight height of car (Numeric). Altura
14 curbweight The weight of a car without occupants or baggage. (Numeric). Peso del auto
15 enginetype Type of engine. (Categorical). Tipo de motor
16 cylindernumber cylinder placed in the car (Categorical). Cilindraje
17 enginesize Size of car (Numeric). Tamaño del carro en …
18 fuelsystem Fuel system of car (Categorical)
19 boreratio Boreratio of car (Numeric). Eficiencia de motor
20 stroke Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión
21 compressionratio compression ratio of car (Numeric). Comprensión o medición de presión en motor
22 horsepower Horsepower (Numeric). Poder del carro
23 peakrpm car peak rpm (Numeric). Picos de revoluciones por minuto
24 citympg Mileage in city (Numeric). Consumo de gasolina
25 highwaympg Mileage on highway (Numeric). Consumo de gasolina
26

price

(Dependent variable)

Price of car (Numeric). Precio del carro en dólares

~Fuente: https://archive.ics.uci.edu/ml/datasets/Automobile~

Preparar los datos

Quitar variables que no reflejan algún interés estadístico es decir, quitar la columnas 1 y 3, car_ID y CarName

datos <- datos[, c(2,4:26)]

Nuevamente los primeros registros

kable(head(datos, 10), caption = "Datos de precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de precios de carros
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.00
3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.00
1 gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.00
2 gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.00
2 gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.00
2 gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250.00
1 gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710.00
1 gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920.00
1 gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875.00
0 gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 17859.17

Datos de entrenamiento y validación

Datos de entrenamiento al 80% de los datos y 20% los datos de validación.

n <- nrow(datos)
set.seed(1306) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]

Datos de entrenamiento

kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de Entrenamient. Precios de carros
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
1 3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.00
2 3 gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.00
4 2 gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.00
6 2 gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250.00
7 1 gas std four sedan fwd front 105.8 192.7 71.4 55.7 2844 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710.00
8 1 gas std four wagon fwd front 105.8 192.7 71.4 55.7 2954 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920.00
9 1 gas turbo four sedan fwd front 105.8 192.7 71.4 55.9 3086 ohc five 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875.00
10 0 gas turbo two hatchback 4wd front 99.5 178.2 67.9 52.0 3053 ohc five 131 mpfi 3.13 3.40 7.0 160 5500 16 22 17859.17
12 0 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.80 8.8 101 5800 23 29 16925.00
13 0 gas std two sedan rwd front 101.2 176.8 64.8 54.3 2710 ohc six 164 mpfi 3.31 3.19 9.0 121 4250 21 28 20970.00

Datos de validación

kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Datos de Validación. Precios de carros
symboling fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
3 1 gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.470 9.0 154 5000 19 26 16500
5 2 gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.400 8.0 115 5500 18 22 17450
11 2 gas std two sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.800 8.8 101 5800 23 29 16430
14 0 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2765 ohc six 164 mpfi 3.31 3.190 9.0 121 4250 21 28 21105
39 0 gas std two hatchback fwd front 96.5 167.5 65.2 53.3 2289 ohc four 110 1bbl 3.15 3.580 9.0 86 5800 27 33 9095
41 0 gas std four sedan fwd front 96.5 175.4 62.5 54.1 2372 ohc four 110 1bbl 3.15 3.580 9.0 86 5800 27 33 10295
53 1 gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1905 ohc four 91 2bbl 3.03 3.150 9.0 68 5000 31 38 6795
57 3 gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2380 rotor two 70 4bbl 3.33 3.255 9.4 101 6000 17 23 11845
65 0 gas std four hatchback fwd front 98.8 177.8 66.5 55.5 2425 ohc four 122 2bbl 3.39 3.390 8.6 84 4800 26 32 11245
69 -1 diesel turbo four wagon rwd front 110.0 190.9 70.3 58.7 3750 ohc five 183 idi 3.58 3.640 21.5 123 4350 22 25 28248

Modelos Supervisados

Modelo de regresión lineal múltiple. (RM)

Se construye el modelo de regresión lineal múltiple (rm). La variable precio en función de todas las variables independientes incluyendo numéricas y no numéricas.

La expresión price ~ . singnifica price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg

# Modelo de regresión lineal múltiple para observar variables de importancia
#modelo_rm <- lm(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento)
modelo_rm <- lm(formula = price ~ . , 
                data = datos.entrenamiento)
summary(modelo_rm)
## 
## Call:
## lm(formula = price ~ ., data = datos.entrenamiento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4580.9 -1100.1   -30.2   852.7  8617.7 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.141e+04  1.914e+04  -1.641 0.103261    
## symboling             3.251e+00  2.824e+02   0.012 0.990835    
## fueltypegas          -1.232e+04  8.168e+03  -1.508 0.134143    
## aspirationturbo       1.946e+03  9.920e+02   1.962 0.052052 .  
## doornumbertwo         2.793e+02  6.331e+02   0.441 0.659890    
## carbodyhardtop       -3.503e+03  1.485e+03  -2.359 0.019889 *  
## carbodyhatchback     -3.562e+03  1.334e+03  -2.670 0.008604 ** 
## carbodysedan         -2.310e+03  1.447e+03  -1.596 0.112988    
## carbodywagon         -3.209e+03  1.614e+03  -1.989 0.048927 *  
## drivewheelfwd         6.980e+01  1.329e+03   0.053 0.958189    
## drivewheelrwd         4.682e+02  1.541e+03   0.304 0.761772    
## enginelocationrear    8.030e+03  2.979e+03   2.696 0.008010 ** 
## wheelbase             3.154e+01  1.148e+02   0.275 0.784023    
## carlength            -8.723e+01  5.612e+01  -1.554 0.122678    
## carwidth              9.375e+02  2.829e+02   3.313 0.001211 ** 
## carheight            -1.117e+00  1.558e+02  -0.007 0.994290    
## curbweight            3.665e+00  2.307e+00   1.589 0.114667    
## enginetypedohcv      -8.153e+03  5.226e+03  -1.560 0.121358    
## enginetypel          -6.848e+02  1.838e+03  -0.372 0.710169    
## enginetypeohc         4.326e+03  1.043e+03   4.149 6.17e-05 ***
## enginetypeohcf        9.251e+02  1.968e+03   0.470 0.639086    
## enginetypeohcv       -6.976e+03  1.475e+03  -4.728 6.09e-06 ***
## enginetyperotor       1.008e+03  4.767e+03   0.211 0.832947    
## cylindernumberfive   -1.092e+04  3.031e+03  -3.601 0.000458 ***
## cylindernumberfour   -1.077e+04  3.319e+03  -3.246 0.001507 ** 
## cylindernumbersix    -6.449e+03  2.443e+03  -2.640 0.009372 ** 
## cylindernumberthree   4.231e+01  4.757e+03   0.009 0.992918    
## cylindernumbertwelve -1.081e+04  4.902e+03  -2.206 0.029241 *  
## cylindernumbertwo            NA         NA      NA       NA    
## enginesize            1.291e+02  2.905e+01   4.443 1.95e-05 ***
## fuelsystem2bbl        1.694e+02  1.025e+03   0.165 0.868982    
## fuelsystem4bbl       -1.574e+03  2.967e+03  -0.530 0.596775    
## fuelsystemidi                NA         NA      NA       NA    
## fuelsystemmfi        -4.136e+03  2.719e+03  -1.521 0.130814    
## fuelsystemmpfi       -1.450e+01  1.167e+03  -0.012 0.990109    
## fuelsystemspdi       -2.894e+03  1.586e+03  -1.825 0.070456 .  
## fuelsystemspfi       -3.275e+02  2.593e+03  -0.126 0.899721    
## boreratio            -4.256e+01  1.920e+03  -0.022 0.982353    
## stroke               -5.102e+03  1.085e+03  -4.702 6.78e-06 ***
## compressionratio     -8.133e+02  6.161e+02  -1.320 0.189257    
## horsepower            7.932e-01  2.716e+01   0.029 0.976752    
## peakrpm               2.443e+00  7.345e-01   3.326 0.001163 ** 
## citympg              -4.223e+01  1.593e+02  -0.265 0.791330    
## highwaympg            9.591e+01  1.467e+02   0.654 0.514475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2171 on 123 degrees of freedom
## Multiple R-squared:  0.9494, Adjusted R-squared:  0.9325 
## F-statistic: 56.24 on 41 and 123 DF,  p-value: < 2.2e-16
  • ¿cuáles son variables que están por encima del 90% de confianza como predictores?

  • El coeficiente de intersección tiene un nivel de confianza del 95%.

  • Se observan algunos coeficientes igual o por encima del 90% de confianza

  • Dado que algunos predictores no presentan un nivel de confianza por encima del 90% es posible que se quiera construir un modelo con solo los predictores que presentan niveles de confianza igual o superior del 90%. Es para trabajos futuros, no se hace en este caso.

  • En modelos lineales múltiples el estadístico Adjusted R-squared: 0.9736 significa que las variables independientes explican aproximadamente el 97.36% de la variable dependiente precio.

Predicciones del modelo rm

predicciones_rm <- predict(object = modelo_rm, newdata = datos.validacion)
## Warning in predict.lm(object = modelo_rm, newdata = datos.validacion):
## prediction from a rank-deficient fit may be misleading
predicciones_rm
##         3         5        11        14        39        41        53        57 
##  8264.586 16068.198 13891.389 20587.000  9233.280  7288.972  5774.748 12285.837 
##        65        69        76        78        86        90        94       114 
## 10347.583 27197.948 20306.878  6907.492 11008.040  6219.037  5109.150 15530.930 
##       116       122       125       127       131       138       140       143 
## 11425.499  6348.924 14834.549 33695.690 10046.023 12492.372  5820.163  7075.984 
##       152       154       155       156       163       167       172       175 
##  6070.678  5960.038  5616.276  8621.581  8013.357  6349.528 13069.459 12737.973 
##       185       186       191       193       195       200       203       204 
##  8589.882  9392.483  8158.094 11034.849 17622.319 18765.849 18888.422 25689.243

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_rm)

Al haber usado semilla 1306 y habiendo realizado las pruebas, se concluye que los datos de entrenamiento deben de cubrir y garantizar todas los posibles valores de las variables categóricas en los datos de validación, es decir, no debe haber valores en datos de validación que no se hayan entrenado.

kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones
precio_real precio_predicciones
3 16500 8264.586
5 17450 16068.198
11 16430 13891.389
14 21105 20587.000
39 9095 9233.280
41 10295 7288.972
53 6795 5774.748
57 11845 12285.837
65 11245 10347.583
69 28248 27197.948

RMSE modelo de rm

rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm
## [1] 2627.374

Modelo de árbol de regresión. (AR)

Se construye el modelo de árbol de regresión (ar)

modelo_ar <- rpart(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento )
modelo_ar
## n= 165 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 165 11445220000 13426.070  
##   2) enginesize< 182 149  3002948000 11135.320  
##     4) curbweight< 2544 96   484534700  8449.750  
##       8) curbweight< 2291.5 58    78729770  7209.621 *
##       9) curbweight>=2291.5 38   180459100 10342.580 *
##     5) curbweight>=2544 53   571917500 15999.740 *
##   3) enginesize>=182 16   379081000 34758.720 *

Variables de importancia

Pendiente

Visualización de árbol de regresión

rpart.plot(modelo_ar)

Predicciones del modelo (ar)

predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar
##         3         5        11        14        39        41        53        57 
## 15999.739 15999.739 10342.579 15999.739  7209.621 10342.579  7209.621 10342.579 
##        65        69        76        78        86        90        94       114 
## 10342.579 34758.719 15999.739  7209.621 10342.579  7209.621  7209.621 15999.739 
##       116       122       125       127       131       138       140       143 
## 15999.739  7209.621 15999.739 34758.719 15999.739 15999.739  7209.621  7209.621 
##       152       154       155       156       163       167       172       175 
##  7209.621  7209.621  7209.621 15999.739  7209.621 10342.579 15999.739 10342.579 
##       185       186       191       193       195       200       203       204 
##  7209.621  7209.621  7209.621 15999.739 15999.739 15999.739 15999.739 15999.739

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_ar)
kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones
precio_real precio_predicciones
3 16500 15999.739
5 17450 15999.739
11 16430 10342.579
14 21105 15999.739
39 9095 7209.621
41 10295 10342.579
53 6795 7209.621
57 11845 10342.579
65 11245 10342.579
69 28248 34758.719

RMSE modelo de ar

rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar
## [1] 3086.6

Modelo de bosques aleatorios (RF)

Se construye el modelo de árbol de regresión (ar)

modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "fueltype", "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginetype", "cylindernumber", "enginesize", "fuelsystem", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")], 
                          y = datos.entrenamiento[,'price'], 
                          importance = TRUE, 
                          keep.forest = TRUE, 
                          ntree=20)
modelo_rf
## 
## Call:
##  randomForest(x = datos.entrenamiento[, c("symboling", "fueltype",      "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation",      "wheelbase", "carlength", "carwidth", "carheight", "curbweight",      "enginetype", "cylindernumber", "enginesize", "fuelsystem",      "boreratio", "stroke", "compressionratio", "horsepower",      "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[,      "price"], ntree = 20, importance = TRUE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 20
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 5719360
##                     % Var explained: 91.75

Variables de importancia

as.data.frame(modelo_rf$importance) %>%
    arrange(desc(IncNodePurity))
##                      %IncMSE IncNodePurity
## enginesize       32157957.45    4115689614
## curbweight       20226244.27    2030138954
## horsepower        7495306.57    1094402848
## cylindernumber    4035909.92     966031591
## citympg           9064117.64     945011459
## wheelbase         5465720.46     649163086
## carwidth          1886327.81     320465716
## highwaympg        1504290.36     279479618
## carlength         1788421.32     215404663
## fuelsystem         915233.56     204301305
## boreratio         1222285.71     144444666
## peakrpm           1467348.31     119236341
## carbody            452459.76      71647651
## compressionratio   907198.35      56385233
## enginetype          13463.02      53595516
## carheight          360821.35      41240011
## stroke             257166.56      33308953
## fueltype           563705.50      32479852
## drivewheel         292191.37      25127909
## symboling          392003.47      15338886
## doornumber        -312113.83      12982346
## aspiration          35519.03       2880439
## enginelocation          0.00             0

Predicciones del modelo (rf)

predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf
##         3         5        11        14        39        41        53        57 
## 16396.446 15723.201 14202.965 19411.422  8717.329  9531.782  6084.350 12219.111 
##        65        69        76        78        86        90        94       114 
##  9655.298 26821.192 19078.894  6394.962  8504.639  6974.060  7823.352 14331.072 
##       116       122       125       127       131       138       140       143 
## 14049.112  7045.413 14021.855 32443.419 11416.847 16015.829  7689.069  8068.413 
##       152       154       155       156       163       167       172       175 
##  6634.752  7743.184  7824.865 10026.958  7798.067 10150.562 11980.241 13305.775 
##       185       186       191       193       195       200       203       204 
##  8672.855  8303.257  9084.548 15362.950 15463.375 16245.008 20104.245 19340.888

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_rf)
kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones
precio_real precio_predicciones
3 16500 16396.446
5 17450 15723.201
11 16430 14202.965
14 21105 19411.423
39 9095 8717.329
41 10295 9531.782
53 6795 6084.350
57 11845 12219.111
65 11245 9655.298
69 28248 26821.192

RMSE modelo de ar

rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf
## [1] 1538.294

Evaluación de modelos

Se comparan las predicciones

comparaciones <- data.frame(cbind(datos.validacion[,-1], predicciones_rm, predicciones_ar, predicciones_rf))

Se visualizan las predicciones de cada modelo

kable(comparaciones, caption = "Predicciones de los modelos") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Predicciones de los modelos
fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase carlength carwidth carheight curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price predicciones_rm predicciones_ar predicciones_rf
3 gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.470 9.0 154 5000 19 26 16500 8264.586 15999.739 16396.446
5 gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.400 8.0 115 5500 18 22 17450 16068.198 15999.739 15723.201
11 gas std two sedan rwd front 101.2 176.8 64.8 54.3 2395 ohc four 108 mpfi 3.50 2.800 8.8 101 5800 23 29 16430 13891.389 10342.579 14202.965
14 gas std four sedan rwd front 101.2 176.8 64.8 54.3 2765 ohc six 164 mpfi 3.31 3.190 9.0 121 4250 21 28 21105 20587.000 15999.739 19411.423
39 gas std two hatchback fwd front 96.5 167.5 65.2 53.3 2289 ohc four 110 1bbl 3.15 3.580 9.0 86 5800 27 33 9095 9233.280 7209.621 8717.329
41 gas std four sedan fwd front 96.5 175.4 62.5 54.1 2372 ohc four 110 1bbl 3.15 3.580 9.0 86 5800 27 33 10295 7288.972 10342.579 9531.782
53 gas std two hatchback fwd front 93.1 159.1 64.2 54.1 1905 ohc four 91 2bbl 3.03 3.150 9.0 68 5000 31 38 6795 5774.748 7209.621 6084.350
57 gas std two hatchback rwd front 95.3 169.0 65.7 49.6 2380 rotor two 70 4bbl 3.33 3.255 9.4 101 6000 17 23 11845 12285.837 10342.579 12219.111
65 gas std four hatchback fwd front 98.8 177.8 66.5 55.5 2425 ohc four 122 2bbl 3.39 3.390 8.6 84 4800 26 32 11245 10347.583 10342.579 9655.298
69 diesel turbo four wagon rwd front 110.0 190.9 70.3 58.7 3750 ohc five 183 idi 3.58 3.640 21.5 123 4350 22 25 28248 27197.948 34758.719 26821.192
76 gas turbo two hatchback rwd front 102.7 178.4 68.0 54.8 2910 ohc four 140 mpfi 3.78 3.120 8.0 175 5000 19 24 16503 20306.878 15999.739 19078.894
78 gas std two hatchback fwd front 93.7 157.3 64.4 50.8 1944 ohc four 92 2bbl 2.97 3.230 9.4 68 5500 31 38 6189 6907.492 7209.621 6394.962
86 gas std four sedan fwd front 96.3 172.4 65.4 51.6 2365 ohc four 122 2bbl 3.35 3.460 8.5 88 5000 25 32 6989 11008.040 10342.579 8504.639
90 gas std two sedan fwd front 94.5 165.3 63.8 54.5 1889 ohc four 97 2bbl 3.15 3.290 9.4 69 5200 31 37 5499 6219.037 7209.621 6974.060
94 gas std four wagon fwd front 94.5 170.2 63.8 53.5 2024 ohc four 97 2bbl 3.15 3.290 9.4 69 5200 31 37 7349 5109.150 7209.621 7823.352
114 gas std four wagon rwd front 114.2 198.9 68.4 56.7 3285 l four 120 mpfi 3.46 2.190 8.4 95 5000 19 24 16695 15530.930 15999.739 14331.072
116 gas std four sedan rwd front 107.9 186.7 68.4 56.7 3075 l four 120 mpfi 3.46 3.190 8.4 97 5000 19 24 16630 11425.499 15999.739 14049.112
122 gas std four sedan fwd front 93.7 167.3 63.8 50.8 1989 ohc four 90 2bbl 2.97 3.230 9.4 68 5500 31 38 6692 6348.924 7209.621 7045.413
125 gas turbo two hatchback rwd front 95.9 173.2 66.3 50.2 2818 ohc four 156 spdi 3.59 3.860 7.0 145 5000 19 24 12764 14834.549 15999.739 14021.855
127 gas std two hardtop rwd rear 89.5 168.9 65.0 51.6 2756 ohcf six 194 mpfi 3.74 2.900 9.5 207 5900 17 25 32528 33695.690 34758.719 32443.419
131 gas std four wagon fwd front 96.1 181.5 66.5 55.2 2579 ohc four 132 mpfi 3.46 3.900 8.7 90 5100 23 31 9295 10046.023 15999.739 11416.847
138 gas turbo four sedan fwd front 99.1 186.6 66.5 56.1 2847 dohc four 121 mpfi 3.54 3.070 9.0 160 5500 19 26 18620 12492.372 15999.739 16015.829
140 gas std two hatchback fwd front 93.7 157.9 63.6 53.7 2120 ohcf four 108 2bbl 3.62 2.640 8.7 73 4400 26 31 7053 5820.163 7209.621 7689.069
143 gas std four sedan fwd front 97.2 172.0 65.4 52.5 2190 ohcf four 108 2bbl 3.62 2.640 9.5 82 4400 28 33 7775 7075.984 7209.621 8068.413
152 gas std two hatchback fwd front 95.7 158.7 63.6 54.5 2040 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 31 38 6338 6070.678 7209.621 6634.752
154 gas std four wagon fwd front 95.7 169.7 63.6 59.1 2280 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 31 37 6918 5960.038 7209.621 7743.184
155 gas std four wagon 4wd front 95.7 169.7 63.6 59.1 2290 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 27 32 7898 5616.276 7209.621 7824.865
156 gas std four wagon 4wd front 95.7 169.7 63.6 59.1 3110 ohc four 92 2bbl 3.05 3.030 9.0 62 4800 27 32 8778 8621.581 15999.739 10026.958
163 gas std four sedan fwd front 95.7 166.3 64.4 52.8 2140 ohc four 98 2bbl 3.19 3.030 9.0 70 4800 28 34 9258 8013.357 7209.621 7798.067
167 gas std two hatchback rwd front 94.5 168.7 64.0 52.6 2300 dohc four 98 mpfi 3.24 3.080 9.4 112 6600 26 29 9538 6349.528 10342.579 10150.562
172 gas std two hatchback rwd front 98.4 176.2 65.6 52.0 2714 ohc four 146 mpfi 3.62 3.500 9.3 116 4800 24 30 11549 13069.459 15999.739 11980.241
175 diesel turbo four sedan fwd front 102.4 175.6 66.5 54.9 2480 ohc four 110 idi 3.27 3.350 22.5 73 4500 30 33 10698 12737.973 10342.579 13305.775
185 diesel std four sedan fwd front 97.3 171.7 65.5 55.7 2264 ohc four 97 idi 3.01 3.400 23.0 52 4800 37 46 7995 8589.882 7209.621 8672.855
186 gas std four sedan fwd front 97.3 171.7 65.5 55.7 2212 ohc four 109 mpfi 3.19 3.400 9.0 85 5250 27 34 8195 9392.483 7209.621 8303.257
191 gas std two hatchback fwd front 94.5 165.7 64.0 51.4 2221 ohc four 109 mpfi 3.19 3.400 8.5 90 5500 24 29 9980 8158.094 7209.621 9084.548
193 diesel turbo four sedan fwd front 100.4 180.2 66.9 55.1 2579 ohc four 97 idi 3.01 3.400 23.0 68 4500 33 38 13845 11034.849 15999.739 15362.950
195 gas std four sedan rwd front 104.3 188.8 67.2 56.2 2912 ohc four 141 mpfi 3.78 3.150 9.5 114 5400 23 28 12940 17622.319 15999.739 15463.375
200 gas turbo four wagon rwd front 104.3 188.8 67.2 57.5 3157 ohc four 130 mpfi 3.62 3.150 7.5 162 5100 17 22 18950 18765.849 15999.739 16245.008
203 gas std four sedan rwd front 109.1 188.8 68.9 55.5 3012 ohcv six 173 mpfi 3.58 2.870 8.8 134 5500 18 23 21485 18888.422 15999.739 20104.245
204 diesel turbo four sedan rwd front 109.1 188.8 68.9 55.5 3217 ohc six 145 idi 3.01 3.400 23.0 106 4800 26 27 22470 25689.243 15999.739 19340.888

Se compara el RMSE

rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)
kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")
Estadístico RMSE de cada modelo
rm ar rf
2627.374 3086.6 1538.294

Interpretación

Se cargaron datos de precios de automóviles basados en todas variables tanto numéricas como categóricas.

El modelo de regresión linea múltiple destaca algunas variables estadísticamente significativas.

El mejor modelo conforme al estadístico raiz del error cuadrático medio (rmse) fue el de bosques aleatorios con estos datos de entrenamiento y validación y con el porcentaje de datos de entrenamiento y validación de 80% y 20%.

Gracias a los ultimos ejercicios realizados podemos concluir que el modelo de random forest suele dar valores de rmse mas a bajos a comparado a los otros dos modelos.