Objetivo

Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).

Descripción

Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment_Numericas_Preparado.csv
Se crean datos de entrenamiento al 80%
Se crean datos de validación al 20%
Se crea el modelo regresión múltiple con datos de entrenamiento
- Con este modelo se responde a preguntas tales como:
  - ¿cuáles son variables que están por encima del 90% de confianza como predictores?,
  - ¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
- Se generan predicciones con datos de validación
- Se determina el estadístico RMSE para efectos de comparación
Se crea el modelo árboles de regresión con los datos de entrenamiento
- Se identifica la importancia de las variables sobre el precio
- Se visualiza el árbol de regresión y sus reglas de asociación
Se hacen predicciones con datos de validación
Se determinar el estadístico RMSE para efectos de comparación
Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados
- Se identifica la importancia de las variables sobre el precio
- Se generan predicciones con datos de validación
- Se determina el estadístico RMSE para efectos de comparación
Al final del caso, se describe una interpretación personal

Desarrollo

Cargar librerías

# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
library(dplyr)
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
library(caret)  # Para particionar
library(Metrics) # Para determinar rmse

library(rpart) # Para árbol
library(rpart.plot) # Para árbol

library(randomForest) # Para random forest
library(caret) # Para hacer divisiones o particiones
library(reshape)    # Para renombrar columnas

Cargar datos

datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment_Numericas_Preparado.csv")

Exploración de datos

str(datos)

## 'data.frame':    205 obs. of  16 variables:
##  $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

Diccionario de datos

Col	Nombre	Descripción
1	Symboling	Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical)
2	wheelbase	Weelbase of car (Numeric). Distancia de ejes en pulgadas.
3	carlength	Length of car (Numeric). Longitud
4	carwidth	Width of car (Numeric). Amplitud
5	carheight	height of car (Numeric). Altura
6	curbweight	The weight of a car without occupants or baggage. (Numeric). Peso del auto
7	enginesize	Size of car (Numeric). Tamaño del carro en …
8	boreratio	Boreratio of car (Numeric). Eficiencia de motor
9	stroke	Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión
10	compressionratio	compression ratio of car (Numeric). Comprensión o medición de presión en motor
11	horsepower	Horsepower (Numeric). Poder del carro
12	peakrpm	car peak rpm (Numeric). Picos de revoluciones por minuto
13	citympg	Mileage in city (Numeric). Consumo de gasolina
14	highwaympg	Mileage on highway (Numeric). Consumo de gasolina
16	price (Dependent variable)	Price of car (Numeric). Precio del carro en dólares

_{Fuente: https://archive.ics.uci.edu/ml/datasets/Automobile}

Primeros (10) registros

kable(head(datos, 10), caption = "Datos de precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Datos de precios de carros
X	symboling	wheelbase	carlength	carwidth	carheight	curbweight	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price
1	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495.00
2	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	16500.00
3	1	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154	5000	19	26	16500.00
4	2	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102	5500	24	30	13950.00
5	2	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450.00
6	2	99.8	177.3	66.3	53.1	2507	136	3.19	3.40	8.5	110	5500	19	25	15250.00
7	1	105.8	192.7	71.4	55.7	2844	136	3.19	3.40	8.5	110	5500	19	25	17710.00
8	1	105.8	192.7	71.4	55.7	2954	136	3.19	3.40	8.5	110	5500	19	25	18920.00
9	1	105.8	192.7	71.4	55.9	3086	131	3.13	3.40	8.3	140	5500	17	20	23875.00
10	0	99.5	178.2	67.9	52.0	3053	131	3.13	3.40	7.0	160	5500	16	22	17859.17

Datos de entrenamiento y validación

Datos de entrenamiento al 80% de los datos y 20% los datos de validación.

n <- nrow(datos)
set.seed(1280) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]

# Datos validación
datos.validacion <- datos[-entrena, ]

Datos de entrenamiento

kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Datos de Entrenamient. Precios de carros
	X	symboling	wheelbase	carlength	carwidth	carheight	curbweight	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price
2	2	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	16500.00
3	3	1	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154	5000	19	26	16500.00
4	4	2	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102	5500	24	30	13950.00
6	6	2	99.8	177.3	66.3	53.1	2507	136	3.19	3.40	8.5	110	5500	19	25	15250.00
7	7	1	105.8	192.7	71.4	55.7	2844	136	3.19	3.40	8.5	110	5500	19	25	17710.00
8	8	1	105.8	192.7	71.4	55.7	2954	136	3.19	3.40	8.5	110	5500	19	25	18920.00
9	9	1	105.8	192.7	71.4	55.9	3086	131	3.13	3.40	8.3	140	5500	17	20	23875.00
10	10	0	99.5	178.2	67.9	52.0	3053	131	3.13	3.40	7.0	160	5500	16	22	17859.17
11	11	2	101.2	176.8	64.8	54.3	2395	108	3.50	2.80	8.8	101	5800	23	29	16430.00
12	12	0	101.2	176.8	64.8	54.3	2395	108	3.50	2.80	8.8	101	5800	23	29	16925.00

Datos de validación

kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Datos de Validación. Precios de carros
	X	symboling	wheelbase	carlength	carwidth	carheight	curbweight	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price
1	1	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495
5	5	2	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450
17	17	0	103.5	193.8	67.9	53.7	3380	209	3.62	3.39	8.0	182	5400	16	22	41315
29	29	-1	103.3	174.6	64.6	59.8	2535	122	3.34	3.46	8.5	88	5000	24	30	8921
32	32	2	86.6	144.6	63.9	50.8	1819	92	2.91	3.41	9.2	76	6000	31	38	6855
42	42	0	96.5	175.4	65.2	54.1	2465	110	3.15	3.58	9.0	101	5800	24	28	12945
44	44	0	94.3	170.7	61.8	53.5	2337	111	3.31	3.23	8.5	78	4800	24	29	6785
49	49	0	113.0	199.6	69.6	52.8	4066	258	3.63	4.17	8.1	176	4750	15	19	35550
52	52	1	93.1	159.1	64.2	54.1	1900	91	3.03	3.15	9.0	68	5000	31	38	6095
62	62	1	98.8	177.8	66.5	53.7	2385	122	3.39	3.39	8.6	84	4800	26	32	10595

Modelos Supervisados

Modelo de regresión lineal múltiple. (RM)

Se construye el modelo de regresión lineal múltiple (rm)

# Modelo de regresión lineal múltiple para observar variables de importancia
modelo_rm <- lm(formula = price ~ symboling +  wheelbase + carlength + carwidth + carheight + curbweight + enginesize + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg , 
                data = datos.entrenamiento)

summary(modelo_rm)

## 
## Call:
## lm(formula = price ~ symboling + wheelbase + carlength + carwidth + 
##     carheight + curbweight + enginesize + boreratio + stroke + 
##     compressionratio + horsepower + peakrpm + citympg + highwaympg, 
##     data = datos.entrenamiento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8090.7 -1651.4  -249.5  1572.2 10886.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -5.898e+04  1.642e+04  -3.593 0.000442 ***
## symboling         6.244e+02  2.730e+02   2.287 0.023604 *  
## wheelbase         2.557e+02  1.171e+02   2.183 0.030600 *  
## carlength        -1.224e+02  6.350e+01  -1.927 0.055864 .  
## carwidth          6.020e+02  2.621e+02   2.296 0.023041 *  
## carheight         2.406e+02  1.524e+02   1.579 0.116549    
## curbweight        1.735e+00  1.846e+00   0.940 0.348961    
## enginesize        1.060e+02  1.501e+01   7.059 5.79e-11 ***
## boreratio        -1.433e+03  1.349e+03  -1.062 0.289820    
## stroke           -3.813e+03  8.863e+02  -4.302 3.03e-05 ***
## compressionratio  2.910e+02  8.792e+01   3.309 0.001171 ** 
## horsepower        3.174e+01  1.754e+01   1.809 0.072377 .  
## peakrpm           1.911e+00  7.555e-01   2.529 0.012476 *  
## citympg          -3.381e+02  1.833e+02  -1.845 0.067036 .  
## highwaympg        2.254e+02  1.623e+02   1.389 0.166915    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3072 on 150 degrees of freedom
## Multiple R-squared:  0.8499, Adjusted R-squared:  0.8359 
## F-statistic: 60.69 on 14 and 150 DF,  p-value: < 2.2e-16

¿cuáles son variables que están por encima del 90% de confianza como predictores?
El coeficiente de intersección tiene un nivel de confianza del 99%.
Las variables wheelbase, carwidth, horsepower y citympg tienen un nivel de confianza del 95% (.)
La variable symboling tiene un nivel de confianza por encima del 95%. Por otro lado, la variable carlenght si posee un valor del 95%.
Las variables stroke, compressionratio y peakrpm tienen un nivel de confianza como predictores del 99% (**)
La variable enginesize tiene un nivel de confianza como predictor del 99.9% (***)
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
En modelos lineales múltiples el estadístico Adjusted R-squared: 0.8359 significa que las variables independientes explican aproximadamente el 83.59% de la variable dependiente precio.

Predicciones del modelo rm

predicciones_rm <- predict(object = modelo_rm, newdata = datos.validacion)
predicciones_rm

##         1         5        17        29        32        42        44        49 
## 12901.228 16240.823 25427.601 11528.568  7771.359  9363.523  5613.568 29606.046 
##        52        62        68        69        74        86        92        94 
##  6640.663 10140.805 24218.544 25155.599 40090.991  9568.592  6361.486  5705.138 
##        98       109       110       113       122       128       134       135 
##  5727.689 18531.625 16435.337 18627.034  5656.783 26064.891 14457.253 20348.617 
##       137       141       148       149       155       157       158       167 
## 17480.503 10209.006 11230.640 10669.640  6864.923  6733.793  6734.237 10976.094 
##       169       174       177       178       180       184       188       202 
## 13408.654  8383.966  8762.033  8597.723 22640.503 10693.576 10393.880 19704.251

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_rm)

kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones
	precio_real	precio_predicciones
1	13495	12901.228
5	17450	16240.823
17	41315	25427.601
29	8921	11528.568
32	6855	7771.359
42	12945	9363.523
44	6785	5613.568
49	35550	29606.046
52	6095	6640.663
62	10595	10140.805

RMSE modelo de rm

rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm

## [1] 3812.759

Modelo de árbol de regresión (AR)

Se construye el modelo de árbol de regresión (ar)

modelo_ar <- rpart(formula = price ~ symboling +  wheelbase + carlength + carwidth + carheight + curbweight + enginesize + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg , 
                data = datos.entrenamiento )


modelo_ar

## n= 165 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 165 9434738000 13083.16  
##    2) enginesize< 182 153 3410799000 11421.30  
##      4) curbweight< 2659.5 101  666542000  8745.00  
##        8) curbweight< 2291.5 62   97558280  7289.29 *
##        9) curbweight>=2291.5 39  228733800 11059.21  
##         18) highwaympg>=29.5 27   55460610 10003.26 *
##         19) highwaympg< 29.5 12   75430060 13435.08 *
##      5) curbweight>=2659.5 52  615727700 16619.50  
##       10) carwidth< 68.65 45  445665800 16006.31 *
##       11) carwidth>=68.65 7   44370490 20561.43 *
##    3) enginesize>=182 12  213848400 34271.88 *

Variables de importancia

summary(modelo_ar)

## Call:
## rpart(formula = price ~ symboling + wheelbase + carlength + carwidth + 
##     carheight + curbweight + enginesize + boreratio + stroke + 
##     compressionratio + horsepower + peakrpm + citympg + highwaympg, 
##     data = datos.entrenamiento)
##   n= 165 
## 
##           CP nsplit  rel error    xerror       xstd
## 1 0.61581898      0 1.00000000 1.0188984 0.17837555
## 2 0.22560554      1 0.38418102 0.5040613 0.08670527
## 3 0.03606352      2 0.15857548 0.2106205 0.02892480
## 4 0.01332220      3 0.12251196 0.1768833 0.02578643
## 5 0.01037052      4 0.10918976 0.1831988 0.02652567
## 6 0.01000000      5 0.09881924 0.1806921 0.02679749
## 
## Variable importance
## enginesize curbweight horsepower   carwidth highwaympg    citympg  carlength 
##         23         20         15         14         13          8          5 
##  wheelbase 
##          1 
## 
## Node number 1: 165 observations,    complexity param=0.615819
##   mean=13083.16, MSE=5.718023e+07 
##   left son=2 (153 obs) right son=3 (12 obs)
##   Primary splits:
##       enginesize < 182    to the left,  improve=0.6158190, (0 missing)
##       curbweight < 2665.5 to the left,  improve=0.5219724, (0 missing)
##       highwaympg < 28.5   to the right, improve=0.4996421, (0 missing)
##       horsepower < 118    to the left,  improve=0.4972609, (0 missing)
##       citympg    < 22.5   to the right, improve=0.4883601, (0 missing)
##   Surrogate splits:
##       curbweight < 3490   to the left,  agree=0.976, adj=0.667, (0 split)
##       horsepower < 175.5  to the left,  agree=0.970, adj=0.583, (0 split)
##       carwidth   < 69.25  to the left,  agree=0.964, adj=0.500, (0 split)
##       citympg    < 16.5   to the right, agree=0.958, adj=0.417, (0 split)
##       highwaympg < 19.5   to the right, agree=0.958, adj=0.417, (0 split)
## 
## Node number 2: 153 observations,    complexity param=0.2256055
##   mean=11421.3, MSE=2.22928e+07 
##   left son=4 (101 obs) right son=5 (52 obs)
##   Primary splits:
##       curbweight < 2659.5 to the left,  improve=0.6240559, (0 missing)
##       highwaympg < 29.5   to the right, improve=0.6006180, (0 missing)
##       enginesize < 126    to the left,  improve=0.5347020, (0 missing)
##       horsepower < 94.5   to the left,  improve=0.5289197, (0 missing)
##       wheelbase  < 98.95  to the left,  improve=0.5245366, (0 missing)
##   Surrogate splits:
##       enginesize < 126    to the left,  agree=0.908, adj=0.731, (0 split)
##       highwaympg < 28.5   to the right, agree=0.895, adj=0.692, (0 split)
##       horsepower < 113    to the left,  agree=0.863, adj=0.596, (0 split)
##       carlength  < 178    to the left,  agree=0.856, adj=0.577, (0 split)
##       carwidth   < 66.05  to the left,  agree=0.856, adj=0.577, (0 split)
## 
## Node number 3: 12 observations
##   mean=34271.88, MSE=1.78207e+07 
## 
## Node number 4: 101 observations,    complexity param=0.03606352
##   mean=8745, MSE=6599426 
##   left son=8 (62 obs) right son=9 (39 obs)
##   Primary splits:
##       curbweight < 2291.5 to the left,  improve=0.5104703, (0 missing)
##       highwaympg < 29.5   to the right, improve=0.4359229, (0 missing)
##       carlength  < 168.75 to the left,  improve=0.4018012, (0 missing)
##       horsepower < 83     to the left,  improve=0.3947234, (0 missing)
##       citympg    < 23.5   to the right, improve=0.3851430, (0 missing)
##   Surrogate splits:
##       carlength  < 168.75 to the left,  agree=0.921, adj=0.795, (0 split)
##       carwidth   < 64.5   to the left,  agree=0.891, adj=0.718, (0 split)
##       horsepower < 83     to the left,  agree=0.851, adj=0.615, (0 split)
##       citympg    < 27.5   to the right, agree=0.851, adj=0.615, (0 split)
##       wheelbase  < 95.9   to the left,  agree=0.842, adj=0.590, (0 split)
## 
## Node number 5: 52 observations,    complexity param=0.0133222
##   mean=16619.5, MSE=1.184092e+07 
##   left son=10 (45 obs) right son=11 (7 obs)
##   Primary splits:
##       carwidth   < 68.65  to the left,  improve=0.2041348, (0 missing)
##       wheelbase  < 100.8  to the left,  improve=0.1768359, (0 missing)
##       carlength  < 188.3  to the left,  improve=0.1262024, (0 missing)
##       peakrpm    < 5450   to the left,  improve=0.1090728, (0 missing)
##       highwaympg < 28.5   to the right, improve=0.1051026, (0 missing)
##   Surrogate splits:
##       wheelbase < 108.55 to the left,  agree=0.885, adj=0.143, (0 split)
## 
## Node number 8: 62 observations
##   mean=7289.29, MSE=1573521 
## 
## Node number 9: 39 observations,    complexity param=0.01037052
##   mean=11059.21, MSE=5864970 
##   left son=18 (27 obs) right son=19 (12 obs)
##   Primary splits:
##       highwaympg < 29.5   to the right, improve=0.4277599, (0 missing)
##       stroke     < 3.43   to the right, improve=0.3433487, (0 missing)
##       horsepower < 100.5  to the left,  improve=0.2823519, (0 missing)
##       enginesize < 109.5  to the right, improve=0.2539563, (0 missing)
##       citympg    < 22     to the right, improve=0.2539524, (0 missing)
##   Surrogate splits:
##       stroke     < 3.3025 to the right, agree=0.949, adj=0.833, (0 split)
##       enginesize < 108.5  to the right, agree=0.872, adj=0.583, (0 split)
##       citympg    < 22     to the right, agree=0.872, adj=0.583, (0 split)
##       wheelbase  < 95.7   to the right, agree=0.821, adj=0.417, (0 split)
##       carlength  < 169.05 to the right, agree=0.821, adj=0.417, (0 split)
## 
## Node number 10: 45 observations
##   mean=16006.31, MSE=9903685 
## 
## Node number 11: 7 observations
##   mean=20561.43, MSE=6338641 
## 
## Node number 18: 27 observations
##   mean=10003.26, MSE=2054097 
## 
## Node number 19: 12 observations
##   mean=13435.08, MSE=6285838

Visualización de árbol de regresión

rpart.plot(modelo_ar)

Predicciones del modelo (ar)

predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar

##        1        5       17       29       32       42       44       49 
## 13435.08 16006.31 34271.88 10003.26  7289.29 13435.08 13435.08 34271.88 
##       52       62       68       69       74       86       92       94 
##  7289.29 10003.26 34271.88 34271.88 34271.88 10003.26  7289.29  7289.29 
##       98      109      110      113      122      128      134      135 
##  7289.29 16006.31 16006.31 16006.31  7289.29 34271.88 16006.31 16006.31 
##      137      141      148      149      155      157      158      167 
## 16006.31  7289.29 10003.26 13435.08  7289.29  7289.29  7289.29 13435.08 
##      169      174      177      178      180      184      188      202 
## 10003.26 10003.26 10003.26 10003.26 16006.31  7289.29 10003.26 20561.43

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_ar)

kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones
	precio_real	precio_predicciones
1	13495	13435.08
5	17450	16006.31
17	41315	34271.88
29	8921	10003.26
32	6855	7289.29
42	12945	13435.08
44	6785	13435.08
49	35550	34271.88
52	6095	7289.29
62	10595	10003.26

RMSE modelo de ar

rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar

## [1] 3007.861

Modelo de bosques aleatorios (RF)

Se construye el modelo de árbol de regresión (ar)

modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "wheelbase",
                                "carlength", "carwidth", "carheight", "curbweight",
                                "enginesize", "boreratio", "stroke",
                                "compressionratio", "horsepower", "peakrpm",
                                "citympg", "highwaympg" )], 
                          y = datos.entrenamiento[,'price'], 
                          importance = TRUE, 
                          keep.forest = TRUE, 
                          ntree=20)
modelo_rf

## 
## Call:
##  randomForest(x = datos.entrenamiento[, c("symboling", "wheelbase",      "carlength", "carwidth", "carheight", "curbweight", "enginesize",      "boreratio", "stroke", "compressionratio", "horsepower",      "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[,      "price"], ntree = 20, importance = TRUE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 20
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 5410968
##                     % Var explained: 90.54

Variables de importancia

as.data.frame(modelo_rf$importance) %>%
    arrange(desc(IncNodePurity))

##                     %IncMSE IncNodePurity
## curbweight       14477866.4    1820539066
## citympg          11828643.5    1805852182
## carwidth          6040519.6    1593727280
## horsepower        9179697.2    1521396316
## enginesize       13773942.5     924356681
## carlength         2040652.1     437971411
## wheelbase         3869201.3     277556990
## highwaympg        2264907.8     225604520
## boreratio          635980.9     165964315
## compressionratio   579149.5     133257452
## peakrpm            319454.6     108361784
## stroke            1818618.1      75943968
## carheight          448648.6      48620485
## symboling         -151969.0      16603118

Predicciones del modelo (rf)

predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf

##         1         5        17        29        32        42        44        49 
## 15385.863 15047.423 32107.879 10076.485  6756.509 10483.219  9850.090 33882.302 
##        52        62        68        69        74        86        92        94 
##  6116.381  9729.441 29050.739 29460.322 37213.357  8698.035  6875.108  7518.999 
##        98       109       110       113       122       128       134       135 
##  7518.999 17045.555 17152.983 17045.555  7275.078 32507.840 13813.106 15214.170 
##       137       141       148       149       155       157       158       167 
## 17930.594  7158.610 10011.651 10366.817  8047.132  7704.135  7704.135 10605.692 
##       169       174       177       178       180       184       188       202 
## 10005.901 10250.160 11157.527 11101.078 17235.312  8312.611  8399.892 20138.345

Tabla comparativa

comparaciones <- data.frame(precio_real = datos.validacion$price,  precio_predicciones = predicciones_rf)

kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones
	precio_real	precio_predicciones
1	13495	15385.863
5	17450	15047.423
17	41315	32107.879
29	8921	10076.485
32	6855	6756.509
42	12945	10483.219
44	6785	9850.090
49	35550	33882.302
52	6095	6116.381
62	10595	9729.441

RMSE modelo de ar

rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf

## [1] 2245.088

Evaluación de modelos

Se comparan las predicciones

comparaciones <- data.frame(cbind(datos.validacion[,-1], predicciones_rm, predicciones_ar, predicciones_rf))

Se visualizan las predicciones de cada modelo

kable(comparaciones, caption = "Predicciones de los modelos") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Predicciones de los modelos
	symboling	wheelbase	carlength	carwidth	carheight	curbweight	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price	predicciones_rm	predicciones_ar	predicciones_rf
1	3	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495	12901.228	13435.08	15385.863
5	2	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450	16240.823	16006.31	15047.423
17	0	103.5	193.8	67.9	53.7	3380	209	3.62	3.39	8.0	182	5400	16	22	41315	25427.601	34271.88	32107.879
29	-1	103.3	174.6	64.6	59.8	2535	122	3.34	3.46	8.5	88	5000	24	30	8921	11528.568	10003.26	10076.485
32	2	86.6	144.6	63.9	50.8	1819	92	2.91	3.41	9.2	76	6000	31	38	6855	7771.359	7289.29	6756.509
42	0	96.5	175.4	65.2	54.1	2465	110	3.15	3.58	9.0	101	5800	24	28	12945	9363.523	13435.08	10483.219
44	0	94.3	170.7	61.8	53.5	2337	111	3.31	3.23	8.5	78	4800	24	29	6785	5613.568	13435.08	9850.090
49	0	113.0	199.6	69.6	52.8	4066	258	3.63	4.17	8.1	176	4750	15	19	35550	29606.046	34271.88	33882.302
52	1	93.1	159.1	64.2	54.1	1900	91	3.03	3.15	9.0	68	5000	31	38	6095	6640.663	7289.29	6116.381
62	1	98.8	177.8	66.5	53.7	2385	122	3.39	3.39	8.6	84	4800	26	32	10595	10140.805	10003.26	9729.441
68	-1	110.0	190.9	70.3	56.5	3515	183	3.58	3.64	21.5	123	4350	22	25	25552	24218.544	34271.88	29050.739
69	-1	110.0	190.9	70.3	58.7	3750	183	3.58	3.64	21.5	123	4350	22	25	28248	25155.599	34271.88	29460.322
74	0	120.9	208.1	71.7	56.7	3900	308	3.80	3.35	8.0	184	4500	14	16	40960	40090.991	34271.88	37213.357
86	1	96.3	172.4	65.4	51.6	2365	122	3.35	3.46	8.5	88	5000	25	32	6989	9568.592	10003.26	8698.035
92	1	94.5	165.3	63.8	54.5	1918	97	3.15	3.29	9.4	69	5200	31	37	6649	6361.486	7289.29	6875.108
94	1	94.5	170.2	63.8	53.5	2024	97	3.15	3.29	9.4	69	5200	31	37	7349	5705.138	7289.29	7518.999
98	1	94.5	170.2	63.8	53.5	2037	97	3.15	3.29	9.4	69	5200	31	37	7999	5727.689	7289.29	7518.999
109	0	107.9	186.7	68.4	56.7	3197	152	3.70	3.52	21.0	95	4150	28	33	13200	18531.625	16006.31	17045.555
110	0	114.2	198.9	68.4	58.7	3230	120	3.46	3.19	8.4	97	5000	19	24	12440	16435.337	16006.31	17152.983
113	0	107.9	186.7	68.4	56.7	3252	152	3.70	3.52	21.0	95	4150	28	33	16900	18627.034	16006.31	17045.555
122	1	93.7	167.3	63.8	50.8	1989	90	2.97	3.23	9.4	68	5500	31	38	6692	5656.783	7289.29	7275.078
128	3	89.5	168.9	65.0	51.6	2756	194	3.74	2.90	9.5	207	5900	17	25	34028	26064.891	34271.88	32507.840
134	2	99.1	186.6	66.5	56.1	2695	121	3.54	3.07	9.3	110	5250	21	28	12170	14457.253	16006.31	13813.106
135	3	99.1	186.6	66.5	56.1	2707	121	2.54	2.07	9.3	110	5250	21	28	15040	20348.617	16006.31	15214.170
137	3	99.1	186.6	66.5	56.1	2808	121	3.54	3.07	9.0	160	5500	19	26	18150	17480.503	16006.31	17930.594
141	2	93.3	157.3	63.8	55.7	2240	108	3.62	2.64	8.7	73	4400	26	31	7603	10209.006	7289.29	7158.610
148	0	97.0	173.5	65.4	53.0	2455	108	3.62	2.64	9.0	94	5200	25	31	10198	11230.640	10003.26	10011.651
149	0	96.9	173.6	65.4	54.9	2420	108	3.62	2.64	9.0	82	4800	23	29	8013	10669.640	13435.08	10366.817
155	0	95.7	169.7	63.6	59.1	2290	92	3.05	3.03	9.0	62	4800	27	32	7898	6864.923	7289.29	8047.132
157	0	95.7	166.3	64.4	53.0	2081	98	3.19	3.03	9.0	70	4800	30	37	6938	6733.793	7289.29	7704.135
158	0	95.7	166.3	64.4	52.8	2109	98	3.19	3.03	9.0	70	4800	30	37	7198	6734.237	7289.29	7704.135
167	1	94.5	168.7	64.0	52.6	2300	98	3.24	3.08	9.4	112	6600	26	29	9538	10976.094	13435.08	10605.692
169	2	98.4	176.2	65.6	52.0	2536	146	3.62	3.50	9.3	116	4800	24	30	9639	13408.654	10003.26	10005.901
174	-1	102.4	175.6	66.5	54.9	2326	122	3.31	3.54	8.7	92	4200	29	34	8948	8383.966	10003.26	10250.160
177	-1	102.4	175.6	66.5	54.9	2414	122	3.31	3.54	8.7	92	4200	27	32	10898	8762.033	10003.26	11157.527
178	-1	102.4	175.6	66.5	53.9	2458	122	3.31	3.54	8.7	92	4200	27	32	11248	8597.723	10003.26	11101.078
180	3	102.9	183.5	67.7	52.0	3016	171	3.27	3.35	9.3	161	5200	19	24	15998	22640.503	16006.31	17235.312
184	2	97.3	171.7	65.5	55.7	2209	109	3.19	3.40	9.0	85	5250	27	34	7975	10693.576	7289.29	8312.611
188	2	97.3	171.7	65.5	55.7	2319	97	3.01	3.40	23.0	68	4500	37	42	9495	10393.880	10003.26	8399.892
202	-1	109.1	188.8	68.8	55.5	3049	141	3.78	3.15	8.7	160	5300	19	25	19045	19704.251	20561.43	20138.345

Se compara el RMSE

rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)

kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>% 
 kable_paper("hover")

Estadístico RMSE de cada modelo
rm	ar	rf
3812.759	3007.861	2245.088

Interpretación

El ejercicio consistió en cargar un conjunto de datos numéricos de precios de automóviles con respecto a algunas variables numéricas.

El modelo de regresión linea múltiple destaca variables estadísticamente significativas: Las variables stroke, compressionratio y peakrpm tienen un nivel de confianza como predictores del 99%. La variable enginesize también posee un nivel de confianza del 99.9%. Por otro lado, la mayoría de las variables están por lo menos muy cerca del 90% y en algunos casos del 95%.

En el modelode árbol de regresión la variable con mayor importancia es enginesize con un valor de 23 y luego le siguen algunas otras como curbweight, horsepower, carwidth, highwaympg, citympg, carlength y wheelbase, también con ese orden de importancia.

El modelo de bosque aleatorio considera variables de importancia tales como: enginesize, curbweight, horsepower, citympg y carwidth.

En este caso, la variable enginesize continua estando presente en todos los modelos como importante y significativa. Otro dato interesante es que Las variables enginesize, curbweight y horsepower figuran como importantes en los modelos árbol de regresión y bosque aleatorio.

El mejor modelo conforme al estadístico raiz del error cuadrático medio (rmse) fue el de bosques aleatorios con estos datos de entrenamiento y validación y con el porcentaje de datos de entrenamiento y validación de 80% y 20%. El valor que arrojó fue de 2245.088, siendo el más bajo de los 3 modelos de regresión.

Comparando los resultados en R con los resultados arrojados en Python, el modelo que proporcionó el menor valor del estádistico RMSE fue el de random forest en ambos casos. No obstante, en R tuvo una cantidad de 2245.088 y en Python tuvo otra de 2631.95698, por lo tanto se puede concluir en que el modelo más óptimo, especificamente con estos datos, es efectivamente el random forest pero haciendo uso de la programación en R.

Caso 6. Comparación modelos de regresión. Datos Precios autos. Programación R

Emilio Martínez Aguilar

2022-10-15

Objetivo

Descripción

Desarrollo

Cargar librerías

Cargar datos

Exploración de datos

Diccionario de datos

Primeros (10) registros

Datos de entrenamiento y validación

Datos de entrenamiento

Datos de validación

Modelos Supervisados

Modelo de regresión lineal múltiple. (RM)

Predicciones del modelo rm

Tabla comparativa

RMSE modelo de rm

Modelo de árbol de regresión (AR)

Variables de importancia

Visualización de árbol de regresión

Predicciones del modelo (ar)

Tabla comparativa

RMSE modelo de ar

Modelo de bosques aleatorios (RF)

Variables de importancia

Predicciones del modelo (rf)

Tabla comparativa

RMSE modelo de ar

Evaluación de modelos

Interpretación