Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).
Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv
Participan todas las variables del conjunto de datos.
Se crean datos de entrenamiento al 80%
Se crean datos de validación al 20%
Se crea el modelo regresión múltiple con datos de entrenamiento
Con este modelo se responde a preguntas tales como:
¿cuáles son variables que están por encima del 90% de confianza como predictores?,
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Se crea el modelo árboles de regresión con los datos de entrenamiento
Se identifica la importancia de las variables sobre el precio
Se visualiza el árbol de regresión y sus reglas de asociación
Se hacen predicciones con datos de validación
Se determinar el estadístico RMSE para efectos de comparación
Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados
Se identifica la importancia de las variables sobre el precio
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Al final del caso, se describe una interpretación personal
# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
library(dplyr)
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
library(caret) # Para particionar
library(Metrics) # Para determinar rmse
library(rpart) # Para árbol
library(rpart.plot) # Para árbol
library(randomForest) # Para random forest
library(caret) # Para hacer divisiones o particiones
library(reshape) # Para renombrar columnas
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv",
fileEncoding = "UTF-8",
stringsAsFactors = TRUE)
Hay 205 observaciones y 26 variables de las cuales se eligen las variables numéricas.
str(datos)
## 'data.frame': 205 obs. of 26 variables:
## $ car_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ CarName : Factor w/ 147 levels "alfa-romero giulia",..: 1 3 2 4 5 9 5 7 6 8 ...
## $ fueltype : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ doornumber : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
## $ carbody : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drivewheel : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ enginelocation : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num 169 169 171 177 177 ...
## $ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ enginetype : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ cylindernumber : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuelsystem : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | alfa-romero giulia | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 2 | 3 | alfa-romero stelvio | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 1 | alfa-romero Quadrifoglio | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 2 | audi 100 ls | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 5 | 2 | audi 100ls | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 6 | 2 | audi fox | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 1 | audi 100ls | gas | std | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 1 | audi 5000 | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 1 | audi 4000 | gas | turbo | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 0 | audi 5000s (diesel) | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
| Col | Nombre | Descripción |
|---|---|---|
| 1 | Car_ID | Unique id of each observation (Interger) |
| 2 | Symboling | Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) |
| 3 | carCompany | Name of car company (Categorical) |
| 4 | fueltype | Car fuel type i.e gas or diesel (Categorical) |
| 5 | aspiration | Aspiration used in a car (Categorical) (Std o Turbo) |
| 6 | doornumber | Number of doors in a car (Categorical). Puertas |
| 7 | carbody | body of car (Categorical). (convertible, sedan, wagon …) |
| 8 | drivewheel | type of drive wheel (Categorical). (hidráulica, manual, ) |
| 9 | enginelocation | Location of car engine (Categorical). Lugar del motor |
| 10 | wheelbase | Weelbase of car (Numeric). Distancia de ejes en pulgadas. |
| 11 | carlength | Length of car (Numeric). Longitud |
| 12 | carwidth | Width of car (Numeric). Amplitud |
| 13 | carheight | height of car (Numeric). Altura |
| 14 | curbweight | The weight of a car without occupants or baggage. (Numeric). Peso del auto |
| 15 | enginetype | Type of engine. (Categorical). Tipo de motor |
| 16 | cylindernumber | cylinder placed in the car (Categorical). Cilindraje |
| 17 | enginesize | Size of car (Numeric). Tamaño del carro en … |
| 18 | fuelsystem | Fuel system of car (Categorical) |
| 19 | boreratio | Boreratio of car (Numeric). Eficiencia de motor |
| 20 | stroke | Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión |
| 21 | compressionratio | compression ratio of car (Numeric). Comprensión o medición de presión en motor |
| 22 | horsepower | Horsepower (Numeric). Poder del carro |
| 23 | peakrpm | car peak rpm (Numeric). Picos de revoluciones por minuto |
| 24 | citympg | Mileage in city (Numeric). Consumo de gasolina |
| 25 | highwaympg | Mileage on highway (Numeric). Consumo de gasolina |
| 26 | price (Dependent variable) |
Price of car (Numeric). Precio del carro en dólares |
Quitar variables que no reflejan algún interés estadístico es decir, quitar la columnas 1 y 3, car_ID y CarName
datos <- datos[, c(2,4:26)]
Nuevamente los primeros registros
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 1 | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 2 | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 2 | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 2 | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 1 | gas | std | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 1 | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 1 | gas | turbo | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 0 | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
Datos de entrenamiento al 80% de los datos y 20% los datos de validación.
n <- nrow(datos)
set.seed(1280) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 1 | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 2 | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 6 | 2 | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 1 | gas | std | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 1 | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 1 | gas | turbo | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 0 | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
| 11 | 2 | gas | std | two | sedan | rwd | front | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | ohc | four | 108 | mpfi | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16430.00 |
| 12 | 0 | gas | std | four | sedan | rwd | front | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | ohc | four | 108 | mpfi | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16925.00 |
kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 5 | 2 | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
| 17 | 0 | gas | std | two | sedan | rwd | front | 103.5 | 193.8 | 67.9 | 53.7 | 3380 | ohc | six | 209 | mpfi | 3.62 | 3.39 | 8.0 | 182 | 5400 | 16 | 22 | 41315 |
| 29 | -1 | gas | std | four | wagon | fwd | front | 103.3 | 174.6 | 64.6 | 59.8 | 2535 | ohc | four | 122 | 2bbl | 3.34 | 3.46 | 8.5 | 88 | 5000 | 24 | 30 | 8921 |
| 32 | 2 | gas | std | two | hatchback | fwd | front | 86.6 | 144.6 | 63.9 | 50.8 | 1819 | ohc | four | 92 | 1bbl | 2.91 | 3.41 | 9.2 | 76 | 6000 | 31 | 38 | 6855 |
| 42 | 0 | gas | std | four | sedan | fwd | front | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | ohc | four | 110 | mpfi | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 |
| 44 | 0 | gas | std | four | sedan | rwd | front | 94.3 | 170.7 | 61.8 | 53.5 | 2337 | ohc | four | 111 | 2bbl | 3.31 | 3.23 | 8.5 | 78 | 4800 | 24 | 29 | 6785 |
| 49 | 0 | gas | std | four | sedan | rwd | front | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | dohc | six | 258 | mpfi | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 |
| 52 | 1 | gas | std | two | hatchback | fwd | front | 93.1 | 159.1 | 64.2 | 54.1 | 1900 | ohc | four | 91 | 2bbl | 3.03 | 3.15 | 9.0 | 68 | 5000 | 31 | 38 | 6095 |
| 62 | 1 | gas | std | two | hatchback | fwd | front | 98.8 | 177.8 | 66.5 | 53.7 | 2385 | ohc | four | 122 | 2bbl | 3.39 | 3.39 | 8.6 | 84 | 4800 | 26 | 32 | 10595 |
Se construye el modelo de regresión lineal múltiple (rm). La variable precio en función de todas las variables independientes incluyendo numéricas y no numéricas.
La expresión price ~ . singnifica price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg
# Modelo de regresión lineal múltiple para observar variables de importancia
#modelo_rm <- lm(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento)
modelo_rm <- lm(formula = price ~ . ,
data = datos.entrenamiento)
summary(modelo_rm)
##
## Call:
## lm(formula = price ~ ., data = datos.entrenamiento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4490 -1037 -4 1112 4790
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.500e+04 1.632e+04 -1.532 0.128117
## symboling 3.967e+02 2.324e+02 1.707 0.090311 .
## fueltypegas -6.181e+03 6.605e+03 -0.936 0.351254
## aspirationturbo 2.202e+03 8.495e+02 2.592 0.010688 *
## doornumbertwo -4.140e+02 5.818e+02 -0.712 0.478031
## carbodyhardtop -3.403e+03 1.344e+03 -2.532 0.012587 *
## carbodyhatchback -3.371e+03 1.183e+03 -2.849 0.005137 **
## carbodysedan -2.156e+03 1.263e+03 -1.707 0.090280 .
## carbodywagon -3.435e+03 1.367e+03 -2.513 0.013276 *
## drivewheelfwd -2.524e+02 1.255e+03 -0.201 0.840879
## drivewheelrwd 1.576e+03 1.352e+03 1.166 0.245867
## enginelocationrear 8.234e+03 2.628e+03 3.134 0.002159 **
## wheelbase 1.292e+02 9.905e+01 1.305 0.194424
## carlength -7.600e+01 4.916e+01 -1.546 0.124692
## carwidth 7.744e+02 2.338e+02 3.313 0.001214 **
## carheight 1.150e+02 1.290e+02 0.892 0.374385
## curbweight 2.222e+00 1.754e+00 1.267 0.207584
## enginetypedohcv -1.952e+03 4.603e+03 -0.424 0.672209
## enginetypel -1.346e+03 1.622e+03 -0.830 0.408153
## enginetypeohc 3.385e+03 1.001e+03 3.381 0.000967 ***
## enginetypeohcf 1.815e+03 1.645e+03 1.103 0.271984
## enginetypeohcv -3.950e+03 1.217e+03 -3.246 0.001510 **
## enginetyperotor 8.357e+03 5.260e+03 1.589 0.114691
## cylindernumberfive -3.647e+03 3.100e+03 -1.177 0.241592
## cylindernumberfour -2.719e+03 3.746e+03 -0.726 0.469311
## cylindernumbersix -3.570e+03 2.367e+03 -1.508 0.134124
## cylindernumberthree 8.532e+03 5.171e+03 1.650 0.101516
## cylindernumbertwelve -1.531e+04 4.664e+03 -3.282 0.001343 **
## cylindernumbertwo NA NA NA NA
## enginesize 1.699e+02 3.318e+01 5.122 1.13e-06 ***
## fuelsystem2bbl -6.991e+02 8.307e+02 -0.842 0.401672
## fuelsystem4bbl -2.007e+03 2.457e+03 -0.817 0.415626
## fuelsystemidi NA NA NA NA
## fuelsystemmfi -3.467e+03 2.305e+03 -1.504 0.135099
## fuelsystemmpfi -1.109e+03 9.526e+02 -1.164 0.246636
## fuelsystemspdi -3.337e+03 1.267e+03 -2.634 0.009526 **
## fuelsystemspfi -1.243e+03 2.217e+03 -0.561 0.576007
## boreratio -6.474e+03 2.245e+03 -2.884 0.004643 **
## stroke -6.410e+03 1.121e+03 -5.716 7.77e-08 ***
## compressionratio -4.047e+02 4.967e+02 -0.815 0.416837
## horsepower 1.065e+01 2.098e+01 0.508 0.612489
## peakrpm 1.508e+00 6.617e-01 2.278 0.024440 *
## citympg -2.776e+02 1.440e+02 -1.927 0.056266 .
## highwaympg 2.425e+02 1.301e+02 1.865 0.064616 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1910 on 123 degrees of freedom
## Multiple R-squared: 0.9525, Adjusted R-squared: 0.9366
## F-statistic: 60.1 on 41 and 123 DF, p-value: < 2.2e-16
¿cuáles son variables que están por encima del 90% de confianza como predictores?
El coeficiente de intersección tiene un nivel de confianza aproximadamente del 87%.
Se observan algunos coeficientes igual o por encima del 90% de confianza
Debido a que algunos predictores no presentan un nivel de confianza por encima al establecido (90%) es posible que se requiera construir un modelo con solo los predictores que presentan niveles de confianza igual o superior del 90%.
En modelos lineales múltiples el estadístico Adjusted R-squared: 0.9366 significa que las variables independientes explican aproximadamente el 93.66% de la variable dependiente precio.
predicciones_rm <- predict(object = modelo_rm, newdata = datos.validacion)
## Warning in predict.lm(object = modelo_rm, newdata = datos.validacion):
## prediction from a rank-deficient fit may be misleading
predicciones_rm
## 1 5 17 29 32 42 44 49
## 15122.219 16732.201 29532.768 9897.322 7489.560 8955.674 8350.434 31814.160
## 52 62 68 69 74 86 92 94
## 6131.592 9622.071 28131.207 27627.620 45466.625 10673.585 6086.089 4969.453
## 98 109 110 113 122 128 134 135
## 4998.341 17219.506 11587.102 17341.722 7021.569 33027.500 13573.404 25252.031
## 137 141 148 149 155 157 158 167
## 12510.072 7302.054 8436.235 8557.954 6944.816 8261.571 7085.245 6380.881
## 169 174 177 178 180 184 188 202
## 12930.578 9036.720 9302.328 8069.550 20334.621 9915.024 11379.105 22048.106
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rm)
Al haber usado semilla 1280 y habiendo realizado las pruebas, se concluye que los datos de entrenamiento deben de cubrir y garantizar todas los posibles valores de las variables categóricas en los datos de validación, es decir, no debe haber valores en datos de validación que no se hayan entrenado.
kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 15122.219 |
| 5 | 17450 | 16732.201 |
| 17 | 41315 | 29532.768 |
| 29 | 8921 | 9897.322 |
| 32 | 6855 | 7489.560 |
| 42 | 12945 | 8955.674 |
| 44 | 6785 | 8350.434 |
| 49 | 35550 | 31814.160 |
| 52 | 6095 | 6131.592 |
| 62 | 10595 | 9622.071 |
rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm
## [1] 3404.151
Se construye el modelo de árbol de regresión (ar)
modelo_ar <- rpart(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento )
modelo_ar
## n= 165
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 165 9434738000 13083.16
## 2) enginesize< 182 153 3410799000 11421.30
## 4) curbweight< 2659.5 101 666542000 8745.00
## 8) curbweight< 2291.5 62 97558280 7289.29 *
## 9) curbweight>=2291.5 39 228733800 11059.21
## 18) highwaympg>=29.5 27 55460610 10003.26 *
## 19) highwaympg< 29.5 12 75430060 13435.08 *
## 5) curbweight>=2659.5 52 615727700 16619.50
## 10) fuelsystem=2bbl,mfi,spdi,spfi 7 25831260 12505.86 *
## 11) fuelsystem=idi,mpfi 45 453015600 17259.40 *
## 3) enginesize>=182 12 213848400 34271.88 *
summary(modelo_ar)
## Call:
## rpart(formula = price ~ symboling + fueltype + aspiration + doornumber +
## carbody + drivewheel + enginelocation + wheelbase + carlength +
## carwidth + carheight + curbweight + enginetype + cylindernumber +
## enginesize + fuelsystem + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento)
## n= 165
##
## CP nsplit rel error xerror xstd
## 1 0.61581898 0 1.00000000 1.0188984 0.17837555
## 2 0.22560554 1 0.38418102 0.5040613 0.08670527
## 3 0.03606352 2 0.15857548 0.2106205 0.02892480
## 4 0.01450818 3 0.12251196 0.1768833 0.02578643
## 5 0.01037052 4 0.10800378 0.1748235 0.02638678
## 6 0.01000000 5 0.09763326 0.1727943 0.02666065
##
## Variable importance
## enginesize curbweight horsepower carwidth citympg
## 23 20 15 14 8
## cylindernumber highwaympg carlength wheelbase stroke
## 8 5 5 1 1
##
## Node number 1: 165 observations, complexity param=0.615819
## mean=13083.16, MSE=5.718023e+07
## left son=2 (153 obs) right son=3 (12 obs)
## Primary splits:
## enginesize < 182 to the left, improve=0.6158190, (0 missing)
## curbweight < 2665.5 to the left, improve=0.5219724, (0 missing)
## cylindernumber splits as RRLRLRL, improve=0.5151719, (0 missing)
## highwaympg < 28.5 to the right, improve=0.4996421, (0 missing)
## horsepower < 118 to the left, improve=0.4972609, (0 missing)
## Surrogate splits:
## curbweight < 3490 to the left, agree=0.976, adj=0.667, (0 split)
## horsepower < 175.5 to the left, agree=0.970, adj=0.583, (0 split)
## carwidth < 69.25 to the left, agree=0.964, adj=0.500, (0 split)
## cylindernumber splits as RLLLLRL, agree=0.958, adj=0.417, (0 split)
## citympg < 16.5 to the right, agree=0.958, adj=0.417, (0 split)
##
## Node number 2: 153 observations, complexity param=0.2256055
## mean=11421.3, MSE=2.22928e+07
## left son=4 (101 obs) right son=5 (52 obs)
## Primary splits:
## curbweight < 2659.5 to the left, improve=0.6240559, (0 missing)
## highwaympg < 29.5 to the right, improve=0.6006180, (0 missing)
## enginesize < 126 to the left, improve=0.5347020, (0 missing)
## horsepower < 94.5 to the left, improve=0.5289197, (0 missing)
## wheelbase < 98.95 to the left, improve=0.5245366, (0 missing)
## Surrogate splits:
## enginesize < 126 to the left, agree=0.908, adj=0.731, (0 split)
## highwaympg < 28.5 to the right, agree=0.895, adj=0.692, (0 split)
## horsepower < 113 to the left, agree=0.863, adj=0.596, (0 split)
## carlength < 178 to the left, agree=0.856, adj=0.577, (0 split)
## carwidth < 66.05 to the left, agree=0.856, adj=0.577, (0 split)
##
## Node number 3: 12 observations
## mean=34271.88, MSE=1.78207e+07
##
## Node number 4: 101 observations, complexity param=0.03606352
## mean=8745, MSE=6599426
## left son=8 (62 obs) right son=9 (39 obs)
## Primary splits:
## curbweight < 2291.5 to the left, improve=0.5104703, (0 missing)
## highwaympg < 29.5 to the right, improve=0.4359229, (0 missing)
## fuelsystem splits as LLRL-RL-, improve=0.4126328, (0 missing)
## carlength < 168.75 to the left, improve=0.4018012, (0 missing)
## horsepower < 83 to the left, improve=0.3947234, (0 missing)
## Surrogate splits:
## carlength < 168.75 to the left, agree=0.921, adj=0.795, (0 split)
## carwidth < 64.5 to the left, agree=0.891, adj=0.718, (0 split)
## horsepower < 83 to the left, agree=0.851, adj=0.615, (0 split)
## citympg < 27.5 to the right, agree=0.851, adj=0.615, (0 split)
## wheelbase < 95.9 to the left, agree=0.842, adj=0.590, (0 split)
##
## Node number 5: 52 observations, complexity param=0.01450818
## mean=16619.5, MSE=1.184092e+07
## left son=10 (7 obs) right son=11 (45 obs)
## Primary splits:
## fuelsystem splits as -L-RLRLL, improve=0.2223074, (0 missing)
## carwidth < 68.65 to the left, improve=0.2041348, (0 missing)
## wheelbase < 100.8 to the left, improve=0.1768359, (0 missing)
## cylindernumber splits as -RLR---, improve=0.1272707, (0 missing)
## carlength < 188.3 to the left, improve=0.1262024, (0 missing)
## Surrogate splits:
## stroke < 3.75 to the right, agree=0.962, adj=0.714, (0 split)
## wheelbase < 97.2 to the left, agree=0.923, adj=0.429, (0 split)
## carlength < 174.1 to the left, agree=0.923, adj=0.429, (0 split)
## compressionratio < 7.25 to the left, agree=0.923, adj=0.429, (0 split)
## carheight < 51.7 to the left, agree=0.904, adj=0.286, (0 split)
##
## Node number 8: 62 observations
## mean=7289.29, MSE=1573521
##
## Node number 9: 39 observations, complexity param=0.01037052
## mean=11059.21, MSE=5864970
## left son=18 (27 obs) right son=19 (12 obs)
## Primary splits:
## highwaympg < 29.5 to the right, improve=0.4277599, (0 missing)
## stroke < 3.43 to the right, improve=0.3433487, (0 missing)
## fuelsystem splits as LLRR-RL-, improve=0.3321655, (0 missing)
## horsepower < 100.5 to the left, improve=0.2823519, (0 missing)
## drivewheel splits as LLR, improve=0.2742662, (0 missing)
## Surrogate splits:
## stroke < 3.3025 to the right, agree=0.949, adj=0.833, (0 split)
## drivewheel splits as RLR, agree=0.897, adj=0.667, (0 split)
## enginetype splits as R--LR-R, agree=0.872, adj=0.583, (0 split)
## enginesize < 108.5 to the right, agree=0.872, adj=0.583, (0 split)
## citympg < 22 to the right, agree=0.872, adj=0.583, (0 split)
##
## Node number 10: 7 observations
## mean=12505.86, MSE=3690180
##
## Node number 11: 45 observations
## mean=17259.4, MSE=1.006701e+07
##
## Node number 18: 27 observations
## mean=10003.26, MSE=2054097
##
## Node number 19: 12 observations
## mean=13435.08, MSE=6285838
rpart.plot(modelo_ar)
predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar
## 1 5 17 29 32 42 44 49
## 13435.08 17259.40 34271.88 10003.26 7289.29 13435.08 13435.08 34271.88
## 52 62 68 69 74 86 92 94
## 7289.29 10003.26 34271.88 34271.88 34271.88 10003.26 7289.29 7289.29
## 98 109 110 113 122 128 134 135
## 7289.29 17259.40 17259.40 17259.40 7289.29 34271.88 17259.40 17259.40
## 137 141 148 149 155 157 158 167
## 17259.40 7289.29 10003.26 13435.08 7289.29 7289.29 7289.29 13435.08
## 169 174 177 178 180 184 188 202
## 10003.26 10003.26 10003.26 10003.26 17259.40 7289.29 10003.26 17259.40
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_ar)
kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 13435.08 |
| 5 | 17450 | 17259.40 |
| 17 | 41315 | 34271.88 |
| 29 | 8921 | 10003.26 |
| 32 | 6855 | 7289.29 |
| 42 | 12945 | 13435.08 |
| 44 | 6785 | 13435.08 |
| 49 | 35550 | 34271.88 |
| 52 | 6095 | 7289.29 |
| 62 | 10595 | 10003.26 |
rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar
## [1] 3131.044
Se construye el modelo de árbol de regresión (ar)
modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "fueltype", "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginetype", "cylindernumber", "enginesize", "fuelsystem", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")],
y = datos.entrenamiento[,'price'],
importance = TRUE,
keep.forest = TRUE,
ntree=20)
modelo_rf
##
## Call:
## randomForest(x = datos.entrenamiento[, c("symboling", "fueltype", "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginetype", "cylindernumber", "enginesize", "fuelsystem", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[, "price"], ntree = 20, importance = TRUE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 5453401
## % Var explained: 90.46
as.data.frame(modelo_rf$importance) %>%
arrange(desc(IncNodePurity))
## %IncMSE IncNodePurity
## enginesize 15425958.66 2618007145
## curbweight 14346107.19 2275617447
## highwaympg 9696518.16 1290000916
## horsepower 8276005.14 814844622
## citympg 3087149.42 474250575
## cylindernumber 1547594.13 390141502
## wheelbase 2888406.72 383468069
## drivewheel 1008225.55 265676215
## carwidth 3236454.35 204207463
## fuelsystem 3357060.79 187376119
## peakrpm 1361368.80 78522989
## enginelocation 220336.18 74174715
## carlength 509880.53 68562622
## boreratio 493611.26 63939419
## compressionratio 830447.64 53484864
## carheight 254178.38 38015529
## stroke 235252.00 32696779
## symboling 437886.51 25088853
## aspiration 152380.62 23531290
## enginetype 231672.48 23259214
## carbody 143974.29 13493888
## doornumber 60724.69 10023092
## fueltype 25672.55 5065489
predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf
## 1 5 17 29 32 42 44 49
## 14419.900 15758.284 31601.349 10711.521 6483.122 12618.451 9796.148 33104.990
## 52 62 68 69 74 86 92 94
## 6093.105 9711.777 29256.205 29256.205 36663.522 8722.587 6702.817 7661.853
## 98 109 110 113 122 128 134 135
## 7581.470 16669.461 16314.744 16669.461 7219.612 32559.316 14387.577 14209.732
## 137 141 148 149 155 157 158 167
## 16922.191 7735.783 10225.388 9778.801 7711.015 7866.334 7853.569 11207.391
## 169 174 177 178 180 184 188 202
## 10079.042 10667.892 10528.080 10610.692 16328.483 8564.105 8946.374 21110.107
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rf)
kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 14419.900 |
| 5 | 17450 | 15758.284 |
| 17 | 41315 | 31601.349 |
| 29 | 8921 | 10711.521 |
| 32 | 6855 | 6483.122 |
| 42 | 12945 | 12618.451 |
| 44 | 6785 | 9796.148 |
| 49 | 35550 | 33104.990 |
| 52 | 6095 | 6093.105 |
| 62 | 10595 | 9711.777 |
rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf
## [1] 2271.819
Se comparan las predicciones
comparaciones <- data.frame(cbind(datos.validacion[,-1], predicciones_rm, predicciones_ar, predicciones_rf))
Se visualizan las predicciones de cada modelo
kable(comparaciones, caption = "Predicciones de los modelos") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | predicciones_rm | predicciones_ar | predicciones_rf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 | 15122.219 | 13435.08 | 14419.900 |
| 5 | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 | 16732.201 | 17259.40 | 15758.284 |
| 17 | gas | std | two | sedan | rwd | front | 103.5 | 193.8 | 67.9 | 53.7 | 3380 | ohc | six | 209 | mpfi | 3.62 | 3.39 | 8.0 | 182 | 5400 | 16 | 22 | 41315 | 29532.768 | 34271.88 | 31601.349 |
| 29 | gas | std | four | wagon | fwd | front | 103.3 | 174.6 | 64.6 | 59.8 | 2535 | ohc | four | 122 | 2bbl | 3.34 | 3.46 | 8.5 | 88 | 5000 | 24 | 30 | 8921 | 9897.322 | 10003.26 | 10711.521 |
| 32 | gas | std | two | hatchback | fwd | front | 86.6 | 144.6 | 63.9 | 50.8 | 1819 | ohc | four | 92 | 1bbl | 2.91 | 3.41 | 9.2 | 76 | 6000 | 31 | 38 | 6855 | 7489.560 | 7289.29 | 6483.122 |
| 42 | gas | std | four | sedan | fwd | front | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | ohc | four | 110 | mpfi | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 | 8955.674 | 13435.08 | 12618.451 |
| 44 | gas | std | four | sedan | rwd | front | 94.3 | 170.7 | 61.8 | 53.5 | 2337 | ohc | four | 111 | 2bbl | 3.31 | 3.23 | 8.5 | 78 | 4800 | 24 | 29 | 6785 | 8350.434 | 13435.08 | 9796.148 |
| 49 | gas | std | four | sedan | rwd | front | 113.0 | 199.6 | 69.6 | 52.8 | 4066 | dohc | six | 258 | mpfi | 3.63 | 4.17 | 8.1 | 176 | 4750 | 15 | 19 | 35550 | 31814.160 | 34271.88 | 33104.990 |
| 52 | gas | std | two | hatchback | fwd | front | 93.1 | 159.1 | 64.2 | 54.1 | 1900 | ohc | four | 91 | 2bbl | 3.03 | 3.15 | 9.0 | 68 | 5000 | 31 | 38 | 6095 | 6131.592 | 7289.29 | 6093.105 |
| 62 | gas | std | two | hatchback | fwd | front | 98.8 | 177.8 | 66.5 | 53.7 | 2385 | ohc | four | 122 | 2bbl | 3.39 | 3.39 | 8.6 | 84 | 4800 | 26 | 32 | 10595 | 9622.071 | 10003.26 | 9711.777 |
| 68 | diesel | turbo | four | sedan | rwd | front | 110.0 | 190.9 | 70.3 | 56.5 | 3515 | ohc | five | 183 | idi | 3.58 | 3.64 | 21.5 | 123 | 4350 | 22 | 25 | 25552 | 28131.207 | 34271.88 | 29256.205 |
| 69 | diesel | turbo | four | wagon | rwd | front | 110.0 | 190.9 | 70.3 | 58.7 | 3750 | ohc | five | 183 | idi | 3.58 | 3.64 | 21.5 | 123 | 4350 | 22 | 25 | 28248 | 27627.620 | 34271.88 | 29256.205 |
| 74 | gas | std | four | sedan | rwd | front | 120.9 | 208.1 | 71.7 | 56.7 | 3900 | ohcv | eight | 308 | mpfi | 3.80 | 3.35 | 8.0 | 184 | 4500 | 14 | 16 | 40960 | 45466.625 | 34271.88 | 36663.522 |
| 86 | gas | std | four | sedan | fwd | front | 96.3 | 172.4 | 65.4 | 51.6 | 2365 | ohc | four | 122 | 2bbl | 3.35 | 3.46 | 8.5 | 88 | 5000 | 25 | 32 | 6989 | 10673.585 | 10003.26 | 8722.587 |
| 92 | gas | std | two | sedan | fwd | front | 94.5 | 165.3 | 63.8 | 54.5 | 1918 | ohc | four | 97 | 2bbl | 3.15 | 3.29 | 9.4 | 69 | 5200 | 31 | 37 | 6649 | 6086.089 | 7289.29 | 6702.817 |
| 94 | gas | std | four | wagon | fwd | front | 94.5 | 170.2 | 63.8 | 53.5 | 2024 | ohc | four | 97 | 2bbl | 3.15 | 3.29 | 9.4 | 69 | 5200 | 31 | 37 | 7349 | 4969.453 | 7289.29 | 7661.853 |
| 98 | gas | std | four | wagon | fwd | front | 94.5 | 170.2 | 63.8 | 53.5 | 2037 | ohc | four | 97 | 2bbl | 3.15 | 3.29 | 9.4 | 69 | 5200 | 31 | 37 | 7999 | 4998.341 | 7289.29 | 7581.470 |
| 109 | diesel | turbo | four | sedan | rwd | front | 107.9 | 186.7 | 68.4 | 56.7 | 3197 | l | four | 152 | idi | 3.70 | 3.52 | 21.0 | 95 | 4150 | 28 | 33 | 13200 | 17219.506 | 17259.40 | 16669.461 |
| 110 | gas | std | four | wagon | rwd | front | 114.2 | 198.9 | 68.4 | 58.7 | 3230 | l | four | 120 | mpfi | 3.46 | 3.19 | 8.4 | 97 | 5000 | 19 | 24 | 12440 | 11587.102 | 17259.40 | 16314.744 |
| 113 | diesel | turbo | four | sedan | rwd | front | 107.9 | 186.7 | 68.4 | 56.7 | 3252 | l | four | 152 | idi | 3.70 | 3.52 | 21.0 | 95 | 4150 | 28 | 33 | 16900 | 17341.722 | 17259.40 | 16669.461 |
| 122 | gas | std | four | sedan | fwd | front | 93.7 | 167.3 | 63.8 | 50.8 | 1989 | ohc | four | 90 | 2bbl | 2.97 | 3.23 | 9.4 | 68 | 5500 | 31 | 38 | 6692 | 7021.569 | 7289.29 | 7219.612 |
| 128 | gas | std | two | hardtop | rwd | rear | 89.5 | 168.9 | 65.0 | 51.6 | 2756 | ohcf | six | 194 | mpfi | 3.74 | 2.90 | 9.5 | 207 | 5900 | 17 | 25 | 34028 | 33027.500 | 34271.88 | 32559.316 |
| 134 | gas | std | four | sedan | fwd | front | 99.1 | 186.6 | 66.5 | 56.1 | 2695 | ohc | four | 121 | mpfi | 3.54 | 3.07 | 9.3 | 110 | 5250 | 21 | 28 | 12170 | 13573.404 | 17259.40 | 14387.577 |
| 135 | gas | std | two | hatchback | fwd | front | 99.1 | 186.6 | 66.5 | 56.1 | 2707 | ohc | four | 121 | mpfi | 2.54 | 2.07 | 9.3 | 110 | 5250 | 21 | 28 | 15040 | 25252.031 | 17259.40 | 14209.732 |
| 137 | gas | turbo | two | hatchback | fwd | front | 99.1 | 186.6 | 66.5 | 56.1 | 2808 | dohc | four | 121 | mpfi | 3.54 | 3.07 | 9.0 | 160 | 5500 | 19 | 26 | 18150 | 12510.072 | 17259.40 | 16922.191 |
| 141 | gas | std | two | hatchback | 4wd | front | 93.3 | 157.3 | 63.8 | 55.7 | 2240 | ohcf | four | 108 | 2bbl | 3.62 | 2.64 | 8.7 | 73 | 4400 | 26 | 31 | 7603 | 7302.054 | 7289.29 | 7735.783 |
| 148 | gas | std | four | wagon | fwd | front | 97.0 | 173.5 | 65.4 | 53.0 | 2455 | ohcf | four | 108 | mpfi | 3.62 | 2.64 | 9.0 | 94 | 5200 | 25 | 31 | 10198 | 8436.235 | 10003.26 | 10225.388 |
| 149 | gas | std | four | wagon | 4wd | front | 96.9 | 173.6 | 65.4 | 54.9 | 2420 | ohcf | four | 108 | 2bbl | 3.62 | 2.64 | 9.0 | 82 | 4800 | 23 | 29 | 8013 | 8557.954 | 13435.08 | 9778.801 |
| 155 | gas | std | four | wagon | 4wd | front | 95.7 | 169.7 | 63.6 | 59.1 | 2290 | ohc | four | 92 | 2bbl | 3.05 | 3.03 | 9.0 | 62 | 4800 | 27 | 32 | 7898 | 6944.816 | 7289.29 | 7711.015 |
| 157 | gas | std | four | sedan | fwd | front | 95.7 | 166.3 | 64.4 | 53.0 | 2081 | ohc | four | 98 | 2bbl | 3.19 | 3.03 | 9.0 | 70 | 4800 | 30 | 37 | 6938 | 8261.571 | 7289.29 | 7866.334 |
| 158 | gas | std | four | hatchback | fwd | front | 95.7 | 166.3 | 64.4 | 52.8 | 2109 | ohc | four | 98 | 2bbl | 3.19 | 3.03 | 9.0 | 70 | 4800 | 30 | 37 | 7198 | 7085.245 | 7289.29 | 7853.569 |
| 167 | gas | std | two | hatchback | rwd | front | 94.5 | 168.7 | 64.0 | 52.6 | 2300 | dohc | four | 98 | mpfi | 3.24 | 3.08 | 9.4 | 112 | 6600 | 26 | 29 | 9538 | 6380.881 | 13435.08 | 11207.391 |
| 169 | gas | std | two | hardtop | rwd | front | 98.4 | 176.2 | 65.6 | 52.0 | 2536 | ohc | four | 146 | mpfi | 3.62 | 3.50 | 9.3 | 116 | 4800 | 24 | 30 | 9639 | 12930.578 | 10003.26 | 10079.042 |
| 174 | gas | std | four | sedan | fwd | front | 102.4 | 175.6 | 66.5 | 54.9 | 2326 | ohc | four | 122 | mpfi | 3.31 | 3.54 | 8.7 | 92 | 4200 | 29 | 34 | 8948 | 9036.720 | 10003.26 | 10667.892 |
| 177 | gas | std | four | sedan | fwd | front | 102.4 | 175.6 | 66.5 | 54.9 | 2414 | ohc | four | 122 | mpfi | 3.31 | 3.54 | 8.7 | 92 | 4200 | 27 | 32 | 10898 | 9302.328 | 10003.26 | 10528.080 |
| 178 | gas | std | four | hatchback | fwd | front | 102.4 | 175.6 | 66.5 | 53.9 | 2458 | ohc | four | 122 | mpfi | 3.31 | 3.54 | 8.7 | 92 | 4200 | 27 | 32 | 11248 | 8069.550 | 10003.26 | 10610.692 |
| 180 | gas | std | two | hatchback | rwd | front | 102.9 | 183.5 | 67.7 | 52.0 | 3016 | dohc | six | 171 | mpfi | 3.27 | 3.35 | 9.3 | 161 | 5200 | 19 | 24 | 15998 | 20334.621 | 17259.40 | 16328.483 |
| 184 | gas | std | two | sedan | fwd | front | 97.3 | 171.7 | 65.5 | 55.7 | 2209 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 9.0 | 85 | 5250 | 27 | 34 | 7975 | 9915.024 | 7289.29 | 8564.105 |
| 188 | diesel | turbo | four | sedan | fwd | front | 97.3 | 171.7 | 65.5 | 55.7 | 2319 | ohc | four | 97 | idi | 3.01 | 3.40 | 23.0 | 68 | 4500 | 37 | 42 | 9495 | 11379.105 | 10003.26 | 8946.374 |
| 202 | gas | turbo | four | sedan | rwd | front | 109.1 | 188.8 | 68.8 | 55.5 | 3049 | ohc | four | 141 | mpfi | 3.78 | 3.15 | 8.7 | 160 | 5300 | 19 | 25 | 19045 | 22048.106 | 17259.40 | 21110.107 |
Se compara el RMSE
rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)
kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| rm | ar | rf |
|---|---|---|
| 3404.151 | 3131.044 | 2271.819 |
El ejercicio consistió en cargar un conjunto de datos numéricos de precios de automóviles con respecto a todas las variables, entre las que se incluyen las numéricas y categóricas.
El modelo de regresión linea múltiple destaca variables estadísticamente significativas: Las variables symboling, aspirationturbo, carbodyhardtop, carbodysedan, peakrpm, citympg, highwaympg y algunas otras más poseen un nivel de confianza como predictores por encima del 90%, lo que los vuelve realmente confiables para ejecutar otro modelo en base a estos.
En el modelo de árbol de regresión la variable con mayor importancia vuelve a ser enginesize con un valor de 23 y luego le siguen algunas otras como curbweight, horsepower, carwidth, highwaympg, citympg, carlength y wheelbase, también con ese orden de importancia.
El modelo de bosque aleatorio considera variables de importancia tales como: enginesize, curbweight, horsepower, citympg y highwaympg.
En este caso, la variable enginesize vuelve a estar presente en todos los modelos como importante y significativa, eso incluye la otra versión de este mismo caso donde no se incluyen las variables categóricas en el análisis. Otro dato interesante es que Las variables enginesize, curbweight y horsepower también figuran como importantes en los modelos árbol de regresión y bosque aleatorio.
El mejor modelo conforme al estadístico raiz del error cuadrático medio (rmse) fue el de bosques aleatorios con estos datos de entrenamiento y validación y con el porcentaje de datos de entrenamiento y validación de 80% y 20%. El valor que arrojó fue de 2271.819, siendo el más bajo de los 3 modelos de regresión. Comparando este resultado con el del anterior caso, donde no estaban involucradas las variables categóricas, la cantidad inesperadamente aumentó un poco.
Cabe señalar que en la realización de este modelo, usando la semilla 1280, no hubo inconvenientes ni errores con respecto a elementos de datos de validación que no sean reconocidos en el modelo por no haber estado presentes en los datos de entrenamiento. Por lo tanto esto significa que los datos de entrenamiento cubren y garantizan todos los posibles valores de las variables categoricas en los datos de validación respectivamente.
Finalmente comparando los resultados en R con los resultados arrojados en Python, el modelo que proporcionó el menor valor del estádistico RMSE fue el de random forest en ambos casos. No obstante, en R tuvo una cantidad de 2271.819 y en Python tuvo otra de 2616.357814, por lo tanto se puede concluir en que el modelo más óptimo, haciendo uso de todas las variables numéricas y categóricas de este caso especificamente, vuelve a ser el random forest pero haciendo uso de la programación en R.