Comparar modelos de supervisados a través de la aplicación de algoritmos de predicción de precios de automóviles determinando el estadístico del error cuadrático medio (rmse).
Se cargan los datos previamente preparados de la dirección https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv
Participan todas las variables del conjunto de datos.
Se crean datos de entrenamiento al 80%
Se crean datos de validación al 20%
Se crea el modelo regresión múltiple con datos de entrenamiento
Con este modelo se responde a preguntas tales como:
¿cuáles son variables que están por encima del 90% de confianza como predictores?,
¿Cuál es el valor de R Square Adjusted o que tanto representan las variables dependientes al precio del vehículo?
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Se crea el modelo árboles de regresión con los datos de entrenamiento
Se identifica la importancia de las variables sobre el precio
Se visualiza el árbol de regresión y sus reglas de asociación
Se hacen predicciones con datos de validación
Se determinar el estadístico RMSE para efectos de comparación
Se construye el modelo bosques aleatorios con datos de entrenamiento y con 20 árboles simulados
Se identifica la importancia de las variables sobre el precio
Se generan predicciones con datos de validación
Se determina el estadístico RMSE para efectos de comparación
Al final del caso, se describe una interpretación personal
# Librerías
library(readr)
library(PerformanceAnalytics) # Para correlaciones gráficas
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:xts':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr) # Para datos tabulares
library(kableExtra) # Para datos tabulares amigables
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(ggplot2) # Para visualizar
library(plotly) # Para visualizar
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(caret) # Para particionar
## Loading required package: lattice
library(Metrics) # Para determinar rmse
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(rpart) # Para árbol
library(rpart.plot) # Para árbol
library(randomForest) # Para random forest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret) # Para hacer divisiones o particiones
library(reshape) # Para renombrar columnas
##
## Attaching package: 'reshape'
## The following object is masked from 'package:plotly':
##
## rename
## The following object is masked from 'package:dplyr':
##
## rename
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/CarPrice_Assignment.csv",
fileEncoding = "UTF-8",
stringsAsFactors = TRUE)
Hay 205 observaciones y 26 variables de las cuales se eligen las variables numéricas.
str(datos)
## 'data.frame': 205 obs. of 26 variables:
## $ car_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ CarName : Factor w/ 147 levels "alfa-romero giulia",..: 1 3 2 4 5 9 5 7 6 8 ...
## $ fueltype : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ doornumber : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
## $ carbody : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drivewheel : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ enginelocation : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num 169 169 171 177 177 ...
## $ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ enginetype : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ cylindernumber : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuelsystem : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| car_ID | symboling | CarName | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | alfa-romero giulia | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 2 | 3 | alfa-romero stelvio | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 1 | alfa-romero Quadrifoglio | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 2 | audi 100 ls | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 5 | 2 | audi 100ls | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 6 | 2 | audi fox | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 1 | audi 100ls | gas | std | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 1 | audi 5000 | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 1 | audi 4000 | gas | turbo | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 0 | audi 5000s (diesel) | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
| Col | Nombre | Descripción |
|---|---|---|
| 1 | Car_ID | Unique id of each observation (Interger) |
| 2 | Symboling | Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.(Categorical) |
| 3 | carCompany | Name of car company (Categorical) |
| 4 | fueltype | Car fuel type i.e gas or diesel (Categorical) |
| 5 | aspiration | Aspiration used in a car (Categorical) (Std o Turbo) |
| 6 | doornumber | Number of doors in a car (Categorical). Puertas |
| 7 | carbody | body of car (Categorical). (convertible, sedan, wagon …) |
| 8 | drivewheel | type of drive wheel (Categorical). (hidráulica, manual, ) |
| 9 | enginelocation | Location of car engine (Categorical). Lugar del motor |
| 10 | wheelbase | Weelbase of car (Numeric). Distancia de ejes en pulgadas. |
| 11 | carlength | Length of car (Numeric). Longitud |
| 12 | carwidth | Width of car (Numeric). Amplitud |
| 13 | carheight | height of car (Numeric). Altura |
| 14 | curbweight | The weight of a car without occupants or baggage. (Numeric). Peso del auto |
| 15 | enginetype | Type of engine. (Categorical). Tipo de motor |
| 16 | cylindernumber | cylinder placed in the car (Categorical). Cilindraje |
| 17 | enginesize | Size of car (Numeric). Tamaño del carro en … |
| 18 | fuelsystem | Fuel system of car (Categorical) |
| 19 | boreratio | Boreratio of car (Numeric). Eficiencia de motor |
| 20 | stroke | Stroke or volume inside the engine (Numeric). Pistones, tiempos, combustión |
| 21 | compressionratio | compression ratio of car (Numeric). Comprensión o medición de presión en motor |
| 22 | horsepower | Horsepower (Numeric). Poder del carro |
| 23 | peakrpm | car peak rpm (Numeric). Picos de revoluciones por minuto |
| 24 | citympg | Mileage in city (Numeric). Consumo de gasolina |
| 25 | highwaympg | Mileage on highway (Numeric). Consumo de gasolina |
| 26 | price (Variable dependiente) |
Price of car (Numeric). Precio del carro en dólares |
Quitar variables que no reflejan algún interés estadístico es decir, quitar la columnas 1 y 3, car_ID y CarName
datos <- datos[, c(2,4:26)]
Nuevamente los primeros registros
kable(head(datos, 10), caption = "Datos de precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495.00 |
| 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 1 | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 2 | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 2 | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 2 | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 1 | gas | std | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 1 | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 1 | gas | turbo | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 0 | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
Datos de entrenamiento al 80% de los datos y 20% los datos de validación.
n <- nrow(datos)
set.seed(1551) # Semilla
entrena <- createDataPartition(y = datos$price, p = 0.80, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de Entrenamient. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500.00 |
| 3 | 1 | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500.00 |
| 4 | 2 | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950.00 |
| 5 | 2 | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450.00 |
| 6 | 2 | gas | std | two | sedan | fwd | front | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250.00 |
| 7 | 1 | gas | std | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710.00 |
| 8 | 1 | gas | std | four | wagon | fwd | front | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 18920.00 |
| 9 | 1 | gas | turbo | four | sedan | fwd | front | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 8.3 | 140 | 5500 | 17 | 20 | 23875.00 |
| 10 | 0 | gas | turbo | two | hatchback | 4wd | front | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | ohc | five | 131 | mpfi | 3.13 | 3.40 | 7.0 | 160 | 5500 | 16 | 22 | 17859.17 |
| 12 | 0 | gas | std | four | sedan | rwd | front | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | ohc | four | 108 | mpfi | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16925.00 |
kable(head(datos.validacion, 10), caption = "Datos de Validación. Precios de carros") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| symboling | fueltype | aspiration | doornumber | carbody | drivewheel | enginelocation | wheelbase | carlength | carwidth | carheight | curbweight | enginetype | cylindernumber | enginesize | fuelsystem | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 11 | 2 | gas | std | two | sedan | rwd | front | 101.2 | 176.8 | 64.8 | 54.3 | 2395 | ohc | four | 108 | mpfi | 3.50 | 2.80 | 8.8 | 101 | 5800 | 23 | 29 | 16430 |
| 13 | 0 | gas | std | two | sedan | rwd | front | 101.2 | 176.8 | 64.8 | 54.3 | 2710 | ohc | six | 164 | mpfi | 3.31 | 3.19 | 9.0 | 121 | 4250 | 21 | 28 | 20970 |
| 15 | 1 | gas | std | four | sedan | rwd | front | 103.5 | 189.0 | 66.9 | 55.7 | 3055 | ohc | six | 164 | mpfi | 3.31 | 3.19 | 9.0 | 121 | 4250 | 20 | 25 | 24565 |
| 18 | 0 | gas | std | four | sedan | rwd | front | 110.0 | 197.0 | 70.9 | 56.3 | 3505 | ohc | six | 209 | mpfi | 3.62 | 3.39 | 8.0 | 182 | 5400 | 15 | 20 | 36880 |
| 23 | 1 | gas | std | two | hatchback | fwd | front | 93.7 | 157.3 | 63.8 | 50.8 | 1876 | ohc | four | 90 | 2bbl | 2.97 | 3.23 | 9.4 | 68 | 5500 | 31 | 38 | 6377 |
| 28 | 1 | gas | turbo | two | sedan | fwd | front | 93.7 | 157.3 | 63.8 | 50.6 | 2191 | ohc | four | 98 | mpfi | 3.03 | 3.39 | 7.6 | 102 | 5500 | 24 | 30 | 8558 |
| 32 | 2 | gas | std | two | hatchback | fwd | front | 86.6 | 144.6 | 63.9 | 50.8 | 1819 | ohc | four | 92 | 1bbl | 2.91 | 3.41 | 9.2 | 76 | 6000 | 31 | 38 | 6855 |
| 36 | 0 | gas | std | four | sedan | fwd | front | 96.5 | 163.4 | 64.0 | 54.5 | 2010 | ohc | four | 92 | 1bbl | 2.91 | 3.41 | 9.2 | 76 | 6000 | 30 | 34 | 7295 |
| 42 | 0 | gas | std | four | sedan | fwd | front | 96.5 | 175.4 | 65.2 | 54.1 | 2465 | ohc | four | 110 | mpfi | 3.15 | 3.58 | 9.0 | 101 | 5800 | 24 | 28 | 12945 |
Se construye el modelo de regresión lineal múltiple (rm). La variable precio en función de todas las variables independientes incluyendo numéricas y no numéricas.
La expresión price ~ . singnifica price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg
# Modelo de regresión lineal múltiple para observar variables de importancia
#modelo_rm <- lm(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento)
modelo_rm <- lm(formula = price ~ . ,
data = datos.entrenamiento)
summary(modelo_rm)
##
## Call:
## lm(formula = price ~ ., data = datos.entrenamiento)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4959.6 -1040.0 -37.7 916.5 10233.2
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.148e+04 2.180e+04 -0.527 0.599385
## symboling 3.200e+01 2.881e+02 0.111 0.911738
## fueltypegas -1.211e+04 8.527e+03 -1.420 0.158153
## aspirationturbo 1.915e+03 1.072e+03 1.788 0.076286 .
## doornumbertwo 4.653e+02 6.810e+02 0.683 0.495731
## carbodyhardtop -3.145e+03 1.561e+03 -2.015 0.046083 *
## carbodyhatchback -3.445e+03 1.420e+03 -2.425 0.016729 *
## carbodysedan -2.227e+03 1.557e+03 -1.430 0.155236
## carbodywagon -3.397e+03 1.691e+03 -2.009 0.046715 *
## drivewheelfwd 1.483e+03 1.449e+03 1.023 0.308095
## drivewheelrwd 2.220e+03 1.608e+03 1.381 0.169838
## enginelocationrear 7.568e+03 2.900e+03 2.610 0.010177 *
## wheelbase 8.782e+00 1.215e+02 0.072 0.942492
## carlength -8.740e+01 5.728e+01 -1.526 0.129575
## carwidth 6.572e+02 2.945e+02 2.231 0.027455 *
## carheight 4.278e+01 1.521e+02 0.281 0.779007
## curbweight 5.768e+00 2.276e+00 2.534 0.012513 *
## enginetypedohcv -1.143e+04 5.830e+03 -1.961 0.052152 .
## enginetypel -8.761e+02 1.858e+03 -0.471 0.638124
## enginetypeohc 3.010e+03 1.046e+03 2.879 0.004706 **
## enginetypeohcf 1.184e+03 1.810e+03 0.654 0.514240
## enginetypeohcv -5.501e+03 1.434e+03 -3.835 0.000198 ***
## enginetyperotor -5.592e+03 6.179e+03 -0.905 0.367249
## cylindernumberfive -1.195e+04 4.282e+03 -2.791 0.006088 **
## cylindernumberfour -1.309e+04 4.676e+03 -2.798 0.005958 **
## cylindernumbersix -1.000e+04 3.498e+03 -2.859 0.004986 **
## cylindernumberthree -4.713e+03 5.850e+03 -0.806 0.421944
## cylindernumbertwelve -1.266e+04 4.884e+03 -2.591 0.010715 *
## cylindernumbertwo NA NA NA NA
## enginesize 9.945e+01 3.436e+01 2.894 0.004493 **
## fuelsystem2bbl -6.639e+00 1.034e+03 -0.006 0.994885
## fuelsystem4bbl NA NA NA NA
## fuelsystemidi NA NA NA NA
## fuelsystemmfi -4.602e+03 2.819e+03 -1.632 0.105120
## fuelsystemmpfi -1.329e+02 1.164e+03 -0.114 0.909328
## fuelsystemspdi -3.683e+03 1.632e+03 -2.256 0.025798 *
## fuelsystemspfi -1.144e+03 2.720e+03 -0.421 0.674804
## boreratio -1.313e+03 1.811e+03 -0.725 0.469777
## stroke -4.198e+03 1.033e+03 -4.066 8.45e-05 ***
## compressionratio -8.906e+02 6.355e+02 -1.401 0.163584
## horsepower 1.554e+01 2.619e+01 0.593 0.553948
## peakrpm 2.078e+00 7.494e-01 2.773 0.006408 **
## citympg -1.108e+02 1.908e+02 -0.581 0.562496
## highwaympg 1.819e+02 1.785e+02 1.019 0.310252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2278 on 124 degrees of freedom
## Multiple R-squared: 0.9392, Adjusted R-squared: 0.9197
## F-statistic: 47.93 on 40 and 124 DF, p-value: < 2.2e-16
¿Cuáles son variables que están por encima del 90% de confianza como predictores?
Algunas de ellas son:
carbodyhardtop
carbodyhatchback
enginetypeohcv
cylindernumbersix
enginesize
stroke
El coeficiente de intersección tiene un nivel de confianza de aproximadamente 40%.
Se observan algunos coeficientes igual o por encima del 90% de confianza.
Dado que algunos predictores no presentan un nivel de confianza por encima del 90% es posible que se quiera construir un modelo con solo los predictores que presentan niveles de confianza igual o superior del 90%.
En modelos lineales múltiples el estadístico Adjusted R-squared: 0.9197 significa que las variables independientes explican aproximadamente el 91.97% de la variable dependiente precio.
# datos.validacion.new <- datos.validacion[c(-11),]
# datos.validacion.new$cylindernumber[which(!(datos.validacion.new$cylindernumber %in% unique(datos.entrenamiento$cylindernumber)))] <- NA
# datos.validacion.new
# datos.validacion
predicciones_rm <- predict(modelo_rm, datos.validacion)
## Warning in predict.lm(modelo_rm, datos.validacion): prediction from a rank-
## deficient fit may be misleading
predicciones_rm
## 1 11 13 15 18 23 28 32
## 14710.711 13782.665 19751.986 21267.491 32944.687 5939.432 12252.774 7626.281
## 36 42 44 52 55 59 64 65
## 7468.358 10167.990 7975.010 5992.939 6296.180 14328.324 10732.285 9764.992
## 67 72 73 76 77 79 82 94
## 12852.464 37254.103 40506.504 20620.041 6687.694 7302.900 9616.899 4789.292
## 101 105 111 115 125 145 149 155
## 10675.790 17940.528 14811.987 15129.207 14842.697 7176.917 6931.931 4646.491
## 159 160 163 169 174 193 198 200
## 7258.606 7588.986 7778.102 12595.720 8747.760 10541.164 16374.353 19657.370
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rm)
comparaciones
## precio_real precio_predicciones
## 1 13495 14710.711
## 11 16430 13782.665
## 13 20970 19751.986
## 15 24565 21267.491
## 18 36880 32944.687
## 23 6377 5939.432
## 28 8558 12252.774
## 32 6855 7626.281
## 36 7295 7468.358
## 42 12945 10167.990
## 44 6785 7975.010
## 52 6095 5992.939
## 55 7395 6296.180
## 59 15645 14328.324
## 64 10795 10732.285
## 65 11245 9764.992
## 67 18344 12852.464
## 72 34184 37254.103
## 73 35056 40506.504
## 76 16503 20620.041
## 77 5389 6687.694
## 79 6669 7302.900
## 82 8499 9616.899
## 94 7349 4789.292
## 101 9549 10675.790
## 105 17199 17940.528
## 111 13860 14811.987
## 115 17075 15129.207
## 125 12764 14842.697
## 145 9233 7176.917
## 149 8013 6931.931
## 155 7898 4646.491
## 159 7898 7258.606
## 160 7788 7588.986
## 163 9258 7778.102
## 169 9639 12595.720
## 174 8948 8747.760
## 193 13845 10541.164
## 198 16515 16374.353
## 200 18950 19657.370
Al haber usado semilla 2023 y habiendo realizado las pruebas, se concluye que los datos de entrenamiento deben de cubrir y garantizar todas los posibles valores de las variables categóricas en los datos de validación, es decir, no debe haber valores en datos de validación que no se hayan entrenado.
kable(head(comparaciones, 10), caption = "Regresión Lineal Múltiple. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 14710.711 |
| 11 | 16430 | 13782.665 |
| 13 | 20970 | 19751.986 |
| 15 | 24565 | 21267.491 |
| 18 | 36880 | 32944.687 |
| 23 | 6377 | 5939.432 |
| 28 | 8558 | 12252.774 |
| 32 | 6855 | 7626.281 |
| 36 | 7295 | 7468.358 |
| 42 | 12945 | 10167.990 |
rmse_rm <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rm
## [1] 2295.975
Se construye el modelo de árbol de regresión (ar)
modelo_ar <- rpart(formula = price ~ symboling + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + wheelbase + carlength + carwidth + carheight + curbweight + enginetype + cylindernumber + enginesize + fuelsystem + boreratio + stroke + compressionratio + horsepower + peakrpm + citympg + highwaympg, data = datos.entrenamiento )
modelo_ar
## n= 165
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 165 10595840000 13266.490
## 2) enginesize< 182 150 3052455000 11187.830
## 4) curbweight< 2659.5 99 530480300 8589.020
## 8) curbweight< 2291.5 60 88568540 7272.717 *
## 9) curbweight>=2291.5 39 178015000 10614.100 *
## 5) curbweight>=2659.5 51 555416900 16232.590
## 10) carwidth< 68.6 43 346463700 15462.490 *
## 11) carwidth>=68.6 8 46382600 20371.880 *
## 3) enginesize>=182 15 414058200 34053.030 *
rpart.plot(modelo_ar)
predicciones_ar <- predict(object = modelo_ar, newdata = datos.validacion)
predicciones_ar
## 1 11 13 15 18 23 28 32
## 10614.103 10614.103 15462.492 15462.492 34053.033 7272.717 7272.717 7272.717
## 36 42 44 52 55 59 64 65
## 7272.717 10614.103 10614.103 7272.717 7272.717 10614.103 10614.103 10614.103
## 67 72 73 76 77 79 82 94
## 15462.492 34053.033 34053.033 15462.492 7272.717 7272.717 10614.103 7272.717
## 101 105 111 115 125 145 149 155
## 10614.103 15462.492 15462.492 15462.492 15462.492 10614.103 10614.103 7272.717
## 159 160 163 169 174 193 198 200
## 7272.717 7272.717 7272.717 10614.103 10614.103 10614.103 15462.492 15462.492
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_ar)
kable(head(comparaciones, 10), caption = "Arbol de regresión. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 10614.103 |
| 11 | 16430 | 10614.103 |
| 13 | 20970 | 15462.492 |
| 15 | 24565 | 15462.492 |
| 18 | 36880 | 34053.033 |
| 23 | 6377 | 7272.717 |
| 28 | 8558 | 7272.717 |
| 32 | 6855 | 7272.717 |
| 36 | 7295 | 7272.717 |
| 42 | 12945 | 10614.103 |
rmse_ar <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_ar
## [1] 2691.051
Se construye el modelo de árbol de regresión (ar)
modelo_rf <- randomForest(x = datos.entrenamiento[,c("symboling", "fueltype", "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginetype", "cylindernumber", "enginesize", "fuelsystem", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")],
y = datos.entrenamiento[,'price'],
importance = TRUE,
keep.forest = TRUE,
ntree=20)
modelo_rf
##
## Call:
## randomForest(x = datos.entrenamiento[, c("symboling", "fueltype", "aspiration", "doornumber", "carbody", "drivewheel", "enginelocation", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginetype", "cylindernumber", "enginesize", "fuelsystem", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg")], y = datos.entrenamiento[, "price"], ntree = 20, importance = TRUE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 5371094
## % Var explained: 91.64
as.data.frame(modelo_rf$importance) %>%
arrange(desc(IncNodePurity))
## %IncMSE IncNodePurity
## enginesize 26910008.629 3702488791
## curbweight 14360829.295 1599583921
## horsepower 8462472.900 1026091239
## citympg 8038553.835 957014842
## cylindernumber 4807837.119 888881639
## highwaympg 2806540.481 559155423
## carlength 3904724.342 332871131
## carwidth 3681928.229 325020958
## drivewheel 1715607.684 217508388
## fuelsystem 1222223.750 127887710
## compressionratio 1207626.985 97623841
## carbody 40172.029 92535600
## wheelbase 954497.400 67318014
## stroke 607817.535 66351466
## enginetype -104401.560 63265000
## peakrpm 452339.352 60232721
## boreratio 428467.730 56290145
## carheight 131513.279 21311426
## symboling -114239.035 14123969
## aspiration 46348.829 10602677
## fueltype -2311.465 3225197
## doornumber 16557.385 2051540
## enginelocation 0.000 0
predicciones_rf <- predict(object = modelo_rf, newdata = datos.validacion)
predicciones_rf
## 1 11 13 15 18 23 28 32
## 15810.722 14940.053 18235.344 16487.647 36844.230 5976.323 9514.695 6624.212
## 36 42 44 52 55 59 64 65
## 7342.855 12290.252 8652.845 6110.136 6659.485 14799.688 9763.007 9462.717
## 67 72 73 76 77 79 82 94
## 11777.299 33160.950 31006.858 18712.623 5869.481 6371.626 8470.255 7671.823
## 101 105 111 115 125 145 149 155
## 9249.823 17719.641 18112.927 18112.927 13797.178 10883.801 9521.793 8282.809
## 159 160 163 169 174 193 198 200
## 8157.163 8023.116 7920.311 10123.508 10548.757 10317.142 14710.165 16643.917
comparaciones <- data.frame(precio_real = datos.validacion$price, precio_predicciones = predicciones_rf)
kable(head(comparaciones, 10), caption = "Random Forest. Comparación precios reales VS predicción de precios. 10 primeras predicciones") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| precio_real | precio_predicciones | |
|---|---|---|
| 1 | 13495 | 15810.722 |
| 11 | 16430 | 14940.053 |
| 13 | 20970 | 18235.344 |
| 15 | 24565 | 16487.647 |
| 18 | 36880 | 36844.230 |
| 23 | 6377 | 5976.323 |
| 28 | 8558 | 9514.695 |
| 32 | 6855 | 6624.212 |
| 36 | 7295 | 7342.855 |
| 42 | 12945 | 12290.252 |
rmse_rf <- rmse(comparaciones$precio_real, comparaciones$precio_predicciones)
rmse_rf
## [1] 2281.42
Se comparan las predicciones
comparaciones <- data.frame(cbind(datos.validacion[c("price")], predicciones_rm, predicciones_ar, predicciones_rf))
Se visualizan las predicciones de cada modelo
kable(comparaciones, caption = "Predicciones del precio real y los modelos") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| price | predicciones_rm | predicciones_ar | predicciones_rf | |
|---|---|---|---|---|
| 1 | 13495 | 14710.711 | 10614.103 | 15810.722 |
| 11 | 16430 | 13782.665 | 10614.103 | 14940.053 |
| 13 | 20970 | 19751.986 | 15462.492 | 18235.344 |
| 15 | 24565 | 21267.491 | 15462.492 | 16487.647 |
| 18 | 36880 | 32944.687 | 34053.033 | 36844.230 |
| 23 | 6377 | 5939.432 | 7272.717 | 5976.323 |
| 28 | 8558 | 12252.774 | 7272.717 | 9514.695 |
| 32 | 6855 | 7626.281 | 7272.717 | 6624.212 |
| 36 | 7295 | 7468.358 | 7272.717 | 7342.855 |
| 42 | 12945 | 10167.990 | 10614.103 | 12290.252 |
| 44 | 6785 | 7975.010 | 10614.103 | 8652.845 |
| 52 | 6095 | 5992.939 | 7272.717 | 6110.136 |
| 55 | 7395 | 6296.180 | 7272.717 | 6659.485 |
| 59 | 15645 | 14328.324 | 10614.103 | 14799.688 |
| 64 | 10795 | 10732.285 | 10614.103 | 9763.007 |
| 65 | 11245 | 9764.992 | 10614.103 | 9462.717 |
| 67 | 18344 | 12852.464 | 15462.492 | 11777.299 |
| 72 | 34184 | 37254.103 | 34053.033 | 33160.950 |
| 73 | 35056 | 40506.504 | 34053.033 | 31006.858 |
| 76 | 16503 | 20620.041 | 15462.492 | 18712.623 |
| 77 | 5389 | 6687.694 | 7272.717 | 5869.481 |
| 79 | 6669 | 7302.900 | 7272.717 | 6371.626 |
| 82 | 8499 | 9616.899 | 10614.103 | 8470.255 |
| 94 | 7349 | 4789.292 | 7272.717 | 7671.823 |
| 101 | 9549 | 10675.790 | 10614.103 | 9249.823 |
| 105 | 17199 | 17940.528 | 15462.492 | 17719.641 |
| 111 | 13860 | 14811.987 | 15462.492 | 18112.928 |
| 115 | 17075 | 15129.207 | 15462.492 | 18112.928 |
| 125 | 12764 | 14842.697 | 15462.492 | 13797.177 |
| 145 | 9233 | 7176.917 | 10614.103 | 10883.801 |
| 149 | 8013 | 6931.931 | 10614.103 | 9521.793 |
| 155 | 7898 | 4646.491 | 7272.717 | 8282.809 |
| 159 | 7898 | 7258.606 | 7272.717 | 8157.163 |
| 160 | 7788 | 7588.986 | 7272.717 | 8023.116 |
| 163 | 9258 | 7778.102 | 7272.717 | 7920.311 |
| 169 | 9639 | 12595.720 | 10614.103 | 10123.508 |
| 174 | 8948 | 8747.760 | 10614.103 | 10548.757 |
| 193 | 13845 | 10541.164 | 10614.103 | 10317.142 |
| 198 | 16515 | 16374.353 | 15462.492 | 14710.165 |
| 200 | 18950 | 19657.370 | 15462.492 | 16643.917 |
Se compara el RMSE
rmse <- data.frame(rm = rmse_rm, ar = rmse_ar, rf = rmse_rf)
kable(rmse, caption = "Estadístico RMSE de cada modelo") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "bordered", "condensed")) %>%
kable_paper("hover")
| rm | ar | rf |
|---|---|---|
| 2295.975 | 2691.051 | 2281.42 |
Primero que nada, me gustaría informar que no estoy usando la semilla que me corresponde (1550), estoy usando una más arriba (1551), ya que al realizar las predicciones del modelo de regresión lineal múltiple, me encontré con un erorr el cual trataba de un dato único (cilyndernumber 12), que, al ser su única existencia, al momento de crear el modelo de predicción y compararlo con los datos de validación, me saltaba ese error que decía que no podía comparar un dato de una partición (datos de entrenamiento) con los datos de validación donde no existía esa variable. Traté de arreglarlo de 2 formas, la primera volviendo ese dato en un NA, haciendo eso ya no me daba el error en la predicción, pero al momento de realizar las comparaciones y el cálculo RMSE, los resultados me daban en NA, supongo porque se “infectaron” los datos por eso que había cambiado. La otra solución que intenté, fue la de eliminar ese dato de la comparación, pero pensé que lógicamente, eso sería una de las peores cosas que podría hacer un analista de datos. Esto sólo me pasó en aquí en RStudio, en Python si pude ejecutar los modelos con la semilla 1550.
Se cargaron datos de precios de automóviles basados en todas variables tanto numéricas como categóricas, y se usará una proporción de 80/20 para los datos de entrenamiento y validación, respectivamente. Además, se usará el 1551 como semilla para la creación de los datos de entrenamiento y validación.
Como dato adicional, noté que en el sumary del modelo de regresión lineal múltiple surgieron nuevas variables, que estas a la vez son diversificaciones de los valores de las distintas columnas. Por ejemplo, en la columna de número de cilíndros, se crearon nuevas variables a tomar en cuenta, siendo estas los diferentes valores que existen dentro de la columna cylindernumber, donde se le agrega al final el sufijo de la variable característica, en este caso los números, desde el dos (two) hasta el seis (six).
Mediante el uso del modelo de regresión lineal múltiple, obtuve un estadístido Adjusted R Squared de 0.9197 , lo que significa que el modelo tiene una certeza del 91.97%.
Con la semilla 1551, el mejor modelo fue el de regresión lineal múltiple, con un valor estadístico RMSE de 2295.975, mientras que con la semilla 1550, el mejor modelo fue el de bosques aleatorios, con un valor estadístico RMSE de 2184.541.
Por último, me gustaría añadir, creo que esa variable de cilyndernumber twelve (12), me inflaba demasiado los datos, ya que, al usarla en la semilla 1550, los valors RMSE de los primeros 2 modelos, me resultaban por encima de 3000, mientras que con la semilla 1551, el resultado RMSE de cada uno de los 3 modelos estaba más igualado, estando ambos 3 resultados en un rango de alrededor de 400 (2295.975 - 2691.051).