Este ejercicio consiste en realizar un análisis exploratorio sobre un dataset de vehiculos Toyota Corolla con 1436 instancias y 37 atributos.
El objetivo es conseguir un modelo de regresión lineal con un resultado aceptable interpretando cada paso del razonamiento necesario para llegar al objetivo.
library(fastDummies)
library(car)
library(corrplot)
library(mctest)
library(tidyverse)
library(leaps)
library(glmnet)
library(MASS)
library(reshape)
library(caret)
library(ggrepel)
raw_data = read.csv("ToyotaCorolla.csv")
raw_data$Id = NULL
raw_data$Model = NULL
raw_data
str(raw_data)
'data.frame': 1436 obs. of 35 variables:
$ Price : int 13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
$ Age_08_04 : int 23 23 24 26 30 32 27 30 27 23 ...
$ Mfg_Month : int 10 10 9 7 3 1 6 3 6 10 ...
$ Mfg_Year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
$ KM : int 46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
$ Fuel_Type : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
$ HP : int 90 90 90 90 90 90 90 90 192 69 ...
$ Met_Color : int 1 1 1 0 0 0 1 1 0 0 ...
$ Automatic : int 0 0 0 0 0 0 0 0 0 0 ...
$ cc : int 2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
$ Doors : int 3 3 3 3 3 3 3 3 3 3 ...
$ Cylinders : int 4 4 4 4 4 4 4 4 4 4 ...
$ Gears : int 5 5 5 5 5 5 5 5 5 5 ...
$ Quarterly_Tax : int 210 210 210 210 210 210 210 210 100 185 ...
$ Weight : int 1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
$ Mfr_Guarantee : int 0 0 1 1 1 0 0 1 0 0 ...
$ BOVAG_Guarantee : int 1 1 1 1 1 1 1 1 1 1 ...
$ Guarantee_Period: int 3 3 3 3 3 3 3 3 3 3 ...
$ ABS : int 1 1 1 1 1 1 1 1 1 1 ...
$ Airbag_1 : int 1 1 1 1 1 1 1 1 1 1 ...
$ Airbag_2 : int 1 1 1 1 1 1 1 1 0 1 ...
$ Airco : int 0 1 0 0 1 1 1 1 1 1 ...
$ Automatic_airco : int 0 0 0 0 0 0 0 0 0 0 ...
$ Boardcomputer : int 1 1 1 1 1 1 1 1 0 1 ...
$ CD_Player : int 0 1 0 0 0 0 0 1 0 0 ...
$ Central_Lock : int 1 1 0 0 1 1 1 1 1 0 ...
$ Powered_Windows : int 1 0 0 0 1 1 1 1 1 0 ...
$ Power_Steering : int 1 1 1 1 1 1 1 1 1 1 ...
$ Radio : int 0 0 0 0 0 0 0 0 1 0 ...
$ Mistlamps : int 0 0 0 0 1 1 0 0 0 0 ...
$ Sport_Model : int 0 0 0 0 0 0 1 0 0 0 ...
$ Backseat_Divider: int 1 1 1 1 1 1 1 1 0 1 ...
$ Metallic_Rim : int 0 0 0 0 0 0 0 0 1 0 ...
$ Radio_cassette : int 0 0 0 0 0 0 0 0 1 0 ...
$ Tow_Bar : int 0 0 0 0 0 0 0 0 0 0 ...
summary(raw_data)
Price Age_08_04 Mfg_Month Mfg_Year KM
Min. : 4350 Min. : 1.00 Min. : 1.000 Min. :1998 Min. : 1
1st Qu.: 8450 1st Qu.:44.00 1st Qu.: 3.000 1st Qu.:1998 1st Qu.: 43000
Median : 9900 Median :61.00 Median : 5.000 Median :1999 Median : 63390
Mean :10731 Mean :55.95 Mean : 5.549 Mean :2000 Mean : 68533
3rd Qu.:11950 3rd Qu.:70.00 3rd Qu.: 8.000 3rd Qu.:2001 3rd Qu.: 87021
Max. :32500 Max. :80.00 Max. :12.000 Max. :2004 Max. :243000
Fuel_Type HP Met_Color Automatic cc
CNG : 17 Min. : 69.0 Min. :0.0000 Min. :0.00000 Min. : 1300
Diesel: 155 1st Qu.: 90.0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 1400
Petrol:1264 Median :110.0 Median :1.0000 Median :0.00000 Median : 1600
Mean :101.5 Mean :0.6748 Mean :0.05571 Mean : 1577
3rd Qu.:110.0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 1600
Max. :192.0 Max. :1.0000 Max. :1.00000 Max. :16000
Doors Cylinders Gears Quarterly_Tax Weight
Min. :2.000 Min. :4 Min. :3.000 Min. : 19.00 Min. :1000
1st Qu.:3.000 1st Qu.:4 1st Qu.:5.000 1st Qu.: 69.00 1st Qu.:1040
Median :4.000 Median :4 Median :5.000 Median : 85.00 Median :1070
Mean :4.033 Mean :4 Mean :5.026 Mean : 87.12 Mean :1072
3rd Qu.:5.000 3rd Qu.:4 3rd Qu.:5.000 3rd Qu.: 85.00 3rd Qu.:1085
Max. :5.000 Max. :4 Max. :6.000 Max. :283.00 Max. :1615
Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
Min. :0.0000 Min. :0.0000 Min. : 3.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.: 3.000 1st Qu.:1.0000
Median :0.0000 Median :1.0000 Median : 3.000 Median :1.0000
Mean :0.4095 Mean :0.8955 Mean : 3.815 Mean :0.8134
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :36.000 Max. :1.0000
Airbag_1 Airbag_2 Airco Automatic_airco
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :1.0000 Median :1.0000 Median :1.0000 Median :0.00000
Mean :0.9708 Mean :0.7228 Mean :0.5084 Mean :0.05641
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
Boardcomputer CD_Player Central_Lock Powered_Windows
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.0000 Median :0.0000 Median :1.0000 Median :1.000
Mean :0.2946 Mean :0.2187 Mean :0.5801 Mean :0.562
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
Power_Steering Radio Mistlamps Sport_Model
Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000
Mean :0.9777 Mean :0.1462 Mean :0.257 Mean :0.3001
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
Backseat_Divider Metallic_Rim Radio_cassette Tow_Bar
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.7702 Mean :0.2047 Mean :0.1455 Mean :0.2779
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Observaciones
* El valor maximo de cc es de 16000, demasiado alto considerando la media.
* El atributo Fuel_Type es de tipo char, requerira un proceso de encoding.
* El valor de Cylinder es constante.
par(mfrow=c(1,2))
boxplot(raw_data$Price,main = "Precio Vehiculos Toyota Corolla",
ylab = "Precio ($)", notch = TRUE)
hist(raw_data$Price,main = "Precio Vehiculos Toyota Corolla")
par(mfrow=c(1,2))
boxplot(raw_data$Age_08_04,main = "Age_08_04")
barplot(table(as.factor(raw_data$Age_08_04)),main = "Age_08_04")
par(mfrow=c(1,2))
boxplot(raw_data$Mfg_Year,main = "Año de Mfg_Year")
barplot(table(as.factor(raw_data$Mfg_Year)),main = "Mfg_Year")
par(mfrow=c(1,2))
boxplot(raw_data$KM,main = "KM",
ylab = "KM", notch = TRUE)
hist(raw_data$KM,main = "KM")
par(mfrow=c(1,2))
boxplot(raw_data$HP,main = "HP",
ylab = "HP", notch = FALSE)
barplot(table(as.factor(raw_data$HP)),main = "HP")
par(mfrow=c(1,2))
boxplot(raw_data$cc,main = "Cilindrada",
ylab = "CC", notch = FALSE)
barplot(table(as.factor(raw_data$cc)),main = "Cilindrada")
par(mfrow=c(1,2))
boxplot(raw_data$Quarterly_Tax, main="Quarterly_Tax")
hist(raw_data$Quarterly_Tax)
par(mfrow=c(1,2))
boxplot(raw_data$Weight, main="Peso(KG)")
hist(raw_data$Weight)
lbls <- c("0: No tiene", "1: Tiene")
par(mfrow=c(1,2))
barplot(table(as.factor(raw_data$Fuel_Type)), main="Fuel_Type")
pie(x = table(raw_data$Radio_cassette), labels = lbls, main="Radio Cassete")
par(mfrow=c(1,2))
pie(x = table(raw_data$Metallic_Rim), labels = lbls, main="Metallic Rim")
pie(x = table(raw_data$Backseat_Divider) , labels = lbls, main="Backseat_Divider")
par(mfrow=c(1,3))
pie(x = table(raw_data$Mistlamps) , labels = lbls, main="Mistlamps")
pie(x = table(raw_data$Radio), labels = lbls, main="Radio")
pie(x = table(raw_data$Sport_Model), labels = lbls, main="Sport_Model")
par(mfrow=c(1,3))
pie(x = table(raw_data$Central_Lock), labels = lbls, main="Central_Lock")
pie(x = table(raw_data$CD_Player), labels = lbls, main="CD_Player")
pie(x = table(raw_data$Boardcomputer), labels = lbls, main="Boardcomputer")
par(mfrow=c(1,3))
pie(x = table(raw_data$Airco), labels = lbls, main="Airco")
pie(x = table(raw_data$Airbag_2), labels = lbls, main="Airbag_2")
pie(x = table(raw_data$Airbag_1), labels = lbls,main="Airbag_1")
par(mfrow=c(1,2))
barplot(table(as.factor(raw_data$Guarantee_Period)), main="Guarantee_Period")
pie(x = table(raw_data$Automatic_airco), labels = lbls, main="Automatic_airco")
par(mfrow=c(1,3))
pie(x = table(raw_data$Mfr_Guarantee), labels = lbls, main="Mfr_Guarantee")
barplot(table(as.factor(raw_data$Gears)), main="Gears")
pie(x = table(raw_data$BOVAG_Guarantee), labels = lbls, main="BOVAG_Guarantee")
par(mfrow=c(1,3))
barplot(table(as.factor(raw_data$Doors)), main="Doors")
pie(x = table(raw_data$Automatic),labels = lbls, main="Automatic")
pie(x = table(raw_data$ABS),labels = lbls, main="ABS")
hist(raw_data$Price, col="blue", breaks = 60, freq = F)
lines(density(raw_data$Price), col = "red", lwd=2)
rug(raw_data$Price)
plot(Price~., data=raw_data,col="blue")
corrplot(cor(dplyr::select(raw_data, -c("Fuel_Type"))), type="upper", method="pie")
the standard deviation is zero
imcdiag(dplyr::select(raw_data, -c("Price", "Fuel_Type")), raw_data$Price)
Call:
imcdiag(x = dplyr::select(raw_data, -c("Price", "Fuel_Type")),
y = raw_data$Price)
All Individual Multicollinearity Diagnostics Result
VIF TOL Wi Fi Leamer CVIF Klein
Age_08_04 Inf 0.0000 Inf Inf 0.0000 -Inf 1
Mfg_Month Inf 0.0000 Inf Inf 0.0000 -Inf 1
Mfg_Year Inf 0.0000 Inf Inf 0.0000 -Inf 1
KM 1.8647 0.5363 37.9120 39.1629 0.7323 -0.0556 0
HP 1.6023 0.6241 26.4092 27.2805 0.7900 -0.0478 0
Met_Color 1.1398 0.8773 6.1308 6.3331 0.9367 -0.0340 0
Automatic 1.0805 0.9255 3.5309 3.6474 0.9620 -0.0322 0
cc 1.2170 0.8217 9.5136 9.8275 0.9065 -0.0363 0
Doors 1.2554 0.7966 11.1979 11.5674 0.8925 -0.0374 0
Cylinders 2.0001 0.5000 43.8472 45.2939 NA -0.0596 0
Gears 1.2599 0.7937 11.3958 11.7718 0.8909 -0.0376 0
Quarterly_Tax 2.7801 0.3597 78.0447 80.6197 0.5998 -0.0829 0
Weight 3.2137 0.3112 97.0581 100.2604 0.5578 -0.0958 0
Mfr_Guarantee 1.1983 0.8345 8.6960 8.9830 0.9135 -0.0357 0
BOVAG_Guarantee 1.3712 0.7293 16.2736 16.8105 0.8540 -0.0409 0
Guarantee_Period 1.5381 0.6502 23.5907 24.3691 0.8063 -0.0458 0
ABS 2.2232 0.4498 53.6282 55.3976 0.6707 -0.0663 0
Airbag_1 1.5989 0.6254 26.2590 27.1253 0.7908 -0.0477 0
Airbag_2 3.0894 0.3237 91.6074 94.6299 0.5689 -0.0921 0
Airco 1.8361 0.5446 36.6558 37.8652 0.7380 -0.0547 0
Automatic_airco 1.7419 0.5741 32.5257 33.5988 0.7577 -0.0519 0
Boardcomputer 2.6305 0.3802 71.4869 73.8455 0.6166 -0.0784 0
CD_Player 1.5503 0.6450 24.1291 24.9253 0.8031 -0.0462 0
Central_Lock 4.5886 0.2179 157.3372 162.5283 0.4668 -0.1368 0
Powered_Windows 4.6078 0.2170 158.1800 163.3990 0.4659 -0.1373 0
Power_Steering 1.5557 0.6428 24.3626 25.1664 0.8018 -0.0464 0
Radio 62.3090 0.0160 2688.0184 2776.7064 0.1267 -1.8572 1
Mistlamps 2.0750 0.4819 47.1335 48.6886 0.6942 -0.0618 0
Sport_Model 1.4606 0.6846 20.1946 20.8609 0.8274 -0.0435 0
Backseat_Divider 2.5379 0.3940 67.4257 69.6504 0.6277 -0.0756 0
Metallic_Rim 1.3400 0.7463 14.9077 15.3995 0.8639 -0.0399 0
Radio_cassette 62.1291 0.0161 2680.1284 2768.5561 0.1269 -1.8518 1
Tow_Bar 1.1445 0.8738 6.3348 6.5438 0.9347 -0.0341 0
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
HP , Automatic , Doors , Guarantee_Period , ABS , Boardcomputer , Central_Lock , Powered_Windows , Power_Steering , Mistlamps , Backseat_Divider , coefficient(s) are non-significant may be due to multicollinearity
R-square of y on all x: 0.9058
* use method argument to check which regressors may be the reason of collinearity
===================================
clean_data <- raw_data
clean_data = clean_data %>% mutate(cc = ifelse(cc == 16000, 1600, cc))
clean_data = clean_data %>% mutate(Guarantee_Period = ifelse(Guarantee_Period == 13, 12, Guarantee_Period))
clean_data = clean_data %>% filter((KM > 20000 & KM < 130000))
clean_data = clean_data %>% filter(Weight < 1075 & Weight > 1010)
clean_data$Cylinders = NULL
bss.model <- regsubsets(Price~., data = clean_data, nvmax = dim(clean_data)[2])
2 linear dependencies found
Reordering variables and trying again:
summary(bss.model)
Subset selection object
Call: regsubsets.formula(Price ~ ., data = clean_data, nvmax = dim(clean_data)[2])
34 Variables (and intercept)
Forced in Forced out
Age_08_04 FALSE FALSE
Mfg_Month FALSE FALSE
KM FALSE FALSE
Fuel_TypePetrol FALSE FALSE
HP FALSE FALSE
Met_Color FALSE FALSE
Automatic FALSE FALSE
cc FALSE FALSE
Doors FALSE FALSE
Gears FALSE FALSE
Quarterly_Tax FALSE FALSE
Weight FALSE FALSE
Mfr_Guarantee FALSE FALSE
BOVAG_Guarantee FALSE FALSE
Guarantee_Period FALSE FALSE
ABS FALSE FALSE
Airbag_1 FALSE FALSE
Airbag_2 FALSE FALSE
Airco FALSE FALSE
Automatic_airco FALSE FALSE
Boardcomputer FALSE FALSE
CD_Player FALSE FALSE
Central_Lock FALSE FALSE
Powered_Windows FALSE FALSE
Power_Steering FALSE FALSE
Radio FALSE FALSE
Mistlamps FALSE FALSE
Sport_Model FALSE FALSE
Backseat_Divider FALSE FALSE
Metallic_Rim FALSE FALSE
Radio_cassette FALSE FALSE
Tow_Bar FALSE FALSE
Mfg_Year FALSE FALSE
Fuel_TypeDiesel FALSE FALSE
1 subsets of each size up to 32
Selection Algorithm: exhaustive
Age_08_04 Mfg_Month Mfg_Year KM Fuel_TypeDiesel Fuel_TypePetrol HP
1 ( 1 ) " " " " "*" " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " "
3 ( 1 ) " " " " "*" "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) " " " " "*" "*" " " " " " "
9 ( 1 ) " " " " "*" "*" " " " " " "
10 ( 1 ) " " " " "*" "*" " " "*" " "
11 ( 1 ) " " " " "*" "*" " " "*" " "
12 ( 1 ) " " " " "*" "*" " " "*" " "
13 ( 1 ) " " " " "*" "*" " " "*" " "
14 ( 1 ) " " " " "*" "*" " " "*" "*"
15 ( 1 ) " " " " "*" "*" " " "*" "*"
16 ( 1 ) " " " " "*" "*" " " "*" "*"
17 ( 1 ) " " " " "*" "*" " " "*" "*"
18 ( 1 ) " " " " "*" "*" " " "*" "*"
19 ( 1 ) " " " " "*" "*" " " "*" "*"
20 ( 1 ) " " " " "*" "*" " " "*" "*"
21 ( 1 ) " " " " "*" "*" " " "*" "*"
22 ( 1 ) " " " " "*" "*" " " "*" "*"
23 ( 1 ) " " " " "*" "*" " " "*" "*"
24 ( 1 ) " " " " "*" "*" " " "*" "*"
25 ( 1 ) " " "*" "*" "*" " " "*" "*"
26 ( 1 ) " " "*" "*" "*" " " "*" "*"
27 ( 1 ) " " "*" "*" "*" " " "*" "*"
28 ( 1 ) " " "*" "*" "*" " " "*" "*"
29 ( 1 ) "*" " " "*" "*" " " "*" "*"
Met_Color Automatic cc Doors Gears Quarterly_Tax Weight Mfr_Guarantee
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " "*" " "
3 ( 1 ) " " " " " " " " " " " " "*" " "
4 ( 1 ) " " " " " " " " " " " " "*" " "
5 ( 1 ) " " " " " " " " " " " " "*" " "
6 ( 1 ) " " " " " " " " " " " " "*" " "
7 ( 1 ) " " " " " " " " " " " " "*" "*"
8 ( 1 ) " " "*" " " " " " " " " "*" "*"
9 ( 1 ) " " "*" " " " " " " " " "*" "*"
10 ( 1 ) " " "*" " " " " " " " " "*" "*"
11 ( 1 ) " " "*" " " " " " " " " "*" "*"
12 ( 1 ) " " "*" " " " " " " " " "*" "*"
13 ( 1 ) " " "*" " " " " " " " " "*" "*"
14 ( 1 ) " " " " "*" " " "*" " " "*" "*"
15 ( 1 ) " " " " "*" " " "*" " " "*" "*"
16 ( 1 ) " " " " "*" " " "*" " " "*" "*"
17 ( 1 ) " " " " "*" " " "*" " " "*" "*"
18 ( 1 ) " " "*" "*" " " "*" " " "*" "*"
19 ( 1 ) " " "*" "*" " " "*" "*" "*" "*"
20 ( 1 ) " " "*" "*" " " "*" "*" "*" "*"
21 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
22 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
23 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
24 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
25 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
26 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
27 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
28 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
29 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2 Airco
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " "*"
5 ( 1 ) " " " " " " " " " " "*"
6 ( 1 ) " " "*" " " " " " " "*"
7 ( 1 ) " " "*" " " " " " " "*"
8 ( 1 ) " " "*" " " " " " " "*"
9 ( 1 ) " " "*" " " " " " " "*"
10 ( 1 ) " " "*" " " " " " " "*"
11 ( 1 ) " " "*" " " " " " " "*"
12 ( 1 ) "*" "*" " " " " " " "*"
13 ( 1 ) "*" "*" " " " " "*" "*"
14 ( 1 ) "*" "*" " " " " " " "*"
15 ( 1 ) "*" "*" " " " " " " "*"
16 ( 1 ) "*" "*" " " " " " " "*"
17 ( 1 ) "*" "*" " " " " "*" "*"
18 ( 1 ) "*" "*" " " " " "*" "*"
19 ( 1 ) "*" "*" " " " " "*" "*"
20 ( 1 ) "*" "*" " " " " "*" "*"
21 ( 1 ) "*" "*" " " " " "*" "*"
22 ( 1 ) "*" "*" " " " " "*" "*"
23 ( 1 ) "*" "*" " " " " "*" "*"
24 ( 1 ) "*" "*" " " " " "*" "*"
25 ( 1 ) "*" "*" " " " " "*" "*"
26 ( 1 ) "*" "*" " " " " "*" "*"
27 ( 1 ) "*" "*" "*" " " "*" "*"
28 ( 1 ) "*" "*" "*" "*" "*" "*"
29 ( 1 ) "*" "*" "*" "*" "*" "*"
Automatic_airco Boardcomputer CD_Player Central_Lock Powered_Windows
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " " "
5 ( 1 ) " " " " " " "*" " "
6 ( 1 ) " " " " " " " " " "
7 ( 1 ) " " " " " " " " " "
8 ( 1 ) " " " " " " " " " "
9 ( 1 ) " " " " " " "*" " "
10 ( 1 ) " " " " " " "*" " "
11 ( 1 ) " " " " " " "*" " "
12 ( 1 ) " " " " " " "*" " "
13 ( 1 ) " " " " " " " " "*"
14 ( 1 ) " " " " " " "*" " "
15 ( 1 ) "*" " " " " "*" " "
16 ( 1 ) "*" " " " " "*" " "
17 ( 1 ) "*" " " " " "*" " "
18 ( 1 ) "*" " " " " "*" " "
19 ( 1 ) "*" " " " " "*" " "
20 ( 1 ) "*" "*" " " "*" " "
21 ( 1 ) "*" " " "*" "*" " "
22 ( 1 ) "*" "*" "*" "*" " "
23 ( 1 ) "*" "*" "*" "*" " "
24 ( 1 ) "*" "*" "*" "*" " "
25 ( 1 ) "*" "*" "*" "*" " "
26 ( 1 ) "*" "*" "*" "*" "*"
27 ( 1 ) "*" "*" "*" "*" "*"
28 ( 1 ) "*" "*" "*" "*" "*"
29 ( 1 ) "*" "*" "*" "*" "*"
Power_Steering Radio Mistlamps Sport_Model Backseat_Divider Metallic_Rim
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " "
6 ( 1 ) " " " " "*" " " " " " "
7 ( 1 ) " " " " "*" " " " " " "
8 ( 1 ) " " " " "*" " " " " " "
9 ( 1 ) " " " " "*" " " " " " "
10 ( 1 ) " " " " "*" " " " " " "
11 ( 1 ) " " " " "*" " " " " " "
12 ( 1 ) " " " " "*" " " " " " "
13 ( 1 ) " " " " "*" " " " " " "
14 ( 1 ) " " " " " " "*" " " " "
15 ( 1 ) " " " " " " "*" " " " "
16 ( 1 ) " " " " " " "*" " " " "
17 ( 1 ) " " " " " " "*" " " " "
18 ( 1 ) " " " " " " "*" " " " "
19 ( 1 ) " " " " " " "*" " " " "
20 ( 1 ) " " " " " " "*" " " " "
21 ( 1 ) " " " " " " "*" " " " "
22 ( 1 ) " " " " " " "*" " " " "
23 ( 1 ) "*" " " " " "*" " " " "
24 ( 1 ) "*" " " "*" "*" " " " "
25 ( 1 ) "*" " " "*" "*" " " " "
26 ( 1 ) "*" " " "*" "*" " " " "
27 ( 1 ) "*" " " "*" "*" " " " "
28 ( 1 ) "*" " " "*" "*" " " " "
29 ( 1 ) "*" "*" "*" "*" " " " "
Radio_cassette Tow_Bar
1 ( 1 ) " " " "
2 ( 1 ) " " " "
3 ( 1 ) " " " "
4 ( 1 ) " " " "
5 ( 1 ) " " " "
6 ( 1 ) " " " "
7 ( 1 ) " " " "
8 ( 1 ) " " " "
9 ( 1 ) " " " "
10 ( 1 ) " " " "
11 ( 1 ) " " "*"
12 ( 1 ) " " "*"
13 ( 1 ) " " "*"
14 ( 1 ) " " "*"
15 ( 1 ) " " "*"
16 ( 1 ) "*" "*"
17 ( 1 ) "*" "*"
18 ( 1 ) "*" "*"
19 ( 1 ) "*" "*"
20 ( 1 ) "*" "*"
21 ( 1 ) "*" "*"
22 ( 1 ) "*" "*"
23 ( 1 ) "*" "*"
24 ( 1 ) "*" "*"
25 ( 1 ) "*" "*"
26 ( 1 ) "*" "*"
27 ( 1 ) "*" "*"
28 ( 1 ) "*" "*"
29 ( 1 ) "*" "*"
[ reached getOption("max.print") -- omitted 3 rows ]
which.max(summary(bss.model)$adjr2)
[1] 24
p <- ggplot(data = data.frame(n_predictores = 1:32,
R_ajustado = summary(bss.model)$adjr2),
aes(x = n_predictores, y = R_ajustado)) +
geom_line() +
geom_point()
p <- p + geom_point(aes(
x = n_predictores[which.max(summary(bss.model)$adjr2)],
y = R_ajustado[which.max(summary(bss.model)$adjr2)]),
colour = "red", size = 3)
p <- p + scale_x_continuous(breaks = c(0:34)) +
theme_bw() +
labs(title = 'R2_ajustado vs número de predictores',
x = 'número predictores')
p
coef(object = bss.model, id = which.max(summary(bss.model)$adjr2))
(Intercept) KM Fuel_TypePetrol Met_Color
-2.529947e+06 -1.179339e-02 -2.636556e+03 -1.288967e+02
Automatic cc Doors Gears
5.137662e+02 2.755593e-01 7.045670e+01 4.497871e+02
Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period
1.073410e+01 2.288838e+02 3.509444e+02 5.796841e+01
ABS Airbag_1 Automatic_airco Boardcomputer
-1.306206e+02 -1.155433e+02 9.962195e+02 1.253951e+00
CD_Player Central_Lock Powered_Windows Power_Steering
1.386529e+02 1.808711e+02 1.862831e+02 -2.503139e+02
Mistlamps Backseat_Divider Metallic_Rim Mfg_Year
3.676494e+02 -3.742877e+01 5.032722e+01 1.264728e+03
Fuel_TypeDiesel
0.000000e+00
summary(bss.model)$adjr2[24]
[1] 0.774939
summary(bss.model)$adjr2[9]
[1] 0.7579834
summary(bss.model)$adjr2[6]
[1] 0.7469467
bss.model.sum = summary(bss.model)
par(mfrow = c(2, 2))
plot(bss.model.sum$rss, xlab = "Numero de Predictores", ylab = "RSS", type = "b")
plot(bss.model.sum$adjr2, xlab = "Numero de Predictores", ylab = "R Ajustada", type = "b")
best_adj_r2 = which.max(bss.model.sum$adjr2)
points(best_adj_r2, bss.model.sum$adjr2[best_adj_r2],
col = "red",cex = 2, pch = 20)
plot(bss.model.sum$cp, xlab = "Numero de Predictores", ylab = "Cp", type = 'b')
best_cp = which.min(bss.model.sum$cp)
points(best_cp, bss.model.sum$cp[best_cp],
col = "red", cex = 2, pch = 20)
plot(bss.model.sum$bic, xlab = "Numero de Predictores", ylab = "BIC", type = 'b')
best_bic = which.min(bss.model.sum$bic)
points(best_bic, bss.model.sum$bic[best_bic],
col = "red", cex = 2, pch = 20)
En las gráficas anteriores BIC, CP, R ajustada se observan los puntos cuyo valores son mínimos y que no concordancia entre ellos para seleccionar ubivocamente la cantidad de predictores a emplear en un modelo. Sin embargo, puede observarse puntualmente en cada grafico, que existen sutiles mejoras (casi imperceptibles) entre algunas cantidades de predictores.
Se destaca que no tiene importancia destacar el punto minimo de RSS, dado que al por la naturaleza del modelo, a mayor cantidad de variable, menor será su valor.
coef(object = bss.model, id = 7)
(Intercept) KM Fuel_TypePetrol BOVAG_Guarantee
1.309274e+04 -2.326143e-02 -4.063483e+03 4.683116e+02
Guarantee_Period Airbag_1 Boardcomputer Backseat_Divider
1.697532e+02 2.091082e+02 1.959851e+03 3.807397e+02
set.seed(10)
index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)
data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
set.seed(10)
model.fwd <- regsubsets(Price ~., data = data.train, nvmax = 7)
3 linear dependencies found
Reordering variables and trying again:
summary(model.fwd)
Subset selection object
Call: regsubsets.formula(Price ~ ., data = data.train, nvmax = 7)
34 Variables (and intercept)
Forced in Forced out
Age_08_04 FALSE FALSE
Mfg_Month FALSE FALSE
KM FALSE FALSE
Fuel_TypePetrol FALSE FALSE
HP FALSE FALSE
Met_Color FALSE FALSE
Automatic FALSE FALSE
cc FALSE FALSE
Doors FALSE FALSE
Gears FALSE FALSE
Quarterly_Tax FALSE FALSE
Weight FALSE FALSE
Mfr_Guarantee FALSE FALSE
BOVAG_Guarantee FALSE FALSE
Guarantee_Period FALSE FALSE
ABS FALSE FALSE
Airbag_1 FALSE FALSE
Airbag_2 FALSE FALSE
Airco FALSE FALSE
Automatic_airco FALSE FALSE
Boardcomputer FALSE FALSE
CD_Player FALSE FALSE
Central_Lock FALSE FALSE
Powered_Windows FALSE FALSE
Power_Steering FALSE FALSE
Radio FALSE FALSE
Mistlamps FALSE FALSE
Sport_Model FALSE FALSE
Backseat_Divider FALSE FALSE
Metallic_Rim FALSE FALSE
Tow_Bar FALSE FALSE
Mfg_Year FALSE FALSE
Fuel_TypeDiesel FALSE FALSE
Radio_cassette FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
Age_08_04 Mfg_Month Mfg_Year KM Fuel_TypeDiesel Fuel_TypePetrol HP
1 ( 1 ) " " " " "*" " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " "
3 ( 1 ) " " " " "*" "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) " " " " "*" "*" " " " " " "
Met_Color Automatic cc Doors Gears Quarterly_Tax Weight Mfr_Guarantee
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " "*" " "
3 ( 1 ) " " " " " " " " " " " " "*" " "
4 ( 1 ) " " " " " " " " " " " " "*" " "
5 ( 1 ) " " " " " " " " " " " " "*" " "
6 ( 1 ) " " " " " " " " " " " " "*" "*"
7 ( 1 ) " " " " " " " " " " " " "*" "*"
8 ( 1 ) " " " " " " " " " " " " "*" "*"
BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2 Airco
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " "*"
6 ( 1 ) " " " " " " " " " " "*"
7 ( 1 ) " " "*" " " " " " " "*"
8 ( 1 ) " " "*" " " " " " " "*"
Automatic_airco Boardcomputer CD_Player Central_Lock Powered_Windows
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " "*"
5 ( 1 ) " " " " " " " " "*"
6 ( 1 ) " " " " " " " " "*"
7 ( 1 ) " " " " " " " " "*"
8 ( 1 ) " " " " " " " " "*"
Power_Steering Radio Mistlamps Sport_Model Backseat_Divider Metallic_Rim
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " "
Radio_cassette Tow_Bar
1 ( 1 ) " " " "
2 ( 1 ) " " " "
3 ( 1 ) " " " "
4 ( 1 ) " " " "
5 ( 1 ) " " " "
6 ( 1 ) " " " "
7 ( 1 ) " " " "
8 ( 1 ) " " "*"
val.errors = rep(NA,7)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:7)
{
coeficientes <- coef(model.fwd, id = i)
predictions <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictions)^2)
}
rmse <- sqrt(mean(val.errors))
rmse
[1] 1407.842
set.seed(10)
index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)
data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
x = model.matrix(Price ~ . , data.train)[, -1]
y = as.matrix(data.train$Price)
ridge.model = glmnet(x, y, alpha = 0)
beta=coef(ridge.model)
tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- ridge.model$lambda[tmp$variable+1]
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1]
ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, group = coef, )) +
geom_line() +
scale_x_log10() +
xlab("Lambda (log scale)") +
guides(color = guide_legend(title = ""),
linetype = guide_legend(title = "")) +
theme_bw() +
theme(legend.key.width = unit(3,"lines"))
plot(ridge.model, xvar = "lambda", label = TRUE)
indices <- sample(c(TRUE,FALSE), nrow(data.train), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 0)
plot(cv.out)
coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -9.276178e+05
Age_08_04 -3.403398e+01
Mfg_Month -1.675292e+01
Mfg_Year 4.666735e+02
KM -8.205862e-03
Fuel_TypeDiesel .
Fuel_TypePetrol .
HP 4.510978e+00
Met_Color -4.077869e+01
Automatic 5.819668e+02
cc 3.425118e-01
Doors 3.215659e+01
Gears 1.800406e+02
Quarterly_Tax 8.492749e+00
Weight 4.057818e+00
Mfr_Guarantee 2.013134e+02
BOVAG_Guarantee 3.643001e+02
Guarantee_Period 4.918492e+01
ABS 1.136664e+02
Airbag_1 -1.506483e+02
Airbag_2 -4.032492e-01
Airco 2.930682e+02
Automatic_airco 7.676041e+02
Boardcomputer 2.717277e+02
CD_Player 1.082930e+02
Central_Lock 2.059698e+02
Powered_Windows 6.839280e+01
Power_Steering -7.009734e+02
Radio -3.066674e+01
Mistlamps 5.970154e+01
Sport_Model -2.479769e+02
Backseat_Divider 3.448104e+00
Metallic_Rim 5.100967e+01
Radio_cassette -3.071270e+01
Tow_Bar -2.591923e+02
bestlam = cv.out$lambda.min
bestlam
[1] 137.6734
ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
sqrt(mean((ridge.pred - y[-indices])^2))
[1] 807.2642
x = model.matrix(Price ~ . , data.test)[, -1]
y = as.matrix(data.test$Price)
ridge.model = glmnet(x, y, alpha = 0)
indices <- sample(c(TRUE,FALSE), nrow(data.test), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 0)
plot(cv.out)
coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -5.107223e+05
Age_08_04 -1.990590e+01
Mfg_Month -1.156697e+01
Mfg_Year 2.572843e+02
KM -9.166624e-03
Fuel_TypeDiesel .
Fuel_TypePetrol .
HP 1.018606e-01
Met_Color -3.050419e+01
Automatic -3.058052e+02
cc -7.618709e-02
Doors 8.384777e+01
Gears 4.417662e+02
Quarterly_Tax 4.467208e+00
Weight 4.256372e+00
Mfr_Guarantee 2.180984e+02
BOVAG_Guarantee 5.903780e+01
Guarantee_Period 2.742444e+01
ABS 1.886358e+02
Airbag_1 -2.687474e+01
Airbag_2 6.792561e+01
Airco 1.990458e+02
Automatic_airco .
Boardcomputer 4.884786e+02
CD_Player 3.181346e+02
Central_Lock 8.770576e+01
Powered_Windows 2.707806e+01
Power_Steering 9.958717e-01
Radio -7.240374e+01
Mistlamps 2.673791e+02
Sport_Model -3.403362e+02
Backseat_Divider 6.955809e+01
Metallic_Rim 3.906290e+00
Radio_cassette -7.238457e+01
Tow_Bar -8.097683e+01
bestlam = cv.out$lambda.min
bestlam
[1] 489.783
ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
sqrt(mean((ridge.pred - y[-indices])^2))
[1] 775.3981
set.seed(10)
index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)
data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
x = model.matrix(Price ~ . , data.train)[, -1]
y = as.matrix(data.train$Price)
lasso.model = glmnet(x, y, alpha = 1)
beta=coef(lasso.model)
tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- lasso.model$lambda[tmp$variable+1]
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1]
ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, group = coef, )) +
geom_line() +
scale_x_log10() +
xlab("Lambda (log scale)") +
guides(color = guide_legend(title = ""),
linetype = guide_legend(title = "")) +
theme_bw() +
theme(legend.key.width = unit(3,"lines"))
plot(lasso.model, xvar = "lambda", label = TRUE)
indices <- sample(c(TRUE,FALSE), nrow(data.train), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)
plot(cv.out)
coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -2.152194e+06
Age_08_04 -9.617712e+00
Mfg_Month .
Mfg_Year 1.077728e+03
KM -6.654090e-03
Fuel_TypeDiesel .
Fuel_TypePetrol .
HP .
Met_Color .
Automatic 3.858173e+02
cc 2.490596e-01
Doors .
Gears .
Quarterly_Tax 3.041571e+00
Weight 7.261598e+00
Mfr_Guarantee 1.032277e+02
BOVAG_Guarantee 1.922574e+02
Guarantee_Period 9.119164e+00
ABS .
Airbag_1 .
Airbag_2 .
Airco 2.308151e+02
Automatic_airco .
Boardcomputer .
CD_Player .
Central_Lock 2.433077e+02
Powered_Windows .
Power_Steering -4.338181e+02
Radio .
Mistlamps .
Sport_Model -6.242317e+01
Backseat_Divider .
Metallic_Rim .
Radio_cassette .
Tow_Bar -1.144862e+02
bestlam = cv.out$lambda.min
bestlam
[1] 27.66179
lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
sqrt(mean((lasso.pred - y[-indices])^2))
[1] 812.298
x = model.matrix(Price ~ . , data.test)[, -1]
y = as.matrix(data.test$Price)
lasso.model = glmnet(x, y, alpha = 1)
indices <- sample(c(TRUE,FALSE), nrow(data.test), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)
plot(cv.out)
coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -2.083003e+06
Age_08_04 .
Mfg_Month .
Mfg_Year 1.045326e+03
KM -6.128349e-03
Fuel_TypeDiesel .
Fuel_TypePetrol .
HP .
Met_Color .
Automatic .
cc .
Doors 6.359957e+01
Gears 2.868344e+01
Quarterly_Tax .
Weight 2.598188e+00
Mfr_Guarantee .
BOVAG_Guarantee .
Guarantee_Period .
ABS .
Airbag_1 .
Airbag_2 .
Airco 2.077925e+02
Automatic_airco .
Boardcomputer 6.818690e+01
CD_Player .
Central_Lock .
Powered_Windows .
Power_Steering .
Radio .
Mistlamps 2.173762e+02
Sport_Model -5.946130e+01
Backseat_Divider .
Metallic_Rim .
Radio_cassette .
Tow_Bar .
bestlam = cv.out$lambda.min
bestlam
[1] 81.70072
lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
sqrt(mean((lasso.pred - y[-indices])^2))
[1] 813.2023
Dado que es un dataset con una cantidad baja de instancias (1500 aprox), y que mediante el uso de Lasso se obtuvo un resultado aceptable comparado a los rmse calculados con otros métodos (Best Subset y Ridge), se opta por usar ese metodo para seleccion de variables.