Este ejercicio consiste en realizar un análisis exploratorio sobre un dataset de vehiculos Toyota Corolla con 1436 instancias y 37 atributos.

El objetivo es conseguir un modelo de regresión lineal con un resultado aceptable interpretando cada paso del razonamiento necesario para llegar al objetivo.

  • El atributo objetivo es Price.

1: Carga de Librerias

library(fastDummies) 
library(car)
library(corrplot) 
library(mctest) 
library(tidyverse) 
library(leaps)
library(glmnet)
library(MASS)
library(reshape)
library(caret)
library(ggrepel)

2: Lectura del DataSet

raw_data = read.csv("ToyotaCorolla.csv") 
raw_data$Id = NULL
raw_data$Model = NULL
  • El atributo ID no es representativo de cada instancia, decido no considerarlo en el modelo.
  • El atributo Model no es representativo de cada instancia, decido no considerarlo en el modelo.

3: DataSet

raw_data

4: Estructura del DataSet

str(raw_data)
'data.frame':   1436 obs. of  35 variables:
 $ Price           : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
 $ Age_08_04       : int  23 23 24 26 30 32 27 30 27 23 ...
 $ Mfg_Month       : int  10 10 9 7 3 1 6 3 6 10 ...
 $ Mfg_Year        : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
 $ KM              : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
 $ Fuel_Type       : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
 $ HP              : int  90 90 90 90 90 90 90 90 192 69 ...
 $ Met_Color       : int  1 1 1 0 0 0 1 1 0 0 ...
 $ Automatic       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cc              : int  2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
 $ Doors           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Cylinders       : int  4 4 4 4 4 4 4 4 4 4 ...
 $ Gears           : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Quarterly_Tax   : int  210 210 210 210 210 210 210 210 100 185 ...
 $ Weight          : int  1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
 $ Mfr_Guarantee   : int  0 0 1 1 1 0 0 1 0 0 ...
 $ BOVAG_Guarantee : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Guarantee_Period: int  3 3 3 3 3 3 3 3 3 3 ...
 $ ABS             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Airbag_1        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Airbag_2        : int  1 1 1 1 1 1 1 1 0 1 ...
 $ Airco           : int  0 1 0 0 1 1 1 1 1 1 ...
 $ Automatic_airco : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Boardcomputer   : int  1 1 1 1 1 1 1 1 0 1 ...
 $ CD_Player       : int  0 1 0 0 0 0 0 1 0 0 ...
 $ Central_Lock    : int  1 1 0 0 1 1 1 1 1 0 ...
 $ Powered_Windows : int  1 0 0 0 1 1 1 1 1 0 ...
 $ Power_Steering  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Radio           : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Mistlamps       : int  0 0 0 0 1 1 0 0 0 0 ...
 $ Sport_Model     : int  0 0 0 0 0 0 1 0 0 0 ...
 $ Backseat_Divider: int  1 1 1 1 1 1 1 1 0 1 ...
 $ Metallic_Rim    : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Radio_cassette  : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Tow_Bar         : int  0 0 0 0 0 0 0 0 0 0 ...

5: Resumen del DataSet

summary(raw_data)
     Price         Age_08_04       Mfg_Month         Mfg_Year          KM        
 Min.   : 4350   Min.   : 1.00   Min.   : 1.000   Min.   :1998   Min.   :     1  
 1st Qu.: 8450   1st Qu.:44.00   1st Qu.: 3.000   1st Qu.:1998   1st Qu.: 43000  
 Median : 9900   Median :61.00   Median : 5.000   Median :1999   Median : 63390  
 Mean   :10731   Mean   :55.95   Mean   : 5.549   Mean   :2000   Mean   : 68533  
 3rd Qu.:11950   3rd Qu.:70.00   3rd Qu.: 8.000   3rd Qu.:2001   3rd Qu.: 87021  
 Max.   :32500   Max.   :80.00   Max.   :12.000   Max.   :2004   Max.   :243000  
  Fuel_Type          HP          Met_Color        Automatic             cc       
 CNG   :  17   Min.   : 69.0   Min.   :0.0000   Min.   :0.00000   Min.   : 1300  
 Diesel: 155   1st Qu.: 90.0   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 1400  
 Petrol:1264   Median :110.0   Median :1.0000   Median :0.00000   Median : 1600  
               Mean   :101.5   Mean   :0.6748   Mean   :0.05571   Mean   : 1577  
               3rd Qu.:110.0   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.: 1600  
               Max.   :192.0   Max.   :1.0000   Max.   :1.00000   Max.   :16000  
     Doors         Cylinders     Gears       Quarterly_Tax        Weight    
 Min.   :2.000   Min.   :4   Min.   :3.000   Min.   : 19.00   Min.   :1000  
 1st Qu.:3.000   1st Qu.:4   1st Qu.:5.000   1st Qu.: 69.00   1st Qu.:1040  
 Median :4.000   Median :4   Median :5.000   Median : 85.00   Median :1070  
 Mean   :4.033   Mean   :4   Mean   :5.026   Mean   : 87.12   Mean   :1072  
 3rd Qu.:5.000   3rd Qu.:4   3rd Qu.:5.000   3rd Qu.: 85.00   3rd Qu.:1085  
 Max.   :5.000   Max.   :4   Max.   :6.000   Max.   :283.00   Max.   :1615  
 Mfr_Guarantee    BOVAG_Guarantee  Guarantee_Period      ABS        
 Min.   :0.0000   Min.   :0.0000   Min.   : 3.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.: 3.000   1st Qu.:1.0000  
 Median :0.0000   Median :1.0000   Median : 3.000   Median :1.0000  
 Mean   :0.4095   Mean   :0.8955   Mean   : 3.815   Mean   :0.8134  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :36.000   Max.   :1.0000  
    Airbag_1         Airbag_2          Airco        Automatic_airco  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :1.0000   Median :1.0000   Median :1.0000   Median :0.00000  
 Mean   :0.9708   Mean   :0.7228   Mean   :0.5084   Mean   :0.05641  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
 Boardcomputer      CD_Player       Central_Lock    Powered_Windows
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :1.000  
 Mean   :0.2946   Mean   :0.2187   Mean   :0.5801   Mean   :0.562  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
 Power_Steering       Radio          Mistlamps      Sport_Model    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
 Median :1.0000   Median :0.0000   Median :0.000   Median :0.0000  
 Mean   :0.9777   Mean   :0.1462   Mean   :0.257   Mean   :0.3001  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
 Backseat_Divider  Metallic_Rim    Radio_cassette      Tow_Bar      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.7702   Mean   :0.2047   Mean   :0.1455   Mean   :0.2779  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  

Observaciones
* El valor maximo de cc es de 16000, demasiado alto considerando la media.
* El atributo Fuel_Type es de tipo char, requerira un proceso de encoding.
* El valor de Cylinder es constante.

6: Análisis exploratorio

6a: Price

par(mfrow=c(1,2))
boxplot(raw_data$Price,main = "Precio Vehiculos Toyota Corolla",
        ylab = "Precio ($)", notch = TRUE)

hist(raw_data$Price,main = "Precio Vehiculos Toyota Corolla")

  • Se observa que la mediana del precio de los vehiculos, ronda los $10000 aproximadamente.
  • Se presentan valores atípicos con valores superiores a las $20000 y valores menores a $7000 aproximadamente.

6b: Mfg_Year

par(mfrow=c(1,2))

boxplot(raw_data$Age_08_04,main = "Age_08_04")
barplot(table(as.factor(raw_data$Age_08_04)),main = "Age_08_04")

  • El atributo “Age_08_04” presenta valores outliers correspondientes a vehiculos nuevos cuya antiguedad es 0.

6c: Mfg_Year

par(mfrow=c(1,2))

boxplot(raw_data$Mfg_Year,main = "Año de Mfg_Year")
barplot(table(as.factor(raw_data$Mfg_Year)),main = "Mfg_Year")

6d: KM

par(mfrow=c(1,2))

boxplot(raw_data$KM,main = "KM",
        ylab = "KM", notch = TRUE)

hist(raw_data$KM,main = "KM")

  • El atributo “KM” presenta valores outliers. Destaco sobretodo un conjunto de valores superiores a los 200000.

6e: HP

par(mfrow=c(1,2))

boxplot(raw_data$HP,main = "HP",
        ylab = "HP", notch = FALSE)

barplot(table(as.factor(raw_data$HP)),main = "HP")

  • El atributo “HP” presenta un valor outlier superior a 180. Según una investigacion realizada en medios externos al dataset, el valor si corresponde a un modelo de Toyota Corolla.

6f: CC

par(mfrow=c(1,2))

boxplot(raw_data$cc,main = "Cilindrada",
        ylab = "CC", notch = FALSE)

barplot(table(as.factor(raw_data$cc)),main = "Cilindrada")

  • El atributo “CC” presenta un outlier notorio superior a 16000, este valor esta fuera del contexto de un vehiculo toyota corolla, donde los valores promedio rondan el 100.

6g: Quarterly_Tax

par(mfrow=c(1,2))

boxplot(raw_data$Quarterly_Tax, main="Quarterly_Tax")

hist(raw_data$Quarterly_Tax)

  • El atributo “Quarterly_Tax” presenta outliers para valores superiores a 150 y valores menores a 50, sobre una mediana de 70 aproximadamente.

6h: Weight

par(mfrow=c(1,2))

boxplot(raw_data$Weight, main="Peso(KG)")
hist(raw_data$Weight)

  • El atributo “Weight” presenta outliers para valores superiores a 1150 sobre una mediana de 1050 aproximadamente.

6i: Fueltype y Radio Cassete

lbls <- c("0: No tiene", "1: Tiene")

par(mfrow=c(1,2)) 

barplot(table(as.factor(raw_data$Fuel_Type)), main="Fuel_Type")
pie(x = table(raw_data$Radio_cassette), labels = lbls, main="Radio Cassete")

6j: Metallic Rim y Backseat Divider


par(mfrow=c(1,2)) 

pie(x = table(raw_data$Metallic_Rim), labels = lbls,  main="Metallic Rim")
pie(x = table(raw_data$Backseat_Divider) , labels = lbls,  main="Backseat_Divider")

6k: Mistlamp, Radio y Sport Model

par(mfrow=c(1,3))

pie(x = table(raw_data$Mistlamps) , labels = lbls,  main="Mistlamps")
pie(x = table(raw_data$Radio), labels = lbls,  main="Radio")
pie(x = table(raw_data$Sport_Model), labels = lbls,  main="Sport_Model")

6l: Central Lock, CD Player y BoardComputer


par(mfrow=c(1,3))

pie(x = table(raw_data$Central_Lock), labels = lbls, main="Central_Lock")
pie(x = table(raw_data$CD_Player), labels = lbls, main="CD_Player")
pie(x = table(raw_data$Boardcomputer), labels = lbls, main="Boardcomputer")

6m: Airco, Airbag_2 y Airbag_1

par(mfrow=c(1,3))

pie(x = table(raw_data$Airco), labels = lbls,  main="Airco")
pie(x = table(raw_data$Airbag_2), labels = lbls,  main="Airbag_2")
pie(x = table(raw_data$Airbag_1), labels = lbls,main="Airbag_1")

6n: Guarantee Period y Automatic Airco


par(mfrow=c(1,2))

barplot(table(as.factor(raw_data$Guarantee_Period)), main="Guarantee_Period") 
pie(x = table(raw_data$Automatic_airco), labels = lbls,  main="Automatic_airco")

6o: MFR Guarantee, Gears y BOVAG Guarantee


par(mfrow=c(1,3))

pie(x = table(raw_data$Mfr_Guarantee), labels = lbls, main="Mfr_Guarantee")
barplot(table(as.factor(raw_data$Gears)), main="Gears")
pie(x = table(raw_data$BOVAG_Guarantee), labels = lbls, main="BOVAG_Guarantee")

6p: Doors, Automatic y ABS

par(mfrow=c(1,3))

barplot(table(as.factor(raw_data$Doors)), main="Doors")
pie(x = table(raw_data$Automatic),labels = lbls, main="Automatic")
pie(x = table(raw_data$ABS),labels = lbls, main="ABS")

7: Estudio de Variable Objetivo “Price”

7a: Distribucion de Price

hist(raw_data$Price, col="blue", breaks = 60, freq = F)
lines(density(raw_data$Price), col = "red", lwd=2)
rug(raw_data$Price)

7b: Relacion Price vs Resto de Predictores

plot(Price~., data=raw_data,col="blue")

8: Seleccion y modificacion de Variables

8a: Estudio de correlacion

corrplot(cor(dplyr::select(raw_data, -c("Fuel_Type"))), type="upper", method="pie")
the standard deviation is zero

8b: Indicadores de Colinealidad

imcdiag(dplyr::select(raw_data, -c("Price", "Fuel_Type")), raw_data$Price)

Call:
imcdiag(x = dplyr::select(raw_data, -c("Price", "Fuel_Type")), 
    y = raw_data$Price)


All Individual Multicollinearity Diagnostics Result

                     VIF    TOL        Wi        Fi Leamer    CVIF Klein
Age_08_04            Inf 0.0000       Inf       Inf 0.0000    -Inf     1
Mfg_Month            Inf 0.0000       Inf       Inf 0.0000    -Inf     1
Mfg_Year             Inf 0.0000       Inf       Inf 0.0000    -Inf     1
KM                1.8647 0.5363   37.9120   39.1629 0.7323 -0.0556     0
HP                1.6023 0.6241   26.4092   27.2805 0.7900 -0.0478     0
Met_Color         1.1398 0.8773    6.1308    6.3331 0.9367 -0.0340     0
Automatic         1.0805 0.9255    3.5309    3.6474 0.9620 -0.0322     0
cc                1.2170 0.8217    9.5136    9.8275 0.9065 -0.0363     0
Doors             1.2554 0.7966   11.1979   11.5674 0.8925 -0.0374     0
Cylinders         2.0001 0.5000   43.8472   45.2939     NA -0.0596     0
Gears             1.2599 0.7937   11.3958   11.7718 0.8909 -0.0376     0
Quarterly_Tax     2.7801 0.3597   78.0447   80.6197 0.5998 -0.0829     0
Weight            3.2137 0.3112   97.0581  100.2604 0.5578 -0.0958     0
Mfr_Guarantee     1.1983 0.8345    8.6960    8.9830 0.9135 -0.0357     0
BOVAG_Guarantee   1.3712 0.7293   16.2736   16.8105 0.8540 -0.0409     0
Guarantee_Period  1.5381 0.6502   23.5907   24.3691 0.8063 -0.0458     0
ABS               2.2232 0.4498   53.6282   55.3976 0.6707 -0.0663     0
Airbag_1          1.5989 0.6254   26.2590   27.1253 0.7908 -0.0477     0
Airbag_2          3.0894 0.3237   91.6074   94.6299 0.5689 -0.0921     0
Airco             1.8361 0.5446   36.6558   37.8652 0.7380 -0.0547     0
Automatic_airco   1.7419 0.5741   32.5257   33.5988 0.7577 -0.0519     0
Boardcomputer     2.6305 0.3802   71.4869   73.8455 0.6166 -0.0784     0
CD_Player         1.5503 0.6450   24.1291   24.9253 0.8031 -0.0462     0
Central_Lock      4.5886 0.2179  157.3372  162.5283 0.4668 -0.1368     0
Powered_Windows   4.6078 0.2170  158.1800  163.3990 0.4659 -0.1373     0
Power_Steering    1.5557 0.6428   24.3626   25.1664 0.8018 -0.0464     0
Radio            62.3090 0.0160 2688.0184 2776.7064 0.1267 -1.8572     1
Mistlamps         2.0750 0.4819   47.1335   48.6886 0.6942 -0.0618     0
Sport_Model       1.4606 0.6846   20.1946   20.8609 0.8274 -0.0435     0
Backseat_Divider  2.5379 0.3940   67.4257   69.6504 0.6277 -0.0756     0
Metallic_Rim      1.3400 0.7463   14.9077   15.3995 0.8639 -0.0399     0
Radio_cassette   62.1291 0.0161 2680.1284 2768.5561 0.1269 -1.8518     1
Tow_Bar           1.1445 0.8738    6.3348    6.5438 0.9347 -0.0341     0

1 --> COLLINEARITY is detected by the test 
0 --> COLLINEARITY is not detected by the test

HP , Automatic , Doors , Guarantee_Period , ABS , Boardcomputer , Central_Lock , Powered_Windows , Power_Steering , Mistlamps , Backseat_Divider , coefficient(s) are non-significant may be due to multicollinearity

R-square of y on all x: 0.9058 

* use method argument to check which regressors may be the reason of collinearity
===================================
  • Mediante el calculo de VIF y haciendo principal hincapíe en los atributos cuyo valor de VIF es muy superior a 5, es posible que exista colinealidad vinculado con los atributos Age_08_04,Mfg_Month, Mfg_Year, Radio y Radio_cassette

9: Limpieza de Datos

9a: Tratamiento de Outliers

  • El atributo CC presenta un outlier(valor atípico) de CC = 16000. No es un valor coherente con el contexto de un vehiculo Toyota Corolla. Considero que probablemente fue un error y supongo que se agrego un cero de más, siendo el valor correcto 1600.
  • El atributo Guarantee_Period presenta un outlier de Guarantee_Period = 13.Considero que probablemente fue un error y decido imputar el valor 12.
  • El atributo KM presenta outliers para valores superiores a 150000 y valores menos a 10000. Si bien son valores coherentes dentro del contexto de vehiculos, al estar la mayor concentracion de los vehiculos dentro del intervalo (10000,120000), decido recortar el dataSet, reduciendo su tamaño un 12%.
  • El atributo Age_08_04 presenta outliers para instancias cuyo Age_08_04 es menor a 25.
  • El atributo HP presenta outliers para valores superiores a 120 y valores menores a 80. Si bien son valores coherentes para algunos modelos de Toyota Corrolla, son casos particulares no representativos del conjunto de vehiculos.
  • Tras realizar estas operaciones, el dataset restante posee un 80% de las instancias del dataset original.
  • El atributo Price posee outliers para valores menores a 6500 y valores superiores a 15000. Por este motivo considero el intervalo (6500,14500), dado que los valores excluidos no son representativos de la mayoria del conjunto.

clean_data <- raw_data

clean_data = clean_data %>%  mutate(cc = ifelse(cc == 16000, 1600, cc))
clean_data = clean_data %>%  mutate(Guarantee_Period = ifelse(Guarantee_Period == 13, 12, Guarantee_Period))

clean_data = clean_data %>% filter((KM > 20000 & KM < 130000))
clean_data = clean_data %>% filter(Weight < 1075 & Weight > 1010)

clean_data$Cylinders = NULL
  • El dataset restante posee un 72% de las instancias del dataset original.

10: Selección de Variables

10a: Best SubSet Selection

bss.model <- regsubsets(Price~., data = clean_data, nvmax = dim(clean_data)[2])
2  linear dependencies found
Reordering variables and trying again:
summary(bss.model)
Subset selection object
Call: regsubsets.formula(Price ~ ., data = clean_data, nvmax = dim(clean_data)[2])
34 Variables  (and intercept)
                 Forced in Forced out
Age_08_04            FALSE      FALSE
Mfg_Month            FALSE      FALSE
KM                   FALSE      FALSE
Fuel_TypePetrol      FALSE      FALSE
HP                   FALSE      FALSE
Met_Color            FALSE      FALSE
Automatic            FALSE      FALSE
cc                   FALSE      FALSE
Doors                FALSE      FALSE
Gears                FALSE      FALSE
Quarterly_Tax        FALSE      FALSE
Weight               FALSE      FALSE
Mfr_Guarantee        FALSE      FALSE
BOVAG_Guarantee      FALSE      FALSE
Guarantee_Period     FALSE      FALSE
ABS                  FALSE      FALSE
Airbag_1             FALSE      FALSE
Airbag_2             FALSE      FALSE
Airco                FALSE      FALSE
Automatic_airco      FALSE      FALSE
Boardcomputer        FALSE      FALSE
CD_Player            FALSE      FALSE
Central_Lock         FALSE      FALSE
Powered_Windows      FALSE      FALSE
Power_Steering       FALSE      FALSE
Radio                FALSE      FALSE
Mistlamps            FALSE      FALSE
Sport_Model          FALSE      FALSE
Backseat_Divider     FALSE      FALSE
Metallic_Rim         FALSE      FALSE
Radio_cassette       FALSE      FALSE
Tow_Bar              FALSE      FALSE
Mfg_Year             FALSE      FALSE
Fuel_TypeDiesel      FALSE      FALSE
1 subsets of each size up to 32
Selection Algorithm: exhaustive
          Age_08_04 Mfg_Month Mfg_Year KM  Fuel_TypeDiesel Fuel_TypePetrol HP 
1  ( 1 )  " "       " "       "*"      " " " "             " "             " "
2  ( 1 )  " "       " "       "*"      " " " "             " "             " "
3  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
4  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
5  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
6  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
7  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
8  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
9  ( 1 )  " "       " "       "*"      "*" " "             " "             " "
10  ( 1 ) " "       " "       "*"      "*" " "             "*"             " "
11  ( 1 ) " "       " "       "*"      "*" " "             "*"             " "
12  ( 1 ) " "       " "       "*"      "*" " "             "*"             " "
13  ( 1 ) " "       " "       "*"      "*" " "             "*"             " "
14  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
15  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
16  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
17  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
18  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
19  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
20  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
21  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
22  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
23  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
24  ( 1 ) " "       " "       "*"      "*" " "             "*"             "*"
25  ( 1 ) " "       "*"       "*"      "*" " "             "*"             "*"
26  ( 1 ) " "       "*"       "*"      "*" " "             "*"             "*"
27  ( 1 ) " "       "*"       "*"      "*" " "             "*"             "*"
28  ( 1 ) " "       "*"       "*"      "*" " "             "*"             "*"
29  ( 1 ) "*"       " "       "*"      "*" " "             "*"             "*"
          Met_Color Automatic cc  Doors Gears Quarterly_Tax Weight Mfr_Guarantee
1  ( 1 )  " "       " "       " " " "   " "   " "           " "    " "          
2  ( 1 )  " "       " "       " " " "   " "   " "           "*"    " "          
3  ( 1 )  " "       " "       " " " "   " "   " "           "*"    " "          
4  ( 1 )  " "       " "       " " " "   " "   " "           "*"    " "          
5  ( 1 )  " "       " "       " " " "   " "   " "           "*"    " "          
6  ( 1 )  " "       " "       " " " "   " "   " "           "*"    " "          
7  ( 1 )  " "       " "       " " " "   " "   " "           "*"    "*"          
8  ( 1 )  " "       "*"       " " " "   " "   " "           "*"    "*"          
9  ( 1 )  " "       "*"       " " " "   " "   " "           "*"    "*"          
10  ( 1 ) " "       "*"       " " " "   " "   " "           "*"    "*"          
11  ( 1 ) " "       "*"       " " " "   " "   " "           "*"    "*"          
12  ( 1 ) " "       "*"       " " " "   " "   " "           "*"    "*"          
13  ( 1 ) " "       "*"       " " " "   " "   " "           "*"    "*"          
14  ( 1 ) " "       " "       "*" " "   "*"   " "           "*"    "*"          
15  ( 1 ) " "       " "       "*" " "   "*"   " "           "*"    "*"          
16  ( 1 ) " "       " "       "*" " "   "*"   " "           "*"    "*"          
17  ( 1 ) " "       " "       "*" " "   "*"   " "           "*"    "*"          
18  ( 1 ) " "       "*"       "*" " "   "*"   " "           "*"    "*"          
19  ( 1 ) " "       "*"       "*" " "   "*"   "*"           "*"    "*"          
20  ( 1 ) " "       "*"       "*" " "   "*"   "*"           "*"    "*"          
21  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
22  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
23  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
24  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
25  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
26  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
27  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
28  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
29  ( 1 ) "*"       "*"       "*" " "   "*"   "*"           "*"    "*"          
          BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2 Airco
1  ( 1 )  " "             " "              " " " "      " "      " "  
2  ( 1 )  " "             " "              " " " "      " "      " "  
3  ( 1 )  " "             " "              " " " "      " "      " "  
4  ( 1 )  " "             " "              " " " "      " "      "*"  
5  ( 1 )  " "             " "              " " " "      " "      "*"  
6  ( 1 )  " "             "*"              " " " "      " "      "*"  
7  ( 1 )  " "             "*"              " " " "      " "      "*"  
8  ( 1 )  " "             "*"              " " " "      " "      "*"  
9  ( 1 )  " "             "*"              " " " "      " "      "*"  
10  ( 1 ) " "             "*"              " " " "      " "      "*"  
11  ( 1 ) " "             "*"              " " " "      " "      "*"  
12  ( 1 ) "*"             "*"              " " " "      " "      "*"  
13  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
14  ( 1 ) "*"             "*"              " " " "      " "      "*"  
15  ( 1 ) "*"             "*"              " " " "      " "      "*"  
16  ( 1 ) "*"             "*"              " " " "      " "      "*"  
17  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
18  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
19  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
20  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
21  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
22  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
23  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
24  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
25  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
26  ( 1 ) "*"             "*"              " " " "      "*"      "*"  
27  ( 1 ) "*"             "*"              "*" " "      "*"      "*"  
28  ( 1 ) "*"             "*"              "*" "*"      "*"      "*"  
29  ( 1 ) "*"             "*"              "*" "*"      "*"      "*"  
          Automatic_airco Boardcomputer CD_Player Central_Lock Powered_Windows
1  ( 1 )  " "             " "           " "       " "          " "            
2  ( 1 )  " "             " "           " "       " "          " "            
3  ( 1 )  " "             " "           " "       " "          " "            
4  ( 1 )  " "             " "           " "       " "          " "            
5  ( 1 )  " "             " "           " "       "*"          " "            
6  ( 1 )  " "             " "           " "       " "          " "            
7  ( 1 )  " "             " "           " "       " "          " "            
8  ( 1 )  " "             " "           " "       " "          " "            
9  ( 1 )  " "             " "           " "       "*"          " "            
10  ( 1 ) " "             " "           " "       "*"          " "            
11  ( 1 ) " "             " "           " "       "*"          " "            
12  ( 1 ) " "             " "           " "       "*"          " "            
13  ( 1 ) " "             " "           " "       " "          "*"            
14  ( 1 ) " "             " "           " "       "*"          " "            
15  ( 1 ) "*"             " "           " "       "*"          " "            
16  ( 1 ) "*"             " "           " "       "*"          " "            
17  ( 1 ) "*"             " "           " "       "*"          " "            
18  ( 1 ) "*"             " "           " "       "*"          " "            
19  ( 1 ) "*"             " "           " "       "*"          " "            
20  ( 1 ) "*"             "*"           " "       "*"          " "            
21  ( 1 ) "*"             " "           "*"       "*"          " "            
22  ( 1 ) "*"             "*"           "*"       "*"          " "            
23  ( 1 ) "*"             "*"           "*"       "*"          " "            
24  ( 1 ) "*"             "*"           "*"       "*"          " "            
25  ( 1 ) "*"             "*"           "*"       "*"          " "            
26  ( 1 ) "*"             "*"           "*"       "*"          "*"            
27  ( 1 ) "*"             "*"           "*"       "*"          "*"            
28  ( 1 ) "*"             "*"           "*"       "*"          "*"            
29  ( 1 ) "*"             "*"           "*"       "*"          "*"            
          Power_Steering Radio Mistlamps Sport_Model Backseat_Divider Metallic_Rim
1  ( 1 )  " "            " "   " "       " "         " "              " "         
2  ( 1 )  " "            " "   " "       " "         " "              " "         
3  ( 1 )  " "            " "   " "       " "         " "              " "         
4  ( 1 )  " "            " "   " "       " "         " "              " "         
5  ( 1 )  " "            " "   " "       " "         " "              " "         
6  ( 1 )  " "            " "   "*"       " "         " "              " "         
7  ( 1 )  " "            " "   "*"       " "         " "              " "         
8  ( 1 )  " "            " "   "*"       " "         " "              " "         
9  ( 1 )  " "            " "   "*"       " "         " "              " "         
10  ( 1 ) " "            " "   "*"       " "         " "              " "         
11  ( 1 ) " "            " "   "*"       " "         " "              " "         
12  ( 1 ) " "            " "   "*"       " "         " "              " "         
13  ( 1 ) " "            " "   "*"       " "         " "              " "         
14  ( 1 ) " "            " "   " "       "*"         " "              " "         
15  ( 1 ) " "            " "   " "       "*"         " "              " "         
16  ( 1 ) " "            " "   " "       "*"         " "              " "         
17  ( 1 ) " "            " "   " "       "*"         " "              " "         
18  ( 1 ) " "            " "   " "       "*"         " "              " "         
19  ( 1 ) " "            " "   " "       "*"         " "              " "         
20  ( 1 ) " "            " "   " "       "*"         " "              " "         
21  ( 1 ) " "            " "   " "       "*"         " "              " "         
22  ( 1 ) " "            " "   " "       "*"         " "              " "         
23  ( 1 ) "*"            " "   " "       "*"         " "              " "         
24  ( 1 ) "*"            " "   "*"       "*"         " "              " "         
25  ( 1 ) "*"            " "   "*"       "*"         " "              " "         
26  ( 1 ) "*"            " "   "*"       "*"         " "              " "         
27  ( 1 ) "*"            " "   "*"       "*"         " "              " "         
28  ( 1 ) "*"            " "   "*"       "*"         " "              " "         
29  ( 1 ) "*"            "*"   "*"       "*"         " "              " "         
          Radio_cassette Tow_Bar
1  ( 1 )  " "            " "    
2  ( 1 )  " "            " "    
3  ( 1 )  " "            " "    
4  ( 1 )  " "            " "    
5  ( 1 )  " "            " "    
6  ( 1 )  " "            " "    
7  ( 1 )  " "            " "    
8  ( 1 )  " "            " "    
9  ( 1 )  " "            " "    
10  ( 1 ) " "            " "    
11  ( 1 ) " "            "*"    
12  ( 1 ) " "            "*"    
13  ( 1 ) " "            "*"    
14  ( 1 ) " "            "*"    
15  ( 1 ) " "            "*"    
16  ( 1 ) "*"            "*"    
17  ( 1 ) "*"            "*"    
18  ( 1 ) "*"            "*"    
19  ( 1 ) "*"            "*"    
20  ( 1 ) "*"            "*"    
21  ( 1 ) "*"            "*"    
22  ( 1 ) "*"            "*"    
23  ( 1 ) "*"            "*"    
24  ( 1 ) "*"            "*"    
25  ( 1 ) "*"            "*"    
26  ( 1 ) "*"            "*"    
27  ( 1 ) "*"            "*"    
28  ( 1 ) "*"            "*"    
29  ( 1 ) "*"            "*"    
 [ reached getOption("max.print") -- omitted 3 rows ]
  • Mediante Best Subset se definen todos los modelos posibles para los 34 predictores del modelo mediante combinaciones de los mismos y comparando su valor de RSS obtenido.

10a.1: Cantidad de Variables propuestas por el modelo

which.max(summary(bss.model)$adjr2)
[1] 24
  • El modelo con un mejor valor de R ajustado utiliza 24 predictores.

10a.2: R Ajustado vs Cant. Predictores

p <- ggplot(data = data.frame(n_predictores = 1:32,
                              R_ajustado = summary(bss.model)$adjr2),
            aes(x = n_predictores, y = R_ajustado)) +
    geom_line() +
    geom_point()

p <- p + geom_point(aes(
                    x = n_predictores[which.max(summary(bss.model)$adjr2)],
                    y = R_ajustado[which.max(summary(bss.model)$adjr2)]),
                    colour = "red", size = 3)
p <- p +  scale_x_continuous(breaks = c(0:34)) + 
          theme_bw() +
          labs(title = 'R2_ajustado vs número de predictores', 
               x =  'número predictores')
p

10a.3: Coeficientes: Predictores propuestos por el modelo

coef(object = bss.model, id = which.max(summary(bss.model)$adjr2))
     (Intercept)               KM  Fuel_TypePetrol        Met_Color 
   -2.529947e+06    -1.179339e-02    -2.636556e+03    -1.288967e+02 
       Automatic               cc            Doors            Gears 
    5.137662e+02     2.755593e-01     7.045670e+01     4.497871e+02 
          Weight    Mfr_Guarantee  BOVAG_Guarantee Guarantee_Period 
    1.073410e+01     2.288838e+02     3.509444e+02     5.796841e+01 
             ABS         Airbag_1  Automatic_airco    Boardcomputer 
   -1.306206e+02    -1.155433e+02     9.962195e+02     1.253951e+00 
       CD_Player     Central_Lock  Powered_Windows   Power_Steering 
    1.386529e+02     1.808711e+02     1.862831e+02    -2.503139e+02 
       Mistlamps Backseat_Divider     Metallic_Rim         Mfg_Year 
    3.676494e+02    -3.742877e+01     5.032722e+01     1.264728e+03 
 Fuel_TypeDiesel 
    0.000000e+00 
  • El modelo obtenido mediante el uso del metodo Best Subset propone usar los predictores expuestos en el recuadro superior.

10a.4: R Ajustado: Predictores propuestos por el modelo

summary(bss.model)$adjr2[24]
[1] 0.774939
  • El valor de R Ajustado obtenido para el uso de 24 predictores es de 0.77. Pero, el uso de demasiados predictores añade complejidad al modelo.

10a.5: R Ajustado: Cant. de Predictores Alternativas

summary(bss.model)$adjr2[9]
[1] 0.7579834
summary(bss.model)$adjr2[6]
[1] 0.7469467
  • Reduciendo la cantidad predictores, disminuye el valor de R ajustado, pero disminuye la complejidad del modelo lo que permite una respuesta más general frente a nuevos datos de entrada.

10a.6: CP, AIC Y BIC

bss.model.sum = summary(bss.model)

par(mfrow = c(2, 2))
plot(bss.model.sum$rss, xlab = "Numero de Predictores", ylab = "RSS", type = "b")

plot(bss.model.sum$adjr2, xlab = "Numero de Predictores", ylab = "R Ajustada", type = "b")
best_adj_r2 = which.max(bss.model.sum$adjr2)
points(best_adj_r2, bss.model.sum$adjr2[best_adj_r2],
       col = "red",cex = 2, pch = 20)

plot(bss.model.sum$cp, xlab = "Numero de Predictores", ylab = "Cp", type = 'b')
best_cp = which.min(bss.model.sum$cp)
points(best_cp, bss.model.sum$cp[best_cp], 
       col = "red", cex = 2, pch = 20)

plot(bss.model.sum$bic, xlab = "Numero de Predictores", ylab = "BIC", type = 'b')
best_bic = which.min(bss.model.sum$bic)
points(best_bic, bss.model.sum$bic[best_bic], 
       col = "red", cex = 2, pch = 20)

  • En las gráficas anteriores BIC, CP, R ajustada se observan los puntos cuyo valores son mínimos y que no concordancia entre ellos para seleccionar ubivocamente la cantidad de predictores a emplear en un modelo. Sin embargo, puede observarse puntualmente en cada grafico, que existen sutiles mejoras (casi imperceptibles) entre algunas cantidades de predictores.

  • Se destaca que no tiene importancia destacar el punto minimo de RSS, dado que al por la naturaleza del modelo, a mayor cantidad de variable, menor será su valor.

10a.7: Coeficientes: Modelo Seleccionado

coef(object = bss.model, id = 7)
     (Intercept)               KM  Fuel_TypePetrol  BOVAG_Guarantee 
    1.309274e+04    -2.326143e-02    -4.063483e+03     4.683116e+02 
Guarantee_Period         Airbag_1    Boardcomputer Backseat_Divider 
    1.697532e+02     2.091082e+02     1.959851e+03     3.807397e+02 
  • Mediante la comparación de las graficas, se opta por un modelo con 7 predictores.

10a.8: Validacion Cruzada

set.seed(10)

index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)

data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
set.seed(10)
model.fwd <- regsubsets(Price ~., data = data.train, nvmax = 7)
3  linear dependencies found
Reordering variables and trying again:
summary(model.fwd)
Subset selection object
Call: regsubsets.formula(Price ~ ., data = data.train, nvmax = 7)
34 Variables  (and intercept)
                 Forced in Forced out
Age_08_04            FALSE      FALSE
Mfg_Month            FALSE      FALSE
KM                   FALSE      FALSE
Fuel_TypePetrol      FALSE      FALSE
HP                   FALSE      FALSE
Met_Color            FALSE      FALSE
Automatic            FALSE      FALSE
cc                   FALSE      FALSE
Doors                FALSE      FALSE
Gears                FALSE      FALSE
Quarterly_Tax        FALSE      FALSE
Weight               FALSE      FALSE
Mfr_Guarantee        FALSE      FALSE
BOVAG_Guarantee      FALSE      FALSE
Guarantee_Period     FALSE      FALSE
ABS                  FALSE      FALSE
Airbag_1             FALSE      FALSE
Airbag_2             FALSE      FALSE
Airco                FALSE      FALSE
Automatic_airco      FALSE      FALSE
Boardcomputer        FALSE      FALSE
CD_Player            FALSE      FALSE
Central_Lock         FALSE      FALSE
Powered_Windows      FALSE      FALSE
Power_Steering       FALSE      FALSE
Radio                FALSE      FALSE
Mistlamps            FALSE      FALSE
Sport_Model          FALSE      FALSE
Backseat_Divider     FALSE      FALSE
Metallic_Rim         FALSE      FALSE
Tow_Bar              FALSE      FALSE
Mfg_Year             FALSE      FALSE
Fuel_TypeDiesel      FALSE      FALSE
Radio_cassette       FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         Age_08_04 Mfg_Month Mfg_Year KM  Fuel_TypeDiesel Fuel_TypePetrol HP 
1  ( 1 ) " "       " "       "*"      " " " "             " "             " "
2  ( 1 ) " "       " "       "*"      " " " "             " "             " "
3  ( 1 ) " "       " "       "*"      "*" " "             " "             " "
4  ( 1 ) " "       " "       "*"      "*" " "             " "             " "
5  ( 1 ) " "       " "       "*"      "*" " "             " "             " "
6  ( 1 ) " "       " "       "*"      "*" " "             " "             " "
7  ( 1 ) " "       " "       "*"      "*" " "             " "             " "
8  ( 1 ) " "       " "       "*"      "*" " "             " "             " "
         Met_Color Automatic cc  Doors Gears Quarterly_Tax Weight Mfr_Guarantee
1  ( 1 ) " "       " "       " " " "   " "   " "           " "    " "          
2  ( 1 ) " "       " "       " " " "   " "   " "           "*"    " "          
3  ( 1 ) " "       " "       " " " "   " "   " "           "*"    " "          
4  ( 1 ) " "       " "       " " " "   " "   " "           "*"    " "          
5  ( 1 ) " "       " "       " " " "   " "   " "           "*"    " "          
6  ( 1 ) " "       " "       " " " "   " "   " "           "*"    "*"          
7  ( 1 ) " "       " "       " " " "   " "   " "           "*"    "*"          
8  ( 1 ) " "       " "       " " " "   " "   " "           "*"    "*"          
         BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2 Airco
1  ( 1 ) " "             " "              " " " "      " "      " "  
2  ( 1 ) " "             " "              " " " "      " "      " "  
3  ( 1 ) " "             " "              " " " "      " "      " "  
4  ( 1 ) " "             " "              " " " "      " "      " "  
5  ( 1 ) " "             " "              " " " "      " "      "*"  
6  ( 1 ) " "             " "              " " " "      " "      "*"  
7  ( 1 ) " "             "*"              " " " "      " "      "*"  
8  ( 1 ) " "             "*"              " " " "      " "      "*"  
         Automatic_airco Boardcomputer CD_Player Central_Lock Powered_Windows
1  ( 1 ) " "             " "           " "       " "          " "            
2  ( 1 ) " "             " "           " "       " "          " "            
3  ( 1 ) " "             " "           " "       " "          " "            
4  ( 1 ) " "             " "           " "       " "          "*"            
5  ( 1 ) " "             " "           " "       " "          "*"            
6  ( 1 ) " "             " "           " "       " "          "*"            
7  ( 1 ) " "             " "           " "       " "          "*"            
8  ( 1 ) " "             " "           " "       " "          "*"            
         Power_Steering Radio Mistlamps Sport_Model Backseat_Divider Metallic_Rim
1  ( 1 ) " "            " "   " "       " "         " "              " "         
2  ( 1 ) " "            " "   " "       " "         " "              " "         
3  ( 1 ) " "            " "   " "       " "         " "              " "         
4  ( 1 ) " "            " "   " "       " "         " "              " "         
5  ( 1 ) " "            " "   " "       " "         " "              " "         
6  ( 1 ) " "            " "   " "       " "         " "              " "         
7  ( 1 ) " "            " "   " "       " "         " "              " "         
8  ( 1 ) " "            " "   " "       " "         " "              " "         
         Radio_cassette Tow_Bar
1  ( 1 ) " "            " "    
2  ( 1 ) " "            " "    
3  ( 1 ) " "            " "    
4  ( 1 ) " "            " "    
5  ( 1 ) " "            " "    
6  ( 1 ) " "            " "    
7  ( 1 ) " "            " "    
8  ( 1 ) " "            "*"    

10a.9: RMSE

val.errors = rep(NA,7)
x.test <- model.matrix(Price ~., data = data.test)

for(i in 1:7) 
{
  coeficientes <- coef(model.fwd, id = i)
  predictions <- x.test[,names(coeficientes)] %*% coeficientes
  val.errors[i] <- mean((data.test$Price - predictions)^2)
}

rmse <- sqrt(mean(val.errors))
rmse
[1] 1407.842
  • Tras una validación cruzada se obtiene un valor de RMSE de 1408.

10b: Ridge

  • Mediante Ridge se obtiene un modelo a partir de los 34 predictores iniciales, ponderando la influencia de cada predictor pero sin eliminar ninguno de los mismos del modelo.
set.seed(10)

index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)

data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
  • Se produce en conjunto de entrenamiento y de prueba.

10b.1: Modelo


x = model.matrix(Price ~ . , data.train)[, -1]
y = as.matrix(data.train$Price)

ridge.model = glmnet(x, y, alpha = 0)
beta=coef(ridge.model)

tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- ridge.model$lambda[tmp$variable+1]
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1]

ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, group = coef, )) + 
  geom_line() + 
    scale_x_log10() + 
    xlab("Lambda (log scale)") + 
    guides(color = guide_legend(title = ""), 
           linetype = guide_legend(title = "")) +
    theme_bw() + 
    theme(legend.key.width = unit(3,"lines")) 

plot(ridge.model, xvar = "lambda", label = TRUE)

  • Mediante los dos graficos anteriores se puede observar que los predictores mas significativos son aquellos que, durante el procesamiento del modelo, demoran mas en acercarse a valores cercanos a cero.

indices <- sample(c(TRUE,FALSE), nrow(data.train), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 0)

plot(cv.out)

coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
                             1
(Intercept)      -9.276178e+05
Age_08_04        -3.403398e+01
Mfg_Month        -1.675292e+01
Mfg_Year          4.666735e+02
KM               -8.205862e-03
Fuel_TypeDiesel   .           
Fuel_TypePetrol   .           
HP                4.510978e+00
Met_Color        -4.077869e+01
Automatic         5.819668e+02
cc                3.425118e-01
Doors             3.215659e+01
Gears             1.800406e+02
Quarterly_Tax     8.492749e+00
Weight            4.057818e+00
Mfr_Guarantee     2.013134e+02
BOVAG_Guarantee   3.643001e+02
Guarantee_Period  4.918492e+01
ABS               1.136664e+02
Airbag_1         -1.506483e+02
Airbag_2         -4.032492e-01
Airco             2.930682e+02
Automatic_airco   7.676041e+02
Boardcomputer     2.717277e+02
CD_Player         1.082930e+02
Central_Lock      2.059698e+02
Powered_Windows   6.839280e+01
Power_Steering   -7.009734e+02
Radio            -3.066674e+01
Mistlamps         5.970154e+01
Sport_Model      -2.479769e+02
Backseat_Divider  3.448104e+00
Metallic_Rim      5.100967e+01
Radio_cassette   -3.071270e+01
Tow_Bar          -2.591923e+02
  • La grafica anterior representa como el error incrementa a mayores valores de log(lambda).
  • Las lineas punteadas indican, el valor minimo de mse obtenido en la validacion cruzada y el valor de la derecha representa el error estandar.

bestlam = cv.out$lambda.min
bestlam
[1] 137.6734
  • Se emplea el menor lambda obtenido de la validacion cruzada propia del modelo.

ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
sqrt(mean((ridge.pred - y[-indices])^2))
[1] 807.2642
  • Se obtiene un valor de rmse de 807 obtenido mediante los datos de entrenamiento.

10b.2: Validacion Cruzada


x = model.matrix(Price ~ . , data.test)[, -1]
y = as.matrix(data.test$Price)

ridge.model = glmnet(x, y, alpha = 0)

indices <- sample(c(TRUE,FALSE), nrow(data.test), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 0)

plot(cv.out)

coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
                             1
(Intercept)      -5.107223e+05
Age_08_04        -1.990590e+01
Mfg_Month        -1.156697e+01
Mfg_Year          2.572843e+02
KM               -9.166624e-03
Fuel_TypeDiesel   .           
Fuel_TypePetrol   .           
HP                1.018606e-01
Met_Color        -3.050419e+01
Automatic        -3.058052e+02
cc               -7.618709e-02
Doors             8.384777e+01
Gears             4.417662e+02
Quarterly_Tax     4.467208e+00
Weight            4.256372e+00
Mfr_Guarantee     2.180984e+02
BOVAG_Guarantee   5.903780e+01
Guarantee_Period  2.742444e+01
ABS               1.886358e+02
Airbag_1         -2.687474e+01
Airbag_2          6.792561e+01
Airco             1.990458e+02
Automatic_airco   .           
Boardcomputer     4.884786e+02
CD_Player         3.181346e+02
Central_Lock      8.770576e+01
Powered_Windows   2.707806e+01
Power_Steering    9.958717e-01
Radio            -7.240374e+01
Mistlamps         2.673791e+02
Sport_Model      -3.403362e+02
Backseat_Divider  6.955809e+01
Metallic_Rim      3.906290e+00
Radio_cassette   -7.238457e+01
Tow_Bar          -8.097683e+01
  • El valor minimo de MSE y el error estandar, son similares a los obtenidos mediante datos de entrenamiento.

bestlam = cv.out$lambda.min
bestlam
[1] 489.783
  • Se emplea el menor lambda obtenido de la validacion cruzada propia del modelo.

ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
sqrt(mean((ridge.pred - y[-indices])^2))
[1] 775.3981
  • Se obtiene un valor de rmse de 775 datos de prueba.

10c: Lasso

  • Mediante Lasso se obtiene un modelo a partir de los 34 predictores iniciales, ponderando la influencia de cada predictor. A diferencia de Ridge, usando Lasso si se realiza una seleccion de variables.
set.seed(10)

index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)

data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
  • Se produce en conjunto de entrenamiento y de prueba.

10c.1: Modelo


x = model.matrix(Price ~ . , data.train)[, -1]
y = as.matrix(data.train$Price)

lasso.model = glmnet(x, y, alpha = 1)
beta=coef(lasso.model)

tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- lasso.model$lambda[tmp$variable+1]
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1]

ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, group = coef, )) + 
  geom_line() + 
    scale_x_log10() + 
    xlab("Lambda (log scale)") + 
    guides(color = guide_legend(title = ""), 
           linetype = guide_legend(title = "")) +
    theme_bw() + 
    theme(legend.key.width = unit(3,"lines")) 

plot(lasso.model, xvar = "lambda", label = TRUE)

  • Mediante los dos graficos anteriores se puede observar que los predictores mas significativos son aquellos que, durante el procesamiento del modelo, demoran mas en acercarse a valores cercanos a cero.

indices <- sample(c(TRUE,FALSE), nrow(data.train), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)

plot(cv.out)

coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
                             1
(Intercept)      -2.152194e+06
Age_08_04        -9.617712e+00
Mfg_Month         .           
Mfg_Year          1.077728e+03
KM               -6.654090e-03
Fuel_TypeDiesel   .           
Fuel_TypePetrol   .           
HP                .           
Met_Color         .           
Automatic         3.858173e+02
cc                2.490596e-01
Doors             .           
Gears             .           
Quarterly_Tax     3.041571e+00
Weight            7.261598e+00
Mfr_Guarantee     1.032277e+02
BOVAG_Guarantee   1.922574e+02
Guarantee_Period  9.119164e+00
ABS               .           
Airbag_1          .           
Airbag_2          .           
Airco             2.308151e+02
Automatic_airco   .           
Boardcomputer     .           
CD_Player         .           
Central_Lock      2.433077e+02
Powered_Windows   .           
Power_Steering   -4.338181e+02
Radio             .           
Mistlamps         .           
Sport_Model      -6.242317e+01
Backseat_Divider  .           
Metallic_Rim      .           
Radio_cassette    .           
Tow_Bar          -1.144862e+02
  • El valor minimo de MSE y el error estandar, son similares a los obtenidos mediante datos de entrenamiento.

bestlam = cv.out$lambda.min
bestlam
[1] 27.66179
  • Se emplea el menor lambda obtenido de la validacion cruzada propia del modelo.

lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
sqrt(mean((lasso.pred - y[-indices])^2))
[1] 812.298
  • Se obtiene un valor de rmse de 813 datos de entrenamiento.

10c.2: Validacion Cruzada


x = model.matrix(Price ~ . , data.test)[, -1]
y = as.matrix(data.test$Price)

lasso.model = glmnet(x, y, alpha = 1)

indices <- sample(c(TRUE,FALSE), nrow(data.test), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)

plot(cv.out)

coef(cv.out)
35 x 1 sparse Matrix of class "dgCMatrix"
                             1
(Intercept)      -2.083003e+06
Age_08_04         .           
Mfg_Month         .           
Mfg_Year          1.045326e+03
KM               -6.128349e-03
Fuel_TypeDiesel   .           
Fuel_TypePetrol   .           
HP                .           
Met_Color         .           
Automatic         .           
cc                .           
Doors             6.359957e+01
Gears             2.868344e+01
Quarterly_Tax     .           
Weight            2.598188e+00
Mfr_Guarantee     .           
BOVAG_Guarantee   .           
Guarantee_Period  .           
ABS               .           
Airbag_1          .           
Airbag_2          .           
Airco             2.077925e+02
Automatic_airco   .           
Boardcomputer     6.818690e+01
CD_Player         .           
Central_Lock      .           
Powered_Windows   .           
Power_Steering    .           
Radio             .           
Mistlamps         2.173762e+02
Sport_Model      -5.946130e+01
Backseat_Divider  .           
Metallic_Rim      .           
Radio_cassette    .           
Tow_Bar           .           

bestlam = cv.out$lambda.min
bestlam
[1] 81.70072

lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
sqrt(mean((lasso.pred - y[-indices])^2))
[1] 813.2023
  • Se obtiene un valor de rmse de 813 datos de prueba.

11: Conclusión

Dado que es un dataset con una cantidad baja de instancias (1500 aprox), y que mediante el uso de Lasso se obtuvo un resultado aceptable comparado a los rmse calculados con otros métodos (Best Subset y Ridge), se opta por usar ese metodo para seleccion de variables.

---
title: "Seleccion de Variables: Dataset 'Toyota Corolla'"
author: "Alvarez Ignacio Nicolas"
output:
  html_notebook:
    df_print: paged
    fig:height: 4
    fig:width: 6
    theme: readable
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
fig:height: 4
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Este ejercicio consiste en realizar un análisis exploratorio sobre un dataset de vehiculos Toyota Corolla con 1436 instancias y 37 atributos. 

El objetivo es conseguir un modelo de regresión lineal con un resultado aceptable interpretando cada paso del razonamiento necesario para llegar al objetivo.

* El atributo objetivo es Price.   

# 1: Carga de Librerias
```{r Carga de Librerias, eval=FALSE}
library(fastDummies) 
library(car)
library(corrplot) 
library(mctest) 
library(tidyverse) 
library(leaps)
library(glmnet)
library(MASS)
library(reshape)
library(caret)
library(ggrepel)
```

# 2: Lectura del DataSet

```{r Lectura del DataSet}
raw_data = read.csv("ToyotaCorolla.csv") 
raw_data$Id = NULL
raw_data$Model = NULL

```

* El atributo **ID** no es representativo de cada instancia, decido no considerarlo en el modelo.     
* El atributo **Model** no es representativo de cada instancia, decido no considerarlo en el modelo. 


# 3: DataSet

```{r Visualizacion del DataSet}
raw_data
```

# 4: Estructura del DataSet

```{r Estructura del DataSet}
str(raw_data)
```

# 5: Resumen del DataSet

```{r Resumen del DataSet}
summary(raw_data)
```

**Observaciones**   
* El valor maximo de cc es de 16000, demasiado alto considerando la media.    
* El atributo Fuel_Type es de tipo char, requerira un proceso de encoding.    
* El valor de Cylinder es constante. 

# 6: Análisis exploratorio    

## 6a: Price 

```{r Price}
par(mfrow=c(1,2))
boxplot(raw_data$Price,main = "Precio Vehiculos Toyota Corolla",
        ylab = "Precio ($)", notch = TRUE)

hist(raw_data$Price,main = "Precio Vehiculos Toyota Corolla")
```
* Se observa que la mediana del precio de los vehiculos, ronda los $10000 aproximadamente. 
* Se presentan valores atípicos con valores superiores a las $20000 y valores menores a $7000 aproximadamente.  

## 6b: Mfg_Year
     
```{r Age_08_04}
par(mfrow=c(1,2))

boxplot(raw_data$Age_08_04,main = "Age_08_04")
barplot(table(as.factor(raw_data$Age_08_04)),main = "Age_08_04")

```
* El atributo "Age_08_04" presenta valores outliers correspondientes a vehiculos nuevos cuya antiguedad es 0.  

## 6c: Mfg_Year

```{r Mfg_Year}
par(mfrow=c(1,2))

boxplot(raw_data$Mfg_Year,main = "Año de Mfg_Year")
barplot(table(as.factor(raw_data$Mfg_Year)),main = "Mfg_Year")
```

## 6d: KM

```{r Boxplot KM}
par(mfrow=c(1,2))

boxplot(raw_data$KM,main = "KM",
        ylab = "KM", notch = TRUE)

hist(raw_data$KM,main = "KM")

```
* El atributo "KM" presenta valores outliers. Destaco sobretodo un conjunto de valores superiores a los 200000. 


## 6e: HP
```{r HP}
par(mfrow=c(1,2))

boxplot(raw_data$HP,main = "HP",
        ylab = "HP", notch = FALSE)

barplot(table(as.factor(raw_data$HP)),main = "HP")
```

* El atributo "HP" presenta un valor outlier superior a 180. Según una investigacion realizada en medios externos al dataset, el valor si corresponde a un modelo de Toyota Corolla.    


## 6f: CC
  
```{r Boxplot CC}
par(mfrow=c(1,2))

boxplot(raw_data$cc,main = "Cilindrada",
        ylab = "CC", notch = FALSE)

barplot(table(as.factor(raw_data$cc)),main = "Cilindrada")
```

* El atributo "CC" presenta un outlier notorio superior a 16000, este valor esta fuera del contexto de un vehiculo toyota corolla, donde los valores promedio rondan el 100.    

## 6g: Quarterly_Tax

```{r Boxplot Quarterly_Tax}
par(mfrow=c(1,2))

boxplot(raw_data$Quarterly_Tax, main="Quarterly_Tax")

hist(raw_data$Quarterly_Tax)
```

* El atributo "Quarterly_Tax" presenta outliers para valores superiores a 150 y valores menores a 50, sobre una mediana de 70 aproximadamente. 

## 6h: Weight

```{r Weight}
par(mfrow=c(1,2))

boxplot(raw_data$Weight, main="Peso(KG)")
hist(raw_data$Weight)
```

* El atributo "Weight" presenta outliers para valores superiores a 1150 sobre una mediana de 1050 aproximadamente.

## 6i: Fueltype y Radio Cassete

```{r Plot Fueltype y Radio Cassete}
lbls <- c("0: No tiene", "1: Tiene")

par(mfrow=c(1,2)) 

barplot(table(as.factor(raw_data$Fuel_Type)), main="Fuel_Type")
pie(x = table(raw_data$Radio_cassette), labels = lbls, main="Radio Cassete")
```

## 6j: Metallic Rim y Backseat Divider

```{r Plot Metallic Rim y Backseat Divider}

par(mfrow=c(1,2)) 

pie(x = table(raw_data$Metallic_Rim), labels = lbls,  main="Metallic Rim")
pie(x = table(raw_data$Backseat_Divider) , labels = lbls,  main="Backseat_Divider")
```

## 6k: Mistlamp, Radio y Sport Model

```{r Plot Mistlamp, Radio y Sport Model}
par(mfrow=c(1,3))

pie(x = table(raw_data$Mistlamps) , labels = lbls,  main="Mistlamps")
pie(x = table(raw_data$Radio), labels = lbls,  main="Radio")
pie(x = table(raw_data$Sport_Model), labels = lbls,  main="Sport_Model")
```

## 6l: Central Lock, CD Player y BoardComputer

```{r Plot Central Lock, CD Player y BoardComputer}

par(mfrow=c(1,3))

pie(x = table(raw_data$Central_Lock), labels = lbls, main="Central_Lock")
pie(x = table(raw_data$CD_Player), labels = lbls, main="CD_Player")
pie(x = table(raw_data$Boardcomputer), labels = lbls, main="Boardcomputer")

```

## 6m: Airco, Airbag_2 y Airbag_1

```{r Plot Airco, Airbag_2 y Airbag_1}
par(mfrow=c(1,3))

pie(x = table(raw_data$Airco), labels = lbls,  main="Airco")
pie(x = table(raw_data$Airbag_2), labels = lbls,  main="Airbag_2")
pie(x = table(raw_data$Airbag_1), labels = lbls,main="Airbag_1")

```

## 6n: Guarantee Period y Automatic Airco

```{r Plot Guarantee Period y Automatic Airco}

par(mfrow=c(1,2))

barplot(table(as.factor(raw_data$Guarantee_Period)), main="Guarantee_Period") 
pie(x = table(raw_data$Automatic_airco), labels = lbls,  main="Automatic_airco")

```

## 6o: MFR Guarantee, Gears y BOVAG Guarantee

```{r Plot MFR Guarantee, Gears y BOVAG Guarantee}

par(mfrow=c(1,3))

pie(x = table(raw_data$Mfr_Guarantee), labels = lbls, main="Mfr_Guarantee")
barplot(table(as.factor(raw_data$Gears)), main="Gears")
pie(x = table(raw_data$BOVAG_Guarantee), labels = lbls, main="BOVAG_Guarantee")
```

## 6p: Doors, Automatic y ABS

```{r Plot Doors, Automatic y ABS}
par(mfrow=c(1,3))

barplot(table(as.factor(raw_data$Doors)), main="Doors")
pie(x = table(raw_data$Automatic),labels = lbls, main="Automatic")
pie(x = table(raw_data$ABS),labels = lbls, main="ABS")
```

# 7: Estudio de Variable Objetivo "Price"

## 7a: Distribucion de Price
```{r Histograma de Price}
hist(raw_data$Price, col="blue", breaks = 60, freq = F)
lines(density(raw_data$Price), col = "red", lwd=2)
rug(raw_data$Price)
```

## 7b: Relacion Price vs Resto de Predictores
```{r Graficos de Dispersion: Price vs Todos}
plot(Price~., data=raw_data,col="blue")
```

# 8: Seleccion y modificacion de Variables    

## 8a: Estudio de correlacion
```{r Estudio de correlacion}
corrplot(cor(dplyr::select(raw_data, -c("Fuel_Type"))), type="upper", method="pie")
```


## 8b: Indicadores de Colinealidad

```{r Calculo de VIF y TOL sobre DataSet}
imcdiag(dplyr::select(raw_data, -c("Price", "Fuel_Type")), raw_data$Price)
```
* Mediante el calculo de VIF y haciendo principal hincapíe en los atributos cuyo valor de VIF es muy superior a 5, es posible que exista colinealidad vinculado con los atributos **Age_08_04,Mfg_Month, Mfg_Year, Radio y  Radio_cassette**

# 9: Limpieza de Datos   

## 9a: Tratamiento de Outliers 

* El atributo CC presenta un outlier(valor atípico) de CC = 16000. No es un valor coherente con el contexto de un vehiculo Toyota Corolla. Considero que probablemente fue un error y supongo que se agrego un cero de más, siendo el valor correcto 1600.   
* El atributo Guarantee_Period presenta un outlier de Guarantee_Period = 13.Considero que probablemente fue un error y decido imputar el valor 12.   
* El atributo KM presenta outliers para valores superiores a 150000 y valores menos a 10000. Si bien son valores coherentes dentro del contexto de vehiculos, al estar la mayor concentracion de los vehiculos dentro del **intervalo (10000,120000)**, decido recortar el dataSet, reduciendo su tamaño un 12%.     
* El atributo Age_08_04 presenta outliers para instancias cuyo Age_08_04 es menor a 25.   
* El atributo HP presenta outliers para valores superiores a 120 y valores menores a 80. Si bien son valores coherentes para algunos modelos de Toyota Corrolla, son casos particulares no representativos del conjunto de vehiculos.    
* Tras realizar estas operaciones, el dataset restante posee un 80% de las instancias del dataset original.     
* El atributo Price posee outliers para valores menores a 6500 y valores superiores a 15000. Por este motivo considero el **intervalo (6500,14500)**, dado que los valores excluidos no son representativos de la mayoria del conjunto.    


```{r Limpieza de Datos}

clean_data <- raw_data

clean_data = clean_data %>%  mutate(cc = ifelse(cc == 16000, 1600, cc))
clean_data = clean_data %>%  mutate(Guarantee_Period = ifelse(Guarantee_Period == 13, 12, Guarantee_Period))

clean_data = clean_data %>% filter((KM > 20000 & KM < 130000))
clean_data = clean_data %>% filter(Weight < 1075 & Weight > 1010)

clean_data$Cylinders = NULL

```

* El dataset restante posee un 72% de las instancias del dataset original.   


# 10: Selección de Variables  

## 10a: Best SubSet Selection


```{r Modelo Best SubSet Selection}
bss.model <- regsubsets(Price~., data = clean_data, nvmax = dim(clean_data)[2])
summary(bss.model)
```

* Mediante Best Subset se definen todos los modelos posibles para los 34 predictores del modelo mediante combinaciones de los mismos y comparando su valor de RSS obtenido.

### 10a.1: Cantidad de Variables propuestas por el modelo

```{r Modelo Best SubSet Selection 2 }
which.max(summary(bss.model)$adjr2)
```

* El modelo con un mejor valor de R ajustado utiliza 24 predictores. 

### 10a.2: R Ajustado vs Cant. Predictores

```{r BSS: R^2 ajustado vs Numero de Predictores}
p <- ggplot(data = data.frame(n_predictores = 1:32,
                              R_ajustado = summary(bss.model)$adjr2),
            aes(x = n_predictores, y = R_ajustado)) +
    geom_line() +
    geom_point()

p <- p + geom_point(aes(
                    x = n_predictores[which.max(summary(bss.model)$adjr2)],
                    y = R_ajustado[which.max(summary(bss.model)$adjr2)]),
                    colour = "red", size = 3)
p <- p +  scale_x_continuous(breaks = c(0:34)) + 
          theme_bw() +
          labs(title = 'R2_ajustado vs número de predictores', 
               x =  'número predictores')
p
```

### 10a.3: Coeficientes: Predictores propuestos por el modelo

```{r Coeficientes Max}
coef(object = bss.model, id = which.max(summary(bss.model)$adjr2))
```

* El modelo obtenido mediante el uso del metodo Best Subset propone usar los predictores expuestos en el recuadro superior.     

### 10a.4: R Ajustado: Predictores propuestos por el modelo

```{r Valor R^2 Ajustado: Metodo}
summary(bss.model)$adjr2[24]
```

* El valor de R Ajustado obtenido para el uso de 24 predictores es de 0.77. Pero, el uso de demasiados predictores añade complejidad al modelo.     

### 10a.5: R Ajustado: Cant. de Predictores Alternativas

```{r Valor R^2 Ajustado: Alternativas}
summary(bss.model)$adjr2[9]
```

```{r Valor R^2 Ajustado: Alternativas 2}
summary(bss.model)$adjr2[6]
```

* Reduciendo la cantidad predictores, disminuye el valor de R ajustado, pero disminuye la complejidad del modelo lo que permite una respuesta más general frente a nuevos datos de entrada.

### 10a.6: CP, AIC Y BIC

```{r BSS: AIC y BIC}
bss.model.sum = summary(bss.model)

par(mfrow = c(2, 2))
plot(bss.model.sum$rss, xlab = "Numero de Predictores", ylab = "RSS", type = "b")

plot(bss.model.sum$adjr2, xlab = "Numero de Predictores", ylab = "R Ajustada", type = "b")
best_adj_r2 = which.max(bss.model.sum$adjr2)
points(best_adj_r2, bss.model.sum$adjr2[best_adj_r2],
       col = "red",cex = 2, pch = 20)

plot(bss.model.sum$cp, xlab = "Numero de Predictores", ylab = "Cp", type = 'b')
best_cp = which.min(bss.model.sum$cp)
points(best_cp, bss.model.sum$cp[best_cp], 
       col = "red", cex = 2, pch = 20)

plot(bss.model.sum$bic, xlab = "Numero de Predictores", ylab = "BIC", type = 'b')
best_bic = which.min(bss.model.sum$bic)
points(best_bic, bss.model.sum$bic[best_bic], 
       col = "red", cex = 2, pch = 20)
```

* En las gráficas anteriores *BIC, CP, R ajustada* se observan los puntos cuyo valores son mínimos y que no concordancia entre ellos para seleccionar ubivocamente la cantidad de predictores a emplear en un modelo. Sin embargo, puede observarse puntualmente en cada grafico, que existen sutiles mejoras (casi imperceptibles) entre algunas cantidades de predictores.     

* Se destaca que no tiene importancia destacar el punto minimo de RSS, dado que al por la naturaleza del modelo, a mayor cantidad de variable, menor será su valor.   

### 10a.7: Coeficientes: Modelo Seleccionado

```{r Modelo 7 predictores}
coef(object = bss.model, id = 7)
```

* Mediante la comparación de las graficas, se opta por un modelo con 7 predictores.    

### 10a.8: Validacion Cruzada

```{r CV Modelo BSS}
set.seed(10)

index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)

data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
```


```{r CV Modelo BSS 2}
set.seed(10)
model.fwd <- regsubsets(Price ~., data = data.train, nvmax = 7)
summary(model.fwd)
```

### 10a.9: RMSE

```{r Calculo de RMSE BSS}
val.errors = rep(NA,7)
x.test <- model.matrix(Price ~., data = data.test)

for(i in 1:7) 
{
  coeficientes <- coef(model.fwd, id = i)
  predictions <- x.test[,names(coeficientes)] %*% coeficientes
  val.errors[i] <- mean((data.test$Price - predictions)^2)
}

rmse <- sqrt(mean(val.errors))
rmse
```

* Tras una validación cruzada se obtiene un valor de RMSE de 1408.    


## 10b: Ridge

* Mediante Ridge se obtiene un modelo a partir de los 34 predictores iniciales, ponderando la influencia de cada predictor pero sin eliminar ninguno de los mismos del modelo.

```{r}
set.seed(10)

index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)

data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
```

* Se produce en conjunto de entrenamiento y de prueba.    

## 10b.1: Modelo

```{r}

x = model.matrix(Price ~ . , data.train)[, -1]
y = as.matrix(data.train$Price)

ridge.model = glmnet(x, y, alpha = 0)

```


```{r}
beta=coef(ridge.model)

tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- ridge.model$lambda[tmp$variable+1]
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1]

ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, group = coef, )) + 
  geom_line() + 
    scale_x_log10() + 
    xlab("Lambda (log scale)") + 
    guides(color = guide_legend(title = ""), 
           linetype = guide_legend(title = "")) +
    theme_bw() + 
    theme(legend.key.width = unit(3,"lines")) 
```


```{r}
plot(ridge.model, xvar = "lambda", label = TRUE)
```

* Mediante los dos graficos anteriores se puede observar que los predictores mas significativos son aquellos que, durante el procesamiento del modelo, demoran mas en acercarse a valores cercanos a cero.    

```{r}

indices <- sample(c(TRUE,FALSE), nrow(data.train), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 0)

plot(cv.out)
coef(cv.out)
```


* La grafica anterior representa como el error incrementa a mayores valores de log(lambda).    
* Las lineas punteadas indican, el valor minimo de mse obtenido en la validacion cruzada y el valor de la derecha representa el error estandar.     


```{r}

bestlam = cv.out$lambda.min
bestlam

```

* Se emplea el menor lambda obtenido de la validacion cruzada propia del modelo.    

```{r}

ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
sqrt(mean((ridge.pred - y[-indices])^2))

```

* Se obtiene un valor de rmse de 807 obtenido mediante los datos de entrenamiento.    

## 10b.2: Validacion Cruzada   

```{r}

x = model.matrix(Price ~ . , data.test)[, -1]
y = as.matrix(data.test$Price)

ridge.model = glmnet(x, y, alpha = 0)

```


```{r}

indices <- sample(c(TRUE,FALSE), nrow(data.test), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 0)

plot(cv.out)
coef(cv.out)
```

* El valor minimo de MSE y el error estandar, son similares a los obtenidos mediante datos de entrenamiento.   


```{r}

bestlam = cv.out$lambda.min
bestlam

```

* Se emplea el menor lambda obtenido de la validacion cruzada propia del modelo.    

```{r}

ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
sqrt(mean((ridge.pred - y[-indices])^2))

```

* Se obtiene un valor de rmse de 775 datos de prueba.



## 10c: Lasso      

* Mediante Lasso se obtiene un modelo a partir de los 34 predictores iniciales, ponderando la influencia de cada predictor. A diferencia de Ridge, usando Lasso si se realiza una seleccion de variables.     

```{r}
set.seed(10)

index <- createDataPartition(clean_data$Price, p = 0.7, list = FALSE)

data.train <- clean_data[index, ]
data.test <- clean_data[-index, ]
```

* Se produce en conjunto de entrenamiento y de prueba.    

## 10c.1: Modelo

```{r}

x = model.matrix(Price ~ . , data.train)[, -1]
y = as.matrix(data.train$Price)

lasso.model = glmnet(x, y, alpha = 1)

```


```{r}
beta=coef(lasso.model)

tmp <- as.data.frame(as.matrix(beta))
tmp$coef <- row.names(tmp)
tmp <- reshape::melt(tmp, id = "coef")
tmp$variable <- as.numeric(gsub("s", "", tmp$variable))
tmp$lambda <- lasso.model$lambda[tmp$variable+1]
tmp$norm <- apply(abs(beta[-1,]), 2, sum)[tmp$variable+1]

ggplot(tmp[tmp$coef != "(Intercept)",], aes(lambda, value, color = coef, group = coef, )) + 
  geom_line() + 
    scale_x_log10() + 
    xlab("Lambda (log scale)") + 
    guides(color = guide_legend(title = ""), 
           linetype = guide_legend(title = "")) +
    theme_bw() + 
    theme(legend.key.width = unit(3,"lines")) 
```


```{r}
plot(lasso.model, xvar = "lambda", label = TRUE)
```


* Mediante los dos graficos anteriores se puede observar que los predictores mas significativos son aquellos que, durante el procesamiento del modelo, demoran mas en acercarse a valores cercanos a cero.    


```{r}

indices <- sample(c(TRUE,FALSE), nrow(data.train), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)

plot(cv.out)
coef(cv.out)
```

* El valor minimo de MSE y el error estandar, son similares a los obtenidos mediante datos de entrenamiento.   



```{r}

bestlam = cv.out$lambda.min
bestlam

```

* Se emplea el menor lambda obtenido de la validacion cruzada propia del modelo.    


```{r}

lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
sqrt(mean((lasso.pred - y[-indices])^2))

```

* Se obtiene un valor de rmse de 813 datos de entrenamiento.

## 10c.2: Validacion Cruzada 

```{r}

x = model.matrix(Price ~ . , data.test)[, -1]
y = as.matrix(data.test$Price)

lasso.model = glmnet(x, y, alpha = 1)

```

```{r}

indices <- sample(c(TRUE,FALSE), nrow(data.test), replace = TRUE)

cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)

plot(cv.out)
coef(cv.out)
```

```{r}

bestlam = cv.out$lambda.min
bestlam

```

```{r}

lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
sqrt(mean((lasso.pred - y[-indices])^2))

```

* Se obtiene un valor de rmse de 813 datos de prueba.   


# 11: Conclusión

Dado que es un dataset con una cantidad baja de instancias (1500 aprox), y que mediante el uso de Lasso se obtuvo un resultado aceptable comparado a los rmse calculados con otros métodos (Best Subset y Ridge), se opta por usar ese metodo para seleccion de variables.    


