1. Carga de librerias y datos.

1.1 Carga de librerias

library(leaps)
library(ggplot2)
Registered S3 method overwritten by 'dplyr':
  method           from
  print.rowwise_df     
library(readr)
library(dplyr)

Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union
library(fastDummies)
library(glmnet)
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-18
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

1.2 Carga de datos


rawdata <- read_csv("ToyotaCorolla.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  Model = col_character(),
  Fuel_Type = col_character()
)
See spec(...) for full column specifications.

1.3 Sumario de datos


summary(rawdata)
       Id            Model               Price         Age_08_04       Mfg_Month         Mfg_Year          KM        
 Min.   :   1.0   Length:1436        Min.   : 4350   Min.   : 1.00   Min.   : 1.000   Min.   :1998   Min.   :     1  
 1st Qu.: 361.8   Class :character   1st Qu.: 8450   1st Qu.:44.00   1st Qu.: 3.000   1st Qu.:1998   1st Qu.: 43000  
 Median : 721.5   Mode  :character   Median : 9900   Median :61.00   Median : 5.000   Median :1999   Median : 63390  
 Mean   : 721.6                      Mean   :10731   Mean   :55.95   Mean   : 5.549   Mean   :2000   Mean   : 68533  
 3rd Qu.:1081.2                      3rd Qu.:11950   3rd Qu.:70.00   3rd Qu.: 8.000   3rd Qu.:2001   3rd Qu.: 87021  
 Max.   :1442.0                      Max.   :32500   Max.   :80.00   Max.   :12.000   Max.   :2004   Max.   :243000  
  Fuel_Type               HP          Met_Color        Automatic             cc            Doors         Cylinders
 Length:1436        Min.   : 69.0   Min.   :0.0000   Min.   :0.00000   Min.   : 1300   Min.   :2.000   Min.   :4  
 Class :character   1st Qu.: 90.0   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 1400   1st Qu.:3.000   1st Qu.:4  
 Mode  :character   Median :110.0   Median :1.0000   Median :0.00000   Median : 1600   Median :4.000   Median :4  
                    Mean   :101.5   Mean   :0.6748   Mean   :0.05571   Mean   : 1577   Mean   :4.033   Mean   :4  
                    3rd Qu.:110.0   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.: 1600   3rd Qu.:5.000   3rd Qu.:4  
                    Max.   :192.0   Max.   :1.0000   Max.   :1.00000   Max.   :16000   Max.   :5.000   Max.   :4  
     Gears       Quarterly_Tax        Weight     Mfr_Guarantee    BOVAG_Guarantee  Guarantee_Period      ABS        
 Min.   :3.000   Min.   : 19.00   Min.   :1000   Min.   :0.0000   Min.   :0.0000   Min.   : 3.000   Min.   :0.0000  
 1st Qu.:5.000   1st Qu.: 69.00   1st Qu.:1040   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.: 3.000   1st Qu.:1.0000  
 Median :5.000   Median : 85.00   Median :1070   Median :0.0000   Median :1.0000   Median : 3.000   Median :1.0000  
 Mean   :5.026   Mean   : 87.12   Mean   :1072   Mean   :0.4095   Mean   :0.8955   Mean   : 3.815   Mean   :0.8134  
 3rd Qu.:5.000   3rd Qu.: 85.00   3rd Qu.:1085   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:1.0000  
 Max.   :6.000   Max.   :283.00   Max.   :1615   Max.   :1.0000   Max.   :1.0000   Max.   :36.000   Max.   :1.0000  
    Airbag_1         Airbag_2          Airco        Automatic_airco   Boardcomputer      CD_Player     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :1.0000   Median :1.0000   Median :0.00000   Median :0.0000   Median :0.0000  
 Mean   :0.9708   Mean   :0.7228   Mean   :0.5084   Mean   :0.05641   Mean   :0.2946   Mean   :0.2187  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
  Central_Lock    Powered_Windows Power_Steering       Radio          Mistlamps      Sport_Model     Backseat_Divider
 Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
 Median :1.0000   Median :1.000   Median :1.0000   Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
 Mean   :0.5801   Mean   :0.562   Mean   :0.9777   Mean   :0.1462   Mean   :0.257   Mean   :0.3001   Mean   :0.7702  
 3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
  Metallic_Rim    Radio_cassette      Tow_Bar      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.2047   Mean   :0.1455   Mean   :0.2779  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  

1.4 Estructura de datos

str(rawdata)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    1436 obs. of  37 variables:
 $ Id              : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Model           : chr  "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "?TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" ...
 $ Price           : num  13500 13750 13950 14950 13750 ...
 $ Age_08_04       : num  23 23 24 26 30 32 27 30 27 23 ...
 $ Mfg_Month       : num  10 10 9 7 3 1 6 3 6 10 ...
 $ Mfg_Year        : num  2002 2002 2002 2002 2002 ...
 $ KM              : num  46986 72937 41711 48000 38500 ...
 $ Fuel_Type       : chr  "Diesel" "Diesel" "Diesel" "Diesel" ...
 $ HP              : num  90 90 90 90 90 90 90 90 192 69 ...
 $ Met_Color       : num  1 1 1 0 0 0 1 1 0 0 ...
 $ Automatic       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cc              : num  2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
 $ Doors           : num  3 3 3 3 3 3 3 3 3 3 ...
 $ Cylinders       : num  4 4 4 4 4 4 4 4 4 4 ...
 $ Gears           : num  5 5 5 5 5 5 5 5 5 5 ...
 $ Quarterly_Tax   : num  210 210 210 210 210 210 210 210 100 185 ...
 $ Weight          : num  1165 1165 1165 1165 1170 ...
 $ Mfr_Guarantee   : num  0 0 1 1 1 0 0 1 0 0 ...
 $ BOVAG_Guarantee : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Guarantee_Period: num  3 3 3 3 3 3 3 3 3 3 ...
 $ ABS             : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Airbag_1        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Airbag_2        : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Airco           : num  0 1 0 0 1 1 1 1 1 1 ...
 $ Automatic_airco : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Boardcomputer   : num  1 1 1 1 1 1 1 1 0 1 ...
 $ CD_Player       : num  0 1 0 0 0 0 0 1 0 0 ...
 $ Central_Lock    : num  1 1 0 0 1 1 1 1 1 0 ...
 $ Powered_Windows : num  1 0 0 0 1 1 1 1 1 0 ...
 $ Power_Steering  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Radio           : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Mistlamps       : num  0 0 0 0 1 1 0 0 0 0 ...
 $ Sport_Model     : num  0 0 0 0 0 0 1 0 0 0 ...
 $ Backseat_Divider: num  1 1 1 1 1 1 1 1 0 1 ...
 $ Metallic_Rim    : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Radio_cassette  : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Tow_Bar         : num  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   Id = col_double(),
  ..   Model = col_character(),
  ..   Price = col_double(),
  ..   Age_08_04 = col_double(),
  ..   Mfg_Month = col_double(),
  ..   Mfg_Year = col_double(),
  ..   KM = col_double(),
  ..   Fuel_Type = col_character(),
  ..   HP = col_double(),
  ..   Met_Color = col_double(),
  ..   Automatic = col_double(),
  ..   cc = col_double(),
  ..   Doors = col_double(),
  ..   Cylinders = col_double(),
  ..   Gears = col_double(),
  ..   Quarterly_Tax = col_double(),
  ..   Weight = col_double(),
  ..   Mfr_Guarantee = col_double(),
  ..   BOVAG_Guarantee = col_double(),
  ..   Guarantee_Period = col_double(),
  ..   ABS = col_double(),
  ..   Airbag_1 = col_double(),
  ..   Airbag_2 = col_double(),
  ..   Airco = col_double(),
  ..   Automatic_airco = col_double(),
  ..   Boardcomputer = col_double(),
  ..   CD_Player = col_double(),
  ..   Central_Lock = col_double(),
  ..   Powered_Windows = col_double(),
  ..   Power_Steering = col_double(),
  ..   Radio = col_double(),
  ..   Mistlamps = col_double(),
  ..   Sport_Model = col_double(),
  ..   Backseat_Divider = col_double(),
  ..   Metallic_Rim = col_double(),
  ..   Radio_cassette = col_double(),
  ..   Tow_Bar = col_double()
  .. )

1.5 Buscando datos faltantes


sum(is.na(rawdata))
[1] 0

1.6 quitando las variables id y model


rawdata2 <- rawdata
rawdata2$Id <- NULL
rawdata2$Model <- NULL

2. Limpieza de dataset

2.1 Transformar a factores


rawdata2$Fuel_Type <- as.factor(rawdata2$Fuel_Type)

#rawdata2$Cylinders <- as.factor(rawdata2$Cylinders)
##(rawdata, aes(x = Price, y = Weight, colour = Fuel_Type))+ geom_point()

2.2 Filtrado


rawdata3 <- rawdata2

rawdata3$cc <- ifelse(rawdata3$cc == 16000, 1600, rawdata3$cc) 
rawdata3 <- filter(rawdata3, rawdata3$Weight < 1200)
rawdata3 <- filter(rawdata3, rawdata3$KM > 1000)
rawdata3 <- filter(rawdata3, rawdata3$KM < 150000)

2.3 Eliminando variables


rawdata3$Mfg_Month <- NULL
rawdata3$Mfg_Year <- NULL
rawdata3$Cylinders <-NULL

2.4 Estructura de nuevo dataset

str(rawdata3)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    1340 obs. of  32 variables:
 $ Price           : num  13500 13750 13950 14950 13750 ...
 $ Age_08_04       : num  23 23 24 26 30 32 27 23 25 22 ...
 $ KM              : num  46986 72937 41711 48000 38500 ...
 $ Fuel_Type       : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 3 2 3 3 ...
 $ HP              : num  90 90 90 90 90 90 192 69 192 192 ...
 $ Met_Color       : num  1 1 1 0 0 0 0 0 0 0 ...
 $ Automatic       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cc              : num  2000 2000 2000 2000 2000 2000 1800 1900 1800 1800 ...
 $ Doors           : num  3 3 3 3 3 3 3 3 3 3 ...
 $ Gears           : num  5 5 5 5 5 5 5 5 6 6 ...
 $ Quarterly_Tax   : num  210 210 210 210 210 210 100 185 100 100 ...
 $ Weight          : num  1165 1165 1165 1165 1170 ...
 $ Mfr_Guarantee   : num  0 0 1 1 1 0 0 0 1 1 ...
 $ BOVAG_Guarantee : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Guarantee_Period: num  3 3 3 3 3 3 3 3 12 3 ...
 $ ABS             : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Airbag_1        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Airbag_2        : num  1 1 1 1 1 1 0 1 1 1 ...
 $ Airco           : num  0 1 0 0 1 1 1 1 1 1 ...
 $ Automatic_airco : num  0 0 0 0 0 0 0 0 1 1 ...
 $ Boardcomputer   : num  1 1 1 1 1 1 0 1 0 1 ...
 $ CD_Player       : num  0 1 0 0 0 0 0 0 1 0 ...
 $ Central_Lock    : num  1 1 0 0 1 1 1 0 1 1 ...
 $ Powered_Windows : num  1 0 0 0 1 1 1 0 1 1 ...
 $ Power_Steering  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Radio           : num  0 0 0 0 0 0 1 0 0 0 ...
 $ Mistlamps       : num  0 0 0 0 1 1 0 0 0 1 ...
 $ Sport_Model     : num  0 0 0 0 0 0 0 0 0 1 ...
 $ Backseat_Divider: num  1 1 1 1 1 1 0 1 0 1 ...
 $ Metallic_Rim    : num  0 0 0 0 0 0 1 0 1 1 ...
 $ Radio_cassette  : num  0 0 0 0 0 0 1 0 0 0 ...
 $ Tow_Bar         : num  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   Id = col_double(),
  ..   Model = col_character(),
  ..   Price = col_double(),
  ..   Age_08_04 = col_double(),
  ..   Mfg_Month = col_double(),
  ..   Mfg_Year = col_double(),
  ..   KM = col_double(),
  ..   Fuel_Type = col_character(),
  ..   HP = col_double(),
  ..   Met_Color = col_double(),
  ..   Automatic = col_double(),
  ..   cc = col_double(),
  ..   Doors = col_double(),
  ..   Cylinders = col_double(),
  ..   Gears = col_double(),
  ..   Quarterly_Tax = col_double(),
  ..   Weight = col_double(),
  ..   Mfr_Guarantee = col_double(),
  ..   BOVAG_Guarantee = col_double(),
  ..   Guarantee_Period = col_double(),
  ..   ABS = col_double(),
  ..   Airbag_1 = col_double(),
  ..   Airbag_2 = col_double(),
  ..   Airco = col_double(),
  ..   Automatic_airco = col_double(),
  ..   Boardcomputer = col_double(),
  ..   CD_Player = col_double(),
  ..   Central_Lock = col_double(),
  ..   Powered_Windows = col_double(),
  ..   Power_Steering = col_double(),
  ..   Radio = col_double(),
  ..   Mistlamps = col_double(),
  ..   Sport_Model = col_double(),
  ..   Backseat_Divider = col_double(),
  ..   Metallic_Rim = col_double(),
  ..   Radio_cassette = col_double(),
  ..   Tow_Bar = col_double()
  .. )
  • Luego de realizar la limpieza el datset contiene 1340 observaciones y 32 variables

3. Best Subset Selection

regfit.full = regsubsets(Price ~., data = rawdata3, nvmax=31)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Price ~ ., data = rawdata3, nvmax = 31)
32 Variables  (and intercept)
                 Forced in Forced out
Age_08_04            FALSE      FALSE
KM                   FALSE      FALSE
Fuel_TypeDiesel      FALSE      FALSE
Fuel_TypePetrol      FALSE      FALSE
HP                   FALSE      FALSE
Met_Color            FALSE      FALSE
Automatic            FALSE      FALSE
cc                   FALSE      FALSE
Doors                FALSE      FALSE
Gears                FALSE      FALSE
Quarterly_Tax        FALSE      FALSE
Weight               FALSE      FALSE
Mfr_Guarantee        FALSE      FALSE
BOVAG_Guarantee      FALSE      FALSE
Guarantee_Period     FALSE      FALSE
ABS                  FALSE      FALSE
Airbag_1             FALSE      FALSE
Airbag_2             FALSE      FALSE
Airco                FALSE      FALSE
Automatic_airco      FALSE      FALSE
Boardcomputer        FALSE      FALSE
CD_Player            FALSE      FALSE
Central_Lock         FALSE      FALSE
Powered_Windows      FALSE      FALSE
Power_Steering       FALSE      FALSE
Radio                FALSE      FALSE
Mistlamps            FALSE      FALSE
Sport_Model          FALSE      FALSE
Backseat_Divider     FALSE      FALSE
Metallic_Rim         FALSE      FALSE
Radio_cassette       FALSE      FALSE
Tow_Bar              FALSE      FALSE
1 subsets of each size up to 31
Selection Algorithm: exhaustive
          Age_08_04 KM  Fuel_TypeDiesel Fuel_TypePetrol HP  Met_Color Automatic cc  Doors Gears Quarterly_Tax Weight
1  ( 1 )  "*"       " " " "             " "             " " " "       " "       " " " "   " "   " "           " "   
2  ( 1 )  "*"       " " " "             " "             " " " "       " "       " " " "   " "   " "           " "   
3  ( 1 )  "*"       " " " "             "*"             " " " "       " "       " " " "   " "   " "           "*"   
4  ( 1 )  "*"       " " " "             "*"             " " " "       " "       " " " "   " "   " "           "*"   
5  ( 1 )  "*"       "*" "*"             " "             " " " "       " "       " " " "   " "   " "           "*"   
6  ( 1 )  "*"       "*" " "             " "             "*" " "       " "       "*" " "   " "   " "           "*"   
7  ( 1 )  "*"       "*" " "             " "             "*" " "       " "       "*" " "   " "   " "           "*"   
8  ( 1 )  "*"       "*" " "             "*"             " " " "       " "       " " " "   " "   "*"           "*"   
9  ( 1 )  "*"       "*" " "             " "             "*" " "       " "       "*" " "   " "   " "           "*"   
10  ( 1 ) "*"       "*" " "             " "             "*" " "       " "       "*" " "   " "   "*"           "*"   
11  ( 1 ) "*"       "*" " "             " "             "*" " "       " "       "*" " "   " "   "*"           "*"   
12  ( 1 ) "*"       "*" " "             " "             "*" " "       " "       "*" " "   " "   "*"           "*"   
13  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" " "   " "   "*"           "*"   
14  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" " "   " "   "*"           "*"   
15  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
16  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
17  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
18  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
19  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
20  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
21  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
22  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       " "       "*" "*"   " "   "*"           "*"   
23  ( 1 ) "*"       "*" "*"             "*"             "*" " "       " "       "*" "*"   " "   "*"           "*"   
24  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       " "       "*" "*"   " "   "*"           "*"   
25  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       " "       "*" "*"   "*"   "*"           "*"   
26  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       " "       "*" "*"   "*"   "*"           "*"   
27  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       " "       "*" "*"   "*"   "*"           "*"   
28  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       "*"       "*" "*"   "*"   "*"           "*"   
29  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       "*"       "*" "*"   "*"   "*"           "*"   
30  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       "*"       "*" "*"   "*"   "*"           "*"   
31  ( 1 ) "*"       "*" "*"             "*"             "*" "*"       "*"       "*" "*"   "*"   "*"           "*"   
          Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
1  ( 1 )  " "           " "             " "              " " " "      " "      " "   " "             " "          
2  ( 1 )  " "           " "             " "              " " " "      " "      " "   "*"             " "          
3  ( 1 )  " "           " "             " "              " " " "      " "      " "   " "             " "          
4  ( 1 )  " "           " "             " "              " " " "      " "      " "   "*"             " "          
5  ( 1 )  " "           " "             " "              " " " "      " "      " "   "*"             " "          
6  ( 1 )  " "           " "             " "              " " " "      " "      " "   "*"             " "          
7  ( 1 )  " "           " "             " "              " " " "      " "      " "   "*"             " "          
8  ( 1 )  " "           " "             "*"              " " " "      " "      " "   "*"             " "          
9  ( 1 )  " "           "*"             "*"              " " " "      " "      " "   "*"             " "          
10  ( 1 ) " "           "*"             "*"              " " " "      " "      " "   "*"             " "          
11  ( 1 ) "*"           "*"             "*"              " " " "      " "      " "   "*"             " "          
12  ( 1 ) "*"           "*"             "*"              " " " "      " "      "*"   "*"             " "          
13  ( 1 ) "*"           "*"             "*"              " " " "      " "      " "   "*"             " "          
14  ( 1 ) "*"           "*"             "*"              " " " "      " "      "*"   "*"             " "          
15  ( 1 ) "*"           "*"             "*"              " " " "      " "      "*"   "*"             " "          
16  ( 1 ) "*"           "*"             "*"              " " " "      " "      "*"   "*"             " "          
17  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             " "          
18  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
19  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
20  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
21  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
22  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
23  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
24  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
25  ( 1 ) "*"           "*"             "*"              "*" " "      " "      "*"   "*"             "*"          
26  ( 1 ) "*"           "*"             "*"              "*" " "      "*"      "*"   "*"             "*"          
27  ( 1 ) "*"           "*"             "*"              "*" " "      "*"      "*"   "*"             "*"          
28  ( 1 ) "*"           "*"             "*"              "*" " "      "*"      "*"   "*"             "*"          
29  ( 1 ) "*"           "*"             "*"              "*" " "      "*"      "*"   "*"             "*"          
30  ( 1 ) "*"           "*"             "*"              "*" " "      "*"      "*"   "*"             "*"          
31  ( 1 ) "*"           "*"             "*"              "*" " "      "*"      "*"   "*"             "*"          
          CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
1  ( 1 )  " "       " "          " "             " "            " "   " "       " "         " "             
2  ( 1 )  " "       " "          " "             " "            " "   " "       " "         " "             
3  ( 1 )  " "       " "          " "             " "            " "   " "       " "         " "             
4  ( 1 )  " "       " "          " "             " "            " "   " "       " "         " "             
5  ( 1 )  " "       " "          " "             " "            " "   " "       " "         " "             
6  ( 1 )  " "       " "          " "             " "            " "   " "       " "         " "             
7  ( 1 )  " "       " "          "*"             " "            " "   " "       " "         " "             
8  ( 1 )  " "       " "          "*"             " "            " "   " "       " "         " "             
9  ( 1 )  " "       " "          "*"             " "            " "   " "       " "         " "             
10  ( 1 ) " "       " "          "*"             " "            " "   " "       " "         " "             
11  ( 1 ) " "       " "          "*"             " "            " "   " "       " "         " "             
12  ( 1 ) " "       " "          "*"             " "            " "   " "       " "         " "             
13  ( 1 ) " "       " "          "*"             " "            " "   " "       " "         " "             
14  ( 1 ) " "       " "          "*"             " "            " "   " "       " "         " "             
15  ( 1 ) " "       " "          "*"             " "            " "   " "       " "         " "             
16  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         " "             
17  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         " "             
18  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         " "             
19  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         " "             
20  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         " "             
21  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         "*"             
22  ( 1 ) "*"       " "          "*"             " "            " "   " "       " "         "*"             
23  ( 1 ) "*"       " "          "*"             " "            "*"   " "       " "         "*"             
24  ( 1 ) "*"       " "          "*"             " "            "*"   " "       " "         "*"             
25  ( 1 ) "*"       " "          "*"             " "            "*"   " "       " "         "*"             
26  ( 1 ) "*"       " "          "*"             " "            "*"   " "       " "         "*"             
27  ( 1 ) "*"       " "          "*"             " "            "*"   " "       "*"         "*"             
28  ( 1 ) "*"       " "          "*"             " "            "*"   " "       "*"         "*"             
29  ( 1 ) "*"       "*"          "*"             " "            "*"   " "       "*"         "*"             
30  ( 1 ) "*"       "*"          "*"             " "            "*"   "*"       "*"         "*"             
31  ( 1 ) "*"       "*"          "*"             "*"            "*"   "*"       "*"         "*"             
          Metallic_Rim Radio_cassette Tow_Bar
1  ( 1 )  " "          " "            " "    
2  ( 1 )  " "          " "            " "    
3  ( 1 )  " "          " "            " "    
4  ( 1 )  " "          " "            " "    
5  ( 1 )  " "          " "            " "    
6  ( 1 )  " "          " "            " "    
7  ( 1 )  " "          " "            " "    
8  ( 1 )  " "          " "            " "    
9  ( 1 )  " "          " "            " "    
10  ( 1 ) " "          " "            " "    
11  ( 1 ) " "          " "            " "    
12  ( 1 ) " "          " "            " "    
13  ( 1 ) " "          " "            " "    
14  ( 1 ) " "          " "            " "    
15  ( 1 ) " "          " "            " "    
16  ( 1 ) " "          " "            " "    
17  ( 1 ) " "          " "            " "    
18  ( 1 ) " "          " "            " "    
19  ( 1 ) " "          " "            "*"    
20  ( 1 ) "*"          " "            "*"    
21  ( 1 ) "*"          " "            "*"    
22  ( 1 ) "*"          " "            "*"    
23  ( 1 ) "*"          "*"            "*"    
24  ( 1 ) "*"          "*"            "*"    
25  ( 1 ) "*"          "*"            "*"    
26  ( 1 ) "*"          "*"            "*"    
27  ( 1 ) "*"          "*"            "*"    
28  ( 1 ) "*"          "*"            "*"    
29  ( 1 ) "*"          "*"            "*"    
30  ( 1 ) "*"          "*"            "*"    
31  ( 1 ) "*"          "*"            "*"    
  • Best Subset Selection nos proporciona 31 modelos con sus respectivos predictores
reg.summary = summary(regfit.full)
names(reg.summary)

3.1 RSS

which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "l")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
  • El modelo con RSS minimo es 31.

3.2 ADJR2


which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="Adjr2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
  • El modelo con ADJR2 maximo es 21.

3.3 CP

which.min(reg.summary$cp)

plot(reg.summary$cp, xlab="Numero de variables", ylab = "CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col= "red", cex=2, pch=20)
  • El modelo con CP minimo es 21.

3.4 BIC

which.min(reg.summary$bic)

plot(reg.summary$bic, xlab="Numero de variables", ylab = "CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col= "red", cex=2, pch=20)
  • El modelo con bic minimo es 15.

3.5 Conclusiones

reg.summary$adjr2[6]
reg.summary$adjr2[8]
reg.summary$adjr2[10]
reg.summary$adjr2[15]
reg.summary$adjr2[21]
  • A juzgar por los graficos pareciera que el numero adecuado de predictores es 15, sin embargo vemos que de un modelo de 8 preditores a uno de 10, 15 o 21 el R2 ajustado varia muy poco. Por lo tanto se opta por un modelo de 8 predictores.

3.5 Validacion Best Subset Selection

index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
  • Se divide el dataset en dos particiones, una de entrenamiento y otra de validacion.
set.seed(7)
dim(rawdata3)
regfit.train <- regsubsets(Price ~., data = data.train, nvmax = 31, really.big = TRUE)
#summary(regfit.train)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(regfit.train, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
d<-which.min(val.errors)

rmse <- sqrt(val.errors)
value.min <- which.min(rmse)
value.min
rmsebss_min <- rmse[value.min]
rmsebss_min
  • El modelo con menor RMSE es el de 20 predictores con un rmse de 1018

plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE Best Subset", pch = 20, type = "b")
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
options(scipen = 999)
coef(regfit.train, which.min(rmse))
  • Coeficientes del modelo con menor RMSE
set.seed(7)
rmse[8]
rmse[10]
rmse[15]
rmse[20]

3.6 Validacion Cruzada Best Subset


predict.regsubsets  <- function(object, newdata, id, ...){
    # Extraer la fórmula del modelo (variable dependiente ~ predictores)
    form <- as.formula(object$call[[2]])
    # Generar una matriz modelo con los nuevos datos y la fórmula
    mat <- model.matrix(form, newdata)
    # Extraer los coeficientes del modelo
    coefi <- coef(object , id = id)
    # Almacenar el nombre de las variables predictoras del modelo
    xvars <- names(coefi)
    # Producto matricial entre los coeficientes del modelo y los valores de
    # los predictores de las nuevas observaciones para obtener las 
    # predicciones
    mat[ , xvars] %*% coefi
}
  • Funcion para predecir
set.seed(123)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
  regfit.val <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31,  really.big = TRUE)
   for(i in 1:31) {
    test <- rawdata3[folds == k,]
    predictions <- predict.regsubsets(object = regfit.val, newdata = rawdata3[folds == k,] , id = i)
    cv.errors[k,i] <- mean((test$Price - predictions)^2)
  }
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
rmse.cv

set.seed(123)
value.min <- which.min(rmse.cv)
value.min
rmsebss.cv_min <- rmse.cv[value.min]
rmsebss.cv_min
  • El modelo con menor RMSE es el de 21 predictores con un rmse de 1064
set.seed(123)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
value.min <- which.min(rmse.cv)
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
coef(regfit.val, which.min(rmse.cv))
  • Coeficientes del modelo con menor RMSE

4. Stepwise Selection: Forward Model

regfitfwd.full = regsubsets(Price ~., data = rawdata3, nvmax=31, method="forward")
summary(regfitfwd.full)
  • Stepwise Forward , 31 modelos
reg.summary = summary(regfitfwd.full)
names(reg.summary)

4.1 RSS

which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "b")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
  • El modelo con menor rss posee 31 predictores

4.2 ADJR2

which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="ADJR2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
  • El modelo con mayor R2 ajustado posee 21 predictores

4.3 CP

which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col="red", cex=2, pch=20)
  • El modelo con menor CP posee 21 predictores

4.4 bic

which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red", cex=2, pch=20)
  • El modelo con menor bic posee 15 predictores

4.5 Conclusiones

reg.summary$adjr2[6]
reg.summary$adjr2[8]
reg.summary$adjr2[10]
reg.summary$adjr2[15]
reg.summary$adjr2[21]
  • A juzgar por los graficos pareciera que el numero adecuado de predictores es 15, el R2 ajustado varia muy poco al aumentar los predictores.

4.6 Validación Stepwise Selection Forward


index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
  • Particion del dataset en entrenamiento y validacion
set.seed(10)
dim(rawdata3)
model.fwd <- regsubsets(Price ~., data = data.train, nvmax = 31, method = "forward", really.big = TRUE)
summary(model.fwd)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(model.fwd, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
rmse <- sqrt(val.errors)
set.seed(123)
value.min <- which.min(rmse)
value.min
rmsesfw_min <- rmse[value.min]
rmsesfw_min
  • La validacion arroja que el modelo con menor RMSE es el de 20 variables con un RMSE de 1035
set.seed(123)
rmse[15]
set.seed(123)
rmse[10]
  • La diferencia de rmse del modelo con 15 predictores y el modelo con 20 predictores es minima. Se puede optar por el modelo de 15 predictores.
set.seed(123)
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE Best Subset", pch = 20, type = "b")
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
coef(model.fwd, which.min(rmse))

4.7 Validacion cruzada Stepwise Forward

set.seed(14)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
  model.fwd <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, method = "forward", really.big = TRUE)
   for(i in 1:31) {
    test <- rawdata3[folds == k,]
    predictions <- predict.regsubsets(object = model.fwd, newdata = rawdata3[folds == k,] , id = i)
    cv.errors[k,i] <- mean((test$Price - predictions)^2)
  }
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
set.seed(14)
value.min <- which.min(rmse.cv)
value.min
rmsesfw.cv_min <- rmse.cv[value.min]
rmsesfw.cv_min
  • El modelo con rmse minimo es el de 22 preditores con un valor de 1057
set.seed(14)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
value.min <- which.min(rmse.cv)
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
coef(model.fwd, which.min(rmse.cv))
  • coeficientes del modelo con 22 predictores.

5. Stepwise Selection: Backward Model

regfitbwd.full = regsubsets(Price ~., data = rawdata3, nvmax=31, method="backward")
summary(regfit.full)
  • Stepwise backward , 31 modelos
reg.summary = summary(regfitbwd.full)
names(reg.summary)

5.1 RSS

which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "b")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
  • El modelo con menor rss posee 31 predictores

5.2 ADJR2

which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="ADJR2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
  • El modelo con mayor R2 ajustado posee 21 predictores

5.3 CP

which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col="red", cex=2, pch=20)
  • El modelo con menor CP posee 21 predictores

5.4 bic

which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red", cex=2, pch=20)
  • El modelo con menor bic posee 15 predictores

5.5 Conclusiones

reg.summary$adjr2[6]
reg.summary$adjr2[8]
reg.summary$adjr2[10]
reg.summary$adjr2[15]
reg.summary$adjr2[21]
  • A juzgar por los graficos pareciera que el numero adecuado de predictores es 15, el R2 ajustado varia muy poco al aumentar los predictores.

5.6 Validación Stepwise Selection Backward


index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
  • Particion del dataset en entrenamiento y validacion
set.seed(191)
dim(rawdata3)
model.bwd <- regsubsets(Price ~., data = data.train, nvmax = 31, method = "backward", really.big = TRUE)
summary(model.bwd)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(model.bwd, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
rmse <- sqrt(val.errors)
set.seed(191)
value.min <- which.min(rmse)
value.min
rmsesbw_min <- rmse[value.min]
rmsesbw_min
  • La validacion arroja que el modelo con menor RMSE es el de 16 variables con un RMSE de 1042
set.seed(191)
rmse[20]
set.seed(191)
rmse[15]
set.seed(191)
rmse[10]
  • La diferencia de rmse del modelo con 16 predictores y el modelo con 20 predictores es minima.
set.seed(191)
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE", pch = 20, type = "b")
value.min <- which.min(rmse)
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
coef(model.fwd, which.min(rmse))

5.7 Validacion cruzada Stepwise Backward

set.seed(141)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
  model.bwd <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, method = "backward", really.big = TRUE)
   for(i in 1:31) {
    test <- rawdata3[folds == k,]
    predictions <- predict.regsubsets(object = model.bwd, newdata = rawdata3[folds == k,] , id = i)
    cv.errors[k,i] <- mean((test$Price - predictions)^2)
  }
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
set.seed(141)
value.min <- which.min(rmse.cv)
value.min
rmsesbw.cv_min <- rmse.cv[value.min]
rmsesbw.cv_min
  • El modelo con rmse minimo es el de 20 preditores con un valor de 1059
set.seed(141)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
coef(model.bwd, which.min(rmse.cv))
  • coeficientes del modelo con 20 predictores.

6. Ridge

6.1 Estandarizacion de valores


x <- model.matrix(Price ~. , rawdata3)[,-1]
y <- rawdata3$Price
grid <- 10^seq(10,-2,length = 100)
ridge.model <- glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge.model))
coef(ridge.model)
plot(ridge.model, xvar ='lambda',label = TRUE)
  • A medida que lambda crece los coeficientes tienden a cero.
options(scipen=999)
predict(ridge.model, s = 10, type = "coefficients")

6.2 Validacion Cruzada

set.seed(23)
indices <- sample(c(TRUE,FALSE), nrow(rawdata3), replace = TRUE)
y.test <- y[-indices]

ridge.model.cv <- cv.glmnet(x[indices,], y[indices], alpha = 0) # Por defecto, 10-fold
coef(ridge.model.cv)
plot(ridge.model.cv)
  • Las lineas horizontales representan, la primera, el minimo de la validacion cruzada, y la segunda, el error estandar.
set.seed(23)
bestlam <- ridge.model.cv$lambda.min
bestlam
  • Calculamos el menor lambda para aplicarlo a la validacion cruzada
  • Menor lambda 285.38

6.3

set.seed(23)
ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
mse <-mean((ridge.pred - y[-indices])^2)
rmse <-sqrt(mse)
value.min <- which.min(rmse)
value.min
rmseridge <- rmse[value.min]
rmseridge
  • El modelo obtenido con Ridge Regression utilizando el lambda minimo nos da un RMSE de 1056.6
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)
  • Ninguno de los coeficientes es exactamente igual a cero, ya que Ridge no realiza seleccion de variables.
plot(out, xvar ='lambda',label = TRUE)
  • Los coeficientes se ajustan mas a cero

7. Lasso

lasso.model <- glmnet(x, y, alpha = 1)
plot(lasso.model, xvar ='lambda',label = TRUE)

7.1 Validacion Cruzada

indices <- sample(c(TRUE,FALSE), nrow(rawdata3), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)
plot(cv.out)
  • el valor minimo de la validacion cruzada, es de tamaño 32, mientras que el error estandar es de tamaño 15.
coef(cv.out)
  • La validacion cruzada no toma en cuenta 17 variables, el modelo queda con 15 variables.
bestlam <- cv.out$lambda.min
bestlam
  • Lambda minimo es 0.79

7.2 Validacion

lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
mse <-mean((lasso.pred - y[-indices])^2)
rmse <-sqrt(mse)
value.min <- which.min(rmse)
rmselasso <- rmse[value.min]
rmselasso

-El valor de RMSE es de 1025

options(scipen=999)
out <- glmnet(x, y, alpha = 1)
lasso.coef <- predict(out, type = "coefficients", s = bestlam)
lasso.coef
plot(out, xvar='lambda', label= TRUE)
  • Se reemplaza el valor arbitrario con el mejor lambda obtenido

8. PCA


rawdata4 <- rawdata2
rawdata4 <- dummy_cols(rawdata4, select_columns = "Fuel_Type", remove_first_dummy = TRUE)
rawdata4 <- select(rawdata4, - Fuel_Type, - Mfg_Year, - Mfg_Month, - Cylinders)
apply(X = rawdata4, MARGIN = 2, FUN = mean)
  • La media de las variables, vemos que existe mucha diferencia entre algunas medias como la de Price o KM. Es necesario una estandarizacion. LA media debe ser cercana a cero.
apply(X = rawdata4, MARGIN = 2, FUN = var)
  • La varianza es la desviacion estandar al cuadrado, analizo la varianza ya que la desviacion estandar puede valores negativos. Se puede observar que algunas variables poseen varianzas dispares. La varianza debe ser cercana a uno.

8.1 Modelo PCA

pca.model <- prcomp(rawdata4[,-1], scale = TRUE)
names(pca.model) 
pca.model$center
  • Representa la media de las variables previa a la estandarizacion
pca.model$scale
  • Representa la desviacion de las variables previa a la estandarizacion
pca.model$rotation
  • Cada variable de las componentes tienen asignado un peso, las de mayor peso son las que explican dicha componente.
dim(pca.model$x)
  • Dimension del dataset, 1436 observaciones y 32 componentes.
biplot(x = pca.model, scale = 0, cex = 0.6, col = c("blue", "brown3"))
  • Grafica Primera y segunda componente

8.2 Varianza explicada de las componentes

pca.model$sdev^2
prop.variance <- pca.model$sdev^2/sum(pca.model$sdev^2)
prop.variance
  • Proporcion respecto al total
prop.cumulative <- cumsum(prop.variance)
prop.cumulative
  • Proporcion de la varianza acumulada
plot(prop.cumulative, type = "b", pch = 20, ylab = "Proporcion de la Varianza Acumulada", 
xlab = "Componente Principal" )
points(prop.cumulative, col = "red")
  • La 1era componente explica el 15%, la 2da el 27% ,etc. Se necesitan 15 componentes principales para explicar el 81%
  • Para explicar el 98% se necesitan 26 variables.
  • Con las primeras 21 componentes principales se pueden exlicar 92% de la varianza acumulada

#9. PCR ##############

9.2 Particion del dataset

set.seed(789)
indices <- createDataPartition(rawdata4$Price, p = 0.8, list = FALSE)
set.seed(789)
pcr.model <- pcr(formula = Price ~ ., data = rawdata4[indices, ], scale = TRUE, ncomp=21)
summary(pcr.model)
  • Se reraliza el ajuste del modelo, se utilizan las primeras 21 componentes principales

9.3 Validacion

pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,], ncomp = 21)
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr <- sqrt(mse.test)
rmsepcr
  • El RMSE es de 1332, se utilizan los datos de test.

9.4 Validacion cruzada

set.seed(678)
pcr.model <- pcr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
summary(pcr.model)
pcr.model.cv <- MSEP(pcr.model, estimate = "CV")
m <- which.min(pcr.model.cv$val)
m
rmsepcr.cv <- sqrt(pcr.model.cv$val[m])
rmsepcr.cv
  • LA validacion cruzada da como resultado un modelo con 31 componentes y un RMSE de 1227
plot(pcr.model.cv$val, main = "MSE vs nro componentes", type = "l",
ylab = "MSE", col = "blue", xlab = "Componentes")
  • Observando el grafico, el menor MSE se da con 31 componentes, que es el numero de predictores, concluimos que no pude reducir la dimensionalidad del dataset original.
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,])
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv

-Evaluacion del modelo PCR contra los datos de prueba. Calculo del MSE y RMSE correspondiente. - El modelo no es bueno dada la igualdad componentes-predictores pero los datos sirven de referencia para otros metodos.

pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,], ncomp = 26)
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
  • Con 26 componentes el Rmse es de 1324.

10 PLS

set.seed(332)
indices <- createDataPartition(rawdata4$Price, p = 0.8, list = FALSE)
set.seed(332)
pls.model <- plsr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
pls.model.cv <- MSEP(pls.model, estimate = "CV")
which.min(pls.model.cv$val)
plot(pls.model.cv$val, main = "MSE vs N Componentes", type = "l", ylab = "MSE", col = "blue", xlab = "Componentes")
pls.model.pred <- predict(pls.model, newdata = rawdata4[-indices,], ncomp = 5)
mse.test <- mean((pls.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
  • El RMSE para 5 componentes como sugiere pls, es de 1282.

10.1 Analisis con 21 predictores.

10.1.1 Creacion de modelo

set.seed(333)
pls.model <- plsr(formula = Price ~ ., data = rawdata4[indices, ], scale = TRUE, ncomp=21)
summary(pls.model)
  • Se utilizo 21 predictores, que explican el 83%, con PCR explicaba un 92%

10.1.2 Validacion

pls.model.pred <- predict(pls.model, newdata = rawdata4[-indices,], ncomp = 21)
mse.test <- mean((pls.model.pred - rawdata4[-indices,]$Price)^2)
rmsepcr <- sqrt(mse.test)
rmsepcr
which.min(pls.model.cv$val)
  • RMSE 1196 para 21 predictores.

10.1.3 Validacion Cruzada

set.seed(333)
pls.model <- plsr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
summary(pls.model)
pls.model.cv <- MSEP(pls.model, estimate = "CV")
m <- which.min(pls.model.cv$val)
m
rmseplst.cv <- sqrt(mse.test)
rmseplst.cv

11. Conclusiones

metodo <- c("BSS","FW", "BW", "Ridge", "Lasso", "PCR", "PLS")
test_RMSE <- c(rmsebss.cv_min, rmsesfw.cv_min, rmsesbw.cv_min, rmseridge,
              rmselasso, rmsepcr.cv, rmseplst.cv)
resultados <- data.frame(metodo, test_RMSE)
resultados
ggplot(data = resultados, aes(x = reorder(metodo, test_RMSE),
                              y = test_RMSE)) +
  geom_col(width = 0.5) +
  lims(y = c(0, 1600)) +
  geom_text(aes(label = round(test_RMSE, 2)), vjust = -1) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  • El mejor modelo es el resultado del metodo Lasso, ya que posee el menor RMSE.
  • Se oberva una gran diferencia entre PLS, PCR, BW(Los de mas alto RMSE) y los demás metodos.

12. Modelo con Lasso

  • Las variables seleccionadas por Lasso son:

  • Age_08_04
  • KM
  • HP
  • cc
  • Gears
  • Weight
  • Mfr_Guarantee
  • BOVAG_Guarantee
  • Guarantee_Period
  • Airco
  • Automatic_airco
  • CD_Player
  • Central_Lock
  • Powered_Windows
  • Metallic_Rim

12.1 Limpieza

datalasso <- rawdata3
datalasso <- select(datalasso, Price, Age_08_04, KM, HP, cc, Gears, Weight, Mfr_Guarantee, BOVAG_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Central_Lock, Powered_Windows, Metallic_Rim)
dim(datalasso)

12.2 Regresion Lineal

lassoLM <- (lm(formula = Price ~. , data = datalasso))
summary(lassoLM)
  • La variable Central_Lock tiene un t_value que no es significativamente distinto a cero, es probable que no agregue nada al modelo.

  • R cuadrado es de 0.89

hist(lassoLM$residuals) # histograma de los residuos estandarizados 
#qqnorm(lassoLM$residuals) # gráfico de cuantiles de los residuos estandarizados 
#qqline(lassoLM$residuals)
  • Se puede observar simetria en el histograma de residuos.
par(mfrow=c(2,2)) # divide la ventana en una fila y tres columnas 
plot(lassoLM)
  • Residuals vs Fitted: El grafico muestra una distribucion de naturaleza aleatoria. El promedio de la grafica se pierde en cero.

  • QQ: Ciertas muestras no se encuentran sobre la linea, es indicativo de que se puede mejorar el modelo

  • Scale-Location: Muestra la varianza de los residuos

  • Residuals vs Leverage: No se observa ningun punto fiera de la linea punteada, es decir no se observan puntos influyentes.

---
title: "Toyota Corolla - PCA, PCR"
author: "Bruguera Agustin"
output:
    html_notebook:
      df_print: paged
      fig:height: 4
      fig:width: 6
      toc: yes
      toc_float: yes
    html_document:
      df_print: paged
      toc: yes
---

##################
# 1. Carga de librerias y datos.
################## 


## 1.1 Carga de librerias
```{r librerias}
library(leaps)
library(ggplot2)
library(readr)
library(dplyr)
library(fastDummies)
library(glmnet)
library(caret)
library(pls)


```

## 1.2 Carga de datos
```{r datos}

rawdata <- read_csv("ToyotaCorolla.csv")

```

## 1.3 Sumario de datos
```{r sumario de datos}

summary(rawdata)

```

## 1.4 Estructura de datos
```{r estructura de datos}
str(rawdata)
```

## 1.5 Buscando datos faltantes
```{r datos faltantes}

sum(is.na(rawdata))
```

## 1.6 quitando las variables id y model
```{r}

rawdata2 <- rawdata
rawdata2$Id <- NULL
rawdata2$Model <- NULL
```



#######################
# 2. Limpieza de dataset
#######################


## 2.1 Transformar a factores
```{r}

rawdata2$Fuel_Type <- as.factor(rawdata2$Fuel_Type)

#rawdata2$Cylinders <- as.factor(rawdata2$Cylinders)
##(rawdata, aes(x = Price, y = Weight, colour = Fuel_Type))+ geom_point()



```

## 2.2 Filtrado
```{r}

rawdata3 <- rawdata2

rawdata3$cc <- ifelse(rawdata3$cc == 16000, 1600, rawdata3$cc) 
rawdata3 <- filter(rawdata3, rawdata3$Weight < 1200)
rawdata3 <- filter(rawdata3, rawdata3$KM > 1000)
rawdata3 <- filter(rawdata3, rawdata3$KM < 150000)


```

## 2.3 Eliminando variables
```{r}

rawdata3$Mfg_Month <- NULL
rawdata3$Mfg_Year <- NULL
rawdata3$Cylinders <-NULL

```

## 2.4 Estructura de nuevo dataset
```{r}

str(rawdata3)

```

- Luego de realizar la limpieza el datset contiene 1340 observaciones y 32 variables

#################
# 3. Best Subset Selection
#################

```{r}
regfit.full = regsubsets(Price ~., data = rawdata3, nvmax=31)
summary(regfit.full)
```

- Best Subset Selection nos proporciona 31 modelos con sus respectivos predictores

```{r}
reg.summary = summary(regfit.full)
names(reg.summary)

```

## 3.1 RSS
```{r}
which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "l")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
```

- El modelo con RSS minimo es 31.


## 3.2 ADJR2
```{r}

which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="Adjr2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
```

- El modelo con ADJR2 maximo es 21.


## 3.3 CP
```{r}
which.min(reg.summary$cp)

plot(reg.summary$cp, xlab="Numero de variables", ylab = "CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col= "red", cex=2, pch=20)

```

- El modelo con CP minimo es 21.

## 3.4 BIC
```{r}
which.min(reg.summary$bic)

plot(reg.summary$bic, xlab="Numero de variables", ylab = "CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col= "red", cex=2, pch=20)

```

- El modelo con bic minimo es 15.

## 3.5 Conclusiones
```{r}
reg.summary$adjr2[6]
```
```{r}
reg.summary$adjr2[8]
```
```{r}
reg.summary$adjr2[10]
```
```{r}
reg.summary$adjr2[15]
```
```{r}
reg.summary$adjr2[21]
```

- A juzgar por los graficos pareciera que el numero adecuado de predictores es 15, sin embargo vemos que de un modelo de 8 preditores a uno de 10, 15 o 21 el R2 ajustado varia muy poco. Por lo tanto se opta por un modelo de 8 predictores.


## 3.5 Validacion Best Subset Selection

```{r}
index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
```

- Se divide el dataset en dos particiones, una de entrenamiento y otra de validacion.

```{r}
set.seed(7)
dim(rawdata3)
```

```{r}
regfit.train <- regsubsets(Price ~., data = data.train, nvmax = 31, really.big = TRUE)
#summary(regfit.train)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(regfit.train, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
d<-which.min(val.errors)

rmse <- sqrt(val.errors)

```

```{r}
value.min <- which.min(rmse)
value.min
```

```{r}
rmsebss_min <- rmse[value.min]
rmsebss_min
```

- El modelo con menor RMSE es el de 20 predictores con un rmse de 1018

```{r}

plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE Best Subset", pch = 20, type = "b")
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
```

```{r}
options(scipen = 999)
coef(regfit.train, which.min(rmse))
```
- Coeficientes del modelo con menor RMSE

```{r}
set.seed(7)
rmse[8]
```

```{r}
rmse[10]
```
```{r}
rmse[15]
```
```{r}
rmse[20]
```


## 3.6 Validacion Cruzada Best Subset

```{r}

predict.regsubsets  <- function(object, newdata, id, ...){
    # Extraer la fórmula del modelo (variable dependiente ~ predictores)
    form <- as.formula(object$call[[2]])
    # Generar una matriz modelo con los nuevos datos y la fórmula
    mat <- model.matrix(form, newdata)
    # Extraer los coeficientes del modelo
    coefi <- coef(object , id = id)
    # Almacenar el nombre de las variables predictoras del modelo
    xvars <- names(coefi)
    # Producto matricial entre los coeficientes del modelo y los valores de
    # los predictores de las nuevas observaciones para obtener las 
    # predicciones
    mat[ , xvars] %*% coefi
}
```
- Funcion para predecir

```{r}
set.seed(123)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)

```

```{r}
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
  regfit.val <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31,  really.big = TRUE)
   for(i in 1:31) {
    test <- rawdata3[folds == k,]
    predictions <- predict.regsubsets(object = regfit.val, newdata = rawdata3[folds == k,] , id = i)
    cv.errors[k,i] <- mean((test$Price - predictions)^2)
  }
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
rmse.cv
```

```{r}

set.seed(123)
value.min <- which.min(rmse.cv)
value.min
```

```{r}
rmsebss.cv_min <- rmse.cv[value.min]
rmsebss.cv_min
```

- El modelo con menor RMSE es el de 21 predictores con un rmse de 1064
```{r}
set.seed(123)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
value.min <- which.min(rmse.cv)
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
```

```{r}
coef(regfit.val, which.min(rmse.cv))
```

- Coeficientes del modelo con menor RMSE

############
# 4. Stepwise Selection: Forward Model
############

```{r}
regfitfwd.full = regsubsets(Price ~., data = rawdata3, nvmax=31, method="forward")
summary(regfitfwd.full)
```

- Stepwise Forward , 31 modelos

```{r}
reg.summary = summary(regfitfwd.full)
names(reg.summary)
```

## 4.1 RSS
```{r}
which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "b")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
```

- El modelo con menor rss posee 31 predictores

## 4.2 ADJR2
```{r}
which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="ADJR2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)

```

- El modelo con mayor R2 ajustado posee 21 predictores

## 4.3 CP

```{r}
which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col="red", cex=2, pch=20)


```

- El modelo con menor CP posee 21 predictores


## 4.4 bic

```{r}
which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red", cex=2, pch=20)


```

- El modelo con menor bic posee 15 predictores


## 4.5 Conclusiones
```{r}
reg.summary$adjr2[6]
```
```{r}
reg.summary$adjr2[8]
```
```{r}
reg.summary$adjr2[10]
```
```{r}
reg.summary$adjr2[15]
```
```{r}
reg.summary$adjr2[21]
```

- A juzgar por los graficos pareciera que el numero adecuado de predictores es 15, el R2 ajustado varia muy poco al aumentar los predictores.


## 4.6 Validación Stepwise Selection Forward

```{r}

index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
```
- Particion del dataset en entrenamiento y validacion

```{r}
set.seed(10)
dim(rawdata3)
```

```{r}
model.fwd <- regsubsets(Price ~., data = data.train, nvmax = 31, method = "forward", really.big = TRUE)
summary(model.fwd)
```

```{r}
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(model.fwd, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
rmse <- sqrt(val.errors)
```


```{r}
set.seed(123)
value.min <- which.min(rmse)
value.min
```

```{r}
rmsesfw_min <- rmse[value.min]
rmsesfw_min
```

- La validacion arroja que el modelo con menor RMSE es el de 20 variables con un RMSE de 1035


```{r}
set.seed(123)
rmse[15]
```

```{r}
set.seed(123)
rmse[10]
```

- La diferencia de rmse del modelo con 15 predictores y el modelo con 20 predictores es minima. Se puede optar por el modelo de 15 predictores.

```{r}
set.seed(123)
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE Best Subset", pch = 20, type = "b")
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
```

```{r}
coef(model.fwd, which.min(rmse))
```


## 4.7 Validacion cruzada Stepwise Forward

```{r}
set.seed(14)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
```

```{r}
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
  model.fwd <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, method = "forward", really.big = TRUE)
   for(i in 1:31) {
    test <- rawdata3[folds == k,]
    predictions <- predict.regsubsets(object = model.fwd, newdata = rawdata3[folds == k,] , id = i)
    cv.errors[k,i] <- mean((test$Price - predictions)^2)
  }
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
```


```{r}
set.seed(14)
value.min <- which.min(rmse.cv)
value.min
```

```{r}
rmsesfw.cv_min <- rmse.cv[value.min]
rmsesfw.cv_min
```

- El modelo con rmse minimo es el de 22 preditores con un valor de 1057

```{r}
set.seed(14)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
value.min <- which.min(rmse.cv)
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
```

```{r}
coef(model.fwd, which.min(rmse.cv))
```

- coeficientes del modelo con 22 predictores.


############
# 5. Stepwise Selection: Backward Model
############

```{r}
regfitbwd.full = regsubsets(Price ~., data = rawdata3, nvmax=31, method="backward")
summary(regfit.full)
```

- Stepwise backward , 31 modelos

```{r}
reg.summary = summary(regfitbwd.full)
names(reg.summary)
```

## 5.1 RSS
```{r}
which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "b")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
```

- El modelo con menor rss posee 31 predictores

## 5.2 ADJR2
```{r}
which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="ADJR2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)

```

- El modelo con mayor R2 ajustado posee 21 predictores

## 5.3 CP

```{r}
which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col="red", cex=2, pch=20)


```

- El modelo con menor CP posee 21 predictores


## 5.4 bic

```{r}
which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red", cex=2, pch=20)


```

- El modelo con menor bic posee 15 predictores


## 5.5 Conclusiones
```{r}
reg.summary$adjr2[6]
```
```{r}
reg.summary$adjr2[8]
```
```{r}
reg.summary$adjr2[10]
```
```{r}
reg.summary$adjr2[15]
```
```{r}
reg.summary$adjr2[21]
```

- A juzgar por los graficos pareciera que el numero adecuado de predictores es 15, el R2 ajustado varia muy poco al aumentar los predictores.


## 5.6 Validación Stepwise Selection Backward

```{r}

index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
```
- Particion del dataset en entrenamiento y validacion

```{r}
set.seed(191)
dim(rawdata3)
```

```{r}
model.bwd <- regsubsets(Price ~., data = data.train, nvmax = 31, method = "backward", really.big = TRUE)
summary(model.bwd)
```

```{r}
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(model.bwd, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
rmse <- sqrt(val.errors)
```


```{r}
set.seed(191)
value.min <- which.min(rmse)
value.min
```

```{r}
rmsesbw_min <- rmse[value.min]
rmsesbw_min
```

- La validacion arroja que el modelo con menor RMSE es el de 16 variables con un RMSE de 1042

```{r}
set.seed(191)
rmse[20]
```
```{r}
set.seed(191)
rmse[15]
```

```{r}
set.seed(191)
rmse[10]
```

- La diferencia de rmse del modelo con 16 predictores y el modelo con 20 predictores es minima.

```{r}
set.seed(191)
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE", pch = 20, type = "b")
value.min <- which.min(rmse)
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
```

```{r}
coef(model.fwd, which.min(rmse))
```


## 5.7 Validacion cruzada Stepwise Backward

```{r}
set.seed(141)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
```

```{r}
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
  model.bwd <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, method = "backward", really.big = TRUE)
   for(i in 1:31) {
    test <- rawdata3[folds == k,]
    predictions <- predict.regsubsets(object = model.bwd, newdata = rawdata3[folds == k,] , id = i)
    cv.errors[k,i] <- mean((test$Price - predictions)^2)
  }
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
```


```{r}
set.seed(141)
value.min <- which.min(rmse.cv)
value.min
```

```{r}
rmsesbw.cv_min <- rmse.cv[value.min]
rmsesbw.cv_min
```

- El modelo con rmse minimo es el de 20 preditores con un valor de 1059

```{r}
set.seed(141)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
```

```{r}
coef(model.bwd, which.min(rmse.cv))
```

- coeficientes del modelo con 20 predictores.



###############
# 6. Ridge
###############

## 6.1 Estandarizacion de valores

```{r}

x <- model.matrix(Price ~. , rawdata3)[,-1]
y <- rawdata3$Price
grid <- 10^seq(10,-2,length = 100)
ridge.model <- glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge.model))
```


```{r}
coef(ridge.model)
```


```{r}
plot(ridge.model, xvar ='lambda',label = TRUE)
```


- A medida que lambda crece los coeficientes tienden a cero.


```{r}
options(scipen=999)
predict(ridge.model, s = 10, type = "coefficients")
```


## 6.2 Validacion Cruzada
```{r}
set.seed(23)
indices <- sample(c(TRUE,FALSE), nrow(rawdata3), replace = TRUE)
y.test <- y[-indices]

ridge.model.cv <- cv.glmnet(x[indices,], y[indices], alpha = 0) # Por defecto, 10-fold
coef(ridge.model.cv)
```

```{r}
plot(ridge.model.cv)
```

- Las lineas horizontales representan, la primera, el minimo de la validacion cruzada, y la segunda, el error estandar.

```{r}
set.seed(23)
bestlam <- ridge.model.cv$lambda.min
bestlam
```

- Calculamos el menor lambda para aplicarlo a la validacion cruzada
- Menor lambda 285.38

## 6.3

```{r}
set.seed(23)
ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
mse <-mean((ridge.pred - y[-indices])^2)
rmse <-sqrt(mse)
value.min <- which.min(rmse)
value.min
```

```{r}
rmseridge <- rmse[value.min]
rmseridge
```

- El modelo obtenido con Ridge Regression utilizando el lambda minimo nos da un RMSE de 1056.6

```{r}
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)
```

- Ninguno de los coeﬁcientes es exactamente igual a cero, ya que Ridge no realiza seleccion de variables.

```{r}
plot(out, xvar ='lambda',label = TRUE)

```

- Los coeficientes se ajustan mas a cero


###############
# 7. Lasso
###############


```{r}
lasso.model <- glmnet(x, y, alpha = 1)
plot(lasso.model, xvar ='lambda',label = TRUE)
```

## 7.1 Validacion Cruzada

```{r}
indices <- sample(c(TRUE,FALSE), nrow(rawdata3), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)
plot(cv.out)
```

- el valor minimo de la validacion cruzada, es de tamaño 32, mientras que el error estandar es de tamaño 15. 

```{r}
coef(cv.out)
```

- La validacion cruzada no toma en cuenta 17 variables, el modelo queda con 15 variables.

```{r}
bestlam <- cv.out$lambda.min
bestlam
```

- Lambda minimo es 0.79


## 7.2 Validacion

```{r}
lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
mse <-mean((lasso.pred - y[-indices])^2)
rmse <-sqrt(mse)
value.min <- which.min(rmse)
rmselasso <- rmse[value.min]
rmselasso

```

-El valor de RMSE es de 1025 

```{r}
options(scipen=999)
out <- glmnet(x, y, alpha = 1)
lasso.coef <- predict(out, type = "coefficients", s = bestlam)
lasso.coef
```

```{r}
plot(out, xvar='lambda', label= TRUE)
```

- Se reemplaza el valor arbitrario con el mejor lambda obtenido


###############
# 8. PCA
###############


```{r}

rawdata4 <- rawdata2
rawdata4 <- dummy_cols(rawdata4, select_columns = "Fuel_Type", remove_first_dummy = TRUE)
rawdata4 <- select(rawdata4, - Fuel_Type, - Mfg_Year, - Mfg_Month, - Cylinders)

```

```{r}
apply(X = rawdata4, MARGIN = 2, FUN = mean)

```

- La media de las variables, vemos que existe mucha diferencia entre algunas medias como la de Price o KM. Es necesario una estandarizacion. LA media debe ser cercana a cero.


```{r}
apply(X = rawdata4, MARGIN = 2, FUN = var)
```

- La varianza es la desviacion estandar al cuadrado, analizo la varianza ya que la desviacion estandar puede valores negativos. Se puede observar que algunas variables poseen varianzas dispares. La varianza debe ser cercana a uno.


## 8.1 Modelo PCA
```{r}
pca.model <- prcomp(rawdata4[,-1], scale = TRUE)
names(pca.model) 
```


```{r}
pca.model$center
```

- Representa la media de las variables previa a la estandarizacion

```{r}
pca.model$scale
```

- Representa la desviacion de las variables previa a la estandarizacion 

```{r}
pca.model$rotation
```

- Cada variable de las componentes tienen asignado un peso, las de mayor peso son las que explican dicha componente.

```{r}
dim(pca.model$x)
```

- Dimension del dataset, 1436 observaciones y 32 componentes.

```{r}
biplot(x = pca.model, scale = 0, cex = 0.6, col = c("blue", "brown3"))
```

- Grafica Primera y segunda componente


## 8.2 Varianza explicada de las componentes
```{r}
pca.model$sdev^2
```


```{r}
prop.variance <- pca.model$sdev^2/sum(pca.model$sdev^2)
prop.variance
```

- Proporcion respecto al total


```{r}
prop.cumulative <- cumsum(prop.variance)
prop.cumulative
```

- Proporcion de la varianza acumulada

```{r}
plot(prop.cumulative, type = "b", pch = 20, ylab = "Proporcion de la Varianza Acumulada", 
xlab = "Componente Principal" )
points(prop.cumulative, col = "red")

```

- La 1era componente explica el 15%, la  2da el 27% ,etc. Se necesitan 15 componentes principales para explicar el 81%
- Para explicar el 98% se necesitan 26 variables.
- Con las primeras 21 componentes principales se pueden exlicar 92% de la varianza acumulada

##############
#9. PCR
##############


## 9.2 Particion del dataset
```{r}
set.seed(789)
indices <- createDataPartition(rawdata4$Price, p = 0.8, list = FALSE)
```

```{r}
set.seed(789)
pcr.model <- pcr(formula = Price ~ ., data = rawdata4[indices, ], scale = TRUE, ncomp=21)
summary(pcr.model)
```

- Se reraliza el ajuste del modelo, se utilizan las primeras 21 componentes principales

## 9.3 Validacion

```{r}
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,], ncomp = 21)
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr <- sqrt(mse.test)
rmsepcr
```

- El RMSE es de 1332, se utilizan los datos de test.


## 9.4 Validacion cruzada

```{r}
set.seed(678)
pcr.model <- pcr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
summary(pcr.model)
```

```{r}
pcr.model.cv <- MSEP(pcr.model, estimate = "CV")
m <- which.min(pcr.model.cv$val)
m
```

```{r}
rmsepcr.cv <- sqrt(pcr.model.cv$val[m])
rmsepcr.cv
```

- LA validacion cruzada da como resultado un modelo con 31 componentes y un RMSE de 1227


```{r}
plot(pcr.model.cv$val, main = "MSE vs nro componentes", type = "l",
ylab = "MSE", col = "blue", xlab = "Componentes")
```

- Observando el grafico, el menor MSE se da con 31 componentes, que es el numero de predictores, concluimos que no pude reducir la dimensionalidad del dataset original.


```{r}
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,])
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
```

-Evaluacion del modelo PCR contra los datos de prueba. Calculo del MSE y RMSE correspondiente.
- El modelo no es bueno dada la igualdad componentes-predictores pero los datos sirven de referencia para otros metodos.

```{r}
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,], ncomp = 26)
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
```

- Con 26 componentes el Rmse es de 1324.



###############
# 10 PLS
###############


```{r}
set.seed(332)
indices <- createDataPartition(rawdata4$Price, p = 0.8, list = FALSE)
```

```{r}
set.seed(332)
pls.model <- plsr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
pls.model.cv <- MSEP(pls.model, estimate = "CV")
which.min(pls.model.cv$val)

```

```{r}
plot(pls.model.cv$val, main = "MSE vs N Componentes", type = "l", ylab = "MSE", col = "blue", xlab = "Componentes")
```

```{r}
pls.model.pred <- predict(pls.model, newdata = rawdata4[-indices,], ncomp = 5)
mse.test <- mean((pls.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
```

- El RMSE para 5 componentes como sugiere pls, es de 1282.



## 10.1 Analisis con 21 predictores.


### 10.1.1 Creacion de modelo
```{r}
set.seed(333)
pls.model <- plsr(formula = Price ~ ., data = rawdata4[indices, ], scale = TRUE, ncomp=21)
summary(pls.model)
```

- Se utilizo 21 predictores, que explican el 83%, con PCR explicaba un 92%

### 10.1.2 Validacion

```{r}
pls.model.pred <- predict(pls.model, newdata = rawdata4[-indices,], ncomp = 21)
mse.test <- mean((pls.model.pred - rawdata4[-indices,]$Price)^2)
rmsepcr <- sqrt(mse.test)
rmsepcr
which.min(pls.model.cv$val)
```

- RMSE 1196 para 21 predictores. 


### 10.1.3 Validacion Cruzada
```{r}
set.seed(333)
pls.model <- plsr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
summary(pls.model)
```

```{r}
pls.model.cv <- MSEP(pls.model, estimate = "CV")
m <- which.min(pls.model.cv$val)
m
```

```{r}
rmseplst.cv <- sqrt(mse.test)
rmseplst.cv
```


############
# 11. Conclusiones
############

```{r}
metodo <- c("BSS","FW", "BW", "Ridge", "Lasso", "PCR", "PLS")
test_RMSE <- c(rmsebss.cv_min, rmsesfw.cv_min, rmsesbw.cv_min, rmseridge,
              rmselasso, rmsepcr.cv, rmseplst.cv)
resultados <- data.frame(metodo, test_RMSE)
resultados
```


```{r}
ggplot(data = resultados, aes(x = reorder(metodo, test_RMSE),
                              y = test_RMSE)) +
  geom_col(width = 0.5) +
  lims(y = c(0, 1600)) +
  geom_text(aes(label = round(test_RMSE, 2)), vjust = -1) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


- El mejor modelo es el resultado del metodo Lasso, ya que posee el menor RMSE.
- Se oberva una gran diferencia entre PLS, PCR, BW(Los de mas alto RMSE) y los demás metodos.



########
# 12. Modelo con Lasso
########

- Las variables seleccionadas por Lasso son:

- Age_08_04         
- KM                  
- HP                  
- cc                  
- Gears              
- Weight              
- Mfr_Guarantee      
- BOVAG_Guarantee     
- Guarantee_Period    
- Airco               
- Automatic_airco   
- CD_Player           
- Central_Lock       
- Powered_Windows    
- Metallic_Rim       


## 12.1 Limpieza
```{r}
datalasso <- rawdata3
datalasso <- select(datalasso, Price, Age_08_04, KM, HP, cc, Gears, Weight, Mfr_Guarantee, BOVAG_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Central_Lock, Powered_Windows, Metallic_Rim)

```

```{r}
dim(datalasso)
```

## 12.2 Regresion Lineal

```{r}
lassoLM <- (lm(formula = Price ~. , data = datalasso))
summary(lassoLM)
```

- La variable Central_Lock tiene un t_value que no es significativamente distinto a cero, es probable que no agregue nada al modelo.

- R cuadrado es de 0.89

```{r}
hist(lassoLM$residuals) # histograma de los residuos estandarizados 
#qqnorm(lassoLM$residuals) # gráfico de cuantiles de los residuos estandarizados 
#qqline(lassoLM$residuals)
```


- Se puede observar simetria en el histograma de residuos.

```{r}
par(mfrow=c(2,2)) # divide la ventana en una fila y tres columnas 
plot(lassoLM)
```

- Residuals vs Fitted: El grafico muestra una distribucion de naturaleza aleatoria. El promedio de la grafica se pierde en cero.

- QQ: Ciertas muestras no se encuentran sobre la linea, es indicativo de que se puede mejorar el modelo

- Scale-Location: Muestra la varianza de los residuos

- Residuals vs Leverage: No se observa ningun punto fiera de la linea punteada, es decir no se observan puntos influyentes.