library(leaps)
library(ggplot2)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
library(readr)
library(dplyr)
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
library(fastDummies)
library(glmnet)
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-18
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
rawdata <- read_csv("ToyotaCorolla.csv")
Parsed with column specification:
cols(
.default = col_double(),
Model = [31mcol_character()[39m,
Fuel_Type = [31mcol_character()[39m
)
See spec(...) for full column specifications.
summary(rawdata)
Id Model Price Age_08_04 Mfg_Month Mfg_Year KM
Min. : 1.0 Length:1436 Min. : 4350 Min. : 1.00 Min. : 1.000 Min. :1998 Min. : 1
1st Qu.: 361.8 Class :character 1st Qu.: 8450 1st Qu.:44.00 1st Qu.: 3.000 1st Qu.:1998 1st Qu.: 43000
Median : 721.5 Mode :character Median : 9900 Median :61.00 Median : 5.000 Median :1999 Median : 63390
Mean : 721.6 Mean :10731 Mean :55.95 Mean : 5.549 Mean :2000 Mean : 68533
3rd Qu.:1081.2 3rd Qu.:11950 3rd Qu.:70.00 3rd Qu.: 8.000 3rd Qu.:2001 3rd Qu.: 87021
Max. :1442.0 Max. :32500 Max. :80.00 Max. :12.000 Max. :2004 Max. :243000
Fuel_Type HP Met_Color Automatic cc Doors Cylinders
Length:1436 Min. : 69.0 Min. :0.0000 Min. :0.00000 Min. : 1300 Min. :2.000 Min. :4
Class :character 1st Qu.: 90.0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 1400 1st Qu.:3.000 1st Qu.:4
Mode :character Median :110.0 Median :1.0000 Median :0.00000 Median : 1600 Median :4.000 Median :4
Mean :101.5 Mean :0.6748 Mean :0.05571 Mean : 1577 Mean :4.033 Mean :4
3rd Qu.:110.0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 1600 3rd Qu.:5.000 3rd Qu.:4
Max. :192.0 Max. :1.0000 Max. :1.00000 Max. :16000 Max. :5.000 Max. :4
Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
Min. :3.000 Min. : 19.00 Min. :1000 Min. :0.0000 Min. :0.0000 Min. : 3.000 Min. :0.0000
1st Qu.:5.000 1st Qu.: 69.00 1st Qu.:1040 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.: 3.000 1st Qu.:1.0000
Median :5.000 Median : 85.00 Median :1070 Median :0.0000 Median :1.0000 Median : 3.000 Median :1.0000
Mean :5.026 Mean : 87.12 Mean :1072 Mean :0.4095 Mean :0.8955 Mean : 3.815 Mean :0.8134
3rd Qu.:5.000 3rd Qu.: 85.00 3rd Qu.:1085 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.:1.0000
Max. :6.000 Max. :283.00 Max. :1615 Max. :1.0000 Max. :1.0000 Max. :36.000 Max. :1.0000
Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :1.0000 Median :1.0000 Median :0.00000 Median :0.0000 Median :0.0000
Mean :0.9708 Mean :0.7228 Mean :0.5084 Mean :0.05641 Mean :0.2946 Mean :0.2187
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
Central_Lock Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
Median :1.0000 Median :1.000 Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
Mean :0.5801 Mean :0.562 Mean :0.9777 Mean :0.1462 Mean :0.257 Mean :0.3001 Mean :0.7702
3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
Metallic_Rim Radio_cassette Tow_Bar
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.2047 Mean :0.1455 Mean :0.2779
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
str(rawdata)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1436 obs. of 37 variables:
$ Id : num 1 2 3 4 5 6 7 8 9 10 ...
$ Model : chr "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "?TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" ...
$ Price : num 13500 13750 13950 14950 13750 ...
$ Age_08_04 : num 23 23 24 26 30 32 27 30 27 23 ...
$ Mfg_Month : num 10 10 9 7 3 1 6 3 6 10 ...
$ Mfg_Year : num 2002 2002 2002 2002 2002 ...
$ KM : num 46986 72937 41711 48000 38500 ...
$ Fuel_Type : chr "Diesel" "Diesel" "Diesel" "Diesel" ...
$ HP : num 90 90 90 90 90 90 90 90 192 69 ...
$ Met_Color : num 1 1 1 0 0 0 1 1 0 0 ...
$ Automatic : num 0 0 0 0 0 0 0 0 0 0 ...
$ cc : num 2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
$ Doors : num 3 3 3 3 3 3 3 3 3 3 ...
$ Cylinders : num 4 4 4 4 4 4 4 4 4 4 ...
$ Gears : num 5 5 5 5 5 5 5 5 5 5 ...
$ Quarterly_Tax : num 210 210 210 210 210 210 210 210 100 185 ...
$ Weight : num 1165 1165 1165 1165 1170 ...
$ Mfr_Guarantee : num 0 0 1 1 1 0 0 1 0 0 ...
$ BOVAG_Guarantee : num 1 1 1 1 1 1 1 1 1 1 ...
$ Guarantee_Period: num 3 3 3 3 3 3 3 3 3 3 ...
$ ABS : num 1 1 1 1 1 1 1 1 1 1 ...
$ Airbag_1 : num 1 1 1 1 1 1 1 1 1 1 ...
$ Airbag_2 : num 1 1 1 1 1 1 1 1 0 1 ...
$ Airco : num 0 1 0 0 1 1 1 1 1 1 ...
$ Automatic_airco : num 0 0 0 0 0 0 0 0 0 0 ...
$ Boardcomputer : num 1 1 1 1 1 1 1 1 0 1 ...
$ CD_Player : num 0 1 0 0 0 0 0 1 0 0 ...
$ Central_Lock : num 1 1 0 0 1 1 1 1 1 0 ...
$ Powered_Windows : num 1 0 0 0 1 1 1 1 1 0 ...
$ Power_Steering : num 1 1 1 1 1 1 1 1 1 1 ...
$ Radio : num 0 0 0 0 0 0 0 0 1 0 ...
$ Mistlamps : num 0 0 0 0 1 1 0 0 0 0 ...
$ Sport_Model : num 0 0 0 0 0 0 1 0 0 0 ...
$ Backseat_Divider: num 1 1 1 1 1 1 1 1 0 1 ...
$ Metallic_Rim : num 0 0 0 0 0 0 0 0 1 0 ...
$ Radio_cassette : num 0 0 0 0 0 0 0 0 1 0 ...
$ Tow_Bar : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "spec")=
.. cols(
.. Id = [32mcol_double()[39m,
.. Model = [31mcol_character()[39m,
.. Price = [32mcol_double()[39m,
.. Age_08_04 = [32mcol_double()[39m,
.. Mfg_Month = [32mcol_double()[39m,
.. Mfg_Year = [32mcol_double()[39m,
.. KM = [32mcol_double()[39m,
.. Fuel_Type = [31mcol_character()[39m,
.. HP = [32mcol_double()[39m,
.. Met_Color = [32mcol_double()[39m,
.. Automatic = [32mcol_double()[39m,
.. cc = [32mcol_double()[39m,
.. Doors = [32mcol_double()[39m,
.. Cylinders = [32mcol_double()[39m,
.. Gears = [32mcol_double()[39m,
.. Quarterly_Tax = [32mcol_double()[39m,
.. Weight = [32mcol_double()[39m,
.. Mfr_Guarantee = [32mcol_double()[39m,
.. BOVAG_Guarantee = [32mcol_double()[39m,
.. Guarantee_Period = [32mcol_double()[39m,
.. ABS = [32mcol_double()[39m,
.. Airbag_1 = [32mcol_double()[39m,
.. Airbag_2 = [32mcol_double()[39m,
.. Airco = [32mcol_double()[39m,
.. Automatic_airco = [32mcol_double()[39m,
.. Boardcomputer = [32mcol_double()[39m,
.. CD_Player = [32mcol_double()[39m,
.. Central_Lock = [32mcol_double()[39m,
.. Powered_Windows = [32mcol_double()[39m,
.. Power_Steering = [32mcol_double()[39m,
.. Radio = [32mcol_double()[39m,
.. Mistlamps = [32mcol_double()[39m,
.. Sport_Model = [32mcol_double()[39m,
.. Backseat_Divider = [32mcol_double()[39m,
.. Metallic_Rim = [32mcol_double()[39m,
.. Radio_cassette = [32mcol_double()[39m,
.. Tow_Bar = [32mcol_double()[39m
.. )
sum(is.na(rawdata))
[1] 0
rawdata2 <- rawdata
rawdata2$Id <- NULL
rawdata2$Model <- NULL
rawdata2$Fuel_Type <- as.factor(rawdata2$Fuel_Type)
#rawdata2$Cylinders <- as.factor(rawdata2$Cylinders)
##(rawdata, aes(x = Price, y = Weight, colour = Fuel_Type))+ geom_point()
rawdata3 <- rawdata2
rawdata3$cc <- ifelse(rawdata3$cc == 16000, 1600, rawdata3$cc)
rawdata3 <- filter(rawdata3, rawdata3$Weight < 1200)
rawdata3 <- filter(rawdata3, rawdata3$KM > 1000)
rawdata3 <- filter(rawdata3, rawdata3$KM < 150000)
rawdata3$Mfg_Month <- NULL
rawdata3$Mfg_Year <- NULL
rawdata3$Cylinders <-NULL
str(rawdata3)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1340 obs. of 32 variables:
$ Price : num 13500 13750 13950 14950 13750 ...
$ Age_08_04 : num 23 23 24 26 30 32 27 23 25 22 ...
$ KM : num 46986 72937 41711 48000 38500 ...
$ Fuel_Type : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 3 2 3 3 ...
$ HP : num 90 90 90 90 90 90 192 69 192 192 ...
$ Met_Color : num 1 1 1 0 0 0 0 0 0 0 ...
$ Automatic : num 0 0 0 0 0 0 0 0 0 0 ...
$ cc : num 2000 2000 2000 2000 2000 2000 1800 1900 1800 1800 ...
$ Doors : num 3 3 3 3 3 3 3 3 3 3 ...
$ Gears : num 5 5 5 5 5 5 5 5 6 6 ...
$ Quarterly_Tax : num 210 210 210 210 210 210 100 185 100 100 ...
$ Weight : num 1165 1165 1165 1165 1170 ...
$ Mfr_Guarantee : num 0 0 1 1 1 0 0 0 1 1 ...
$ BOVAG_Guarantee : num 1 1 1 1 1 1 1 1 1 1 ...
$ Guarantee_Period: num 3 3 3 3 3 3 3 3 12 3 ...
$ ABS : num 1 1 1 1 1 1 1 1 1 1 ...
$ Airbag_1 : num 1 1 1 1 1 1 1 1 1 1 ...
$ Airbag_2 : num 1 1 1 1 1 1 0 1 1 1 ...
$ Airco : num 0 1 0 0 1 1 1 1 1 1 ...
$ Automatic_airco : num 0 0 0 0 0 0 0 0 1 1 ...
$ Boardcomputer : num 1 1 1 1 1 1 0 1 0 1 ...
$ CD_Player : num 0 1 0 0 0 0 0 0 1 0 ...
$ Central_Lock : num 1 1 0 0 1 1 1 0 1 1 ...
$ Powered_Windows : num 1 0 0 0 1 1 1 0 1 1 ...
$ Power_Steering : num 1 1 1 1 1 1 1 1 1 1 ...
$ Radio : num 0 0 0 0 0 0 1 0 0 0 ...
$ Mistlamps : num 0 0 0 0 1 1 0 0 0 1 ...
$ Sport_Model : num 0 0 0 0 0 0 0 0 0 1 ...
$ Backseat_Divider: num 1 1 1 1 1 1 0 1 0 1 ...
$ Metallic_Rim : num 0 0 0 0 0 0 1 0 1 1 ...
$ Radio_cassette : num 0 0 0 0 0 0 1 0 0 0 ...
$ Tow_Bar : num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "spec")=
.. cols(
.. Id = [32mcol_double()[39m,
.. Model = [31mcol_character()[39m,
.. Price = [32mcol_double()[39m,
.. Age_08_04 = [32mcol_double()[39m,
.. Mfg_Month = [32mcol_double()[39m,
.. Mfg_Year = [32mcol_double()[39m,
.. KM = [32mcol_double()[39m,
.. Fuel_Type = [31mcol_character()[39m,
.. HP = [32mcol_double()[39m,
.. Met_Color = [32mcol_double()[39m,
.. Automatic = [32mcol_double()[39m,
.. cc = [32mcol_double()[39m,
.. Doors = [32mcol_double()[39m,
.. Cylinders = [32mcol_double()[39m,
.. Gears = [32mcol_double()[39m,
.. Quarterly_Tax = [32mcol_double()[39m,
.. Weight = [32mcol_double()[39m,
.. Mfr_Guarantee = [32mcol_double()[39m,
.. BOVAG_Guarantee = [32mcol_double()[39m,
.. Guarantee_Period = [32mcol_double()[39m,
.. ABS = [32mcol_double()[39m,
.. Airbag_1 = [32mcol_double()[39m,
.. Airbag_2 = [32mcol_double()[39m,
.. Airco = [32mcol_double()[39m,
.. Automatic_airco = [32mcol_double()[39m,
.. Boardcomputer = [32mcol_double()[39m,
.. CD_Player = [32mcol_double()[39m,
.. Central_Lock = [32mcol_double()[39m,
.. Powered_Windows = [32mcol_double()[39m,
.. Power_Steering = [32mcol_double()[39m,
.. Radio = [32mcol_double()[39m,
.. Mistlamps = [32mcol_double()[39m,
.. Sport_Model = [32mcol_double()[39m,
.. Backseat_Divider = [32mcol_double()[39m,
.. Metallic_Rim = [32mcol_double()[39m,
.. Radio_cassette = [32mcol_double()[39m,
.. Tow_Bar = [32mcol_double()[39m
.. )
regfit.full = regsubsets(Price ~., data = rawdata3, nvmax=31)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Price ~ ., data = rawdata3, nvmax = 31)
32 Variables (and intercept)
Forced in Forced out
Age_08_04 FALSE FALSE
KM FALSE FALSE
Fuel_TypeDiesel FALSE FALSE
Fuel_TypePetrol FALSE FALSE
HP FALSE FALSE
Met_Color FALSE FALSE
Automatic FALSE FALSE
cc FALSE FALSE
Doors FALSE FALSE
Gears FALSE FALSE
Quarterly_Tax FALSE FALSE
Weight FALSE FALSE
Mfr_Guarantee FALSE FALSE
BOVAG_Guarantee FALSE FALSE
Guarantee_Period FALSE FALSE
ABS FALSE FALSE
Airbag_1 FALSE FALSE
Airbag_2 FALSE FALSE
Airco FALSE FALSE
Automatic_airco FALSE FALSE
Boardcomputer FALSE FALSE
CD_Player FALSE FALSE
Central_Lock FALSE FALSE
Powered_Windows FALSE FALSE
Power_Steering FALSE FALSE
Radio FALSE FALSE
Mistlamps FALSE FALSE
Sport_Model FALSE FALSE
Backseat_Divider FALSE FALSE
Metallic_Rim FALSE FALSE
Radio_cassette FALSE FALSE
Tow_Bar FALSE FALSE
1 subsets of each size up to 31
Selection Algorithm: exhaustive
Age_08_04 KM Fuel_TypeDiesel Fuel_TypePetrol HP Met_Color Automatic cc Doors Gears Quarterly_Tax Weight
1 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " "
2 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " "
3 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " "*"
4 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" "*" " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " "*" " " " " "*" " " " " " " "*"
7 ( 1 ) "*" "*" " " " " "*" " " " " "*" " " " " " " "*"
8 ( 1 ) "*" "*" " " "*" " " " " " " " " " " " " "*" "*"
9 ( 1 ) "*" "*" " " " " "*" " " " " "*" " " " " " " "*"
10 ( 1 ) "*" "*" " " " " "*" " " " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " "*" " " " " "*" " " " " "*" "*"
12 ( 1 ) "*" "*" " " " " "*" " " " " "*" " " " " "*" "*"
13 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " " "*" "*"
14 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" " " " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
20 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
21 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
22 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
23 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" " " "*" "*"
24 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
25 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
26 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
27 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
28 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
29 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
31 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer
1 ( 1 ) " " " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " "*" " "
3 ( 1 ) " " " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " "*" " "
5 ( 1 ) " " " " " " " " " " " " " " "*" " "
6 ( 1 ) " " " " " " " " " " " " " " "*" " "
7 ( 1 ) " " " " " " " " " " " " " " "*" " "
8 ( 1 ) " " " " "*" " " " " " " " " "*" " "
9 ( 1 ) " " "*" "*" " " " " " " " " "*" " "
10 ( 1 ) " " "*" "*" " " " " " " " " "*" " "
11 ( 1 ) "*" "*" "*" " " " " " " " " "*" " "
12 ( 1 ) "*" "*" "*" " " " " " " "*" "*" " "
13 ( 1 ) "*" "*" "*" " " " " " " " " "*" " "
14 ( 1 ) "*" "*" "*" " " " " " " "*" "*" " "
15 ( 1 ) "*" "*" "*" " " " " " " "*" "*" " "
16 ( 1 ) "*" "*" "*" " " " " " " "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" " "
18 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
20 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
21 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
22 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
23 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
24 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
25 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
26 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
27 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
28 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
29 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
30 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
31 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
CD_Player Central_Lock Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " " " " " "
7 ( 1 ) " " " " "*" " " " " " " " " " "
8 ( 1 ) " " " " "*" " " " " " " " " " "
9 ( 1 ) " " " " "*" " " " " " " " " " "
10 ( 1 ) " " " " "*" " " " " " " " " " "
11 ( 1 ) " " " " "*" " " " " " " " " " "
12 ( 1 ) " " " " "*" " " " " " " " " " "
13 ( 1 ) " " " " "*" " " " " " " " " " "
14 ( 1 ) " " " " "*" " " " " " " " " " "
15 ( 1 ) " " " " "*" " " " " " " " " " "
16 ( 1 ) "*" " " "*" " " " " " " " " " "
17 ( 1 ) "*" " " "*" " " " " " " " " " "
18 ( 1 ) "*" " " "*" " " " " " " " " " "
19 ( 1 ) "*" " " "*" " " " " " " " " " "
20 ( 1 ) "*" " " "*" " " " " " " " " " "
21 ( 1 ) "*" " " "*" " " " " " " " " "*"
22 ( 1 ) "*" " " "*" " " " " " " " " "*"
23 ( 1 ) "*" " " "*" " " "*" " " " " "*"
24 ( 1 ) "*" " " "*" " " "*" " " " " "*"
25 ( 1 ) "*" " " "*" " " "*" " " " " "*"
26 ( 1 ) "*" " " "*" " " "*" " " " " "*"
27 ( 1 ) "*" " " "*" " " "*" " " "*" "*"
28 ( 1 ) "*" " " "*" " " "*" " " "*" "*"
29 ( 1 ) "*" "*" "*" " " "*" " " "*" "*"
30 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
31 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
Metallic_Rim Radio_cassette Tow_Bar
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
4 ( 1 ) " " " " " "
5 ( 1 ) " " " " " "
6 ( 1 ) " " " " " "
7 ( 1 ) " " " " " "
8 ( 1 ) " " " " " "
9 ( 1 ) " " " " " "
10 ( 1 ) " " " " " "
11 ( 1 ) " " " " " "
12 ( 1 ) " " " " " "
13 ( 1 ) " " " " " "
14 ( 1 ) " " " " " "
15 ( 1 ) " " " " " "
16 ( 1 ) " " " " " "
17 ( 1 ) " " " " " "
18 ( 1 ) " " " " " "
19 ( 1 ) " " " " "*"
20 ( 1 ) "*" " " "*"
21 ( 1 ) "*" " " "*"
22 ( 1 ) "*" " " "*"
23 ( 1 ) "*" "*" "*"
24 ( 1 ) "*" "*" "*"
25 ( 1 ) "*" "*" "*"
26 ( 1 ) "*" "*" "*"
27 ( 1 ) "*" "*" "*"
28 ( 1 ) "*" "*" "*"
29 ( 1 ) "*" "*" "*"
30 ( 1 ) "*" "*" "*"
31 ( 1 ) "*" "*" "*"
reg.summary = summary(regfit.full)
names(reg.summary)
which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "l")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="Adjr2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab = "CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col= "red", cex=2, pch=20)
which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab = "CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col= "red", cex=2, pch=20)
reg.summary$adjr2[6]
reg.summary$adjr2[8]
reg.summary$adjr2[10]
reg.summary$adjr2[15]
reg.summary$adjr2[21]
index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
set.seed(7)
dim(rawdata3)
regfit.train <- regsubsets(Price ~., data = data.train, nvmax = 31, really.big = TRUE)
#summary(regfit.train)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(regfit.train, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
d<-which.min(val.errors)
rmse <- sqrt(val.errors)
value.min <- which.min(rmse)
value.min
rmsebss_min <- rmse[value.min]
rmsebss_min
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE Best Subset", pch = 20, type = "b")
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
options(scipen = 999)
coef(regfit.train, which.min(rmse))
set.seed(7)
rmse[8]
rmse[10]
rmse[15]
rmse[20]
predict.regsubsets <- function(object, newdata, id, ...){
# Extraer la fórmula del modelo (variable dependiente ~ predictores)
form <- as.formula(object$call[[2]])
# Generar una matriz modelo con los nuevos datos y la fórmula
mat <- model.matrix(form, newdata)
# Extraer los coeficientes del modelo
coefi <- coef(object , id = id)
# Almacenar el nombre de las variables predictoras del modelo
xvars <- names(coefi)
# Producto matricial entre los coeficientes del modelo y los valores de
# los predictores de las nuevas observaciones para obtener las
# predicciones
mat[ , xvars] %*% coefi
}
set.seed(123)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
regfit.val <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, really.big = TRUE)
for(i in 1:31) {
test <- rawdata3[folds == k,]
predictions <- predict.regsubsets(object = regfit.val, newdata = rawdata3[folds == k,] , id = i)
cv.errors[k,i] <- mean((test$Price - predictions)^2)
}
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
rmse.cv
set.seed(123)
value.min <- which.min(rmse.cv)
value.min
rmsebss.cv_min <- rmse.cv[value.min]
rmsebss.cv_min
set.seed(123)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
value.min <- which.min(rmse.cv)
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
coef(regfit.val, which.min(rmse.cv))
regfitfwd.full = regsubsets(Price ~., data = rawdata3, nvmax=31, method="forward")
summary(regfitfwd.full)
reg.summary = summary(regfitfwd.full)
names(reg.summary)
which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "b")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="ADJR2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col="red", cex=2, pch=20)
which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red", cex=2, pch=20)
reg.summary$adjr2[6]
reg.summary$adjr2[8]
reg.summary$adjr2[10]
reg.summary$adjr2[15]
reg.summary$adjr2[21]
index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
set.seed(10)
dim(rawdata3)
model.fwd <- regsubsets(Price ~., data = data.train, nvmax = 31, method = "forward", really.big = TRUE)
summary(model.fwd)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(model.fwd, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
rmse <- sqrt(val.errors)
set.seed(123)
value.min <- which.min(rmse)
value.min
rmsesfw_min <- rmse[value.min]
rmsesfw_min
set.seed(123)
rmse[15]
set.seed(123)
rmse[10]
set.seed(123)
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE Best Subset", pch = 20, type = "b")
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
coef(model.fwd, which.min(rmse))
set.seed(14)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
model.fwd <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, method = "forward", really.big = TRUE)
for(i in 1:31) {
test <- rawdata3[folds == k,]
predictions <- predict.regsubsets(object = model.fwd, newdata = rawdata3[folds == k,] , id = i)
cv.errors[k,i] <- mean((test$Price - predictions)^2)
}
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
set.seed(14)
value.min <- which.min(rmse.cv)
value.min
rmsesfw.cv_min <- rmse.cv[value.min]
rmsesfw.cv_min
set.seed(14)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
value.min <- which.min(rmse.cv)
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
coef(model.fwd, which.min(rmse.cv))
regfitbwd.full = regsubsets(Price ~., data = rawdata3, nvmax=31, method="backward")
summary(regfit.full)
reg.summary = summary(regfitbwd.full)
names(reg.summary)
which.min(reg.summary$rss)
plot(reg.summary$rss, xlab="Numero de variables", ylab="RSS", type = "b")
points(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)], col="red", cex=2, pch=20)
which.max(reg.summary$adjr2)
plot(reg.summary$adjr2, xlab="Numero de variables", ylab="ADJR2", type = "b")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red", cex=2, pch=20)
which.min(reg.summary$cp)
plot(reg.summary$cp, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col="red", cex=2, pch=20)
which.min(reg.summary$bic)
plot(reg.summary$bic, xlab="Numero de variables", ylab="CP", type = "b")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red", cex=2, pch=20)
reg.summary$adjr2[6]
reg.summary$adjr2[8]
reg.summary$adjr2[10]
reg.summary$adjr2[15]
reg.summary$adjr2[21]
index <- createDataPartition(rawdata3$Price, p = 0.7, list = FALSE)
data.train <- rawdata3[index, ]
data.test <- rawdata3[-index, ]
set.seed(191)
dim(rawdata3)
model.bwd <- regsubsets(Price ~., data = data.train, nvmax = 31, method = "backward", really.big = TRUE)
summary(model.bwd)
val.errors = rep(NA,31)
x.test <- model.matrix(Price ~., data = data.test)
for(i in 1:31) {
coeficientes <- coef(model.bwd, id = i)
predictors <- x.test[,names(coeficientes)] %*% coeficientes
val.errors[i] <- mean((data.test$Price - predictors)^2)
}
rmse <- sqrt(val.errors)
set.seed(191)
value.min <- which.min(rmse)
value.min
rmsesbw_min <- rmse[value.min]
rmsesbw_min
set.seed(191)
rmse[20]
set.seed(191)
rmse[15]
set.seed(191)
rmse[10]
set.seed(191)
plot(rmse, xlab = "Numero de Variables", ylab = "Root MSE", pch = 20, type = "b")
value.min <- which.min(rmse)
points(value.min, rmse[value.min], col="red", cex=2, pch=20)
coef(model.fwd, which.min(rmse))
set.seed(141)
folds <- sample(rep(1:10, length = nrow(rawdata3)))
table(folds)
cv.errors <- matrix(NA,10,31)
for(k in 1:10) {
model.bwd <- regsubsets(Price~., data = rawdata3[folds !=k,], nvmax = 31, method = "backward", really.big = TRUE)
for(i in 1:31) {
test <- rawdata3[folds == k,]
predictions <- predict.regsubsets(object = model.bwd, newdata = rawdata3[folds == k,] , id = i)
cv.errors[k,i] <- mean((test$Price - predictions)^2)
}
}
rmse.cv <- sqrt(apply(cv.errors, 2, mean))
set.seed(141)
value.min <- which.min(rmse.cv)
value.min
rmsesbw.cv_min <- rmse.cv[value.min]
rmsesbw.cv_min
set.seed(141)
plot(rmse.cv, xlab = "Numero de Variables", ylab = "Root MSE Cross Validation", pch = 20, type = "b")
points(value.min, rmse.cv[value.min], col="red", cex=2, pch=20)
coef(model.bwd, which.min(rmse.cv))
x <- model.matrix(Price ~. , rawdata3)[,-1]
y <- rawdata3$Price
grid <- 10^seq(10,-2,length = 100)
ridge.model <- glmnet(x, y, alpha = 0, lambda = grid)
dim(coef(ridge.model))
coef(ridge.model)
plot(ridge.model, xvar ='lambda',label = TRUE)
options(scipen=999)
predict(ridge.model, s = 10, type = "coefficients")
set.seed(23)
indices <- sample(c(TRUE,FALSE), nrow(rawdata3), replace = TRUE)
y.test <- y[-indices]
ridge.model.cv <- cv.glmnet(x[indices,], y[indices], alpha = 0) # Por defecto, 10-fold
coef(ridge.model.cv)
plot(ridge.model.cv)
set.seed(23)
bestlam <- ridge.model.cv$lambda.min
bestlam
set.seed(23)
ridge.pred <- predict(ridge.model, s = bestlam, newx = x[-indices,])
mse <-mean((ridge.pred - y[-indices])^2)
rmse <-sqrt(mse)
value.min <- which.min(rmse)
value.min
rmseridge <- rmse[value.min]
rmseridge
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)
plot(out, xvar ='lambda',label = TRUE)
lasso.model <- glmnet(x, y, alpha = 1)
plot(lasso.model, xvar ='lambda',label = TRUE)
indices <- sample(c(TRUE,FALSE), nrow(rawdata3), replace = TRUE)
cv.out <- cv.glmnet(x[indices,], y[indices], alpha = 1)
plot(cv.out)
coef(cv.out)
bestlam <- cv.out$lambda.min
bestlam
lasso.pred <- predict(lasso.model, s = bestlam, newx = x[-indices,])
mse <-mean((lasso.pred - y[-indices])^2)
rmse <-sqrt(mse)
value.min <- which.min(rmse)
rmselasso <- rmse[value.min]
rmselasso
-El valor de RMSE es de 1025
options(scipen=999)
out <- glmnet(x, y, alpha = 1)
lasso.coef <- predict(out, type = "coefficients", s = bestlam)
lasso.coef
plot(out, xvar='lambda', label= TRUE)
rawdata4 <- rawdata2
rawdata4 <- dummy_cols(rawdata4, select_columns = "Fuel_Type", remove_first_dummy = TRUE)
rawdata4 <- select(rawdata4, - Fuel_Type, - Mfg_Year, - Mfg_Month, - Cylinders)
apply(X = rawdata4, MARGIN = 2, FUN = mean)
apply(X = rawdata4, MARGIN = 2, FUN = var)
pca.model <- prcomp(rawdata4[,-1], scale = TRUE)
names(pca.model)
pca.model$center
pca.model$scale
pca.model$rotation
dim(pca.model$x)
biplot(x = pca.model, scale = 0, cex = 0.6, col = c("blue", "brown3"))
pca.model$sdev^2
prop.variance <- pca.model$sdev^2/sum(pca.model$sdev^2)
prop.variance
prop.cumulative <- cumsum(prop.variance)
prop.cumulative
plot(prop.cumulative, type = "b", pch = 20, ylab = "Proporcion de la Varianza Acumulada",
xlab = "Componente Principal" )
points(prop.cumulative, col = "red")
#9. PCR ##############
set.seed(789)
indices <- createDataPartition(rawdata4$Price, p = 0.8, list = FALSE)
set.seed(789)
pcr.model <- pcr(formula = Price ~ ., data = rawdata4[indices, ], scale = TRUE, ncomp=21)
summary(pcr.model)
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,], ncomp = 21)
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr <- sqrt(mse.test)
rmsepcr
set.seed(678)
pcr.model <- pcr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
summary(pcr.model)
pcr.model.cv <- MSEP(pcr.model, estimate = "CV")
m <- which.min(pcr.model.cv$val)
m
rmsepcr.cv <- sqrt(pcr.model.cv$val[m])
rmsepcr.cv
plot(pcr.model.cv$val, main = "MSE vs nro componentes", type = "l",
ylab = "MSE", col = "blue", xlab = "Componentes")
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,])
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
-Evaluacion del modelo PCR contra los datos de prueba. Calculo del MSE y RMSE correspondiente. - El modelo no es bueno dada la igualdad componentes-predictores pero los datos sirven de referencia para otros metodos.
pcr.model.pred <- predict(pcr.model, newdata = rawdata4[-indices,], ncomp = 26)
mse.test <- mean((pcr.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
set.seed(332)
indices <- createDataPartition(rawdata4$Price, p = 0.8, list = FALSE)
set.seed(332)
pls.model <- plsr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
pls.model.cv <- MSEP(pls.model, estimate = "CV")
which.min(pls.model.cv$val)
plot(pls.model.cv$val, main = "MSE vs N Componentes", type = "l", ylab = "MSE", col = "blue", xlab = "Componentes")
pls.model.pred <- predict(pls.model, newdata = rawdata4[-indices,], ncomp = 5)
mse.test <- mean((pls.model.pred - rawdata4[-indices,]$Price)^2)
mse.test
rmsepcr.cv <- sqrt(mse.test)
rmsepcr.cv
set.seed(333)
pls.model <- plsr(formula = Price ~ ., data = rawdata4[indices, ], scale = TRUE, ncomp=21)
summary(pls.model)
pls.model.pred <- predict(pls.model, newdata = rawdata4[-indices,], ncomp = 21)
mse.test <- mean((pls.model.pred - rawdata4[-indices,]$Price)^2)
rmsepcr <- sqrt(mse.test)
rmsepcr
which.min(pls.model.cv$val)
set.seed(333)
pls.model <- plsr(formula = Price ~., data = rawdata4[indices, ], scale = TRUE, validation = "CV")
summary(pls.model)
pls.model.cv <- MSEP(pls.model, estimate = "CV")
m <- which.min(pls.model.cv$val)
m
rmseplst.cv <- sqrt(mse.test)
rmseplst.cv
metodo <- c("BSS","FW", "BW", "Ridge", "Lasso", "PCR", "PLS")
test_RMSE <- c(rmsebss.cv_min, rmsesfw.cv_min, rmsesbw.cv_min, rmseridge,
rmselasso, rmsepcr.cv, rmseplst.cv)
resultados <- data.frame(metodo, test_RMSE)
resultados
ggplot(data = resultados, aes(x = reorder(metodo, test_RMSE),
y = test_RMSE)) +
geom_col(width = 0.5) +
lims(y = c(0, 1600)) +
geom_text(aes(label = round(test_RMSE, 2)), vjust = -1) +
theme_gray() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Las variables seleccionadas por Lasso son:
Metallic_Rim
datalasso <- rawdata3
datalasso <- select(datalasso, Price, Age_08_04, KM, HP, cc, Gears, Weight, Mfr_Guarantee, BOVAG_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Central_Lock, Powered_Windows, Metallic_Rim)
dim(datalasso)
lassoLM <- (lm(formula = Price ~. , data = datalasso))
summary(lassoLM)
La variable Central_Lock tiene un t_value que no es significativamente distinto a cero, es probable que no agregue nada al modelo.
R cuadrado es de 0.89
hist(lassoLM$residuals) # histograma de los residuos estandarizados
#qqnorm(lassoLM$residuals) # gráfico de cuantiles de los residuos estandarizados
#qqline(lassoLM$residuals)
par(mfrow=c(2,2)) # divide la ventana en una fila y tres columnas
plot(lassoLM)
Residuals vs Fitted: El grafico muestra una distribucion de naturaleza aleatoria. El promedio de la grafica se pierde en cero.
QQ: Ciertas muestras no se encuentran sobre la linea, es indicativo de que se puede mejorar el modelo
Scale-Location: Muestra la varianza de los residuos
Residuals vs Leverage: No se observa ningun punto fiera de la linea punteada, es decir no se observan puntos influyentes.