Dear Yuki,
I wanted to send you the current structure of the ML model portion of my thesis.
I know that this is a lot of code and output, but you can get most of the important information by scanning the plots and charts and by reading the summary results at the end of sections 1.3 and 2.2.2.
I will spend the day tomorrow tuning the prediction models to increase accuracy, but the structure will remain roughly the same unless you have any additions, objections, or any other input.
I’m looking forward to reviewing these with you soon.
Best, Sawyer
library(glmnet) #glmnet() is the main function in the glmnet package (must pass in an x matrix as well as a y vector)
# Standard Model on full data set (choosing forward selection for now)
nvmax <- 72
regfit.base <- regsubsets(log(sold_price) ~ . ,
data = data_bi,
nvmax = nvmax,
method= "forward")
summary(regfit.base)
Subset selection object
Call: regsubsets.formula(log(sold_price) ~ ., data = data_bi, nvmax = nvmax,
method = "forward")
71 Variables (and intercept)
Forced in Forced out
property_type_CND FALSE FALSE
property_type_OTH FALSE FALSE
property_type_PAT FALSE FALSE
property_type_SGL FALSE FALSE
air_conditioning_central FALSE FALSE
appartment_bi FALSE FALSE
patio_bi FALSE FALSE
school_high FALSE FALSE
school_junior FALSE FALSE
school_middle FALSE FALSE
photo_count FALSE FALSE
pool_bi FALSE FALSE
rear_yard_access_bi FALSE FALSE
roof_type_metal FALSE FALSE
roof_type_shingle FALSE FALSE
roof_type_slate FALSE FALSE
gas_type_natural FALSE FALSE
out_building_livable_bi FALSE FALSE
out_building_not_livable_bi FALSE FALSE
living_area FALSE FALSE
land_acres FALSE FALSE
appliances_included_bi FALSE FALSE
garage_bi FALSE FALSE
condition_new FALSE FALSE
condition_excellent FALSE FALSE
condition_very_good FALSE FALSE
energy_efficient_bi FALSE FALSE
exterior_brick FALSE FALSE
exterior_type_metal FALSE FALSE
exterior_type_vinyl FALSE FALSE
exterior_type_wood FALSE FALSE
exterior_features_balcony FALSE FALSE
exterior_features_courtyard FALSE FALSE
exterior_features_fence FALSE FALSE
exterior_features_porch FALSE FALSE
exterior_features_tennis_court FALSE FALSE
fire_place_bi FALSE FALSE
foundation_type_raised FALSE FALSE
foundation_type_slab FALSE FALSE
total_area FALSE FALSE
beds_total_1 FALSE FALSE
beds_total_2 FALSE FALSE
beds_total_3 FALSE FALSE
beds_total_4 FALSE FALSE
bath_full_0 FALSE FALSE
bath_full_1 FALSE FALSE
bath_full_2 FALSE FALSE
bath_full_3 FALSE FALSE
bath_full_4 FALSE FALSE
bath_full_5 FALSE FALSE
bath_full_6 FALSE FALSE
bath_half_0 FALSE FALSE
bath_half_1 FALSE FALSE
bath_half_2 FALSE FALSE
bath_half_3 FALSE FALSE
age FALSE FALSE
days_on_market FALSE FALSE
sewer_type_city FALSE FALSE
sewer_type_septic FALSE FALSE
spa_location_inside FALSE FALSE
spa_location_outside FALSE FALSE
stories FALSE FALSE
property_style_mobile FALSE FALSE
property_style_modular FALSE FALSE
city_limit_bi FALSE FALSE
subdivision_bi FALSE FALSE
termite_contract FALSE FALSE
water_type_public FALSE FALSE
water_type_well FALSE FALSE
water_type_other FALSE FALSE
waterfront_bi FALSE FALSE
1 subsets of each size up to 71
Selection Algorithm: forward
property_type_CND property_type_OTH property_type_PAT property_type_SGL air_conditioning_central
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " "*"
3 ( 1 ) " " " " " " " " "*"
4 ( 1 ) " " " " " " " " "*"
5 ( 1 ) " " " " " " " " "*"
6 ( 1 ) " " " " " " " " "*"
7 ( 1 ) " " " " " " " " "*"
8 ( 1 ) " " " " " " " " "*"
9 ( 1 ) " " " " " " " " "*"
10 ( 1 ) " " " " " " " " "*"
11 ( 1 ) " " " " " " " " "*"
12 ( 1 ) " " " " " " " " "*"
13 ( 1 ) " " " " " " " " "*"
14 ( 1 ) " " " " " " " " "*"
appartment_bi patio_bi school_high school_junior school_middle photo_count pool_bi rear_yard_access_bi
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " " " " " "
9 ( 1 ) " " " " " " " " " " "*" " " " "
10 ( 1 ) " " " " " " " " "*" "*" " " " "
11 ( 1 ) " " " " " " " " "*" "*" " " " "
12 ( 1 ) " " " " " " " " "*" "*" " " " "
13 ( 1 ) " " " " " " " " "*" "*" " " " "
14 ( 1 ) " " " " " " " " "*" "*" " " " "
roof_type_metal roof_type_shingle roof_type_slate gas_type_natural out_building_livable_bi
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) " " "*" " " " " " "
5 ( 1 ) " " "*" " " " " " "
6 ( 1 ) " " "*" " " " " " "
7 ( 1 ) " " "*" " " " " " "
8 ( 1 ) " " "*" " " " " " "
9 ( 1 ) " " "*" " " " " " "
10 ( 1 ) " " "*" " " " " " "
11 ( 1 ) " " "*" " " " " " "
12 ( 1 ) " " "*" " " " " " "
13 ( 1 ) " " "*" " " " " " "
14 ( 1 ) " " "*" " " " " " "
out_building_not_livable_bi living_area land_acres appliances_included_bi garage_bi condition_new
1 ( 1 ) " " "*" " " " " " " " "
2 ( 1 ) " " "*" " " " " " " " "
3 ( 1 ) " " "*" " " " " " " " "
4 ( 1 ) " " "*" " " " " " " " "
5 ( 1 ) " " "*" " " "*" " " " "
6 ( 1 ) " " "*" " " "*" " " " "
7 ( 1 ) " " "*" " " "*" " " " "
8 ( 1 ) " " "*" " " "*" " " " "
9 ( 1 ) " " "*" " " "*" " " " "
10 ( 1 ) " " "*" " " "*" " " " "
11 ( 1 ) " " "*" " " "*" " " " "
12 ( 1 ) " " "*" " " "*" " " " "
13 ( 1 ) " " "*" " " "*" " " " "
14 ( 1 ) " " "*" " " "*" " " " "
condition_excellent condition_very_good energy_efficient_bi exterior_brick exterior_type_metal
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " " "
5 ( 1 ) " " " " " " " " " "
6 ( 1 ) " " " " " " " " " "
7 ( 1 ) " " " " " " " " " "
8 ( 1 ) "*" " " " " " " " "
9 ( 1 ) "*" " " " " " " " "
10 ( 1 ) "*" " " " " " " " "
11 ( 1 ) "*" " " "*" " " " "
12 ( 1 ) "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" " " " "
14 ( 1 ) "*" "*" "*" " " " "
exterior_type_vinyl exterior_type_wood exterior_features_balcony exterior_features_courtyard
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) " " " " " " " "
4 ( 1 ) " " " " " " " "
5 ( 1 ) " " " " " " " "
6 ( 1 ) " " " " " " " "
7 ( 1 ) " " " " " " " "
8 ( 1 ) " " " " " " " "
9 ( 1 ) " " " " " " " "
10 ( 1 ) " " " " " " " "
11 ( 1 ) " " " " " " " "
12 ( 1 ) " " " " " " " "
13 ( 1 ) " " " " " " " "
14 ( 1 ) " " " " " " " "
exterior_features_fence exterior_features_porch exterior_features_tennis_court fire_place_bi
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) " " " " " " " "
4 ( 1 ) " " " " " " " "
5 ( 1 ) " " " " " " " "
6 ( 1 ) " " " " " " " "
7 ( 1 ) " " " " " " " "
8 ( 1 ) " " " " " " " "
9 ( 1 ) " " " " " " " "
10 ( 1 ) " " " " " " " "
11 ( 1 ) " " " " " " " "
12 ( 1 ) " " " " " " " "
13 ( 1 ) " " " " " " " "
14 ( 1 ) " " " " " " " "
foundation_type_raised foundation_type_slab total_area beds_total_1 beds_total_2 beds_total_3
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " "*" " " " " " " " "
5 ( 1 ) " " "*" " " " " " " " "
6 ( 1 ) " " "*" " " " " " " " "
7 ( 1 ) " " "*" " " " " " " " "
8 ( 1 ) " " "*" " " " " " " " "
9 ( 1 ) " " "*" " " " " " " " "
10 ( 1 ) " " "*" " " " " " " " "
11 ( 1 ) " " "*" " " " " " " " "
12 ( 1 ) " " "*" " " " " " " " "
13 ( 1 ) "*" "*" " " " " " " " "
14 ( 1 ) "*" "*" " " " " " " " "
beds_total_4 bath_full_0 bath_full_1 bath_full_2 bath_full_3 bath_full_4 bath_full_5 bath_full_6
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " "*" " " " " " " " " " "
7 ( 1 ) " " " " "*" " " " " " " " " " "
8 ( 1 ) " " " " "*" " " " " " " " " " "
9 ( 1 ) " " " " "*" " " " " " " " " " "
10 ( 1 ) " " " " "*" " " " " " " " " " "
11 ( 1 ) " " " " "*" " " " " " " " " " "
12 ( 1 ) " " " " "*" " " " " " " " " " "
13 ( 1 ) " " " " "*" " " " " " " " " " "
14 ( 1 ) " " " " "*" " " " " " " " " " "
bath_half_0 bath_half_1 bath_half_2 bath_half_3 age days_on_market sewer_type_city sewer_type_septic
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " " " " " "
9 ( 1 ) " " " " " " " " " " " " " " " "
10 ( 1 ) " " " " " " " " " " " " " " " "
11 ( 1 ) " " " " " " " " " " " " " " " "
12 ( 1 ) " " " " " " " " " " " " " " " "
13 ( 1 ) " " " " " " " " " " " " " " " "
14 ( 1 ) " " " " " " " " " " " " " " " "
spa_location_inside spa_location_outside stories property_style_mobile property_style_modular
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " " "
5 ( 1 ) " " " " " " " " " "
6 ( 1 ) " " " " " " " " " "
7 ( 1 ) " " " " " " "*" " "
8 ( 1 ) " " " " " " "*" " "
9 ( 1 ) " " " " " " "*" " "
10 ( 1 ) " " " " " " "*" " "
11 ( 1 ) " " " " " " "*" " "
12 ( 1 ) " " " " " " "*" " "
13 ( 1 ) " " " " " " "*" " "
14 ( 1 ) " " " " " " "*" " "
city_limit_bi subdivision_bi termite_contract water_type_public water_type_well water_type_other
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " "
9 ( 1 ) " " " " " " " " " " " "
10 ( 1 ) " " " " " " " " " " " "
11 ( 1 ) " " " " " " " " " " " "
12 ( 1 ) " " " " " " " " " " " "
13 ( 1 ) " " " " " " " " " " " "
14 ( 1 ) " " " " " " " " " " " "
waterfront_bi
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) " "
8 ( 1 ) " "
9 ( 1 ) " "
10 ( 1 ) " "
11 ( 1 ) " "
12 ( 1 ) " "
13 ( 1 ) " "
14 ( 1 ) "*"
[ reached getOption("max.print") -- omitted 57 rows ]
mse_base <- (summary(regfit.base)$rss / nrow(data_bi))^2 #This is a manual way to get MSE from any subset
#Validation set approach
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(data_bi), replace = TRUE) #use only the training observations to perform all aspects of model-fitting - including variable selection
test <- (!train)
table(train) #Checking to make sure data didn't get randomly split in a weird way between training and test
train
FALSE TRUE
7278 7331
table(test) #Subset selection on training data created using Validation Set Approach (as appose to K-Fold)
test
FALSE TRUE
7331 7278
# lm Check
# Note: Each unique data split can cause independecies between variables with not a lot of variation.
lm <- lm(sold_price ~ ., data_bi)
summary(lm)
Call:
lm(formula = sold_price ~ ., data = data_bi)
Residuals:
Min 1Q Median 3Q Max
-765272 -37461 120 31842 1846438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.316e+05 7.218e+04 -3.209 0.001333 **
property_type_CND -2.531e+04 7.361e+03 -3.438 0.000588 ***
property_type_OTH 9.078e+04 4.609e+04 1.970 0.048884 *
property_type_PAT 4.235e+04 1.445e+04 2.931 0.003387 **
property_type_SGL 3.237e+04 5.654e+03 5.725 1.06e-08 ***
air_conditioning_central 2.599e+04 2.633e+03 9.869 < 2e-16 ***
appartment_bi -3.275e+03 1.974e+04 -0.166 0.868189
patio_bi 8.733e+03 1.481e+03 5.897 3.78e-09 ***
school_high 2.485e+04 7.167e+03 3.468 0.000526 ***
school_junior -8.598e+04 6.414e+03 -13.406 < 2e-16 ***
school_middle 3.955e+04 7.208e+03 5.487 4.15e-08 ***
photo_count 1.443e+03 8.029e+01 17.975 < 2e-16 ***
pool_bi 1.876e+04 2.375e+03 7.901 2.97e-15 ***
rear_yard_access_bi 2.313e+04 4.028e+03 5.744 9.44e-09 ***
roof_type_metal 2.602e+03 2.494e+03 1.044 0.296655
roof_type_shingle 3.007e+04 1.750e+03 17.180 < 2e-16 ***
roof_type_slate 1.911e+04 1.069e+04 1.788 0.073817 .
gas_type_natural 5.336e+02 1.934e+03 0.276 0.782574
out_building_livable_bi 3.884e+04 1.210e+04 3.211 0.001327 **
out_building_not_livable_bi -9.093e+03 1.571e+03 -5.789 7.22e-09 ***
living_area 7.196e+01 1.818e+00 39.585 < 2e-16 ***
land_acres 4.083e-01 4.152e+00 0.098 0.921653
appliances_included_bi 2.376e+04 1.970e+03 12.063 < 2e-16 ***
garage_bi 1.125e+04 1.478e+03 7.613 2.83e-14 ***
condition_new 1.048e+05 9.584e+03 10.932 < 2e-16 ***
condition_excellent 1.040e+05 4.597e+03 22.616 < 2e-16 ***
condition_very_good 1.465e+04 2.142e+03 6.839 8.28e-12 ***
energy_efficient_bi 1.502e+04 1.646e+03 9.129 < 2e-16 ***
exterior_brick -1.019e+04 2.057e+03 -4.955 7.30e-07 ***
exterior_type_metal -1.814e+03 4.226e+03 -0.429 0.667815
exterior_type_vinyl -7.661e+03 1.754e+03 -4.368 1.26e-05 ***
exterior_type_wood -6.392e+03 3.039e+03 -2.103 0.035468 *
exterior_features_balcony 8.119e+04 7.909e+03 10.265 < 2e-16 ***
exterior_features_courtyard 9.988e+04 1.134e+04 8.806 < 2e-16 ***
exterior_features_fence -1.288e+04 1.547e+03 -8.324 < 2e-16 ***
exterior_features_porch -1.817e+03 2.357e+03 -0.771 0.440665
exterior_features_tennis_court 3.713e+03 3.258e+04 0.114 0.909266
fire_place_bi 1.066e+04 1.558e+03 6.846 7.89e-12 ***
foundation_type_raised -1.379e+04 2.429e+03 -5.676 1.41e-08 ***
foundation_type_slab 7.816e+02 2.147e+03 0.364 0.715767
total_area -2.756e-03 1.933e-03 -1.425 0.154114
beds_total_1 4.538e+04 8.658e+03 5.241 1.62e-07 ***
beds_total_2 4.115e+04 5.423e+03 7.588 3.44e-14 ***
beds_total_3 3.314e+04 4.776e+03 6.940 4.09e-12 ***
beds_total_4 3.133e+04 4.543e+03 6.896 5.56e-12 ***
bath_full_0 -6.347e+04 4.580e+04 -1.386 0.165789
bath_full_1 -9.999e+04 3.998e+04 -2.501 0.012382 *
bath_full_2 -7.331e+04 3.986e+04 -1.839 0.065887 .
bath_full_3 -3.216e+04 3.980e+04 -0.808 0.419078
bath_full_4 4.174e+04 4.010e+04 1.041 0.297864
bath_full_5 1.972e+05 4.241e+04 4.649 3.36e-06 ***
bath_full_6 5.324e+05 6.062e+04 8.782 < 2e-16 ***
bath_half_0 1.484e+05 5.628e+04 2.637 0.008374 **
bath_half_1 1.736e+05 5.630e+04 3.083 0.002053 **
bath_half_2 2.007e+05 5.685e+04 3.530 0.000416 ***
bath_half_3 6.245e+05 6.294e+04 9.922 < 2e-16 ***
age 7.151e+02 5.819e+01 12.289 < 2e-16 ***
days_on_market -7.607e+01 7.111e+00 -10.698 < 2e-16 ***
sewer_type_city 4.637e+03 1.563e+03 2.966 0.003024 **
sewer_type_septic -4.078e+03 2.428e+03 -1.679 0.093108 .
spa_location_inside 7.641e+04 2.309e+04 3.309 0.000938 ***
spa_location_outside 9.634e+04 1.961e+04 4.913 9.06e-07 ***
stories -1.030e+03 2.350e+03 -0.438 0.661241
property_style_mobile -5.338e+04 3.921e+03 -13.614 < 2e-16 ***
property_style_modular -3.477e+04 1.563e+04 -2.224 0.026133 *
city_limit_bi 1.338e+04 4.327e+03 3.091 0.001998 **
subdivision_bi -7.761e+03 2.685e+03 -2.890 0.003858 **
termite_contract 5.626e+04 4.609e+03 12.207 < 2e-16 ***
water_type_public 1.123e+04 1.777e+04 0.632 0.527569
water_type_well 2.619e+04 1.901e+04 1.378 0.168361
water_type_other 3.266e+04 1.788e+04 1.827 0.067674 .
waterfront_bi 3.744e+04 2.566e+03 14.593 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 78950 on 14537 degrees of freedom
Multiple R-squared: 0.6462, Adjusted R-squared: 0.6445
F-statistic: 374 on 71 and 14537 DF, p-value: < 2.2e-16
lm <- lm(sold_price ~ ., data_bi[train,])
summary(lm)
Call:
lm(formula = sold_price ~ ., data = data_bi[train, ])
Residuals:
Min 1Q Median 3Q Max
-703508 -38592 -418 32483 1832204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.279e+05 1.198e+05 -1.068 0.285770
property_type_CND -7.715e+03 1.067e+04 -0.723 0.469677
property_type_OTH 7.629e+04 5.791e+04 1.317 0.187787
property_type_PAT 4.309e+04 2.076e+04 2.076 0.037964 *
property_type_SGL 3.093e+04 8.461e+03 3.656 0.000258 ***
air_conditioning_central 2.558e+04 3.923e+03 6.522 7.40e-11 ***
appartment_bi -1.095e+04 2.469e+04 -0.443 0.657471
patio_bi 8.614e+03 2.142e+03 4.021 5.85e-05 ***
school_high 1.646e+04 1.067e+04 1.542 0.123048
school_junior -9.723e+04 9.029e+03 -10.769 < 2e-16 ***
school_middle 4.661e+04 1.073e+04 4.345 1.41e-05 ***
photo_count 1.369e+03 1.163e+02 11.775 < 2e-16 ***
pool_bi 1.692e+04 3.435e+03 4.926 8.57e-07 ***
rear_yard_access_bi 2.862e+04 5.905e+03 4.846 1.29e-06 ***
roof_type_metal 7.609e+02 3.617e+03 0.210 0.833368
roof_type_shingle 3.088e+04 2.521e+03 12.250 < 2e-16 ***
roof_type_slate 1.413e+04 1.383e+04 1.022 0.306867
gas_type_natural -4.538e+01 2.768e+03 -0.016 0.986919
out_building_livable_bi 9.653e+04 1.823e+04 5.294 1.23e-07 ***
out_building_not_livable_bi -8.530e+03 2.272e+03 -3.755 0.000175 ***
living_area 6.991e+01 2.685e+00 26.039 < 2e-16 ***
land_acres 5.065e-02 4.995e+00 0.010 0.991909
appliances_included_bi 2.389e+04 2.818e+03 8.479 < 2e-16 ***
garage_bi 1.140e+04 2.152e+03 5.299 1.20e-07 ***
condition_new 1.130e+05 1.259e+04 8.976 < 2e-16 ***
condition_excellent 9.436e+04 6.648e+03 14.193 < 2e-16 ***
condition_very_good 1.772e+04 3.077e+03 5.760 8.78e-09 ***
energy_efficient_bi 1.323e+04 2.372e+03 5.580 2.49e-08 ***
exterior_brick -9.663e+03 2.981e+03 -3.242 0.001193 **
exterior_type_metal 2.440e+03 5.938e+03 0.411 0.681150
exterior_type_vinyl -6.409e+03 2.532e+03 -2.531 0.011388 *
exterior_type_wood 3.312e+03 4.359e+03 0.760 0.447375
exterior_features_balcony 8.209e+04 1.142e+04 7.186 7.32e-13 ***
exterior_features_courtyard 9.354e+04 1.769e+04 5.287 1.28e-07 ***
exterior_features_fence -1.111e+04 2.242e+03 -4.956 7.37e-07 ***
exterior_features_porch 2.631e+03 3.422e+03 0.769 0.441937
exterior_features_tennis_court 2.620e+04 4.833e+04 0.542 0.587701
fire_place_bi 1.234e+04 2.238e+03 5.515 3.61e-08 ***
foundation_type_raised -8.228e+03 3.512e+03 -2.343 0.019168 *
foundation_type_slab 5.961e+03 3.092e+03 1.928 0.053864 .
total_area -2.506e-03 2.790e-03 -0.898 0.369107
beds_total_1 4.036e+04 1.251e+04 3.227 0.001258 **
beds_total_2 4.637e+04 7.783e+03 5.958 2.67e-09 ***
beds_total_3 4.100e+04 6.849e+03 5.986 2.25e-09 ***
beds_total_4 4.350e+04 6.517e+03 6.674 2.67e-11 ***
bath_full_0 -1.777e+05 8.929e+04 -1.990 0.046602 *
bath_full_1 -1.695e+05 8.280e+04 -2.047 0.040720 *
bath_full_2 -1.446e+05 8.259e+04 -1.751 0.079999 .
bath_full_3 -1.039e+05 8.241e+04 -1.260 0.207557
bath_full_4 -2.125e+04 8.249e+04 -0.258 0.796741
bath_full_5 8.009e+04 8.445e+04 0.948 0.342997
bath_full_6 -1.404e+05 1.156e+05 -1.215 0.224403
bath_half_0 1.341e+05 8.096e+04 1.657 0.097575 .
bath_half_1 1.560e+05 8.099e+04 1.926 0.054153 .
bath_half_2 1.824e+05 8.193e+04 2.227 0.026009 *
bath_half_3 5.719e+05 8.881e+04 6.439 1.28e-10 ***
age 5.667e+02 6.930e+01 8.177 3.41e-16 ***
days_on_market -7.698e+01 1.058e+01 -7.279 3.72e-13 ***
sewer_type_city 5.388e+03 2.248e+03 2.397 0.016554 *
sewer_type_septic -6.309e+03 3.577e+03 -1.763 0.077867 .
spa_location_inside 1.325e+04 4.142e+04 0.320 0.749141
spa_location_outside 1.957e+04 3.348e+04 0.585 0.558880
stories 1.869e+03 3.591e+03 0.520 0.602741
property_style_mobile -5.359e+04 5.897e+03 -9.087 < 2e-16 ***
property_style_modular -4.314e+04 1.929e+04 -2.236 0.025374 *
city_limit_bi 1.017e+04 6.226e+03 1.633 0.102455
subdivision_bi -1.317e+04 3.821e+03 -3.447 0.000570 ***
termite_contract 6.268e+04 6.536e+03 9.590 < 2e-16 ***
water_type_public -5.773e+03 2.717e+04 -0.213 0.831706
water_type_well 4.407e+03 2.891e+04 0.152 0.878849
water_type_other 1.353e+04 2.730e+04 0.496 0.620117
waterfront_bi 4.106e+04 3.761e+03 10.917 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80720 on 7259 degrees of freedom
Multiple R-squared: 0.6284, Adjusted R-squared: 0.6248
F-statistic: 172.9 on 71 and 7259 DF, p-value: < 2.2e-16
lm <- lm(sold_price ~ ., data_bi[test,])
summary(lm)
Call:
lm(formula = sold_price ~ ., data = data_bi[test, ])
Residuals:
Min 1Q Median 3Q Max
-677618 -36246 -255 32167 1412450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.443e+05 9.477e+04 -2.578 0.009968 **
property_type_CND -4.503e+04 1.015e+04 -4.435 9.32e-06 ***
property_type_OTH 1.339e+05 7.704e+04 1.738 0.082180 .
property_type_PAT 4.356e+04 1.993e+04 2.185 0.028912 *
property_type_SGL 3.549e+04 7.510e+03 4.725 2.34e-06 ***
air_conditioning_central 2.587e+04 3.508e+03 7.375 1.83e-13 ***
appartment_bi -1.960e+04 3.512e+04 -0.558 0.576797
patio_bi 8.626e+03 2.026e+03 4.258 2.09e-05 ***
school_high 2.990e+04 9.529e+03 3.138 0.001708 **
school_junior -7.622e+04 9.161e+03 -8.321 < 2e-16 ***
school_middle 3.552e+04 9.594e+03 3.702 0.000216 ***
photo_count 1.531e+03 1.097e+02 13.950 < 2e-16 ***
pool_bi 2.187e+04 3.260e+03 6.709 2.11e-11 ***
rear_yard_access_bi 1.668e+04 5.466e+03 3.052 0.002282 **
roof_type_metal 4.068e+03 3.403e+03 1.195 0.231964
roof_type_shingle 3.038e+04 2.407e+03 12.623 < 2e-16 ***
roof_type_slate 2.025e+04 1.715e+04 1.181 0.237607
gas_type_natural 1.428e+03 2.672e+03 0.534 0.593140
out_building_livable_bi -2.592e+04 1.620e+04 -1.600 0.109580
out_building_not_livable_bi -1.028e+04 2.154e+03 -4.773 1.85e-06 ***
living_area 7.164e+01 2.458e+00 29.147 < 2e-16 ***
land_acres 1.896e+00 7.628e+00 0.249 0.803722
appliances_included_bi 2.386e+04 2.726e+03 8.751 < 2e-16 ***
garage_bi 1.195e+04 2.013e+03 5.937 3.04e-09 ***
condition_new 9.257e+04 1.499e+04 6.176 6.91e-10 ***
condition_excellent 1.174e+05 6.330e+03 18.541 < 2e-16 ***
condition_very_good 1.117e+04 2.955e+03 3.780 0.000158 ***
energy_efficient_bi 1.652e+04 2.260e+03 7.310 2.96e-13 ***
exterior_brick -1.014e+04 2.811e+03 -3.606 0.000314 ***
exterior_type_metal -5.710e+03 5.984e+03 -0.954 0.340004
exterior_type_vinyl -8.347e+03 2.404e+03 -3.471 0.000521 ***
exterior_type_wood -1.592e+04 4.202e+03 -3.789 0.000153 ***
exterior_features_balcony 7.955e+04 1.098e+04 7.247 4.69e-13 ***
exterior_features_courtyard 1.003e+05 1.471e+04 6.820 9.87e-12 ***
exterior_features_fence -1.440e+04 2.114e+03 -6.811 1.05e-11 ***
exterior_features_porch -5.782e+03 3.221e+03 -1.795 0.072682 .
exterior_features_tennis_court -3.639e+03 4.424e+04 -0.082 0.934447
fire_place_bi 9.224e+03 2.147e+03 4.296 1.76e-05 ***
foundation_type_raised -1.887e+04 3.333e+03 -5.661 1.56e-08 ***
foundation_type_slab -4.480e+03 2.953e+03 -1.517 0.129284
total_area -3.145e-03 2.648e-03 -1.188 0.234937
beds_total_1 5.224e+04 1.198e+04 4.360 1.32e-05 ***
beds_total_2 3.861e+04 7.520e+03 5.134 2.90e-07 ***
beds_total_3 2.857e+04 6.632e+03 4.308 1.67e-05 ***
beds_total_4 2.259e+04 6.314e+03 3.578 0.000348 ***
bath_full_0 -1.369e+04 5.410e+04 -0.253 0.800253
bath_full_1 -8.224e+04 4.473e+04 -1.839 0.066020 .
bath_full_2 -5.250e+04 4.458e+04 -1.178 0.239005
bath_full_3 -9.633e+03 4.455e+04 -0.216 0.828791
bath_full_4 5.790e+04 4.510e+04 1.284 0.199234
bath_full_5 2.788e+05 4.985e+04 5.592 2.33e-08 ***
bath_full_6 8.876e+05 7.074e+04 12.548 < 2e-16 ***
bath_half_0 1.123e+05 7.831e+04 1.434 0.151598
bath_half_1 1.409e+05 7.833e+04 1.799 0.072015 .
bath_half_2 1.600e+05 7.903e+04 2.025 0.042952 *
bath_half_3 6.643e+05 9.008e+04 7.374 1.84e-13 ***
age 1.074e+03 1.124e+02 9.554 < 2e-16 ***
days_on_market -7.612e+01 9.566e+00 -7.957 2.03e-15 ***
sewer_type_city 3.701e+03 2.153e+03 1.719 0.085678 .
sewer_type_septic -7.032e+02 3.272e+03 -0.215 0.829845
spa_location_inside 1.093e+05 2.737e+04 3.992 6.61e-05 ***
spa_location_outside 1.326e+05 2.418e+04 5.485 4.27e-08 ***
stories -6.511e+02 3.084e+03 -0.211 0.832782
property_style_mobile -5.330e+04 5.185e+03 -10.280 < 2e-16 ***
property_style_modular -1.783e+04 2.727e+04 -0.654 0.513115
city_limit_bi 1.444e+04 5.996e+03 2.409 0.016021 *
subdivision_bi -6.023e+02 3.756e+03 -0.160 0.872606
termite_contract 4.642e+04 6.469e+03 7.175 7.93e-13 ***
water_type_public 2.588e+04 2.324e+04 1.114 0.265496
water_type_well 4.927e+04 2.498e+04 1.973 0.048586 *
water_type_other 5.008e+04 2.340e+04 2.140 0.032360 *
waterfront_bi 3.239e+04 3.475e+03 9.322 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 76120 on 7206 degrees of freedom
Multiple R-squared: 0.676, Adjusted R-squared: 0.6728
F-statistic: 211.8 on 71 and 7206 DF, p-value: < 2.2e-16
# Forward selection on training data
nvmax <- 72
regfit.fwd <- regsubsets(log(sold_price) ~ . ,
data = data_bi[train,],
nvmax = nvmax,
method= "forward")
summary(regfit.fwd)
Subset selection object
Call: regsubsets.formula(log(sold_price) ~ ., data = data_bi[train,
], nvmax = nvmax, method = "forward")
71 Variables (and intercept)
Forced in Forced out
property_type_CND FALSE FALSE
property_type_OTH FALSE FALSE
property_type_PAT FALSE FALSE
property_type_SGL FALSE FALSE
air_conditioning_central FALSE FALSE
appartment_bi FALSE FALSE
patio_bi FALSE FALSE
school_high FALSE FALSE
school_junior FALSE FALSE
school_middle FALSE FALSE
photo_count FALSE FALSE
pool_bi FALSE FALSE
rear_yard_access_bi FALSE FALSE
roof_type_metal FALSE FALSE
roof_type_shingle FALSE FALSE
roof_type_slate FALSE FALSE
gas_type_natural FALSE FALSE
out_building_livable_bi FALSE FALSE
out_building_not_livable_bi FALSE FALSE
living_area FALSE FALSE
land_acres FALSE FALSE
appliances_included_bi FALSE FALSE
garage_bi FALSE FALSE
condition_new FALSE FALSE
condition_excellent FALSE FALSE
condition_very_good FALSE FALSE
energy_efficient_bi FALSE FALSE
exterior_brick FALSE FALSE
exterior_type_metal FALSE FALSE
exterior_type_vinyl FALSE FALSE
exterior_type_wood FALSE FALSE
exterior_features_balcony FALSE FALSE
exterior_features_courtyard FALSE FALSE
exterior_features_fence FALSE FALSE
exterior_features_porch FALSE FALSE
exterior_features_tennis_court FALSE FALSE
fire_place_bi FALSE FALSE
foundation_type_raised FALSE FALSE
foundation_type_slab FALSE FALSE
total_area FALSE FALSE
beds_total_1 FALSE FALSE
beds_total_2 FALSE FALSE
beds_total_3 FALSE FALSE
beds_total_4 FALSE FALSE
bath_full_0 FALSE FALSE
bath_full_1 FALSE FALSE
bath_full_2 FALSE FALSE
bath_full_3 FALSE FALSE
bath_full_4 FALSE FALSE
bath_full_5 FALSE FALSE
bath_full_6 FALSE FALSE
bath_half_0 FALSE FALSE
bath_half_1 FALSE FALSE
bath_half_2 FALSE FALSE
bath_half_3 FALSE FALSE
age FALSE FALSE
days_on_market FALSE FALSE
sewer_type_city FALSE FALSE
sewer_type_septic FALSE FALSE
spa_location_inside FALSE FALSE
spa_location_outside FALSE FALSE
stories FALSE FALSE
property_style_mobile FALSE FALSE
property_style_modular FALSE FALSE
city_limit_bi FALSE FALSE
subdivision_bi FALSE FALSE
termite_contract FALSE FALSE
water_type_public FALSE FALSE
water_type_well FALSE FALSE
water_type_other FALSE FALSE
waterfront_bi FALSE FALSE
1 subsets of each size up to 71
Selection Algorithm: forward
property_type_CND property_type_OTH property_type_PAT property_type_SGL air_conditioning_central
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " "*"
3 ( 1 ) " " " " " " " " "*"
4 ( 1 ) " " " " " " " " "*"
5 ( 1 ) " " " " " " " " "*"
6 ( 1 ) " " " " " " " " "*"
7 ( 1 ) " " " " " " " " "*"
8 ( 1 ) " " " " " " " " "*"
9 ( 1 ) " " " " " " " " "*"
10 ( 1 ) " " " " " " " " "*"
11 ( 1 ) " " " " " " " " "*"
12 ( 1 ) " " " " " " " " "*"
13 ( 1 ) " " " " " " " " "*"
14 ( 1 ) " " " " " " " " "*"
appartment_bi patio_bi school_high school_junior school_middle photo_count pool_bi rear_yard_access_bi
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " " " " " "
9 ( 1 ) " " " " " " " " "*" " " " " " "
10 ( 1 ) " " " " " " " " "*" " " " " " "
11 ( 1 ) " " " " " " " " "*" "*" " " " "
12 ( 1 ) " " " " " " " " "*" "*" " " " "
13 ( 1 ) " " " " " " " " "*" "*" " " " "
14 ( 1 ) " " " " " " " " "*" "*" " " " "
roof_type_metal roof_type_shingle roof_type_slate gas_type_natural out_building_livable_bi
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " "
4 ( 1 ) " " "*" " " " " " "
5 ( 1 ) " " "*" " " " " " "
6 ( 1 ) " " "*" " " " " " "
7 ( 1 ) " " "*" " " " " " "
8 ( 1 ) " " "*" " " " " " "
9 ( 1 ) " " "*" " " " " " "
10 ( 1 ) " " "*" " " " " " "
11 ( 1 ) " " "*" " " " " " "
12 ( 1 ) " " "*" " " " " " "
13 ( 1 ) " " "*" " " " " " "
14 ( 1 ) " " "*" " " " " " "
out_building_not_livable_bi living_area land_acres appliances_included_bi garage_bi condition_new
1 ( 1 ) " " "*" " " " " " " " "
2 ( 1 ) " " "*" " " " " " " " "
3 ( 1 ) " " "*" " " " " " " " "
4 ( 1 ) " " "*" " " " " " " " "
5 ( 1 ) " " "*" " " "*" " " " "
6 ( 1 ) " " "*" " " "*" " " " "
7 ( 1 ) " " "*" " " "*" " " " "
8 ( 1 ) " " "*" " " "*" " " " "
9 ( 1 ) " " "*" " " "*" " " " "
10 ( 1 ) " " "*" " " "*" " " " "
11 ( 1 ) " " "*" " " "*" " " " "
12 ( 1 ) " " "*" " " "*" " " " "
13 ( 1 ) " " "*" " " "*" " " " "
14 ( 1 ) " " "*" " " "*" " " " "
condition_excellent condition_very_good energy_efficient_bi exterior_brick exterior_type_metal
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " " "
5 ( 1 ) " " " " " " " " " "
6 ( 1 ) "*" " " " " " " " "
7 ( 1 ) "*" " " " " " " " "
8 ( 1 ) "*" " " " " " " " "
9 ( 1 ) "*" " " " " " " " "
10 ( 1 ) "*" "*" " " " " " "
11 ( 1 ) "*" "*" " " " " " "
12 ( 1 ) "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" " " " "
14 ( 1 ) "*" "*" "*" " " " "
exterior_type_vinyl exterior_type_wood exterior_features_balcony exterior_features_courtyard
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) " " " " " " " "
4 ( 1 ) " " " " " " " "
5 ( 1 ) " " " " " " " "
6 ( 1 ) " " " " " " " "
7 ( 1 ) " " " " " " " "
8 ( 1 ) " " " " " " " "
9 ( 1 ) " " " " " " " "
10 ( 1 ) " " " " " " " "
11 ( 1 ) " " " " " " " "
12 ( 1 ) " " " " " " " "
13 ( 1 ) " " " " " " " "
14 ( 1 ) " " " " " " " "
exterior_features_fence exterior_features_porch exterior_features_tennis_court fire_place_bi
1 ( 1 ) " " " " " " " "
2 ( 1 ) " " " " " " " "
3 ( 1 ) " " " " " " " "
4 ( 1 ) " " " " " " " "
5 ( 1 ) " " " " " " " "
6 ( 1 ) " " " " " " " "
7 ( 1 ) " " " " " " " "
8 ( 1 ) " " " " " " " "
9 ( 1 ) " " " " " " " "
10 ( 1 ) " " " " " " " "
11 ( 1 ) " " " " " " " "
12 ( 1 ) " " " " " " " "
13 ( 1 ) " " " " " " " "
14 ( 1 ) " " " " " " " "
foundation_type_raised foundation_type_slab total_area beds_total_1 beds_total_2 beds_total_3
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " "*" " " " " " " " "
5 ( 1 ) " " "*" " " " " " " " "
6 ( 1 ) " " "*" " " " " " " " "
7 ( 1 ) " " "*" " " " " " " " "
8 ( 1 ) " " "*" " " " " " " " "
9 ( 1 ) " " "*" " " " " " " " "
10 ( 1 ) " " "*" " " " " " " " "
11 ( 1 ) " " "*" " " " " " " " "
12 ( 1 ) " " "*" " " " " " " " "
13 ( 1 ) " " "*" " " " " " " " "
14 ( 1 ) " " "*" " " " " " " " "
beds_total_4 bath_full_0 bath_full_1 bath_full_2 bath_full_3 bath_full_4 bath_full_5 bath_full_6
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " " " " " "
7 ( 1 ) " " " " "*" " " " " " " " " " "
8 ( 1 ) " " " " "*" " " " " " " " " " "
9 ( 1 ) " " " " "*" " " " " " " " " " "
10 ( 1 ) " " " " "*" " " " " " " " " " "
11 ( 1 ) " " " " "*" " " " " " " " " " "
12 ( 1 ) " " " " "*" " " " " " " " " " "
13 ( 1 ) " " " " "*" " " " " " " " " " "
14 ( 1 ) " " " " "*" " " " " " " " " " "
bath_half_0 bath_half_1 bath_half_2 bath_half_3 age days_on_market sewer_type_city sewer_type_septic
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " " " " " "
9 ( 1 ) " " " " " " " " " " " " " " " "
10 ( 1 ) " " " " " " " " " " " " " " " "
11 ( 1 ) " " " " " " " " " " " " " " " "
12 ( 1 ) " " " " " " " " " " " " " " " "
13 ( 1 ) " " " " " " " " " " " " " " " "
14 ( 1 ) "*" " " " " " " " " " " " " " "
spa_location_inside spa_location_outside stories property_style_mobile property_style_modular
1 ( 1 ) " " " " " " " " " "
2 ( 1 ) " " " " " " " " " "
3 ( 1 ) " " " " " " " " " "
4 ( 1 ) " " " " " " " " " "
5 ( 1 ) " " " " " " " " " "
6 ( 1 ) " " " " " " " " " "
7 ( 1 ) " " " " " " " " " "
8 ( 1 ) " " " " " " "*" " "
9 ( 1 ) " " " " " " "*" " "
10 ( 1 ) " " " " " " "*" " "
11 ( 1 ) " " " " " " "*" " "
12 ( 1 ) " " " " " " "*" " "
13 ( 1 ) " " " " " " "*" " "
14 ( 1 ) " " " " " " "*" " "
city_limit_bi subdivision_bi termite_contract water_type_public water_type_well water_type_other
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " " " " " " " " " " "
4 ( 1 ) " " " " " " " " " " " "
5 ( 1 ) " " " " " " " " " " " "
6 ( 1 ) " " " " " " " " " " " "
7 ( 1 ) " " " " " " " " " " " "
8 ( 1 ) " " " " " " " " " " " "
9 ( 1 ) " " " " " " " " " " " "
10 ( 1 ) " " " " " " " " " " " "
11 ( 1 ) " " " " " " " " " " " "
12 ( 1 ) " " " " " " " " " " " "
13 ( 1 ) " " " " " " " " " " " "
14 ( 1 ) " " " " " " " " " " " "
waterfront_bi
1 ( 1 ) " "
2 ( 1 ) " "
3 ( 1 ) " "
4 ( 1 ) " "
5 ( 1 ) " "
6 ( 1 ) " "
7 ( 1 ) " "
8 ( 1 ) " "
9 ( 1 ) " "
10 ( 1 ) " "
11 ( 1 ) " "
12 ( 1 ) " "
13 ( 1 ) "*"
14 ( 1 ) "*"
[ reached getOption("max.print") -- omitted 57 rows ]
mse_train <- (summary(regfit.fwd)$rss / nrow(data_bi))^2 #This is a manual way to get MSE from any subset
# Make a model matrix from the test data. Create prediction using test data with model trained on training date
test.mat <- model.matrix(log(sold_price) ~ . ,
data = data_bi[test,],
nvmax = nvmax,
method = "forward")
dim(test.mat)
[1] 7278 72
val.errors <- rep(0, 71) #Creating empty container for val.errors for null model to 28Var model
for (i in 1:71){
coef.i <- coef(regfit.fwd, i) #extract the coefficients TRAINING
pred.i <- test.mat[, names(coef.i)] %*% coef.i #Put coef into TEST data for predictions - multiply them into the appropriate columns of the test model matrix to form the predictions
val.errors[i] <- mean((log(data_bi$sold_price[test]) - pred.i)^2) #compute the test MSE
}
val.errors
[1] 0.3981389 0.3460456 0.3131914 0.2857834 0.2690171 0.2587374 0.2477949 0.2376078 0.2318361 0.2294201
[11] 0.2235994 0.2203535 0.2180126 0.2165033 0.2146756 0.2129553 0.2118975 0.2099353 0.2073327 0.2038707
[21] 0.2036459 0.2029766 0.2015432 0.2022469 0.2017999 0.2005087 0.1994757 0.1990417 0.1986158 0.1985793
[31] 0.1986899 0.1991310 0.1993242 0.1993470 0.1991443 0.1990559 0.1989207 0.1989950 0.1990568 0.1988342
[41] 0.1986396 0.1986040 0.1984280 0.1984807 0.1983732 0.1984372 0.1985181 0.1986470 0.1986643 0.1986536
[51] 0.1983438 0.1983531 0.1982854 0.1982847 0.1982942 0.1982212 0.1981750 0.1980991 0.1981443 0.1982113
[61] 0.1981467 0.1981829 0.1981768 0.1981728 0.1981549 0.1981229 0.1981381 0.1981239 0.1981097 0.1981131
[71] 0.1981076
which.min(val.errors) #70-Variable model has min Test MSE
[1] 58
coef(regfit.fwd, 58) #Shows which best 58 variables
(Intercept) property_type_CND property_type_OTH property_type_PAT
1.053620e+01 -9.612058e-02 3.622257e-01 1.061965e-01
air_conditioning_central appartment_bi patio_bi school_high
4.388982e-01 -2.419822e-01 6.114000e-02 1.052042e-01
school_junior school_middle photo_count pool_bi
-5.159281e-01 2.389779e-01 8.786759e-03 4.014091e-02
rear_yard_access_bi roof_type_metal roof_type_shingle roof_type_slate
1.376895e-01 -4.487685e-02 2.104819e-01 6.622378e-02
gas_type_natural out_building_livable_bi living_area land_acres
-3.585821e-02 2.728194e-01 2.393582e-04 2.613134e-05
appliances_included_bi garage_bi condition_new condition_excellent
2.623661e-01 7.624987e-02 5.717235e-01 4.814971e-01
condition_very_good energy_efficient_bi exterior_brick exterior_type_metal
1.940528e-01 1.182910e-01 -5.370564e-02 -6.249268e-02
exterior_type_vinyl exterior_type_wood exterior_features_balcony exterior_features_courtyard
-2.670202e-02 -4.107985e-02 2.012439e-01 2.550949e-01
exterior_features_fence exterior_features_porch fire_place_bi foundation_type_raised
-5.883263e-02 2.728448e-02 9.787377e-02 -1.390797e-01
foundation_type_slab total_area beds_total_2 beds_total_3
3.956151e-02 -1.737406e-08 1.331216e-02 8.019988e-02
beds_total_4 bath_full_0 bath_full_1 bath_full_3
1.103319e-01 -3.102680e-01 -2.807862e-01 1.357063e-01
bath_full_4 bath_full_5 bath_full_6 bath_half_0
2.015260e-01 -4.582929e-01 2.375432e-01 -1.116315e-01
bath_half_3 age days_on_market sewer_type_city
3.999695e-01 2.843575e-03 -3.518382e-04 2.599214e-02
stories property_style_mobile property_style_modular city_limit_bi
-3.899330e-02 -5.165234e-01 -3.166140e-01 2.311786e-02
termite_contract water_type_public waterfront_bi
1.529996e-01 -8.383389e-02 2.235910e-01
# Graphing MSE
par(mfrow = c(1,1))
plot(val.errors, ylab = "Test Mean Squared Error" , xlab = "Number of Variables", main = "Test MSE using Validation Set Approach")
?plot
lines(val.errors, lwd = 2, col = "blue")
abline(v = which.min(val.errors))
# A functional way to get validation errors from
predict.regsubsets <- function(object, newdata, id, ...){ #predict() method for regsubsets()
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coef.i <- coef(object, id)
xvars <- names(coef.i)
mat[, xvars] %*% coef.i
}
val.errors <- rep(0, 71)
for (i in 1:71){
pred.i <- predict(regfit.fwd, data_bi[test,], i)
val.errors[i] <- mean((log(data_bi$sold_price[test]) - pred.i)^2)
}
val.errors
[1] 0.3981389 0.3460456 0.3131914 0.2857834 0.2690171 0.2587374 0.2477949 0.2376078 0.2318361 0.2294201
[11] 0.2235994 0.2203535 0.2180126 0.2165033 0.2146756 0.2129553 0.2118975 0.2099353 0.2073327 0.2038707
[21] 0.2036459 0.2029766 0.2015432 0.2022469 0.2017999 0.2005087 0.1994757 0.1990417 0.1986158 0.1985793
[31] 0.1986899 0.1991310 0.1993242 0.1993470 0.1991443 0.1990559 0.1989207 0.1989950 0.1990568 0.1988342
[41] 0.1986396 0.1986040 0.1984280 0.1984807 0.1983732 0.1984372 0.1985181 0.1986470 0.1986643 0.1986536
[51] 0.1983438 0.1983531 0.1982854 0.1982847 0.1982942 0.1982212 0.1981750 0.1980991 0.1981443 0.1982113
[61] 0.1981467 0.1981829 0.1981768 0.1981728 0.1981549 0.1981229 0.1981381 0.1981239 0.1981097 0.1981131
[71] 0.1981076
which.min(val.errors) #Again, we see that 58-Variable model has min Test MSE
[1] 58
#k-fold cross-validation
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(data_bi), replace = TRUE)
sum.errors <- rep(0, 71)
sum2.errors <- rep(0, 71)
for (j in 1:k){
best.fit <- regsubsets(log(sold_price) ~ . ,
data = data_bi[folds != j,],
nvmax = 71,
method = "forward")
for (i in 1:71){
pred <- predict(best.fit, data_bi[folds == j,], i)
sum.errors[i] <- sum.errors[i] + sum((log(data_bi$sold_price[folds == j]) - pred)^2)
sum2.errors[i] <- sum2.errors[i] + sum(((log(data_bi$sold_price[folds == j]) - pred)^2)^2)
}
}
cv.errors <- sum.errors / nrow(data_bi) #Cross Validation Test Errors
cv.errors
#Standard error (NOT standard deviation). Know the difference
se.errors <- 1 / sqrt(nrow(data_bi)) * sqrt(nrow(data_bi) / (nrow(data_bi) - 1) * (sum2.errors / nrow(data_bi) - cv.errors^2))
cv.errors; se.errors
which.min(cv.errors)
cv.errors <= cv.errors[71] + se.errors[71] #All models cv.errors that are less than or = cv.error[71]
# Errors
(summary(regfit.fwd)$rss / nrow(data_bi))^2 #This is a manual way to get MSE from any subset
val.errors
cv.errors
#Graphing
#Note: Training error is tiny compared to test MSE of both validation and cross-validation approaches
plot(summary(regfit.fwd)$rss / nrow(data_bi), xlab = "Number of Variables", ylab = "Mean Squared Error",
type = "l", lwd = 2, col = "black", ylim = c(0,0.5))
lines(val.errors, lwd = 2, col = "red")
lines(cv.errors, lwd = 2, col = "blue")
legend("topright", legend = c("Training error (best subset)", "Validation set approach", "10-fold cross-validation"), col = c("black", "red", "blue"), lty = 1, lwd = 2)
# At 58
abline(h = cv.errors[58], v = 58, lwd = 1, col = "cornflowerblue")
points(58, cv.errors[58], col = "cornflowerblue", cex = 2, pch = 20)
text(58, cv.errors[58], "Actual Minimum", pos = 3)
?col
# At 58 +1SE
abline(h = cv.errors[58] + se.errors[58], v = 14, lwd = 1, col = "cornflowerblue")
points(14, cv.errors[14], col = "cornflowerblue", cex = 2, pch = 20)
text(14, cv.errors[14] + .002, "One-standard-error rule", pos = 3)
# Notes and Todos:***
# - It may be the case that the data in simply not good enough to predict any closer to the ideal fit to training data.
# However, this doesn't change my ability to compare the improvements in predictability between subsets.
# - Need to find the 1-SE rule and implement it for a final variable selection level and model.
# - NOTICE: that switched to log(sold_price)
# - Need to run BEST subset selection for base_case.
# - Changed data set to binary only. Fit OLS with this?
# - Now that we have decided that 14 is the lowest number of variables we can use that is 1-standard
# error from the minimum test MSE of 58 variables.
# We now run the best 14-variable model on the full data set
This 14-variable model is the most parsimonious (using fewest variables) model that is within 1 standard error from the 58-variable model which produced the absolute minimum test MSE.
Printed below is the best 14-variables model from our data set according to a Farward Stepwise Selection process.
coef(regfit.base, 14) #Final minimum test MSE + 1SE model on full data set
library(readxl)
data_bi <- read_excel("Data/Data__Bi_ML_20.12.21.xlsx")
data_bi <- drop_na(data_bi) # Drop Na Values
attach(data_bi)
# Remove linear dependencies
names(data_bi)
data_bi <- subset(data_bi, select = -c(beds_total, school_general,
bath_full, bath_half, bath_half_4,
bath_full_7, property_type_DUP,
post_corona_bi, property_type_TNH,
roof_type_other, condition_other,
exterior_type_other, exterior_features_none,
foundation_type_other, beds_total_5,
beds_total_6, bath_half_5,
sewer_type_other, spa_location_none,
property_style_other, water_type_none, sold_date))
# Set x-y definitions for glmnet package
x <- model.matrix(log(sold_price) ~ . ,
data = data_bi)[, -1]
y <- log(data_bi$sold_price)
# General grid
grid <- exp(seq(10, -72, length = 101)) #grid of values from exp(10) [null model] to exp(-15) [least squares]
# Questions: what is the 61?
# Ridge
par(mfrow = c(1,1))
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid) #if alpha = 0 then ridge regression (variables are standardized by default)
dim(coef(ridge.mod)) #one row for each predictor, plus an intercept, one column for each value of lambda
[1] 72 101
plot(ridge.mod, "lambda") #coefficients vs. log(lambda)
print(ridge.mod)
Call: glmnet(x = x, y = y, alpha = 0, lambda = grid)
Df %Dev Lambda
1 71 0.02 22030.0
2 71 0.04 9701.0
3 71 0.08 4273.0
4 71 0.19 1882.0
5 71 0.43 828.8
6 71 0.97 365.0
7 71 2.16 160.8
8 71 4.74 70.8
9 71 9.93 31.2
10 71 19.15 13.7
11 71 32.18 6.0
12 71 45.31 2.7
13 71 54.43 1.2
14 71 59.21 0.5
15 71 61.35 0.2
16 71 62.17 0.1
17 71 62.43 0.0
18 71 62.51 0.0
19 71 62.52 0.0
20 71 62.53 0.0
21 71 62.53 0.0
22 71 62.53 0.0
23 71 62.53 0.0
24 71 62.53 0.0
25 71 62.53 0.0
26 71 62.53 0.0
27 71 62.53 0.0
28 71 62.53 0.0
29 71 62.53 0.0
30 71 62.53 0.0
31 71 62.53 0.0
32 71 62.53 0.0
33 71 62.53 0.0
34 71 62.53 0.0
35 71 62.53 0.0
36 71 62.53 0.0
37 71 62.53 0.0
38 71 62.53 0.0
39 71 62.53 0.0
40 71 62.53 0.0
41 71 62.53 0.0
42 71 62.53 0.0
43 71 62.53 0.0
44 71 62.53 0.0
45 71 62.53 0.0
46 71 62.53 0.0
47 71 62.53 0.0
48 71 62.53 0.0
49 71 62.53 0.0
50 71 62.53 0.0
51 71 62.53 0.0
52 71 62.53 0.0
53 71 62.53 0.0
54 71 62.53 0.0
55 71 62.53 0.0
56 71 62.53 0.0
57 71 62.53 0.0
58 71 62.53 0.0
59 71 62.53 0.0
60 71 62.53 0.0
61 71 62.53 0.0
62 71 62.53 0.0
63 71 62.53 0.0
64 71 62.53 0.0
65 71 62.53 0.0
66 71 62.53 0.0
67 71 62.53 0.0
68 71 62.53 0.0
69 71 62.53 0.0
70 71 62.53 0.0
71 71 62.53 0.0
72 71 62.53 0.0
73 71 62.53 0.0
74 71 62.53 0.0
75 71 62.53 0.0
76 71 62.53 0.0
77 71 62.53 0.0
78 71 62.53 0.0
79 71 62.53 0.0
80 71 62.53 0.0
81 71 62.53 0.0
82 71 62.53 0.0
83 71 62.53 0.0
84 71 62.53 0.0
85 71 62.53 0.0
86 71 62.53 0.0
87 71 62.53 0.0
88 71 62.53 0.0
89 71 62.53 0.0
90 71 62.53 0.0
91 71 62.53 0.0
92 71 62.53 0.0
93 71 62.53 0.0
94 71 62.53 0.0
95 71 62.53 0.0
96 71 62.53 0.0
97 71 62.53 0.0
98 71 62.53 0.0
99 71 62.53 0.0
100 71 62.53 0.0
101 71 62.53 0.0
coef(ridge.mod, s = 0.1)
collapsing to unique 'x' values
72 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.044502e+01
property_type_CND -8.190396e-02
property_type_OTH 5.472962e-01
property_type_PAT 1.081101e-01
property_type_SGL 5.276265e-02
air_conditioning_central 4.199224e-01
appartment_bi -9.232795e-02
patio_bi 6.464450e-02
school_high 1.305857e-01
school_junior -3.793498e-01
school_middle 1.555841e-01
photo_count 8.466339e-03
pool_bi 5.594976e-02
rear_yard_access_bi 1.125973e-01
roof_type_metal -2.218945e-02
roof_type_shingle 1.828963e-01
roof_type_slate 1.155304e-01
gas_type_natural -2.909950e-02
out_building_livable_bi 1.485061e-01
out_building_not_livable_bi -1.063201e-02
living_area 2.046186e-04
land_acres 1.706656e-05
appliances_included_bi 2.530165e-01
garage_bi 8.509423e-02
condition_new 4.524760e-01
condition_excellent 3.951693e-01
condition_very_good 1.387157e-01
energy_efficient_bi 1.070482e-01
exterior_brick -2.285908e-02
exterior_type_metal -5.834935e-02
exterior_type_vinyl -2.436926e-02
exterior_type_wood -6.011719e-02
exterior_features_balcony 2.052184e-01
exterior_features_courtyard 2.424293e-01
exterior_features_fence -4.399838e-02
exterior_features_porch 1.667952e-02
exterior_features_tennis_court -3.988141e-02
fire_place_bi 1.027598e-01
foundation_type_raised -1.553237e-01
foundation_type_slab 4.812967e-02
total_area -9.424950e-09
beds_total_1 -3.831750e-02
beds_total_2 -6.862840e-02
beds_total_3 8.135931e-03
beds_total_4 4.758326e-02
bath_full_0 -1.258941e-01
bath_full_1 -2.339301e-01
bath_full_2 6.272852e-02
bath_full_3 1.812841e-01
bath_full_4 2.479715e-01
bath_full_5 -2.525897e-02
bath_full_6 6.368741e-01
bath_half_0 -5.849209e-02
bath_half_1 5.910688e-02
bath_half_2 5.443303e-02
bath_half_3 3.882697e-01
age 3.119376e-03
days_on_market -3.472691e-04
sewer_type_city 1.716631e-02
sewer_type_septic 1.416659e-03
spa_location_inside 1.141094e-01
spa_location_outside 1.485881e-01
stories -1.952608e-03
property_style_mobile -4.413359e-01
property_style_modular -1.840919e-01
city_limit_bi 2.330825e-02
subdivision_bi 1.845676e-03
termite_contract 1.647901e-01
water_type_public -4.028100e-02
water_type_well 7.377684e-02
water_type_other 3.946069e-02
waterfront_bi 1.829982e-01
coef(ridge.mod, s = "lambda.min") # Get variable associated with minimum Lambda
Error in lambda[1] - s : non-numeric argument to binary operator
# Lasso
par(mfrow = c(1,1))
lasso.mod <- glmnet(x, y, alpha = 1, lambda = grid) #if alpha = 1 then lasso (some of the coefficients will be exactly equal to zero)
dim(coef(lasso.mod))
[1] 72 101
plot(lasso.mod, "lambda")
lasso.mod$lambda[61]; log(lasso.mod$lambda[61])
[1] 9.454886e-18
[1] -39.2
coef(lasso.mod)[, 61]
(Intercept) property_type_CND property_type_OTH
1.027131e+01 -9.635935e-02 6.229564e-01
property_type_PAT property_type_SGL air_conditioning_central
1.280238e-01 5.668257e-02 4.576522e-01
appartment_bi patio_bi school_high
-1.365493e-01 5.744469e-02 1.089062e-01
school_junior school_middle photo_count
-5.163464e-01 2.342838e-01 9.142794e-03
pool_bi rear_yard_access_bi roof_type_metal
4.865636e-02 1.241408e-01 -4.444897e-03
roof_type_shingle roof_type_slate gas_type_natural
1.995057e-01 1.144544e-01 -2.994145e-02
out_building_livable_bi out_building_not_livable_bi living_area
1.263447e-01 -1.641365e-02 2.573496e-04
land_acres appliances_included_bi garage_bi
1.852096e-05 2.823372e-01 8.300679e-02
condition_new condition_excellent condition_very_good
6.120504e-01 5.007767e-01 1.690670e-01
energy_efficient_bi exterior_brick exterior_type_metal
1.139828e-01 -4.918078e-02 -5.146689e-02
exterior_type_vinyl exterior_type_wood exterior_features_balcony
-2.602425e-02 -6.920131e-02 2.055929e-01
exterior_features_courtyard exterior_features_fence exterior_features_porch
2.255922e-01 -6.295169e-02 5.717464e-03
exterior_features_tennis_court fire_place_bi foundation_type_raised
-6.096909e-02 9.519642e-02 -1.923984e-01
foundation_type_slab total_area beds_total_1
1.336588e-02 -1.267816e-08 8.773937e-02
beds_total_2 beds_total_3 beds_total_4
2.373623e-02 8.194183e-02 9.343540e-02
bath_full_0 bath_full_1 bath_full_2
-1.934697e-01 -2.953839e-01 2.620844e-03
bath_full_3 bath_full_4 bath_full_5
1.219724e-01 1.694761e-01 -1.765427e-01
bath_full_6 bath_half_0 bath_half_1
5.590766e-01 -3.608565e-02 7.908154e-02
bath_half_2 bath_half_3 age
3.769801e-02 3.669492e-01 3.780402e-03
days_on_market sewer_type_city sewer_type_septic
-4.049446e-04 1.508662e-02 3.313740e-03
spa_location_inside spa_location_outside stories
9.917311e-02 1.313907e-01 -3.336124e-02
property_style_mobile property_style_modular city_limit_bi
-5.039226e-01 -2.105006e-01 4.356197e-02
subdivision_bi termite_contract water_type_public
1.809149e-02 1.531048e-01 -5.318026e-02
water_type_well water_type_other waterfront_bi
7.171215e-02 4.643681e-02 1.964201e-01
sum(abs(coef(lasso.mod)[-1, 61])) #l1 norm
[1] 9.816054
plot(lasso.mod)
sum(abs(predict(lasso.mod, s = 0, exact = TRUE, type = "coefficients", x = x, y = y)[2:29]))
collapsing to unique 'x' values
[1] 4.899363
#k-fold cross-validation
# Ridge
par(mfrow = c(1,1))
set.seed(1)
cv.out <- cv.glmnet(x, y, alpha = 0, lambda = grid, nfolds = 10) #ridge regression (ten-fold cross-validation)
collapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' values
plot(cv.out) #test MSE vs. log(lambda)
coef(cv.out, s = "lambda.min")
collapsing to unique 'x' values
72 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.029982e+01
property_type_CND -9.285520e-02
property_type_OTH 6.079979e-01
property_type_PAT 1.228874e-01
property_type_SGL 5.522544e-02
air_conditioning_central 4.501073e-01
appartment_bi -1.271012e-01
patio_bi 5.907678e-02
school_high 1.337090e-01
school_junior -4.834311e-01
school_middle 1.957628e-01
photo_count 9.006747e-03
pool_bi 5.049495e-02
rear_yard_access_bi 1.209132e-01
roof_type_metal -8.607398e-03
roof_type_shingle 1.960967e-01
roof_type_slate 1.175016e-01
gas_type_natural -2.955014e-02
out_building_livable_bi 1.311425e-01
out_building_not_livable_bi -1.503507e-02
living_area 2.410222e-04
land_acres 1.833708e-05
appliances_included_bi 2.757159e-01
garage_bi 8.380439e-02
condition_new 5.708364e-01
condition_excellent 4.745168e-01
condition_very_good 1.616231e-01
energy_efficient_bi 1.126265e-01
exterior_brick -4.289466e-02
exterior_type_metal -5.277627e-02
exterior_type_vinyl -2.570559e-02
exterior_type_wood -6.671777e-02
exterior_features_balcony 2.061056e-01
exterior_features_courtyard 2.315487e-01
exterior_features_fence -5.864469e-02
exterior_features_porch 8.368235e-03
exterior_features_tennis_court -6.234307e-02
fire_place_bi 9.803442e-02
foundation_type_raised -1.825316e-01
foundation_type_slab 2.264904e-02
total_area -1.185654e-08
beds_total_1 2.948293e-02
beds_total_2 -2.480850e-02
beds_total_3 4.014147e-02
beds_total_4 6.084056e-02
bath_full_0 -1.289124e-01
bath_full_1 -2.342684e-01
bath_full_2 6.455283e-02
bath_full_3 1.827602e-01
bath_full_4 2.326052e-01
bath_full_5 -9.687905e-02
bath_full_6 6.322253e-01
bath_half_0 -5.282689e-02
bath_half_1 6.334629e-02
bath_half_2 3.175875e-02
bath_half_3 3.650931e-01
age 3.627154e-03
days_on_market -3.915258e-04
sewer_type_city 1.573334e-02
sewer_type_septic 3.285474e-03
spa_location_inside 1.039582e-01
spa_location_outside 1.331717e-01
stories -2.618781e-02
property_style_mobile -4.908022e-01
property_style_modular -2.061023e-01
city_limit_bi 3.948137e-02
subdivision_bi 1.374761e-02
termite_contract 1.574858e-01
water_type_public -4.667732e-02
water_type_well 7.677674e-02
water_type_other 4.844728e-02
waterfront_bi 1.944598e-01
bestlam <- cv.out$lambda.min; bestlam; log(bestlam) #value of lambda that results in the smallest cross-validation error
[1] 0.01944821
[1] -3.94
out <- cv.out$glmnet.fit #full data set
ridge.coef <- predict(out, type = "coefficients", s = bestlam); ridge.coef
collapsing to unique 'x' values
72 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.029982e+01
property_type_CND -9.285520e-02
property_type_OTH 6.079979e-01
property_type_PAT 1.228874e-01
property_type_SGL 5.522544e-02
air_conditioning_central 4.501073e-01
appartment_bi -1.271012e-01
patio_bi 5.907678e-02
school_high 1.337090e-01
school_junior -4.834311e-01
school_middle 1.957628e-01
photo_count 9.006747e-03
pool_bi 5.049495e-02
rear_yard_access_bi 1.209132e-01
roof_type_metal -8.607398e-03
roof_type_shingle 1.960967e-01
roof_type_slate 1.175016e-01
gas_type_natural -2.955014e-02
out_building_livable_bi 1.311425e-01
out_building_not_livable_bi -1.503507e-02
living_area 2.410222e-04
land_acres 1.833708e-05
appliances_included_bi 2.757159e-01
garage_bi 8.380439e-02
condition_new 5.708364e-01
condition_excellent 4.745168e-01
condition_very_good 1.616231e-01
energy_efficient_bi 1.126265e-01
exterior_brick -4.289466e-02
exterior_type_metal -5.277627e-02
exterior_type_vinyl -2.570559e-02
exterior_type_wood -6.671777e-02
exterior_features_balcony 2.061056e-01
exterior_features_courtyard 2.315487e-01
exterior_features_fence -5.864469e-02
exterior_features_porch 8.368235e-03
exterior_features_tennis_court -6.234307e-02
fire_place_bi 9.803442e-02
foundation_type_raised -1.825316e-01
foundation_type_slab 2.264904e-02
total_area -1.185654e-08
beds_total_1 2.948293e-02
beds_total_2 -2.480850e-02
beds_total_3 4.014147e-02
beds_total_4 6.084056e-02
bath_full_0 -1.289124e-01
bath_full_1 -2.342684e-01
bath_full_2 6.455283e-02
bath_full_3 1.827602e-01
bath_full_4 2.326052e-01
bath_full_5 -9.687905e-02
bath_full_6 6.322253e-01
bath_half_0 -5.282689e-02
bath_half_1 6.334629e-02
bath_half_2 3.175875e-02
bath_half_3 3.650931e-01
age 3.627154e-03
days_on_market -3.915258e-04
sewer_type_city 1.573334e-02
sewer_type_septic 3.285474e-03
spa_location_inside 1.039582e-01
spa_location_outside 1.331717e-01
stories -2.618781e-02
property_style_mobile -4.908022e-01
property_style_modular -2.061023e-01
city_limit_bi 3.948137e-02
subdivision_bi 1.374761e-02
termite_contract 1.574858e-01
water_type_public -4.667732e-02
water_type_well 7.677674e-02
water_type_other 4.844728e-02
waterfront_bi 1.944598e-01
sqrt(sum(ridge.coef[2:29]^2)) #l2 norm
[1] 1.29274
bestlam2 <- cv.out$lambda.1se; bestlam2; log(bestlam2) #one-standard-error rule
[1] 0.5168513
[1] -0.66
ridge.coef2 <- predict(out, type = "coefficients", s = bestlam2); ridge.coef2
collapsing to unique 'x' values
72 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.077077e+01
property_type_CND -5.490650e-02
property_type_OTH 3.352204e-01
property_type_PAT 6.874305e-02
property_type_SGL 3.584764e-02
air_conditioning_central 3.173300e-01
appartment_bi 7.724902e-03
patio_bi 7.510424e-02
school_high 9.180535e-02
school_junior -1.549144e-01
school_middle 9.873627e-02
photo_count 6.580859e-03
pool_bi 6.662178e-02
rear_yard_access_bi 9.678706e-02
roof_type_metal -5.143247e-02
roof_type_shingle 1.419989e-01
roof_type_slate 8.464418e-02
gas_type_natural -2.715091e-02
out_building_livable_bi 1.643073e-01
out_building_not_livable_bi 1.484165e-03
living_area 1.441003e-04
land_acres 1.151787e-05
appliances_included_bi 1.910882e-01
garage_bi 8.151176e-02
condition_new 2.390557e-01
condition_excellent 2.401115e-01
condition_very_good 8.795974e-02
energy_efficient_bi 8.528046e-02
exterior_brick 1.511077e-02
exterior_type_metal -7.059224e-02
exterior_type_vinyl -2.155127e-02
exterior_type_wood -4.995540e-02
exterior_features_balcony 1.776841e-01
exterior_features_courtyard 2.298976e-01
exterior_features_fence -1.063526e-02
exterior_features_porch 2.966085e-02
exterior_features_tennis_court 2.470670e-02
fire_place_bi 1.000684e-01
foundation_type_raised -1.127591e-01
foundation_type_slab 8.160771e-02
total_area -3.865442e-09
beds_total_1 -1.096740e-01
beds_total_2 -9.888148e-02
beds_total_3 -9.238614e-03
beds_total_4 6.108453e-02
bath_full_0 -1.236595e-01
bath_full_1 -2.038255e-01
bath_full_2 5.526330e-02
bath_full_3 1.550280e-01
bath_full_4 2.327264e-01
bath_full_5 8.435515e-02
bath_full_6 5.189993e-01
bath_half_0 -5.622411e-02
bath_half_1 5.301270e-02
bath_half_2 9.377248e-02
bath_half_3 3.703164e-01
age 1.823671e-03
days_on_market -2.287987e-04
sewer_type_city 1.750425e-02
sewer_type_septic -7.663245e-03
spa_location_inside 1.235207e-01
spa_location_outside 1.715415e-01
stories 4.455519e-02
property_style_mobile -3.018031e-01
property_style_modular -1.215730e-01
city_limit_bi -1.795733e-02
subdivision_bi -9.982497e-03
termite_contract 1.493581e-01
water_type_public -2.048997e-02
water_type_well 5.779633e-02
water_type_other 1.915258e-02
waterfront_bi 1.374400e-01
sqrt(sum(ridge.coef2[2:29]^2))
[1] 0.7174466
#Lasso
set.seed(1)
cv.out <- cv.glmnet(x, y, alpha = 1, lambda = grid, nfolds = 10) #lasso
collapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' valuescollapsing to unique 'x' values
plot(cv.out)
bestlam <- cv.out$lambda.min; bestlam; log(bestlam)
[1] 0.001661557
[1] -6.4
out <- cv.out$glmnet.fit
lasso.coef <- predict(out, type = "coefficients", s = bestlam); lasso.coef; lasso.coef[lasso.coef != 0]
collapsing to unique 'x' values
72 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.046040e+01
property_type_CND -8.531091e-02
property_type_OTH 4.865828e-01
property_type_PAT 7.955454e-02
property_type_SGL 3.984324e-02
air_conditioning_central 4.561841e-01
appartment_bi -5.543432e-02
patio_bi 5.646840e-02
school_high 1.029861e-01
school_junior -4.826426e-01
school_middle 2.250629e-01
photo_count 8.935697e-03
pool_bi 4.254607e-02
rear_yard_access_bi 1.148713e-01
roof_type_metal -2.742459e-03
roof_type_shingle 1.972797e-01
roof_type_slate 8.946792e-02
gas_type_natural -2.652554e-02
out_building_livable_bi 9.409751e-02
out_building_not_livable_bi -1.264038e-02
living_area 2.539450e-04
land_acres 7.802352e-06
appliances_included_bi 2.819675e-01
garage_bi 8.332536e-02
condition_new 5.649665e-01
condition_excellent 4.839410e-01
condition_very_good 1.625884e-01
energy_efficient_bi 1.129558e-01
exterior_brick -3.653939e-02
exterior_type_metal -3.917235e-02
exterior_type_vinyl -1.790019e-02
exterior_type_wood -5.621107e-02
exterior_features_balcony 1.865617e-01
exterior_features_courtyard 2.076102e-01
exterior_features_fence -5.726480e-02
exterior_features_porch 6.576076e-04
exterior_features_tennis_court .
fire_place_bi 9.479505e-02
foundation_type_raised -1.882866e-01
foundation_type_slab 1.698236e-02
total_area -7.281619e-09
beds_total_1 .
beds_total_2 -4.111505e-02
beds_total_3 1.895527e-02
beds_total_4 3.772588e-02
bath_full_0 -1.399782e-01
bath_full_1 -2.959936e-01
bath_full_2 .
bath_full_3 1.073872e-01
bath_full_4 1.300966e-01
bath_full_5 -1.642456e-01
bath_full_6 4.348065e-01
bath_half_0 -7.599938e-02
bath_half_1 3.335157e-02
bath_half_2 .
bath_half_3 2.543969e-01
age 3.570307e-03
days_on_market -3.871715e-04
sewer_type_city 1.184920e-02
sewer_type_septic .
spa_location_inside 3.956261e-02
spa_location_outside 7.491554e-02
stories -2.362044e-02
property_style_mobile -5.019624e-01
property_style_modular -1.792179e-01
city_limit_bi 2.468508e-02
subdivision_bi 7.251849e-03
termite_contract 1.470587e-01
water_type_public -8.928889e-02
water_type_well 1.435507e-02
water_type_other .
waterfront_bi 1.914500e-01
<sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
[1] 1.046040e+01 -8.531091e-02 4.865828e-01 7.955454e-02 3.984324e-02 4.561841e-01 -5.543432e-02
[8] 5.646840e-02 1.029861e-01 -4.826426e-01 2.250629e-01 8.935697e-03 4.254607e-02 1.148713e-01
[15] -2.742459e-03 1.972797e-01 8.946792e-02 -2.652554e-02 9.409751e-02 -1.264038e-02 2.539450e-04
[22] 7.802352e-06 2.819675e-01 8.332536e-02 5.649665e-01 4.839410e-01 1.625884e-01 1.129558e-01
[29] -3.653939e-02 -3.917235e-02 -1.790019e-02 -5.621107e-02 1.865617e-01 2.076102e-01 -5.726480e-02
[36] 6.576076e-04 9.479505e-02 -1.882866e-01 1.698236e-02 -7.281619e-09 -4.111505e-02 1.895527e-02
[43] 3.772588e-02 -1.399782e-01 -2.959936e-01 1.073872e-01 1.300966e-01 -1.642456e-01 4.348065e-01
[50] -7.599938e-02 3.335157e-02 2.543969e-01 3.570307e-03 -3.871715e-04 1.184920e-02 3.956261e-02
[57] 7.491554e-02 -2.362044e-02 -5.019624e-01 -1.792179e-01 2.468508e-02 7.251849e-03 1.470587e-01
[64] -8.928889e-02 1.435507e-02 1.914500e-01
sum(abs(lasso.coef[2:29])) #l1 norm
[1] 4.385722
bestlam2 <- cv.out$lambda.1se; bestlam2; log(bestlam2)
[1] 0.01944821
[1] -3.94
This is the final reduced 33-variable model which minimized test MSE using LASSO and K-fold CV.
Note that variables with “.” instead of coeffecients were eliminated from the final model.
lasso.coef2 <- predict(out, type = "coefficients", s = bestlam2); lasso.coef2; lasso.coef2[lasso.coef2 != 0]
collapsing to unique 'x' values
72 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 10.5132013376
property_type_CND .
property_type_OTH .
property_type_PAT .
property_type_SGL .
air_conditioning_central 0.4316100327
appartment_bi .
patio_bi 0.0468739427
school_high 0.0286314948
school_junior -0.0983667993
school_middle 0.1842698523
photo_count 0.0070072231
pool_bi .
rear_yard_access_bi 0.0447876432
roof_type_metal .
roof_type_shingle 0.1824942418
roof_type_slate .
gas_type_natural -0.0073861928
out_building_livable_bi .
out_building_not_livable_bi .
living_area 0.0002836130
land_acres .
appliances_included_bi 0.2595525432
garage_bi 0.0702853119
condition_new 0.1431954282
condition_excellent 0.3643838708
condition_very_good 0.1136273898
energy_efficient_bi 0.0947213248
exterior_brick .
exterior_type_metal .
exterior_type_vinyl .
exterior_type_wood .
exterior_features_balcony 0.0634554504
exterior_features_courtyard 0.0336681268
exterior_features_fence .
exterior_features_porch .
exterior_features_tennis_court .
fire_place_bi 0.0664410944
foundation_type_raised -0.1377834041
foundation_type_slab 0.0603071649
total_area .
beds_total_1 .
beds_total_2 -0.0285079488
beds_total_3 .
beds_total_4 0.0087738276
bath_full_0 .
bath_full_1 -0.2911410493
bath_full_2 .
bath_full_3 0.0642121320
bath_full_4 .
bath_full_5 .
bath_full_6 .
bath_half_0 -0.0688215545
bath_half_1 .
bath_half_2 .
bath_half_3 .
age 0.0017208536
days_on_market -0.0002015102
sewer_type_city .
sewer_type_septic .
spa_location_inside .
spa_location_outside .
stories .
property_style_mobile -0.4410226829
property_style_modular .
city_limit_bi .
subdivision_bi .
termite_contract 0.1392234915
water_type_public -0.0016602213
water_type_well .
water_type_other .
waterfront_bi 0.1260309664
<sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient
[1] 10.5132013376 0.4316100327 0.0468739427 0.0286314948 -0.0983667993 0.1842698523 0.0070072231
[8] 0.0447876432 0.1824942418 -0.0073861928 0.0002836130 0.2595525432 0.0702853119 0.1431954282
[15] 0.3643838708 0.1136273898 0.0947213248 0.0634554504 0.0336681268 0.0664410944 -0.1377834041
[22] 0.0603071649 -0.0285079488 0.0087738276 -0.2911410493 0.0642121320 -0.0688215545 0.0017208536
[29] -0.0002015102 -0.4410226829 0.1392234915 -0.0016602213 0.1260309664
sum(abs(lasso.coef2[2:29]))
[1] 2.077477
End of Document