This portfolio uses the Ames Housing dataset to
demonstrate statistical modeling with R and tidymodels. The
dataset includes information about residential homes sold in Ames, Iowa,
including size, quality, location, amenities, and sale price.
The main outcome is sale_price, the final selling price
of a home. Because home prices are usually right-skewed, I also create
log_sale_price, a transformed version of the outcome used
for the main regression models. This transformation helps make the
modeling framework more appropriate and improves interpretation of
relative changes in price.
The analysis focuses on one guiding question:
Which housing characteristics are most useful for explaining and predicting home sale prices?
To answer this question, I use multiple regression, polynomial regression, lasso regression, multinomial logistic regression, and count or classification modeling. Together, these models demonstrate how statistical methods can be used to study both continuous and categorical outcomes.
library(tidymodels)
library(AmesHousing)
library(tidyverse)
library(janitor)
library(skimr)
library(scales)
library(forcats)
library(broom)
library(glmnet)
library(vip)
library(MASS)
library(poissonreg)
tidymodels_prefer()
set.seed(42)
ames <- make_ames() |>
clean_names() |>
select(-longitude, -latitude) |>
mutate(log_sale_price = log(sale_price))
glimpse(ames)
## Rows: 2,930
## Columns: 80
## $ ms_sub_class <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ ms_zoning <fct> Residential_Low_Density, Residential_High_Density, …
## $ lot_frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ lot_shape <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ land_contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ lot_config <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ land_slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ bldg_type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ house_style <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ overall_qual <fct> Above_Average, Average, Above_Average, Good, Averag…
## $ overall_cond <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ year_built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ roof_style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ roof_matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ mas_vnr_type <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ mas_vnr_area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ exter_qual <fct> Typical, Typical, Typical, Good, Typical, Typical, …
## $ exter_cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ bsmt_qual <fct> Typical, Typical, Typical, Typical, Good, Typical, …
## $ bsmt_cond <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ bsmt_exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ bsmt_fin_type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ bsmt_fin_sf_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ bsmt_fin_type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ bsmt_fin_sf_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ bsmt_unf_sf <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ total_bsmt_sf <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ heating_qc <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ central_air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ first_flr_sf <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ second_flr_sf <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ bsmt_full_bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ bsmt_half_bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ half_bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ bedroom_abv_gr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ kitchen_abv_gr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual <fct> Typical, Typical, Good, Excellent, Typical, Good, G…
## $ tot_rms_abv_grd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ fireplace_qu <fct> Good, No_Fireplace, No_Fireplace, Typical, Typical,…
## $ garage_type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ garage_finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ garage_cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ garage_area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ garage_qual <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ garage_cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ paved_drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ wood_deck_sf <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ open_porch_sf <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ enclosed_porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ screen_porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ pool_area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ misc_feature <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ misc_val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ mo_sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ year_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ sale_type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ sale_condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ sale_price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ log_sale_price <dbl> 12.27839, 11.56172, 12.05525, 12.40492, 12.15425, 1…
dim(ames)
## [1] 2930 80
sum(is.na(ames))
## [1] 0
skim(ames)
| Name | ames |
| Number of rows | 2930 |
| Number of columns | 80 |
| _______________________ | |
| Column type frequency: | |
| factor | 46 |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| ms_sub_class | 0 | 1 | FALSE | 16 | One: 1079, Two: 575, One: 287, One: 192 |
| ms_zoning | 0 | 1 | FALSE | 7 | Res: 2273, Res: 462, Flo: 139, Res: 27 |
| street | 0 | 1 | FALSE | 2 | Pav: 2918, Grv: 12 |
| alley | 0 | 1 | FALSE | 3 | No_: 2732, Gra: 120, Pav: 78 |
| lot_shape | 0 | 1 | FALSE | 4 | Reg: 1859, Sli: 979, Mod: 76, Irr: 16 |
| land_contour | 0 | 1 | FALSE | 4 | Lvl: 2633, HLS: 120, Bnk: 117, Low: 60 |
| utilities | 0 | 1 | FALSE | 3 | All: 2927, NoS: 2, NoS: 1 |
| lot_config | 0 | 1 | FALSE | 5 | Ins: 2140, Cor: 511, Cul: 180, FR2: 85 |
| land_slope | 0 | 1 | FALSE | 3 | Gtl: 2789, Mod: 125, Sev: 16 |
| neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
| condition_1 | 0 | 1 | FALSE | 9 | Nor: 2522, Fee: 164, Art: 92, RRA: 50 |
| condition_2 | 0 | 1 | FALSE | 8 | Nor: 2900, Fee: 13, Art: 5, Pos: 4 |
| bldg_type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
| house_style | 0 | 1 | FALSE | 8 | One: 1481, Two: 873, One: 314, SLv: 128 |
| overall_qual | 0 | 1 | FALSE | 10 | Ave: 825, Abo: 732, Goo: 602, Ver: 350 |
| overall_cond | 0 | 1 | FALSE | 9 | Ave: 1654, Abo: 533, Goo: 390, Ver: 144 |
| roof_style | 0 | 1 | FALSE | 6 | Gab: 2321, Hip: 551, Gam: 22, Fla: 20 |
| roof_matl | 0 | 1 | FALSE | 8 | Com: 2887, Tar: 23, WdS: 9, WdS: 7 |
| exterior_1st | 0 | 1 | FALSE | 16 | Vin: 1026, Met: 450, HdB: 442, Wd : 420 |
| exterior_2nd | 0 | 1 | FALSE | 17 | Vin: 1015, Met: 447, HdB: 406, Wd : 397 |
| mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1775, Brk: 880, Sto: 249, Brk: 25 |
| exter_qual | 0 | 1 | FALSE | 4 | Typ: 1799, Goo: 989, Exc: 107, Fai: 35 |
| exter_cond | 0 | 1 | FALSE | 5 | Typ: 2549, Goo: 299, Fai: 67, Exc: 12 |
| foundation | 0 | 1 | FALSE | 6 | PCo: 1310, CBl: 1244, Brk: 311, Sla: 49 |
| bsmt_qual | 0 | 1 | FALSE | 6 | Typ: 1283, Goo: 1219, Exc: 258, Fai: 88 |
| bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 2616, Goo: 122, Fai: 104, No_: 80 |
| bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1906, Av: 418, Gd: 284, Mn: 239 |
| bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 859, Unf: 851, ALQ: 429, Rec: 288 |
| bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 2499, Rec: 106, LwQ: 89, No_: 81 |
| heating | 0 | 1 | FALSE | 6 | Gas: 2885, Gas: 27, Gra: 9, Wal: 6 |
| heating_qc | 0 | 1 | FALSE | 5 | Exc: 1495, Typ: 864, Goo: 476, Fai: 92 |
| central_air | 0 | 1 | FALSE | 2 | Y: 2734, N: 196 |
| electrical | 0 | 1 | FALSE | 6 | SBr: 2682, Fus: 188, Fus: 50, Fus: 8 |
| kitchen_qual | 0 | 1 | FALSE | 5 | Typ: 1494, Goo: 1160, Exc: 205, Fai: 70 |
| functional | 0 | 1 | FALSE | 8 | Typ: 2728, Min: 70, Min: 65, Mod: 35 |
| fireplace_qu | 0 | 1 | FALSE | 6 | No_: 1422, Goo: 744, Typ: 600, Fai: 75 |
| garage_type | 0 | 1 | FALSE | 7 | Att: 1731, Det: 782, Bui: 186, No_: 157 |
| garage_finish | 0 | 1 | FALSE | 4 | Unf: 1231, RFn: 812, Fin: 728, No_: 159 |
| garage_qual | 0 | 1 | FALSE | 6 | Typ: 2615, No_: 159, Fai: 124, Goo: 24 |
| garage_cond | 0 | 1 | FALSE | 6 | Typ: 2665, No_: 159, Fai: 74, Goo: 15 |
| paved_drive | 0 | 1 | FALSE | 3 | Pav: 2652, Dir: 216, Par: 62 |
| pool_qc | 0 | 1 | FALSE | 5 | No_: 2917, Exc: 4, Goo: 4, Typ: 3 |
| fence | 0 | 1 | FALSE | 5 | No_: 2358, Min: 330, Goo: 118, Goo: 112 |
| misc_feature | 0 | 1 | FALSE | 6 | Non: 2824, She: 95, Gar: 5, Oth: 4 |
| sale_type | 0 | 1 | FALSE | 10 | WD : 2536, New: 239, COD: 87, Con: 26 |
| sale_condition | 0 | 1 | FALSE | 6 | Nor: 2413, Par: 245, Abn: 190, Fam: 46 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lot_frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
| lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
| year_built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
| year_remod_add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
| mas_vnr_area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
| bsmt_fin_sf_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
| bsmt_fin_sf_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
| bsmt_unf_sf | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
| total_bsmt_sf | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
| first_flr_sf | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
| second_flr_sf | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
| low_qual_fin_sf | 0 | 1 | 4.68 | 46.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1064.00 | ▇▁▁▁▁ |
| gr_liv_area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
| bsmt_full_bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
| bsmt_half_bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| full_bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
| half_bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
| bedroom_abv_gr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
| kitchen_abv_gr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
| tot_rms_abv_grd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
| fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
| garage_cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
| garage_area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
| wood_deck_sf | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
| open_porch_sf | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
| enclosed_porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
| three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
| screen_porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
| pool_area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
| misc_val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
| mo_sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
| year_sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
| sale_price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
| log_sale_price | 0 | 1 | 12.02 | 0.41 | 9.46 | 11.77 | 11.98 | 12.27 | 13.53 | ▁▁▆▇▁ |
The Ames Housing dataset contains 2,930 observations and a large set
of variables describing the physical characteristics, quality, location,
and sale information for residential properties. The response variable
for the first part of the portfolio is sale_price. Because
the dataset contains many possible predictors, the analysis begins
broadly and then uses both exploratory analysis and formal
feature-selection methods to narrow the model.
ames |>
ggplot(aes(x = sale_price)) +
geom_histogram(bins = 40, fill = "#2C7FB8", color = "white") +
scale_x_continuous(labels = dollar) +
labs(
title = "Distribution of Sale Prices",
x = "Sale Price",
y = "Count"
)
ames |>
ggplot(aes(x = log_sale_price)) +
geom_histogram(bins = 40, fill = "#7FCDBB", color = "white") +
labs(
title = "Distribution of Log(Sale Price)",
x = "Log Sale Price",
y = "Count"
)
The transformed outcome is more symmetric than the original sale
price distribution. For this reason, log_sale_price is used
as the primary outcome for the continuous regression models. The
original sale_price is retained only for interpretation and
for later derived outcomes.
Sale Price vs Living Area
ames |>
ggplot(aes(x = gr_liv_area, y = sale_price)) +
geom_point(alpha = 0.4, color = "#2C7FB8") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_y_continuous(labels = dollar) +
labs(
title = "Sale Price vs. Living Area",
x = "Above Ground Living Area (sq ft)",
y = "Sale Price"
)
Neighborhood Effects
ames |>
mutate(neighborhood = fct_reorder(neighborhood, sale_price, .fun = median)) |>
ggplot(aes(x = neighborhood, y = sale_price)) +
geom_boxplot(fill = "#A1D99B") +
coord_flip() +
scale_y_continuous(labels = dollar) +
labs(
title = "Sale Price by Neighborhood",
x = "Neighborhood",
y = "Sale Price"
)
Overall Quality and Sale Price
ames |>
ggplot(aes(x = overall_qual, y = sale_price)) +
geom_boxplot(fill = "#FDAE6B") +
scale_y_continuous(labels = dollar) +
labs(
title = "Sale Price by Overall Quality",
x = "Overall Quality",
y = "Sale Price"
)
The categorical plots suggest that several qualitative features are meaningfully related to sale price. In particular, neighborhood and overall quality show strong separation in median log sale price.
# Correlations among numeric variables
numeric_correlations <- ames |>
select(where(is.numeric)) |>
cor(use = "pairwise.complete.obs") |>
as.data.frame() |>
rownames_to_column("variable_1") |>
pivot_longer(
cols = -variable_1,
names_to = "variable_2",
values_to = "correlation"
)
# Keep only stronger relationships
strong_numeric_correlations <- numeric_correlations |>
filter(
variable_1 != variable_2,
abs(correlation) >= 0.3
)
# Plot
strong_numeric_correlations |>
ggplot(aes(x = variable_1, y = variable_2, fill = correlation)) +
geom_tile() +
geom_text(aes(label = round(correlation, 2)), size = 3) +
scale_fill_viridis_c(
option = "D",
limits = c(-1, 1)
) +
coord_fixed() +
labs(
title = "Strong Correlations Among Numeric Variables",
x = NULL,
y = NULL,
fill = "Correlation"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
The exploratory analysis shows that sale price is strongly right-skewed, motivating the use of a log transformation. Living area and overall quality appear to have strong positive relationships with price, while neighborhood differences suggest meaningful variation across locations. These patterns guide the selection of predictors and modeling approaches in the following sections.
set.seed(42)
ames_split <- initial_split(ames, prop = 0.8)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
set.seed(42)
ames_folds <- vfold_cv(ames_train, v = 5)
ames_recipe <- recipe(log_sale_price ~ ., data = ames_train) |>
step_rm(sale_price) |>
step_log(gr_liv_area, base = 10) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
ames_prep <- prep(ames_recipe)
ames_baked <- bake(ames_prep, new_data = NULL)
glimpse(ames_baked)
## Rows: 2,344
## Columns: 302
## $ lot_frontage <dbl> 0.7468160, -1.1…
## $ lot_area <dbl> 0.008845521, -1…
## $ year_built <dbl> -0.29500974, -0…
## $ year_remod_add <dbl> -1.0272152, -0.…
## $ mas_vnr_area <dbl> 1.53612533, 0.7…
## $ bsmt_fin_sf_1 <dbl> -1.43710624, 1.…
## $ bsmt_fin_sf_2 <dbl> -0.2945113, -0.…
## $ bsmt_unf_sf <dbl> -0.1356635, 0.2…
## $ total_bsmt_sf <dbl> 0.06460710, -0.…
## $ first_flr_sf <dbl> -0.04002436, -1…
## $ second_flr_sf <dbl> -0.7831775, 0.4…
## $ low_qual_fin_sf <dbl> -0.1041543, -0.…
## $ gr_liv_area <dbl> -0.6688422, -0.…
## $ bsmt_full_bath <dbl> -0.8224714, -0.…
## $ bsmt_half_bath <dbl> 3.6941629, -0.2…
## $ full_bath <dbl> -1.0426515, -1.…
## $ half_bath <dbl> -0.7593844, 1.2…
## $ bedroom_abv_gr <dbl> 0.1761544, 0.17…
## $ kitchen_abv_gr <dbl> -0.2030956, -0.…
## $ tot_rms_abv_grd <dbl> -0.2826372, 0.3…
## $ fireplaces <dbl> 0.6281975, -0.9…
## $ garage_cars <dbl> -0.9975764, -2.…
## $ garage_area <dbl> -0.82502805, -2…
## $ wood_deck_sf <dbl> 2.0253838, 0.86…
## $ open_porch_sf <dbl> 0.4213274, -0.7…
## $ enclosed_porch <dbl> -0.3512185, -0.…
## $ three_season_porch <dbl> -0.09985435, -0…
## $ screen_porch <dbl> -0.2849768, -0.…
## $ pool_area <dbl> -0.06412225, -0…
## $ misc_val <dbl> -0.08763635, -0…
## $ mo_sold <dbl> -0.0732589, -0.…
## $ year_sold <dbl> -1.3783358, -1.…
## $ log_sale_price <dbl> 12.01370, 11.42…
## $ ms_sub_class_One_Story_1945_and_Older <dbl> -0.2154269, -0.…
## $ ms_sub_class_One_Story_with_Finished_Attic_All_Ages <dbl> -0.04622503, -0…
## $ ms_sub_class_One_and_Half_Story_Unfinished_All_Ages <dbl> -0.07749852, -0…
## $ ms_sub_class_One_and_Half_Story_Finished_All_Ages <dbl> -0.3241824, -0.…
## $ ms_sub_class_Two_Story_1946_and_Newer <dbl> -0.5013591, -0.…
## $ ms_sub_class_Two_Story_1945_and_Older <dbl> -0.2110551, -0.…
## $ ms_sub_class_Two_and_Half_Story_All_Ages <dbl> -0.08795059, -0…
## $ ms_sub_class_Split_or_Multilevel <dbl> 5.062156, -0.19…
## $ ms_sub_class_Split_Foyer <dbl> -0.124865, -0.1…
## $ ms_sub_class_Duplex_All_Styles_and_Ages <dbl> -0.1951168, -0.…
## $ ms_sub_class_One_Story_PUD_1946_and_Newer <dbl> -0.2733212, -0.…
## $ ms_sub_class_One_and_Half_Story_PUD_All_Ages <dbl> -0.02065481, -0…
## $ ms_sub_class_Two_Story_PUD_1946_and_Newer <dbl> -0.2154269, 4.6…
## $ ms_sub_class_PUD_Multilevel_Split_Level_Foyer <dbl> -0.07171891, -0…
## $ ms_sub_class_Two_Family_conversion_All_Styles_and_Ages <dbl> -0.1505919, -0.…
## $ ms_zoning_Residential_High_Density <dbl> -0.1016878, -0.…
## $ ms_zoning_Residential_Low_Density <dbl> 0.5410728, -1.8…
## $ ms_zoning_Residential_Medium_Density <dbl> -0.4293692, 2.3…
## $ ms_zoning_A_agr <dbl> -0.02921655, -0…
## $ ms_zoning_C_all <dbl> -0.09731674, -0…
## $ ms_zoning_I_all <dbl> -0.02921655, -0…
## $ street_Pave <dbl> 0.05850809, 0.0…
## $ alley_No_Alley_Access <dbl> 0.2724191, 0.27…
## $ alley_Paved <dbl> -0.1661556, -0.…
## $ lot_shape_Slightly_Irregular <dbl> -0.7026624, -0.…
## $ lot_shape_Moderately_Irregular <dbl> -0.1661556, -0.…
## $ lot_shape_Irregular <dbl> -0.08023581, -0…
## $ land_contour_HLS <dbl> -0.2054823, -0.…
## $ land_contour_Low <dbl> -0.144558, -0.1…
## $ land_contour_Lvl <dbl> 0.3360985, 0.33…
## $ utilities_NoSeWa <dbl> -0.02065481, -0…
## $ utilities_NoSewr <dbl> -0.02921655, -0…
## $ lot_config_CulDSac <dbl> -0.2567312, -0.…
## $ lot_config_FR2 <dbl> -0.1728127, -0.…
## $ lot_config_FR3 <dbl> -0.06865091, -0…
## $ lot_config_Inside <dbl> 0.6107455, 0.61…
## $ land_slope_Mod <dbl> -0.2132503, -0.…
## $ land_slope_Sev <dbl> -0.07749852, -0…
## $ neighborhood_College_Creek <dbl> -0.3209593, -0.…
## $ neighborhood_Old_Town <dbl> -0.2969349, -0.…
## $ neighborhood_Edwards <dbl> -0.2660421, -0.…
## $ neighborhood_Somerset <dbl> -0.2623472, -0.…
## $ neighborhood_Northridge_Heights <dbl> -0.2481288, -0.…
## $ neighborhood_Gilbert <dbl> -0.247159, -0.2…
## $ neighborhood_Sawyer <dbl> -0.2218514, -0.…
## $ neighborhood_Northwest_Ames <dbl> -0.2088409, -0.…
## $ neighborhood_Sawyer_West <dbl> -0.2088409, -0.…
## $ neighborhood_Mitchell <dbl> -0.2043525, -0.…
## $ neighborhood_Brookside <dbl> -0.1903566, -0.…
## $ neighborhood_Crawford <dbl> -0.1891505, -0.…
## $ neighborhood_Iowa_DOT_and_Rail_Road <dbl> -0.1817694, -0.…
## $ neighborhood_Timberland <dbl> -0.1535276, -0.…
## $ neighborhood_Northridge <dbl> -0.1647955, -0.…
## $ neighborhood_Stone_Brook <dbl> -0.1366732, -0.…
## $ neighborhood_South_and_West_of_Iowa_State_University <dbl> -0.1283422, -0.…
## $ neighborhood_Clear_Creek <dbl> -0.1283422, -0.…
## $ neighborhood_Meadow_Village <dbl> -0.1157445, -0.…
## $ neighborhood_Briardale <dbl> -0.1058857, 9.4…
## $ neighborhood_Bloomington_Heights <dbl> -0.1016878, -0.…
## $ neighborhood_Veenker <dbl> -0.08795059, -0…
## $ neighborhood_Northpark_Villa <dbl> -0.09038007, -0…
## $ neighborhood_Blueste <dbl> -0.06207049, -0…
## $ neighborhood_Greens <dbl> -0.0547176, -0.…
## $ neighborhood_Green_Hills <dbl> -0.02921655, -0…
## $ neighborhood_Landmark <dbl> -0.02065481, -0…
## $ condition_1_Feedr <dbl> -0.239291, -0.2…
## $ condition_1_Norm <dbl> 0.4025578, 0.40…
## $ condition_1_PosA <dbl> -0.08023581, -0…
## $ condition_1_PosN <dbl> -0.1157445, -0.…
## $ condition_1_RRAe <dbl> -0.1016878, -0.…
## $ condition_1_RRAn <dbl> -0.1300482, -0.…
## $ condition_1_RRNe <dbl> -0.03579046, -0…
## $ condition_1_RRNn <dbl> -0.0547176, -0.…
## $ condition_2_Feedr <dbl> -0.06865091, -0…
## $ condition_2_Norm <dbl> 0.1038071, 0.10…
## $ condition_2_PosA <dbl> -0.03579046, -0…
## $ condition_2_PosN <dbl> -0.03579046, -0…
## $ condition_2_RRAe <dbl> -0.02065481, -0…
## $ condition_2_RRAn <dbl> -0.02065481, -0…
## $ condition_2_RRNn <dbl> -0.02065481, -0…
## $ bldg_type_TwoFmCon <dbl> -0.1520662, -0.…
## $ bldg_type_Duplex <dbl> -0.1951168, -0.…
## $ bldg_type_Twnhs <dbl> -0.1939361, 5.1…
## $ bldg_type_TwnhsE <dbl> -0.2969349, -0.…
## $ house_style_One_and_Half_Unf <dbl> -0.08023581, -0…
## $ house_style_One_Story <dbl> -1.0143954, -1.…
## $ house_style_SFoyer <dbl> -0.1647955, -0.…
## $ house_style_SLvl <dbl> 4.7862919, -0.2…
## $ house_style_Two_and_Half_Fin <dbl> -0.05850809, -0…
## $ house_style_Two_and_Half_Unf <dbl> -0.08795059, -0…
## $ house_style_Two_Story <dbl> -0.6563761, 1.5…
## $ overall_qual_Poor <dbl> -0.06544205, -0…
## $ overall_qual_Fair <dbl> -0.117622, -0.1…
## $ overall_qual_Below_Average <dbl> -0.2900783, -0.…
## $ overall_qual_Average <dbl> -0.6226058, -0.…
## $ overall_qual_Above_Average <dbl> 1.7658361, 1.76…
## $ overall_qual_Good <dbl> -0.5186234, -0.…
## $ overall_qual_Very_Good <dbl> -0.3749301, -0.…
## $ overall_qual_Excellent <dbl> -0.1891505, -0.…
## $ overall_qual_Very_Excellent <dbl> -0.1038071, -0.…
## $ overall_cond_Poor <dbl> -0.0547176, -0.…
## $ overall_cond_Fair <dbl> -0.1398763, -0.…
## $ overall_cond_Below_Average <dbl> -0.1867183, -0.…
## $ overall_cond_Average <dbl> 0.8623979, 0.86…
## $ overall_cond_Above_Average <dbl> -0.463051, -0.4…
## $ overall_cond_Good <dbl> -0.3844881, -0.…
## $ overall_cond_Very_Good <dbl> -0.2260515, -0.…
## $ overall_cond_Excellent <dbl> -0.1194715, -0.…
## $ roof_style_Gable <dbl> 0.510004, 0.510…
## $ roof_style_Gambrel <dbl> -0.09038007, -0…
## $ roof_style_Hip <dbl> -0.4765797, -0.…
## $ roof_style_Mansard <dbl> -0.06544205, -0…
## $ roof_style_Shed <dbl> -0.04133609, -0…
## $ roof_matl_CompShg <dbl> 0.124865, 0.124…
## $ roof_matl_Membran <dbl> -0.02065481, -0…
## $ roof_matl_Roll <dbl> -0.02065481, -0…
## $ roof_matl_Tar.Grv <dbl> -0.09038007, -0…
## $ roof_matl_WdShake <dbl> -0.06207049, -0…
## $ roof_matl_WdShngl <dbl> -0.04622503, -0…
## $ exterior_1st_AsphShn <dbl> -0.02921655, -0…
## $ exterior_1st_BrkComm <dbl> -0.04622503, -0…
## $ exterior_1st_BrkFace <dbl> -0.1754127, -0.…
## $ exterior_1st_CBlock <dbl> -0.02921655, -0…
## $ exterior_1st_CemntBd <dbl> -0.2066069, 4.8…
## $ exterior_1st_HdBoard <dbl> 2.4189468, -0.4…
## $ exterior_1st_ImStucc <dbl> -0.02065481, -0…
## $ exterior_1st_MetalSd <dbl> -0.4328472, -0.…
## $ exterior_1st_Plywood <dbl> -0.2822282, -0.…
## $ exterior_1st_PreCast <dbl> -0.02065481, -0…
## $ exterior_1st_Stone <dbl> -0.02065481, -0…
## $ exterior_1st_Stucco <dbl> -0.1194715, -0.…
## $ exterior_1st_VinylSd <dbl> -0.7555144, -0.…
## $ exterior_1st_Wd.Sdng <dbl> -0.3953758, -0.…
## $ exterior_1st_WdShing <dbl> -0.1414527, -0.…
## $ exterior_2nd_AsphShn <dbl> -0.04133609, -0…
## $ exterior_2nd_Brk.Cmn <dbl> -0.08545424, -0…
## $ exterior_2nd_BrkFace <dbl> -0.1212944, -0.…
## $ exterior_2nd_CBlock <dbl> -0.03579046, -0…
## $ exterior_2nd_CmentBd <dbl> -0.2054823, 4.8…
## $ exterior_2nd_HdBoard <dbl> 2.5514784, -0.3…
## $ exterior_2nd_ImStucc <dbl> -0.07466342, -0…
## $ exterior_2nd_MetalSd <dbl> -0.4307617, -0.…
## $ exterior_2nd_Other <dbl> -0.02065481, -0…
## $ exterior_2nd_Plywood <dbl> -0.3152693, -0.…
## $ exterior_2nd_PreCast <dbl> -0.02065481, -0…
## $ exterior_2nd_Stone <dbl> -0.04622503, -0…
## $ exterior_2nd_Stucco <dbl> -0.1283422, -0.…
## $ exterior_2nd_VinylSd <dbl> -0.749257, -0.7…
## $ exterior_2nd_Wd.Sdng <dbl> -0.3895869, -0.…
## $ exterior_2nd_Wd.Shng <dbl> -0.1620447, -0.…
## $ mas_vnr_type_BrkFace <dbl> 1.5228669, 1.52…
## $ mas_vnr_type_None <dbl> -1.229504, -1.2…
## $ mas_vnr_type_Stone <dbl> -0.3111646, -0.…
## $ exter_qual_Fair <dbl> -0.1038071, -0.…
## $ exter_qual_Good <dbl> -0.7244479, -0.…
## $ exter_qual_Typical <dbl> 0.7993048, 0.79…
## $ exter_cond_Fair <dbl> -0.1549765, -0.…
## $ exter_cond_Good <dbl> -0.3407945, -0.…
## $ exter_cond_Poor <dbl> -0.03579046, -0…
## $ exter_cond_Typical <dbl> 0.3910379, 0.39…
## $ foundation_CBlock <dbl> 1.1692276, 1.16…
## $ foundation_PConc <dbl> -0.9146767, -0.…
## $ foundation_Slab <dbl> -0.1266148, -0.…
## $ foundation_Stone <dbl> -0.05850809, -0…
## $ foundation_Wood <dbl> -0.04133609, -0…
## $ bsmt_qual_Fair <dbl> -0.1779793, -0.…
## $ bsmt_qual_Good <dbl> -0.8586435, -0.…
## $ bsmt_qual_No_Basement <dbl> -0.1647955, -0.…
## $ bsmt_qual_Poor <dbl> -0.02921655, -0…
## $ bsmt_qual_Typical <dbl> 1.1480131, 1.14…
## $ bsmt_cond_Fair <dbl> -0.1891505, -0.…
## $ bsmt_cond_Good <dbl> -0.2110551, -0.…
## $ bsmt_cond_No_Basement <dbl> -0.1647955, -0.…
## $ bsmt_cond_Poor <dbl> -0.04133609, -0…
## $ bsmt_cond_Typical <dbl> 0.3446791, 0.34…
## $ bsmt_exposure_Gd <dbl> -0.3241824, -0.…
## $ bsmt_exposure_Mn <dbl> -0.3028478, -0.…
## $ bsmt_exposure_No <dbl> -1.3490446, 0.7…
## $ bsmt_exposure_No_Basement <dbl> -0.1675058, -0.…
## $ bsmt_fin_type_1_BLQ <dbl> -0.3136316, -0.…
## $ bsmt_fin_type_1_GLQ <dbl> -0.6563761, -0.…
## $ bsmt_fin_type_1_LwQ <dbl> -0.2291607, -0.…
## $ bsmt_fin_type_1_No_Basement <dbl> -0.1647955, -0.…
## $ bsmt_fin_type_1_Rec <dbl> -0.3329462, -0.…
## $ bsmt_fin_type_1_Unf <dbl> -0.6424354, 1.5…
## $ bsmt_fin_type_2_BLQ <dbl> -0.1564132, -0.…
## $ bsmt_fin_type_2_GLQ <dbl> -0.1099302, -0.…
## $ bsmt_fin_type_2_LwQ <dbl> -0.1779793, -0.…
## $ bsmt_fin_type_2_No_Basement <dbl> -0.1661556, -0.…
## $ bsmt_fin_type_2_Rec <dbl> -0.1951168, -0.…
## $ bsmt_fin_type_2_Unf <dbl> 0.4167571, 0.41…
## $ heating_GasA <dbl> 0.1266148, 0.12…
## $ heating_GasW <dbl> -0.09731674, -0…
## $ heating_Grav <dbl> -0.0547176, -0.…
## $ heating_OthW <dbl> -0.02921655, -0…
## $ heating_Wall <dbl> -0.05064781, -0…
## $ heating_qc_Fair <dbl> -0.1817694, -0.…
## $ heating_qc_Good <dbl> -0.4356222, -0.…
## $ heating_qc_Poor <dbl> -0.02921655, -0…
## $ heating_qc_Typical <dbl> 1.5704786, 1.57…
## $ central_air_Y <dbl> 0.2660421, 0.26…
## $ electrical_FuseF <dbl> -0.124865, -0.1…
## $ electrical_FuseP <dbl> -0.05850809, -0…
## $ electrical_Mix <dbl> -0.02065481, -0…
## $ electrical_SBrkr <dbl> 0.2994786, 0.29…
## $ electrical_Unknown <dbl> -0.02065481, -0…
## $ kitchen_qual_Fair <dbl> -0.1535276, -0.…
## $ kitchen_qual_Good <dbl> -0.8224307, -0.…
## $ kitchen_qual_Poor <dbl> -0.02065481, -0…
## $ kitchen_qual_Typical <dbl> 0.9904466, 0.99…
## $ functional_Maj2 <dbl> -0.0547176, -0.…
## $ functional_Min1 <dbl> -0.1476032, -0.…
## $ functional_Min2 <dbl> -0.1476032, -0.…
## $ functional_Mod <dbl> -0.1138377, -0.…
## $ functional_Sal <dbl> -0.02065481, -0…
## $ functional_Sev <dbl> -0.02921655, -0…
## $ functional_Typ <dbl> 0.2669599, 0.26…
## $ fireplace_qu_Fair <dbl> -0.1592514, -0.…
## $ fireplace_qu_Good <dbl> 1.7277501, -0.5…
## $ fireplace_qu_No_Fireplace <dbl> -0.9786828, 1.0…
## $ fireplace_qu_Poor <dbl> -0.1194715, -0.…
## $ fireplace_qu_Typical <dbl> -0.5106678, -0.…
## $ garage_type_Basment <dbl> -0.1119002, -0.…
## $ garage_type_BuiltIn <dbl> -0.2557863, -0.…
## $ garage_type_CarPort <dbl> -0.06207049, -0…
## $ garage_type_Detchd <dbl> -0.6041656, -0.…
## $ garage_type_More_Than_Two_Types <dbl> -0.09038007, -0…
## $ garage_type_No_Garage <dbl> -0.2412767, 4.1…
## $ garage_finish_No_Garage <dbl> -0.2432497, 4.1…
## $ garage_finish_RFn <dbl> -0.6199681, -0.…
## $ garage_finish_Unf <dbl> 1.1795096, -0.8…
## $ garage_qual_Fair <dbl> -0.2121551, -0.…
## $ garage_qual_Good <dbl> -0.09038007, -0…
## $ garage_qual_No_Garage <dbl> -0.2432497, 4.1…
## $ garage_qual_Poor <dbl> -0.04622503, -0…
## $ garage_qual_Typical <dbl> 0.3523743, -2.8…
## $ garage_cond_Fair <dbl> -0.1606535, -0.…
## $ garage_cond_Good <dbl> -0.07466342, -0…
## $ garage_cond_No_Garage <dbl> -0.2432497, 4.1…
## $ garage_cond_Poor <dbl> -0.07749852, -0…
## $ garage_cond_Typical <dbl> 0.3217669, -3.1…
## $ paved_drive_Partial_Pavement <dbl> -0.1414527, -0.…
## $ paved_drive_Paved <dbl> 0.3193401, -3.1…
## $ pool_qc_Fair <dbl> -0.02065481, -0…
## $ pool_qc_Good <dbl> -0.03579046, -0…
## $ pool_qc_No_Pool <dbl> 0.06865091, 0.0…
## $ pool_qc_Typical <dbl> -0.03579046, -0…
## $ fence_Good_Wood <dbl> -0.1997798, -0.…
## $ fence_Minimum_Privacy <dbl> -0.3500758, -0.…
## $ fence_Minimum_Wood_Wire <dbl> -0.06207049, -0…
## $ fence_No_Fence <dbl> 0.4880005, 0.48…
## $ misc_feature_Gar2 <dbl> -0.03579046, -0…
## $ misc_feature_None <dbl> 0.1891505, 0.18…
## $ misc_feature_Othr <dbl> -0.02065481, -0…
## $ misc_feature_Shed <dbl> -0.1830175, -0.…
## $ sale_type_Con <dbl> -0.04133609, -0…
## $ sale_type_ConLD <dbl> -0.09038007, -0…
## $ sale_type_ConLI <dbl> -0.0547176, -0.…
## $ sale_type_ConLw <dbl> -0.05064781, -0…
## $ sale_type_CWD <dbl> -0.06544205, -0…
## $ sale_type_New <dbl> -0.3028478, -0.…
## $ sale_type_Oth <dbl> -0.05064781, -0…
## $ sale_type_VWD <dbl> -0.02065481, -0…
## $ sale_type_WD. <dbl> 0.3975366, 0.39…
## $ sale_condition_AdjLand <dbl> -0.06207049, -0…
## $ sale_condition_Alloca <dbl> -0.08795059, -0…
## $ sale_condition_Family <dbl> -0.1230919, -0.…
## $ sale_condition_Normal <dbl> 0.4623717, -2.1…
## $ sale_condition_Partial <dbl> -0.3070247, -0.…
To ensure reliable model evaluation, the data was split into training and testing sets. The training data was used to fit models, while the test data was reserved for final evaluation. Five-fold cross-validation was applied within the training data to estimate model performance and reduce overfitting.
A preprocessing recipe was used to standardize transformations across all models. This included encoding categorical variables, removing zero-variance predictors, and normalizing numeric variables. These steps ensure that models are trained on consistent and appropriately scaled data.
lm_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
baseline_lm <- lm_spec |>
fit(log_sale_price ~ . - sale_price, data = ames_train)
baseline_lm_metrics <- glance(baseline_lm)
baseline_lm_metrics
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.950 0.943 0.0972 133. 0 293 2295. -3999. -2300.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The model formula excludes sale_price because
sale_price is the original version of the response.
Including it as a predictor of log_sale_price would create
data leakage and produce misleadingly strong results.
baseline_lm_coefs <- tidy(baseline_lm) |>
filter(!is.na(p.value)) |>
arrange(p.value) |>
mutate(remove_candidate = p.value > 0.05)
baseline_lm_coefs
## # A tibble: 294 × 6
## term estimate std.error statistic p.value remove_candidate
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 roof_matlCompShg 2.91 0.174 16.7 7.43e-59 FALSE
## 2 roof_matlWdShngl 2.99 0.181 16.5 1.26e-57 FALSE
## 3 roof_matlTar&Grv 2.93 0.180 16.3 2.95e-56 FALSE
## 4 roof_matlWdShake 2.92 0.180 16.2 6.17e-56 FALSE
## 5 first_flr_sf 0.000271 0.0000169 16.1 6.42e-55 FALSE
## 6 roof_matlMembran 3.17 0.215 14.7 1.04e-46 FALSE
## 7 roof_matlRoll 2.99 0.203 14.7 1.17e-46 FALSE
## 8 second_flr_sf 0.000263 0.0000187 14.0 8.36e-43 FALSE
## 9 misc_featureGar2 2.27 0.169 13.5 8.12e-40 FALSE
## 10 misc_featureShed 2.19 0.229 9.56 3.21e-21 FALSE
## # ℹ 284 more rows
Backward selection is used as a classical feature-selection baseline. It begins with a broad model and removes predictors based on AIC, an information criterion that balances model fit and model complexity.
lm_full <- lm(log_sale_price ~ . - sale_price, data = ames_train)
backward_lm <- MASS::stepAIC(
lm_full,
direction = "backward",
trace = FALSE
)
## Selected terms
backward_terms <- attr(terms(backward_lm), "term.labels")
backward_terms
## [1] "ms_sub_class" "ms_zoning" "lot_area" "street"
## [5] "land_slope" "neighborhood" "condition_1" "condition_2"
## [9] "bldg_type" "overall_qual" "overall_cond" "year_built"
## [13] "year_remod_add" "roof_matl" "exterior_1st" "mas_vnr_type"
## [17] "exter_qual" "exter_cond" "foundation" "bsmt_qual"
## [21] "bsmt_exposure" "bsmt_fin_type_1" "bsmt_fin_type_2" "bsmt_unf_sf"
## [25] "total_bsmt_sf" "heating_qc" "central_air" "first_flr_sf"
## [29] "second_flr_sf" "low_qual_fin_sf" "bsmt_full_bath" "full_bath"
## [33] "half_bath" "bedroom_abv_gr" "kitchen_qual" "functional"
## [37] "fireplaces" "garage_cars" "garage_area" "garage_qual"
## [41] "wood_deck_sf" "open_porch_sf" "enclosed_porch" "screen_porch"
## [45] "pool_area" "misc_feature" "year_sold" "sale_condition"
## Final coefficient table
backward_lm_coefs <- tidy(backward_lm) |>
filter(!is.na(p.value)) |>
arrange(p.value) |>
mutate(remove = p.value > 0.05)
backward_lm_coefs
## # A tibble: 194 × 6
## term estimate std.error statistic p.value remove
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 roof_matlCompShg 2.66 0.121 22.0 6.10e-97 FALSE
## 2 roof_matlTar&Grv 2.68 0.124 21.7 1.89e-94 FALSE
## 3 roof_matlWdShngl 2.77 0.129 21.4 1.57e-92 FALSE
## 4 roof_matlWdShake 2.63 0.127 20.7 5.11e-87 FALSE
## 5 misc_featureNone 2.18 0.110 19.9 1.11e-80 FALSE
## 6 misc_featureShed 2.19 0.110 19.8 1.52e-80 FALSE
## 7 first_flr_sf 0.000278 0.0000146 19.0 7.25e-75 FALSE
## 8 second_flr_sf 0.000261 0.0000147 17.8 2.98e-66 FALSE
## 9 misc_featureGar2 2.29 0.130 17.7 2.47e-65 FALSE
## 10 roof_matlMembran 2.86 0.164 17.4 8.56e-64 FALSE
## # ℹ 184 more rows
## Compare model metrics
linear_model_comparison <- bind_rows(
full_model = glance(lm_full),
backward_selected_model = glance(backward_lm),
.id = "model"
) |>
select(
model,
r.squared,
adj.r.squared,
sigma,
statistic,
p.value,
df,
logLik,
AIC,
BIC,
deviance,
df.residual
)
linear_model_comparison
## # A tibble: 2 × 12
## model r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 full_mod… 0.950 0.943 0.0972 133. 0 293 2295. -3999.
## 2 backward… 0.947 0.943 0.0975 200. 0 193 2231. -4072.
## # ℹ 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
The full linear model explains approximately 95.0% of the variation in log-transformed sale price, while the backward-selected model explains approximately 94.7%. This small decrease suggests that many predictors in the full model contribute very little additional explanatory value. After adjusting for model complexity, the two models remain nearly identical, with adjusted R^2 values of 0.9430 and 0.9426, respectively.
The backward-selected model is preferable as a classical feature-selection baseline because it removes about 100 model terms while maintaining nearly the same explanatory strength. Although the full model has slightly lower deviance and residual error, the improvement is minimal relative to the added complexity. Both AIC and BIC are lower for the backward-selected model, indicating that it provides a better balance between goodness of fit and parsimony.
Statistically, this suggests that the Ames Housing data contain many predictors that overlap in explanatory information. The backward-selected model keeps the most useful terms while removing predictors that do not improve the model enough to justify their complexity. However, because these results are based on in-sample performance, the backward-selected model should be treated as an interpretable baseline rather than the final predictive model. Cross-validation and test-set performance are still needed to determine which model generalizes best to new housing data.
Lasso regression was fit as a separate feature-selection model using the full predictor set rather than the backward-selected variables. The penalty parameter controls how strongly coefficients are shrunk toward zero. Predictors with coefficients shrunk exactly to zero are removed from the model, while predictors with nonzero coefficients are retained. Because categorical predictors were dummy-coded before fitting the lasso model, lasso selection occurs at the dummy-variable level. Therefore, lasso may retain one level of a categorical variable while shrinking another level of the same original variable to zero. For interpretation, these dummy-level terms can be grouped back to the original variable names, but the fitted lasso model itself is based on the dummy-coded design matrix.
lasso_spec <- linear_reg(
penalty = tune(),
mixture = 1
) |>
set_engine("glmnet") |>
set_mode("regression")
lasso_wf <- workflow() |>
add_recipe(ames_recipe) |>
add_model(lasso_spec)
## Penalty grid
set.seed(42)
lasso_grid <- grid_regular(
penalty(range = c(-4, 1)),
levels = 50
)
lasso_res <- tune_grid(
lasso_wf,
resamples = ames_folds,
grid = lasso_grid,
metrics = metric_set(rmse, rsq, mae)
)
lasso_res |>
collect_metrics() |>
arrange(.metric, mean)
## # A tibble: 150 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.000655 mae standard 0.0815 5 0.00173 pre0_mod09_post0
## 2 0.000518 mae standard 0.0816 5 0.00175 pre0_mod08_post0
## 3 0.000829 mae standard 0.0816 5 0.00168 pre0_mod10_post0
## 4 0.000409 mae standard 0.0818 5 0.00179 pre0_mod07_post0
## 5 0.00105 mae standard 0.0819 5 0.00162 pre0_mod11_post0
## 6 0.000324 mae standard 0.0819 5 0.00183 pre0_mod06_post0
## 7 0.000256 mae standard 0.0822 5 0.00186 pre0_mod05_post0
## 8 0.00133 mae standard 0.0822 5 0.00158 pre0_mod12_post0
## 9 0.000202 mae standard 0.0824 5 0.00194 pre0_mod04_post0
## 10 0.00168 mae standard 0.0826 5 0.00167 pre0_mod13_post0
## # ℹ 140 more rows
best_lasso <- select_best(lasso_res, metric = "rmse")
best_lasso
## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 0.00133 pre0_mod12_post0
autoplot(lasso_res) +
labs(
title = "Lasso Regression Tuning Results",
x = "Penalty",
y = "Cross-Validated Performance"
)
## Finalize workflow
final_lasso_wf <- finalize_workflow(
lasso_wf,
best_lasso
)
final_lasso_fit <- last_fit(
final_lasso_wf,
split = ames_split,
metrics = metric_set(rmse, rsq, mae)
)
##Test Set Performance
lasso_test_metrics <- final_lasso_fit |>
collect_metrics()
lasso_test_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.143 pre0_mod0_post0
## 2 rsq standard 0.877 pre0_mod0_post0
## 3 mae standard 0.0808 pre0_mod0_post0
lasso_coefs <- final_lasso_fit |>
extract_fit_parsnip() |>
tidy() |>
filter(estimate != 0) |>
arrange(desc(abs(estimate)))
selected_lasso_features <- lasso_coefs |>
filter(term != "(Intercept)", estimate != 0) |>
pull(term)
selected_lasso_features
## [1] "gr_liv_area"
## [2] "year_built"
## [3] "overall_qual_Excellent"
## [4] "misc_val"
## [5] "overall_qual_Very_Good"
## [6] "total_bsmt_sf"
## [7] "garage_cars"
## [8] "functional_Sal"
## [9] "overall_cond_Good"
## [10] "pool_qc_Good"
## [11] "overall_cond_Fair"
## [12] "condition_1_Norm"
## [13] "kitchen_qual_Typical"
## [14] "overall_qual_Very_Excellent"
## [15] "neighborhood_Northridge_Heights"
## [16] "misc_feature_Gar2"
## [17] "sale_condition_Normal"
## [18] "neighborhood_Crawford"
## [19] "fireplaces"
## [20] "year_remod_add"
## [21] "overall_cond_Very_Good"
## [22] "ms_zoning_C_all"
## [23] "overall_qual_Below_Average"
## [24] "neighborhood_Somerset"
## [25] "ms_sub_class_Two_Story_PUD_1946_and_Newer"
## [26] "functional_Typ"
## [27] "lot_area"
## [28] "neighborhood_Green_Hills"
## [29] "ms_zoning_Residential_Medium_Density"
## [30] "sale_type_New"
## [31] "neighborhood_Northridge"
## [32] "pool_area"
## [33] "neighborhood_Stone_Brook"
## [34] "overall_cond_Excellent"
## [35] "ms_sub_class_Duplex_All_Styles_and_Ages"
## [36] "full_bath"
## [37] "overall_qual_Good"
## [38] "bldg_type_Twnhs"
## [39] "overall_cond_Above_Average"
## [40] "kitchen_qual_Good"
## [41] "bsmt_full_bath"
## [42] "bsmt_unf_sf"
## [43] "overall_qual_Fair"
## [44] "overall_qual_Poor"
## [45] "bldg_type_TwnhsE"
## [46] "screen_porch"
## [47] "neighborhood_Meadow_Village"
## [48] "house_style_One_Story"
## [49] "bsmt_exposure_Gd"
## [50] "bsmt_fin_type_1_GLQ"
## [51] "exterior_1st_BrkFace"
## [52] "neighborhood_Edwards"
## [53] "sale_condition_Partial"
## [54] "functional_Sev"
## [55] "functional_Maj2"
## [56] "heating_qc_Typical"
## [57] "bsmt_exposure_No"
## [58] "garage_area"
## [59] "wood_deck_sf"
## [60] "roof_matl_WdShngl"
## [61] "overall_cond_Below_Average"
## [62] "kitchen_abv_gr"
## [63] "overall_qual_Average"
## [64] "bsmt_fin_sf_1"
## [65] "kitchen_qual_Fair"
## [66] "misc_feature_Shed"
## [67] "condition_1_PosN"
## [68] "bsmt_qual_Typical"
## [69] "neighborhood_Old_Town"
## [70] "exterior_1st_PreCast"
## [71] "exter_qual_Typical"
## [72] "sale_condition_Alloca"
## [73] "exter_cond_Poor"
## [74] "ms_sub_class_Two_Story_1946_and_Newer"
## [75] "sale_condition_AdjLand"
## [76] "central_air_Y"
## [77] "exter_cond_Fair"
## [78] "heating_GasW"
## [79] "bsmt_fin_type_2_No_Basement"
## [80] "lot_config_CulDSac"
## [81] "house_style_SFoyer"
## [82] "land_slope_Sev"
## [83] "condition_1_PosA"
## [84] "ms_zoning_A_agr"
## [85] "exterior_1st_HdBoard"
## [86] "neighborhood_Clear_Creek"
## [87] "garage_type_More_Than_Two_Types"
## [88] "roof_matl_Tar.Grv"
## [89] "heating_qc_Fair"
## [90] "condition_2_PosA"
## [91] "garage_cond_Typical"
## [92] "ms_sub_class_One_Story_1945_and_Older"
## [93] "roof_style_Gambrel"
## [94] "exterior_1st_Wd.Sdng"
## [95] "neighborhood_Timberland"
## [96] "ms_sub_class_One_and_Half_Story_PUD_All_Ages"
## [97] "bsmt_cond_Fair"
## [98] "heating_qc_Poor"
## [99] "half_bath"
## [100] "house_style_One_and_Half_Unf"
## [101] "bsmt_qual_No_Basement"
## [102] "heating_qc_Good"
## [103] "garage_qual_Good"
## [104] "open_porch_sf"
## [105] "bsmt_fin_type_2_Rec"
## [106] "bsmt_exposure_Mn"
## [107] "bsmt_qual_Fair"
## [108] "year_sold"
## [109] "bsmt_qual_Good"
## [110] "sale_type_Con"
## [111] "land_contour_Lvl"
## [112] "lot_shape_Moderately_Irregular"
## [113] "land_contour_HLS"
## [114] "foundation_Wood"
## [115] "paved_drive_Paved"
## [116] "low_qual_fin_sf"
## [117] "lot_shape_Irregular"
## [118] "foundation_Stone"
## [119] "roof_matl_Membran"
## [120] "kitchen_qual_Poor"
## [121] "pool_qc_Fair"
## [122] "bsmt_fin_type_2_GLQ"
## [123] "garage_finish_RFn"
## [124] "bsmt_cond_Poor"
## [125] "exterior_1st_Stucco"
## [126] "ms_zoning_I_all"
## [127] "pool_qc_Typical"
## [128] "fence_Good_Wood"
## [129] "lot_config_FR2"
## [130] "street_Pave"
## [131] "mas_vnr_type_Stone"
## [132] "neighborhood_Greens"
## [133] "garage_type_CarPort"
## [134] "heating_OthW"
## [135] "bsmt_fin_type_1_Unf"
## [136] "neighborhood_Veenker"
## [137] "roof_matl_CompShg"
## [138] "utilities_NoSewr"
## [139] "functional_Min1"
## [140] "enclosed_porch"
## [141] "neighborhood_Bloomington_Heights"
## [142] "exterior_1st_CemntBd"
## [143] "foundation_Slab"
## [144] "neighborhood_Sawyer_West"
## [145] "garage_cond_Poor"
## [146] "ms_sub_class_Two_and_Half_Story_All_Ages"
## [147] "alley_Paved"
## [148] "bsmt_cond_Good"
## [149] "utilities_NoSeWa"
## [150] "garage_qual_Fair"
## [151] "roof_style_Mansard"
## [152] "bsmt_exposure_No_Basement"
## [153] "exterior_2nd_BrkFace"
## [154] "sale_condition_Family"
## [155] "garage_qual_Poor"
## [156] "fireplace_qu_Good"
## [157] "exterior_2nd_Plywood"
## [158] "mas_vnr_area"
## [159] "ms_sub_class_Two_Family_conversion_All_Styles_and_Ages"
## [160] "sale_type_ConLI"
## [161] "neighborhood_Sawyer"
## [162] "condition_1_RRAn"
## [163] "garage_cond_Good"
## [164] "condition_2_RRNn"
## [165] "ms_sub_class_Two_Story_1945_and_Older"
## [166] "neighborhood_Blueste"
## [167] "exterior_2nd_Wd.Shng"
## [168] "house_style_Two_and_Half_Unf"
## [169] "bsmt_fin_type_2_Unf"
## [170] "three_season_porch"
## [171] "bedroom_abv_gr"
## [172] "garage_cond_Fair"
## [173] "heating_Grav"
## [174] "neighborhood_Northwest_Ames"
## [175] "fence_Minimum_Wood_Wire"
## [176] "neighborhood_Brookside"
## [177] "exterior_2nd_Wd.Sdng"
## [178] "neighborhood_Mitchell"
## [179] "exterior_2nd_Stone"
## [180] "lot_config_FR3"
## [181] "sale_type_WD."
## [182] "bsmt_fin_type_2_BLQ"
## [183] "neighborhood_Northpark_Villa"
## [184] "paved_drive_Partial_Pavement"
## [185] "bsmt_cond_No_Basement"
## [186] "condition_1_RRAe"
## [187] "bsmt_fin_type_1_BLQ"
## [188] "land_slope_Mod"
## [189] "bsmt_half_bath"
## [190] "neighborhood_Briardale"
## [191] "bsmt_fin_type_1_No_Basement"
## [192] "second_flr_sf"
## [193] "exterior_1st_CBlock"
## [194] "garage_finish_Unf"
## [195] "exterior_2nd_Other"
## [196] "condition_1_RRNn"
## [197] "bldg_type_Duplex"
## [198] "fence_No_Fence"
## [199] "neighborhood_Gilbert"
length(selected_lasso_features)
## [1] 199
Map dummy variables back to original variables:
original_vars <- names(ames_train) |>
setdiff(c("sale_price", "log_sale_price"))
selected_original_vars <- original_vars[
purrr::map_lgl(original_vars, \(var) {
any(stringr::str_detect(selected_lasso_features, paste0("^", var)))
})
]
selected_original_vars
## [1] "ms_sub_class" "ms_zoning" "lot_area"
## [4] "street" "alley" "lot_shape"
## [7] "land_contour" "utilities" "lot_config"
## [10] "land_slope" "neighborhood" "condition_1"
## [13] "condition_2" "bldg_type" "house_style"
## [16] "overall_qual" "overall_cond" "year_built"
## [19] "year_remod_add" "roof_style" "roof_matl"
## [22] "exterior_1st" "exterior_2nd" "mas_vnr_type"
## [25] "mas_vnr_area" "exter_qual" "exter_cond"
## [28] "foundation" "bsmt_qual" "bsmt_cond"
## [31] "bsmt_exposure" "bsmt_fin_type_1" "bsmt_fin_sf_1"
## [34] "bsmt_fin_type_2" "bsmt_unf_sf" "total_bsmt_sf"
## [37] "heating" "heating_qc" "central_air"
## [40] "second_flr_sf" "low_qual_fin_sf" "gr_liv_area"
## [43] "bsmt_full_bath" "bsmt_half_bath" "full_bath"
## [46] "half_bath" "bedroom_abv_gr" "kitchen_abv_gr"
## [49] "kitchen_qual" "functional" "fireplaces"
## [52] "fireplace_qu" "garage_type" "garage_finish"
## [55] "garage_cars" "garage_area" "garage_qual"
## [58] "garage_cond" "paved_drive" "wood_deck_sf"
## [61] "open_porch_sf" "enclosed_porch" "three_season_porch"
## [64] "screen_porch" "pool_area" "pool_qc"
## [67] "fence" "misc_feature" "misc_val"
## [70] "year_sold" "sale_type" "sale_condition"
length(selected_original_vars)
## [1] 72
The selected lasso features are the predictors with non-zero
coefficients in the finalized model. These are the best feature
candidates from the lasso procedure because they survived
cross-validated penalization. Larger absolute coefficient values
indicate stronger effects on log_sale_price, after
preprocessing and normalization.
original_vars <- names(ames_train)
lasso_coefs_grouped <- lasso_coefs |>
filter(term != "(Intercept)") |>
rowwise() |>
mutate(
original_var = original_vars[
which.max(startsWith(term, original_vars))
]
) |>
ungroup() |>
group_by(original_var) |>
summarize(
importance = sum(abs(estimate))
) |>
arrange(desc(importance))
lasso_coefs_grouped |>
slice_max(order_by = importance, n = 25) |>
mutate(original_var = fct_reorder(original_var, importance)) |>
ggplot(aes(x = importance, y = original_var)) +
geom_col() +
labs(
title = "Top Lasso-Selected Variables",
x = "Total Absolute Coefficient",
y = NULL
) +
theme_minimal()
lasso_preds <- final_lasso_fit |>
collect_predictions() |>
bind_cols(
ames_test |>
select(sale_price)
) |>
mutate(
pred_sale_price = exp(.pred),
residual_log = log_sale_price - .pred,
residual_dollars = sale_price - pred_sale_price
)
head(lasso_preds)
## # A tibble: 6 × 9
## .pred id log_sale_price .row .config sale_price pred_sale_price
## <dbl> <chr> <dbl> <int> <chr> <int> <dbl>
## 1 12.1 train/test split 12.2 5 pre0_m… 189900 186549.
## 2 12.1 train/test split 12.0 22 pre0_m… 170000 176648.
## 3 12.6 train/test split 12.6 38 pre0_m… 306000 284474.
## 4 13.1 train/test split 13.1 47 pre0_m… 500000 481507.
## 5 12.7 train/test split 12.7 48 pre0_m… 320000 324779.
## 6 12.2 train/test split 12.2 54 pre0_m… 192000 190586.
## # ℹ 2 more variables: residual_log <dbl>, residual_dollars <dbl>
lasso_log_metrics <- lasso_preds |>
metrics(truth = log_sale_price, estimate = .pred) |>
mutate(model = "Lasso")
lasso_log_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.143 Lasso
## 2 rsq standard 0.877 Lasso
## 3 mae standard 0.0808 Lasso
lasso_dollar_metrics <- lasso_preds |>
metrics(truth = sale_price, estimate = pred_sale_price) |>
mutate(model = "Lasso")
lasso_dollar_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 27273. Lasso
## 2 rsq standard 0.882 Lasso
## 3 mae standard 14049. Lasso
ggplot(lasso_preds, aes(x = log_sale_price, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Lasso Model: Actual vs. Predicted Log Sale Price",
x = "Actual Log Sale Price",
y = "Predicted Log Sale Price"
) +
theme_minimal()
Residual VS Predicted
ggplot(lasso_preds, aes(x = .pred, y = residual_log)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Lasso Model: Residuals vs. Predicted Log Sale Price",
x = "Predicted Log Sale Price",
y = "Residual"
) +
theme_minimal()
Actual VS Predicted
ggplot(lasso_preds, aes(x = sale_price, y = pred_sale_price)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
scale_x_continuous(labels = scales::label_dollar()) +
scale_y_continuous(labels = scales::label_dollar()) +
labs(
title = "Lasso Model: Actual vs. Predicted Sale Price",
x = "Actual Sale Price",
y = "Predicted Sale Price"
) +
theme_minimal()
The lasso model was evaluated on the held-out test set to estimate how well it generalizes to new homes. Performance was assessed on both the log scale and the original dollar scale. The log-scale metrics align directly with the model’s training objective, while the dollar-scale metrics make the results easier to interpret for a general audience.
The actual-versus-predicted plots show whether predictions closely follow the 45-degree reference line, where perfect predictions would fall. Points far from the line represent homes that were overpredicted or underpredicted. The dollar-scale plot is especially useful for communication, but because predictions were back-transformed from the log scale, dollar-scale errors should be interpreted as approximate practical errors rather than the original scale used to fit the model.
The residual plot helps evaluate whether errors show systematic patterns. Ideally, residuals should be centered around zero with no clear curve or funnel shape. If residuals widen for higher predicted prices, this suggests that expensive homes are harder to predict consistently, which is common in housing data.
Both the backward-selection and lasso models are evaluated on the test set to compare predictive performance.
ames_test_lm <- ames_test
for (var in names(backward_lm$xlevels)) {
ames_test_lm[[var]] <- factor(
ames_test_lm[[var]],
levels = backward_lm$xlevels[[var]]
)
}
backward_preds <- predict(backward_lm, newdata = ames_test_lm) |>
tibble(.pred = _) |>
bind_cols(
ames_test |>
select(sale_price, log_sale_price)
) |>
mutate(
pred_sale_price = exp(.pred),
residual_log = log_sale_price - .pred,
residual_dollars = sale_price - pred_sale_price
)
backward_preds_clean <- backward_preds |>
filter(!is.na(.pred))
backward_log_metrics <- backward_preds_clean |>
metrics(truth = log_sale_price, estimate = .pred) |>
mutate(model = "Backward Selection")
backward_log_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.146 Backward Selection
## 2 rsq standard 0.873 Backward Selection
## 3 mae standard 0.0795 Backward Selection
backward_dollar_metrics <- backward_preds_clean |>
metrics(truth = sale_price, estimate = pred_sale_price) |>
mutate(model = "Backward Selection")
backward_dollar_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 37872. Backward Selection
## 2 rsq standard 0.796 Backward Selection
## 3 mae standard 14066. Backward Selection
log_model_comparison_wide <- bind_rows(
backward_log_metrics,
lasso_log_metrics
) |>
select(model, .metric, .estimate) |>
pivot_wider(
names_from = .metric,
values_from = .estimate
)
log_model_comparison_wide
## # A tibble: 2 × 4
## model rmse rsq mae
## <chr> <dbl> <dbl> <dbl>
## 1 Backward Selection 0.146 0.873 0.0795
## 2 Lasso 0.143 0.877 0.0808
On the log scale, the lasso model has lower RMSE and higher R^2, indicating stronger performance in predicting proportional differences in sale price. The backward-selection model shows slightly lower MAE, suggesting smaller typical errors, but lasso reduces larger errors more effectively due to its lower RMSE.
dollar_model_comparison_wide <- bind_rows(
backward_dollar_metrics,
lasso_dollar_metrics
) |>
select(model, .metric, .estimate) |>
pivot_wider(
names_from = .metric,
values_from = .estimate
)
dollar_model_comparison_wide
## # A tibble: 2 × 4
## model rmse rsq mae
## <chr> <dbl> <dbl> <dbl>
## 1 Backward Selection 37872. 0.796 14066.
## 2 Lasso 27273. 0.882 14049.
On the dollar scale, the lasso model performs better overall. It has a lower RMSE, higher R^2, and slightly lower MAE than the backward-selection model. The difference in RMSE is especially meaningful because RMSE penalizes larger errors more heavily; this suggests that lasso makes fewer large dollar-scale prediction errors. Although the MAE values are very close, lasso still has the smaller typical absolute error. Overall, lasso provides stronger predictive performance than backward selection on both the log scale and the dollar scale.
Because lasso performs better on both evaluation scales, it is the stronger feature-selection model for this analysis. It combines predictive accuracy with automatic feature selection, making it a useful bridge between interpretability and performance. Backward selection remains useful as a classical comparison model, but the test-set results support lasso as the preferred feature-selection approach.
The lasso model was used as an evidence-based guide for selecting predictors for the multiple linear regression model. Rather than choosing variables arbitrarily, I selected predictors that represented the main housing dimensions identified by lasso: quality, location, size, age, condition, basement features, garage capacity, amenities, and sale context.
This model is designed to be interpretable as well as predictive. Highly sparse, redundant, or difficult-to-explain predictors were excluded even if they appeared in the lasso ranking. This keeps the model useful for explaining patterns in home prices to a general audience.
mlr_features <- c(
"overall_qual",
"neighborhood",
"gr_liv_area",
"overall_cond",
"garage_cars",
"ms_sub_class",
"ms_zoning",
"year_built",
"sale_condition",
"bsmt_exposure",
"fireplaces",
"year_remod_add",
"total_bsmt_sf",
"full_bath",
"condition_1",
"bsmt_full_bath",
"exter_qual",
"central_air",
"first_flr_sf",
"screen_porch",
"bsmt_qual",
"kitchen_qual",
"foundation",
"lot_area",
"exterior_1st"
)
Build the formula
mlr_formula <- as.formula(
paste("log_sale_price ~", paste(mlr_features, collapse = " + "))
)
mlr_formula
## log_sale_price ~ overall_qual + neighborhood + gr_liv_area +
## overall_cond + garage_cars + ms_sub_class + ms_zoning + year_built +
## sale_condition + bsmt_exposure + fireplaces + year_remod_add +
## total_bsmt_sf + full_bath + condition_1 + bsmt_full_bath +
## exter_qual + central_air + first_flr_sf + screen_porch +
## bsmt_qual + kitchen_qual + foundation + lot_area + exterior_1st
mlr_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")
mlr_recipe <- recipe(mlr_formula, data = ames_train) |>
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
mlr_workflow <- workflow() |>
add_recipe(mlr_recipe) |>
add_model(mlr_spec)
mlr_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
set.seed(42)
mlr_cv_results <- fit_resamples(
mlr_workflow,
resamples = ames_folds,
metrics = metric_set(rmse, rsq, mae),
control = control_resamples(save_pred = TRUE)
)
mlr_cv_metrics <- collect_metrics(mlr_cv_results)
mlr_cv_metrics
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 0.0863 5 0.000172 pre0_mod0_post0
## 2 rmse standard 0.140 5 0.00810 pre0_mod0_post0
## 3 rsq standard 0.882 5 0.0149 pre0_mod0_post0
final_mlr_fit <- fit(
mlr_workflow,
data = ames_train
)
final_mlr_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept)
## 5.177e+00
## gr_liv_area
## 2.302e-04
## garage_cars
## 5.458e-02
## year_built
## 2.108e-03
## fireplaces
## 3.435e-02
## year_remod_add
## 7.430e-04
## total_bsmt_sf
## 2.455e-06
## full_bath
## 3.944e-02
## bsmt_full_bath
## 5.301e-02
## first_flr_sf
## 1.164e-05
## screen_porch
## 2.719e-04
## lot_area
## 1.423e-06
## overall_qual_Poor
## -6.923e-02
## overall_qual_Fair
## 1.342e-01
## overall_qual_Below_Average
## 1.839e-01
## overall_qual_Average
## 2.460e-01
## overall_qual_Above_Average
## 2.876e-01
## overall_qual_Good
## 3.265e-01
## overall_qual_Very_Good
## 3.846e-01
## overall_qual_Excellent
## 4.821e-01
## overall_qual_Very_Excellent
## 4.168e-01
## neighborhood_College_Creek
## 2.425e-02
## neighborhood_Old_Town
##
## ...
## and 210 more lines.
mlr_tidy <- tidy(final_mlr_fit)
mlr_tidy_clean <- mlr_tidy |>
filter(term != "(Intercept)") |>
mutate(
percent_change = (exp(estimate) - 1) * 100
) |>
arrange(desc(abs(estimate)))
mlr_tidy_clean
## # A tibble: 126 × 6
## term estimate std.error statistic p.value percent_change
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ms_zoning_A_agr -1.00 0.138 -7.27 4.94e-13 -63.4
## 2 exterior_1st_PreCast 0.629 0.125 5.04 5.15e- 7 87.5
## 3 exterior_1st_CBlock 0.611 0.102 5.97 2.74e- 9 84.1
## 4 neighborhood_Green_Hills 0.598 0.0876 6.84 1.05e-11 81.9
## 5 overall_qual_Excellent 0.482 0.0999 4.83 1.47e- 6 62.0
## 6 overall_qual_Very_Excel… 0.417 0.103 4.04 5.57e- 5 51.7
## 7 overall_qual_Very_Good 0.385 0.0979 3.93 8.77e- 5 46.9
## 8 overall_cond_Excellent 0.369 0.0611 6.04 1.78e- 9 44.7
## 9 ms_sub_class_One_and_Ha… -0.355 0.123 -2.89 3.84e- 3 -29.9
## 10 overall_qual_Good 0.326 0.0974 3.35 8.15e- 4 38.6
## # ℹ 116 more rows
mlr_glance <- glance(final_mlr_fit)
mlr_glance
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.920 0.915 0.119 201. 0 126 1736. -3216. -2479.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The multiple linear regression model explains approximately 92.0% of the variation in log-transformed sale price, with an adjusted R^2 of 91.5% after accounting for model complexity. This indicates that the selected predictors provide strong explanatory power while maintaining a more interpretable structure than the full model. The residual standard error is 0.119 on the log scale, meaning that the model’s typical prediction error is roughly around 12% in sale price terms. The overall model test is highly significant, suggesting that the selected housing characteristics collectively provide meaningful information about home sale prices.
Although the model performs well, it is still complex because categorical variables are converted into many dummy-variable terms. The model has 126 degrees of freedom, reflecting predictors such as neighborhood, quality level, zoning, basement features, exterior type, and sale condition. This is appropriate for Ames Housing because home prices are shaped by both numeric features, such as living area and age, and categorical features, such as location and quality. However, the model should still be evaluated on test-set performance and diagnostic plots before being treated as a final predictive model.
mlr_preds <- predict(final_mlr_fit, new_data = ames_test) |>
bind_cols(
ames_test |>
select(sale_price, log_sale_price)
) |>
mutate(
pred_sale_price = exp(.pred),
residual_log = log_sale_price - .pred,
residual_dollars = sale_price - pred_sale_price
)
head(mlr_preds)
## # A tibble: 6 × 6
## .pred sale_price log_sale_price pred_sale_price residual_log residual_dollars
## <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 12.0 189900 12.2 170444. 0.108 19456.
## 2 12.1 170000 12.0 187368. -0.0973 -17368.
## 3 12.6 306000 12.6 292309. 0.0458 13691.
## 4 13.1 500000 13.1 484153. 0.0322 15847.
## 5 12.7 320000 12.7 325614. -0.0174 -5614.
## 6 12.1 192000 12.2 187871. 0.0217 4129.
mlr_log_metrics <- mlr_preds |>
metrics(truth = log_sale_price, estimate = .pred) |>
mutate(model = "Multiple Linear Regression")
mlr_log_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.136 Multiple Linear Regression
## 2 rsq standard 0.890 Multiple Linear Regression
## 3 mae standard 0.0870 Multiple Linear Regression
mlr_dollar_metrics <- mlr_preds |>
metrics(truth = sale_price, estimate = pred_sale_price) |>
mutate(model = "Multiple Linear Regression")
mlr_dollar_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 29291. Multiple Linear Regression
## 2 rsq standard 0.864 Multiple Linear Regression
## 3 mae standard 15452. Multiple Linear Regression
mlr_preds |>
ggplot(aes(x = sale_price, y = pred_sale_price)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
scale_x_continuous(labels = scales::label_dollar()) +
scale_y_continuous(labels = scales::label_dollar()) +
labs(
title = "Multiple Linear Regression: Actual vs. Predicted Sale Price",
x = "Actual Sale Price",
y = "Predicted Sale Price"
) +
theme_minimal()
mlr_preds |>
ggplot(aes(x = .pred, y = residual_log)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "MLR Residuals vs. Predicted Log Sale Price",
x = "Predicted Log Sale Price",
y = "Residual"
) +
theme_minimal()
mlr_preds |>
ggplot(aes(sample = residual_log)) +
stat_qq() +
stat_qq_line() +
labs(
title = "Q-Q Plot of MLR Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()
The actual-versus-predicted plot shows whether predicted prices closely
follow the observed prices. Points near the dashed line represent
accurate predictions. If points at the high end fall farther from the
line, this suggests that expensive homes are harder to predict, which is
common because luxury or unusual properties may have features not fully
captured by the model.
The residual plot checks whether prediction errors show systematic patterns. Ideally, residuals should be centered around zero with no clear curve or funnel shape. A wider spread for higher predicted prices would suggest heteroscedasticity, meaning the model makes less consistent predictions for more expensive homes.
The Q-Q plot checks whether residuals are approximately normally distributed. Residuals following the line in the center but deviating in the tails suggests that the model works reasonably well for typical homes but struggles more with unusually low- or high-priced properties.
The general fitted model is:
\[ \widehat{\log(\text{Sale Price})} = \beta_0 + \beta_1(\text{gr\_liv\_area}) + \beta_2(\text{garage\_cars}) + \beta_3(\text{year\_built}) + \beta_4(\text{fireplaces}) + \beta_5(\text{year\_remod\_add}) + \cdots \]
Using the fitted coefficient estimates, a simplified version of the model is:
\[ \widehat{\log(\text{Sale Price})} = 5.177 + 0.000230(\text{Gr Liv Area}) + 0.0546(\text{Garage Cars}) + 0.00211(\text{Year Built}) + 0.0344(\text{Fireplaces}) + 0.000743(\text{Year Remod Add}) + \cdots \]
The ellipsis represents the remaining predictors in the model, including additional numeric variables and dummy variables for categorical features such as overall quality, neighborhood, zoning, basement quality, and sale condition. These categorical variables are encoded relative to reference categories.
Because the response variable is log-transformed, each coefficient represents an approximate percentage change in sale price for a one-unit increase in the predictor, holding other variables constant. This allows the model to be interpreted in relative terms rather than absolute dollar changes.
Key Coefficient Interpretations:
gr_liv_area: Each additional square foot of above-ground living area is associated with about a 0.023% increase in expected sale price, holding other predictors constant.
garage_cars: Each additional garage space is associated with about a 5.6% increase in expected sale price, holding other factors constant.
year_built: Each additional year newer is associated with about a 0.21% increase in expected sale price other factors held constant.
fireplaces: Each additional fireplace is associated with about a 3.5% increase in expected sale price other factors held constant.
full_bath: Each additional full bathroom is associated with about a 4.0% increase in expected sale price other factors held constant.
bsmt_full_bath: Each additional basement full bathroom is associated with about a 5.4% increase in expected sale price holding other factors constant.
Polynomial regression was used to test whether selected numeric
predictors have nonlinear relationships with
log_sale_price. This model uses the same main feature set
as the multiple linear regression model, but allows several important
numeric predictors to have curved effects.
poly_vars <- c(
"gr_liv_area",
"total_bsmt_sf",
"year_built",
"lot_area"
)
poly_vars <- intersect(poly_vars, mlr_features)
poly_vars
## [1] "gr_liv_area" "total_bsmt_sf" "year_built" "lot_area"
poly_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")
# Formula using the same selected features as MLR
poly_formula <- as.formula(
paste("log_sale_price ~", paste(mlr_features, collapse = " + "))
)
poly_formula
## log_sale_price ~ overall_qual + neighborhood + gr_liv_area +
## overall_cond + garage_cars + ms_sub_class + ms_zoning + year_built +
## sale_condition + bsmt_exposure + fireplaces + year_remod_add +
## total_bsmt_sf + full_bath + condition_1 + bsmt_full_bath +
## exter_qual + central_air + first_flr_sf + screen_porch +
## bsmt_qual + kitchen_qual + foundation + lot_area + exterior_1st
poly_recipe <- recipe(poly_formula, data = ames_train) |>
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_poly(all_of(poly_vars), degree = 2) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
poly_workflow <- workflow() |>
add_recipe(poly_recipe) |>
add_model(poly_spec)
poly_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_poly()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
set.seed(42)
poly_cv_results <- fit_resamples(
poly_workflow,
resamples = ames_folds,
metrics = metric_set(rmse, rsq, mae),
control = control_resamples(save_pred = TRUE)
)
poly_cv_metrics <- collect_metrics(poly_cv_results)
poly_cv_metrics
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 0.0805 5 0.000818 pre0_mod0_post0
## 2 rmse standard 0.126 5 0.00426 pre0_mod0_post0
## 3 rsq standard 0.904 5 0.00747 pre0_mod0_post0
final_poly_fit <- fit(
poly_workflow,
data = ames_train
)
final_poly_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_poly()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept)
## 9.972e+00
## garage_cars
## 4.326e-02
## fireplaces
## 2.976e-02
## year_remod_add
## 6.577e-04
## full_bath
## 1.515e-02
## bsmt_full_bath
## 5.021e-02
## first_flr_sf
## -1.040e-06
## screen_porch
## 2.092e-04
## gr_liv_area_poly_1
## 6.541e+00
## gr_liv_area_poly_2
## -1.206e+00
## total_bsmt_sf_poly_1
## 1.359e+00
## total_bsmt_sf_poly_2
## -2.352e+00
## year_built_poly_1
## 4.629e+00
## year_built_poly_2
## 5.328e-01
## lot_area_poly_1
## 1.140e+00
## lot_area_poly_2
## -6.622e-01
## overall_qual_Poor
## -3.304e-02
## overall_qual_Fair
## 1.014e-01
## overall_qual_Below_Average
## 1.358e-01
## overall_qual_Average
## 1.870e-01
## overall_qual_Above_Average
## 2.162e-01
## overall_qual_Good
## 2.518e-01
## overall_qual_Very_Good
##
## ...
## and 218 more lines.
poly_glance <- glance(final_poly_fit)
poly_glance
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.936 0.932 0.106 247. 0 130 1994. -3723. -2963.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
poly_preds <- predict(final_poly_fit, new_data = ames_test) |>
bind_cols(
ames_test |>
select(sale_price, log_sale_price)
) |>
mutate(
pred_sale_price = exp(.pred),
residual_log = log_sale_price - .pred,
residual_dollars = sale_price - pred_sale_price
)
head(poly_preds)
## # A tibble: 6 × 6
## .pred sale_price log_sale_price pred_sale_price residual_log residual_dollars
## <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 12.1 189900 12.2 177955. 0.0650 11945.
## 2 12.1 170000 12.0 180117. -0.0578 -10117.
## 3 12.6 306000 12.6 289080. 0.0569 16920.
## 4 13.0 500000 13.1 427360. 0.157 72640.
## 5 12.7 320000 12.7 327385. -0.0228 -7385.
## 6 12.1 192000 12.2 185276. 0.0356 6724.
poly_log_metrics <- poly_preds |>
metrics(truth = log_sale_price, estimate = .pred) |>
mutate(model = "Polynomial Regression")
poly_log_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.130 Polynomial Regression
## 2 rsq standard 0.900 Polynomial Regression
## 3 mae standard 0.0811 Polynomial Regression
poly_dollar_metrics <- poly_preds |>
metrics(truth = sale_price, estimate = pred_sale_price) |>
mutate(model = "Polynomial Regression")
poly_dollar_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 25775. Polynomial Regression
## 2 rsq standard 0.895 Polynomial Regression
## 3 mae standard 14266. Polynomial Regression
model_comparison_log <- bind_rows(
mlr_log_metrics,
poly_log_metrics
) |>
select(model, .metric, .estimate) |>
pivot_wider(
names_from = .metric,
values_from = .estimate
)
model_comparison_log
## # A tibble: 2 × 4
## model rmse rsq mae
## <chr> <dbl> <dbl> <dbl>
## 1 Multiple Linear Regression 0.136 0.890 0.0870
## 2 Polynomial Regression 0.130 0.900 0.0811
model_comparison_dollars <- bind_rows(
mlr_dollar_metrics,
poly_dollar_metrics
) |>
select(model, .metric, .estimate) |>
pivot_wider(
names_from = .metric,
values_from = .estimate
)
model_comparison_dollars
## # A tibble: 2 × 4
## model rmse rsq mae
## <chr> <dbl> <dbl> <dbl>
## 1 Multiple Linear Regression 29291. 0.864 15452.
## 2 Polynomial Regression 25775. 0.895 14266.
# Actual vs predicted plot
poly_preds |>
ggplot(aes(x = sale_price, y = pred_sale_price)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
scale_x_continuous(labels = scales::label_dollar()) +
scale_y_continuous(labels = scales::label_dollar()) +
labs(
title = "Polynomial Regression: Actual vs. Predicted Sale Price",
x = "Actual Sale Price",
y = "Predicted Sale Price"
) +
theme_minimal()
poly_preds |>
ggplot(aes(x = .pred, y = residual_log)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Polynomial Regression Residuals vs. Predicted Log Sale Price",
x = "Predicted Log Sale Price",
y = "Residual"
) +
theme_minimal()
poly_preds |>
ggplot(aes(sample = residual_log)) +
stat_qq() +
stat_qq_line() +
labs(
title = "Q-Q Plot of Polynomial Regression Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()
# Visualize polynomial effect: living area
ames_train |>
ggplot(aes(x = gr_liv_area, y = log_sale_price)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
labs(
title = "Polynomial Relationship Between Living Area and Log Sale Price",
x = "Above-Ground Living Area",
y = "Log Sale Price"
) +
theme_minimal()
# Visualize polynomial effect: basement size
ames_train |>
ggplot(aes(x = total_bsmt_sf, y = log_sale_price)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
labs(
title = "Polynomial Relationship Between Basement Size and Log Sale Price",
x = "Total Basement Square Feet",
y = "Log Sale Price"
) +
theme_minimal()
# Visualize polynomial effect: year built
ames_train |>
ggplot(aes(x = year_built, y = log_sale_price)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
labs(
title = "Polynomial Relationship Between Year Built and Log Sale Price",
x = "Year Built",
y = "Log Sale Price"
) +
theme_minimal()
# Visualize polynomial effect: lot area
ames_train |>
ggplot(aes(x = lot_area, y = log_sale_price)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
labs(
title = "Polynomial Relationship Between Lot Area and Log Sale Price",
x = "Lot Area",
y = "Log Sale Price"
) +
theme_minimal()
The polynomial regression model improved prediction compared with the multiple linear regression model. In the training data, the polynomial model had an R^2 of 0.936 and an adjusted R^2 of 0.932, compared with 0.920 and 0.915 for the multiple linear regression model. The residual standard error also decreased from 0.119 to 0.106, suggesting that the polynomial model reduced typical prediction error on the log-price scale.
More importantly, the improvement also appeared on the test set. The polynomial model had a test-set RMSE of 0.130, compared with 0.136 for the multiple linear regression model. Its test-set R^2 increased from 0.890 to 0.900, and MAE decreased from 0.087 to 0.081. This indicates that the polynomial model generalized slightly better to new observations, rather than only improving the fit to the training data.
Statistically, these results suggest that some important housing variables have nonlinear relationships with sale price. Living area, basement size, year built, and lot area may not affect home value at a constant rate across their full ranges. By allowing second-degree polynomial effects for these predictors, the model captures curvature while remaining interpretable as an extension of multiple linear regression.
Because step_poly() uses transformed polynomial terms, the polynomial coefficients should be interpreted as evidence of curvature rather than as simple raw squared effects. The main takeaway is that adding nonlinear flexibility improved predictive performance while preserving a clear regression-based modeling structure.
The polynomial regression model extends the multiple linear
regression model by allowing selected numeric predictors to have curved
relationships with log_sale_price.
The general form of the model is:
\[ \widehat{\log(\text{Sale Price})} = 9.972 + 6.541(\text{gr\_liv\_area})_1 - 1.206(\text{gr\_liv\_area})_2 + 1.359(\text{total\_bsmt\_sf})_1 - 2.352(\text{total\_bsmt\_sf})_2 + 4.629(\text{year\_built})_1 + 0.533(\text{year\_built})_2 + 1.140(\text{lot\_area})_1 - 0.662(\text{lot\_area})_2 + \cdots \] Because step_poly() was used, the _1 and _2 terms represent first- and second-degree polynomial basis terms, not the raw variable and raw squared variable.
In this model, gr_liv_area, total_bsmt_sf,
year_built, and lot_area were allowed to have
second-degree polynomial effects.
Because the response variable is log_sale_price, the
model predicts the log of home sale price rather than sale price
directly. The predicted dollar sale price is obtained by exponentiating
the fitted value:
\[ \widehat{\text{Sale Price}} = e^{\widehat{\log(\text{Sale Price})}} \]
Can selected Ames housing characteristics classify homes into low, middle, and high price categories?
We will create price_category with 3 levels
low, middle and high. Baseline
level will be low. The model estimates 2 equations:
We will use a smaller feature set than MLR. Multinomial models with many categorical predictors can become large quickly.
# Create tertile cut points using the training data only
price_cuts <- quantile(
ames_train$sale_price,
probs = c(1 / 3, 2 / 3),
na.rm = TRUE
)
price_cuts
## 33.33333% 66.66667%
## 139000 192000
# Add price category to training and testing data
ames_train_multi <- ames_train |>
mutate(
price_category = case_when(
sale_price <= price_cuts[1] ~ "Low",
sale_price <= price_cuts[2] ~ "Middle",
TRUE ~ "High"
),
price_category = factor(
price_category,
levels = c("Low", "Middle", "High")
)
)
ames_test_multi <- ames_test |>
mutate(
price_category = case_when(
sale_price <= price_cuts[1] ~ "Low",
sale_price <= price_cuts[2] ~ "Middle",
TRUE ~ "High"
),
price_category = factor(
price_category,
levels = c("Low", "Middle", "High")
)
)
# Check class balance
bind_rows(
train = ames_train_multi |> count(price_category),
test = ames_test_multi |> count(price_category),
.id = "dataset"
) |>
group_by(dataset) |>
mutate(prop = n / sum(n))
## # A tibble: 6 × 4
## # Groups: dataset [2]
## dataset price_category n prop
## <chr> <fct> <int> <dbl>
## 1 train Low 796 0.340
## 2 train Middle 773 0.330
## 3 train High 775 0.331
## 4 test Low 185 0.316
## 5 test Middle 228 0.389
## 6 test High 173 0.295
ames_train_multi |>
ggplot(aes(x = price_category, fill = price_category)) +
geom_bar() +
labs(
title = "Distribution of Housing Price Categories",
x = "Price Category",
y = "Number of Homes"
) +
theme_minimal() +
theme(legend.position = "none")
ames_train_multi |>
ggplot(aes(x = price_category, y = sale_price, fill = price_category)) +
geom_boxplot() +
scale_y_continuous(labels = scales::label_dollar()) +
labs(
title = "Sale Price by Price Category",
x = "Price Category",
y = "Sale Price"
) +
theme_minimal() +
theme(legend.position = "none")
The model uses selected housing predictors that were important in the earlier regression work. These include measures of quality, location, size, age, garage capacity, basement features, and sale context.
multi_features <- c(
"overall_qual",
"gr_liv_area",
"overall_cond",
"garage_cars",
"year_built",
"fireplaces",
"year_remod_add",
"total_bsmt_sf",
"full_bath",
"central_air",
"lot_area"
)
multi_formula <- reformulate(
termlabels = multi_features,
response = "price_category"
)
multi_formula
## price_category ~ overall_qual + gr_liv_area + overall_cond +
## garage_cars + year_built + fireplaces + year_remod_add +
## total_bsmt_sf + full_bath + central_air + lot_area
multi_spec <- multinom_reg() |>
set_engine("nnet") |>
set_mode("classification")
multi_recipe <- recipe(multi_formula, data = ames_train_multi) |>
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
multi_workflow <- workflow() |>
add_recipe(multi_recipe) |>
add_model(multi_spec)
multi_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Multinomial Regression Model Specification (classification)
##
## Computational engine: nnet
set.seed(42)
multi_folds <- vfold_cv(
ames_train_multi,
v = 5,
strata = price_category
)
set.seed(42)
multi_cv_results <- fit_resamples(
multi_workflow,
resamples = multi_folds,
metrics = metric_set(accuracy, kap, mn_log_loss),
control = control_resamples(save_pred = TRUE)
)
multi_cv_metrics <- collect_metrics(multi_cv_results)
multi_cv_metrics
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.827 5 0.00987 pre0_mod0_post0
## 2 kap multiclass 0.740 5 0.0148 pre0_mod0_post0
## 3 mn_log_loss multiclass 0.519 5 0.0341 pre0_mod0_post0
final_multi_fit <- fit(
multi_workflow,
data = ames_train_multi
)
final_multi_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## nnet::multinom(formula = ..y ~ ., data = data, trace = FALSE)
##
## Coefficients:
## (Intercept) gr_liv_area garage_cars year_built fireplaces year_remod_add
## Middle -173.2418 0.003396774 0.8155047 0.04725986 0.7878914 0.02136111
## High -320.7770 0.007201304 1.1042862 0.08363337 1.3111405 0.05844962
## total_bsmt_sf full_bath lot_area overall_qual_Poor
## Middle 0.001548055 0.02140128 0.0001696131 -12.50913
## High 0.003591289 -0.20667223 0.0002515269 -59.98598
## overall_qual_Fair overall_qual_Below_Average overall_qual_Average
## Middle -19.63521 10.873724 11.468132
## High -73.09130 2.519599 4.614533
## overall_qual_Above_Average overall_qual_Good overall_qual_Very_Good
## Middle 12.251735 13.370957 14.01366
## High 6.351918 8.692872 10.97494
## overall_qual_Excellent overall_qual_Very_Excellent overall_cond_Poor
## Middle 65.35161 15.810281 10.315792
## High 62.25347 8.553918 4.609423
## overall_cond_Fair overall_cond_Below_Average overall_cond_Average
## Middle 14.320916 15.25583 15.78269
## High 7.743726 10.77664 12.75549
## overall_cond_Above_Average overall_cond_Good overall_cond_Very_Good
## Middle 16.64329 17.16676 17.30265
## High 14.60196 15.84057 15.18794
## overall_cond_Excellent central_air_Y
## Middle 18.47110 1.3265716
## High 16.65102 0.4867664
##
## Residual Deviance: 1920.906
## AIC: 2028.906
multi_fit <- extract_fit_parsnip(final_multi_fit)
raw_multi_fit <- multi_fit$fit
multi_coef <- coef(raw_multi_fit)
multi_tidy_clean <- as.data.frame(multi_coef) |>
rownames_to_column("price_category") |>
pivot_longer(
cols = -price_category,
names_to = "term",
values_to = "estimate"
) |>
mutate(
odds_ratio = exp(estimate)
) |>
arrange(price_category, desc(abs(estimate)))
multi_tidy_clean
## # A tibble: 54 × 4
## price_category term estimate odds_ratio
## <chr> <chr> <dbl> <dbl>
## 1 High (Intercept) -321. 4.88e-140
## 2 High overall_qual_Fair -73.1 1.81e- 32
## 3 High overall_qual_Excellent 62.3 1.09e+ 27
## 4 High overall_qual_Poor -60.0 8.88e- 27
## 5 High overall_cond_Excellent 16.7 1.70e+ 7
## 6 High overall_cond_Good 15.8 7.58e+ 6
## 7 High overall_cond_Very_Good 15.2 3.94e+ 6
## 8 High overall_cond_Above_Average 14.6 2.20e+ 6
## 9 High overall_cond_Average 12.8 3.46e+ 5
## 10 High overall_qual_Very_Good 11.0 5.84e+ 4
## # ℹ 44 more rows
multi_tidy_clean |>
group_by(price_category) |>
slice_max(order_by = abs(estimate), n = 10) |>
ungroup()
## # A tibble: 20 × 4
## price_category term estimate odds_ratio
## <chr> <chr> <dbl> <dbl>
## 1 High (Intercept) -321. 4.88e-140
## 2 High overall_qual_Fair -73.1 1.81e- 32
## 3 High overall_qual_Excellent 62.3 1.09e+ 27
## 4 High overall_qual_Poor -60.0 8.88e- 27
## 5 High overall_cond_Excellent 16.7 1.70e+ 7
## 6 High overall_cond_Good 15.8 7.58e+ 6
## 7 High overall_cond_Very_Good 15.2 3.94e+ 6
## 8 High overall_cond_Above_Average 14.6 2.20e+ 6
## 9 High overall_cond_Average 12.8 3.46e+ 5
## 10 High overall_qual_Very_Good 11.0 5.84e+ 4
## 11 Middle (Intercept) -173. 5.78e- 76
## 12 Middle overall_qual_Excellent 65.4 2.41e+ 28
## 13 Middle overall_qual_Fair -19.6 2.97e- 9
## 14 Middle overall_cond_Excellent 18.5 1.05e+ 8
## 15 Middle overall_cond_Very_Good 17.3 3.27e+ 7
## 16 Middle overall_cond_Good 17.2 2.85e+ 7
## 17 Middle overall_cond_Above_Average 16.6 1.69e+ 7
## 18 Middle overall_qual_Very_Excellent 15.8 7.35e+ 6
## 19 Middle overall_cond_Average 15.8 7.15e+ 6
## 20 Middle overall_cond_Below_Average 15.3 4.22e+ 6
multi_class_preds <- predict(
final_multi_fit,
new_data = ames_test_multi,
type = "class"
)
multi_prob_preds <- predict(
final_multi_fit,
new_data = ames_test_multi,
type = "prob"
)
multi_preds <- ames_test_multi |>
select(sale_price, price_category) |>
bind_cols(multi_class_preds, multi_prob_preds)
head(multi_preds)
## # A tibble: 6 × 6
## sale_price price_category .pred_class .pred_Low .pred_Middle .pred_High
## <int> <fct> <fct> <dbl> <dbl> <dbl>
## 1 189900 Middle Middle 1.28e- 2 0.827 0.160
## 2 170000 Middle Middle 1.17e- 2 0.797 0.191
## 3 306000 High High 7.12e- 7 0.00759 0.992
## 4 500000 High High 1.18e-34 0.0000114 1.000
## 5 320000 High High 1.98e- 8 0.00128 0.999
## 6 192000 Middle High 2.30e- 3 0.247 0.751
multi_conf_mat <- multi_preds |>
conf_mat(
truth = price_category,
estimate = .pred_class
)
multi_conf_mat
## Truth
## Prediction Low Middle High
## Low 164 39 0
## Middle 20 170 29
## High 1 19 144
# 3. Plot heatmap
autoplot(multi_conf_mat, type = "heatmap") +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(
title = "Confusion Matrix Heatmap",
x = "Predicted Class",
y = "Actual Class"
) +
theme_minimal()
multi_preds |>
count(price_category, .pred_class) |>
pivot_wider(
names_from = .pred_class,
values_from = n,
values_fill = 0
)
## # A tibble: 3 × 4
## price_category Low Middle High
## <fct> <int> <int> <int>
## 1 Low 164 20 1
## 2 Middle 39 170 19
## 3 High 0 29 144
multi_metrics <- multi_preds |>
metrics(
truth = price_category,
estimate = .pred_class
)
multi_log_loss <- multi_preds |>
mn_log_loss(
truth = price_category,
.pred_Low,
.pred_Middle,
.pred_High
)
multi_model_metrics <- bind_rows(
multi_metrics,
multi_log_loss
)
multi_model_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.816
## 2 kap multiclass 0.722
## 3 mn_log_loss multiclass 0.509
multi_preds |>
count(price_category, .pred_class) |>
ggplot(aes(x = price_category, y = n, fill = .pred_class)) +
geom_col(position = "stack") +
labs(
title = "Predicted vs. Actual Housing Price Categories",
x = "Actual Price Category",
y = "Number of Homes",
fill = "Predicted Category"
) +
theme_minimal()
multi_preds |>
pivot_longer(
cols = starts_with(".pred_") & where(is.numeric),
names_to = "predicted_category",
values_to = "predicted_probability"
) |>
ggplot(aes(x = price_category, y = predicted_probability, fill = price_category)) +
geom_boxplot() +
facet_wrap(~ predicted_category) +
labs(
title = "Predicted Probabilities by Actual Price Category",
x = "Actual Price Category",
y = "Predicted Probability"
) +
theme_minimal() +
theme(legend.position = "none")
The multinomial logistic regression model predicts whether a home belongs to the Low, Middle, or High price category. Unlike the earlier regression models, this model treats sale price as a categorical outcome. This makes the model useful for classification rather than exact price prediction.
The model uses Low as the reference category, so the
coefficient estimates describe how each predictor changes the log-odds
of being in the Middle or High category compared with the Low category.
Positive coefficients increase the odds of being in a higher price
category relative to Low, while negative coefficients decrease those
odds.
The test-set results show that the model performs well overall, especially for distinguishing Low- and High-price homes. The Middle category is more difficult to classify because these homes share characteristics with both cheaper and more expensive homes. This is a common classification challenge when the middle class is created from ordered price ranges.
Because the response variable has three categories, the model
estimates two equations relative to the baseline category,
Low.
\[ \log\left(\frac{\hat{p}_{Middle}}{\hat{p}_{Low}}\right) = -173.242 + 0.00340(\text{gr\_liv\_area}) + 0.816(\text{garage\_cars}) + 0.0473(\text{year\_built}) + 0.788(\text{fireplaces}) + 0.0214(\text{year\_remod\_add}) + 0.00155(\text{total\_bsmt\_sf}) + 0.0214(\text{full\_bath}) + 0.000170(\text{lot\_area}) + 1.327(\text{central\_air\_Y}) + \cdots \]
\[ \log\left(\frac{\hat{p}_{High}}{\hat{p}_{Low}}\right) = -320.777 + 0.00720(\text{gr\_liv\_area}) + 1.104(\text{garage\_cars}) + 0.0836(\text{year\_built}) + 1.311(\text{fireplaces}) + 0.0584(\text{year\_remod\_add}) + 0.00359(\text{total\_bsmt\_sf}) - 0.207(\text{full\_bath}) + 0.000252(\text{lot\_area}) + 0.487(\text{central\_air\_Y}) + \cdots \]
The multinomial logistic regression model predicts whether a home belongs to the Low, Middle, or High price category. Because Low is the reference category, the model estimates two log-odds equations: one comparing Middle to Low and one comparing High to Low. Positive coefficients increase the log-odds of being classified as Middle or High rather than Low, while negative coefficients decrease those log-odds, holding other predictors constant.
The coefficient patterns are consistent with expectations. Larger living area, more garage capacity, newer construction, more fireplaces, larger basement size, and larger lot area generally increase the likelihood that a home belongs to a higher price category rather than the Low category. For example, the coefficient for gr_liv_area is larger in the High equation than in the Middle equation, suggesting that living area is especially important for distinguishing high-priced homes from low-priced homes.
The model classified approximately 81.6% of test homes correctly. Its kappa value of 0.722 indicates substantial agreement beyond chance, which is important for a multiclass classification problem. The multinomial log loss of 0.509 suggests that the model’s predicted probabilities are reasonably informative, not just its final class labels.
Overall, the multinomial logistic regression model provides strong evidence that selected Ames housing characteristics can distinguish low-, middle-, and high-price homes. The model is especially useful for showing how regression-based ideas can be extended from continuous price prediction to categorical classification.
In this section, I use Poisson regression to model a count outcome
from the Ames Housing dataset. The response variable is
tot_rms_abv_grd, which measures the total number of
above-ground rooms in a home.
Because this outcome is a count, Poisson regression is a reasonable generalized linear model to consider. I also fit a standard linear regression model using the same predictors as a comparison model.
ames_train |>
ggplot(aes(x = tot_rms_abv_grd)) +
geom_bar(fill = "steelblue") +
labs(
title = "Distribution of Total Rooms Above Ground",
x = "Total Rooms Above Ground",
y = "Number of Homes"
) +
theme_minimal()
# Mean and variance of count outcome
ames_train |>
summarise(
mean_rooms = mean(tot_rms_abv_grd, na.rm = TRUE),
variance_rooms = var(tot_rms_abv_grd, na.rm = TRUE)
)
## # A tibble: 1 × 2
## mean_rooms variance_rooms
## <dbl> <dbl>
## 1 6.45 2.52
The predictors were chosen to represent housing characteristics that may explain the number of above-ground rooms, including living area, bathrooms, age, basement size, garage capacity, and overall quality.
poisson_features <- c(
"gr_liv_area",
"year_built",
"year_remod_add",
"full_bath",
"half_bath",
"bedroom_abv_gr",
"fireplaces",
"total_bsmt_sf",
"garage_cars",
"overall_qual",
"overall_cond"
)
poisson_formula <- reformulate(
termlabels = poisson_features,
response = "tot_rms_abv_grd"
)
poisson_formula
## tot_rms_abv_grd ~ gr_liv_area + year_built + year_remod_add +
## full_bath + half_bath + bedroom_abv_gr + fireplaces + total_bsmt_sf +
## garage_cars + overall_qual + overall_cond
Before fitting the Poisson model, I fit a standard linear regression model to the same count outcome. This provides a comparison point, but a limitation of linear regression is that it can theoretically predict negative room counts.
# Linear regression model for comparison
room_lm_spec <- linear_reg() |>
set_engine("lm") |>
set_mode("regression")
room_lm_recipe <- recipe(poisson_formula, data = ames_train) |>
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
room_lm_workflow <- workflow() |>
add_recipe(room_lm_recipe) |>
add_model(room_lm_spec)
room_lm_fit <- fit(
room_lm_workflow,
data = ames_train
)
room_lm_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) gr_liv_area
## 1.9959713 0.0019109
## year_built year_remod_add
## -0.0050185 0.0044161
## full_bath half_bath
## 0.0485959 -0.0117184
## bedroom_abv_gr fireplaces
## 0.6530162 -0.0184058
## total_bsmt_sf garage_cars
## -0.0001223 0.0817076
## overall_qual_Poor overall_qual_Fair
## 0.5060055 0.8032186
## overall_qual_Below_Average overall_qual_Average
## 0.4190687 0.5477962
## overall_qual_Above_Average overall_qual_Good
## 0.3643046 0.4540228
## overall_qual_Very_Good overall_qual_Excellent
## 0.4532981 0.9205828
## overall_qual_Very_Excellent overall_cond_Poor
## 0.5895420 0.8073662
## overall_cond_Fair overall_cond_Below_Average
## 0.2664559 0.3296302
## overall_cond_Average overall_cond_Above_Average
## 0.3437264 0.2727707
## overall_cond_Good overall_cond_Very_Good
## 0.1713796 0.1664163
## overall_cond_Excellent
## 0.0276620
The Poisson regression model is fit using poisson_reg()
with the glm engine. This model uses a log link, meaning
the model estimates the log of the expected room count.
poisson_spec <- poisson_reg() |>
set_engine("glm") |>
set_mode("regression")
poisson_recipe <- recipe(poisson_formula, data = ames_train) |>
step_novel(all_nominal_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
poisson_workflow <- workflow() |>
add_recipe(poisson_recipe) |>
add_model(poisson_spec)
poisson_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: poisson_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Poisson Regression Model Specification (regression)
##
## Computational engine: glm
final_poisson_fit <- fit(
poisson_workflow,
data = ames_train
)
final_poisson_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: poisson_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::poisson, data = data)
##
## Coefficients:
## (Intercept) gr_liv_area
## 1.192e+00 2.366e-04
## year_built year_remod_add
## -9.319e-04 8.003e-04
## full_bath half_bath
## 1.641e-02 5.887e-03
## bedroom_abv_gr fireplaces
## 1.019e-01 2.144e-03
## total_bsmt_sf garage_cars
## -1.428e-05 1.632e-02
## overall_qual_Poor overall_qual_Fair
## 6.241e-02 2.315e-01
## overall_qual_Below_Average overall_qual_Average
## 1.744e-01 2.047e-01
## overall_qual_Above_Average overall_qual_Good
## 1.838e-01 2.047e-01
## overall_qual_Very_Good overall_qual_Excellent
## 2.042e-01 2.669e-01
## overall_qual_Very_Excellent overall_cond_Poor
## 1.836e-01 1.074e-01
## overall_cond_Fair overall_cond_Below_Average
## 7.728e-03 2.181e-02
## overall_cond_Average overall_cond_Above_Average
## 2.233e-02 8.637e-03
## overall_cond_Good overall_cond_Very_Good
## -1.472e-02 -1.614e-02
## overall_cond_Excellent
## -4.118e-02
##
## Degrees of Freedom: 2343 Total (i.e. Null); 2317 Residual
## Null Deviance: 891.9
## Residual Deviance: 240.3 AIC: 8966
poisson_tidy <- final_poisson_fit |>
extract_fit_parsnip() |>
tidy() |>
mutate(
rate_ratio = exp(estimate)
) |>
arrange(p.value)
poisson_tidy
## # A tibble: 27 × 6
## term estimate std.error statistic p.value rate_ratio
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bedroom_abv_gr 0.102 0.0129 7.91 2.62e-15 1.11
## 2 gr_liv_area 0.000237 0.0000310 7.62 2.53e-14 1.00
## 3 year_built -0.000932 0.000510 -1.83 6.78e- 2 0.999
## 4 year_remod_add 0.000800 0.000609 1.31 1.89e- 1 1.00
## 5 garage_cars 0.0163 0.0145 1.13 2.60e- 1 1.02
## 6 overall_qual_Excellent 0.267 0.262 1.02 3.09e- 1 1.31
## 7 (Intercept) 1.19 1.21 0.985 3.24e- 1 3.29
## 8 overall_qual_Fair 0.231 0.264 0.876 3.81e- 1 1.26
## 9 overall_qual_Average 0.205 0.257 0.798 4.25e- 1 1.23
## 10 overall_qual_Good 0.205 0.258 0.793 4.28e- 1 1.23
## # ℹ 17 more rows
poisson_glance <- final_poisson_fit |>
extract_fit_parsnip() |>
glance()
poisson_glance
## # A tibble: 1 × 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 892. 2343 -4456. 8966. 9122. 240. 2317 2344
room_lm_preds <- predict(
room_lm_fit,
new_data = ames_test
) |>
rename(lm_pred = .pred)
poisson_preds <- predict(
final_poisson_fit,
new_data = ames_test
) |>
rename(poisson_pred = .pred)
room_pred_comparison <- ames_test |>
select(tot_rms_abv_grd) |>
bind_cols(room_lm_preds, poisson_preds) |>
mutate(
lm_residual = tot_rms_abv_grd - lm_pred,
poisson_residual = tot_rms_abv_grd - poisson_pred
)
head(room_pred_comparison)
## # A tibble: 6 × 5
## tot_rms_abv_grd lm_pred poisson_pred lm_residual poisson_residual
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 6 6.88 6.75 -0.878 -0.745
## 2 6 5.83 5.95 0.171 0.0469
## 3 7 6.92 6.85 0.0807 0.151
## 4 10 9.11 9.15 0.885 0.851
## 5 7 7.95 7.84 -0.954 -0.842
## 6 6 5.41 5.53 0.590 0.473
room_lm_metrics <- room_pred_comparison |>
metrics(
truth = tot_rms_abv_grd,
estimate = lm_pred
) |>
mutate(model = "Linear Regression")
room_lm_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.768 Linear Regression
## 2 rsq standard 0.742 Linear Regression
## 3 mae standard 0.586 Linear Regression
poisson_metrics <- room_pred_comparison |>
metrics(
truth = tot_rms_abv_grd,
estimate = poisson_pred
) |>
mutate(model = "Poisson Regression")
poisson_metrics
## # A tibble: 3 × 4
## .metric .estimator .estimate model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.784 Poisson Regression
## 2 rsq standard 0.731 Poisson Regression
## 3 mae standard 0.601 Poisson Regression
room_model_comparison <- bind_rows(
room_lm_metrics,
poisson_metrics
) |>
select(model, .metric, .estimate) |>
pivot_wider(
names_from = .metric,
values_from = .estimate
)
room_model_comparison
## # A tibble: 2 × 4
## model rmse rsq mae
## <chr> <dbl> <dbl> <dbl>
## 1 Linear Regression 0.768 0.742 0.586
## 2 Poisson Regression 0.784 0.731 0.601
room_pred_comparison |>
ggplot(aes(x = lm_pred, y = poisson_pred)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Comparison of Linear and Poisson Model Predictions",
x = "Predicted Rooms from Linear Regression",
y = "Predicted Rooms from Poisson Regression"
) +
theme_minimal()
room_pred_comparison |>
ggplot(aes(x = tot_rms_abv_grd, y = poisson_pred)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Poisson Regression: Actual vs. Predicted Total Rooms",
x = "Actual Total Rooms Above Ground",
y = "Predicted Total Rooms Above Ground"
) +
theme_minimal()
room_pred_comparison |>
ggplot(aes(x = poisson_pred, y = poisson_residual)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Poisson Regression Residuals",
x = "Predicted Total Rooms Above Ground",
y = "Residual"
) +
theme_minimal()
poisson_glm <- extract_fit_parsnip(final_poisson_fit)$fit
dispersion_ratio <- sum(residuals(poisson_glm, type = "pearson")^2) /
poisson_glm$df.residual
dispersion_ratio
## [1] 0.102777
The Poisson regression model was used to model
tot_rms_abv_grd, a count outcome representing the number of
above-ground rooms in a home. The strongest predictors were the number
of above-ground bedrooms and above-ground living area. Each additional
above-ground bedroom was associated with a 10.7% higher expected room
count, holding other variables constant. Similarly, each additional 100
square feet of above-ground living area was associated with about a 2.4%
higher expected room count.
Most quality and condition variables were not statistically significant after accounting for size and bedrooms. This is reasonable because quality and condition are more directly related to home value than to the number of rooms. The model deviance decreased substantially from the null model to the fitted model, showing that the predictors improved model fit. However, the residual deviance was much smaller than the residual degrees of freedom, suggesting underdispersion. This means the room-count outcome is less variable than expected under a standard Poisson model.
\[ \log\left(\widehat{E(\text{Total Rooms Above Ground})}\right) = 1.192 + 0.000237(\text{gr\_liv\_area}) - 0.000932(\text{year\_built}) + 0.000800(\text{year\_remod\_add}) + 0.0164(\text{full\_bath}) + 0.00589(\text{half\_bath}) + 0.1019(\text{bedroom\_abv\_gr}) + 0.00214(\text{fireplaces}) - 0.0000143(\text{total\_bsmt\_sf}) + 0.0163(\text{garage\_cars}) + \cdots \]
The ellipsis represents dummy-variable terms for overall_qual and overall_cond. Since the model uses a log link, exponentiating both sides gives the expected room count:
\[ \widehat{E(\text{Total Rooms Above Ground})} = e^{ 1.192 + 0.000237(\text{gr\_liv\_area}) - 0.000932(\text{year\_built}) + 0.000800(\text{year\_remod\_add}) + 0.0164(\text{full\_bath}) + 0.00589(\text{half\_bath}) + 0.1019(\text{bedroom\_abv\_gr}) + 0.00214(\text{fireplaces}) - 0.0000143(\text{total\_bsmt\_sf}) + 0.0163(\text{garage\_cars}) + \cdots } \]