Introduction

This portfolio uses the Ames Housing dataset to demonstrate statistical modeling with R and tidymodels. The dataset includes information about residential homes sold in Ames, Iowa, including size, quality, location, amenities, and sale price.

The main outcome is sale_price, the final selling price of a home. Because home prices are usually right-skewed, I also create log_sale_price, a transformed version of the outcome used for the main regression models. This transformation helps make the modeling framework more appropriate and improves interpretation of relative changes in price.

The analysis focuses on one guiding question:

Which housing characteristics are most useful for explaining and predicting home sale prices?

To answer this question, I use multiple regression, polynomial regression, lasso regression, multinomial logistic regression, and count or classification modeling. Together, these models demonstrate how statistical methods can be used to study both continuous and categorical outcomes.

Load and Describe the Data

Loading libraries

library(tidymodels)
library(AmesHousing)
library(tidyverse)
library(janitor)
library(skimr)
library(scales)
library(forcats)
library(broom)
library(glmnet)
library(vip)
library(MASS)
library(poissonreg)

tidymodels_prefer()

Loading data

set.seed(42)

ames <- make_ames() |>
  clean_names() |>
  select(-longitude, -latitude) |>
  mutate(log_sale_price = log(sale_price))
glimpse(ames)
## Rows: 2,930
## Columns: 80
## $ ms_sub_class       <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ ms_zoning          <fct> Residential_Low_Density, Residential_High_Density, …
## $ lot_frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ lot_area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ lot_shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ land_contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ lot_config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ land_slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ bldg_type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ house_style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ overall_qual       <fct> Above_Average, Average, Above_Average, Good, Averag…
## $ overall_cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ year_built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ year_remod_add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ roof_style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ roof_matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ mas_vnr_type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ mas_vnr_area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ exter_qual         <fct> Typical, Typical, Typical, Good, Typical, Typical, …
## $ exter_cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ bsmt_qual          <fct> Typical, Typical, Typical, Typical, Good, Typical, …
## $ bsmt_cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ bsmt_exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ bsmt_fin_type_1    <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ bsmt_fin_sf_1      <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ bsmt_fin_type_2    <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ bsmt_fin_sf_2      <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ bsmt_unf_sf        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ total_bsmt_sf      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ heating_qc         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ central_air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ first_flr_sf       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ second_flr_sf      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ low_qual_fin_sf    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ bsmt_full_bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ bsmt_half_bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ half_bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ bedroom_abv_gr     <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ kitchen_abv_gr     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual       <fct> Typical, Typical, Good, Excellent, Typical, Good, G…
## $ tot_rms_abv_grd    <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ fireplace_qu       <fct> Good, No_Fireplace, No_Fireplace, Typical, Typical,…
## $ garage_type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ garage_finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ garage_cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ garage_area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ garage_qual        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ garage_cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ paved_drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ wood_deck_sf       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ open_porch_sf      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ enclosed_porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ screen_porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ pool_area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ misc_feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ misc_val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ mo_sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ year_sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ sale_type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ sale_condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ sale_price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ log_sale_price     <dbl> 12.27839, 11.56172, 12.05525, 12.40492, 12.15425, 1…
dim(ames)
## [1] 2930   80
sum(is.na(ames))
## [1] 0
skim(ames)
Data summary
Name ames
Number of rows 2930
Number of columns 80
_______________________
Column type frequency:
factor 46
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ms_sub_class 0 1 FALSE 16 One: 1079, Two: 575, One: 287, One: 192
ms_zoning 0 1 FALSE 7 Res: 2273, Res: 462, Flo: 139, Res: 27
street 0 1 FALSE 2 Pav: 2918, Grv: 12
alley 0 1 FALSE 3 No_: 2732, Gra: 120, Pav: 78
lot_shape 0 1 FALSE 4 Reg: 1859, Sli: 979, Mod: 76, Irr: 16
land_contour 0 1 FALSE 4 Lvl: 2633, HLS: 120, Bnk: 117, Low: 60
utilities 0 1 FALSE 3 All: 2927, NoS: 2, NoS: 1
lot_config 0 1 FALSE 5 Ins: 2140, Cor: 511, Cul: 180, FR2: 85
land_slope 0 1 FALSE 3 Gtl: 2789, Mod: 125, Sev: 16
neighborhood 0 1 FALSE 28 Nor: 443, Col: 267, Old: 239, Edw: 194
condition_1 0 1 FALSE 9 Nor: 2522, Fee: 164, Art: 92, RRA: 50
condition_2 0 1 FALSE 8 Nor: 2900, Fee: 13, Art: 5, Pos: 4
bldg_type 0 1 FALSE 5 One: 2425, Twn: 233, Dup: 109, Twn: 101
house_style 0 1 FALSE 8 One: 1481, Two: 873, One: 314, SLv: 128
overall_qual 0 1 FALSE 10 Ave: 825, Abo: 732, Goo: 602, Ver: 350
overall_cond 0 1 FALSE 9 Ave: 1654, Abo: 533, Goo: 390, Ver: 144
roof_style 0 1 FALSE 6 Gab: 2321, Hip: 551, Gam: 22, Fla: 20
roof_matl 0 1 FALSE 8 Com: 2887, Tar: 23, WdS: 9, WdS: 7
exterior_1st 0 1 FALSE 16 Vin: 1026, Met: 450, HdB: 442, Wd : 420
exterior_2nd 0 1 FALSE 17 Vin: 1015, Met: 447, HdB: 406, Wd : 397
mas_vnr_type 0 1 FALSE 5 Non: 1775, Brk: 880, Sto: 249, Brk: 25
exter_qual 0 1 FALSE 4 Typ: 1799, Goo: 989, Exc: 107, Fai: 35
exter_cond 0 1 FALSE 5 Typ: 2549, Goo: 299, Fai: 67, Exc: 12
foundation 0 1 FALSE 6 PCo: 1310, CBl: 1244, Brk: 311, Sla: 49
bsmt_qual 0 1 FALSE 6 Typ: 1283, Goo: 1219, Exc: 258, Fai: 88
bsmt_cond 0 1 FALSE 6 Typ: 2616, Goo: 122, Fai: 104, No_: 80
bsmt_exposure 0 1 FALSE 5 No: 1906, Av: 418, Gd: 284, Mn: 239
bsmt_fin_type_1 0 1 FALSE 7 GLQ: 859, Unf: 851, ALQ: 429, Rec: 288
bsmt_fin_type_2 0 1 FALSE 7 Unf: 2499, Rec: 106, LwQ: 89, No_: 81
heating 0 1 FALSE 6 Gas: 2885, Gas: 27, Gra: 9, Wal: 6
heating_qc 0 1 FALSE 5 Exc: 1495, Typ: 864, Goo: 476, Fai: 92
central_air 0 1 FALSE 2 Y: 2734, N: 196
electrical 0 1 FALSE 6 SBr: 2682, Fus: 188, Fus: 50, Fus: 8
kitchen_qual 0 1 FALSE 5 Typ: 1494, Goo: 1160, Exc: 205, Fai: 70
functional 0 1 FALSE 8 Typ: 2728, Min: 70, Min: 65, Mod: 35
fireplace_qu 0 1 FALSE 6 No_: 1422, Goo: 744, Typ: 600, Fai: 75
garage_type 0 1 FALSE 7 Att: 1731, Det: 782, Bui: 186, No_: 157
garage_finish 0 1 FALSE 4 Unf: 1231, RFn: 812, Fin: 728, No_: 159
garage_qual 0 1 FALSE 6 Typ: 2615, No_: 159, Fai: 124, Goo: 24
garage_cond 0 1 FALSE 6 Typ: 2665, No_: 159, Fai: 74, Goo: 15
paved_drive 0 1 FALSE 3 Pav: 2652, Dir: 216, Par: 62
pool_qc 0 1 FALSE 5 No_: 2917, Exc: 4, Goo: 4, Typ: 3
fence 0 1 FALSE 5 No_: 2358, Min: 330, Goo: 118, Goo: 112
misc_feature 0 1 FALSE 6 Non: 2824, She: 95, Gar: 5, Oth: 4
sale_type 0 1 FALSE 10 WD : 2536, New: 239, COD: 87, Con: 26
sale_condition 0 1 FALSE 6 Nor: 2413, Par: 245, Abn: 190, Fam: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lot_frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
lot_area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
year_built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
year_remod_add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
mas_vnr_area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
bsmt_fin_sf_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
bsmt_fin_sf_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
bsmt_unf_sf 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
total_bsmt_sf 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
first_flr_sf 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
second_flr_sf 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
low_qual_fin_sf 0 1 4.68 46.31 0.00 0.00 0.00 0.00 1064.00 ▇▁▁▁▁
gr_liv_area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
bsmt_full_bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
bsmt_half_bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
full_bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
half_bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
bedroom_abv_gr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
kitchen_abv_gr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
tot_rms_abv_grd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
garage_cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
garage_area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
wood_deck_sf 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
open_porch_sf 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
enclosed_porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
screen_porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
pool_area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
misc_val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
mo_sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
year_sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
sale_price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
log_sale_price 0 1 12.02 0.41 9.46 11.77 11.98 12.27 13.53 ▁▁▆▇▁

The Ames Housing dataset contains 2,930 observations and a large set of variables describing the physical characteristics, quality, location, and sale information for residential properties. The response variable for the first part of the portfolio is sale_price. Because the dataset contains many possible predictors, the analysis begins broadly and then uses both exploratory analysis and formal feature-selection methods to narrow the model.

Exploratory Data Analysis

Distribution of Sale Price

ames |>
  ggplot(aes(x = sale_price)) +
  geom_histogram(bins = 40, fill = "#2C7FB8", color = "white") +
  scale_x_continuous(labels = dollar) +
  labs(
    title = "Distribution of Sale Prices",
    x = "Sale Price",
    y = "Count"
  )

ames |>
  ggplot(aes(x = log_sale_price)) +
  geom_histogram(bins = 40, fill = "#7FCDBB", color = "white") +
  labs(
    title = "Distribution of Log(Sale Price)",
    x = "Log Sale Price",
    y = "Count"
  )

The transformed outcome is more symmetric than the original sale price distribution. For this reason, log_sale_price is used as the primary outcome for the continuous regression models. The original sale_price is retained only for interpretation and for later derived outcomes.

Univariate exploration: categorical predictors

Sale Price vs Living Area

ames |>
  ggplot(aes(x = gr_liv_area, y = sale_price)) +
  geom_point(alpha = 0.4, color = "#2C7FB8") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Sale Price vs. Living Area",
    x = "Above Ground Living Area (sq ft)",
    y = "Sale Price"
  )

Neighborhood Effects

ames |>
  mutate(neighborhood = fct_reorder(neighborhood, sale_price, .fun = median)) |>
  ggplot(aes(x = neighborhood, y = sale_price)) +
  geom_boxplot(fill = "#A1D99B") +
  coord_flip() +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Sale Price by Neighborhood",
    x = "Neighborhood",
    y = "Sale Price"
  )

Overall Quality and Sale Price

ames |>
  ggplot(aes(x = overall_qual, y = sale_price)) +
  geom_boxplot(fill = "#FDAE6B") +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Sale Price by Overall Quality",
    x = "Overall Quality",
    y = "Sale Price"
  )

The categorical plots suggest that several qualitative features are meaningfully related to sale price. In particular, neighborhood and overall quality show strong separation in median log sale price.

# Correlations among numeric variables
numeric_correlations <- ames |>
  select(where(is.numeric)) |>
  cor(use = "pairwise.complete.obs") |>
  as.data.frame() |>
  rownames_to_column("variable_1") |>
  pivot_longer(
    cols = -variable_1,
    names_to = "variable_2",
    values_to = "correlation"
  )

# Keep only stronger relationships
strong_numeric_correlations <- numeric_correlations |>
  filter(
    variable_1 != variable_2,
    abs(correlation) >= 0.3
  )

# Plot
strong_numeric_correlations |>
  ggplot(aes(x = variable_1, y = variable_2, fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 2)), size = 3) +
   scale_fill_viridis_c(
    option = "D",
    limits = c(-1, 1)
  ) +
  coord_fixed() +
  labs(
    title = "Strong Correlations Among Numeric Variables",
    x = NULL,
    y = NULL,
    fill = "Correlation"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

The exploratory analysis shows that sale price is strongly right-skewed, motivating the use of a log transformation. Living area and overall quality appear to have strong positive relationships with price, while neighborhood differences suggest meaningful variation across locations. These patterns guide the selection of predictors and modeling approaches in the following sections.

Data Splitting and Preprocessing

Train/Test Split

set.seed(42)

ames_split <- initial_split(ames, prop = 0.8)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

Cross-Validation Setup

set.seed(42)

ames_folds <- vfold_cv(ames_train, v = 5)

Preprocessing Recipe

ames_recipe <- recipe(log_sale_price ~ ., data = ames_train) |>
  step_rm(sale_price) |>
  step_log(gr_liv_area, base = 10) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Preview Processed data

ames_prep <- prep(ames_recipe)
ames_baked <- bake(ames_prep, new_data = NULL)

glimpse(ames_baked)
## Rows: 2,344
## Columns: 302
## $ lot_frontage                                           <dbl> 0.7468160, -1.1…
## $ lot_area                                               <dbl> 0.008845521, -1…
## $ year_built                                             <dbl> -0.29500974, -0…
## $ year_remod_add                                         <dbl> -1.0272152, -0.…
## $ mas_vnr_area                                           <dbl> 1.53612533, 0.7…
## $ bsmt_fin_sf_1                                          <dbl> -1.43710624, 1.…
## $ bsmt_fin_sf_2                                          <dbl> -0.2945113, -0.…
## $ bsmt_unf_sf                                            <dbl> -0.1356635, 0.2…
## $ total_bsmt_sf                                          <dbl> 0.06460710, -0.…
## $ first_flr_sf                                           <dbl> -0.04002436, -1…
## $ second_flr_sf                                          <dbl> -0.7831775, 0.4…
## $ low_qual_fin_sf                                        <dbl> -0.1041543, -0.…
## $ gr_liv_area                                            <dbl> -0.6688422, -0.…
## $ bsmt_full_bath                                         <dbl> -0.8224714, -0.…
## $ bsmt_half_bath                                         <dbl> 3.6941629, -0.2…
## $ full_bath                                              <dbl> -1.0426515, -1.…
## $ half_bath                                              <dbl> -0.7593844, 1.2…
## $ bedroom_abv_gr                                         <dbl> 0.1761544, 0.17…
## $ kitchen_abv_gr                                         <dbl> -0.2030956, -0.…
## $ tot_rms_abv_grd                                        <dbl> -0.2826372, 0.3…
## $ fireplaces                                             <dbl> 0.6281975, -0.9…
## $ garage_cars                                            <dbl> -0.9975764, -2.…
## $ garage_area                                            <dbl> -0.82502805, -2…
## $ wood_deck_sf                                           <dbl> 2.0253838, 0.86…
## $ open_porch_sf                                          <dbl> 0.4213274, -0.7…
## $ enclosed_porch                                         <dbl> -0.3512185, -0.…
## $ three_season_porch                                     <dbl> -0.09985435, -0…
## $ screen_porch                                           <dbl> -0.2849768, -0.…
## $ pool_area                                              <dbl> -0.06412225, -0…
## $ misc_val                                               <dbl> -0.08763635, -0…
## $ mo_sold                                                <dbl> -0.0732589, -0.…
## $ year_sold                                              <dbl> -1.3783358, -1.…
## $ log_sale_price                                         <dbl> 12.01370, 11.42…
## $ ms_sub_class_One_Story_1945_and_Older                  <dbl> -0.2154269, -0.…
## $ ms_sub_class_One_Story_with_Finished_Attic_All_Ages    <dbl> -0.04622503, -0…
## $ ms_sub_class_One_and_Half_Story_Unfinished_All_Ages    <dbl> -0.07749852, -0…
## $ ms_sub_class_One_and_Half_Story_Finished_All_Ages      <dbl> -0.3241824, -0.…
## $ ms_sub_class_Two_Story_1946_and_Newer                  <dbl> -0.5013591, -0.…
## $ ms_sub_class_Two_Story_1945_and_Older                  <dbl> -0.2110551, -0.…
## $ ms_sub_class_Two_and_Half_Story_All_Ages               <dbl> -0.08795059, -0…
## $ ms_sub_class_Split_or_Multilevel                       <dbl> 5.062156, -0.19…
## $ ms_sub_class_Split_Foyer                               <dbl> -0.124865, -0.1…
## $ ms_sub_class_Duplex_All_Styles_and_Ages                <dbl> -0.1951168, -0.…
## $ ms_sub_class_One_Story_PUD_1946_and_Newer              <dbl> -0.2733212, -0.…
## $ ms_sub_class_One_and_Half_Story_PUD_All_Ages           <dbl> -0.02065481, -0…
## $ ms_sub_class_Two_Story_PUD_1946_and_Newer              <dbl> -0.2154269, 4.6…
## $ ms_sub_class_PUD_Multilevel_Split_Level_Foyer          <dbl> -0.07171891, -0…
## $ ms_sub_class_Two_Family_conversion_All_Styles_and_Ages <dbl> -0.1505919, -0.…
## $ ms_zoning_Residential_High_Density                     <dbl> -0.1016878, -0.…
## $ ms_zoning_Residential_Low_Density                      <dbl> 0.5410728, -1.8…
## $ ms_zoning_Residential_Medium_Density                   <dbl> -0.4293692, 2.3…
## $ ms_zoning_A_agr                                        <dbl> -0.02921655, -0…
## $ ms_zoning_C_all                                        <dbl> -0.09731674, -0…
## $ ms_zoning_I_all                                        <dbl> -0.02921655, -0…
## $ street_Pave                                            <dbl> 0.05850809, 0.0…
## $ alley_No_Alley_Access                                  <dbl> 0.2724191, 0.27…
## $ alley_Paved                                            <dbl> -0.1661556, -0.…
## $ lot_shape_Slightly_Irregular                           <dbl> -0.7026624, -0.…
## $ lot_shape_Moderately_Irregular                         <dbl> -0.1661556, -0.…
## $ lot_shape_Irregular                                    <dbl> -0.08023581, -0…
## $ land_contour_HLS                                       <dbl> -0.2054823, -0.…
## $ land_contour_Low                                       <dbl> -0.144558, -0.1…
## $ land_contour_Lvl                                       <dbl> 0.3360985, 0.33…
## $ utilities_NoSeWa                                       <dbl> -0.02065481, -0…
## $ utilities_NoSewr                                       <dbl> -0.02921655, -0…
## $ lot_config_CulDSac                                     <dbl> -0.2567312, -0.…
## $ lot_config_FR2                                         <dbl> -0.1728127, -0.…
## $ lot_config_FR3                                         <dbl> -0.06865091, -0…
## $ lot_config_Inside                                      <dbl> 0.6107455, 0.61…
## $ land_slope_Mod                                         <dbl> -0.2132503, -0.…
## $ land_slope_Sev                                         <dbl> -0.07749852, -0…
## $ neighborhood_College_Creek                             <dbl> -0.3209593, -0.…
## $ neighborhood_Old_Town                                  <dbl> -0.2969349, -0.…
## $ neighborhood_Edwards                                   <dbl> -0.2660421, -0.…
## $ neighborhood_Somerset                                  <dbl> -0.2623472, -0.…
## $ neighborhood_Northridge_Heights                        <dbl> -0.2481288, -0.…
## $ neighborhood_Gilbert                                   <dbl> -0.247159, -0.2…
## $ neighborhood_Sawyer                                    <dbl> -0.2218514, -0.…
## $ neighborhood_Northwest_Ames                            <dbl> -0.2088409, -0.…
## $ neighborhood_Sawyer_West                               <dbl> -0.2088409, -0.…
## $ neighborhood_Mitchell                                  <dbl> -0.2043525, -0.…
## $ neighborhood_Brookside                                 <dbl> -0.1903566, -0.…
## $ neighborhood_Crawford                                  <dbl> -0.1891505, -0.…
## $ neighborhood_Iowa_DOT_and_Rail_Road                    <dbl> -0.1817694, -0.…
## $ neighborhood_Timberland                                <dbl> -0.1535276, -0.…
## $ neighborhood_Northridge                                <dbl> -0.1647955, -0.…
## $ neighborhood_Stone_Brook                               <dbl> -0.1366732, -0.…
## $ neighborhood_South_and_West_of_Iowa_State_University   <dbl> -0.1283422, -0.…
## $ neighborhood_Clear_Creek                               <dbl> -0.1283422, -0.…
## $ neighborhood_Meadow_Village                            <dbl> -0.1157445, -0.…
## $ neighborhood_Briardale                                 <dbl> -0.1058857, 9.4…
## $ neighborhood_Bloomington_Heights                       <dbl> -0.1016878, -0.…
## $ neighborhood_Veenker                                   <dbl> -0.08795059, -0…
## $ neighborhood_Northpark_Villa                           <dbl> -0.09038007, -0…
## $ neighborhood_Blueste                                   <dbl> -0.06207049, -0…
## $ neighborhood_Greens                                    <dbl> -0.0547176, -0.…
## $ neighborhood_Green_Hills                               <dbl> -0.02921655, -0…
## $ neighborhood_Landmark                                  <dbl> -0.02065481, -0…
## $ condition_1_Feedr                                      <dbl> -0.239291, -0.2…
## $ condition_1_Norm                                       <dbl> 0.4025578, 0.40…
## $ condition_1_PosA                                       <dbl> -0.08023581, -0…
## $ condition_1_PosN                                       <dbl> -0.1157445, -0.…
## $ condition_1_RRAe                                       <dbl> -0.1016878, -0.…
## $ condition_1_RRAn                                       <dbl> -0.1300482, -0.…
## $ condition_1_RRNe                                       <dbl> -0.03579046, -0…
## $ condition_1_RRNn                                       <dbl> -0.0547176, -0.…
## $ condition_2_Feedr                                      <dbl> -0.06865091, -0…
## $ condition_2_Norm                                       <dbl> 0.1038071, 0.10…
## $ condition_2_PosA                                       <dbl> -0.03579046, -0…
## $ condition_2_PosN                                       <dbl> -0.03579046, -0…
## $ condition_2_RRAe                                       <dbl> -0.02065481, -0…
## $ condition_2_RRAn                                       <dbl> -0.02065481, -0…
## $ condition_2_RRNn                                       <dbl> -0.02065481, -0…
## $ bldg_type_TwoFmCon                                     <dbl> -0.1520662, -0.…
## $ bldg_type_Duplex                                       <dbl> -0.1951168, -0.…
## $ bldg_type_Twnhs                                        <dbl> -0.1939361, 5.1…
## $ bldg_type_TwnhsE                                       <dbl> -0.2969349, -0.…
## $ house_style_One_and_Half_Unf                           <dbl> -0.08023581, -0…
## $ house_style_One_Story                                  <dbl> -1.0143954, -1.…
## $ house_style_SFoyer                                     <dbl> -0.1647955, -0.…
## $ house_style_SLvl                                       <dbl> 4.7862919, -0.2…
## $ house_style_Two_and_Half_Fin                           <dbl> -0.05850809, -0…
## $ house_style_Two_and_Half_Unf                           <dbl> -0.08795059, -0…
## $ house_style_Two_Story                                  <dbl> -0.6563761, 1.5…
## $ overall_qual_Poor                                      <dbl> -0.06544205, -0…
## $ overall_qual_Fair                                      <dbl> -0.117622, -0.1…
## $ overall_qual_Below_Average                             <dbl> -0.2900783, -0.…
## $ overall_qual_Average                                   <dbl> -0.6226058, -0.…
## $ overall_qual_Above_Average                             <dbl> 1.7658361, 1.76…
## $ overall_qual_Good                                      <dbl> -0.5186234, -0.…
## $ overall_qual_Very_Good                                 <dbl> -0.3749301, -0.…
## $ overall_qual_Excellent                                 <dbl> -0.1891505, -0.…
## $ overall_qual_Very_Excellent                            <dbl> -0.1038071, -0.…
## $ overall_cond_Poor                                      <dbl> -0.0547176, -0.…
## $ overall_cond_Fair                                      <dbl> -0.1398763, -0.…
## $ overall_cond_Below_Average                             <dbl> -0.1867183, -0.…
## $ overall_cond_Average                                   <dbl> 0.8623979, 0.86…
## $ overall_cond_Above_Average                             <dbl> -0.463051, -0.4…
## $ overall_cond_Good                                      <dbl> -0.3844881, -0.…
## $ overall_cond_Very_Good                                 <dbl> -0.2260515, -0.…
## $ overall_cond_Excellent                                 <dbl> -0.1194715, -0.…
## $ roof_style_Gable                                       <dbl> 0.510004, 0.510…
## $ roof_style_Gambrel                                     <dbl> -0.09038007, -0…
## $ roof_style_Hip                                         <dbl> -0.4765797, -0.…
## $ roof_style_Mansard                                     <dbl> -0.06544205, -0…
## $ roof_style_Shed                                        <dbl> -0.04133609, -0…
## $ roof_matl_CompShg                                      <dbl> 0.124865, 0.124…
## $ roof_matl_Membran                                      <dbl> -0.02065481, -0…
## $ roof_matl_Roll                                         <dbl> -0.02065481, -0…
## $ roof_matl_Tar.Grv                                      <dbl> -0.09038007, -0…
## $ roof_matl_WdShake                                      <dbl> -0.06207049, -0…
## $ roof_matl_WdShngl                                      <dbl> -0.04622503, -0…
## $ exterior_1st_AsphShn                                   <dbl> -0.02921655, -0…
## $ exterior_1st_BrkComm                                   <dbl> -0.04622503, -0…
## $ exterior_1st_BrkFace                                   <dbl> -0.1754127, -0.…
## $ exterior_1st_CBlock                                    <dbl> -0.02921655, -0…
## $ exterior_1st_CemntBd                                   <dbl> -0.2066069, 4.8…
## $ exterior_1st_HdBoard                                   <dbl> 2.4189468, -0.4…
## $ exterior_1st_ImStucc                                   <dbl> -0.02065481, -0…
## $ exterior_1st_MetalSd                                   <dbl> -0.4328472, -0.…
## $ exterior_1st_Plywood                                   <dbl> -0.2822282, -0.…
## $ exterior_1st_PreCast                                   <dbl> -0.02065481, -0…
## $ exterior_1st_Stone                                     <dbl> -0.02065481, -0…
## $ exterior_1st_Stucco                                    <dbl> -0.1194715, -0.…
## $ exterior_1st_VinylSd                                   <dbl> -0.7555144, -0.…
## $ exterior_1st_Wd.Sdng                                   <dbl> -0.3953758, -0.…
## $ exterior_1st_WdShing                                   <dbl> -0.1414527, -0.…
## $ exterior_2nd_AsphShn                                   <dbl> -0.04133609, -0…
## $ exterior_2nd_Brk.Cmn                                   <dbl> -0.08545424, -0…
## $ exterior_2nd_BrkFace                                   <dbl> -0.1212944, -0.…
## $ exterior_2nd_CBlock                                    <dbl> -0.03579046, -0…
## $ exterior_2nd_CmentBd                                   <dbl> -0.2054823, 4.8…
## $ exterior_2nd_HdBoard                                   <dbl> 2.5514784, -0.3…
## $ exterior_2nd_ImStucc                                   <dbl> -0.07466342, -0…
## $ exterior_2nd_MetalSd                                   <dbl> -0.4307617, -0.…
## $ exterior_2nd_Other                                     <dbl> -0.02065481, -0…
## $ exterior_2nd_Plywood                                   <dbl> -0.3152693, -0.…
## $ exterior_2nd_PreCast                                   <dbl> -0.02065481, -0…
## $ exterior_2nd_Stone                                     <dbl> -0.04622503, -0…
## $ exterior_2nd_Stucco                                    <dbl> -0.1283422, -0.…
## $ exterior_2nd_VinylSd                                   <dbl> -0.749257, -0.7…
## $ exterior_2nd_Wd.Sdng                                   <dbl> -0.3895869, -0.…
## $ exterior_2nd_Wd.Shng                                   <dbl> -0.1620447, -0.…
## $ mas_vnr_type_BrkFace                                   <dbl> 1.5228669, 1.52…
## $ mas_vnr_type_None                                      <dbl> -1.229504, -1.2…
## $ mas_vnr_type_Stone                                     <dbl> -0.3111646, -0.…
## $ exter_qual_Fair                                        <dbl> -0.1038071, -0.…
## $ exter_qual_Good                                        <dbl> -0.7244479, -0.…
## $ exter_qual_Typical                                     <dbl> 0.7993048, 0.79…
## $ exter_cond_Fair                                        <dbl> -0.1549765, -0.…
## $ exter_cond_Good                                        <dbl> -0.3407945, -0.…
## $ exter_cond_Poor                                        <dbl> -0.03579046, -0…
## $ exter_cond_Typical                                     <dbl> 0.3910379, 0.39…
## $ foundation_CBlock                                      <dbl> 1.1692276, 1.16…
## $ foundation_PConc                                       <dbl> -0.9146767, -0.…
## $ foundation_Slab                                        <dbl> -0.1266148, -0.…
## $ foundation_Stone                                       <dbl> -0.05850809, -0…
## $ foundation_Wood                                        <dbl> -0.04133609, -0…
## $ bsmt_qual_Fair                                         <dbl> -0.1779793, -0.…
## $ bsmt_qual_Good                                         <dbl> -0.8586435, -0.…
## $ bsmt_qual_No_Basement                                  <dbl> -0.1647955, -0.…
## $ bsmt_qual_Poor                                         <dbl> -0.02921655, -0…
## $ bsmt_qual_Typical                                      <dbl> 1.1480131, 1.14…
## $ bsmt_cond_Fair                                         <dbl> -0.1891505, -0.…
## $ bsmt_cond_Good                                         <dbl> -0.2110551, -0.…
## $ bsmt_cond_No_Basement                                  <dbl> -0.1647955, -0.…
## $ bsmt_cond_Poor                                         <dbl> -0.04133609, -0…
## $ bsmt_cond_Typical                                      <dbl> 0.3446791, 0.34…
## $ bsmt_exposure_Gd                                       <dbl> -0.3241824, -0.…
## $ bsmt_exposure_Mn                                       <dbl> -0.3028478, -0.…
## $ bsmt_exposure_No                                       <dbl> -1.3490446, 0.7…
## $ bsmt_exposure_No_Basement                              <dbl> -0.1675058, -0.…
## $ bsmt_fin_type_1_BLQ                                    <dbl> -0.3136316, -0.…
## $ bsmt_fin_type_1_GLQ                                    <dbl> -0.6563761, -0.…
## $ bsmt_fin_type_1_LwQ                                    <dbl> -0.2291607, -0.…
## $ bsmt_fin_type_1_No_Basement                            <dbl> -0.1647955, -0.…
## $ bsmt_fin_type_1_Rec                                    <dbl> -0.3329462, -0.…
## $ bsmt_fin_type_1_Unf                                    <dbl> -0.6424354, 1.5…
## $ bsmt_fin_type_2_BLQ                                    <dbl> -0.1564132, -0.…
## $ bsmt_fin_type_2_GLQ                                    <dbl> -0.1099302, -0.…
## $ bsmt_fin_type_2_LwQ                                    <dbl> -0.1779793, -0.…
## $ bsmt_fin_type_2_No_Basement                            <dbl> -0.1661556, -0.…
## $ bsmt_fin_type_2_Rec                                    <dbl> -0.1951168, -0.…
## $ bsmt_fin_type_2_Unf                                    <dbl> 0.4167571, 0.41…
## $ heating_GasA                                           <dbl> 0.1266148, 0.12…
## $ heating_GasW                                           <dbl> -0.09731674, -0…
## $ heating_Grav                                           <dbl> -0.0547176, -0.…
## $ heating_OthW                                           <dbl> -0.02921655, -0…
## $ heating_Wall                                           <dbl> -0.05064781, -0…
## $ heating_qc_Fair                                        <dbl> -0.1817694, -0.…
## $ heating_qc_Good                                        <dbl> -0.4356222, -0.…
## $ heating_qc_Poor                                        <dbl> -0.02921655, -0…
## $ heating_qc_Typical                                     <dbl> 1.5704786, 1.57…
## $ central_air_Y                                          <dbl> 0.2660421, 0.26…
## $ electrical_FuseF                                       <dbl> -0.124865, -0.1…
## $ electrical_FuseP                                       <dbl> -0.05850809, -0…
## $ electrical_Mix                                         <dbl> -0.02065481, -0…
## $ electrical_SBrkr                                       <dbl> 0.2994786, 0.29…
## $ electrical_Unknown                                     <dbl> -0.02065481, -0…
## $ kitchen_qual_Fair                                      <dbl> -0.1535276, -0.…
## $ kitchen_qual_Good                                      <dbl> -0.8224307, -0.…
## $ kitchen_qual_Poor                                      <dbl> -0.02065481, -0…
## $ kitchen_qual_Typical                                   <dbl> 0.9904466, 0.99…
## $ functional_Maj2                                        <dbl> -0.0547176, -0.…
## $ functional_Min1                                        <dbl> -0.1476032, -0.…
## $ functional_Min2                                        <dbl> -0.1476032, -0.…
## $ functional_Mod                                         <dbl> -0.1138377, -0.…
## $ functional_Sal                                         <dbl> -0.02065481, -0…
## $ functional_Sev                                         <dbl> -0.02921655, -0…
## $ functional_Typ                                         <dbl> 0.2669599, 0.26…
## $ fireplace_qu_Fair                                      <dbl> -0.1592514, -0.…
## $ fireplace_qu_Good                                      <dbl> 1.7277501, -0.5…
## $ fireplace_qu_No_Fireplace                              <dbl> -0.9786828, 1.0…
## $ fireplace_qu_Poor                                      <dbl> -0.1194715, -0.…
## $ fireplace_qu_Typical                                   <dbl> -0.5106678, -0.…
## $ garage_type_Basment                                    <dbl> -0.1119002, -0.…
## $ garage_type_BuiltIn                                    <dbl> -0.2557863, -0.…
## $ garage_type_CarPort                                    <dbl> -0.06207049, -0…
## $ garage_type_Detchd                                     <dbl> -0.6041656, -0.…
## $ garage_type_More_Than_Two_Types                        <dbl> -0.09038007, -0…
## $ garage_type_No_Garage                                  <dbl> -0.2412767, 4.1…
## $ garage_finish_No_Garage                                <dbl> -0.2432497, 4.1…
## $ garage_finish_RFn                                      <dbl> -0.6199681, -0.…
## $ garage_finish_Unf                                      <dbl> 1.1795096, -0.8…
## $ garage_qual_Fair                                       <dbl> -0.2121551, -0.…
## $ garage_qual_Good                                       <dbl> -0.09038007, -0…
## $ garage_qual_No_Garage                                  <dbl> -0.2432497, 4.1…
## $ garage_qual_Poor                                       <dbl> -0.04622503, -0…
## $ garage_qual_Typical                                    <dbl> 0.3523743, -2.8…
## $ garage_cond_Fair                                       <dbl> -0.1606535, -0.…
## $ garage_cond_Good                                       <dbl> -0.07466342, -0…
## $ garage_cond_No_Garage                                  <dbl> -0.2432497, 4.1…
## $ garage_cond_Poor                                       <dbl> -0.07749852, -0…
## $ garage_cond_Typical                                    <dbl> 0.3217669, -3.1…
## $ paved_drive_Partial_Pavement                           <dbl> -0.1414527, -0.…
## $ paved_drive_Paved                                      <dbl> 0.3193401, -3.1…
## $ pool_qc_Fair                                           <dbl> -0.02065481, -0…
## $ pool_qc_Good                                           <dbl> -0.03579046, -0…
## $ pool_qc_No_Pool                                        <dbl> 0.06865091, 0.0…
## $ pool_qc_Typical                                        <dbl> -0.03579046, -0…
## $ fence_Good_Wood                                        <dbl> -0.1997798, -0.…
## $ fence_Minimum_Privacy                                  <dbl> -0.3500758, -0.…
## $ fence_Minimum_Wood_Wire                                <dbl> -0.06207049, -0…
## $ fence_No_Fence                                         <dbl> 0.4880005, 0.48…
## $ misc_feature_Gar2                                      <dbl> -0.03579046, -0…
## $ misc_feature_None                                      <dbl> 0.1891505, 0.18…
## $ misc_feature_Othr                                      <dbl> -0.02065481, -0…
## $ misc_feature_Shed                                      <dbl> -0.1830175, -0.…
## $ sale_type_Con                                          <dbl> -0.04133609, -0…
## $ sale_type_ConLD                                        <dbl> -0.09038007, -0…
## $ sale_type_ConLI                                        <dbl> -0.0547176, -0.…
## $ sale_type_ConLw                                        <dbl> -0.05064781, -0…
## $ sale_type_CWD                                          <dbl> -0.06544205, -0…
## $ sale_type_New                                          <dbl> -0.3028478, -0.…
## $ sale_type_Oth                                          <dbl> -0.05064781, -0…
## $ sale_type_VWD                                          <dbl> -0.02065481, -0…
## $ sale_type_WD.                                          <dbl> 0.3975366, 0.39…
## $ sale_condition_AdjLand                                 <dbl> -0.06207049, -0…
## $ sale_condition_Alloca                                  <dbl> -0.08795059, -0…
## $ sale_condition_Family                                  <dbl> -0.1230919, -0.…
## $ sale_condition_Normal                                  <dbl> 0.4623717, -2.1…
## $ sale_condition_Partial                                 <dbl> -0.3070247, -0.…

To ensure reliable model evaluation, the data was split into training and testing sets. The training data was used to fit models, while the test data was reserved for final evaluation. Five-fold cross-validation was applied within the training data to estimate model performance and reduce overfitting.

A preprocessing recipe was used to standardize transformations across all models. This included encoding categorical variables, removing zero-variance predictors, and normalizing numeric variables. These steps ensure that models are trained on consistent and appropriately scaled data.

Baseline linear model

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
baseline_lm <- lm_spec |>
  fit(log_sale_price ~ . - sale_price, data = ames_train)

baseline_lm_metrics <- glance(baseline_lm)

baseline_lm_metrics
## # A tibble: 1 × 12
##   r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.950         0.943 0.0972      133.       0   293  2295. -3999. -2300.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The model formula excludes sale_price because sale_price is the original version of the response. Including it as a predictor of log_sale_price would create data leakage and produce misleadingly strong results.

baseline_lm_coefs <- tidy(baseline_lm) |>
  filter(!is.na(p.value)) |>
  arrange(p.value) |>
  mutate(remove_candidate = p.value > 0.05)

baseline_lm_coefs
## # A tibble: 294 × 6
##    term             estimate std.error statistic  p.value remove_candidate
##    <chr>               <dbl>     <dbl>     <dbl>    <dbl> <lgl>           
##  1 roof_matlCompShg 2.91     0.174         16.7  7.43e-59 FALSE           
##  2 roof_matlWdShngl 2.99     0.181         16.5  1.26e-57 FALSE           
##  3 roof_matlTar&Grv 2.93     0.180         16.3  2.95e-56 FALSE           
##  4 roof_matlWdShake 2.92     0.180         16.2  6.17e-56 FALSE           
##  5 first_flr_sf     0.000271 0.0000169     16.1  6.42e-55 FALSE           
##  6 roof_matlMembran 3.17     0.215         14.7  1.04e-46 FALSE           
##  7 roof_matlRoll    2.99     0.203         14.7  1.17e-46 FALSE           
##  8 second_flr_sf    0.000263 0.0000187     14.0  8.36e-43 FALSE           
##  9 misc_featureGar2 2.27     0.169         13.5  8.12e-40 FALSE           
## 10 misc_featureShed 2.19     0.229          9.56 3.21e-21 FALSE           
## # ℹ 284 more rows

Backward Selection Using AIC

Backward selection is used as a classical feature-selection baseline. It begins with a broad model and removes predictors based on AIC, an information criterion that balances model fit and model complexity.

lm_full <- lm(log_sale_price ~ . - sale_price, data = ames_train)

backward_lm <- MASS::stepAIC(
  lm_full,
  direction = "backward",
  trace = FALSE
)
## Selected terms
backward_terms <- attr(terms(backward_lm), "term.labels")

backward_terms
##  [1] "ms_sub_class"    "ms_zoning"       "lot_area"        "street"         
##  [5] "land_slope"      "neighborhood"    "condition_1"     "condition_2"    
##  [9] "bldg_type"       "overall_qual"    "overall_cond"    "year_built"     
## [13] "year_remod_add"  "roof_matl"       "exterior_1st"    "mas_vnr_type"   
## [17] "exter_qual"      "exter_cond"      "foundation"      "bsmt_qual"      
## [21] "bsmt_exposure"   "bsmt_fin_type_1" "bsmt_fin_type_2" "bsmt_unf_sf"    
## [25] "total_bsmt_sf"   "heating_qc"      "central_air"     "first_flr_sf"   
## [29] "second_flr_sf"   "low_qual_fin_sf" "bsmt_full_bath"  "full_bath"      
## [33] "half_bath"       "bedroom_abv_gr"  "kitchen_qual"    "functional"     
## [37] "fireplaces"      "garage_cars"     "garage_area"     "garage_qual"    
## [41] "wood_deck_sf"    "open_porch_sf"   "enclosed_porch"  "screen_porch"   
## [45] "pool_area"       "misc_feature"    "year_sold"       "sale_condition"
## Final coefficient table

backward_lm_coefs <- tidy(backward_lm) |>
  filter(!is.na(p.value)) |>
  arrange(p.value) |>
  mutate(remove = p.value > 0.05)

backward_lm_coefs
## # A tibble: 194 × 6
##    term             estimate std.error statistic  p.value remove
##    <chr>               <dbl>     <dbl>     <dbl>    <dbl> <lgl> 
##  1 roof_matlCompShg 2.66     0.121          22.0 6.10e-97 FALSE 
##  2 roof_matlTar&Grv 2.68     0.124          21.7 1.89e-94 FALSE 
##  3 roof_matlWdShngl 2.77     0.129          21.4 1.57e-92 FALSE 
##  4 roof_matlWdShake 2.63     0.127          20.7 5.11e-87 FALSE 
##  5 misc_featureNone 2.18     0.110          19.9 1.11e-80 FALSE 
##  6 misc_featureShed 2.19     0.110          19.8 1.52e-80 FALSE 
##  7 first_flr_sf     0.000278 0.0000146      19.0 7.25e-75 FALSE 
##  8 second_flr_sf    0.000261 0.0000147      17.8 2.98e-66 FALSE 
##  9 misc_featureGar2 2.29     0.130          17.7 2.47e-65 FALSE 
## 10 roof_matlMembran 2.86     0.164          17.4 8.56e-64 FALSE 
## # ℹ 184 more rows
## Compare model metrics
linear_model_comparison <- bind_rows(
  full_model = glance(lm_full),
  backward_selected_model = glance(backward_lm),
  .id = "model"
) |>
  select(
    model,
    r.squared,
    adj.r.squared,
    sigma,
    statistic,
    p.value,
    df,
    logLik,
    AIC,
    BIC,
    deviance,
    df.residual
  )

linear_model_comparison
## # A tibble: 2 × 12
##   model     r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC
##   <chr>         <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>
## 1 full_mod…     0.950         0.943 0.0972      133.       0   293  2295. -3999.
## 2 backward…     0.947         0.943 0.0975      200.       0   193  2231. -4072.
## # ℹ 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

Interpretation

The full linear model explains approximately 95.0% of the variation in log-transformed sale price, while the backward-selected model explains approximately 94.7%. This small decrease suggests that many predictors in the full model contribute very little additional explanatory value. After adjusting for model complexity, the two models remain nearly identical, with adjusted R^2 values of 0.9430 and 0.9426, respectively.

The backward-selected model is preferable as a classical feature-selection baseline because it removes about 100 model terms while maintaining nearly the same explanatory strength. Although the full model has slightly lower deviance and residual error, the improvement is minimal relative to the added complexity. Both AIC and BIC are lower for the backward-selected model, indicating that it provides a better balance between goodness of fit and parsimony.

Statistically, this suggests that the Ames Housing data contain many predictors that overlap in explanatory information. The backward-selected model keeps the most useful terms while removing predictors that do not improve the model enough to justify their complexity. However, because these results are based on in-sample performance, the backward-selected model should be treated as an interpretable baseline rather than the final predictive model. Cross-validation and test-set performance are still needed to determine which model generalizes best to new housing data.

Project 1: Lasso Regression

Lasso regression was fit as a separate feature-selection model using the full predictor set rather than the backward-selected variables. The penalty parameter controls how strongly coefficients are shrunk toward zero. Predictors with coefficients shrunk exactly to zero are removed from the model, while predictors with nonzero coefficients are retained. Because categorical predictors were dummy-coded before fitting the lasso model, lasso selection occurs at the dummy-variable level. Therefore, lasso may retain one level of a categorical variable while shrinking another level of the same original variable to zero. For interpretation, these dummy-level terms can be grouped back to the original variable names, but the fitted lasso model itself is based on the dummy-coded design matrix.

Model Specification

lasso_spec <- linear_reg(
  penalty = tune(),
  mixture = 1
) |>
  set_engine("glmnet") |>
  set_mode("regression")

Workflow

lasso_wf <- workflow() |>
  add_recipe(ames_recipe) |>
  add_model(lasso_spec)

Tune Penalty Parameter

## Penalty grid
set.seed(42)

lasso_grid <- grid_regular(
  penalty(range = c(-4, 1)),
  levels = 50
)

lasso_res <- tune_grid(
  lasso_wf,
  resamples = ames_folds,
  grid = lasso_grid,
  metrics = metric_set(rmse, rsq, mae)
)

Cross-Validation Results

lasso_res |>
  collect_metrics() |>
  arrange(.metric, mean)
## # A tibble: 150 × 7
##     penalty .metric .estimator   mean     n std_err .config         
##       <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>           
##  1 0.000655 mae     standard   0.0815     5 0.00173 pre0_mod09_post0
##  2 0.000518 mae     standard   0.0816     5 0.00175 pre0_mod08_post0
##  3 0.000829 mae     standard   0.0816     5 0.00168 pre0_mod10_post0
##  4 0.000409 mae     standard   0.0818     5 0.00179 pre0_mod07_post0
##  5 0.00105  mae     standard   0.0819     5 0.00162 pre0_mod11_post0
##  6 0.000324 mae     standard   0.0819     5 0.00183 pre0_mod06_post0
##  7 0.000256 mae     standard   0.0822     5 0.00186 pre0_mod05_post0
##  8 0.00133  mae     standard   0.0822     5 0.00158 pre0_mod12_post0
##  9 0.000202 mae     standard   0.0824     5 0.00194 pre0_mod04_post0
## 10 0.00168  mae     standard   0.0826     5 0.00167 pre0_mod13_post0
## # ℹ 140 more rows

Select Best Penalty

best_lasso <- select_best(lasso_res, metric = "rmse")

best_lasso
## # A tibble: 1 × 2
##   penalty .config         
##     <dbl> <chr>           
## 1 0.00133 pre0_mod12_post0

Visualize Tuning Results

autoplot(lasso_res) +
  labs(
    title = "Lasso Regression Tuning Results",
    x = "Penalty",
    y = "Cross-Validated Performance"
  )

Finalize and Fit Lasso Model

## Finalize workflow
final_lasso_wf <- finalize_workflow(
  lasso_wf,
  best_lasso
)

final_lasso_fit <- last_fit(
  final_lasso_wf,
  split = ames_split,
  metrics = metric_set(rmse, rsq, mae)
)

##Test Set Performance

lasso_test_metrics <- final_lasso_fit |>
  collect_metrics()

lasso_test_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate .config        
##   <chr>   <chr>          <dbl> <chr>          
## 1 rmse    standard      0.143  pre0_mod0_post0
## 2 rsq     standard      0.877  pre0_mod0_post0
## 3 mae     standard      0.0808 pre0_mod0_post0

Final Lasso Coefficients

lasso_coefs <- final_lasso_fit |>
  extract_fit_parsnip() |>
  tidy() |>
  filter(estimate != 0) |>
  arrange(desc(abs(estimate)))
selected_lasso_features <- lasso_coefs |>
  filter(term != "(Intercept)", estimate != 0) |>
  pull(term)

selected_lasso_features
##   [1] "gr_liv_area"                                           
##   [2] "year_built"                                            
##   [3] "overall_qual_Excellent"                                
##   [4] "misc_val"                                              
##   [5] "overall_qual_Very_Good"                                
##   [6] "total_bsmt_sf"                                         
##   [7] "garage_cars"                                           
##   [8] "functional_Sal"                                        
##   [9] "overall_cond_Good"                                     
##  [10] "pool_qc_Good"                                          
##  [11] "overall_cond_Fair"                                     
##  [12] "condition_1_Norm"                                      
##  [13] "kitchen_qual_Typical"                                  
##  [14] "overall_qual_Very_Excellent"                           
##  [15] "neighborhood_Northridge_Heights"                       
##  [16] "misc_feature_Gar2"                                     
##  [17] "sale_condition_Normal"                                 
##  [18] "neighborhood_Crawford"                                 
##  [19] "fireplaces"                                            
##  [20] "year_remod_add"                                        
##  [21] "overall_cond_Very_Good"                                
##  [22] "ms_zoning_C_all"                                       
##  [23] "overall_qual_Below_Average"                            
##  [24] "neighborhood_Somerset"                                 
##  [25] "ms_sub_class_Two_Story_PUD_1946_and_Newer"             
##  [26] "functional_Typ"                                        
##  [27] "lot_area"                                              
##  [28] "neighborhood_Green_Hills"                              
##  [29] "ms_zoning_Residential_Medium_Density"                  
##  [30] "sale_type_New"                                         
##  [31] "neighborhood_Northridge"                               
##  [32] "pool_area"                                             
##  [33] "neighborhood_Stone_Brook"                              
##  [34] "overall_cond_Excellent"                                
##  [35] "ms_sub_class_Duplex_All_Styles_and_Ages"               
##  [36] "full_bath"                                             
##  [37] "overall_qual_Good"                                     
##  [38] "bldg_type_Twnhs"                                       
##  [39] "overall_cond_Above_Average"                            
##  [40] "kitchen_qual_Good"                                     
##  [41] "bsmt_full_bath"                                        
##  [42] "bsmt_unf_sf"                                           
##  [43] "overall_qual_Fair"                                     
##  [44] "overall_qual_Poor"                                     
##  [45] "bldg_type_TwnhsE"                                      
##  [46] "screen_porch"                                          
##  [47] "neighborhood_Meadow_Village"                           
##  [48] "house_style_One_Story"                                 
##  [49] "bsmt_exposure_Gd"                                      
##  [50] "bsmt_fin_type_1_GLQ"                                   
##  [51] "exterior_1st_BrkFace"                                  
##  [52] "neighborhood_Edwards"                                  
##  [53] "sale_condition_Partial"                                
##  [54] "functional_Sev"                                        
##  [55] "functional_Maj2"                                       
##  [56] "heating_qc_Typical"                                    
##  [57] "bsmt_exposure_No"                                      
##  [58] "garage_area"                                           
##  [59] "wood_deck_sf"                                          
##  [60] "roof_matl_WdShngl"                                     
##  [61] "overall_cond_Below_Average"                            
##  [62] "kitchen_abv_gr"                                        
##  [63] "overall_qual_Average"                                  
##  [64] "bsmt_fin_sf_1"                                         
##  [65] "kitchen_qual_Fair"                                     
##  [66] "misc_feature_Shed"                                     
##  [67] "condition_1_PosN"                                      
##  [68] "bsmt_qual_Typical"                                     
##  [69] "neighborhood_Old_Town"                                 
##  [70] "exterior_1st_PreCast"                                  
##  [71] "exter_qual_Typical"                                    
##  [72] "sale_condition_Alloca"                                 
##  [73] "exter_cond_Poor"                                       
##  [74] "ms_sub_class_Two_Story_1946_and_Newer"                 
##  [75] "sale_condition_AdjLand"                                
##  [76] "central_air_Y"                                         
##  [77] "exter_cond_Fair"                                       
##  [78] "heating_GasW"                                          
##  [79] "bsmt_fin_type_2_No_Basement"                           
##  [80] "lot_config_CulDSac"                                    
##  [81] "house_style_SFoyer"                                    
##  [82] "land_slope_Sev"                                        
##  [83] "condition_1_PosA"                                      
##  [84] "ms_zoning_A_agr"                                       
##  [85] "exterior_1st_HdBoard"                                  
##  [86] "neighborhood_Clear_Creek"                              
##  [87] "garage_type_More_Than_Two_Types"                       
##  [88] "roof_matl_Tar.Grv"                                     
##  [89] "heating_qc_Fair"                                       
##  [90] "condition_2_PosA"                                      
##  [91] "garage_cond_Typical"                                   
##  [92] "ms_sub_class_One_Story_1945_and_Older"                 
##  [93] "roof_style_Gambrel"                                    
##  [94] "exterior_1st_Wd.Sdng"                                  
##  [95] "neighborhood_Timberland"                               
##  [96] "ms_sub_class_One_and_Half_Story_PUD_All_Ages"          
##  [97] "bsmt_cond_Fair"                                        
##  [98] "heating_qc_Poor"                                       
##  [99] "half_bath"                                             
## [100] "house_style_One_and_Half_Unf"                          
## [101] "bsmt_qual_No_Basement"                                 
## [102] "heating_qc_Good"                                       
## [103] "garage_qual_Good"                                      
## [104] "open_porch_sf"                                         
## [105] "bsmt_fin_type_2_Rec"                                   
## [106] "bsmt_exposure_Mn"                                      
## [107] "bsmt_qual_Fair"                                        
## [108] "year_sold"                                             
## [109] "bsmt_qual_Good"                                        
## [110] "sale_type_Con"                                         
## [111] "land_contour_Lvl"                                      
## [112] "lot_shape_Moderately_Irregular"                        
## [113] "land_contour_HLS"                                      
## [114] "foundation_Wood"                                       
## [115] "paved_drive_Paved"                                     
## [116] "low_qual_fin_sf"                                       
## [117] "lot_shape_Irregular"                                   
## [118] "foundation_Stone"                                      
## [119] "roof_matl_Membran"                                     
## [120] "kitchen_qual_Poor"                                     
## [121] "pool_qc_Fair"                                          
## [122] "bsmt_fin_type_2_GLQ"                                   
## [123] "garage_finish_RFn"                                     
## [124] "bsmt_cond_Poor"                                        
## [125] "exterior_1st_Stucco"                                   
## [126] "ms_zoning_I_all"                                       
## [127] "pool_qc_Typical"                                       
## [128] "fence_Good_Wood"                                       
## [129] "lot_config_FR2"                                        
## [130] "street_Pave"                                           
## [131] "mas_vnr_type_Stone"                                    
## [132] "neighborhood_Greens"                                   
## [133] "garage_type_CarPort"                                   
## [134] "heating_OthW"                                          
## [135] "bsmt_fin_type_1_Unf"                                   
## [136] "neighborhood_Veenker"                                  
## [137] "roof_matl_CompShg"                                     
## [138] "utilities_NoSewr"                                      
## [139] "functional_Min1"                                       
## [140] "enclosed_porch"                                        
## [141] "neighborhood_Bloomington_Heights"                      
## [142] "exterior_1st_CemntBd"                                  
## [143] "foundation_Slab"                                       
## [144] "neighborhood_Sawyer_West"                              
## [145] "garage_cond_Poor"                                      
## [146] "ms_sub_class_Two_and_Half_Story_All_Ages"              
## [147] "alley_Paved"                                           
## [148] "bsmt_cond_Good"                                        
## [149] "utilities_NoSeWa"                                      
## [150] "garage_qual_Fair"                                      
## [151] "roof_style_Mansard"                                    
## [152] "bsmt_exposure_No_Basement"                             
## [153] "exterior_2nd_BrkFace"                                  
## [154] "sale_condition_Family"                                 
## [155] "garage_qual_Poor"                                      
## [156] "fireplace_qu_Good"                                     
## [157] "exterior_2nd_Plywood"                                  
## [158] "mas_vnr_area"                                          
## [159] "ms_sub_class_Two_Family_conversion_All_Styles_and_Ages"
## [160] "sale_type_ConLI"                                       
## [161] "neighborhood_Sawyer"                                   
## [162] "condition_1_RRAn"                                      
## [163] "garage_cond_Good"                                      
## [164] "condition_2_RRNn"                                      
## [165] "ms_sub_class_Two_Story_1945_and_Older"                 
## [166] "neighborhood_Blueste"                                  
## [167] "exterior_2nd_Wd.Shng"                                  
## [168] "house_style_Two_and_Half_Unf"                          
## [169] "bsmt_fin_type_2_Unf"                                   
## [170] "three_season_porch"                                    
## [171] "bedroom_abv_gr"                                        
## [172] "garage_cond_Fair"                                      
## [173] "heating_Grav"                                          
## [174] "neighborhood_Northwest_Ames"                           
## [175] "fence_Minimum_Wood_Wire"                               
## [176] "neighborhood_Brookside"                                
## [177] "exterior_2nd_Wd.Sdng"                                  
## [178] "neighborhood_Mitchell"                                 
## [179] "exterior_2nd_Stone"                                    
## [180] "lot_config_FR3"                                        
## [181] "sale_type_WD."                                         
## [182] "bsmt_fin_type_2_BLQ"                                   
## [183] "neighborhood_Northpark_Villa"                          
## [184] "paved_drive_Partial_Pavement"                          
## [185] "bsmt_cond_No_Basement"                                 
## [186] "condition_1_RRAe"                                      
## [187] "bsmt_fin_type_1_BLQ"                                   
## [188] "land_slope_Mod"                                        
## [189] "bsmt_half_bath"                                        
## [190] "neighborhood_Briardale"                                
## [191] "bsmt_fin_type_1_No_Basement"                           
## [192] "second_flr_sf"                                         
## [193] "exterior_1st_CBlock"                                   
## [194] "garage_finish_Unf"                                     
## [195] "exterior_2nd_Other"                                    
## [196] "condition_1_RRNn"                                      
## [197] "bldg_type_Duplex"                                      
## [198] "fence_No_Fence"                                        
## [199] "neighborhood_Gilbert"
length(selected_lasso_features)
## [1] 199

Map dummy variables back to original variables:

original_vars <- names(ames_train) |>
  setdiff(c("sale_price", "log_sale_price"))

selected_original_vars <- original_vars[
  purrr::map_lgl(original_vars, \(var) {
    any(stringr::str_detect(selected_lasso_features, paste0("^", var)))
  })
]

selected_original_vars
##  [1] "ms_sub_class"       "ms_zoning"          "lot_area"          
##  [4] "street"             "alley"              "lot_shape"         
##  [7] "land_contour"       "utilities"          "lot_config"        
## [10] "land_slope"         "neighborhood"       "condition_1"       
## [13] "condition_2"        "bldg_type"          "house_style"       
## [16] "overall_qual"       "overall_cond"       "year_built"        
## [19] "year_remod_add"     "roof_style"         "roof_matl"         
## [22] "exterior_1st"       "exterior_2nd"       "mas_vnr_type"      
## [25] "mas_vnr_area"       "exter_qual"         "exter_cond"        
## [28] "foundation"         "bsmt_qual"          "bsmt_cond"         
## [31] "bsmt_exposure"      "bsmt_fin_type_1"    "bsmt_fin_sf_1"     
## [34] "bsmt_fin_type_2"    "bsmt_unf_sf"        "total_bsmt_sf"     
## [37] "heating"            "heating_qc"         "central_air"       
## [40] "second_flr_sf"      "low_qual_fin_sf"    "gr_liv_area"       
## [43] "bsmt_full_bath"     "bsmt_half_bath"     "full_bath"         
## [46] "half_bath"          "bedroom_abv_gr"     "kitchen_abv_gr"    
## [49] "kitchen_qual"       "functional"         "fireplaces"        
## [52] "fireplace_qu"       "garage_type"        "garage_finish"     
## [55] "garage_cars"        "garage_area"        "garage_qual"       
## [58] "garage_cond"        "paved_drive"        "wood_deck_sf"      
## [61] "open_porch_sf"      "enclosed_porch"     "three_season_porch"
## [64] "screen_porch"       "pool_area"          "pool_qc"           
## [67] "fence"              "misc_feature"       "misc_val"          
## [70] "year_sold"          "sale_type"          "sale_condition"
length(selected_original_vars)
## [1] 72

The selected lasso features are the predictors with non-zero coefficients in the finalized model. These are the best feature candidates from the lasso procedure because they survived cross-validated penalization. Larger absolute coefficient values indicate stronger effects on log_sale_price, after preprocessing and normalization.

Plot strongest lasso-selected features

original_vars <- names(ames_train)

lasso_coefs_grouped <- lasso_coefs |>
  filter(term != "(Intercept)") |>
  rowwise() |>
  mutate(
    original_var = original_vars[
      which.max(startsWith(term, original_vars))
    ]
  ) |>
  ungroup() |>
  group_by(original_var) |>
  summarize(
    importance = sum(abs(estimate))
  ) |>
  arrange(desc(importance))
lasso_coefs_grouped |>
  slice_max(order_by = importance, n = 25) |>
  mutate(original_var = fct_reorder(original_var, importance)) |>
  ggplot(aes(x = importance, y = original_var)) +
  geom_col() +
  labs(
    title = "Top Lasso-Selected Variables",
    x = "Total Absolute Coefficient",
    y = NULL
  ) +
  theme_minimal()

Evaluate finalized lasso model on the test set

lasso_preds <- final_lasso_fit |>
  collect_predictions() |>
  bind_cols(
    ames_test |>
      select(sale_price)
  ) |>
  mutate(
    pred_sale_price = exp(.pred),
    residual_log = log_sale_price - .pred,
    residual_dollars = sale_price - pred_sale_price
  )

head(lasso_preds)
## # A tibble: 6 × 9
##   .pred id               log_sale_price  .row .config sale_price pred_sale_price
##   <dbl> <chr>                     <dbl> <int> <chr>        <int>           <dbl>
## 1  12.1 train/test split           12.2     5 pre0_m…     189900         186549.
## 2  12.1 train/test split           12.0    22 pre0_m…     170000         176648.
## 3  12.6 train/test split           12.6    38 pre0_m…     306000         284474.
## 4  13.1 train/test split           13.1    47 pre0_m…     500000         481507.
## 5  12.7 train/test split           12.7    48 pre0_m…     320000         324779.
## 6  12.2 train/test split           12.2    54 pre0_m…     192000         190586.
## # ℹ 2 more variables: residual_log <dbl>, residual_dollars <dbl>

Evaluating Lasso Perfomance

lasso_log_metrics <- lasso_preds |>
  metrics(truth = log_sale_price, estimate = .pred) |>
  mutate(model = "Lasso")

lasso_log_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model
##   <chr>   <chr>          <dbl> <chr>
## 1 rmse    standard      0.143  Lasso
## 2 rsq     standard      0.877  Lasso
## 3 mae     standard      0.0808 Lasso
lasso_dollar_metrics <- lasso_preds |>
  metrics(truth = sale_price, estimate = pred_sale_price) |>
  mutate(model = "Lasso")

lasso_dollar_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model
##   <chr>   <chr>          <dbl> <chr>
## 1 rmse    standard   27273.    Lasso
## 2 rsq     standard       0.882 Lasso
## 3 mae     standard   14049.    Lasso

Diagnostics plots

ggplot(lasso_preds, aes(x = log_sale_price, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Lasso Model: Actual vs. Predicted Log Sale Price",
    x = "Actual Log Sale Price",
    y = "Predicted Log Sale Price"
  ) +
  theme_minimal()

Residual VS Predicted

ggplot(lasso_preds, aes(x = .pred, y = residual_log)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Lasso Model: Residuals vs. Predicted Log Sale Price",
    x = "Predicted Log Sale Price",
    y = "Residual"
  ) +
  theme_minimal()

Actual VS Predicted

ggplot(lasso_preds, aes(x = sale_price, y = pred_sale_price)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_x_continuous(labels = scales::label_dollar()) +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "Lasso Model: Actual vs. Predicted Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

The lasso model was evaluated on the held-out test set to estimate how well it generalizes to new homes. Performance was assessed on both the log scale and the original dollar scale. The log-scale metrics align directly with the model’s training objective, while the dollar-scale metrics make the results easier to interpret for a general audience.

The actual-versus-predicted plots show whether predictions closely follow the 45-degree reference line, where perfect predictions would fall. Points far from the line represent homes that were overpredicted or underpredicted. The dollar-scale plot is especially useful for communication, but because predictions were back-transformed from the log scale, dollar-scale errors should be interpreted as approximate practical errors rather than the original scale used to fit the model.

The residual plot helps evaluate whether errors show systematic patterns. Ideally, residuals should be centered around zero with no clear curve or funnel shape. If residuals widen for higher predicted prices, this suggests that expensive homes are harder to predict consistently, which is common in housing data.

Final Model Evaluation and Comparison

Both the backward-selection and lasso models are evaluated on the test set to compare predictive performance.

Backward Model Predictions

ames_test_lm <- ames_test

for (var in names(backward_lm$xlevels)) {
  ames_test_lm[[var]] <- factor(
    ames_test_lm[[var]],
    levels = backward_lm$xlevels[[var]]
  )
}
backward_preds <- predict(backward_lm, newdata = ames_test_lm) |>
  tibble(.pred = _) |>
  bind_cols(
    ames_test |>
      select(sale_price, log_sale_price)
  ) |>
  mutate(
    pred_sale_price = exp(.pred),
    residual_log = log_sale_price - .pred,
    residual_dollars = sale_price - pred_sale_price
  )

backward_preds_clean <- backward_preds |>
  filter(!is.na(.pred))

Evaluate backward model

backward_log_metrics <- backward_preds_clean |>
  metrics(truth = log_sale_price, estimate = .pred) |>
  mutate(model = "Backward Selection")

backward_log_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model             
##   <chr>   <chr>          <dbl> <chr>             
## 1 rmse    standard      0.146  Backward Selection
## 2 rsq     standard      0.873  Backward Selection
## 3 mae     standard      0.0795 Backward Selection
backward_dollar_metrics <- backward_preds_clean |>
  metrics(truth = sale_price, estimate = pred_sale_price) |>
  mutate(model = "Backward Selection")

backward_dollar_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model             
##   <chr>   <chr>          <dbl> <chr>             
## 1 rmse    standard   37872.    Backward Selection
## 2 rsq     standard       0.796 Backward Selection
## 3 mae     standard   14066.    Backward Selection

Compare Models on Log Scale

log_model_comparison_wide <- bind_rows(
  backward_log_metrics,
  lasso_log_metrics
) |>
  select(model, .metric, .estimate) |>
  pivot_wider(
    names_from = .metric,
    values_from = .estimate
  )

log_model_comparison_wide
## # A tibble: 2 × 4
##   model               rmse   rsq    mae
##   <chr>              <dbl> <dbl>  <dbl>
## 1 Backward Selection 0.146 0.873 0.0795
## 2 Lasso              0.143 0.877 0.0808

On the log scale, the lasso model has lower RMSE and higher R^2, indicating stronger performance in predicting proportional differences in sale price. The backward-selection model shows slightly lower MAE, suggesting smaller typical errors, but lasso reduces larger errors more effectively due to its lower RMSE.

Compare Models on Dollar Scale

dollar_model_comparison_wide <- bind_rows(
  backward_dollar_metrics,
  lasso_dollar_metrics
) |>
  select(model, .metric, .estimate) |>
  pivot_wider(
    names_from = .metric,
    values_from = .estimate
  )

dollar_model_comparison_wide
## # A tibble: 2 × 4
##   model                rmse   rsq    mae
##   <chr>               <dbl> <dbl>  <dbl>
## 1 Backward Selection 37872. 0.796 14066.
## 2 Lasso              27273. 0.882 14049.

On the dollar scale, the lasso model performs better overall. It has a lower RMSE, higher R^2, and slightly lower MAE than the backward-selection model. The difference in RMSE is especially meaningful because RMSE penalizes larger errors more heavily; this suggests that lasso makes fewer large dollar-scale prediction errors. Although the MAE values are very close, lasso still has the smaller typical absolute error. Overall, lasso provides stronger predictive performance than backward selection on both the log scale and the dollar scale.

Conclusion

Because lasso performs better on both evaluation scales, it is the stronger feature-selection model for this analysis. It combines predictive accuracy with automatic feature selection, making it a useful bridge between interpretability and performance. Backward selection remains useful as a classical comparison model, but the test-set results support lasso as the preferred feature-selection approach.

Project 2: Multiple Linear Regression

The lasso model was used as an evidence-based guide for selecting predictors for the multiple linear regression model. Rather than choosing variables arbitrarily, I selected predictors that represented the main housing dimensions identified by lasso: quality, location, size, age, condition, basement features, garage capacity, amenities, and sale context.

This model is designed to be interpretable as well as predictive. Highly sparse, redundant, or difficult-to-explain predictors were excluded even if they appeared in the lasso ranking. This keeps the model useful for explaining patterns in home prices to a general audience.

mlr_features <- c(
  "overall_qual",
  "neighborhood",
  "gr_liv_area",
  "overall_cond",
  "garage_cars",
  "ms_sub_class",
  "ms_zoning",
  "year_built",
  "sale_condition",
  "bsmt_exposure",
  "fireplaces",
  "year_remod_add",
  "total_bsmt_sf",
  "full_bath",
  "condition_1",
  "bsmt_full_bath",
  "exter_qual",
  "central_air",
  "first_flr_sf",
  "screen_porch",
  "bsmt_qual",
  "kitchen_qual",
  "foundation",
  "lot_area",
  "exterior_1st"
)

Build the formula

mlr_formula <- as.formula(
  paste("log_sale_price ~", paste(mlr_features, collapse = " + "))
)

mlr_formula
## log_sale_price ~ overall_qual + neighborhood + gr_liv_area + 
##     overall_cond + garage_cars + ms_sub_class + ms_zoning + year_built + 
##     sale_condition + bsmt_exposure + fireplaces + year_remod_add + 
##     total_bsmt_sf + full_bath + condition_1 + bsmt_full_bath + 
##     exter_qual + central_air + first_flr_sf + screen_porch + 
##     bsmt_qual + kitchen_qual + foundation + lot_area + exterior_1st

Fit the MLR model with tidymodels

mlr_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

mlr_recipe <- recipe(mlr_formula, data = ames_train) |>
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

mlr_workflow <- workflow() |>
  add_recipe(mlr_recipe) |>
  add_model(mlr_spec)

mlr_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Cross-validation performance

set.seed(42)

mlr_cv_results <- fit_resamples(
  mlr_workflow,
  resamples = ames_folds,
  metrics = metric_set(rmse, rsq, mae),
  control = control_resamples(save_pred = TRUE)
)

mlr_cv_metrics <- collect_metrics(mlr_cv_results)

mlr_cv_metrics
## # A tibble: 3 × 6
##   .metric .estimator   mean     n  std_err .config        
##   <chr>   <chr>       <dbl> <int>    <dbl> <chr>          
## 1 mae     standard   0.0863     5 0.000172 pre0_mod0_post0
## 2 rmse    standard   0.140      5 0.00810  pre0_mod0_post0
## 3 rsq     standard   0.882      5 0.0149   pre0_mod0_post0

Fit final model on training data

final_mlr_fit <- fit(
  mlr_workflow,
  data = ames_train
)

final_mlr_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##                                            (Intercept)  
##                                              5.177e+00  
##                                            gr_liv_area  
##                                              2.302e-04  
##                                            garage_cars  
##                                              5.458e-02  
##                                             year_built  
##                                              2.108e-03  
##                                             fireplaces  
##                                              3.435e-02  
##                                         year_remod_add  
##                                              7.430e-04  
##                                          total_bsmt_sf  
##                                              2.455e-06  
##                                              full_bath  
##                                              3.944e-02  
##                                         bsmt_full_bath  
##                                              5.301e-02  
##                                           first_flr_sf  
##                                              1.164e-05  
##                                           screen_porch  
##                                              2.719e-04  
##                                               lot_area  
##                                              1.423e-06  
##                                      overall_qual_Poor  
##                                             -6.923e-02  
##                                      overall_qual_Fair  
##                                              1.342e-01  
##                             overall_qual_Below_Average  
##                                              1.839e-01  
##                                   overall_qual_Average  
##                                              2.460e-01  
##                             overall_qual_Above_Average  
##                                              2.876e-01  
##                                      overall_qual_Good  
##                                              3.265e-01  
##                                 overall_qual_Very_Good  
##                                              3.846e-01  
##                                 overall_qual_Excellent  
##                                              4.821e-01  
##                            overall_qual_Very_Excellent  
##                                              4.168e-01  
##                             neighborhood_College_Creek  
##                                              2.425e-02  
##                                  neighborhood_Old_Town  
## 
## ...
## and 210 more lines.

Model coefficients

mlr_tidy <- tidy(final_mlr_fit)

mlr_tidy_clean <- mlr_tidy |>
  filter(term != "(Intercept)") |>
  mutate(
    percent_change = (exp(estimate) - 1) * 100
  ) |>
  arrange(desc(abs(estimate)))

mlr_tidy_clean
## # A tibble: 126 × 6
##    term                     estimate std.error statistic  p.value percent_change
##    <chr>                       <dbl>     <dbl>     <dbl>    <dbl>          <dbl>
##  1 ms_zoning_A_agr            -1.00     0.138      -7.27 4.94e-13          -63.4
##  2 exterior_1st_PreCast        0.629    0.125       5.04 5.15e- 7           87.5
##  3 exterior_1st_CBlock         0.611    0.102       5.97 2.74e- 9           84.1
##  4 neighborhood_Green_Hills    0.598    0.0876      6.84 1.05e-11           81.9
##  5 overall_qual_Excellent      0.482    0.0999      4.83 1.47e- 6           62.0
##  6 overall_qual_Very_Excel…    0.417    0.103       4.04 5.57e- 5           51.7
##  7 overall_qual_Very_Good      0.385    0.0979      3.93 8.77e- 5           46.9
##  8 overall_cond_Excellent      0.369    0.0611      6.04 1.78e- 9           44.7
##  9 ms_sub_class_One_and_Ha…   -0.355    0.123      -2.89 3.84e- 3          -29.9
## 10 overall_qual_Good           0.326    0.0974      3.35 8.15e- 4           38.6
## # ℹ 116 more rows

Model summary

mlr_glance <- glance(final_mlr_fit)

mlr_glance
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.920         0.915 0.119      201.       0   126  1736. -3216. -2479.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The multiple linear regression model explains approximately 92.0% of the variation in log-transformed sale price, with an adjusted R^2 of 91.5% after accounting for model complexity. This indicates that the selected predictors provide strong explanatory power while maintaining a more interpretable structure than the full model. The residual standard error is 0.119 on the log scale, meaning that the model’s typical prediction error is roughly around 12% in sale price terms. The overall model test is highly significant, suggesting that the selected housing characteristics collectively provide meaningful information about home sale prices.

Although the model performs well, it is still complex because categorical variables are converted into many dummy-variable terms. The model has 126 degrees of freedom, reflecting predictors such as neighborhood, quality level, zoning, basement features, exterior type, and sale condition. This is appropriate for Ames Housing because home prices are shaped by both numeric features, such as living area and age, and categorical features, such as location and quality. However, the model should still be evaluated on test-set performance and diagnostic plots before being treated as a final predictive model.

Test-set predictions

mlr_preds <- predict(final_mlr_fit, new_data = ames_test) |>
  bind_cols(
    ames_test |>
      select(sale_price, log_sale_price)
  ) |>
  mutate(
    pred_sale_price = exp(.pred),
    residual_log = log_sale_price - .pred,
    residual_dollars = sale_price - pred_sale_price
  )

head(mlr_preds)
## # A tibble: 6 × 6
##   .pred sale_price log_sale_price pred_sale_price residual_log residual_dollars
##   <dbl>      <int>          <dbl>           <dbl>        <dbl>            <dbl>
## 1  12.0     189900           12.2         170444.       0.108            19456.
## 2  12.1     170000           12.0         187368.      -0.0973          -17368.
## 3  12.6     306000           12.6         292309.       0.0458           13691.
## 4  13.1     500000           13.1         484153.       0.0322           15847.
## 5  12.7     320000           12.7         325614.      -0.0174           -5614.
## 6  12.1     192000           12.2         187871.       0.0217            4129.

Test-set metrics

mlr_log_metrics <- mlr_preds |>
  metrics(truth = log_sale_price, estimate = .pred) |>
  mutate(model = "Multiple Linear Regression")

mlr_log_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model                     
##   <chr>   <chr>          <dbl> <chr>                     
## 1 rmse    standard      0.136  Multiple Linear Regression
## 2 rsq     standard      0.890  Multiple Linear Regression
## 3 mae     standard      0.0870 Multiple Linear Regression
mlr_dollar_metrics <- mlr_preds |>
  metrics(truth = sale_price, estimate = pred_sale_price) |>
  mutate(model = "Multiple Linear Regression")

mlr_dollar_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model                     
##   <chr>   <chr>          <dbl> <chr>                     
## 1 rmse    standard   29291.    Multiple Linear Regression
## 2 rsq     standard       0.864 Multiple Linear Regression
## 3 mae     standard   15452.    Multiple Linear Regression

Actual vs predicted plot

mlr_preds |>
  ggplot(aes(x = sale_price, y = pred_sale_price)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_x_continuous(labels = scales::label_dollar()) +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "Multiple Linear Regression: Actual vs. Predicted Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

Residual plot

mlr_preds |>
  ggplot(aes(x = .pred, y = residual_log)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "MLR Residuals vs. Predicted Log Sale Price",
    x = "Predicted Log Sale Price",
    y = "Residual"
  ) +
  theme_minimal()

Q-Q plot of residuals

mlr_preds |>
  ggplot(aes(sample = residual_log)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    title = "Q-Q Plot of MLR Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

The actual-versus-predicted plot shows whether predicted prices closely follow the observed prices. Points near the dashed line represent accurate predictions. If points at the high end fall farther from the line, this suggests that expensive homes are harder to predict, which is common because luxury or unusual properties may have features not fully captured by the model.

The residual plot checks whether prediction errors show systematic patterns. Ideally, residuals should be centered around zero with no clear curve or funnel shape. A wider spread for higher predicted prices would suggest heteroscedasticity, meaning the model makes less consistent predictions for more expensive homes.

The Q-Q plot checks whether residuals are approximately normally distributed. Residuals following the line in the center but deviating in the tails suggests that the model works reasonably well for typical homes but struggles more with unusually low- or high-priced properties.

Fitted Model Equation

The general fitted model is:

\[ \widehat{\log(\text{Sale Price})} = \beta_0 + \beta_1(\text{gr\_liv\_area}) + \beta_2(\text{garage\_cars}) + \beta_3(\text{year\_built}) + \beta_4(\text{fireplaces}) + \beta_5(\text{year\_remod\_add}) + \cdots \]

Using the fitted coefficient estimates, a simplified version of the model is:

\[ \widehat{\log(\text{Sale Price})} = 5.177 + 0.000230(\text{Gr Liv Area}) + 0.0546(\text{Garage Cars}) + 0.00211(\text{Year Built}) + 0.0344(\text{Fireplaces}) + 0.000743(\text{Year Remod Add}) + \cdots \]

The ellipsis represents the remaining predictors in the model, including additional numeric variables and dummy variables for categorical features such as overall quality, neighborhood, zoning, basement quality, and sale condition. These categorical variables are encoded relative to reference categories.

Because the response variable is log-transformed, each coefficient represents an approximate percentage change in sale price for a one-unit increase in the predictor, holding other variables constant. This allows the model to be interpreted in relative terms rather than absolute dollar changes.

Key Coefficient Interpretations:

gr_liv_area: Each additional square foot of above-ground living area is associated with about a 0.023% increase in expected sale price, holding other predictors constant.

garage_cars: Each additional garage space is associated with about a 5.6% increase in expected sale price, holding other factors constant.

year_built: Each additional year newer is associated with about a 0.21% increase in expected sale price other factors held constant.

fireplaces: Each additional fireplace is associated with about a 3.5% increase in expected sale price other factors held constant.

full_bath: Each additional full bathroom is associated with about a 4.0% increase in expected sale price other factors held constant.

bsmt_full_bath: Each additional basement full bathroom is associated with about a 5.4% increase in expected sale price holding other factors constant.

Project 3: Polynomial Regression for Nonlinear Effects

Polynomial regression was used to test whether selected numeric predictors have nonlinear relationships with log_sale_price. This model uses the same main feature set as the multiple linear regression model, but allows several important numeric predictors to have curved effects.

poly_vars <- c(
  "gr_liv_area",
  "total_bsmt_sf",
  "year_built",
  "lot_area"
)

poly_vars <- intersect(poly_vars, mlr_features)

poly_vars
## [1] "gr_liv_area"   "total_bsmt_sf" "year_built"    "lot_area"

Polynomial Regression

poly_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")
# Formula using the same selected features as MLR
poly_formula <- as.formula(
  paste("log_sale_price ~", paste(mlr_features, collapse = " + "))
)

poly_formula
## log_sale_price ~ overall_qual + neighborhood + gr_liv_area + 
##     overall_cond + garage_cars + ms_sub_class + ms_zoning + year_built + 
##     sale_condition + bsmt_exposure + fireplaces + year_remod_add + 
##     total_bsmt_sf + full_bath + condition_1 + bsmt_full_bath + 
##     exter_qual + central_air + first_flr_sf + screen_porch + 
##     bsmt_qual + kitchen_qual + foundation + lot_area + exterior_1st

Fitting the recipe

poly_recipe <- recipe(poly_formula, data = ames_train) |>
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_poly(all_of(poly_vars), degree = 2) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

Polynomial regression workflow

poly_workflow <- workflow() |>
  add_recipe(poly_recipe) |>
  add_model(poly_spec)

poly_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_poly()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Cross-Validation Perfomance

set.seed(42)

poly_cv_results <- fit_resamples(
  poly_workflow,
  resamples = ames_folds,
  metrics = metric_set(rmse, rsq, mae),
  control = control_resamples(save_pred = TRUE)
)

poly_cv_metrics <- collect_metrics(poly_cv_results)

poly_cv_metrics
## # A tibble: 3 × 6
##   .metric .estimator   mean     n  std_err .config        
##   <chr>   <chr>       <dbl> <int>    <dbl> <chr>          
## 1 mae     standard   0.0805     5 0.000818 pre0_mod0_post0
## 2 rmse    standard   0.126      5 0.00426  pre0_mod0_post0
## 3 rsq     standard   0.904      5 0.00747  pre0_mod0_post0

Fit Final Polynomial Model

final_poly_fit <- fit(
  poly_workflow,
  data = ames_train
)

final_poly_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_poly()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##                                            (Intercept)  
##                                              9.972e+00  
##                                            garage_cars  
##                                              4.326e-02  
##                                             fireplaces  
##                                              2.976e-02  
##                                         year_remod_add  
##                                              6.577e-04  
##                                              full_bath  
##                                              1.515e-02  
##                                         bsmt_full_bath  
##                                              5.021e-02  
##                                           first_flr_sf  
##                                             -1.040e-06  
##                                           screen_porch  
##                                              2.092e-04  
##                                     gr_liv_area_poly_1  
##                                              6.541e+00  
##                                     gr_liv_area_poly_2  
##                                             -1.206e+00  
##                                   total_bsmt_sf_poly_1  
##                                              1.359e+00  
##                                   total_bsmt_sf_poly_2  
##                                             -2.352e+00  
##                                      year_built_poly_1  
##                                              4.629e+00  
##                                      year_built_poly_2  
##                                              5.328e-01  
##                                        lot_area_poly_1  
##                                              1.140e+00  
##                                        lot_area_poly_2  
##                                             -6.622e-01  
##                                      overall_qual_Poor  
##                                             -3.304e-02  
##                                      overall_qual_Fair  
##                                              1.014e-01  
##                             overall_qual_Below_Average  
##                                              1.358e-01  
##                                   overall_qual_Average  
##                                              1.870e-01  
##                             overall_qual_Above_Average  
##                                              2.162e-01  
##                                      overall_qual_Good  
##                                              2.518e-01  
##                                 overall_qual_Very_Good  
## 
## ...
## and 218 more lines.

Model Summary

poly_glance <- glance(final_poly_fit)

poly_glance
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.936         0.932 0.106      247.       0   130  1994. -3723. -2963.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Test-Set Predictions

poly_preds <- predict(final_poly_fit, new_data = ames_test) |>
  bind_cols(
    ames_test |>
      select(sale_price, log_sale_price)
  ) |>
  mutate(
    pred_sale_price = exp(.pred),
    residual_log = log_sale_price - .pred,
    residual_dollars = sale_price - pred_sale_price
  )

head(poly_preds)
## # A tibble: 6 × 6
##   .pred sale_price log_sale_price pred_sale_price residual_log residual_dollars
##   <dbl>      <int>          <dbl>           <dbl>        <dbl>            <dbl>
## 1  12.1     189900           12.2         177955.       0.0650           11945.
## 2  12.1     170000           12.0         180117.      -0.0578          -10117.
## 3  12.6     306000           12.6         289080.       0.0569           16920.
## 4  13.0     500000           13.1         427360.       0.157            72640.
## 5  12.7     320000           12.7         327385.      -0.0228           -7385.
## 6  12.1     192000           12.2         185276.       0.0356            6724.

Test-Set Metrics

poly_log_metrics <- poly_preds |>
  metrics(truth = log_sale_price, estimate = .pred) |>
  mutate(model = "Polynomial Regression")

poly_log_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model                
##   <chr>   <chr>          <dbl> <chr>                
## 1 rmse    standard      0.130  Polynomial Regression
## 2 rsq     standard      0.900  Polynomial Regression
## 3 mae     standard      0.0811 Polynomial Regression
poly_dollar_metrics <- poly_preds |>
  metrics(truth = sale_price, estimate = pred_sale_price) |>
  mutate(model = "Polynomial Regression")

poly_dollar_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model                
##   <chr>   <chr>          <dbl> <chr>                
## 1 rmse    standard   25775.    Polynomial Regression
## 2 rsq     standard       0.895 Polynomial Regression
## 3 mae     standard   14266.    Polynomial Regression

Compare MLR and Polynomial Regression

model_comparison_log <- bind_rows(
  mlr_log_metrics,
  poly_log_metrics
) |>
  select(model, .metric, .estimate) |>
  pivot_wider(
    names_from = .metric,
    values_from = .estimate
  )

model_comparison_log
## # A tibble: 2 × 4
##   model                       rmse   rsq    mae
##   <chr>                      <dbl> <dbl>  <dbl>
## 1 Multiple Linear Regression 0.136 0.890 0.0870
## 2 Polynomial Regression      0.130 0.900 0.0811
model_comparison_dollars <- bind_rows(
  mlr_dollar_metrics,
  poly_dollar_metrics
) |>
  select(model, .metric, .estimate) |>
  pivot_wider(
    names_from = .metric,
    values_from = .estimate
  )

model_comparison_dollars
## # A tibble: 2 × 4
##   model                        rmse   rsq    mae
##   <chr>                       <dbl> <dbl>  <dbl>
## 1 Multiple Linear Regression 29291. 0.864 15452.
## 2 Polynomial Regression      25775. 0.895 14266.

Diagnostic Plots

# Actual vs predicted plot
poly_preds |>
  ggplot(aes(x = sale_price, y = pred_sale_price)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_x_continuous(labels = scales::label_dollar()) +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "Polynomial Regression: Actual vs. Predicted Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

poly_preds |>
  ggplot(aes(x = .pred, y = residual_log)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Polynomial Regression Residuals vs. Predicted Log Sale Price",
    x = "Predicted Log Sale Price",
    y = "Residual"
  ) +
  theme_minimal()

poly_preds |>
  ggplot(aes(sample = residual_log)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    title = "Q-Q Plot of Polynomial Regression Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  ) +
  theme_minimal()

Visualize Polynomial Effects

# Visualize polynomial effect: living area
ames_train |>
  ggplot(aes(x = gr_liv_area, y = log_sale_price)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
  labs(
    title = "Polynomial Relationship Between Living Area and Log Sale Price",
    x = "Above-Ground Living Area",
    y = "Log Sale Price"
  ) +
  theme_minimal()

# Visualize polynomial effect: basement size
ames_train |>
  ggplot(aes(x = total_bsmt_sf, y = log_sale_price)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
  labs(
    title = "Polynomial Relationship Between Basement Size and Log Sale Price",
    x = "Total Basement Square Feet",
    y = "Log Sale Price"
  ) +
  theme_minimal()

# Visualize polynomial effect: year built
ames_train |>
  ggplot(aes(x = year_built, y = log_sale_price)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
  labs(
    title = "Polynomial Relationship Between Year Built and Log Sale Price",
    x = "Year Built",
    y = "Log Sale Price"
  ) +
  theme_minimal()

# Visualize polynomial effect: lot area
ames_train |>
  ggplot(aes(x = lot_area, y = log_sale_price)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE) +
  labs(
    title = "Polynomial Relationship Between Lot Area and Log Sale Price",
    x = "Lot Area",
    y = "Log Sale Price"
  ) +
  theme_minimal()

Polynomial Regression Results

The polynomial regression model improved prediction compared with the multiple linear regression model. In the training data, the polynomial model had an R^2 of 0.936 and an adjusted R^2 of 0.932, compared with 0.920 and 0.915 for the multiple linear regression model. The residual standard error also decreased from 0.119 to 0.106, suggesting that the polynomial model reduced typical prediction error on the log-price scale.

More importantly, the improvement also appeared on the test set. The polynomial model had a test-set RMSE of 0.130, compared with 0.136 for the multiple linear regression model. Its test-set R^2 increased from 0.890 to 0.900, and MAE decreased from 0.087 to 0.081. This indicates that the polynomial model generalized slightly better to new observations, rather than only improving the fit to the training data.

Statistically, these results suggest that some important housing variables have nonlinear relationships with sale price. Living area, basement size, year built, and lot area may not affect home value at a constant rate across their full ranges. By allowing second-degree polynomial effects for these predictors, the model captures curvature while remaining interpretable as an extension of multiple linear regression.

Because step_poly() uses transformed polynomial terms, the polynomial coefficients should be interpreted as evidence of curvature rather than as simple raw squared effects. The main takeaway is that adding nonlinear flexibility improved predictive performance while preserving a clear regression-based modeling structure.

Polynomial Regression Equation

The polynomial regression model extends the multiple linear regression model by allowing selected numeric predictors to have curved relationships with log_sale_price.

The general form of the model is:

\[ \widehat{\log(\text{Sale Price})} = 9.972 + 6.541(\text{gr\_liv\_area})_1 - 1.206(\text{gr\_liv\_area})_2 + 1.359(\text{total\_bsmt\_sf})_1 - 2.352(\text{total\_bsmt\_sf})_2 + 4.629(\text{year\_built})_1 + 0.533(\text{year\_built})_2 + 1.140(\text{lot\_area})_1 - 0.662(\text{lot\_area})_2 + \cdots \] Because step_poly() was used, the _1 and _2 terms represent first- and second-degree polynomial basis terms, not the raw variable and raw squared variable.

In this model, gr_liv_area, total_bsmt_sf, year_built, and lot_area were allowed to have second-degree polynomial effects.

Because the response variable is log_sale_price, the model predicts the log of home sale price rather than sale price directly. The predicted dollar sale price is obtained by exponentiating the fitted value:

\[ \widehat{\text{Sale Price}} = e^{\widehat{\log(\text{Sale Price})}} \]

Project 4: Multinomial Logistic Regression

Can selected Ames housing characteristics classify homes into low, middle, and high price categories?

We will create price_category with 3 levels low, middle and high. Baseline level will be low. The model estimates 2 equations:

We will use a smaller feature set than MLR. Multinomial models with many categorical predictors can become large quickly.

Create Price Categories

# Create tertile cut points using the training data only
price_cuts <- quantile(
  ames_train$sale_price,
  probs = c(1 / 3, 2 / 3),
  na.rm = TRUE
)

price_cuts
## 33.33333% 66.66667% 
##    139000    192000
# Add price category to training and testing data
ames_train_multi <- ames_train |>
  mutate(
    price_category = case_when(
      sale_price <= price_cuts[1] ~ "Low",
      sale_price <= price_cuts[2] ~ "Middle",
      TRUE ~ "High"
    ),
    price_category = factor(
      price_category,
      levels = c("Low", "Middle", "High")
    )
  )

ames_test_multi <- ames_test |>
  mutate(
    price_category = case_when(
      sale_price <= price_cuts[1] ~ "Low",
      sale_price <= price_cuts[2] ~ "Middle",
      TRUE ~ "High"
    ),
    price_category = factor(
      price_category,
      levels = c("Low", "Middle", "High")
    )
  )
# Check class balance
bind_rows(
  train = ames_train_multi |> count(price_category),
  test = ames_test_multi |> count(price_category),
  .id = "dataset"
) |>
  group_by(dataset) |>
  mutate(prop = n / sum(n))
## # A tibble: 6 × 4
## # Groups:   dataset [2]
##   dataset price_category     n  prop
##   <chr>   <fct>          <int> <dbl>
## 1 train   Low              796 0.340
## 2 train   Middle           773 0.330
## 3 train   High             775 0.331
## 4 test    Low              185 0.316
## 5 test    Middle           228 0.389
## 6 test    High             173 0.295

Visualize Price Categories

ames_train_multi |>
  ggplot(aes(x = price_category, fill = price_category)) +
  geom_bar() +
  labs(
    title = "Distribution of Housing Price Categories",
    x = "Price Category",
    y = "Number of Homes"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

ames_train_multi |>
  ggplot(aes(x = price_category, y = sale_price, fill = price_category)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "Sale Price by Price Category",
    x = "Price Category",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Multinomial Model Features

The model uses selected housing predictors that were important in the earlier regression work. These include measures of quality, location, size, age, garage capacity, basement features, and sale context.

multi_features <- c(
  "overall_qual",
  "gr_liv_area",
  "overall_cond",
  "garage_cars",
  "year_built",
  "fireplaces",
  "year_remod_add",
  "total_bsmt_sf",
  "full_bath",
  "central_air",
  "lot_area"
)

multi_formula <- reformulate(
  termlabels = multi_features,
  response = "price_category"
)

multi_formula
## price_category ~ overall_qual + gr_liv_area + overall_cond + 
##     garage_cars + year_built + fireplaces + year_remod_add + 
##     total_bsmt_sf + full_bath + central_air + lot_area

Model Specification and Workflow

multi_spec <- multinom_reg() |>
  set_engine("nnet") |>
  set_mode("classification")

multi_recipe <- recipe(multi_formula, data = ames_train_multi) |>
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

multi_workflow <- workflow() |>
  add_recipe(multi_recipe) |>
  add_model(multi_spec)

multi_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Multinomial Regression Model Specification (classification)
## 
## Computational engine: nnet

Cross-Validation Performance

set.seed(42)

multi_folds <- vfold_cv(
  ames_train_multi,
  v = 5,
  strata = price_category
)
set.seed(42)

multi_cv_results <- fit_resamples(
  multi_workflow,
  resamples = multi_folds,
  metrics = metric_set(accuracy, kap, mn_log_loss),
  control = control_resamples(save_pred = TRUE)
)

multi_cv_metrics <- collect_metrics(multi_cv_results)

multi_cv_metrics
## # A tibble: 3 × 6
##   .metric     .estimator  mean     n std_err .config        
##   <chr>       <chr>      <dbl> <int>   <dbl> <chr>          
## 1 accuracy    multiclass 0.827     5 0.00987 pre0_mod0_post0
## 2 kap         multiclass 0.740     5 0.0148  pre0_mod0_post0
## 3 mn_log_loss multiclass 0.519     5 0.0341  pre0_mod0_post0

Fit Final Model

final_multi_fit <- fit(
  multi_workflow,
  data = ames_train_multi
)

final_multi_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## nnet::multinom(formula = ..y ~ ., data = data, trace = FALSE)
## 
## Coefficients:
##        (Intercept) gr_liv_area garage_cars year_built fireplaces year_remod_add
## Middle   -173.2418 0.003396774   0.8155047 0.04725986  0.7878914     0.02136111
## High     -320.7770 0.007201304   1.1042862 0.08363337  1.3111405     0.05844962
##        total_bsmt_sf   full_bath     lot_area overall_qual_Poor
## Middle   0.001548055  0.02140128 0.0001696131         -12.50913
## High     0.003591289 -0.20667223 0.0002515269         -59.98598
##        overall_qual_Fair overall_qual_Below_Average overall_qual_Average
## Middle         -19.63521                  10.873724            11.468132
## High           -73.09130                   2.519599             4.614533
##        overall_qual_Above_Average overall_qual_Good overall_qual_Very_Good
## Middle                  12.251735         13.370957               14.01366
## High                     6.351918          8.692872               10.97494
##        overall_qual_Excellent overall_qual_Very_Excellent overall_cond_Poor
## Middle               65.35161                   15.810281         10.315792
## High                 62.25347                    8.553918          4.609423
##        overall_cond_Fair overall_cond_Below_Average overall_cond_Average
## Middle         14.320916                   15.25583             15.78269
## High            7.743726                   10.77664             12.75549
##        overall_cond_Above_Average overall_cond_Good overall_cond_Very_Good
## Middle                   16.64329          17.16676               17.30265
## High                     14.60196          15.84057               15.18794
##        overall_cond_Excellent central_air_Y
## Middle               18.47110     1.3265716
## High                 16.65102     0.4867664
## 
## Residual Deviance: 1920.906 
## AIC: 2028.906

Coefficients and Odds Ratios

multi_fit <- extract_fit_parsnip(final_multi_fit)
raw_multi_fit <- multi_fit$fit

multi_coef <- coef(raw_multi_fit)

multi_tidy_clean <- as.data.frame(multi_coef) |>
  rownames_to_column("price_category") |>
  pivot_longer(
    cols = -price_category,
    names_to = "term",
    values_to = "estimate"
  ) |>
  mutate(
    odds_ratio = exp(estimate)
  ) |>
  arrange(price_category, desc(abs(estimate)))

multi_tidy_clean
## # A tibble: 54 × 4
##    price_category term                       estimate odds_ratio
##    <chr>          <chr>                         <dbl>      <dbl>
##  1 High           (Intercept)                  -321.   4.88e-140
##  2 High           overall_qual_Fair             -73.1  1.81e- 32
##  3 High           overall_qual_Excellent         62.3  1.09e+ 27
##  4 High           overall_qual_Poor             -60.0  8.88e- 27
##  5 High           overall_cond_Excellent         16.7  1.70e+  7
##  6 High           overall_cond_Good              15.8  7.58e+  6
##  7 High           overall_cond_Very_Good         15.2  3.94e+  6
##  8 High           overall_cond_Above_Average     14.6  2.20e+  6
##  9 High           overall_cond_Average           12.8  3.46e+  5
## 10 High           overall_qual_Very_Good         11.0  5.84e+  4
## # ℹ 44 more rows
multi_tidy_clean |>
  group_by(price_category) |>
  slice_max(order_by = abs(estimate), n = 10) |>
  ungroup()
## # A tibble: 20 × 4
##    price_category term                        estimate odds_ratio
##    <chr>          <chr>                          <dbl>      <dbl>
##  1 High           (Intercept)                   -321.   4.88e-140
##  2 High           overall_qual_Fair              -73.1  1.81e- 32
##  3 High           overall_qual_Excellent          62.3  1.09e+ 27
##  4 High           overall_qual_Poor              -60.0  8.88e- 27
##  5 High           overall_cond_Excellent          16.7  1.70e+  7
##  6 High           overall_cond_Good               15.8  7.58e+  6
##  7 High           overall_cond_Very_Good          15.2  3.94e+  6
##  8 High           overall_cond_Above_Average      14.6  2.20e+  6
##  9 High           overall_cond_Average            12.8  3.46e+  5
## 10 High           overall_qual_Very_Good          11.0  5.84e+  4
## 11 Middle         (Intercept)                   -173.   5.78e- 76
## 12 Middle         overall_qual_Excellent          65.4  2.41e+ 28
## 13 Middle         overall_qual_Fair              -19.6  2.97e-  9
## 14 Middle         overall_cond_Excellent          18.5  1.05e+  8
## 15 Middle         overall_cond_Very_Good          17.3  3.27e+  7
## 16 Middle         overall_cond_Good               17.2  2.85e+  7
## 17 Middle         overall_cond_Above_Average      16.6  1.69e+  7
## 18 Middle         overall_qual_Very_Excellent     15.8  7.35e+  6
## 19 Middle         overall_cond_Average            15.8  7.15e+  6
## 20 Middle         overall_cond_Below_Average      15.3  4.22e+  6

Test-Set Predictions

multi_class_preds <- predict(
  final_multi_fit,
  new_data = ames_test_multi,
  type = "class"
)

multi_prob_preds <- predict(
  final_multi_fit,
  new_data = ames_test_multi,
  type = "prob"
)

multi_preds <- ames_test_multi |>
  select(sale_price, price_category) |>
  bind_cols(multi_class_preds, multi_prob_preds)

head(multi_preds)
## # A tibble: 6 × 6
##   sale_price price_category .pred_class .pred_Low .pred_Middle .pred_High
##        <int> <fct>          <fct>           <dbl>        <dbl>      <dbl>
## 1     189900 Middle         Middle       1.28e- 2    0.827          0.160
## 2     170000 Middle         Middle       1.17e- 2    0.797          0.191
## 3     306000 High           High         7.12e- 7    0.00759        0.992
## 4     500000 High           High         1.18e-34    0.0000114      1.000
## 5     320000 High           High         1.98e- 8    0.00128        0.999
## 6     192000 Middle         High         2.30e- 3    0.247          0.751

Confusion Matrix

multi_conf_mat <- multi_preds |>
  conf_mat(
    truth = price_category,
    estimate = .pred_class
  )

multi_conf_mat
##           Truth
## Prediction Low Middle High
##     Low    164     39    0
##     Middle  20    170   29
##     High     1     19  144
# 3. Plot heatmap
autoplot(multi_conf_mat, type = "heatmap") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(
    title = "Confusion Matrix Heatmap",
    x = "Predicted Class",
    y = "Actual Class"
  ) +
  theme_minimal()

multi_preds |>
  count(price_category, .pred_class) |>
  pivot_wider(
    names_from = .pred_class,
    values_from = n,
    values_fill = 0
  )
## # A tibble: 3 × 4
##   price_category   Low Middle  High
##   <fct>          <int>  <int> <int>
## 1 Low              164     20     1
## 2 Middle            39    170    19
## 3 High               0     29   144

Test-Set Classification Metrics

multi_metrics <- multi_preds |>
  metrics(
    truth = price_category,
    estimate = .pred_class
  )

multi_log_loss <- multi_preds |>
  mn_log_loss(
    truth = price_category,
    .pred_Low,
    .pred_Middle,
    .pred_High
  )

multi_model_metrics <- bind_rows(
  multi_metrics,
  multi_log_loss
)

multi_model_metrics
## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    multiclass     0.816
## 2 kap         multiclass     0.722
## 3 mn_log_loss multiclass     0.509

Visualize Predicted vs. Actual Categories

multi_preds |>
  count(price_category, .pred_class) |>
  ggplot(aes(x = price_category, y = n, fill = .pred_class)) +
  geom_col(position = "stack") +
  labs(
    title = "Predicted vs. Actual Housing Price Categories",
    x = "Actual Price Category",
    y = "Number of Homes",
    fill = "Predicted Category"
  ) +
  theme_minimal()

Predicted Probability Distributions

multi_preds |>
  pivot_longer(
    cols = starts_with(".pred_") & where(is.numeric),
    names_to = "predicted_category",
    values_to = "predicted_probability"
  ) |>
  ggplot(aes(x = price_category, y = predicted_probability, fill = price_category)) +
  geom_boxplot() +
  facet_wrap(~ predicted_category) +
  labs(
    title = "Predicted Probabilities by Actual Price Category",
    x = "Actual Price Category",
    y = "Predicted Probability"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation

The multinomial logistic regression model predicts whether a home belongs to the Low, Middle, or High price category. Unlike the earlier regression models, this model treats sale price as a categorical outcome. This makes the model useful for classification rather than exact price prediction.

The model uses Low as the reference category, so the coefficient estimates describe how each predictor changes the log-odds of being in the Middle or High category compared with the Low category. Positive coefficients increase the odds of being in a higher price category relative to Low, while negative coefficients decrease those odds.

The test-set results show that the model performs well overall, especially for distinguishing Low- and High-price homes. The Middle category is more difficult to classify because these homes share characteristics with both cheaper and more expensive homes. This is a common classification challenge when the middle class is created from ordered price ranges.

Multinomial Logistic Regression Equations

Because the response variable has three categories, the model estimates two equations relative to the baseline category, Low.

\[ \log\left(\frac{\hat{p}_{Middle}}{\hat{p}_{Low}}\right) = -173.242 + 0.00340(\text{gr\_liv\_area}) + 0.816(\text{garage\_cars}) + 0.0473(\text{year\_built}) + 0.788(\text{fireplaces}) + 0.0214(\text{year\_remod\_add}) + 0.00155(\text{total\_bsmt\_sf}) + 0.0214(\text{full\_bath}) + 0.000170(\text{lot\_area}) + 1.327(\text{central\_air\_Y}) + \cdots \]

\[ \log\left(\frac{\hat{p}_{High}}{\hat{p}_{Low}}\right) = -320.777 + 0.00720(\text{gr\_liv\_area}) + 1.104(\text{garage\_cars}) + 0.0836(\text{year\_built}) + 1.311(\text{fireplaces}) + 0.0584(\text{year\_remod\_add}) + 0.00359(\text{total\_bsmt\_sf}) - 0.207(\text{full\_bath}) + 0.000252(\text{lot\_area}) + 0.487(\text{central\_air\_Y}) + \cdots \]

The multinomial logistic regression model predicts whether a home belongs to the Low, Middle, or High price category. Because Low is the reference category, the model estimates two log-odds equations: one comparing Middle to Low and one comparing High to Low. Positive coefficients increase the log-odds of being classified as Middle or High rather than Low, while negative coefficients decrease those log-odds, holding other predictors constant.

The coefficient patterns are consistent with expectations. Larger living area, more garage capacity, newer construction, more fireplaces, larger basement size, and larger lot area generally increase the likelihood that a home belongs to a higher price category rather than the Low category. For example, the coefficient for gr_liv_area is larger in the High equation than in the Middle equation, suggesting that living area is especially important for distinguishing high-priced homes from low-priced homes.

The model classified approximately 81.6% of test homes correctly. Its kappa value of 0.722 indicates substantial agreement beyond chance, which is important for a multiclass classification problem. The multinomial log loss of 0.509 suggests that the model’s predicted probabilities are reasonably informative, not just its final class labels.

Overall, the multinomial logistic regression model provides strong evidence that selected Ames housing characteristics can distinguish low-, middle-, and high-price homes. The model is especially useful for showing how regression-based ideas can be extended from continuous price prediction to categorical classification.

Project 5: Poisson Regression

Purpose of the Model

In this section, I use Poisson regression to model a count outcome from the Ames Housing dataset. The response variable is tot_rms_abv_grd, which measures the total number of above-ground rooms in a home.

Because this outcome is a count, Poisson regression is a reasonable generalized linear model to consider. I also fit a standard linear regression model using the same predictors as a comparison model.

Check distribution of count outcome

ames_train |>
  ggplot(aes(x = tot_rms_abv_grd)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Distribution of Total Rooms Above Ground",
    x = "Total Rooms Above Ground",
    y = "Number of Homes"
  ) +
  theme_minimal()

# Mean and variance of count outcome
ames_train |>
  summarise(
    mean_rooms = mean(tot_rms_abv_grd, na.rm = TRUE),
    variance_rooms = var(tot_rms_abv_grd, na.rm = TRUE)
  )
## # A tibble: 1 × 2
##   mean_rooms variance_rooms
##        <dbl>          <dbl>
## 1       6.45           2.52

Predictor Set

The predictors were chosen to represent housing characteristics that may explain the number of above-ground rooms, including living area, bathrooms, age, basement size, garage capacity, and overall quality.

poisson_features <- c(
  "gr_liv_area",
  "year_built",
  "year_remod_add",
  "full_bath",
  "half_bath",
  "bedroom_abv_gr",
  "fireplaces",
  "total_bsmt_sf",
  "garage_cars",
  "overall_qual",
  "overall_cond"
)

poisson_formula <- reformulate(
  termlabels = poisson_features,
  response = "tot_rms_abv_grd"
)

poisson_formula
## tot_rms_abv_grd ~ gr_liv_area + year_built + year_remod_add + 
##     full_bath + half_bath + bedroom_abv_gr + fireplaces + total_bsmt_sf + 
##     garage_cars + overall_qual + overall_cond

Linear Regression Comparison Model

Before fitting the Poisson model, I fit a standard linear regression model to the same count outcome. This provides a comparison point, but a limitation of linear regression is that it can theoretically predict negative room counts.

# Linear regression model for comparison
room_lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

room_lm_recipe <- recipe(poisson_formula, data = ames_train) |>
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

room_lm_workflow <- workflow() |>
  add_recipe(room_lm_recipe) |>
  add_model(room_lm_spec)

room_lm_fit <- fit(
  room_lm_workflow,
  data = ames_train
)

room_lm_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##                 (Intercept)                  gr_liv_area  
##                   1.9959713                    0.0019109  
##                  year_built               year_remod_add  
##                  -0.0050185                    0.0044161  
##                   full_bath                    half_bath  
##                   0.0485959                   -0.0117184  
##              bedroom_abv_gr                   fireplaces  
##                   0.6530162                   -0.0184058  
##               total_bsmt_sf                  garage_cars  
##                  -0.0001223                    0.0817076  
##           overall_qual_Poor            overall_qual_Fair  
##                   0.5060055                    0.8032186  
##  overall_qual_Below_Average         overall_qual_Average  
##                   0.4190687                    0.5477962  
##  overall_qual_Above_Average            overall_qual_Good  
##                   0.3643046                    0.4540228  
##      overall_qual_Very_Good       overall_qual_Excellent  
##                   0.4532981                    0.9205828  
## overall_qual_Very_Excellent            overall_cond_Poor  
##                   0.5895420                    0.8073662  
##           overall_cond_Fair   overall_cond_Below_Average  
##                   0.2664559                    0.3296302  
##        overall_cond_Average   overall_cond_Above_Average  
##                   0.3437264                    0.2727707  
##           overall_cond_Good       overall_cond_Very_Good  
##                   0.1713796                    0.1664163  
##      overall_cond_Excellent  
##                   0.0276620

Poisson Regression Model

The Poisson regression model is fit using poisson_reg() with the glm engine. This model uses a log link, meaning the model estimates the log of the expected room count.

poisson_spec <- poisson_reg() |>
  set_engine("glm") |>
  set_mode("regression")

poisson_recipe <- recipe(poisson_formula, data = ames_train) |>
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

poisson_workflow <- workflow() |>
  add_recipe(poisson_recipe) |>
  add_model(poisson_spec)

poisson_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: poisson_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Poisson Regression Model Specification (regression)
## 
## Computational engine: glm
final_poisson_fit <- fit(
  poisson_workflow,
  data = ames_train
)

final_poisson_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: poisson_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_novel()
## • step_unknown()
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::poisson, data = data)
## 
## Coefficients:
##                 (Intercept)                  gr_liv_area  
##                   1.192e+00                    2.366e-04  
##                  year_built               year_remod_add  
##                  -9.319e-04                    8.003e-04  
##                   full_bath                    half_bath  
##                   1.641e-02                    5.887e-03  
##              bedroom_abv_gr                   fireplaces  
##                   1.019e-01                    2.144e-03  
##               total_bsmt_sf                  garage_cars  
##                  -1.428e-05                    1.632e-02  
##           overall_qual_Poor            overall_qual_Fair  
##                   6.241e-02                    2.315e-01  
##  overall_qual_Below_Average         overall_qual_Average  
##                   1.744e-01                    2.047e-01  
##  overall_qual_Above_Average            overall_qual_Good  
##                   1.838e-01                    2.047e-01  
##      overall_qual_Very_Good       overall_qual_Excellent  
##                   2.042e-01                    2.669e-01  
## overall_qual_Very_Excellent            overall_cond_Poor  
##                   1.836e-01                    1.074e-01  
##           overall_cond_Fair   overall_cond_Below_Average  
##                   7.728e-03                    2.181e-02  
##        overall_cond_Average   overall_cond_Above_Average  
##                   2.233e-02                    8.637e-03  
##           overall_cond_Good       overall_cond_Very_Good  
##                  -1.472e-02                   -1.614e-02  
##      overall_cond_Excellent  
##                  -4.118e-02  
## 
## Degrees of Freedom: 2343 Total (i.e. Null);  2317 Residual
## Null Deviance:       891.9 
## Residual Deviance: 240.3     AIC: 8966

Poisson Coefficients

poisson_tidy <- final_poisson_fit |>
  extract_fit_parsnip() |>
  tidy() |>
  mutate(
    rate_ratio = exp(estimate)
  ) |>
  arrange(p.value)

poisson_tidy
## # A tibble: 27 × 6
##    term                    estimate std.error statistic  p.value rate_ratio
##    <chr>                      <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
##  1 bedroom_abv_gr          0.102    0.0129        7.91  2.62e-15      1.11 
##  2 gr_liv_area             0.000237 0.0000310     7.62  2.53e-14      1.00 
##  3 year_built             -0.000932 0.000510     -1.83  6.78e- 2      0.999
##  4 year_remod_add          0.000800 0.000609      1.31  1.89e- 1      1.00 
##  5 garage_cars             0.0163   0.0145        1.13  2.60e- 1      1.02 
##  6 overall_qual_Excellent  0.267    0.262         1.02  3.09e- 1      1.31 
##  7 (Intercept)             1.19     1.21          0.985 3.24e- 1      3.29 
##  8 overall_qual_Fair       0.231    0.264         0.876 3.81e- 1      1.26 
##  9 overall_qual_Average    0.205    0.257         0.798 4.25e- 1      1.23 
## 10 overall_qual_Good       0.205    0.258         0.793 4.28e- 1      1.23 
## # ℹ 17 more rows
poisson_glance <- final_poisson_fit |>
  extract_fit_parsnip() |>
  glance()

poisson_glance
## # A tibble: 1 × 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          892.    2343 -4456. 8966. 9122.     240.        2317  2344

Predictions from Linear and Poisson Models

room_lm_preds <- predict(
  room_lm_fit,
  new_data = ames_test
) |>
  rename(lm_pred = .pred)

poisson_preds <- predict(
  final_poisson_fit,
  new_data = ames_test
) |>
  rename(poisson_pred = .pred)

room_pred_comparison <- ames_test |>
  select(tot_rms_abv_grd) |>
  bind_cols(room_lm_preds, poisson_preds) |>
  mutate(
    lm_residual = tot_rms_abv_grd - lm_pred,
    poisson_residual = tot_rms_abv_grd - poisson_pred
  )

head(room_pred_comparison)
## # A tibble: 6 × 5
##   tot_rms_abv_grd lm_pred poisson_pred lm_residual poisson_residual
##             <int>   <dbl>        <dbl>       <dbl>            <dbl>
## 1               6    6.88         6.75     -0.878           -0.745 
## 2               6    5.83         5.95      0.171            0.0469
## 3               7    6.92         6.85      0.0807           0.151 
## 4              10    9.11         9.15      0.885            0.851 
## 5               7    7.95         7.84     -0.954           -0.842 
## 6               6    5.41         5.53      0.590            0.473

Prediction Metrics

room_lm_metrics <- room_pred_comparison |>
  metrics(
    truth = tot_rms_abv_grd,
    estimate = lm_pred
  ) |>
  mutate(model = "Linear Regression")

room_lm_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model            
##   <chr>   <chr>          <dbl> <chr>            
## 1 rmse    standard       0.768 Linear Regression
## 2 rsq     standard       0.742 Linear Regression
## 3 mae     standard       0.586 Linear Regression
poisson_metrics <- room_pred_comparison |>
  metrics(
    truth = tot_rms_abv_grd,
    estimate = poisson_pred
  ) |>
  mutate(model = "Poisson Regression")

poisson_metrics
## # A tibble: 3 × 4
##   .metric .estimator .estimate model             
##   <chr>   <chr>          <dbl> <chr>             
## 1 rmse    standard       0.784 Poisson Regression
## 2 rsq     standard       0.731 Poisson Regression
## 3 mae     standard       0.601 Poisson Regression
room_model_comparison <- bind_rows(
  room_lm_metrics,
  poisson_metrics
) |>
  select(model, .metric, .estimate) |>
  pivot_wider(
    names_from = .metric,
    values_from = .estimate
  )

room_model_comparison
## # A tibble: 2 × 4
##   model               rmse   rsq   mae
##   <chr>              <dbl> <dbl> <dbl>
## 1 Linear Regression  0.768 0.742 0.586
## 2 Poisson Regression 0.784 0.731 0.601

Prediction and Diagnostic Plots

room_pred_comparison |>
  ggplot(aes(x = lm_pred, y = poisson_pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Comparison of Linear and Poisson Model Predictions",
    x = "Predicted Rooms from Linear Regression",
    y = "Predicted Rooms from Poisson Regression"
  ) +
  theme_minimal()

room_pred_comparison |>
  ggplot(aes(x = tot_rms_abv_grd, y = poisson_pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Poisson Regression: Actual vs. Predicted Total Rooms",
    x = "Actual Total Rooms Above Ground",
    y = "Predicted Total Rooms Above Ground"
  ) +
  theme_minimal()

room_pred_comparison |>
  ggplot(aes(x = poisson_pred, y = poisson_residual)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Poisson Regression Residuals",
    x = "Predicted Total Rooms Above Ground",
    y = "Residual"
  ) +
  theme_minimal()

Check Dispersion

poisson_glm <- extract_fit_parsnip(final_poisson_fit)$fit

dispersion_ratio <- sum(residuals(poisson_glm, type = "pearson")^2) /
  poisson_glm$df.residual

dispersion_ratio
## [1] 0.102777

Interpretation

The Poisson regression model was used to model tot_rms_abv_grd, a count outcome representing the number of above-ground rooms in a home. The strongest predictors were the number of above-ground bedrooms and above-ground living area. Each additional above-ground bedroom was associated with a 10.7% higher expected room count, holding other variables constant. Similarly, each additional 100 square feet of above-ground living area was associated with about a 2.4% higher expected room count.

Most quality and condition variables were not statistically significant after accounting for size and bedrooms. This is reasonable because quality and condition are more directly related to home value than to the number of rooms. The model deviance decreased substantially from the null model to the fitted model, showing that the predictors improved model fit. However, the residual deviance was much smaller than the residual degrees of freedom, suggesting underdispersion. This means the room-count outcome is less variable than expected under a standard Poisson model.

Poisson Regression Equation

\[ \log\left(\widehat{E(\text{Total Rooms Above Ground})}\right) = 1.192 + 0.000237(\text{gr\_liv\_area}) - 0.000932(\text{year\_built}) + 0.000800(\text{year\_remod\_add}) + 0.0164(\text{full\_bath}) + 0.00589(\text{half\_bath}) + 0.1019(\text{bedroom\_abv\_gr}) + 0.00214(\text{fireplaces}) - 0.0000143(\text{total\_bsmt\_sf}) + 0.0163(\text{garage\_cars}) + \cdots \]

The ellipsis represents dummy-variable terms for overall_qual and overall_cond. Since the model uses a log link, exponentiating both sides gives the expected room count:

\[ \widehat{E(\text{Total Rooms Above Ground})} = e^{ 1.192 + 0.000237(\text{gr\_liv\_area}) - 0.000932(\text{year\_built}) + 0.000800(\text{year\_remod\_add}) + 0.0164(\text{full\_bath}) + 0.00589(\text{half\_bath}) + 0.1019(\text{bedroom\_abv\_gr}) + 0.00214(\text{fireplaces}) - 0.0000143(\text{total\_bsmt\_sf}) + 0.0163(\text{garage\_cars}) + \cdots } \]