Housing prices are influenced by a combination of structural, locational, and amenity-related characteristics. Understanding how these features relate to sale price is an important statistical modeling problem because it involves both explanation and prediction. In this project, I use the Ames Housing dataset to investigate how selected housing features are associated with house sale prices.
The main goal of this project is to determine whether a multiple
linear regression model or a polynomial regression model provides a
better explanation of the relationship between sale_price and housing
characteristics. I focus on building interpretable models using the
tidymodels framework in R and evaluating them using
held-out test data and regression diagnostics.
The main research questions for this project are: - Which housing features are most important in explaining variation in sale price? - Does a polynomial regression model improve upon a standard multiple linear regression model? - Which model provides the best balance between predictive performance and interpretability?
A working hypothesis is that larger homes, newer homes, and homes located in more desirable neighborhoods will tend to have higher sale prices. In addition, it is plausible that the relationship between living area and sale price is not strictly linear, which motivates the comparison with a polynomial regression model
In this section, I load the Ames Housing data, inspect its structure, and create a cleaned modeling dataset. Because the original dataset contains many predictors, the initial exploration focuses on understanding variable types and identifying plausible features for the regression analysis.
# Load packages and data
library(tidymodels)
library(modeldata)
library(tidyverse)
library(janitor)
library(skimr)
library(forcats)
set.seed(42)
data(ames)
ames <- ames %>%
clean_names()
glimpse(ames)
## Rows: 2,930
## Columns: 74
## $ ms_sub_class <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ ms_zoning <fct> Residential_Low_Density, Residential_High_Density, …
## $ lot_frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ lot_shape <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ land_contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ lot_config <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ land_slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ bldg_type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ house_style <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ overall_cond <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ year_built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ roof_style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ roof_matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ mas_vnr_type <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ mas_vnr_area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ exter_cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ bsmt_cond <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ bsmt_exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ bsmt_fin_type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ bsmt_fin_sf_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ bsmt_fin_type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ bsmt_fin_sf_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ bsmt_unf_sf <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ total_bsmt_sf <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ heating_qc <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ central_air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ first_flr_sf <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ second_flr_sf <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ bsmt_full_bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ bsmt_half_bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ half_bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ bedroom_abv_gr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ kitchen_abv_gr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ tot_rms_abv_grd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ garage_type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ garage_finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ garage_cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ garage_area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ garage_cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ paved_drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ wood_deck_sf <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ open_porch_sf <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ enclosed_porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ screen_porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ pool_area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ misc_feature <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ misc_val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ mo_sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ year_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ sale_type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ sale_condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ sale_price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
skim(ames)
| Name | ames |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| factor | 40 |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| ms_sub_class | 0 | 1 | FALSE | 16 | One: 1079, Two: 575, One: 287, One: 192 |
| ms_zoning | 0 | 1 | FALSE | 7 | Res: 2273, Res: 462, Flo: 139, Res: 27 |
| street | 0 | 1 | FALSE | 2 | Pav: 2918, Grv: 12 |
| alley | 0 | 1 | FALSE | 3 | No_: 2732, Gra: 120, Pav: 78 |
| lot_shape | 0 | 1 | FALSE | 4 | Reg: 1859, Sli: 979, Mod: 76, Irr: 16 |
| land_contour | 0 | 1 | FALSE | 4 | Lvl: 2633, HLS: 120, Bnk: 117, Low: 60 |
| utilities | 0 | 1 | FALSE | 3 | All: 2927, NoS: 2, NoS: 1 |
| lot_config | 0 | 1 | FALSE | 5 | Ins: 2140, Cor: 511, Cul: 180, FR2: 85 |
| land_slope | 0 | 1 | FALSE | 3 | Gtl: 2789, Mod: 125, Sev: 16 |
| neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
| condition_1 | 0 | 1 | FALSE | 9 | Nor: 2522, Fee: 164, Art: 92, RRA: 50 |
| condition_2 | 0 | 1 | FALSE | 8 | Nor: 2900, Fee: 13, Art: 5, Pos: 4 |
| bldg_type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
| house_style | 0 | 1 | FALSE | 8 | One: 1481, Two: 873, One: 314, SLv: 128 |
| overall_cond | 0 | 1 | FALSE | 9 | Ave: 1654, Abo: 533, Goo: 390, Ver: 144 |
| roof_style | 0 | 1 | FALSE | 6 | Gab: 2321, Hip: 551, Gam: 22, Fla: 20 |
| roof_matl | 0 | 1 | FALSE | 8 | Com: 2887, Tar: 23, WdS: 9, WdS: 7 |
| exterior_1st | 0 | 1 | FALSE | 16 | Vin: 1026, Met: 450, HdB: 442, Wd : 420 |
| exterior_2nd | 0 | 1 | FALSE | 17 | Vin: 1015, Met: 447, HdB: 406, Wd : 397 |
| mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1775, Brk: 880, Sto: 249, Brk: 25 |
| exter_cond | 0 | 1 | FALSE | 5 | Typ: 2549, Goo: 299, Fai: 67, Exc: 12 |
| foundation | 0 | 1 | FALSE | 6 | PCo: 1310, CBl: 1244, Brk: 311, Sla: 49 |
| bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 2616, Goo: 122, Fai: 104, No_: 80 |
| bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1906, Av: 418, Gd: 284, Mn: 239 |
| bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 859, Unf: 851, ALQ: 429, Rec: 288 |
| bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 2499, Rec: 106, LwQ: 89, No_: 81 |
| heating | 0 | 1 | FALSE | 6 | Gas: 2885, Gas: 27, Gra: 9, Wal: 6 |
| heating_qc | 0 | 1 | FALSE | 5 | Exc: 1495, Typ: 864, Goo: 476, Fai: 92 |
| central_air | 0 | 1 | FALSE | 2 | Y: 2734, N: 196 |
| electrical | 0 | 1 | FALSE | 6 | SBr: 2682, Fus: 188, Fus: 50, Fus: 8 |
| functional | 0 | 1 | FALSE | 8 | Typ: 2728, Min: 70, Min: 65, Mod: 35 |
| garage_type | 0 | 1 | FALSE | 7 | Att: 1731, Det: 782, Bui: 186, No_: 157 |
| garage_finish | 0 | 1 | FALSE | 4 | Unf: 1231, RFn: 812, Fin: 728, No_: 159 |
| garage_cond | 0 | 1 | FALSE | 6 | Typ: 2665, No_: 159, Fai: 74, Goo: 15 |
| paved_drive | 0 | 1 | FALSE | 3 | Pav: 2652, Dir: 216, Par: 62 |
| pool_qc | 0 | 1 | FALSE | 5 | No_: 2917, Exc: 4, Goo: 4, Typ: 3 |
| fence | 0 | 1 | FALSE | 5 | No_: 2358, Min: 330, Goo: 118, Goo: 112 |
| misc_feature | 0 | 1 | FALSE | 6 | Non: 2824, She: 95, Gar: 5, Oth: 4 |
| sale_type | 0 | 1 | FALSE | 10 | WD : 2536, New: 239, COD: 87, Con: 26 |
| sale_condition | 0 | 1 | FALSE | 6 | Nor: 2413, Par: 245, Abn: 190, Fam: 46 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lot_frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
| lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
| year_built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
| year_remod_add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
| mas_vnr_area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
| bsmt_fin_sf_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
| bsmt_fin_sf_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
| bsmt_unf_sf | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
| total_bsmt_sf | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
| first_flr_sf | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
| second_flr_sf | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
| gr_liv_area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
| bsmt_full_bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
| bsmt_half_bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| full_bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
| half_bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
| bedroom_abv_gr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
| kitchen_abv_gr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
| tot_rms_abv_grd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
| fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
| garage_cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
| garage_area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
| wood_deck_sf | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
| open_porch_sf | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
| enclosed_porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
| three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
| screen_porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
| pool_area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
| misc_val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
| mo_sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
| year_sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
| sale_price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
| longitude | 0 | 1 | -93.64 | 0.03 | -93.69 | -93.66 | -93.64 | -93.62 | -93.58 | ▅▅▇▆▁ |
| latitude | 0 | 1 | 42.03 | 0.02 | 41.99 | 42.02 | 42.03 | 42.05 | 42.06 | ▂▂▇▇▇ |
total_na <- sum(is.na(ames))
total_na
## [1] 0
# View column names
names(ames)
## [1] "ms_sub_class" "ms_zoning" "lot_frontage"
## [4] "lot_area" "street" "alley"
## [7] "lot_shape" "land_contour" "utilities"
## [10] "lot_config" "land_slope" "neighborhood"
## [13] "condition_1" "condition_2" "bldg_type"
## [16] "house_style" "overall_cond" "year_built"
## [19] "year_remod_add" "roof_style" "roof_matl"
## [22] "exterior_1st" "exterior_2nd" "mas_vnr_type"
## [25] "mas_vnr_area" "exter_cond" "foundation"
## [28] "bsmt_cond" "bsmt_exposure" "bsmt_fin_type_1"
## [31] "bsmt_fin_sf_1" "bsmt_fin_type_2" "bsmt_fin_sf_2"
## [34] "bsmt_unf_sf" "total_bsmt_sf" "heating"
## [37] "heating_qc" "central_air" "electrical"
## [40] "first_flr_sf" "second_flr_sf" "gr_liv_area"
## [43] "bsmt_full_bath" "bsmt_half_bath" "full_bath"
## [46] "half_bath" "bedroom_abv_gr" "kitchen_abv_gr"
## [49] "tot_rms_abv_grd" "functional" "fireplaces"
## [52] "garage_type" "garage_finish" "garage_cars"
## [55] "garage_area" "garage_cond" "paved_drive"
## [58] "wood_deck_sf" "open_porch_sf" "enclosed_porch"
## [61] "three_season_porch" "screen_porch" "pool_area"
## [64] "pool_qc" "fence" "misc_feature"
## [67] "misc_val" "mo_sold" "year_sold"
## [70] "sale_type" "sale_condition" "sale_price"
## [73] "longitude" "latitude"
The Ames Housing dataset contains 2,930 observations and 74 variables, consisting of a mixture of numerical and categorical predictors describing various aspects of residential properties. The response variable of interest in this analysis is sale_price, which represents the final sale price of each home.
Initial inspection shows that the dataset does not contain missing values in the selected variables, which simplifies the preprocessing stage. The presence of both numerical and categorical variables makes this dataset well-suited for multiple regression modeling using dummy encoding for categorical predictors.
Before fitting regression models, it is important to examine the distribution of the response variable, sale_price. Understanding the center, spread, skewness, and potential outliers in the response helps provide context for later model interpretation and diagnostic assessment. In particular, if the response is highly skewed or contains extreme values, this may affect model fit and residual behavior.
summary(ames$sale_price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129500 160000 180796 213500 755000
sd(ames$sale_price)
## [1] 79886.69
IQR(ames$sale_price)
## [1] 84000
histogram of sale_price
ggplot(ames, aes(x = sale_price)) +
geom_histogram(bins = 30, fill = "#3B528B", color = "white") +
labs(
title = "Distribution of Sale Price",
x = "Sale Price",
y = "Count"
) +
scale_x_continuous(labels = label_comma()) +
theme_minimal()
boxplot of sale_price
ggplot(ames, aes(y = sale_price)) +
geom_boxplot(fill = "#3B528B") +
labs(
title = "Boxplot of Sale Price",
y = "Sale Price"
) +
scale_y_continuous(labels = label_comma()) +
theme_minimal()
The histogram shows that sale prices are right-skewed, with the majority of homes concentrated between approximately $100,000 and $250,000. A smaller number of homes extend into much higher price ranges, producing a long right tail. This pattern is common in real estate data, where a few high-value properties can substantially exceed typical prices.
This skewness is also reflected in the summary statistics. The mean sale price ($180,796) is higher than the median ($160,000), which is consistent with a right-skewed distribution. The wide range of values, from approximately $12,789 to $755,000, and a relatively large standard deviation (≈ $79,887), indicate substantial variability in housing prices.
The boxplot further highlights the presence of high-end outliers, with many observations above the upper quartile and several extreme values well beyond $400,000. These observations are not necessarily errors but represent genuinely expensive homes. However, they are important because they may influence regression results, particularly by increasing residual variance at higher predicted values.
For this analysis, sale_price is retained on its original scale to preserve interpretability in dollar units. However, the observed skewness and presence of outliers suggest that the model may exhibit heteroskedasticity, especially for higher-priced homes. This will be examined more carefully in the model diagnostics section.
Neighborhood
ames %>%
ggplot(aes(x = neighborhood, fill = neighborhood)) +
geom_bar() +
scale_fill_viridis_d(option = "D") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood", x = "Category", y = "Count")
House Style
ames %>%
ggplot(aes(x = house_style, fill = house_style)) +
geom_bar() +
scale_fill_viridis_d(option = "D") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "House Style", x = "Category", y = "Count")
Building Type
ames %>%
ggplot(aes(x = bldg_type, fill = bldg_type)) +
geom_bar() +
scale_fill_viridis_d(option = "D") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Building Type", x = "Category", y = "Count")
Central Air
ames %>%
ggplot(aes(x = central_air, fill = central_air)) +
geom_bar() +
scale_fill_viridis_d(option = "D") +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Central Air", x = "Category", y = "Count")
ames_num <- ames %>%
select(where(is.numeric))
sale_price_corr <- ames_num %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
rownames_to_column("predictor") %>%
select(predictor, sale_price) %>%
filter(predictor != "sale_price") %>%
arrange(desc(abs(sale_price)))
sale_price_corr
## predictor sale_price
## 1 gr_liv_area 0.706779921
## 2 garage_cars 0.647561613
## 3 garage_area 0.640138298
## 4 total_bsmt_sf 0.632528849
## 5 first_flr_sf 0.621676063
## 6 year_built 0.558426106
## 7 full_bath 0.545603901
## 8 year_remod_add 0.532973754
## 9 mas_vnr_area 0.502195977
## 10 tot_rms_abv_grd 0.495474417
## 11 fireplaces 0.474558093
## 12 wood_deck_sf 0.327143174
## 13 open_porch_sf 0.312950506
## 14 latitude 0.290891384
## 15 half_bath 0.285056032
## 16 bsmt_full_bath 0.275822661
## 17 second_flr_sf 0.269373357
## 18 lot_area 0.266549220
## 19 longitude -0.251397253
## 20 lot_frontage 0.201874510
## 21 bsmt_unf_sf 0.183307587
## 22 bedroom_abv_gr 0.143913428
## 23 bsmt_fin_sf_1 -0.134905479
## 24 enclosed_porch -0.128787442
## 25 kitchen_abv_gr -0.119813720
## 26 screen_porch 0.112151214
## 27 pool_area 0.068403247
## 28 bsmt_half_bath -0.035816609
## 29 mo_sold 0.035258842
## 30 three_season_porch 0.032224649
## 31 year_sold -0.030569087
## 32 misc_val -0.015691463
## 33 bsmt_fin_sf_2 0.006017568
Ask a question on how best to select the prediction features
We visualize the top numeric relationships
top_numeric_predictors <- sale_price_corr %>%
slice_head(n = 15) %>%
pull(predictor)
ames %>%
select(sale_price, all_of(top_numeric_predictors)) %>%
pivot_longer(
cols = -sale_price,
names_to = "predictor",
values_to = "value"
) %>%
ggplot(aes(x = value, y = sale_price)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
facet_wrap(~ predictor, scales = "free_x") +
labs(
title = "Pairwise Relationships with Sale Price",
x = "Predictor Value",
y = "Sale Price"
) +
scale_y_continuous(labels = label_comma()) +
theme_minimal()
Among the top predictors, garage_cars and
garage_area are highly correlated,so we will avoid keeping
both. Also, first_flr_sf is strongly related to both
gr_liv_area and total_bsmt_sf, and this may
create an overlap. Others to note are tot_rms_abv_grd and
gr_liv_area, year_remod_add and
year_built.
ames %>%
select(sale_price, all_of(top_numeric_predictors)) %>%
cor(use = "pairwise.complete.obs") %>%
round(2)
## sale_price gr_liv_area garage_cars garage_area total_bsmt_sf
## sale_price 1.00 0.71 0.65 0.64 0.63
## gr_liv_area 0.71 1.00 0.49 0.48 0.45
## garage_cars 0.65 0.49 1.00 0.89 0.44
## garage_area 0.64 0.48 0.89 1.00 0.49
## total_bsmt_sf 0.63 0.45 0.44 0.49 1.00
## first_flr_sf 0.62 0.56 0.44 0.49 0.80
## year_built 0.56 0.24 0.54 0.48 0.41
## full_bath 0.55 0.63 0.48 0.41 0.33
## year_remod_add 0.53 0.32 0.42 0.38 0.30
## mas_vnr_area 0.50 0.40 0.36 0.37 0.39
## tot_rms_abv_grd 0.50 0.81 0.36 0.33 0.28
## fireplaces 0.47 0.45 0.32 0.29 0.33
## wood_deck_sf 0.33 0.25 0.24 0.24 0.23
## open_porch_sf 0.31 0.34 0.20 0.23 0.25
## latitude 0.29 0.18 0.26 0.21 0.18
## half_bath 0.29 0.43 0.23 0.18 -0.05
## first_flr_sf year_built full_bath year_remod_add mas_vnr_area
## sale_price 0.62 0.56 0.55 0.53 0.50
## gr_liv_area 0.56 0.24 0.63 0.32 0.40
## garage_cars 0.44 0.54 0.48 0.42 0.36
## garage_area 0.49 0.48 0.41 0.38 0.37
## total_bsmt_sf 0.80 0.41 0.33 0.30 0.39
## first_flr_sf 1.00 0.31 0.37 0.24 0.39
## year_built 0.31 1.00 0.47 0.61 0.31
## full_bath 0.37 0.47 1.00 0.46 0.25
## year_remod_add 0.24 0.61 0.46 1.00 0.19
## mas_vnr_area 0.39 0.31 0.25 0.19 1.00
## tot_rms_abv_grd 0.39 0.11 0.53 0.20 0.28
## fireplaces 0.41 0.17 0.23 0.13 0.27
## wood_deck_sf 0.23 0.23 0.18 0.22 0.17
## open_porch_sf 0.24 0.20 0.26 0.24 0.14
## latitude 0.13 0.25 0.21 0.18 0.22
## half_bath -0.10 0.27 0.16 0.21 0.19
## tot_rms_abv_grd fireplaces wood_deck_sf open_porch_sf latitude
## sale_price 0.50 0.47 0.33 0.31 0.29
## gr_liv_area 0.81 0.45 0.25 0.34 0.18
## garage_cars 0.36 0.32 0.24 0.20 0.26
## garage_area 0.33 0.29 0.24 0.23 0.21
## total_bsmt_sf 0.28 0.33 0.23 0.25 0.18
## first_flr_sf 0.39 0.41 0.23 0.24 0.13
## year_built 0.11 0.17 0.23 0.20 0.25
## full_bath 0.53 0.23 0.18 0.26 0.21
## year_remod_add 0.20 0.13 0.22 0.24 0.18
## mas_vnr_area 0.28 0.27 0.17 0.14 0.22
## tot_rms_abv_grd 1.00 0.30 0.15 0.24 0.15
## fireplaces 0.30 1.00 0.23 0.16 0.15
## wood_deck_sf 0.15 0.23 1.00 0.04 0.03
## open_porch_sf 0.24 0.16 0.04 1.00 0.09
## latitude 0.15 0.15 0.03 0.09 1.00
## half_bath 0.35 0.20 0.12 0.18 0.17
## half_bath
## sale_price 0.29
## gr_liv_area 0.43
## garage_cars 0.23
## garage_area 0.18
## total_bsmt_sf -0.05
## first_flr_sf -0.10
## year_built 0.27
## full_bath 0.16
## year_remod_add 0.21
## mas_vnr_area 0.19
## tot_rms_abv_grd 0.35
## fireplaces 0.20
## wood_deck_sf 0.12
## open_porch_sf 0.18
## latitude 0.17
## half_bath 1.00
ames %>%
select(sale_price, all_of(top_numeric_predictors)) %>%
cor(use = "pairwise.complete.obs") %>%
as.data.frame() %>%
rownames_to_column(var = "var1") %>%
pivot_longer(-var1, names_to = "var2", values_to = "correlation") %>%
ggplot(aes(x = var1, y = var2, fill = abs(correlation))) +
geom_tile() +
geom_text(aes(label = round(correlation, 2)), size = 3) +
scale_fill_viridis_c(option = "D") +
labs(
title = "Correlation Heatmap for Top Numeric Variables",
fill = "Correlation",
x = NULL,
y = NULL
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
gr_liv_areagarage_areatotal_bsmt_sfyear_builtfull_bathneighborhoodhouse_stylebldg_typecentral_airPredictors were narrowed using exploratory pairwise screening. For numeric variables, I examined their correlations with sale price and screened out variables that appeared highly redundant with stronger alternatives. For categorical variables, I examined group differences in sale price using boxplots and retained factors that showed meaningful separation and were easy to interpret in the housing context.
ames_model <- ames %>%
select(
sale_price,
gr_liv_area,
garage_area,
total_bsmt_sf,
year_built,
full_bath,
neighborhood,
house_style,
bldg_type,
central_air
)
# Train/test split
set.seed(2026)
ames_split <- initial_split(ames_model, prop = 0.80)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
# Check sizes
dim(ames_train)
## [1] 2344 10
dim(ames_test)
## [1] 586 10
We specify our model
# Specify the model
lm_model <- linear_reg() %>%
set_engine("lm")
# Create the recipe
lm_recipe <- recipe(
sale_price ~ gr_liv_area + garage_area + total_bsmt_sf +
year_built + full_bath + neighborhood +
house_style + bldg_type + central_air,
data = ames_train
) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
lm_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
lm_fit <- lm_workflow %>%
fit(data = ames_train)
lm_preds <- predict(lm_fit, new_data = ames_test) %>%
bind_cols(ames_test)
head(lm_preds)
## # A tibble: 6 × 11
## .pred sale_price gr_liv_area garage_area total_bsmt_sf year_built full_bath
## <dbl> <int> <int> <dbl> <dbl> <int> <int>
## 1 133691. 105000 896 730 882 1961 1
## 2 190557. 195500 1604 470 926 1998 2
## 3 175442. 180400 1465 393 789 1998 2
## 4 426089. 538000 3279 841 1650 2003 3
## 5 186253. 164000 1752 492 559 1988 2
## 6 137944. 149000 1004 480 1004 1970 1
## # ℹ 4 more variables: neighborhood <fct>, house_style <fct>, bldg_type <fct>,
## # central_air <fct>
# Overall metrics
lm_metrics <- lm_preds %>%
metrics(truth = sale_price, estimate = .pred)
lm_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 33711.
## 2 rsq standard 0.844
## 3 mae standard 22016.
# RMSE
lm_preds %>%
rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 33711.
# R-squared
lm_preds %>%
rsq(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.844
# MAE
lm_preds %>%
mae(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 22016.
# Actual vs Predicted
ggplot(lm_preds, aes(x = sale_price, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Actual vs Predicted Sale Price",
x = "Actual Sale Price",
y = "Predicted Sale Price"
) +
theme_minimal()
# Residual plot
lm_preds <- lm_preds %>%
mutate(residual = sale_price - .pred)
ggplot(lm_preds, aes(x = .pred, y = residual)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Residuals vs Predicted Sale Price",
x = "Predicted Sale Price",
y = "Residuals"
) +
theme_minimal()
# New recipe with polynomial term
poly_recipe <- recipe(
sale_price ~ garage_area + total_bsmt_sf +
year_built + full_bath +
neighborhood + house_style +
bldg_type + central_air + gr_liv_area,
data = ames_train
) %>%
step_poly(gr_liv_area, degree = 2) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
# Workflow
poly_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(poly_recipe)
# Fit model
poly_fit <- poly_workflow %>%
fit(data = ames_train)
# Predictions
poly_preds <- predict(poly_fit, new_data = ames_test) %>%
bind_cols(ames_test)
# Metrics
poly_metrics <- poly_preds %>%
metrics(truth = sale_price, estimate = .pred)
poly_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 34781.
## 2 rsq standard 0.833
## 3 mae standard 22662.
lm_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 33711.
## 2 rsq standard 0.844
## 3 mae standard 22016.
poly_metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 34781.
## 2 rsq standard 0.833
## 3 mae standard 22662.
# Actual vs Predicted
ggplot(poly_preds, aes(x = sale_price, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Polynomial Model: Actual vs Predicted",
x = "Actual Sale Price",
y = "Predicted Sale Price"
) +
theme_minimal()
# Residuals
poly_preds <- poly_preds %>%
mutate(residual = sale_price - .pred)
ggplot(poly_preds, aes(x = .pred, y = residual)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Polynomial Model: Residuals vs Predicted",
x = "Predicted",
y = "Residuals"
) +
theme_minimal()
Report test-set performance and diagnostic plots.
Summarize major findings and what they mean in plain language.
State what you learned and what model you would choose.