Introduction

Housing prices are influenced by a combination of structural, locational, and amenity-related characteristics. Understanding how these features relate to sale price is an important statistical modeling problem because it involves both explanation and prediction. In this project, I use the Ames Housing dataset to investigate how selected housing features are associated with house sale prices.

The main goal of this project is to determine whether a multiple linear regression model or a polynomial regression model provides a better explanation of the relationship between sale_price and housing characteristics. I focus on building interpretable models using the tidymodels framework in R and evaluating them using held-out test data and regression diagnostics.

The main research questions for this project are: - Which housing features are most important in explaining variation in sale price? - Does a polynomial regression model improve upon a standard multiple linear regression model? - Which model provides the best balance between predictive performance and interpretability?

A working hypothesis is that larger homes, newer homes, and homes located in more desirable neighborhoods will tend to have higher sale prices. In addition, it is plausible that the relationship between living area and sale price is not strictly linear, which motivates the comparison with a polynomial regression model

Data and Variables

In this section, I load the Ames Housing data, inspect its structure, and create a cleaned modeling dataset. Because the original dataset contains many predictors, the initial exploration focuses on understanding variable types and identifying plausible features for the regression analysis.

# Load packages and data

library(tidymodels)
library(modeldata)
library(tidyverse)
library(janitor)
library(skimr)
library(forcats)

set.seed(42)

data(ames)

ames <- ames %>%
  clean_names()
glimpse(ames)
## Rows: 2,930
## Columns: 74
## $ ms_sub_class       <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ ms_zoning          <fct> Residential_Low_Density, Residential_High_Density, …
## $ lot_frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ lot_area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ lot_shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ land_contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ lot_config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ land_slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ bldg_type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ house_style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ overall_cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ year_built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ year_remod_add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ roof_style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ roof_matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ mas_vnr_type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ mas_vnr_area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ exter_cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ bsmt_cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ bsmt_exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ bsmt_fin_type_1    <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ bsmt_fin_sf_1      <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ bsmt_fin_type_2    <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ bsmt_fin_sf_2      <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ bsmt_unf_sf        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ total_bsmt_sf      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ heating_qc         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ central_air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ first_flr_sf       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ second_flr_sf      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ gr_liv_area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ bsmt_full_bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ bsmt_half_bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ half_bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ bedroom_abv_gr     <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ kitchen_abv_gr     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ tot_rms_abv_grd    <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ garage_type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ garage_finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ garage_cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ garage_area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ garage_cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ paved_drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ wood_deck_sf       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ open_porch_sf      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ enclosed_porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ screen_porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ pool_area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ misc_feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ misc_val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ mo_sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ year_sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ sale_type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ sale_condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ sale_price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
skim(ames)
Data summary
Name ames
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
factor 40
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ms_sub_class 0 1 FALSE 16 One: 1079, Two: 575, One: 287, One: 192
ms_zoning 0 1 FALSE 7 Res: 2273, Res: 462, Flo: 139, Res: 27
street 0 1 FALSE 2 Pav: 2918, Grv: 12
alley 0 1 FALSE 3 No_: 2732, Gra: 120, Pav: 78
lot_shape 0 1 FALSE 4 Reg: 1859, Sli: 979, Mod: 76, Irr: 16
land_contour 0 1 FALSE 4 Lvl: 2633, HLS: 120, Bnk: 117, Low: 60
utilities 0 1 FALSE 3 All: 2927, NoS: 2, NoS: 1
lot_config 0 1 FALSE 5 Ins: 2140, Cor: 511, Cul: 180, FR2: 85
land_slope 0 1 FALSE 3 Gtl: 2789, Mod: 125, Sev: 16
neighborhood 0 1 FALSE 28 Nor: 443, Col: 267, Old: 239, Edw: 194
condition_1 0 1 FALSE 9 Nor: 2522, Fee: 164, Art: 92, RRA: 50
condition_2 0 1 FALSE 8 Nor: 2900, Fee: 13, Art: 5, Pos: 4
bldg_type 0 1 FALSE 5 One: 2425, Twn: 233, Dup: 109, Twn: 101
house_style 0 1 FALSE 8 One: 1481, Two: 873, One: 314, SLv: 128
overall_cond 0 1 FALSE 9 Ave: 1654, Abo: 533, Goo: 390, Ver: 144
roof_style 0 1 FALSE 6 Gab: 2321, Hip: 551, Gam: 22, Fla: 20
roof_matl 0 1 FALSE 8 Com: 2887, Tar: 23, WdS: 9, WdS: 7
exterior_1st 0 1 FALSE 16 Vin: 1026, Met: 450, HdB: 442, Wd : 420
exterior_2nd 0 1 FALSE 17 Vin: 1015, Met: 447, HdB: 406, Wd : 397
mas_vnr_type 0 1 FALSE 5 Non: 1775, Brk: 880, Sto: 249, Brk: 25
exter_cond 0 1 FALSE 5 Typ: 2549, Goo: 299, Fai: 67, Exc: 12
foundation 0 1 FALSE 6 PCo: 1310, CBl: 1244, Brk: 311, Sla: 49
bsmt_cond 0 1 FALSE 6 Typ: 2616, Goo: 122, Fai: 104, No_: 80
bsmt_exposure 0 1 FALSE 5 No: 1906, Av: 418, Gd: 284, Mn: 239
bsmt_fin_type_1 0 1 FALSE 7 GLQ: 859, Unf: 851, ALQ: 429, Rec: 288
bsmt_fin_type_2 0 1 FALSE 7 Unf: 2499, Rec: 106, LwQ: 89, No_: 81
heating 0 1 FALSE 6 Gas: 2885, Gas: 27, Gra: 9, Wal: 6
heating_qc 0 1 FALSE 5 Exc: 1495, Typ: 864, Goo: 476, Fai: 92
central_air 0 1 FALSE 2 Y: 2734, N: 196
electrical 0 1 FALSE 6 SBr: 2682, Fus: 188, Fus: 50, Fus: 8
functional 0 1 FALSE 8 Typ: 2728, Min: 70, Min: 65, Mod: 35
garage_type 0 1 FALSE 7 Att: 1731, Det: 782, Bui: 186, No_: 157
garage_finish 0 1 FALSE 4 Unf: 1231, RFn: 812, Fin: 728, No_: 159
garage_cond 0 1 FALSE 6 Typ: 2665, No_: 159, Fai: 74, Goo: 15
paved_drive 0 1 FALSE 3 Pav: 2652, Dir: 216, Par: 62
pool_qc 0 1 FALSE 5 No_: 2917, Exc: 4, Goo: 4, Typ: 3
fence 0 1 FALSE 5 No_: 2358, Min: 330, Goo: 118, Goo: 112
misc_feature 0 1 FALSE 6 Non: 2824, She: 95, Gar: 5, Oth: 4
sale_type 0 1 FALSE 10 WD : 2536, New: 239, COD: 87, Con: 26
sale_condition 0 1 FALSE 6 Nor: 2413, Par: 245, Abn: 190, Fam: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lot_frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
lot_area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
year_built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
year_remod_add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
mas_vnr_area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
bsmt_fin_sf_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
bsmt_fin_sf_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
bsmt_unf_sf 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
total_bsmt_sf 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
first_flr_sf 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
second_flr_sf 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
gr_liv_area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
bsmt_full_bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
bsmt_half_bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
full_bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
half_bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
bedroom_abv_gr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
kitchen_abv_gr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
tot_rms_abv_grd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
garage_cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
garage_area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
wood_deck_sf 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
open_porch_sf 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
enclosed_porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
screen_porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
pool_area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
misc_val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
mo_sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
year_sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
sale_price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
longitude 0 1 -93.64 0.03 -93.69 -93.66 -93.64 -93.62 -93.58 ▅▅▇▆▁
latitude 0 1 42.03 0.02 41.99 42.02 42.03 42.05 42.06 ▂▂▇▇▇
total_na <- sum(is.na(ames))
total_na
## [1] 0
# View column names
names(ames)
##  [1] "ms_sub_class"       "ms_zoning"          "lot_frontage"      
##  [4] "lot_area"           "street"             "alley"             
##  [7] "lot_shape"          "land_contour"       "utilities"         
## [10] "lot_config"         "land_slope"         "neighborhood"      
## [13] "condition_1"        "condition_2"        "bldg_type"         
## [16] "house_style"        "overall_cond"       "year_built"        
## [19] "year_remod_add"     "roof_style"         "roof_matl"         
## [22] "exterior_1st"       "exterior_2nd"       "mas_vnr_type"      
## [25] "mas_vnr_area"       "exter_cond"         "foundation"        
## [28] "bsmt_cond"          "bsmt_exposure"      "bsmt_fin_type_1"   
## [31] "bsmt_fin_sf_1"      "bsmt_fin_type_2"    "bsmt_fin_sf_2"     
## [34] "bsmt_unf_sf"        "total_bsmt_sf"      "heating"           
## [37] "heating_qc"         "central_air"        "electrical"        
## [40] "first_flr_sf"       "second_flr_sf"      "gr_liv_area"       
## [43] "bsmt_full_bath"     "bsmt_half_bath"     "full_bath"         
## [46] "half_bath"          "bedroom_abv_gr"     "kitchen_abv_gr"    
## [49] "tot_rms_abv_grd"    "functional"         "fireplaces"        
## [52] "garage_type"        "garage_finish"      "garage_cars"       
## [55] "garage_area"        "garage_cond"        "paved_drive"       
## [58] "wood_deck_sf"       "open_porch_sf"      "enclosed_porch"    
## [61] "three_season_porch" "screen_porch"       "pool_area"         
## [64] "pool_qc"            "fence"              "misc_feature"      
## [67] "misc_val"           "mo_sold"            "year_sold"         
## [70] "sale_type"          "sale_condition"     "sale_price"        
## [73] "longitude"          "latitude"

The Ames Housing dataset contains 2,930 observations and 74 variables, consisting of a mixture of numerical and categorical predictors describing various aspects of residential properties. The response variable of interest in this analysis is sale_price, which represents the final sale price of each home.

Initial inspection shows that the dataset does not contain missing values in the selected variables, which simplifies the preprocessing stage. The presence of both numerical and categorical variables makes this dataset well-suited for multiple regression modeling using dummy encoding for categorical predictors.

Exploratory Data Analysis

Exploring the response variable

Before fitting regression models, it is important to examine the distribution of the response variable, sale_price. Understanding the center, spread, skewness, and potential outliers in the response helps provide context for later model interpretation and diagnostic assessment. In particular, if the response is highly skewed or contains extreme values, this may affect model fit and residual behavior.

summary(ames$sale_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000
sd(ames$sale_price)
## [1] 79886.69
IQR(ames$sale_price)
## [1] 84000

histogram of sale_price

ggplot(ames, aes(x = sale_price)) +
  geom_histogram(bins = 30, fill = "#3B528B", color = "white") +
  labs(
    title = "Distribution of Sale Price",
    x = "Sale Price",
    y = "Count"
  ) +
  scale_x_continuous(labels = label_comma()) +
  theme_minimal()

boxplot of sale_price

ggplot(ames, aes(y = sale_price)) +
  geom_boxplot(fill = "#3B528B") +
  labs(
    title = "Boxplot of Sale Price",
    y = "Sale Price"
  ) +
  scale_y_continuous(labels = label_comma()) +
  theme_minimal()

The histogram shows that sale prices are right-skewed, with the majority of homes concentrated between approximately $100,000 and $250,000. A smaller number of homes extend into much higher price ranges, producing a long right tail. This pattern is common in real estate data, where a few high-value properties can substantially exceed typical prices.

This skewness is also reflected in the summary statistics. The mean sale price ($180,796) is higher than the median ($160,000), which is consistent with a right-skewed distribution. The wide range of values, from approximately $12,789 to $755,000, and a relatively large standard deviation (≈ $79,887), indicate substantial variability in housing prices.

The boxplot further highlights the presence of high-end outliers, with many observations above the upper quartile and several extreme values well beyond $400,000. These observations are not necessarily errors but represent genuinely expensive homes. However, they are important because they may influence regression results, particularly by increasing residual variance at higher predicted values.

For this analysis, sale_price is retained on its original scale to preserve interpretability in dollar units. However, the observed skewness and presence of outliers suggest that the model may exhibit heteroskedasticity, especially for higher-priced homes. This will be examined more carefully in the model diagnostics section.

Univariate Exploration: Categorical Predictors

Neighborhood

ames %>%
  ggplot(aes(x = neighborhood, fill = neighborhood)) +
  geom_bar() +
  scale_fill_viridis_d(option = "D") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(title = "Neighborhood", x = "Category", y = "Count")

House Style

ames %>%
  ggplot(aes(x = house_style, fill = house_style)) +
  geom_bar() +
  scale_fill_viridis_d(option = "D") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(title = "House Style", x = "Category", y = "Count")

Building Type

ames %>%
  ggplot(aes(x = bldg_type, fill = bldg_type)) +
  geom_bar() +
  scale_fill_viridis_d(option = "D") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(title = "Building Type", x = "Category", y = "Count")

Central Air

ames %>%
  ggplot(aes(x = central_air, fill = central_air)) +
  geom_bar() +
  scale_fill_viridis_d(option = "D") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Central Air", x = "Category", y = "Count")

Bivariate exploration: numeric predictors vs sale_price**

ames_num <- ames %>%
  select(where(is.numeric))

sale_price_corr <- ames_num %>%
  cor(use = "pairwise.complete.obs") %>%
  as.data.frame() %>%
  rownames_to_column("predictor") %>%
  select(predictor, sale_price) %>%
  filter(predictor != "sale_price") %>%
  arrange(desc(abs(sale_price)))

sale_price_corr
##             predictor   sale_price
## 1         gr_liv_area  0.706779921
## 2         garage_cars  0.647561613
## 3         garage_area  0.640138298
## 4       total_bsmt_sf  0.632528849
## 5        first_flr_sf  0.621676063
## 6          year_built  0.558426106
## 7           full_bath  0.545603901
## 8      year_remod_add  0.532973754
## 9        mas_vnr_area  0.502195977
## 10    tot_rms_abv_grd  0.495474417
## 11         fireplaces  0.474558093
## 12       wood_deck_sf  0.327143174
## 13      open_porch_sf  0.312950506
## 14           latitude  0.290891384
## 15          half_bath  0.285056032
## 16     bsmt_full_bath  0.275822661
## 17      second_flr_sf  0.269373357
## 18           lot_area  0.266549220
## 19          longitude -0.251397253
## 20       lot_frontage  0.201874510
## 21        bsmt_unf_sf  0.183307587
## 22     bedroom_abv_gr  0.143913428
## 23      bsmt_fin_sf_1 -0.134905479
## 24     enclosed_porch -0.128787442
## 25     kitchen_abv_gr -0.119813720
## 26       screen_porch  0.112151214
## 27          pool_area  0.068403247
## 28     bsmt_half_bath -0.035816609
## 29            mo_sold  0.035258842
## 30 three_season_porch  0.032224649
## 31          year_sold -0.030569087
## 32           misc_val -0.015691463
## 33      bsmt_fin_sf_2  0.006017568

Ask a question on how best to select the prediction features

We visualize the top numeric relationships

top_numeric_predictors <- sale_price_corr %>%
  slice_head(n = 15) %>%
  pull(predictor)

ames %>%
  select(sale_price, all_of(top_numeric_predictors)) %>%
  pivot_longer(
    cols = -sale_price,
    names_to = "predictor",
    values_to = "value"
  ) %>%
  ggplot(aes(x = value, y = sale_price)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  facet_wrap(~ predictor, scales = "free_x") +
  labs(
    title = "Pairwise Relationships with Sale Price",
    x = "Predictor Value",
    y = "Sale Price"
  ) +
  scale_y_continuous(labels = label_comma()) +
  theme_minimal()

Among the top predictors, garage_cars and garage_area are highly correlated,so we will avoid keeping both. Also, first_flr_sf is strongly related to both gr_liv_area and total_bsmt_sf, and this may create an overlap. Others to note are tot_rms_abv_grd and gr_liv_area, year_remod_add and year_built.

correlation among top predictors

ames %>%
  select(sale_price, all_of(top_numeric_predictors)) %>%
  cor(use = "pairwise.complete.obs") %>%
  round(2)
##                 sale_price gr_liv_area garage_cars garage_area total_bsmt_sf
## sale_price            1.00        0.71        0.65        0.64          0.63
## gr_liv_area           0.71        1.00        0.49        0.48          0.45
## garage_cars           0.65        0.49        1.00        0.89          0.44
## garage_area           0.64        0.48        0.89        1.00          0.49
## total_bsmt_sf         0.63        0.45        0.44        0.49          1.00
## first_flr_sf          0.62        0.56        0.44        0.49          0.80
## year_built            0.56        0.24        0.54        0.48          0.41
## full_bath             0.55        0.63        0.48        0.41          0.33
## year_remod_add        0.53        0.32        0.42        0.38          0.30
## mas_vnr_area          0.50        0.40        0.36        0.37          0.39
## tot_rms_abv_grd       0.50        0.81        0.36        0.33          0.28
## fireplaces            0.47        0.45        0.32        0.29          0.33
## wood_deck_sf          0.33        0.25        0.24        0.24          0.23
## open_porch_sf         0.31        0.34        0.20        0.23          0.25
## latitude              0.29        0.18        0.26        0.21          0.18
## half_bath             0.29        0.43        0.23        0.18         -0.05
##                 first_flr_sf year_built full_bath year_remod_add mas_vnr_area
## sale_price              0.62       0.56      0.55           0.53         0.50
## gr_liv_area             0.56       0.24      0.63           0.32         0.40
## garage_cars             0.44       0.54      0.48           0.42         0.36
## garage_area             0.49       0.48      0.41           0.38         0.37
## total_bsmt_sf           0.80       0.41      0.33           0.30         0.39
## first_flr_sf            1.00       0.31      0.37           0.24         0.39
## year_built              0.31       1.00      0.47           0.61         0.31
## full_bath               0.37       0.47      1.00           0.46         0.25
## year_remod_add          0.24       0.61      0.46           1.00         0.19
## mas_vnr_area            0.39       0.31      0.25           0.19         1.00
## tot_rms_abv_grd         0.39       0.11      0.53           0.20         0.28
## fireplaces              0.41       0.17      0.23           0.13         0.27
## wood_deck_sf            0.23       0.23      0.18           0.22         0.17
## open_porch_sf           0.24       0.20      0.26           0.24         0.14
## latitude                0.13       0.25      0.21           0.18         0.22
## half_bath              -0.10       0.27      0.16           0.21         0.19
##                 tot_rms_abv_grd fireplaces wood_deck_sf open_porch_sf latitude
## sale_price                 0.50       0.47         0.33          0.31     0.29
## gr_liv_area                0.81       0.45         0.25          0.34     0.18
## garage_cars                0.36       0.32         0.24          0.20     0.26
## garage_area                0.33       0.29         0.24          0.23     0.21
## total_bsmt_sf              0.28       0.33         0.23          0.25     0.18
## first_flr_sf               0.39       0.41         0.23          0.24     0.13
## year_built                 0.11       0.17         0.23          0.20     0.25
## full_bath                  0.53       0.23         0.18          0.26     0.21
## year_remod_add             0.20       0.13         0.22          0.24     0.18
## mas_vnr_area               0.28       0.27         0.17          0.14     0.22
## tot_rms_abv_grd            1.00       0.30         0.15          0.24     0.15
## fireplaces                 0.30       1.00         0.23          0.16     0.15
## wood_deck_sf               0.15       0.23         1.00          0.04     0.03
## open_porch_sf              0.24       0.16         0.04          1.00     0.09
## latitude                   0.15       0.15         0.03          0.09     1.00
## half_bath                  0.35       0.20         0.12          0.18     0.17
##                 half_bath
## sale_price           0.29
## gr_liv_area          0.43
## garage_cars          0.23
## garage_area          0.18
## total_bsmt_sf       -0.05
## first_flr_sf        -0.10
## year_built           0.27
## full_bath            0.16
## year_remod_add       0.21
## mas_vnr_area         0.19
## tot_rms_abv_grd      0.35
## fireplaces           0.20
## wood_deck_sf         0.12
## open_porch_sf        0.18
## latitude             0.17
## half_bath            1.00
ames %>%
  select(sale_price, all_of(top_numeric_predictors)) %>%
  cor(use = "pairwise.complete.obs") %>%
  as.data.frame() %>%
  rownames_to_column(var = "var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "correlation") %>%
  ggplot(aes(x = var1, y = var2, fill = abs(correlation))) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 2)), size = 3) +
  scale_fill_viridis_c(option = "D") +
  labs(
    title = "Correlation Heatmap for Top Numeric Variables",
    fill = "Correlation",
    x = NULL,
    y = NULL
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Numeric Candidates

  • gr_liv_area
  • garage_area
  • total_bsmt_sf
  • year_built
  • full_bath

Categorical Candidates

  • neighborhood
  • house_style
  • bldg_type
  • central_air

Predictors were narrowed using exploratory pairwise screening. For numeric variables, I examined their correlations with sale price and screened out variables that appeared highly redundant with stronger alternatives. For categorical variables, I examined group differences in sale price using boxplots and retained factors that showed meaningful separation and were easy to interpret in the housing context.

Modeling Dataset and Train/Test Split

ames_model <- ames %>%
  select(
    sale_price,
    gr_liv_area,
    garage_area,
    total_bsmt_sf,
    year_built,
    full_bath,
    neighborhood,
    house_style,
    bldg_type,
    central_air
  )

Train/Test split

# Train/test split
set.seed(2026)

ames_split <- initial_split(ames_model, prop = 0.80)

ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

# Check sizes
dim(ames_train)
## [1] 2344   10
dim(ames_test)
## [1] 586  10

We specify our model

# Specify the model
lm_model <- linear_reg() %>%
  set_engine("lm")
# Create the recipe
lm_recipe <- recipe(
  sale_price ~ gr_liv_area + garage_area + total_bsmt_sf +
    year_built + full_bath + neighborhood +
    house_style + bldg_type + central_air,
  data = ames_train
) %>%
  step_novel(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())
lm_workflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(lm_recipe)

lm_fit <- lm_workflow %>%
  fit(data = ames_train)

lm_preds <- predict(lm_fit, new_data = ames_test) %>%
  bind_cols(ames_test)

head(lm_preds)
## # A tibble: 6 × 11
##     .pred sale_price gr_liv_area garage_area total_bsmt_sf year_built full_bath
##     <dbl>      <int>       <int>       <dbl>         <dbl>      <int>     <int>
## 1 133691.     105000         896         730           882       1961         1
## 2 190557.     195500        1604         470           926       1998         2
## 3 175442.     180400        1465         393           789       1998         2
## 4 426089.     538000        3279         841          1650       2003         3
## 5 186253.     164000        1752         492           559       1988         2
## 6 137944.     149000        1004         480          1004       1970         1
## # ℹ 4 more variables: neighborhood <fct>, house_style <fct>, bldg_type <fct>,
## #   central_air <fct>
# Overall metrics
lm_metrics <- lm_preds %>%
  metrics(truth = sale_price, estimate = .pred)

lm_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   33711.   
## 2 rsq     standard       0.844
## 3 mae     standard   22016.
# RMSE
lm_preds %>%
  rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      33711.
# R-squared
lm_preds %>%
  rsq(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.844
# MAE
lm_preds %>%
  mae(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard      22016.

Actual vs Predicted Plot

# Actual vs Predicted
ggplot(lm_preds, aes(x = sale_price, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Actual vs Predicted Sale Price",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

Residual Plot

# Residual plot
lm_preds <- lm_preds %>%
  mutate(residual = sale_price - .pred)

ggplot(lm_preds, aes(x = .pred, y = residual)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals vs Predicted Sale Price",
    x = "Predicted Sale Price",
    y = "Residuals"
  ) +
  theme_minimal()

Polynomial Regression Model

# New recipe with polynomial term
poly_recipe <- recipe(
  sale_price ~ garage_area + total_bsmt_sf +
    year_built + full_bath +
    neighborhood + house_style +
    bldg_type + central_air + gr_liv_area,
  data = ames_train
) %>%
  step_poly(gr_liv_area, degree = 2) %>%
  step_novel(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

# Workflow
poly_workflow <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(poly_recipe)

# Fit model
poly_fit <- poly_workflow %>%
  fit(data = ames_train)

Evaluate Polynomial Model

# Predictions
poly_preds <- predict(poly_fit, new_data = ames_test) %>%
  bind_cols(ames_test)

# Metrics
poly_metrics <- poly_preds %>%
  metrics(truth = sale_price, estimate = .pred)

poly_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   34781.   
## 2 rsq     standard       0.833
## 3 mae     standard   22662.

Model Comparison

lm_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   33711.   
## 2 rsq     standard       0.844
## 3 mae     standard   22016.
poly_metrics
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard   34781.   
## 2 rsq     standard       0.833
## 3 mae     standard   22662.

Plot Diagnostics

# Actual vs Predicted
ggplot(poly_preds, aes(x = sale_price, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Polynomial Model: Actual vs Predicted",
    x = "Actual Sale Price",
    y = "Predicted Sale Price"
  ) +
  theme_minimal()

# Residuals
poly_preds <- poly_preds %>%
  mutate(residual = sale_price - .pred)

ggplot(poly_preds, aes(x = .pred, y = residual)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Polynomial Model: Residuals vs Predicted",
    x = "Predicted",
    y = "Residuals"
  ) +
  theme_minimal()

Final Model Evaluation

Report test-set performance and diagnostic plots.

Interpretation and Discussion

Summarize major findings and what they mean in plain language.

Conclusion

State what you learned and what model you would choose.