Real Estate Analytics and House Price Prediction Using R

Author

Dharani

Introduction

This project analyzes housing data to identify important factors influencing house prices and build a predictive analytics model using R.

Load Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(corrplot)
corrplot 0.95 loaded
library(modeldata)

Load Dataset

data(ames, package = "modeldata")

glimpse(ames)
Rows: 2,930
Columns: 74
$ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
$ MS_Zoning          <fct> Residential_Low_Density, Residential_High_Density, …
$ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
$ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
$ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
$ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
$ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
$ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
$ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
$ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
$ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
$ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
$ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
$ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
$ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
$ House_Style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
$ Overall_Cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
$ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
$ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
$ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
$ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
$ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
$ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
$ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
$ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
$ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
$ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
$ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
$ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
$ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
$ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
$ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
$ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
$ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
$ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
$ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
$ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
$ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
$ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
$ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
$ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
$ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
$ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
$ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
$ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
$ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
$ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
$ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
$ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
$ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
$ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
$ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
$ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
$ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
$ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
$ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
$ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
$ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
$ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
$ Misc_Feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
$ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
$ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
$ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
$ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
$ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
$ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
$ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
summary(ames)
                               MS_SubClass  
 One_Story_1946_and_Newer_All_Styles :1079  
 Two_Story_1946_and_Newer            : 575  
 One_and_Half_Story_Finished_All_Ages: 287  
 One_Story_PUD_1946_and_Newer        : 192  
 One_Story_1945_and_Older            : 139  
 Two_Story_PUD_1946_and_Newer        : 129  
 (Other)                             : 529  
                        MS_Zoning     Lot_Frontage       Lot_Area     
 Floating_Village_Residential: 139   Min.   :  0.00   Min.   :  1300  
 Residential_High_Density    :  27   1st Qu.: 43.00   1st Qu.:  7440  
 Residential_Low_Density     :2273   Median : 63.00   Median :  9436  
 Residential_Medium_Density  : 462   Mean   : 57.65   Mean   : 10148  
 A_agr                       :   2   3rd Qu.: 78.00   3rd Qu.: 11555  
 C_all                       :  25   Max.   :313.00   Max.   :215245  
 I_all                       :   2                                    
  Street                 Alley                     Lot_Shape    Land_Contour
 Grvl:  12   Gravel         : 120   Regular             :1859   Bnk: 117    
 Pave:2918   No_Alley_Access:2732   Slightly_Irregular  : 979   HLS: 120    
             Paved          :  78   Moderately_Irregular:  76   Low:  60    
                                    Irregular           :  16   Lvl:2633    
                                                                            
                                                                            
                                                                            
  Utilities      Lot_Config   Land_Slope             Neighborhood 
 AllPub:2927   Corner : 511   Gtl:2789   North_Ames        : 443  
 NoSeWa:   1   CulDSac: 180   Mod: 125   College_Creek     : 267  
 NoSewr:   2   FR2    :  85   Sev:  16   Old_Town          : 239  
               FR3    :  14              Edwards           : 194  
               Inside :2140              Somerset          : 182  
                                         Northridge_Heights: 166  
                                         (Other)           :1439  
  Condition_1    Condition_2      Bldg_Type              House_Style  
 Norm   :2522   Norm   :2900   OneFam  :2425   One_Story       :1481  
 Feedr  : 164   Feedr  :  13   TwoFmCon:  62   Two_Story       : 873  
 Artery :  92   Artery :   5   Duplex  : 109   One_and_Half_Fin: 314  
 RRAn   :  50   PosA   :   4   Twnhs   : 101   SLvl            : 128  
 PosN   :  39   PosN   :   4   TwnhsE  : 233   SFoyer          :  83  
 RRAe   :  28   RRNn   :   2                   Two_and_Half_Unf:  24  
 (Other):  35   (Other):   2                   (Other)         :  27  
        Overall_Cond    Year_Built   Year_Remod_Add   Roof_Style  
 Average      :1654   Min.   :1872   Min.   :1950   Flat   :  20  
 Above_Average: 533   1st Qu.:1954   1st Qu.:1965   Gable  :2321  
 Good         : 390   Median :1973   Median :1993   Gambrel:  22  
 Very_Good    : 144   Mean   :1971   Mean   :1984   Hip    : 551  
 Below_Average: 101   3rd Qu.:2001   3rd Qu.:2004   Mansard:  11  
 Fair         :  50   Max.   :2010   Max.   :2010   Shed   :   5  
 (Other)      :  58                                               
   Roof_Matl     Exterior_1st   Exterior_2nd   Mas_Vnr_Type   Mas_Vnr_Area   
 CompShg:2887   VinylSd:1026   VinylSd:1015   BrkCmn :  25   Min.   :   0.0  
 Tar&Grv:  23   MetalSd: 450   MetalSd: 447   BrkFace: 880   1st Qu.:   0.0  
 WdShake:   9   HdBoard: 442   HdBoard: 406   CBlock :   1   Median :   0.0  
 WdShngl:   7   Wd Sdng: 420   Wd Sdng: 397   None   :1775   Mean   : 101.1  
 ClyTile:   1   Plywood: 221   Plywood: 274   Stone  : 249   3rd Qu.: 162.8  
 Membran:   1   CemntBd: 126   CmentBd: 126                  Max.   :1600.0  
 (Other):   2   (Other): 245   (Other): 265                                  
     Exter_Cond    Foundation         Bsmt_Cond        Bsmt_Exposure 
 Excellent:  12   BrkTil: 311   Excellent  :   3   Av         : 418  
 Fair     :  67   CBlock:1244   Fair       : 104   Gd         : 284  
 Good     : 299   PConc :1310   Good       : 122   Mn         : 239  
 Poor     :   3   Slab  :  49   No_Basement:  80   No         :1906  
 Typical  :2549   Stone :  11   Poor       :   5   No_Basement:  83  
                  Wood  :   5   Typical    :2616                     
                                                                     
     BsmtFin_Type_1  BsmtFin_SF_1       BsmtFin_Type_2  BsmtFin_SF_2    
 ALQ        :429    Min.   :0.000   ALQ        :  53   Min.   :   0.00  
 BLQ        :269    1st Qu.:3.000   BLQ        :  68   1st Qu.:   0.00  
 GLQ        :859    Median :3.000   GLQ        :  34   Median :   0.00  
 LwQ        :154    Mean   :4.177   LwQ        :  89   Mean   :  49.71  
 No_Basement: 80    3rd Qu.:7.000   No_Basement:  81   3rd Qu.:   0.00  
 Rec        :288    Max.   :7.000   Rec        : 106   Max.   :1526.00  
 Unf        :851                    Unf        :2499                    
  Bsmt_Unf_SF     Total_Bsmt_SF   Heating         Heating_QC   Central_Air
 Min.   :   0.0   Min.   :   0   Floor:   1   Excellent:1495   N: 196     
 1st Qu.: 219.0   1st Qu.: 793   GasA :2885   Fair     :  92   Y:2734     
 Median : 465.5   Median : 990   GasW :  27   Good     : 476              
 Mean   : 559.1   Mean   :1051   Grav :   9   Poor     :   3              
 3rd Qu.: 801.8   3rd Qu.:1302   OthW :   2   Typical  : 864              
 Max.   :2336.0   Max.   :6110   Wall :   6                               
                                                                          
   Electrical    First_Flr_SF    Second_Flr_SF     Gr_Liv_Area  
 FuseA  : 188   Min.   : 334.0   Min.   :   0.0   Min.   : 334  
 FuseF  :  50   1st Qu.: 876.2   1st Qu.:   0.0   1st Qu.:1126  
 FuseP  :   8   Median :1084.0   Median :   0.0   Median :1442  
 Mix    :   1   Mean   :1159.6   Mean   : 335.5   Mean   :1500  
 SBrkr  :2682   3rd Qu.:1384.0   3rd Qu.: 703.8   3rd Qu.:1743  
 Unknown:   1   Max.   :5095.0   Max.   :2065.0   Max.   :5642  
                                                                
 Bsmt_Full_Bath   Bsmt_Half_Bath      Full_Bath       Half_Bath     
 Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
 Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
 Mean   :0.4311   Mean   :0.06109   Mean   :1.567   Mean   :0.3795  
 3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :3.0000   Max.   :2.00000   Max.   :4.000   Max.   :2.0000  
                                                                    
 Bedroom_AbvGr   Kitchen_AbvGr   TotRms_AbvGrd      Functional  
 Min.   :0.000   Min.   :0.000   Min.   : 2.000   Typ    :2728  
 1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   Min2   :  70  
 Median :3.000   Median :1.000   Median : 6.000   Min1   :  65  
 Mean   :2.854   Mean   :1.044   Mean   : 6.443   Mod    :  35  
 3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   Maj1   :  19  
 Max.   :8.000   Max.   :3.000   Max.   :15.000   Maj2   :   9  
                                                  (Other):   4  
   Fireplaces                  Garage_Type     Garage_Finish   Garage_Cars   
 Min.   :0.0000   Attchd             :1731   Fin      : 728   Min.   :0.000  
 1st Qu.:0.0000   Basment            :  36   No_Garage: 159   1st Qu.:1.000  
 Median :1.0000   BuiltIn            : 186   RFn      : 812   Median :2.000  
 Mean   :0.5993   CarPort            :  15   Unf      :1231   Mean   :1.766  
 3rd Qu.:1.0000   Detchd             : 782                    3rd Qu.:2.000  
 Max.   :4.0000   More_Than_Two_Types:  23                    Max.   :5.000  
                  No_Garage          : 157                                   
  Garage_Area        Garage_Cond             Paved_Drive    Wood_Deck_SF    
 Min.   :   0.0   Excellent:   3   Dirt_Gravel     : 216   Min.   :   0.00  
 1st Qu.: 320.0   Fair     :  74   Partial_Pavement:  62   1st Qu.:   0.00  
 Median : 480.0   Good     :  15   Paved           :2652   Median :   0.00  
 Mean   : 472.7   No_Garage: 159                           Mean   :  93.75  
 3rd Qu.: 576.0   Poor     :  14                           3rd Qu.: 168.00  
 Max.   :1488.0   Typical  :2665                           Max.   :1424.00  
                                                                            
 Open_Porch_SF    Enclosed_Porch    Three_season_porch  Screen_Porch
 Min.   :  0.00   Min.   :   0.00   Min.   :  0.000    Min.   :  0  
 1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:  0.000    1st Qu.:  0  
 Median : 27.00   Median :   0.00   Median :  0.000    Median :  0  
 Mean   : 47.53   Mean   :  23.01   Mean   :  2.592    Mean   : 16  
 3rd Qu.: 70.00   3rd Qu.:   0.00   3rd Qu.:  0.000    3rd Qu.:  0  
 Max.   :742.00   Max.   :1012.00   Max.   :508.000    Max.   :576  
                                                                    
   Pool_Area            Pool_QC                   Fence      Misc_Feature
 Min.   :  0.000   Excellent:   4   Good_Privacy     : 118   Elev:   1   
 1st Qu.:  0.000   Fair     :   2   Good_Wood        : 112   Gar2:   5   
 Median :  0.000   Good     :   4   Minimum_Privacy  : 330   None:2824   
 Mean   :  2.243   No_Pool  :2917   Minimum_Wood_Wire:  12   Othr:   4   
 3rd Qu.:  0.000   Typical  :   3   No_Fence         :2358   Shed:  95   
 Max.   :800.000                                             TenC:   1   
                                                                         
    Misc_Val           Mo_Sold         Year_Sold      Sale_Type   
 Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :2536  
 1st Qu.:    0.00   1st Qu.: 4.000   1st Qu.:2007   New    : 239  
 Median :    0.00   Median : 6.000   Median :2008   COD    :  87  
 Mean   :   50.64   Mean   : 6.216   Mean   :2008   ConLD  :  26  
 3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   CWD    :  12  
 Max.   :17000.00   Max.   :12.000   Max.   :2010   ConLI  :   9  
                                                    (Other):  21  
 Sale_Condition   Sale_Price       Longitude         Latitude    
 Abnorml: 190   Min.   : 12789   Min.   :-93.69   Min.   :41.99  
 AdjLand:  12   1st Qu.:129500   1st Qu.:-93.66   1st Qu.:42.02  
 Alloca :  24   Median :160000   Median :-93.64   Median :42.03  
 Family :  46   Mean   :180796   Mean   :-93.64   Mean   :42.03  
 Normal :2413   3rd Qu.:213500   3rd Qu.:-93.62   3rd Qu.:42.05  
 Partial: 245   Max.   :755000   Max.   :-93.58   Max.   :42.06  
                                                                 

Data Cleaning

The dataset was cleaned and prepared for analysis by standardizing column names and creating a log-transformed version of sale price for modeling purposes.

ames_clean <- ames |>
  clean_names() |>
  mutate(log_sale_price = log(sale_price))

glimpse(ames_clean)
Rows: 2,930
Columns: 75
$ ms_sub_class       <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
$ ms_zoning          <fct> Residential_Low_Density, Residential_High_Density, …
$ lot_frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
$ lot_area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
$ street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
$ alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
$ lot_shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
$ land_contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
$ utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
$ lot_config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
$ land_slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
$ neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
$ condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
$ condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
$ bldg_type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
$ house_style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
$ overall_cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
$ year_built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
$ year_remod_add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
$ roof_style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
$ roof_matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
$ exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
$ exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
$ mas_vnr_type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
$ mas_vnr_area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
$ exter_cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
$ foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
$ bsmt_cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
$ bsmt_exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
$ bsmt_fin_type_1    <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
$ bsmt_fin_sf_1      <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
$ bsmt_fin_type_2    <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
$ bsmt_fin_sf_2      <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
$ bsmt_unf_sf        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
$ total_bsmt_sf      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
$ heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
$ heating_qc         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
$ central_air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
$ electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
$ first_flr_sf       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
$ second_flr_sf      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
$ gr_liv_area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
$ bsmt_full_bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
$ bsmt_half_bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ full_bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
$ half_bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ bedroom_abv_gr     <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
$ kitchen_abv_gr     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ tot_rms_abv_grd    <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
$ functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
$ fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
$ garage_type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
$ garage_finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
$ garage_cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
$ garage_area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
$ garage_cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
$ paved_drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
$ wood_deck_sf       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
$ open_porch_sf      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
$ enclosed_porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ screen_porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
$ pool_area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pool_qc            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
$ fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
$ misc_feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
$ misc_val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
$ mo_sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
$ year_sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ sale_type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
$ sale_condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
$ sale_price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
$ longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
$ latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
$ log_sale_price     <dbl> 12.27839, 11.56172, 12.05525, 12.40492, 12.15425, 1…

Missing Values Analysis

ames_clean |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(cols = everything(),
               names_to = "column",
               values_to = "missing_values") |>
  arrange(desc(missing_values)) |>
  head(15)
# A tibble: 15 × 2
   column       missing_values
   <chr>                 <int>
 1 ms_sub_class              0
 2 ms_zoning                 0
 3 lot_frontage              0
 4 lot_area                  0
 5 street                    0
 6 alley                     0
 7 lot_shape                 0
 8 land_contour              0
 9 utilities                 0
10 lot_config                0
11 land_slope                0
12 neighborhood              0
13 condition_1               0
14 condition_2               0
15 bldg_type                 0

Exploratory Data Analysis

Exploratory Data Analysis helps identify patterns, trends, and relationships in the housing data before building the prediction model.

Distribution of Sale Price

ggplot(ames_clean, aes(x = sale_price)) +
  geom_histogram(bins = 40) +
  labs(
    title = "Distribution of House Sale Prices",
    x = "Sale Price",
    y = "Number of Houses"
  )

Living Area vs Sale Price

ggplot(ames_clean, aes(x = gr_liv_area, y = sale_price)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Living Area vs Sale Price",
    x = "Above Ground Living Area",
    y = "Sale Price"
  )
`geom_smooth()` using formula = 'y ~ x'

Sale Price by Neighborhood

ggplot(ames_clean, aes(x = neighborhood, y = sale_price)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Sale Price by Neighborhood",
    x = "Neighborhood",
    y = "Sale Price"
  )

Overall Quality vs Sale Price

{ggplot(ames_clean, aes(x = factor(overall_cond), y = sale_price)) +} geom_boxplot() + labs( title = "Overall Condition vs Sale Price", x = "Overall Condition", y = "Sale Price" )

Predictive Modeling

A linear regression model was developed to predict house prices based on important housing characteristics.

Train-Test Split

set.seed(123)

ames_split <- initial_split(ames_clean, prop = 0.8)

ames_train <- training(ames_split)

ames_test <- testing(ames_split)

Fit Linear Regression Model

ames_fit <- lm(
  log_sale_price ~ gr_liv_area +
    overall_cond +
    year_built +
    garage_area,
  data = ames_train
)

summary(ames_fit)

Call:
lm(formula = log_sale_price ~ gr_liv_area + overall_cond + year_built + 
    garage_area, data = ames_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.26984 -0.09842 -0.00320  0.10595  0.78006 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -1.203e+00  3.683e-01  -3.267   0.0011 ** 
gr_liv_area                4.056e-04  9.291e-06  43.650  < 2e-16 ***
overall_condPoor           1.027e-01  1.042e-01   0.986   0.3244    
overall_condFair           2.180e-01  8.612e-02   2.532   0.0114 *  
overall_condBelow_Average  3.654e-01  8.372e-02   4.365 1.33e-05 ***
overall_condAverage        5.042e-01  8.174e-02   6.168 8.13e-10 ***
overall_condAbove_Average  5.490e-01  8.162e-02   6.727 2.17e-11 ***
overall_condGood           6.294e-01  8.178e-02   7.697 2.04e-14 ***
overall_condVery_Good      6.896e-01  8.314e-02   8.294  < 2e-16 ***
overall_condExcellent      7.839e-01  8.851e-02   8.857  < 2e-16 ***
year_built                 6.037e-03  1.872e-04  32.244  < 2e-16 ***
garage_area                3.814e-04  2.422e-05  15.746  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1974 on 2332 degrees of freedom
Multiple R-squared:  0.7733,    Adjusted R-squared:  0.7722 
F-statistic: 723.2 on 11 and 2332 DF,  p-value: < 2.2e-16

Model Evaluation

The regression model was evaluated to understand how different housing features impact sale prices.

par(mfrow = c(2,2))
plot(ames_fit)

Conclusion

This project analyzed housing data using R to identify important factors influencing house prices. Exploratory data analysis showed that living area, overall condition, garage area, and year built have noticeable relationships with sale price.

A linear regression model was developed to predict log-transformed sale prices using selected housing characteristics. The analysis demonstrated how predictive analytics techniques can be applied in real estate valuation and decision-making.

Overall, the project successfully combined data cleaning, visualization, and predictive modeling techniques in R to build a professional real estate analytics workflow.