Libraries Used

library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(scales)

Loading Data

setwd("D:/LPU/Sem 2/R programming/Project/house-prices-advanced-regression-techniques")
data <- read.csv("train.csv")
test_data <- read.csv("test.csv")

data$Neighborhood <- recode(data$Neighborhood,
  "Blmngtn" = "Bloomington Heights",
  "Blueste" = "Bluestem",
  "BrDale"  = "Briardale",
  "BrkSide" = "Brookside",
  "ClearCr" = "Clear Creek",
  "CollgCr" = "College Creek",
  "Crawfor" = "Crawford",
  "Edwards" = "Edwards",
  "Gilbert" = "Gilbert",
  "IDOTRR"  = "Iowa DOT and Railroad",
  "MeadowV" = "Meadow Village",
  "Mitchel" = "Mitchell",
  "NAmes"   = "North Ames",
  "NoRidge" = "North Ridge",
  "NPkVill" = "Northpark Villa",
  "NridgHt" = "North Ridge Heights",
  "NWAmes"  = "Northwest Ames",
  "OldTown" = "Old Town",
  "SWISU"   = "South and West Iowa State University",
  "Sawyer"  = "Sawyer",
  "SawyerW" = "Sawyer West",
  "Somerst" = "Somerset",
  "StoneBr" = "Stone Brook",
  "Timber"  = "Timberland",
  "Veenker" = "Veenker"
)

test_data$Neighborhood <- recode(test_data$Neighborhood,
  "Blmngtn" = "Bloomington Heights",
  "Blueste" = "Bluestem",
  "BrDale"  = "Briardale",
  "BrkSide" = "Brookside",
  "ClearCr" = "Clear Creek",
  "CollgCr" = "College Creek",
  "Crawfor" = "Crawford",
  "Edwards" = "Edwards",
  "Gilbert" = "Gilbert",
  "IDOTRR"  = "Iowa DOT and Railroad",
  "MeadowV" = "Meadow Village",
  "Mitchel" = "Mitchell",
  "NAmes"   = "North Ames",
  "NoRidge" = "North Ridge",
  "NPkVill" = "Northpark Villa",
  "NridgHt" = "North Ridge Heights",
  "NWAmes"  = "Northwest Ames",
  "OldTown" = "Old Town",
  "SWISU"   = "South and West Iowa State University",
  "Sawyer"  = "Sawyer",
  "SawyerW" = "Sawyer West",
  "Somerst" = "Somerset",
  "StoneBr" = "Stone Brook",
  "Timber"  = "Timberland",
  "Veenker" = "Veenker"
)


head(data)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope  Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl College Creek       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl       Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl College Creek       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawford       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl   North Ridge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchell       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

The dataset preview confirms successful data import and provides an initial understanding of the available variables and their formats.

—————————————————————————

LEVEL 1: UNDERSTANDING DATA

—————————————————————————

Question 1.1: What is the structure of the dataset (number of observations, variables, and data types)?

dim(data)
## [1] 1460   81
str(data)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "College Creek" "Veenker" "College Creek" "Crawford" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(test_data)
## [1] 1459   80
str(test_data)
## 'data.frame':    1459 obs. of  80 variables:
##  $ Id           : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
##  $ MSSubClass   : int  20 20 60 60 120 60 20 60 20 20 ...
##  $ MSZoning     : chr  "RH" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  80 81 74 78 43 75 NA 63 85 70 ...
##  $ LotArea      : int  11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "IR1" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "Corner" "Inside" "Inside" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "North Ames" "North Ames" "Gilbert" "Gilbert" ...
##  $ Condition1   : chr  "Feedr" "Norm" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "1Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  5 6 5 6 8 6 6 6 7 4 ...
##  $ OverallCond  : int  6 6 5 6 5 5 7 5 5 5 ...
##  $ YearBuilt    : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ YearRemodAdd : int  1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
##  $ RoofStyle    : chr  "Gable" "Hip" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
##  $ Exterior2nd  : chr  "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
##  $ MasVnrType   : chr  "None" "BrkFace" "None" "BrkFace" ...
##  $ MasVnrArea   : int  0 108 0 20 0 0 0 0 0 0 ...
##  $ ExterQual    : chr  "TA" "TA" "TA" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "CBlock" "CBlock" "PConc" "PConc" ...
##  $ BsmtQual     : chr  "TA" "TA" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "TA" ...
##  $ BsmtExposure : chr  "No" "No" "No" "No" ...
##  $ BsmtFinType1 : chr  "Rec" "ALQ" "GLQ" "GLQ" ...
##  $ BsmtFinSF1   : int  468 923 791 602 263 0 935 0 637 804 ...
##  $ BsmtFinType2 : chr  "LwQ" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  144 0 0 0 0 0 0 0 0 78 ...
##  $ BsmtUnfSF    : int  270 406 137 324 1017 763 233 789 663 0 ...
##  $ TotalBsmtSF  : int  882 1329 928 926 1280 763 1168 789 1300 882 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "TA" "TA" "Gd" "Ex" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  896 1329 928 926 1280 763 1187 789 1341 882 ...
##  $ X2ndFlrSF    : int  0 0 701 678 0 892 0 676 0 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
##  $ BsmtFullBath : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ BsmtHalfBath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  1 1 2 2 2 2 2 2 1 1 ...
##  $ HalfBath     : int  0 1 1 1 0 1 0 1 1 0 ...
##  $ BedroomAbvGr : int  2 3 3 3 2 3 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ KitchenQual  : chr  "TA" "Gd" "TA" "Gd" ...
##  $ TotRmsAbvGrd : int  5 6 6 7 5 7 6 7 5 4 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 0 1 1 0 1 0 1 1 0 ...
##  $ FireplaceQu  : chr  NA NA "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ GarageYrBlt  : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ GarageFinish : chr  "Unf" "Unf" "Fin" "Fin" ...
##  $ GarageCars   : int  1 1 2 2 2 2 2 2 2 2 ...
##  $ GarageArea   : int  730 312 482 470 506 440 420 393 506 525 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  140 393 212 360 0 157 483 0 192 240 ...
##  $ OpenPorchSF  : int  0 36 34 36 82 84 21 75 0 0 ...
##  $ EnclosedPorch: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ScreenPorch  : int  120 0 0 0 144 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  "MnPrv" NA "MnPrv" NA ...
##  $ MiscFeature  : chr  NA "Gar2" NA NA ...
##  $ MiscVal      : int  0 12500 0 0 0 0 500 0 0 0 ...
##  $ MoSold       : int  6 6 3 6 1 4 3 5 2 4 ...
##  $ YrSold       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Normal" ...
setdiff(colnames(data), colnames(test_data))
## [1] "SalePrice"
length(setdiff(colnames(data), colnames(test_data)))
## [1] 1

The dataset contains 1460 observations and 81 variables, including both numerical and categorical features. These variables capture structural, quality, and locational aspects of houses. The training dataset contains one additional variable, SalePrice, which serves as the target variable for prediction.

Question 1.2: What types of variables are present in the dataset (numerical vs categorical)?

num_vars <- data %>% select(where(is.numeric)) %>% ncol()
cat_vars <- data %>% select(where(is.character)) %>% ncol()
num_vars
## [1] 38
cat_vars
## [1] 43

The dataset includes both numerical and categorical variables, enabling analysis of measurable attributes such as size and price, along with qualitative features such as neighborhood.

Question 1.3: Create a cleaned dataset with relevant variables for analysis

df <- data %>% 
  select(SalePrice,
         GrLivArea,
         OverallQual,
         YearBuilt,
         TotalBsmtSF,
         GarageArea,
         BedroomAbvGr,
         FullBath,
         Neighborhood,
         GarageCars,
         TotRmsAbvGrd,
         LotArea) %>% 
  mutate(across(where(is.numeric), ~replace_na(., 0)))

test_df <- test_data %>% 
  select(GrLivArea,
         OverallQual,
         YearBuilt,
         TotalBsmtSF,
         GarageArea,
         BedroomAbvGr,
         FullBath,
         GarageCars,
         TotRmsAbvGrd,
         LotArea,
         Neighborhood) %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))


dim(df)
## [1] 1460   12

A refined dataset containing key variables was created to ensure focused and meaningful analysis. This dataset retains the most influential features affecting house prices while eliminating redundant or less relevant variables.

Question 1.4: Are there any missing values in the cleaned dataset?

colSums(is.na(df))
##    SalePrice    GrLivArea  OverallQual    YearBuilt  TotalBsmtSF   GarageArea 
##            0            0            0            0            0            0 
## BedroomAbvGr     FullBath Neighborhood   GarageCars TotRmsAbvGrd      LotArea 
##            0            0            0            0            0            0

No missing values remain in the cleaned dataset after preprocessing. This confirms that the dataset is complete and ready for analysis and modeling.

—————————————————————————

LEVEL 2: DATA EXTRACTION AND ANALYSIS

—————————————————————————

Question 2.1: What is the contribution of different features to house price (basic correlation scan)?

cor(df %>% select(where(is.numeric)))["SalePrice", ] %>% 
  sort(decreasing = TRUE)
##    SalePrice  OverallQual    GrLivArea   GarageCars   GarageArea  TotalBsmtSF 
##    1.0000000    0.7909816    0.7086245    0.6404092    0.6234314    0.6135806 
##     FullBath TotRmsAbvGrd    YearBuilt      LotArea BedroomAbvGr 
##    0.5606638    0.5337232    0.5228973    0.2638434    0.1682132

The correlation results show that OverallQual and GrLivArea have the strongest positive relationships with SalePrice, indicating that quality and size are the primary drivers of house value. Features like GarageArea and FullBath have moderate influence, while BedroomAbvGr shows a weak relationship. Overall, house prices are driven more by quality and space than by room count.

Question 2.2: What is the distribution of house prices across quantiles?

summary(df$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
quantile(df$SalePrice)
##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000
range(df$SalePrice)
## [1]  34900 755000

The quantile distribution shows that a large proportion of houses fall within lower price ranges, while a smaller fraction occupies the high-value segment. The large range between minimum and maximum prices indicates substantial market variation and the presence of high-value luxury properties.

Question 2.3: What proportion of houses are priced above the mean?

mean(df$SalePrice > mean(df$SalePrice)) * 100
## [1] 38.35616

Approximately 38.36% of houses are priced above the average value, indicating that the majority of houses fall below the mean. This supports the earlier observation of a right-skewed distribution, where a smaller proportion of high-value properties increases the average price disproportionately.

—————————————————————————

LEVEL 3: GROUPING & PATTERN ANALYSIS

—————————————————————————

Question 3.1: How does house price vary with number of bedrooms (BedroomAbvGr)?

df %>% 
  group_by(BedroomAbvGr) %>% 
  summarise(
  avg_price = mean(SalePrice),
  count = n()
  ) %>% 
  arrange(BedroomAbvGr)
## # A tibble: 8 × 3
##   BedroomAbvGr avg_price count
##          <int>     <dbl> <int>
## 1            0   221493.     6
## 2            1   173162.    50
## 3            2   158198.   358
## 4            3   181057.   804
## 5            4   220421.   213
## 6            5   180819.    21
## 7            6   143779      7
## 8            8   200000      1

The number of bedrooms does not show a consistent relationship with house price. Prices increase up to a certain point but decline for higher bedroom counts, indicating diminishing returns. This suggests that simply increasing the number of bedrooms does not add value unless supported by overall size and quality. In some cases, more bedrooms may reflect inefficient space usage rather than higher property value.

Question 3.2: How does house price vary with number of bathrooms (FullBath)?

df %>% 
  group_by(FullBath) %>% 
  summarise(
  avg_price = mean(SalePrice),
  count = n()
  ) %>% 
  arrange(FullBath)
## # A tibble: 4 × 3
##   FullBath avg_price count
##      <int>     <dbl> <int>
## 1        0   165201.     9
## 2        1   134751.   650
## 3        2   213010.   768
## 4        3   347823.    33

House prices show a clear upward trend with the number of bathrooms, indicating that additional bathrooms enhance property value by improving functionality and comfort. Unlike bedrooms, this relationship is more consistent, suggesting that bathrooms are a more reliable indicator of increased housing value.

Question 3.3: How does house price vary across different neighborhoods?

df %>% 
  group_by(Neighborhood) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(desc(avg_price))
## # A tibble: 25 × 2
##    Neighborhood        avg_price
##    <chr>                   <dbl>
##  1 North Ridge           335295.
##  2 North Ridge Heights   316271.
##  3 Stone Brook           310499 
##  4 Timberland            242247.
##  5 Veenker               238773.
##  6 Somerset              225380.
##  7 Clear Creek           212565.
##  8 Crawford              210625.
##  9 College Creek         197966.
## 10 Bloomington Heights   194871.
## # ℹ 15 more rows

House prices exhibit strong variation across neighborhoods, with a small number of areas such as North Ridge, North Ridge Heights, Stone Brook dominating the high-value segment. This highlights a clear market segmentation, where location creates a significant price premium independent of structural features.

Question 3.4: Does the age of the house (YearBuilt) affect its price?

df %>% 
  mutate(Age = 2024 - YearBuilt,
         AgeGroup = cut(Age, breaks = c(0, 20, 40, 60, 80, 150))) %>% 
  group_by(AgeGroup) %>% 
  summarise(avg_price = mean(SalePrice), count = n())
## # A tibble: 6 × 3
##   AgeGroup avg_price count
##   <fct>        <dbl> <int>
## 1 (0,20]     248928.   276
## 2 (20,40]    224340.   311
## 3 (40,60]    156229.   322
## 4 (60,80]    140125.   277
## 5 (80,150]   133439.   273
## 6 <NA>       122000      1

House prices generally decline with increasing age, indicating that newer houses are valued higher in the market. However, the irregular pattern suggests that age alone does not determine value, as older houses can still achieve high prices if they possess strong attributes such as superior construction quality or prime location.

Overall Insight from Level 3

House prices are influenced by multiple structural and locational factors. Features such as bathrooms and neighborhood show strong positive relationships with price, while bedroom count demonstrates diminishing returns, highlighting the importance of quality and efficient space utilization over simple quantity-based measures.

—————————————————————————

LEVEL 4: SORTING & RANKING ANALYSIS

—————————————————————————

Question 4.1: Which features differentiate high-priced houses from low-priced houses?

df %>% 
  mutate(Category = ifelse(SalePrice > median(SalePrice), "High", "Low")) %>% 
  group_by(Category) %>% 
  summarise(
    avg_area = mean(GrLivArea),
    avg_quality = mean(OverallQual),
    avg_garage = mean(GarageArea),
    avg_rooms = mean(TotRmsAbvGrd)
  )
## # A tibble: 2 × 5
##   Category avg_area avg_quality avg_garage avg_rooms
##   <chr>       <dbl>       <dbl>      <dbl>     <dbl>
## 1 High        1814.        7.03       581.      7.20
## 2 Low         1219.        5.17       365.      5.84

High-priced houses are characterized by significantly larger living areas and higher overall quality. This indicates that price differences are primarily driven by size and construction quality rather than individual features alone.

Question 4.2: Do the largest houses correspond to the highest prices?

df %>% 
  arrange(desc(GrLivArea)) %>% 
  select(GrLivArea, SalePrice, Neighborhood, OverallQual) %>% 
  head(5)
##   GrLivArea SalePrice Neighborhood OverallQual
## 1      5642    160000      Edwards          10
## 2      4676    184750      Edwards          10
## 3      4476    745000  North Ridge          10
## 4      4316    755000  North Ridge          10
## 5      3627    625000  North Ridge          10

The largest houses do not consistently correspond to the highest prices, indicating that size alone is not a sufficient determinant of house value. While some large houses are highly priced, others are relatively lower in value, suggesting that factors such as neighborhood and overall quality significantly influence pricing. This highlights that the impact of size on price is conditional rather than absolute.

Question 4.3: Which houses have the highest price per unit area (Price per sqft)?

df <- df %>% 
  mutate(
    Total_SF = GrLivArea + TotalBsmtSF,
    Price_per_sqft = SalePrice / Total_SF
  )

df %>% 
  select(Neighborhood, SalePrice, Total_SF, Price_per_sqft) %>% 
  arrange(desc(Price_per_sqft)) %>% 
  head(5)
##          Neighborhood SalePrice Total_SF Price_per_sqft
## 1         Stone Brook    392000     2838       138.1254
## 2 North Ridge Heights    611657     4694       130.3061
## 3          North Ames    107500      827       129.9879
## 4 North Ridge Heights    582933     4556       127.9484
## 5          North Ames    106500      882       120.7483

Houses with the highest price per unit area are typically not the largest, but are high-quality properties located in premium neighborhoods such as Stone Brook and North Ridge Heights. This indicates that price efficiency is driven more by construction quality and location than by size alone. Smaller, well-designed houses in desirable areas can achieve higher value per square foot compared to larger properties in less favorable locations.

Overall Insight from Level 4

The ranking and comparison analysis reveals that house prices are not determined solely by size. While larger houses may have higher total prices, price efficiency and overall valuation are strongly influenced by location and construction quality. This reinforces that high-value properties are defined by a combination of factors rather than a single attribute.

—————————————————————————

LEVEL 5: FEATURE ENGINEERING

—————————————————————————

Consolidated Creation of New Matrices

df <- df %>% 
  mutate(
    Total_SF       = GrLivArea + TotalBsmtSF,
    Price_per_sqft = SalePrice / Total_SF,
    Price_per_room = SalePrice / TotRmsAbvGrd,
    Age            = 2024 - YearBuilt
  )

Question 5.1: How does pricing efficiency (price per square foot) vary across houses?

summary(df$Price_per_sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.61   60.33   69.51   69.81   78.93  138.13

Price per square foot varies considerably across houses, indicating that property value is influenced not only by size but also by quality, location, and pricing efficiency.

Question 5.2: What does the distribution of house age reveal about the housing market?

df %>% 
  mutate(AgeGroup = cut(Age, breaks = c(0, 20, 40, 60, 80, 150))) %>% 
  group_by(AgeGroup) %>% 
  summarise(count = n())
## # A tibble: 6 × 2
##   AgeGroup count
##   <fct>    <int>
## 1 (0,20]     276
## 2 (20,40]    311
## 3 (40,60]    322
## 4 (60,80]    277
## 5 (80,150]   273
## 6 <NA>         1

The distribution of house age indicates that a large proportion of properties fall within mid-age ranges, with fewer very new or very old houses. This suggests a mature housing market where most properties are neither newly constructed nor extremely old. The presence of both newer and older houses provides diversity, enabling analysis of how age interacts with other factors such as quality and location.

Question 5.3: How does price per room compare to overall house pricing?

summary(df$Price_per_room)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6317   21943   26467   27909   32378   78500

Price per room varies significantly across houses, indicating that room count alone does not determine property value. Houses with fewer rooms can exhibit higher price per room if they are of superior quality or located in premium neighborhoods. This highlights that value is driven more by quality and efficient space utilization than by the number of rooms.

Question 5.4: How does total usable space vary across houses, and what does it indicate about property scale?

summary(df$Total_SF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    2014    2479    2573    3008   11752

Total usable space varies widely across houses, indicating the presence of both compact homes and extremely large luxury properties. While most houses fall within a moderate size range, the presence of very large houses suggests a segment of high-end properties. However, large size alone does not guarantee higher value, reinforcing that space must be considered alongside quality and location.

Overall Insight from Level 5

The engineered features provide deeper insight into pricing dynamics, showing that efficiency-based measures such as price per square foot and price per room reveal variations that are not captured by total price alone. These features highlight that house value is influenced not just by size, but by how effectively space is utilized, along with quality and location.

—————————————————————————

LEVEL 6 — DATA VISUALIZATION & DISTRIBUTION ANALYSIS

—————————————————————————

Question 6.1: How is house price distributed across the dataset?

df <- df %>% 
  mutate(Log_SalePrice = log(SalePrice))

# Original distribution
ggplot(df, aes(x = SalePrice)) +
  geom_histogram(bins = 30, fill = "seagreen", color = "black") +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Distribution of Sale Price",
    x = "Sale Price",
    y = "Frequency"
  ) +
  theme_minimal()

# Log-transformed distribution
ggplot(df, aes(x = Log_SalePrice)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
  labs(
    title = "Distribution of Log(SalePrice)",
    x = "Log(Sale Price)",
    y = "Frequency"
  ) +
  theme_minimal()

The distribution of SalePrice is positively skewed, with a concentration of houses in the lower price range and a long right tail representing high-value properties. After applying a log transformation, the distribution becomes more symmetric and less influenced by extreme values. This indicates that the transformation stabilizes the variance and makes the data more suitable for statistical modeling.

Question 6.2: How does living area relate to house price visually?

ggplot(df, aes(x = GrLivArea, y = SalePrice)) +
  geom_smooth(method = "lm", color = "red") +
  geom_point(alpha = 0.6) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "SalePrice vs Living Area",
    x = "Living Area",
    y = "Sale Price"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot shows a clear positive relationship between living area and house price, indicating that larger houses tend to have higher prices. However, the spread of points increases for larger houses, suggesting that size alone does not fully explain price variation. Other factors such as quality and location also influence pricing.

Question 6.3: How are outliers distributed in house prices?

ggplot(df, aes(x = factor(OverallQual), y = SalePrice, fill = factor(OverallQual))) +

  geom_boxplot() +

  scale_fill_manual(values = c(
    "red",
    "orangered",
    "orange",
    "gold",
    "yellow",
    "yellowgreen",
    "green3",
    "forestgreen",
    "green4",
    "darkgreen"
  )) +

  scale_y_continuous(labels = scales::comma) +

  labs(
    title = "House Prices Across Quality Levels",
    x = "Overall Quality",
    y = "Sale Price"
  ) +

  theme_minimal()

The boxplots show a clear upward relationship between overall house quality and sale price. Lower quality homes (red shades) are concentrated in lower price ranges, while higher quality homes (green shades) have significantly higher median prices and wider price variation. Several outliers are also visible in the higher quality categories, representing premium-priced properties in the housing market.

Question 6.4: What does the density distribution of SalePrice reveal?

ggplot(df, aes(x = SalePrice)) +
  geom_density(fill = "lightblue") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Density Distribution of Sale Price",
    x = "Sale Price",
    y = "Density"
  ) +
  theme_minimal()

The density plot confirms that house prices are concentrated in lower and moderate ranges, while a smaller number of expensive properties form a long upper tail. This reinforces the presence of positive skewness in the dataset.

—————————————————————————

LEVEL 7 — CORRELATION ANALYSIS

—————————————————————————

Question 7.1: What is the correlation structure among numerical variables?

numeric_df <- df %>% 
  select(where(is.numeric))

cor_matrix <- cor(numeric_df)

round(cor_matrix, 2)
##                SalePrice GrLivArea OverallQual YearBuilt TotalBsmtSF GarageArea
## SalePrice           1.00      0.71        0.79      0.52        0.61       0.62
## GrLivArea           0.71      1.00        0.59      0.20        0.45       0.47
## OverallQual         0.79      0.59        1.00      0.57        0.54       0.56
## YearBuilt           0.52      0.20        0.57      1.00        0.39       0.48
## TotalBsmtSF         0.61      0.45        0.54      0.39        1.00       0.49
## GarageArea          0.62      0.47        0.56      0.48        0.49       1.00
## BedroomAbvGr        0.17      0.52        0.10     -0.07        0.05       0.07
## FullBath            0.56      0.63        0.55      0.47        0.32       0.41
## GarageCars          0.64      0.47        0.60      0.54        0.43       0.88
## TotRmsAbvGrd        0.53      0.83        0.43      0.10        0.29       0.34
## LotArea             0.26      0.26        0.11      0.01        0.26       0.18
## Total_SF            0.78      0.88        0.66      0.34        0.82       0.56
## Price_per_sqft      0.64      0.13        0.52      0.52        0.03       0.40
## Price_per_room      0.78      0.28        0.65      0.57        0.54       0.51
## Age                -0.52     -0.20       -0.57     -1.00       -0.39      -0.48
## Log_SalePrice       0.95      0.70        0.82      0.59        0.61       0.65
##                BedroomAbvGr FullBath GarageCars TotRmsAbvGrd LotArea Total_SF
## SalePrice              0.17     0.56       0.64         0.53    0.26     0.78
## GrLivArea              0.52     0.63       0.47         0.83    0.26     0.88
## OverallQual            0.10     0.55       0.60         0.43    0.11     0.66
## YearBuilt             -0.07     0.47       0.54         0.10    0.01     0.34
## TotalBsmtSF            0.05     0.32       0.43         0.29    0.26     0.82
## GarageArea             0.07     0.41       0.88         0.34    0.18     0.56
## BedroomAbvGr           1.00     0.36       0.09         0.68    0.12     0.36
## FullBath               0.36     1.00       0.47         0.55    0.13     0.57
## GarageCars             0.09     0.47       1.00         0.36    0.15     0.53
## TotRmsAbvGrd           0.68     0.55       0.36         1.00    0.19     0.68
## LotArea                0.12     0.13       0.15         0.19    1.00     0.31
## Total_SF               0.36     0.57       0.53         0.68    0.31     1.00
## Price_per_sqft        -0.16     0.26       0.44         0.05    0.10     0.10
## Price_per_room        -0.27     0.30       0.52        -0.06    0.19     0.46
## Age                    0.07    -0.47      -0.54        -0.10   -0.01    -0.34
## Log_SalePrice          0.21     0.59       0.68         0.53    0.26     0.77
##                Price_per_sqft Price_per_room   Age Log_SalePrice
## SalePrice                0.64           0.78 -0.52          0.95
## GrLivArea                0.13           0.28 -0.20          0.70
## OverallQual              0.52           0.65 -0.57          0.82
## YearBuilt                0.52           0.57 -1.00          0.59
## TotalBsmtSF              0.03           0.54 -0.39          0.61
## GarageArea               0.40           0.51 -0.48          0.65
## BedroomAbvGr            -0.16          -0.27  0.07          0.21
## FullBath                 0.26           0.30 -0.47          0.59
## GarageCars               0.44           0.52 -0.54          0.68
## TotRmsAbvGrd             0.05          -0.06 -0.10          0.53
## LotArea                  0.10           0.19 -0.01          0.26
## Total_SF                 0.10           0.46 -0.34          0.77
## Price_per_sqft           1.00           0.70 -0.52          0.63
## Price_per_room           0.70           1.00 -0.57          0.77
## Age                     -0.52          -0.57  1.00         -0.59
## Log_SalePrice            0.63           0.77 -0.59          1.00

The correlation matrix reveals strong positive relationships among variables related to house size and quality. Features such as GrLivArea, OverallQual, and Total_SF show strong associations with SalePrice, indicating their importance in determining house value.

Question 7.2: Which variables have the strongest relationship with SalePrice?

cor_matrix["SalePrice", ] %>% 
  sort(decreasing = TRUE)
##      SalePrice  Log_SalePrice    OverallQual       Total_SF Price_per_room 
##      1.0000000      0.9483737      0.7909816      0.7789588      0.7758415 
##      GrLivArea Price_per_sqft     GarageCars     GarageArea    TotalBsmtSF 
##      0.7086245      0.6408187      0.6404092      0.6234314      0.6135806 
##       FullBath   TotRmsAbvGrd      YearBuilt        LotArea   BedroomAbvGr 
##      0.5606638      0.5337232      0.5228973      0.2638434      0.1682132 
##            Age 
##     -0.5228973

OverallQual and GrLivArea exhibit the strongest positive correlations with SalePrice, confirming that construction quality and living area are the most influential predictors of house price.

Question 7.3: How can correlations be visualized using a heatmap?

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.cex = 0.7
)

The correlation heatmap provides a visual representation of relationships among numerical variables, making it easier to identify strong positive, weak, and negative correlations within the dataset.

Question 7.4: Are there signs of multicollinearity among features?

cor_matrix
##                 SalePrice  GrLivArea OverallQual   YearBuilt TotalBsmtSF
## SalePrice       1.0000000  0.7086245   0.7909816  0.52289733  0.61358055
## GrLivArea       0.7086245  1.0000000   0.5930074  0.19900971  0.45486820
## OverallQual     0.7909816  0.5930074   1.0000000  0.57232277  0.53780850
## YearBuilt       0.5228973  0.1990097   0.5723228  1.00000000  0.39145200
## TotalBsmtSF     0.6135806  0.4548682   0.5378085  0.39145200  1.00000000
## GarageArea      0.6234314  0.4689975   0.5620218  0.47895382  0.48666546
## BedroomAbvGr    0.1682132  0.5212695   0.1016764 -0.07065122  0.05044996
## FullBath        0.5606638  0.6300116   0.5505997  0.46827079  0.32372241
## GarageCars      0.6404092  0.4672474   0.6006707  0.53785009  0.43458483
## TotRmsAbvGrd    0.5337232  0.8254894   0.4274523  0.09558913  0.28557256
## LotArea         0.2638434  0.2631162   0.1058057  0.01422765  0.26083313
## Total_SF        0.7789588  0.8803240   0.6648303  0.33548845  0.82288840
## Price_per_sqft  0.6408187  0.1331439   0.5150323  0.51789555  0.03447252
## Price_per_room  0.7758415  0.2769877   0.6497751  0.56995718  0.53621843
## Age            -0.5228973 -0.1990097  -0.5723228 -1.00000000 -0.39145200
## Log_SalePrice   0.9483737  0.7009267   0.8171844  0.58657024  0.61213398
##                 GarageArea BedroomAbvGr   FullBath  GarageCars TotRmsAbvGrd
## SalePrice       0.62343144   0.16821315  0.5606638  0.64040920   0.53372316
## GrLivArea       0.46899748   0.52126951  0.6300116  0.46724742   0.82548937
## OverallQual     0.56202176   0.10167636  0.5505997  0.60067072   0.42745234
## YearBuilt       0.47895382  -0.07065122  0.4682708  0.53785009   0.09558913
## TotalBsmtSF     0.48666546   0.05044996  0.3237224  0.43458483   0.28557256
## GarageArea      1.00000000   0.06525253  0.4056562  0.88247541   0.33782212
## BedroomAbvGr    0.06525253   1.00000000  0.3632520  0.08610644   0.67661994
## FullBath        0.40565621   0.36325198  1.0000000  0.46967204   0.55478425
## GarageCars      0.88247541   0.08610644  0.4696720  1.00000000   0.36228857
## TotRmsAbvGrd    0.33782212   0.67661994  0.5547843  0.36228857   1.00000000
## LotArea         0.18040276   0.11968991  0.1260306  0.15487074   0.19001478
## Total_SF        0.55846594   0.35945861  0.5744031  0.52960762   0.67880245
## Price_per_sqft  0.39757007  -0.16243932  0.2625206  0.44228129   0.05132142
## Price_per_room  0.51320440  -0.27149296  0.2954026  0.51801751  -0.06276533
## Age            -0.47895382   0.07065122 -0.4682708 -0.53785009  -0.09558913
## Log_SalePrice   0.65088756   0.20904368  0.5947705  0.68062481   0.53442220
##                    LotArea   Total_SF Price_per_sqft Price_per_room         Age
## SalePrice       0.26384335  0.7789588     0.64081873     0.77584146 -0.52289733
## GrLivArea       0.26311617  0.8803240     0.13314387     0.27698773 -0.19900971
## OverallQual     0.10580574  0.6648303     0.51503230     0.64977513 -0.57232277
## YearBuilt       0.01422765  0.3354884     0.51789555     0.56995718 -1.00000000
## TotalBsmtSF     0.26083313  0.8228884     0.03447252     0.53621843 -0.39145200
## GarageArea      0.18040276  0.5584659     0.39757007     0.51320440 -0.47895382
## BedroomAbvGr    0.11968991  0.3594586    -0.16243932    -0.27149296  0.07065122
## FullBath        0.12603063  0.5744031     0.26252061     0.29540264 -0.46827079
## GarageCars      0.15487074  0.5296076     0.44228129     0.51801751 -0.53785009
## TotRmsAbvGrd    0.19001478  0.6788025     0.05132142    -0.06276533 -0.09558913
## LotArea         1.00000000  0.3068137     0.09586740     0.18617218 -0.01422765
## Total_SF        0.30681366  1.0000000     0.10331220     0.46235332 -0.33548845
## Price_per_sqft  0.09586740  0.1033122     1.00000000     0.70301207 -0.51789555
## Price_per_room  0.18617218  0.4623533     0.70301207     1.00000000 -0.56995718
## Age            -0.01422765 -0.3354884    -0.51789555    -0.56995718  1.00000000
## Log_SalePrice   0.25731989  0.7732768     0.63296046     0.76769748 -0.58657024
##                Log_SalePrice
## SalePrice          0.9483737
## GrLivArea          0.7009267
## OverallQual        0.8171844
## YearBuilt          0.5865702
## TotalBsmtSF        0.6121340
## GarageArea         0.6508876
## BedroomAbvGr       0.2090437
## FullBath           0.5947705
## GarageCars         0.6806248
## TotRmsAbvGrd       0.5344222
## LotArea            0.2573199
## Total_SF           0.7732768
## Price_per_sqft     0.6329605
## Price_per_room     0.7676975
## Age               -0.5865702
## Log_SalePrice      1.0000000

Strong correlations between variables such as GrLivArea and Total_SF indicate potential multicollinearity. Similarly, GarageArea and GarageCars show high association, suggesting that some features may contain overlapping information.

—————————————————————————

LEVEL 8 — SIMPLE LINEAR REGRESSION

—————————————————————————

Question 8.1: How does living area predict house price?

model1 <- lm(SalePrice ~ GrLivArea, data = df)

summary(model1)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18569.026   4480.755   4.144 3.61e-05 ***
## GrLivArea     107.130      2.794  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16
ggplot(df, aes(x = GrLivArea, y = SalePrice)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "red") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Linear Regression: Living Area vs Sale Price",
    x = "Living Area (sq ft)",
    y = "Sale Price"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The simple linear regression model shows that living area has a significant positive effect on house price. The positive regression coefficient indicates that larger houses tend to have higher prices. The R² value suggests that a substantial portion of price variation can be explained by living area alone.

—————————————————————————

LEVEL 9 — MULTIPLE & POLYNOMIAL REGRESSION

—————————————————————————

Question 9.1: How do multiple features jointly influence house prices?

model2 <- lm(
  SalePrice ~ GrLivArea + OverallQual +
    GarageArea + FullBath + Age,
  data = df
)

summary(model2)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + GarageArea + 
##     FullBath + Age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -424848  -21012   -2355   17693  295926 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44592.420   7753.719  -5.751 1.08e-08 ***
## GrLivArea       59.886      3.043  19.683  < 2e-16 ***
## OverallQual  23242.964   1160.471  20.029  < 2e-16 ***
## GarageArea      56.652      6.255   9.058  < 2e-16 ***
## FullBath     -7174.235   2716.719  -2.641  0.00836 ** 
## Age           -428.100     48.026  -8.914  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39550 on 1454 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.7522 
## F-statistic: 886.6 on 5 and 1454 DF,  p-value: < 2.2e-16

The multiple regression model demonstrates that house price is jointly influenced by multiple features, with OverallQual and GrLivArea emerging as the strongest predictors. The higher adjusted R² value indicates improved explanatory power compared to simple regression.

Question 9.2: Does a non-linear relationship exist between living area and house price?

poly_model <- lm(
  SalePrice ~ poly(GrLivArea, 2),
  data = df
)

summary(poly_model)
## 
## Call:
## lm(formula = SalePrice ~ poly(GrLivArea, 2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321613  -30369    -876   22954  338146 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           180921       1459 124.038  < 2e-16 ***
## poly(GrLivArea, 2)1  2150288      55733  38.582  < 2e-16 ***
## poly(GrLivArea, 2)2  -241924      55733  -4.341 1.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55730 on 1457 degrees of freedom
## Multiple R-squared:  0.5085, Adjusted R-squared:  0.5078 
## F-statistic: 753.7 on 2 and 1457 DF,  p-value: < 2.2e-16

The polynomial regression model captures non-linear relationships between living area and house price more effectively than simple linear regression. This suggests that the impact of living area on price is not perfectly linear across all property sizes.

Question 9.3: Which regression model performs better?

model_comparison <- data.frame(
  Model = c("Simple Regression",
            "Multiple Regression",
            "Polynomial Regression"),

  R_Squared = c(
    summary(model1)$r.squared,
    summary(model2)$r.squared,
    summary(poly_model)$r.squared
  ),

  Adjusted_R_Squared = c(
    summary(model1)$adj.r.squared,
    summary(model2)$adj.r.squared,
    summary(poly_model)$adj.r.squared
  )
)

model_comparison
##                   Model R_Squared Adjusted_R_Squared
## 1     Simple Regression 0.5021487          0.5018072
## 2   Multiple Regression 0.7530226          0.7521733
## 3 Polynomial Regression 0.5085048          0.5078302

The multiple regression model performs best overall, as indicated by the highest R² and adjusted R² values. This demonstrates that combining multiple predictors explains house price variation more effectively than relying on a single variable. The polynomial regression model improves upon simple linear regression by capturing non-linear relationships, but the multiple regression model remains the strongest overall.

—————————————————————————

LEVEL 10 — MODEL DIAGNOSTICS & OUTLIER ANALYSIS

—————————————————————————

Question 10.1: Are there significant outliers in house prices?

Q1 <- quantile(df$SalePrice, 0.25)
Q3 <- quantile(df$SalePrice, 0.75)

IQR_value <- Q3 - Q1

outliers <- df %>% 
  filter(
    SalePrice < (Q1 - 1.5 * IQR_value) |
    SalePrice > (Q3 + 1.5 * IQR_value)
  )

nrow(outliers)
## [1] 61
summary(outliers$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  341000  372500  394617  425954  440000  755000

The IQR method identifies a number of significant outliers in house prices. These outliers represent unusually expensive properties and may influence regression coefficients, residual distribution, and overall model accuracy.

Question 10.2: Do regression residuals satisfy model assumptions?

par(mfrow = c(2,2))
plot(model2)

The diagnostic plots help evaluate regression assumptions such as linearity, normality of residuals, homoscedasticity, and the presence of influential observations. Minor deviations suggest that while the model performs reasonably well, certain assumptions are not perfectly satisfied.

—————————————————————————

LEVEL 11 — MODEL EVALUATION & FINAL INSIGHTS

—————————————————————————

Question 11.1: Which variables emerge as the strongest predictors of house price across all analyses?

importance_summary <- data.frame(
  Feature = c("OverallQual", "GrLivArea", "Total_SF"),
  Importance = c("Very High", "Very High", "High")
)

importance_summary
##       Feature Importance
## 1 OverallQual  Very High
## 2   GrLivArea  Very High
## 3    Total_SF       High

Across statistical analysis, correlation analysis, and regression modeling, OverallQual, GrLivArea, and Total_SF consistently emerge as the strongest predictors of house price. This indicates that construction quality and effective living space play a dominant role in determining property value.

FINAL CONCLUSION

The project demonstrates that house prices are strongly influenced by structural quality, living area, and location. Statistical analysis, visualization, feature engineering, correlation analysis, and regression modeling collectively reveal that property value is determined by a combination of size, quality, and market positioning rather than a single feature alone.

The regression models further confirm that multiple variables jointly explain house price variation more effectively than individual features in isolation. Overall, the project successfully transforms raw housing data into meaningful analytical insights using R programming techniques aligned with exploratory data analysis and predictive modeling principles.

The project also demonstrates how statistical modeling techniques in R can support data-driven decision making in real estate valuation and housing market analysis.