library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(scales)
setwd("D:/LPU/Sem 2/R programming/Project/house-prices-advanced-regression-techniques")
data <- read.csv("train.csv")
test_data <- read.csv("test.csv")
data$Neighborhood <- recode(data$Neighborhood,
"Blmngtn" = "Bloomington Heights",
"Blueste" = "Bluestem",
"BrDale" = "Briardale",
"BrkSide" = "Brookside",
"ClearCr" = "Clear Creek",
"CollgCr" = "College Creek",
"Crawfor" = "Crawford",
"Edwards" = "Edwards",
"Gilbert" = "Gilbert",
"IDOTRR" = "Iowa DOT and Railroad",
"MeadowV" = "Meadow Village",
"Mitchel" = "Mitchell",
"NAmes" = "North Ames",
"NoRidge" = "North Ridge",
"NPkVill" = "Northpark Villa",
"NridgHt" = "North Ridge Heights",
"NWAmes" = "Northwest Ames",
"OldTown" = "Old Town",
"SWISU" = "South and West Iowa State University",
"Sawyer" = "Sawyer",
"SawyerW" = "Sawyer West",
"Somerst" = "Somerset",
"StoneBr" = "Stone Brook",
"Timber" = "Timberland",
"Veenker" = "Veenker"
)
test_data$Neighborhood <- recode(test_data$Neighborhood,
"Blmngtn" = "Bloomington Heights",
"Blueste" = "Bluestem",
"BrDale" = "Briardale",
"BrkSide" = "Brookside",
"ClearCr" = "Clear Creek",
"CollgCr" = "College Creek",
"Crawfor" = "Crawford",
"Edwards" = "Edwards",
"Gilbert" = "Gilbert",
"IDOTRR" = "Iowa DOT and Railroad",
"MeadowV" = "Meadow Village",
"Mitchel" = "Mitchell",
"NAmes" = "North Ames",
"NoRidge" = "North Ridge",
"NPkVill" = "Northpark Villa",
"NridgHt" = "North Ridge Heights",
"NWAmes" = "Northwest Ames",
"OldTown" = "Old Town",
"SWISU" = "South and West Iowa State University",
"Sawyer" = "Sawyer",
"SawyerW" = "Sawyer West",
"Somerst" = "Somerset",
"StoneBr" = "Stone Brook",
"Timber" = "Timberland",
"Veenker" = "Veenker"
)
head(data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl College Creek Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl College Creek Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawford Norm Norm 1Fam
## 5 AllPub FR2 Gtl North Ridge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchell Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
The dataset preview confirms successful data import and provides an initial understanding of the available variables and their formats.
dim(data)
## [1] 1460 81
str(data)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "College Creek" "Veenker" "College Creek" "Crawford" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(test_data)
## [1] 1459 80
str(test_data)
## 'data.frame': 1459 obs. of 80 variables:
## $ Id : int 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
## $ MSSubClass : int 20 20 60 60 120 60 20 60 20 20 ...
## $ MSZoning : chr "RH" "RL" "RL" "RL" ...
## $ LotFrontage : int 80 81 74 78 43 75 NA 63 85 70 ...
## $ LotArea : int 11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "IR1" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "Corner" "Inside" "Inside" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "North Ames" "North Ames" "Gilbert" "Gilbert" ...
## $ Condition1 : chr "Feedr" "Norm" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "1Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 5 6 5 6 8 6 6 6 7 4 ...
## $ OverallCond : int 6 6 5 6 5 5 7 5 5 5 ...
## $ YearBuilt : int 1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
## $ YearRemodAdd : int 1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
## $ RoofStyle : chr "Gable" "Hip" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
## $ Exterior2nd : chr "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
## $ MasVnrType : chr "None" "BrkFace" "None" "BrkFace" ...
## $ MasVnrArea : int 0 108 0 20 0 0 0 0 0 0 ...
## $ ExterQual : chr "TA" "TA" "TA" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "CBlock" "CBlock" "PConc" "PConc" ...
## $ BsmtQual : chr "TA" "TA" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "TA" ...
## $ BsmtExposure : chr "No" "No" "No" "No" ...
## $ BsmtFinType1 : chr "Rec" "ALQ" "GLQ" "GLQ" ...
## $ BsmtFinSF1 : int 468 923 791 602 263 0 935 0 637 804 ...
## $ BsmtFinType2 : chr "LwQ" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 144 0 0 0 0 0 0 0 0 78 ...
## $ BsmtUnfSF : int 270 406 137 324 1017 763 233 789 663 0 ...
## $ TotalBsmtSF : int 882 1329 928 926 1280 763 1168 789 1300 882 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "TA" "TA" "Gd" "Ex" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 896 1329 928 926 1280 763 1187 789 1341 882 ...
## $ X2ndFlrSF : int 0 0 701 678 0 892 0 676 0 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
## $ BsmtFullBath : int 0 0 0 0 0 0 1 0 1 1 ...
## $ BsmtHalfBath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 1 1 2 2 2 2 2 2 1 1 ...
## $ HalfBath : int 0 1 1 1 0 1 0 1 1 0 ...
## $ BedroomAbvGr : int 2 3 3 3 2 3 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ KitchenQual : chr "TA" "Gd" "TA" "Gd" ...
## $ TotRmsAbvGrd : int 5 6 6 7 5 7 6 7 5 4 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 0 1 1 0 1 0 1 1 0 ...
## $ FireplaceQu : chr NA NA "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ GarageYrBlt : int 1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
## $ GarageFinish : chr "Unf" "Unf" "Fin" "Fin" ...
## $ GarageCars : int 1 1 2 2 2 2 2 2 2 2 ...
## $ GarageArea : int 730 312 482 470 506 440 420 393 506 525 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 140 393 212 360 0 157 483 0 192 240 ...
## $ OpenPorchSF : int 0 36 34 36 82 84 21 75 0 0 ...
## $ EnclosedPorch: int 0 0 0 0 0 0 0 0 0 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : int 120 0 0 0 144 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr "MnPrv" NA "MnPrv" NA ...
## $ MiscFeature : chr NA "Gar2" NA NA ...
## $ MiscVal : int 0 12500 0 0 0 0 500 0 0 0 ...
## $ MoSold : int 6 6 3 6 1 4 3 5 2 4 ...
## $ YrSold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Normal" ...
setdiff(colnames(data), colnames(test_data))
## [1] "SalePrice"
length(setdiff(colnames(data), colnames(test_data)))
## [1] 1
The dataset contains 1460 observations and 81 variables, including both numerical and categorical features. These variables capture structural, quality, and locational aspects of houses. The training dataset contains one additional variable, SalePrice, which serves as the target variable for prediction.
num_vars <- data %>% select(where(is.numeric)) %>% ncol()
cat_vars <- data %>% select(where(is.character)) %>% ncol()
num_vars
## [1] 38
cat_vars
## [1] 43
The dataset includes both numerical and categorical variables, enabling analysis of measurable attributes such as size and price, along with qualitative features such as neighborhood.
df <- data %>%
select(SalePrice,
GrLivArea,
OverallQual,
YearBuilt,
TotalBsmtSF,
GarageArea,
BedroomAbvGr,
FullBath,
Neighborhood,
GarageCars,
TotRmsAbvGrd,
LotArea) %>%
mutate(across(where(is.numeric), ~replace_na(., 0)))
test_df <- test_data %>%
select(GrLivArea,
OverallQual,
YearBuilt,
TotalBsmtSF,
GarageArea,
BedroomAbvGr,
FullBath,
GarageCars,
TotRmsAbvGrd,
LotArea,
Neighborhood) %>%
mutate(across(where(is.numeric), ~replace_na(., 0)))
dim(df)
## [1] 1460 12
A refined dataset containing key variables was created to ensure focused and meaningful analysis. This dataset retains the most influential features affecting house prices while eliminating redundant or less relevant variables.
colSums(is.na(df))
## SalePrice GrLivArea OverallQual YearBuilt TotalBsmtSF GarageArea
## 0 0 0 0 0 0
## BedroomAbvGr FullBath Neighborhood GarageCars TotRmsAbvGrd LotArea
## 0 0 0 0 0 0
No missing values remain in the cleaned dataset after preprocessing. This confirms that the dataset is complete and ready for analysis and modeling.
cor(df %>% select(where(is.numeric)))["SalePrice", ] %>%
sort(decreasing = TRUE)
## SalePrice OverallQual GrLivArea GarageCars GarageArea TotalBsmtSF
## 1.0000000 0.7909816 0.7086245 0.6404092 0.6234314 0.6135806
## FullBath TotRmsAbvGrd YearBuilt LotArea BedroomAbvGr
## 0.5606638 0.5337232 0.5228973 0.2638434 0.1682132
The correlation results show that OverallQual and GrLivArea have the strongest positive relationships with SalePrice, indicating that quality and size are the primary drivers of house value. Features like GarageArea and FullBath have moderate influence, while BedroomAbvGr shows a weak relationship. Overall, house prices are driven more by quality and space than by room count.
summary(df$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
quantile(df$SalePrice)
## 0% 25% 50% 75% 100%
## 34900 129975 163000 214000 755000
range(df$SalePrice)
## [1] 34900 755000
The quantile distribution shows that a large proportion of houses fall within lower price ranges, while a smaller fraction occupies the high-value segment. The large range between minimum and maximum prices indicates substantial market variation and the presence of high-value luxury properties.
mean(df$SalePrice > mean(df$SalePrice)) * 100
## [1] 38.35616
Approximately 38.36% of houses are priced above the average value, indicating that the majority of houses fall below the mean. This supports the earlier observation of a right-skewed distribution, where a smaller proportion of high-value properties increases the average price disproportionately.
df %>%
group_by(BedroomAbvGr) %>%
summarise(
avg_price = mean(SalePrice),
count = n()
) %>%
arrange(BedroomAbvGr)
## # A tibble: 8 × 3
## BedroomAbvGr avg_price count
## <int> <dbl> <int>
## 1 0 221493. 6
## 2 1 173162. 50
## 3 2 158198. 358
## 4 3 181057. 804
## 5 4 220421. 213
## 6 5 180819. 21
## 7 6 143779 7
## 8 8 200000 1
The number of bedrooms does not show a consistent relationship with house price. Prices increase up to a certain point but decline for higher bedroom counts, indicating diminishing returns. This suggests that simply increasing the number of bedrooms does not add value unless supported by overall size and quality. In some cases, more bedrooms may reflect inefficient space usage rather than higher property value.
df %>%
group_by(FullBath) %>%
summarise(
avg_price = mean(SalePrice),
count = n()
) %>%
arrange(FullBath)
## # A tibble: 4 × 3
## FullBath avg_price count
## <int> <dbl> <int>
## 1 0 165201. 9
## 2 1 134751. 650
## 3 2 213010. 768
## 4 3 347823. 33
House prices show a clear upward trend with the number of bathrooms, indicating that additional bathrooms enhance property value by improving functionality and comfort. Unlike bedrooms, this relationship is more consistent, suggesting that bathrooms are a more reliable indicator of increased housing value.
df %>%
group_by(Neighborhood) %>%
summarise(avg_price = mean(SalePrice)) %>%
arrange(desc(avg_price))
## # A tibble: 25 × 2
## Neighborhood avg_price
## <chr> <dbl>
## 1 North Ridge 335295.
## 2 North Ridge Heights 316271.
## 3 Stone Brook 310499
## 4 Timberland 242247.
## 5 Veenker 238773.
## 6 Somerset 225380.
## 7 Clear Creek 212565.
## 8 Crawford 210625.
## 9 College Creek 197966.
## 10 Bloomington Heights 194871.
## # ℹ 15 more rows
House prices exhibit strong variation across neighborhoods, with a small number of areas such as North Ridge, North Ridge Heights, Stone Brook dominating the high-value segment. This highlights a clear market segmentation, where location creates a significant price premium independent of structural features.
df %>%
mutate(Age = 2024 - YearBuilt,
AgeGroup = cut(Age, breaks = c(0, 20, 40, 60, 80, 150))) %>%
group_by(AgeGroup) %>%
summarise(avg_price = mean(SalePrice), count = n())
## # A tibble: 6 × 3
## AgeGroup avg_price count
## <fct> <dbl> <int>
## 1 (0,20] 248928. 276
## 2 (20,40] 224340. 311
## 3 (40,60] 156229. 322
## 4 (60,80] 140125. 277
## 5 (80,150] 133439. 273
## 6 <NA> 122000 1
House prices generally decline with increasing age, indicating that newer houses are valued higher in the market. However, the irregular pattern suggests that age alone does not determine value, as older houses can still achieve high prices if they possess strong attributes such as superior construction quality or prime location.
House prices are influenced by multiple structural and locational factors. Features such as bathrooms and neighborhood show strong positive relationships with price, while bedroom count demonstrates diminishing returns, highlighting the importance of quality and efficient space utilization over simple quantity-based measures.
df %>%
mutate(Category = ifelse(SalePrice > median(SalePrice), "High", "Low")) %>%
group_by(Category) %>%
summarise(
avg_area = mean(GrLivArea),
avg_quality = mean(OverallQual),
avg_garage = mean(GarageArea),
avg_rooms = mean(TotRmsAbvGrd)
)
## # A tibble: 2 × 5
## Category avg_area avg_quality avg_garage avg_rooms
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 High 1814. 7.03 581. 7.20
## 2 Low 1219. 5.17 365. 5.84
High-priced houses are characterized by significantly larger living areas and higher overall quality. This indicates that price differences are primarily driven by size and construction quality rather than individual features alone.
df %>%
arrange(desc(GrLivArea)) %>%
select(GrLivArea, SalePrice, Neighborhood, OverallQual) %>%
head(5)
## GrLivArea SalePrice Neighborhood OverallQual
## 1 5642 160000 Edwards 10
## 2 4676 184750 Edwards 10
## 3 4476 745000 North Ridge 10
## 4 4316 755000 North Ridge 10
## 5 3627 625000 North Ridge 10
The largest houses do not consistently correspond to the highest prices, indicating that size alone is not a sufficient determinant of house value. While some large houses are highly priced, others are relatively lower in value, suggesting that factors such as neighborhood and overall quality significantly influence pricing. This highlights that the impact of size on price is conditional rather than absolute.
df <- df %>%
mutate(
Total_SF = GrLivArea + TotalBsmtSF,
Price_per_sqft = SalePrice / Total_SF
)
df %>%
select(Neighborhood, SalePrice, Total_SF, Price_per_sqft) %>%
arrange(desc(Price_per_sqft)) %>%
head(5)
## Neighborhood SalePrice Total_SF Price_per_sqft
## 1 Stone Brook 392000 2838 138.1254
## 2 North Ridge Heights 611657 4694 130.3061
## 3 North Ames 107500 827 129.9879
## 4 North Ridge Heights 582933 4556 127.9484
## 5 North Ames 106500 882 120.7483
Houses with the highest price per unit area are typically not the largest, but are high-quality properties located in premium neighborhoods such as Stone Brook and North Ridge Heights. This indicates that price efficiency is driven more by construction quality and location than by size alone. Smaller, well-designed houses in desirable areas can achieve higher value per square foot compared to larger properties in less favorable locations.
The ranking and comparison analysis reveals that house prices are not determined solely by size. While larger houses may have higher total prices, price efficiency and overall valuation are strongly influenced by location and construction quality. This reinforces that high-value properties are defined by a combination of factors rather than a single attribute.
df <- df %>%
mutate(
Total_SF = GrLivArea + TotalBsmtSF,
Price_per_sqft = SalePrice / Total_SF,
Price_per_room = SalePrice / TotRmsAbvGrd,
Age = 2024 - YearBuilt
)
summary(df$Price_per_sqft)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.61 60.33 69.51 69.81 78.93 138.13
Price per square foot varies considerably across houses, indicating that property value is influenced not only by size but also by quality, location, and pricing efficiency.
df %>%
mutate(AgeGroup = cut(Age, breaks = c(0, 20, 40, 60, 80, 150))) %>%
group_by(AgeGroup) %>%
summarise(count = n())
## # A tibble: 6 × 2
## AgeGroup count
## <fct> <int>
## 1 (0,20] 276
## 2 (20,40] 311
## 3 (40,60] 322
## 4 (60,80] 277
## 5 (80,150] 273
## 6 <NA> 1
The distribution of house age indicates that a large proportion of properties fall within mid-age ranges, with fewer very new or very old houses. This suggests a mature housing market where most properties are neither newly constructed nor extremely old. The presence of both newer and older houses provides diversity, enabling analysis of how age interacts with other factors such as quality and location.
summary(df$Price_per_room)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6317 21943 26467 27909 32378 78500
Price per room varies significantly across houses, indicating that room count alone does not determine property value. Houses with fewer rooms can exhibit higher price per room if they are of superior quality or located in premium neighborhoods. This highlights that value is driven more by quality and efficient space utilization than by the number of rooms.
summary(df$Total_SF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 2014 2479 2573 3008 11752
Total usable space varies widely across houses, indicating the presence of both compact homes and extremely large luxury properties. While most houses fall within a moderate size range, the presence of very large houses suggests a segment of high-end properties. However, large size alone does not guarantee higher value, reinforcing that space must be considered alongside quality and location.
The engineered features provide deeper insight into pricing dynamics, showing that efficiency-based measures such as price per square foot and price per room reveal variations that are not captured by total price alone. These features highlight that house value is influenced not just by size, but by how effectively space is utilized, along with quality and location.
df <- df %>%
mutate(Log_SalePrice = log(SalePrice))
# Original distribution
ggplot(df, aes(x = SalePrice)) +
geom_histogram(bins = 30, fill = "seagreen", color = "black") +
scale_x_continuous(labels = scales::comma) +
labs(
title = "Distribution of Sale Price",
x = "Sale Price",
y = "Frequency"
) +
theme_minimal()
# Log-transformed distribution
ggplot(df, aes(x = Log_SalePrice)) +
geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
labs(
title = "Distribution of Log(SalePrice)",
x = "Log(Sale Price)",
y = "Frequency"
) +
theme_minimal()
The distribution of SalePrice is positively skewed, with a concentration of houses in the lower price range and a long right tail representing high-value properties. After applying a log transformation, the distribution becomes more symmetric and less influenced by extreme values. This indicates that the transformation stabilizes the variance and makes the data more suitable for statistical modeling.
ggplot(df, aes(x = GrLivArea, y = SalePrice)) +
geom_smooth(method = "lm", color = "red") +
geom_point(alpha = 0.6) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "SalePrice vs Living Area",
x = "Living Area",
y = "Sale Price"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot shows a clear positive relationship between living area and house price, indicating that larger houses tend to have higher prices. However, the spread of points increases for larger houses, suggesting that size alone does not fully explain price variation. Other factors such as quality and location also influence pricing.
ggplot(df, aes(x = factor(OverallQual), y = SalePrice, fill = factor(OverallQual))) +
geom_boxplot() +
scale_fill_manual(values = c(
"red",
"orangered",
"orange",
"gold",
"yellow",
"yellowgreen",
"green3",
"forestgreen",
"green4",
"darkgreen"
)) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "House Prices Across Quality Levels",
x = "Overall Quality",
y = "Sale Price"
) +
theme_minimal()
The boxplots show a clear upward relationship between overall house quality and sale price. Lower quality homes (red shades) are concentrated in lower price ranges, while higher quality homes (green shades) have significantly higher median prices and wider price variation. Several outliers are also visible in the higher quality categories, representing premium-priced properties in the housing market.
ggplot(df, aes(x = SalePrice)) +
geom_density(fill = "lightblue") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Density Distribution of Sale Price",
x = "Sale Price",
y = "Density"
) +
theme_minimal()
The density plot confirms that house prices are concentrated in lower and moderate ranges, while a smaller number of expensive properties form a long upper tail. This reinforces the presence of positive skewness in the dataset.
numeric_df <- df %>%
select(where(is.numeric))
cor_matrix <- cor(numeric_df)
round(cor_matrix, 2)
## SalePrice GrLivArea OverallQual YearBuilt TotalBsmtSF GarageArea
## SalePrice 1.00 0.71 0.79 0.52 0.61 0.62
## GrLivArea 0.71 1.00 0.59 0.20 0.45 0.47
## OverallQual 0.79 0.59 1.00 0.57 0.54 0.56
## YearBuilt 0.52 0.20 0.57 1.00 0.39 0.48
## TotalBsmtSF 0.61 0.45 0.54 0.39 1.00 0.49
## GarageArea 0.62 0.47 0.56 0.48 0.49 1.00
## BedroomAbvGr 0.17 0.52 0.10 -0.07 0.05 0.07
## FullBath 0.56 0.63 0.55 0.47 0.32 0.41
## GarageCars 0.64 0.47 0.60 0.54 0.43 0.88
## TotRmsAbvGrd 0.53 0.83 0.43 0.10 0.29 0.34
## LotArea 0.26 0.26 0.11 0.01 0.26 0.18
## Total_SF 0.78 0.88 0.66 0.34 0.82 0.56
## Price_per_sqft 0.64 0.13 0.52 0.52 0.03 0.40
## Price_per_room 0.78 0.28 0.65 0.57 0.54 0.51
## Age -0.52 -0.20 -0.57 -1.00 -0.39 -0.48
## Log_SalePrice 0.95 0.70 0.82 0.59 0.61 0.65
## BedroomAbvGr FullBath GarageCars TotRmsAbvGrd LotArea Total_SF
## SalePrice 0.17 0.56 0.64 0.53 0.26 0.78
## GrLivArea 0.52 0.63 0.47 0.83 0.26 0.88
## OverallQual 0.10 0.55 0.60 0.43 0.11 0.66
## YearBuilt -0.07 0.47 0.54 0.10 0.01 0.34
## TotalBsmtSF 0.05 0.32 0.43 0.29 0.26 0.82
## GarageArea 0.07 0.41 0.88 0.34 0.18 0.56
## BedroomAbvGr 1.00 0.36 0.09 0.68 0.12 0.36
## FullBath 0.36 1.00 0.47 0.55 0.13 0.57
## GarageCars 0.09 0.47 1.00 0.36 0.15 0.53
## TotRmsAbvGrd 0.68 0.55 0.36 1.00 0.19 0.68
## LotArea 0.12 0.13 0.15 0.19 1.00 0.31
## Total_SF 0.36 0.57 0.53 0.68 0.31 1.00
## Price_per_sqft -0.16 0.26 0.44 0.05 0.10 0.10
## Price_per_room -0.27 0.30 0.52 -0.06 0.19 0.46
## Age 0.07 -0.47 -0.54 -0.10 -0.01 -0.34
## Log_SalePrice 0.21 0.59 0.68 0.53 0.26 0.77
## Price_per_sqft Price_per_room Age Log_SalePrice
## SalePrice 0.64 0.78 -0.52 0.95
## GrLivArea 0.13 0.28 -0.20 0.70
## OverallQual 0.52 0.65 -0.57 0.82
## YearBuilt 0.52 0.57 -1.00 0.59
## TotalBsmtSF 0.03 0.54 -0.39 0.61
## GarageArea 0.40 0.51 -0.48 0.65
## BedroomAbvGr -0.16 -0.27 0.07 0.21
## FullBath 0.26 0.30 -0.47 0.59
## GarageCars 0.44 0.52 -0.54 0.68
## TotRmsAbvGrd 0.05 -0.06 -0.10 0.53
## LotArea 0.10 0.19 -0.01 0.26
## Total_SF 0.10 0.46 -0.34 0.77
## Price_per_sqft 1.00 0.70 -0.52 0.63
## Price_per_room 0.70 1.00 -0.57 0.77
## Age -0.52 -0.57 1.00 -0.59
## Log_SalePrice 0.63 0.77 -0.59 1.00
The correlation matrix reveals strong positive relationships among variables related to house size and quality. Features such as GrLivArea, OverallQual, and Total_SF show strong associations with SalePrice, indicating their importance in determining house value.
cor_matrix["SalePrice", ] %>%
sort(decreasing = TRUE)
## SalePrice Log_SalePrice OverallQual Total_SF Price_per_room
## 1.0000000 0.9483737 0.7909816 0.7789588 0.7758415
## GrLivArea Price_per_sqft GarageCars GarageArea TotalBsmtSF
## 0.7086245 0.6408187 0.6404092 0.6234314 0.6135806
## FullBath TotRmsAbvGrd YearBuilt LotArea BedroomAbvGr
## 0.5606638 0.5337232 0.5228973 0.2638434 0.1682132
## Age
## -0.5228973
OverallQual and GrLivArea exhibit the strongest positive correlations with SalePrice, confirming that construction quality and living area are the most influential predictors of house price.
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7
)
The correlation heatmap provides a visual representation of relationships among numerical variables, making it easier to identify strong positive, weak, and negative correlations within the dataset.
cor_matrix
## SalePrice GrLivArea OverallQual YearBuilt TotalBsmtSF
## SalePrice 1.0000000 0.7086245 0.7909816 0.52289733 0.61358055
## GrLivArea 0.7086245 1.0000000 0.5930074 0.19900971 0.45486820
## OverallQual 0.7909816 0.5930074 1.0000000 0.57232277 0.53780850
## YearBuilt 0.5228973 0.1990097 0.5723228 1.00000000 0.39145200
## TotalBsmtSF 0.6135806 0.4548682 0.5378085 0.39145200 1.00000000
## GarageArea 0.6234314 0.4689975 0.5620218 0.47895382 0.48666546
## BedroomAbvGr 0.1682132 0.5212695 0.1016764 -0.07065122 0.05044996
## FullBath 0.5606638 0.6300116 0.5505997 0.46827079 0.32372241
## GarageCars 0.6404092 0.4672474 0.6006707 0.53785009 0.43458483
## TotRmsAbvGrd 0.5337232 0.8254894 0.4274523 0.09558913 0.28557256
## LotArea 0.2638434 0.2631162 0.1058057 0.01422765 0.26083313
## Total_SF 0.7789588 0.8803240 0.6648303 0.33548845 0.82288840
## Price_per_sqft 0.6408187 0.1331439 0.5150323 0.51789555 0.03447252
## Price_per_room 0.7758415 0.2769877 0.6497751 0.56995718 0.53621843
## Age -0.5228973 -0.1990097 -0.5723228 -1.00000000 -0.39145200
## Log_SalePrice 0.9483737 0.7009267 0.8171844 0.58657024 0.61213398
## GarageArea BedroomAbvGr FullBath GarageCars TotRmsAbvGrd
## SalePrice 0.62343144 0.16821315 0.5606638 0.64040920 0.53372316
## GrLivArea 0.46899748 0.52126951 0.6300116 0.46724742 0.82548937
## OverallQual 0.56202176 0.10167636 0.5505997 0.60067072 0.42745234
## YearBuilt 0.47895382 -0.07065122 0.4682708 0.53785009 0.09558913
## TotalBsmtSF 0.48666546 0.05044996 0.3237224 0.43458483 0.28557256
## GarageArea 1.00000000 0.06525253 0.4056562 0.88247541 0.33782212
## BedroomAbvGr 0.06525253 1.00000000 0.3632520 0.08610644 0.67661994
## FullBath 0.40565621 0.36325198 1.0000000 0.46967204 0.55478425
## GarageCars 0.88247541 0.08610644 0.4696720 1.00000000 0.36228857
## TotRmsAbvGrd 0.33782212 0.67661994 0.5547843 0.36228857 1.00000000
## LotArea 0.18040276 0.11968991 0.1260306 0.15487074 0.19001478
## Total_SF 0.55846594 0.35945861 0.5744031 0.52960762 0.67880245
## Price_per_sqft 0.39757007 -0.16243932 0.2625206 0.44228129 0.05132142
## Price_per_room 0.51320440 -0.27149296 0.2954026 0.51801751 -0.06276533
## Age -0.47895382 0.07065122 -0.4682708 -0.53785009 -0.09558913
## Log_SalePrice 0.65088756 0.20904368 0.5947705 0.68062481 0.53442220
## LotArea Total_SF Price_per_sqft Price_per_room Age
## SalePrice 0.26384335 0.7789588 0.64081873 0.77584146 -0.52289733
## GrLivArea 0.26311617 0.8803240 0.13314387 0.27698773 -0.19900971
## OverallQual 0.10580574 0.6648303 0.51503230 0.64977513 -0.57232277
## YearBuilt 0.01422765 0.3354884 0.51789555 0.56995718 -1.00000000
## TotalBsmtSF 0.26083313 0.8228884 0.03447252 0.53621843 -0.39145200
## GarageArea 0.18040276 0.5584659 0.39757007 0.51320440 -0.47895382
## BedroomAbvGr 0.11968991 0.3594586 -0.16243932 -0.27149296 0.07065122
## FullBath 0.12603063 0.5744031 0.26252061 0.29540264 -0.46827079
## GarageCars 0.15487074 0.5296076 0.44228129 0.51801751 -0.53785009
## TotRmsAbvGrd 0.19001478 0.6788025 0.05132142 -0.06276533 -0.09558913
## LotArea 1.00000000 0.3068137 0.09586740 0.18617218 -0.01422765
## Total_SF 0.30681366 1.0000000 0.10331220 0.46235332 -0.33548845
## Price_per_sqft 0.09586740 0.1033122 1.00000000 0.70301207 -0.51789555
## Price_per_room 0.18617218 0.4623533 0.70301207 1.00000000 -0.56995718
## Age -0.01422765 -0.3354884 -0.51789555 -0.56995718 1.00000000
## Log_SalePrice 0.25731989 0.7732768 0.63296046 0.76769748 -0.58657024
## Log_SalePrice
## SalePrice 0.9483737
## GrLivArea 0.7009267
## OverallQual 0.8171844
## YearBuilt 0.5865702
## TotalBsmtSF 0.6121340
## GarageArea 0.6508876
## BedroomAbvGr 0.2090437
## FullBath 0.5947705
## GarageCars 0.6806248
## TotRmsAbvGrd 0.5344222
## LotArea 0.2573199
## Total_SF 0.7732768
## Price_per_sqft 0.6329605
## Price_per_room 0.7676975
## Age -0.5865702
## Log_SalePrice 1.0000000
Strong correlations between variables such as GrLivArea and Total_SF indicate potential multicollinearity. Similarly, GarageArea and GarageCars show high association, suggesting that some features may contain overlapping information.
model1 <- lm(SalePrice ~ GrLivArea, data = df)
summary(model1)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462999 -29800 -1124 21957 339832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18569.026 4480.755 4.144 3.61e-05 ***
## GrLivArea 107.130 2.794 38.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018
## F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16
ggplot(df, aes(x = GrLivArea, y = SalePrice)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Linear Regression: Living Area vs Sale Price",
x = "Living Area (sq ft)",
y = "Sale Price"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The simple linear regression model shows that living area has a significant positive effect on house price. The positive regression coefficient indicates that larger houses tend to have higher prices. The R² value suggests that a substantial portion of price variation can be explained by living area alone.
model2 <- lm(
SalePrice ~ GrLivArea + OverallQual +
GarageArea + FullBath + Age,
data = df
)
summary(model2)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + GarageArea +
## FullBath + Age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -424848 -21012 -2355 17693 295926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -44592.420 7753.719 -5.751 1.08e-08 ***
## GrLivArea 59.886 3.043 19.683 < 2e-16 ***
## OverallQual 23242.964 1160.471 20.029 < 2e-16 ***
## GarageArea 56.652 6.255 9.058 < 2e-16 ***
## FullBath -7174.235 2716.719 -2.641 0.00836 **
## Age -428.100 48.026 -8.914 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39550 on 1454 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.7522
## F-statistic: 886.6 on 5 and 1454 DF, p-value: < 2.2e-16
The multiple regression model demonstrates that house price is jointly influenced by multiple features, with OverallQual and GrLivArea emerging as the strongest predictors. The higher adjusted R² value indicates improved explanatory power compared to simple regression.
poly_model <- lm(
SalePrice ~ poly(GrLivArea, 2),
data = df
)
summary(poly_model)
##
## Call:
## lm(formula = SalePrice ~ poly(GrLivArea, 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -321613 -30369 -876 22954 338146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180921 1459 124.038 < 2e-16 ***
## poly(GrLivArea, 2)1 2150288 55733 38.582 < 2e-16 ***
## poly(GrLivArea, 2)2 -241924 55733 -4.341 1.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55730 on 1457 degrees of freedom
## Multiple R-squared: 0.5085, Adjusted R-squared: 0.5078
## F-statistic: 753.7 on 2 and 1457 DF, p-value: < 2.2e-16
The polynomial regression model captures non-linear relationships between living area and house price more effectively than simple linear regression. This suggests that the impact of living area on price is not perfectly linear across all property sizes.
model_comparison <- data.frame(
Model = c("Simple Regression",
"Multiple Regression",
"Polynomial Regression"),
R_Squared = c(
summary(model1)$r.squared,
summary(model2)$r.squared,
summary(poly_model)$r.squared
),
Adjusted_R_Squared = c(
summary(model1)$adj.r.squared,
summary(model2)$adj.r.squared,
summary(poly_model)$adj.r.squared
)
)
model_comparison
## Model R_Squared Adjusted_R_Squared
## 1 Simple Regression 0.5021487 0.5018072
## 2 Multiple Regression 0.7530226 0.7521733
## 3 Polynomial Regression 0.5085048 0.5078302
The multiple regression model performs best overall, as indicated by the highest R² and adjusted R² values. This demonstrates that combining multiple predictors explains house price variation more effectively than relying on a single variable. The polynomial regression model improves upon simple linear regression by capturing non-linear relationships, but the multiple regression model remains the strongest overall.
Q1 <- quantile(df$SalePrice, 0.25)
Q3 <- quantile(df$SalePrice, 0.75)
IQR_value <- Q3 - Q1
outliers <- df %>%
filter(
SalePrice < (Q1 - 1.5 * IQR_value) |
SalePrice > (Q3 + 1.5 * IQR_value)
)
nrow(outliers)
## [1] 61
summary(outliers$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 341000 372500 394617 425954 440000 755000
The IQR method identifies a number of significant outliers in house prices. These outliers represent unusually expensive properties and may influence regression coefficients, residual distribution, and overall model accuracy.
par(mfrow = c(2,2))
plot(model2)
The diagnostic plots help evaluate regression assumptions such as linearity, normality of residuals, homoscedasticity, and the presence of influential observations. Minor deviations suggest that while the model performs reasonably well, certain assumptions are not perfectly satisfied.
importance_summary <- data.frame(
Feature = c("OverallQual", "GrLivArea", "Total_SF"),
Importance = c("Very High", "Very High", "High")
)
importance_summary
## Feature Importance
## 1 OverallQual Very High
## 2 GrLivArea Very High
## 3 Total_SF High
Across statistical analysis, correlation analysis, and regression modeling, OverallQual, GrLivArea, and Total_SF consistently emerge as the strongest predictors of house price. This indicates that construction quality and effective living space play a dominant role in determining property value.
The project demonstrates that house prices are strongly influenced by structural quality, living area, and location. Statistical analysis, visualization, feature engineering, correlation analysis, and regression modeling collectively reveal that property value is determined by a combination of size, quality, and market positioning rather than a single feature alone.
The regression models further confirm that multiple variables jointly explain house price variation more effectively than individual features in isolation. Overall, the project successfully transforms raw housing data into meaningful analytical insights using R programming techniques aligned with exploratory data analysis and predictive modeling principles.
The project also demonstrates how statistical modeling techniques in R can support data-driven decision making in real estate valuation and housing market analysis.