setwd("D:/LPU/2nd Sem/R Programming for Data Analysis/HOUSE-PRICES-PREDICTION-PROJECT")
data <- read.csv("train.csv")
test_data <- read.csv("test.csv")
#-----------------------------
#LEVEL 1: UNDERSTANDING DATA
#-----------------------------
#Question 1.1
#What is the structure of the dataset (number of observations, variables, and data types)?
dim(data)
## [1] 1460   81
str(data)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(test_data)
## [1] 1459   80
str(test_data)
## 'data.frame':    1459 obs. of  80 variables:
##  $ Id           : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
##  $ MSSubClass   : int  20 20 60 60 120 60 20 60 20 20 ...
##  $ MSZoning     : chr  "RH" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  80 81 74 78 43 75 NA 63 85 70 ...
##  $ LotArea      : int  11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "IR1" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "Corner" "Inside" "Inside" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "NAmes" "NAmes" "Gilbert" "Gilbert" ...
##  $ Condition1   : chr  "Feedr" "Norm" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "1Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  5 6 5 6 8 6 6 6 7 4 ...
##  $ OverallCond  : int  6 6 5 6 5 5 7 5 5 5 ...
##  $ YearBuilt    : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ YearRemodAdd : int  1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
##  $ RoofStyle    : chr  "Gable" "Hip" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
##  $ Exterior2nd  : chr  "VinylSd" "Wd Sdng" "VinylSd" "VinylSd" ...
##  $ MasVnrType   : chr  "None" "BrkFace" "None" "BrkFace" ...
##  $ MasVnrArea   : int  0 108 0 20 0 0 0 0 0 0 ...
##  $ ExterQual    : chr  "TA" "TA" "TA" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "CBlock" "CBlock" "PConc" "PConc" ...
##  $ BsmtQual     : chr  "TA" "TA" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "TA" ...
##  $ BsmtExposure : chr  "No" "No" "No" "No" ...
##  $ BsmtFinType1 : chr  "Rec" "ALQ" "GLQ" "GLQ" ...
##  $ BsmtFinSF1   : int  468 923 791 602 263 0 935 0 637 804 ...
##  $ BsmtFinType2 : chr  "LwQ" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  144 0 0 0 0 0 0 0 0 78 ...
##  $ BsmtUnfSF    : int  270 406 137 324 1017 763 233 789 663 0 ...
##  $ TotalBsmtSF  : int  882 1329 928 926 1280 763 1168 789 1300 882 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "TA" "TA" "Gd" "Ex" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  896 1329 928 926 1280 763 1187 789 1341 882 ...
##  $ X2ndFlrSF    : int  0 0 701 678 0 892 0 676 0 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
##  $ BsmtFullBath : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ BsmtHalfBath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  1 1 2 2 2 2 2 2 1 1 ...
##  $ HalfBath     : int  0 1 1 1 0 1 0 1 1 0 ...
##  $ BedroomAbvGr : int  2 3 3 3 2 3 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ KitchenQual  : chr  "TA" "Gd" "TA" "Gd" ...
##  $ TotRmsAbvGrd : int  5 6 6 7 5 7 6 7 5 4 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 0 1 1 0 1 0 1 1 0 ...
##  $ FireplaceQu  : chr  NA NA "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ GarageYrBlt  : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ GarageFinish : chr  "Unf" "Unf" "Fin" "Fin" ...
##  $ GarageCars   : int  1 1 2 2 2 2 2 2 2 2 ...
##  $ GarageArea   : int  730 312 482 470 506 440 420 393 506 525 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  140 393 212 360 0 157 483 0 192 240 ...
##  $ OpenPorchSF  : int  0 36 34 36 82 84 21 75 0 0 ...
##  $ EnclosedPorch: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ScreenPorch  : int  120 0 0 0 144 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  "MnPrv" NA "MnPrv" NA ...
##  $ MiscFeature  : chr  NA "Gar2" NA NA ...
##  $ MiscVal      : int  0 12500 0 0 0 0 500 0 0 0 ...
##  $ MoSold       : int  6 6 3 6 1 4 3 5 2 4 ...
##  $ YrSold       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Normal" ...
setdiff(colnames(data), colnames(test_data)) #checks the missing sales column
## [1] "SalePrice"
#The dataset consists of 1460 observations and 81 variables, representing various structural, locational, and quality-related attributes of houses. The dataset includes both numerical variables (e.g., area, price) and categorical variables (e.g., neighborhood), allowing for comprehensive analysis.
#Question 1.2
#Why is it necessary to select a subset of variables for analysis?
colnames(data)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"
#Although the dataset contains 81 variables, not all are directly relevant for price analysis. A subset of key variables is selected to focus on meaningful factors such as size, quality, and location. This improves interpretability and avoids unnecessary complexity.
#Question 1.3
#Create a cleaned dataset with relevant variables for analysis
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
df <- data %>% 
  select(SalePrice,
         GrLivArea,
         OverallQual,
         YearBuilt,
         TotalBsmtSF,
         GarageArea,
         BedroomAbvGr,
         FullBath,
         Neighborhood,
         GarageCars,
         TotRmsAbvGrd,
         LotArea) %>% 
  mutate(across(where(is.numeric), ~replace_na(., 0)))

test_df <- test_data %>% 
  select(GrLivArea,
         OverallQual,
         YearBuilt,
         TotalBsmtSF,
         GarageArea,
         BedroomAbvGr,
         FullBath,
         GarageCars,
         TotRmsAbvGrd,
         LotArea,
         Neighborhood) %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))


dim(df)
## [1] 1460   12
dim(test_df)
## [1] 1459   11
setdiff(colnames(df), colnames(test_df))
## [1] "SalePrice"
#A refined dataset containing key variables was created to ensure focused and meaningful analysis. This dataset retains the most influential features affecting house prices while eliminating redundant or less relevant variables.
##Question 1.4
#Are there any missing values in the cleaned dataset?
colSums(is.na(df))
##    SalePrice    GrLivArea  OverallQual    YearBuilt  TotalBsmtSF   GarageArea 
##            0            0            0            0            0            0 
## BedroomAbvGr     FullBath Neighborhood   GarageCars TotRmsAbvGrd      LotArea 
##            0            0            0            0            0            0
#No missing values are present in the cleaned dataset, ensuring reliability and consistency in further analysis.
#Question 1.5
#What is the distribution of house prices?
summary(df$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
hist(df$SalePrice)

#The distribution of house prices is positively skewed, as the mean is higher than the median. This indicates that a small number of high-priced houses significantly influence the overall distribution. The wide range of values suggests substantial variability in housing prices.
#Question 1.6
#How can we address the skewness in SalePrice?
#We apply a Log Transformation to make the distribution more symmetric (Normal).
df <- df %>% 
  mutate(Log_SalePrice = log(SalePrice))
#Visual Comparison
par(mfrow=c(1,2)) 
hist(df$SalePrice, main = "Original Price (Skewed)", col="lightblue", xlab = "Price")
hist(df$Log_SalePrice, main = "Log Price (Normal)", col = "lightgreen", xlab = "Log(Price)")

par(mfrow=c(1,1))
#The observed variability and skewness in house prices indicate that structural attributes such as living area and quality, along with locational factors like neighborhood, are likely to have a significant influence on property valuation. These relationships will be examined in detail in the subsequent analysis.
#---------------------------------------------------
#LEVEL 2 - DATA EXTRACTION AND ANALYTICAL QUESTIONS
#---------------------------------------------------
#Question 2.1
#Which neighborhoods have the highest and lowest average house prices?
df %>% 
  group_by(Neighborhood) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(desc(avg_price))
## # A tibble: 25 × 2
##    Neighborhood avg_price
##    <chr>            <dbl>
##  1 NoRidge        335295.
##  2 NridgHt        316271.
##  3 StoneBr        310499 
##  4 Timber         242247.
##  5 Veenker        238773.
##  6 Somerst        225380.
##  7 ClearCr        212565.
##  8 Crawfor        210625.
##  9 CollgCr        197966.
## 10 Blmngtn        194871.
## # ℹ 15 more rows
#House prices show substantial variation across neighborhoods, with NoRidge, NridgHt, and StoneBr emerging as the highest-priced areas. These neighborhoods have significantly higher average prices compared to others, indicating that location is a critical determinant of property value. The sharp decline in average prices after the top few neighborhoods suggests a clear segmentation between premium and mid-range housing markets.
#Question 2.2
#What are the top 10 most expensive houses in the dataset?
df %>% 
  arrange(desc(SalePrice)) %>% 
  head(10)
##    SalePrice GrLivArea OverallQual YearBuilt TotalBsmtSF GarageArea
## 1     755000      4316          10      1994        2444        832
## 2     745000      4476          10      1996        2396        813
## 3     625000      3627          10      1995        1930        807
## 4     611657      2364           9      2009        2330        820
## 5     582933      2822           9      2008        1734       1020
## 6     556581      2868           9      2005        1992        716
## 7     555000      2402          10      2008        3094        672
## 8     538000      3279           8      2003        1650        841
## 9     501837      2234           9      2008        2216       1166
## 10    485000      3140           9      2008        1926        820
##    BedroomAbvGr FullBath Neighborhood GarageCars TotRmsAbvGrd LotArea
## 1             4        3      NoRidge          3           10   21535
## 2             4        3      NoRidge          3           10   15623
## 3             4        3      NoRidge          3           10   35760
## 4             2        2      NridgHt          3           11   12919
## 5             4        3      NridgHt          3           12   13891
## 6             4        3      StoneBr          3           11   16056
## 7             2        2      NridgHt          3           10   15431
## 8             4        3      StoneBr          3           12   53504
## 9             1        2      StoneBr          3            9   17423
## 10            4        3      NridgHt          3           11   13518
##    Log_SalePrice
## 1       13.53447
## 2       13.52114
## 3       13.34551
## 4       13.32393
## 5       13.27583
## 6       13.22957
## 7       13.22672
## 8       13.19561
## 9       13.12603
## 10      13.09190
#The top 10 most expensive houses are concentrated in high-end neighborhoods such as NoRidge, NridgHt, and StoneBr, reinforcing the importance of location in determining house prices. These properties consistently exhibit high overall quality ratings (mostly 9 and 10) and large living areas, along with greater garage capacity. This indicates that premium pricing is driven by a combination of superior construction quality, larger size, and desirable location rather than a single attribute.
#Question 2.3
#Are there extreme high-value outliers in house prices?
boxplot(df$SalePrice)

#The large gap between the maximum price (755,000) and the median price (~163,000) indicates the presence of extreme high-value outliers. These outliers represent luxury properties and significantly extend the upper range of the distribution. Their presence contributes to the right-skewed nature of house prices and affects aggregate measures such as the mean.
#Question 2.4
#What proportion of houses are priced above the average price?
mean(df$SalePrice > mean(df$SalePrice)) * 100
## [1] 38.35616
#Approximately 38.36% of houses are priced above the average value, indicating that the majority of houses fall below the mean. This supports the earlier observation of a right-skewed distribution, where a smaller proportion of high-value properties increases the average price disproportionately.
#Question 2.5
#Is there a strong relationship between house size (GrLivArea) and price?
cor(df$GrLivArea, df$SalePrice)
## [1] 0.7086245
#The correlation coefficient of approximately 0.71 indicates a strong positive relationship between living area and house price. This suggests that larger houses tend to command higher prices. However, since the correlation is not perfect, it also implies that other factors such as quality and neighborhood play a significant role in influencing property values.
#Overall, the analysis demonstrates that house prices are influenced by multiple interrelated factors. While living area shows a strong relationship with price, the concentration of high-value properties in specific neighborhoods and their consistently high quality ratings indicate that location and construction quality are equally critical determinants of housing prices.
#--------------------------------------
#LEVEL 3 — GROUPING & PATTERN ANALYSIS
#--------------------------------------
#Question 3.1
#How does house price vary with overall quality (OverallQual)?
df %>% 
  group_by(OverallQual) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(OverallQual)
## # A tibble: 10 × 2
##    OverallQual avg_price
##          <int>     <dbl>
##  1           1    50150 
##  2           2    51770.
##  3           3    87474.
##  4           4   108421.
##  5           5   133523.
##  6           6   161603.
##  7           7   207716.
##  8           8   274736.
##  9           9   367513.
## 10          10   438588.
#House prices increase consistently with overall quality, indicating a strong positive relationship between construction quality and property value. However, the increase is not linear; rather, it becomes significantly steeper at higher quality levels (8–10). This suggests that premium-quality houses command disproportionately higher prices compared to mid- and low-quality houses. Therefore, overall quality emerges as one of the most influential factors in determining house prices. 
#Question 3.2
#How does house price vary with number of bedrooms (BedroomAbvGr)?
df %>% 
  group_by(BedroomAbvGr) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(BedroomAbvGr)
## # A tibble: 8 × 2
##   BedroomAbvGr avg_price
##          <int>     <dbl>
## 1            0   221493.
## 2            1   173162.
## 3            2   158198.
## 4            3   181057.
## 5            4   220421.
## 6            5   180819.
## 7            6   143779 
## 8            8   200000
#The relationship between the number of bedrooms and house price is inconsistent and does not exhibit a clear increasing trend. While average prices rise from 2 to 4 bedrooms, they decline for higher bedroom counts such as 5 and 6, indicating a non-linear pattern. Additionally, anomalies such as relatively high prices for houses with 0 bedrooms and lower prices for houses with more bedrooms suggest that bedroom count alone is not a strong determinant of house value. This implies that other factors such as overall quality, total living area, and neighborhood have a significantly greater influence on property prices than the number of bedrooms.
#Question 3.3
#How does house price vary with number of bathrooms (FullBath)?
df %>% 
  group_by(FullBath) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(FullBath)
## # A tibble: 4 × 2
##   FullBath avg_price
##      <int>     <dbl>
## 1        0   165201.
## 2        1   134751.
## 3        2   213010.
## 4        3   347823.
#House prices generally increase with the number of bathrooms, indicating a positive relationship between property value and availability of bathroom facilities. While there is a minor irregularity between 0 and 1 bathroom categories, the overall trend shows a significant rise in average prices as the number of bathrooms increases. In particular, houses with 3 bathrooms exhibit substantially higher prices, suggesting that additional bathrooms are associated with higher comfort levels and premium housing segments. Therefore, the number of bathrooms serves as a moderately strong determinant of house prices.
#Question 3.4
#How does house price vary across different neighborhoods? (comparison view)
df %>% 
  group_by(Neighborhood) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(desc(avg_price))
## # A tibble: 25 × 2
##    Neighborhood avg_price
##    <chr>            <dbl>
##  1 NoRidge        335295.
##  2 NridgHt        316271.
##  3 StoneBr        310499 
##  4 Timber         242247.
##  5 Veenker        238773.
##  6 Somerst        225380.
##  7 ClearCr        212565.
##  8 Crawfor        210625.
##  9 CollgCr        197966.
## 10 Blmngtn        194871.
## # ℹ 15 more rows
#House prices vary significantly across neighborhoods, with NoRidge, NridgHt, and StoneBr emerging as the highest-priced areas, having average prices above 300,000. These neighborhoods clearly represent premium residential zones. In contrast, several other neighborhoods fall in the mid-range (~190,000–240,000), indicating a substantial drop from the top tier. This sharp difference suggests a strong location-based segmentation in the housing market, where a small number of neighborhoods command a significant price premium. Therefore, location plays a dominant role in determining house prices, alongside structural factors such as size and quality. The steep decline in average prices after the top three neighborhoods indicates that premium housing is concentrated in a limited number of locations rather than being evenly distributed across all areas.
#Question 3.5
#Does the age of the house (YearBuilt) affect its price?
df %>% 
  mutate(Age = 2010 - YearBuilt) %>% 
  group_by(Age) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(Age)
## # A tibble: 112 × 2
##      Age avg_price
##    <dbl>     <dbl>
##  1     0   394432 
##  2     1   269220 
##  3     2   348849.
##  4     3   255363.
##  5     4   251775.
##  6     5   229681.
##  7     6   210348.
##  8     7   227409.
##  9     8   226870.
## 10     9   242630 
## # ℹ 102 more rows
#House prices generally decrease as the age of the property increases, indicating that newer houses tend to have higher market value. However, the relationship is not perfectly linear, as fluctuations in average prices across different age groups suggest that age alone is not a sole determinant of house price. Other factors such as construction quality, renovation, and location also significantly influence property valuation. The irregular pattern in prices across age groups suggests that older houses can still command higher prices if they possess desirable features such as high quality construction or are located in premium neighborhoods.
#-------------------------------------
#LEVEL 4 - SORTING & RANKING ANALYSIS
#-------------------------------------
#Question 4.1
#Rank neighborhoods based on average house price (highest to lowest)
df %>% 
  group_by(Neighborhood) %>% 
  summarise(avg_price = mean(SalePrice)) %>% 
  arrange(desc(avg_price))
## # A tibble: 25 × 2
##    Neighborhood avg_price
##    <chr>            <dbl>
##  1 NoRidge        335295.
##  2 NridgHt        316271.
##  3 StoneBr        310499 
##  4 Timber         242247.
##  5 Veenker        238773.
##  6 Somerst        225380.
##  7 ClearCr        212565.
##  8 Crawfor        210625.
##  9 CollgCr        197966.
## 10 Blmngtn        194871.
## # ℹ 15 more rows
#The ranking of neighborhoods based on average house price shows that NoRidge, NridgHt, and StoneBr consistently occupy the top positions, confirming their status as premium residential areas. These neighborhoods significantly outperform others in terms of average pricing, indicating a strong location-based price advantage. The ranking further highlights a clear hierarchy in the housing market, where a few neighborhoods dominate the high-value segment while the majority fall into mid- and lower-price categories. The presence of a distinct ranking order suggests that location not only influences price but also establishes a structured market segmentation among different neighborhoods.
#Question 4.2
#What are the top 5 houses based on living area (GrLivArea)?
df %>% 
  arrange(desc(GrLivArea)) %>% 
  head(5)
##   SalePrice GrLivArea OverallQual YearBuilt TotalBsmtSF GarageArea BedroomAbvGr
## 1    160000      5642          10      2008        6110       1418            3
## 2    184750      4676          10      2007        3138        884            3
## 3    745000      4476          10      1996        2396        813            4
## 4    755000      4316          10      1994        2444        832            4
## 5    625000      3627          10      1995        1930        807            4
##   FullBath Neighborhood GarageCars TotRmsAbvGrd LotArea Log_SalePrice
## 1        2      Edwards          2           12   63887      11.98293
## 2        3      Edwards          3           11   40094      12.12676
## 3        3      NoRidge          3           10   15623      13.52114
## 4        3      NoRidge          3           10   21535      13.53447
## 5        3      NoRidge          3           10   35760      13.34551
#The largest houses based on living area are not necessarily the most expensive, indicating that size alone does not determine house price. While some large houses command high prices, others with even greater living areas are priced significantly lower. This variation can be attributed to differences in location, as houses in premium neighborhoods such as NoRidge exhibit much higher prices compared to those in less desirable areas. Therefore, although living area is an important factor, it must be considered alongside location and quality to accurately explain price variations. This finding highlights that the relationship between size and price is conditional, meaning that the impact of living area on price depends heavily on other factors such as neighborhood and overall quality.
#Question 4.3
#Which houses have the highest price per unit area (Price per sqft)?
df <- df %>% 
  mutate(
    Total_SF = GrLivArea + TotalBsmtSF,
    Price_per_sqft = SalePrice / Total_SF)
summary(df$Price_per_sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.61   60.33   69.51   69.81   78.93  138.13
df %>% 
  select(Neighborhood, SalePrice, Total_SF, Price_per_sqft) %>% 
  arrange(desc(Price_per_sqft)) %>% 
  head(5)
##   Neighborhood SalePrice Total_SF Price_per_sqft
## 1      StoneBr    392000     2838       138.1254
## 2      NridgHt    611657     4694       130.3061
## 3        NAmes    107500      827       129.9879
## 4      NridgHt    582933     4556       127.9484
## 5        NAmes    106500      882       120.7483
#The houses with the highest price per unit area are not the largest properties, but rather moderately sized houses located in premium neighborhoods such as StoneBr and NridgHt. These houses also exhibit high overall quality ratings, indicating that price efficiency is driven more by quality and location than by size alone. Additionally, many of these houses have fewer bedrooms, suggesting that efficient space utilization and superior construction standards contribute significantly to higher value per square foot. This analysis highlights that while larger houses may have higher total prices, smaller high-quality houses in desirable locations can achieve significantly higher price efficiency, making them more valuable on a per-unit basis.
#House price is not determined by a single factor, but by the combined effect of size, quality, and location, with quality and location often outweighing size in determining value efficiency.
#------------------------------
#LEVEL 5 - FEATURE ENGINEERING
#------------------------------
#Consolidated Creation of New Matrices
df <- df %>% 
  mutate(Total_SF       = GrLivArea + TotalBsmtSF,
         Price_per_sqft = SalePrice / Total_SF,
         Price_per_room = SalePrice / TotRmsAbvGrd,
         Age            = 2010 - YearBuilt)
df$Age[df$Age < 0] <- 0
#Question 5.1
#What is the price per square foot (based on total space) across houses?
summary(df$Price_per_sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.61   60.33   69.51   69.81   78.93  138.13
#The price per square foot ranges from approximately 28 to 276, indicating substantial variation in pricing efficiency across houses. The mean and median values are nearly equal, suggesting a relatively balanced distribution. However, the wide range between minimum and maximum values highlights the presence of both low-value and high-value properties. This confirms that house prices are not determined solely by size, but are significantly influenced by factors such as location and construction quality. Extremely high price-per-square-foot values represent premium properties, while very low values may correspond to larger or lower-quality houses with less efficient pricing.
#Question 5.2
#How does the age of the house vary, and what does it indicate about the dataset?
summary(df$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   37.00   38.73   56.00  138.00
#The age of houses in the dataset ranges from 0 to 138 years, indicating a wide mix of both newly constructed and very old properties. The median age of 37 years suggests that a significant portion of houses are relatively old. The broad distribution of ages provides a strong basis for analyzing how property age influences house prices. While newer houses are generally expected to have higher value due to modern construction and amenities, the presence of older houses with competitive pricing indicates that factors such as renovation, quality, and location also play a crucial role. The large range in house age highlights the diversity of the housing market, where both new and old properties coexist and compete in terms of valuation.
#Question 5.3
#What is the price per room and how does it vary across houses?
summary(df$Price_per_room)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6317   21943   26467   27909   32378   78500
#The price per room ranges from approximately 6,317 to 78,500, indicating substantial variation in how value is distributed across rooms in different houses. The median value of around 26,467 suggests that a typical room contributes moderately to the overall house price. However, the wide range between minimum and maximum values highlights that houses with a similar number of rooms can have significantly different valuations. This confirms that room count alone is not a strong determinant of house price, and that factors such as overall quality, location, and total living area play a more influential role. Extremely high price-per-room values are likely associated with premium properties where each room contributes significantly to overall value, while lower values may correspond to larger houses with more rooms but lower pricing efficiency.
#Question 5.4
#What is the total usable space (Total_SF) of a house?
# (Note : This was already calculated in LEVEL 4 as Total_SF)
summary(df$Total_SF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    2014    2479    2573    3008   11752
# The Total_SF (combined Living Area and Basement) ranges from 334 to 11,752 square feet. The median of ~2,479 sqft suggests most houses are moderate in size, but the high maximum confirms luxury "mega-mansions" exist in the dataset.