Dimension Reduction - Housing market data

Introduction

The aim of this article is to apply PCA algorithm for dimension reduction on housing market data.

PCA stands for Principal Component Analysis, and it is a statistical technique used for analyzing datasets in order to identify patterns and relationships between variables.

The goal of PCA is to reduce the dimensionality of a dataset by finding a smaller set of variables, called principal components, that capture the majority of the variation in the original data. These principal components are linear combinations of the original variables, and they are chosen such that they are orthogonal to each other and ordered by the amount of variance they explain.

In other words, PCA is a method for transforming a high-dimensional dataset into a lower-dimensional space while retaining as much of the original information as possible. This can be useful for a variety of applications, such as data visualization, data compression, and feature selection in machine learning.

Dataset exploration

The dataset was downloaded from Kaggle and contains data of housing market in Sydney and Melbourne. The dataset can be found at that link: https://www.kaggle.com/datasets/shree1992/housedata?datasetId=46927

Preprocessing

The dataset contains also non-numerical variables, which I am not interested in in this project. One of the first steps is to remove such variables, the next one is to remove rows with NAs. I also deleted the binary variable “waterfront”, which could could disrupt the PCA algorithm. There were also observations in the dataset, where the price of the property was equal to 0, I also filtered the dataset not to contain such observations. Additionaly I observed, that there is a huge number of observations with yr_renovation equal to zero. I deleted that column from my dataset.

After these operations my dataset finally contains 11 variables:

“price” - price of the property
“bedrooms” - number of bedrooms
“bathrooms” - number of bathrooms
“sqft_living” - number of living square meters
“sqft_lot” - number of square footage of the lot
“floors” - number of floors
“view” - variable that evaluates the view from the property [0-4]
“condition” - variable that evaluates the condition of the property [1-5]
“sqft_above” - number of square footage of house apart from basement
“sqft_basement” - number of square footage of basement
“yr_renovated” - year of renovation

library(dplyr)
library(DT)

df <- read.csv("./data/data.csv")
df <- df %>% select(where(is.numeric))
df <- na.omit(df)
df <- subset(df, select = -c(waterfront))
df <- df %>%
  filter(price > 0)

df %>%
  filter(yr_renovated == 0) %>%
  count()

##      n
## 1 2706

df <- subset(df, select = -c(yr_renovated))

datatable(df, options = list(
  searching = FALSE,
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20),
  scrollX = TRUE,
  scrollCollapse = TRUE
))

Exploratory Data Analysis

summary(df)

##      price             bedrooms       bathrooms      sqft_living   
##  Min.   :    7800   Min.   :0.000   Min.   :0.000   Min.   :  370  
##  1st Qu.:  326264   1st Qu.:3.000   1st Qu.:1.750   1st Qu.: 1460  
##  Median :  465000   Median :3.000   Median :2.250   Median : 1970  
##  Mean   :  557906   Mean   :3.395   Mean   :2.155   Mean   : 2132  
##  3rd Qu.:  657500   3rd Qu.:4.000   3rd Qu.:2.500   3rd Qu.: 2610  
##  Max.   :26590000   Max.   :9.000   Max.   :8.000   Max.   :13540  
##     sqft_lot           floors           view          condition    
##  Min.   :    638   Min.   :1.000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:   5000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :   7680   Median :1.500   Median :0.0000   Median :3.000  
##  Mean   :  14835   Mean   :1.512   Mean   :0.2347   Mean   :3.449  
##  3rd Qu.:  10978   3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:4.000  
##  Max.   :1074218   Max.   :3.500   Max.   :4.0000   Max.   :5.000  
##    sqft_above   sqft_basement       yr_built   
##  Min.   : 370   Min.   :   0.0   Min.   :1900  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median :1590   Median :   0.0   Median :1976  
##  Mean   :1822   Mean   : 310.2   Mean   :1971  
##  3rd Qu.:2300   3rd Qu.: 600.0   3rd Qu.:1997  
##  Max.   :9410   Max.   :4820.0   Max.   :2014

Outliers Handling

From the above graphs we can deduce, that there might be a lot of outliers. In the next step I delete those outliers according to the Z-Score method.

The z-score method is a statistical technique used to standardize and normalize data by calculating how many standard deviations a given data point is from the mean of a dataset. The z-score formula is:

z = (x - μ) / σ

Where:

z is the z-score
x is the raw data value
μ is the mean of the dataset
σ is the standard deviation of the dataset

A z-score of 0 means that the data point is exactly at the mean, a z-score of 1 means that the data point is one standard deviation above the mean, and a z-score of -1 means that the data point is one standard deviation below the mean.

In the code below I use absolute values and the value 3 as the limit value, which means, that in my cleaned dataset observations will be between the third standard deviations.

Z-Score Method

z_scores <- abs(scale(df))
z_scores <- as.data.frame(z_scores)
colnames(z_scores) <- c("price_Z", "bedrooms_Z", "bathrooms_Z", "sqft_living_Z", "sqft_lot_Z", 
                        "floors_Z", "view_Z", "condition_Z", "sqft_above_Z", "sqft_basement_Z", 
                        "yr_built_Z")

df_Z <- cbind(df, z_scores)

rm(z_scores)

for (i in 12:ncol(df_Z)) {
  df_Z[df_Z[, i] > 3, i] <- NA
}

df_Z <- na.omit(df_Z)

df_Z <- df_Z[,-c(12:22)]

count(df_Z)

##      n
## 1 4189

Dataset comparison

Z-Score

Without deleting outliers

Correlation

library(corrplot)

cor_w <- cor(df_Z)
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

Principal Component Analysis (PCA)

Optimal number of components

Choosing the proper number of components in PCA is an important step, as it can have a significant impact on the interpretation of the results.

The Kaiser rule is a commonly used method for selecting the optimal number of components in PCA. The rule states that only components with eigenvalues greater than 1 should be retained, as these components explain more variance than a single original variable.

PCA is a technique that is based on the covariance matrix of the data, and the covariance matrix is sensitive to the scale of the variables. In order to ensures that all variables are on the same scale and have equal importance in the analysis I standardize the data.

library(factoextra)
library(gridExtra)

pca <- prcomp(df_Z, center = TRUE, scale = TRUE)
eigen(cor_w)$values

##  [1]  3.934009e+00  1.996474e+00  1.068024e+00  9.748807e-01  8.891597e-01
##  [6]  6.374111e-01  5.786156e-01  3.932263e-01  2.901358e-01  2.380648e-01
## [11] -5.551115e-17

fviz_eig(pca, choice = "eigenvalue", addlabels = TRUE,   main = "Eigenvalues")

According to the results above the proper number of components is 3.

Proportion of variance

Now I would like to know the number of variance explained by components. For this purpose I will use proportion of variance and the Scree plot. A scree plot is a graphical representation of the eigenvalues of the principal components in a PCA. The scree plot is used to visualize the amount of variance explained by each principal component.

summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.9834 1.4130 1.03345 0.98736 0.94295 0.79838 0.7607
## Proportion of Variance 0.3576 0.1815 0.09709 0.08863 0.08083 0.05795 0.0526
## Cumulative Proportion  0.3576 0.5391 0.63623 0.72485 0.80569 0.86363 0.9162
##                            PC8     PC9    PC10      PC11
## Standard deviation     0.62708 0.53864 0.48792 1.238e-15
## Proportion of Variance 0.03575 0.02638 0.02164 0.000e+00
## Cumulative Proportion  0.95198 0.97836 1.00000 1.000e+00

fviz_eig(pca, addlabels = TRUE)

The results show that PC1 explains 35.8% of the variance, PC2 explains 18.1% of the variance, and the PC3 explains 9.7% of the variance. In order to explain 63 % of the variance there should be used 3 components.

Correlation Circle

The following plot displays how much each variable contributes to the principal components, with red colors indicating a greater contribution.

We can observe that variables sqft_living, sqfl_above and bathrooms have the most influence on the principal components. On the contrary the variables sqft_lot and view have the least influence on the principal component.

fviz_pca_var(pca, col.var="contrib") +
         scale_color_gradient2(low="green", mid="darkgoldenrod1", 
                               high="red", midpoint=11)

Loading plots

Loading plot is a graph that shows the relationship between the variables and the principal components. Each variable is represented by a vector in the plot, and the direction and length of the vector indicate the strength and direction of the variable’s relationship with each principal component.

Below are shown two loading plots, for PC1, PC2 and PC2, PC3. The contribution of each variable to the corresponding principal component is indicated by the color assigned to it, the scale is the same as in the chart above, that is, from green through yellow to red with the red color indicating the greatest contribution.

The first graph shows that variables sft_living, sqft_above, bathrooms have the greatest influence on PC1 and PC2, and variables view and sqft_lot the least.

The second graph shows that variables sqft_lot, sqft_basement have the greatest influence on PC2 and PC3, and variables bathrooms and sqft_above the least.

PCA 1,2

fviz_pca_var(pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
         labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2") +
         scale_color_gradient2(low="green", mid="darkgoldenrod1", 
                               high="red", midpoint=11)

PCA 2,3

fviz_pca_var(pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
         labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3") +
         scale_color_gradient2(low="green", mid="darkgoldenrod1", 
                               high="red", midpoint=11)

Contributions of variables

Below are shown the plots with the contribution of the variables to first three components.

We can observe, that:

the first component consists of: sqft_living, sqft_above, bathrooms, price, bedrooms and floors
the second components consists of: sqft_basement, condition, yr_built and floors
the third component consists of: sqft_lot and view.

PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pca, choice = "var", axes = 3)

grid.arrange(PC1, PC2, PC3, ncol=3)

Summary

The main goal of this article was to find out if the dataset of housing market could be explained by the lower number of dimensions. The result showed that it is possible to reduce the number of dimensions from 11 to 3.

References

https://statisticsbyjim.com/basics/z-score

https://rpubs.com/okowalewska/survey_dim_reduction