The aim of this article is to apply PCA algorithm for dimension reduction on housing market data.
PCA stands for Principal Component Analysis, and it is a statistical technique used for analyzing datasets in order to identify patterns and relationships between variables.
The goal of PCA is to reduce the dimensionality of a dataset by finding a smaller set of variables, called principal components, that capture the majority of the variation in the original data. These principal components are linear combinations of the original variables, and they are chosen such that they are orthogonal to each other and ordered by the amount of variance they explain.
In other words, PCA is a method for transforming a high-dimensional dataset into a lower-dimensional space while retaining as much of the original information as possible. This can be useful for a variety of applications, such as data visualization, data compression, and feature selection in machine learning.
The dataset was downloaded from Kaggle and contains data of housing market in Sydney and Melbourne. The dataset can be found at that link: https://www.kaggle.com/datasets/shree1992/housedata?datasetId=46927
The dataset contains also non-numerical variables, which I am not interested in in this project. One of the first steps is to remove such variables, the next one is to remove rows with NAs. I also deleted the binary variable “waterfront”, which could could disrupt the PCA algorithm. There were also observations in the dataset, where the price of the property was equal to 0, I also filtered the dataset not to contain such observations. Additionaly I observed, that there is a huge number of observations with yr_renovation equal to zero. I deleted that column from my dataset.
After these operations my dataset finally contains 11 variables:
library(dplyr)
library(DT)
df <- read.csv("./data/data.csv")
df <- df %>% select(where(is.numeric))
df <- na.omit(df)
df <- subset(df, select = -c(waterfront))
df <- df %>%
filter(price > 0)
df %>%
filter(yr_renovated == 0) %>%
count()
## n
## 1 2706
df <- subset(df, select = -c(yr_renovated))
datatable(df, options = list(
searching = FALSE,
pageLength = 5,
lengthMenu = c(5, 10, 15, 20),
scrollX = TRUE,
scrollCollapse = TRUE
))
summary(df)
## price bedrooms bathrooms sqft_living
## Min. : 7800 Min. :0.000 Min. :0.000 Min. : 370
## 1st Qu.: 326264 1st Qu.:3.000 1st Qu.:1.750 1st Qu.: 1460
## Median : 465000 Median :3.000 Median :2.250 Median : 1970
## Mean : 557906 Mean :3.395 Mean :2.155 Mean : 2132
## 3rd Qu.: 657500 3rd Qu.:4.000 3rd Qu.:2.500 3rd Qu.: 2610
## Max. :26590000 Max. :9.000 Max. :8.000 Max. :13540
## sqft_lot floors view condition
## Min. : 638 Min. :1.000 Min. :0.0000 Min. :1.000
## 1st Qu.: 5000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:3.000
## Median : 7680 Median :1.500 Median :0.0000 Median :3.000
## Mean : 14835 Mean :1.512 Mean :0.2347 Mean :3.449
## 3rd Qu.: 10978 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :1074218 Max. :3.500 Max. :4.0000 Max. :5.000
## sqft_above sqft_basement yr_built
## Min. : 370 Min. : 0.0 Min. :1900
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median :1590 Median : 0.0 Median :1976
## Mean :1822 Mean : 310.2 Mean :1971
## 3rd Qu.:2300 3rd Qu.: 600.0 3rd Qu.:1997
## Max. :9410 Max. :4820.0 Max. :2014
From the above graphs we can deduce, that there might be a lot of outliers. In the next step I delete those outliers according to the Z-Score method.
The z-score method is a statistical technique used to standardize and normalize data by calculating how many standard deviations a given data point is from the mean of a dataset. The z-score formula is:
z = (x - μ) / σ
Where:
A z-score of 0 means that the data point is exactly at the mean, a z-score of 1 means that the data point is one standard deviation above the mean, and a z-score of -1 means that the data point is one standard deviation below the mean.
In the code below I use absolute values and the value 3 as the limit value, which means, that in my cleaned dataset observations will be between the third standard deviations.
z_scores <- abs(scale(df))
z_scores <- as.data.frame(z_scores)
colnames(z_scores) <- c("price_Z", "bedrooms_Z", "bathrooms_Z", "sqft_living_Z", "sqft_lot_Z",
"floors_Z", "view_Z", "condition_Z", "sqft_above_Z", "sqft_basement_Z",
"yr_built_Z")
df_Z <- cbind(df, z_scores)
rm(z_scores)
for (i in 12:ncol(df_Z)) {
df_Z[df_Z[, i] > 3, i] <- NA
}
df_Z <- na.omit(df_Z)
df_Z <- df_Z[,-c(12:22)]
count(df_Z)
## n
## 1 4189
library(corrplot)
cor_w <- cor(df_Z)
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
Choosing the proper number of components in PCA is an important step, as it can have a significant impact on the interpretation of the results.
The Kaiser rule is a commonly used method for selecting the optimal number of components in PCA. The rule states that only components with eigenvalues greater than 1 should be retained, as these components explain more variance than a single original variable.
PCA is a technique that is based on the covariance matrix of the data, and the covariance matrix is sensitive to the scale of the variables. In order to ensures that all variables are on the same scale and have equal importance in the analysis I standardize the data.
library(factoextra)
library(gridExtra)
pca <- prcomp(df_Z, center = TRUE, scale = TRUE)
eigen(cor_w)$values
## [1] 3.934009e+00 1.996474e+00 1.068024e+00 9.748807e-01 8.891597e-01
## [6] 6.374111e-01 5.786156e-01 3.932263e-01 2.901358e-01 2.380648e-01
## [11] -5.551115e-17
fviz_eig(pca, choice = "eigenvalue", addlabels = TRUE, main = "Eigenvalues")
According to the results above the proper number of components is 3.
Now I would like to know the number of variance explained by components. For this purpose I will use proportion of variance and the Scree plot. A scree plot is a graphical representation of the eigenvalues of the principal components in a PCA. The scree plot is used to visualize the amount of variance explained by each principal component.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9834 1.4130 1.03345 0.98736 0.94295 0.79838 0.7607
## Proportion of Variance 0.3576 0.1815 0.09709 0.08863 0.08083 0.05795 0.0526
## Cumulative Proportion 0.3576 0.5391 0.63623 0.72485 0.80569 0.86363 0.9162
## PC8 PC9 PC10 PC11
## Standard deviation 0.62708 0.53864 0.48792 1.238e-15
## Proportion of Variance 0.03575 0.02638 0.02164 0.000e+00
## Cumulative Proportion 0.95198 0.97836 1.00000 1.000e+00
fviz_eig(pca, addlabels = TRUE)
The results show that PC1 explains 35.8% of the variance, PC2 explains 18.1% of the variance, and the PC3 explains 9.7% of the variance. In order to explain 63 % of the variance there should be used 3 components.
The following plot displays how much each variable contributes to the principal components, with red colors indicating a greater contribution.
We can observe that variables sqft_living, sqfl_above and bathrooms have the most influence on the principal components. On the contrary the variables sqft_lot and view have the least influence on the principal component.
fviz_pca_var(pca, col.var="contrib") +
scale_color_gradient2(low="green", mid="darkgoldenrod1",
high="red", midpoint=11)
Loading plot is a graph that shows the relationship between the variables and the principal components. Each variable is represented by a vector in the plot, and the direction and length of the vector indicate the strength and direction of the variable’s relationship with each principal component.
Below are shown two loading plots, for PC1, PC2 and PC2, PC3. The contribution of each variable to the corresponding principal component is indicated by the color assigned to it, the scale is the same as in the chart above, that is, from green through yellow to red with the red color indicating the greatest contribution.
The first graph shows that variables sft_living, sqft_above, bathrooms have the greatest influence on PC1 and PC2, and variables view and sqft_lot the least.
The second graph shows that variables sqft_lot, sqft_basement have the greatest influence on PC2 and PC3, and variables bathrooms and sqft_above the least.
fviz_pca_var(pca, col.var="contrib", repel = TRUE, axes = c(1, 2)) +
labs(title="Variables loadings for PC1 and PC2", x="PC1", y="PC2") +
scale_color_gradient2(low="green", mid="darkgoldenrod1",
high="red", midpoint=11)
fviz_pca_var(pca, col.var="contrib", repel = TRUE, axes = c(2, 3)) +
labs(title="Variables loadings for PC2 and PC3", x="PC2", y="PC3") +
scale_color_gradient2(low="green", mid="darkgoldenrod1",
high="red", midpoint=11)
Below are shown the plots with the contribution of the variables to first three components.
We can observe, that:
PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pca, choice = "var", axes = 3)
grid.arrange(PC1, PC2, PC3, ncol=3)
The main goal of this article was to find out if the dataset of housing market could be explained by the lower number of dimensions. The result showed that it is possible to reduce the number of dimensions from 11 to 3.