Introduction
Literature review
Basic transformation
Correlation matrix
Principal component analysis
Hierarchical clustering
Summary
References

The analysis of principal components should start with the preparation of appropriate packages and libraries built into the R programming language environment.

# install.packages(
#   c(
#     "corrplot",
#     "caret",
#     "factoextra",
#     "pdp",
#     "tidyverse",
#     "FactoMineR",
#     "kableExtra",
#     "GGally",
#     "gridExtra",
#     "psych",
#     "ClusterR",
#     "pheatmap",
#     "gplots"
#   )
# )

library(corrplot)
library(caret)
library(factoextra)
library(pdp)
library(tidyverse)
library(FactoMineR)
library(knitr)
library(kableExtra)
library(GGally)
library(gridExtra)
library(psych)
library(ClusterR)
library(gplots)
library(pheatmap)


realEstate <- read.table("data.csv", sep = ',', header = TRUE)

1 Introdution

The purpose of this paper is to show how principal component analysis works for data sets with a large number of variables describing a phenomenon. Principal component analysis is an algorithm used to reduce the number of variables (dimensions) in order to reduce the complexity of the dataset, to simplify the batch data that we want to use in further modeling, and to be able to visualize in 2D or 3D the relevant results from the modeling.

Principal component analysis reduces larger dimensions to smaller dimensions in an attempt to retain as much information as possible about the reduced variables. This information is preserved in the variance that the principal components explain. Moderation in the number of principal components should be exercised by keeping in mind the highest possible variance of all explanatory variables in the data set.

2 Literature review

A measure of the effect of air pollution on the marginal willingness to pay higher housing costs in the city of Boston was analyzed in a paper Harrison Jr and Rubinfeld (1978). The entire article raises very important economic issues for citizens. The authors first dealt with an attempt to determine the hedonic equation of housing value as a function of the degree of pollution, then tried to determine the marginal willingness to pay for cleaner air, and to determine the equation of the housing demand curve as a function of many important parameters such as distance from the nearest 5 jobs, highway availability, tax rates, etc. They found unequivocally that it is very difficult to estimate correctly the marginal willingness to pay for cleaner air because people incorrectly underestimate the benefits of such conditions by up to 60%. Neither changing the functional form of marginal willingness to pay nor changing the variables to other variables had a statistically significant effect on the correct valuation of housing and social benefits.

3 Basic transformation

The dataset, which is accessible through a web service Kaggle Boston Real Estate describes the characteristics of selected apartments in various locations in Boston. It includes a wealth of information on crime per capita, access to the nearest jobs, freeway accessibility, number of rooms in the apartment, or air pollution. The literature presented, in which the authors tried to explain the variation of housing price depending on many variables using econometric methods (linear regression model) shows the significance of these variables.

realEstate %>% 
  rename(CrimesPerCapita = CRIM,
         PropotionOfLandZonedForLots = ZN,
         PropotionOfNonRetailBusiness = INDUS,
         NitricOxidesConcentration = NOX,
         AverageNumberOfRooms = RM,
         ProportionOfOccupiedUnits = AGE,
         WeightedDistancesToFiveEmploy = DIS,
         IndexOfAccessibilityHighway = RAD,
         ValueOfTaxRate = TAX,
         PupilTeacherRatio = PTRATIO,
         PropotionOfBlackPeople = B,
         ProportionOfPopLowerStatus = LSTAT,
         MedianValueHome = MEDV) %>% 
  select(-CHAS) %>% 
  drop_na(AverageNumberOfRooms)  -> new_realEstate

new_realEstate %>% 
  select(everything()) %>% 
  describe(quant = c(0.25, 0.75)) %>% 
  as_tibble(rownames = "rowname") %>% 
  kable(caption = "Basic statistics of the variables") %>%
  kable_styling(font_size = 12, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Basic statistics of the variables
rowname	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	Q0.25	Q0.75
CrimesPerCapita	1	506	3.6174040	8.6001233	0.266005	1.6864663	0.3427252	0.00632	88.9762	88.96988	5.1936288	36.6125426	0.3823221	0.0822675	3.677083
PropotionOfLandZonedForLots	2	506	11.2895257	23.3253499	0.000000	4.9876847	0.0000000	0.00000	100.0000	100.00000	2.2203246	3.9770836	1.0369383	0.0000000	12.500000
PropotionOfNonRetailBusiness	3	506	11.1748419	6.8245918	9.690000	10.9850000	9.3107280	0.46000	27.7400	27.28000	0.2794927	-1.2277573	0.3033901	5.1900000	18.100000
NitricOxidesConcentration	4	506	0.5552089	0.1156112	0.538000	0.5457005	0.1275036	0.38500	0.8710	0.48600	0.7195751	-0.0765268	0.0051395	0.4490000	0.624000
AverageNumberOfRooms	5	506	6.2875889	0.7038016	6.209000	6.2565567	0.5203926	3.56100	8.7800	5.21900	0.3906614	1.8062933	0.0312878	5.8855000	6.629750
ProportionOfOccupiedUnits	6	506	68.5557312	28.1615730	77.500000	71.1721675	28.9848300	2.90000	100.0000	97.10000	-0.5955805	-0.9809586	1.2519346	45.0250000	93.975000
WeightedDistancesToFiveEmploy	7	506	3.7752308	2.0961474	3.122200	3.5175500	1.8545102	1.12960	12.1265	10.99690	1.0276487	0.5242025	0.0931851	2.0985000	5.117675
IndexOfAccessibilityHighway	8	506	9.5316206	8.7166614	5.000000	8.7093596	2.9652000	1.00000	24.0000	23.00000	0.9999212	-0.8785117	0.3875029	4.0000000	24.000000
ValueOfTaxRate	9	506	408.3300395	168.3826846	330.000000	400.1600985	108.2298000	187.00000	711.0000	524.00000	0.6680231	-1.1466598	7.4855234	280.2500000	666.000000
PupilTeacherRatio	10	506	18.4984190	2.2020777	19.100000	18.6908867	1.6308600	12.60000	23.0000	10.40000	-0.7272263	-0.2628700	0.0978943	17.4000000	20.200000
PropotionOfBlackPeople	11	506	356.2283794	91.2534620	391.260000	382.6140887	8.3618640	0.32000	396.9000	396.58000	-2.8632279	7.0658163	4.0567112	374.6875000	396.210000
ProportionOfPopLowerStatus	12	506	12.8725692	7.8235275	11.465000	11.9849261	7.2054360	1.73000	76.0000	74.27000	1.7981388	8.2818194	0.3477982	6.9500000	17.107500
MedianValueHome	13	506	22.7118577	9.5205197	21.200000	21.6610837	6.0786600	5.00000	67.0000	62.00000	1.2099885	1.9635032	0.4232387	17.0250000	25.075000

4 Correlation matrix

Before proceeding to principal component analysis, it would be necessary to initially determine the matrix of correlation coefficients between the analyzed variables. All variables can be considered as continuous variables or as ratio or interval variables. For this purpose we determine Pearson correlation coefficients.

new_realEstate %>% cor(method = "pearson") -> corVal
corVal %>% corrplot(method = "square")

At first glance, we can see that most of the variables are neither negatively nor positively correlated with each other. On the other hand, there is also a moderate pair of variables that are positively or negatively correlated. To the positive correlations we may consider the median price of apartments with the average number of rooms. This makes economic sense, since the more rooms, the larger the apartment, the more expensive it will cost. Also, the amount of housing tax is positively correlated with the highway accessibility index. Among the negative correlations we can include the amount of housing prices with the proportion of wealthy people. The higher the prices, the less people will be able to afford the housing.

5 Principal component analysis

Principal component analysis is the most important part of this study. Principal component analysis as mentioned earlier is responsible for reducing the number of dimensions while keeping as much of the variance of all variables as possible. The higher percentage of variance is explained by principal components the less information is lost with this transformation.

In order to obtain unencumbered eigenvalues and eigenvectors, one should first standardize all the variables under consideration. This gives us values from a normal distribution with mean 0 and variance equal to 1.

new_realEstate %>% 
  preProcess(method = c("center", "scale")) %>% 
  predict(new_realEstate) -> standRealEstate

summary(standRealEstate)

##  CrimesPerCapita     PropotionOfLandZonedForLots PropotionOfNonRetailBusiness
##  Min.   :-0.419887   Min.   :-0.48400            Min.   :-1.5700             
##  1st Qu.:-0.411056   1st Qu.:-0.48400            1st Qu.:-0.8770             
##  Median :-0.389692   Median :-0.48400            Median :-0.2176             
##  Mean   : 0.000000   Mean   : 0.00000            Mean   : 0.0000             
##  3rd Qu.: 0.006939   3rd Qu.: 0.05189            3rd Qu.: 1.0147             
##  Max.   : 9.925299   Max.   : 3.80318            Max.   : 2.4273             
##  NitricOxidesConcentration AverageNumberOfRooms ProportionOfOccupiedUnits
##  Min.   :-1.4723           Min.   :-3.8741      Min.   :-2.3314          
##  1st Qu.:-0.9187           1st Qu.:-0.5713      1st Qu.:-0.8356          
##  Median :-0.1489           Median :-0.1117      Median : 0.3176          
##  Mean   : 0.0000           Mean   : 0.0000      Mean   : 0.0000          
##  3rd Qu.: 0.5950           3rd Qu.: 0.4862      3rd Qu.: 0.9026          
##  Max.   : 2.7315           Max.   : 3.5414      Max.   : 1.1166          
##  WeightedDistancesToFiveEmploy IndexOfAccessibilityHighway ValueOfTaxRate   
##  Min.   :-1.2621               Min.   :-0.9788             Min.   :-1.3144  
##  1st Qu.:-0.7999               1st Qu.:-0.6346             1st Qu.:-0.7606  
##  Median :-0.3115               Median :-0.5199             Median :-0.4652  
##  Mean   : 0.0000               Mean   : 0.0000             Mean   : 0.0000  
##  3rd Qu.: 0.6404               3rd Qu.: 1.6599             3rd Qu.: 1.5303  
##  Max.   : 3.9841               Max.   : 1.6599             Max.   : 1.7975  
##  PupilTeacherRatio PropotionOfBlackPeople ProportionOfPopLowerStatus
##  Min.   :-2.6786   Min.   :-3.9002        Min.   :-1.4242           
##  1st Qu.:-0.4988   1st Qu.: 0.2023        1st Qu.:-0.7570           
##  Median : 0.2732   Median : 0.3839        Median :-0.1799           
##  Mean   : 0.0000   Mean   : 0.0000        Mean   : 0.0000           
##  3rd Qu.: 0.7727   3rd Qu.: 0.4381        3rd Qu.: 0.5413           
##  Max.   : 2.0442   Max.   : 0.4457        Max.   : 8.0689           
##  MedianValueHome  
##  Min.   :-1.8604  
##  1st Qu.:-0.5973  
##  Median :-0.1588  
##  Mean   : 0.0000  
##  3rd Qu.: 0.2482  
##  Max.   : 4.6519

standRealEstate %>% cov() -> standRealEstate.cov
standRealEstate.cov %>% eigen() -> standRealEstate.eigen
standRealEstate.eigen$values

##  [1] 6.40307267 1.47929098 1.33275143 0.84222164 0.67286378 0.53787365
##  [7] 0.44836594 0.36141387 0.27422943 0.24708642 0.18628796 0.15130026
## [13] 0.06324197

Let’s now turn to the visualization of the ScreePlot plot, which shows us how successive principal components play a role in explaining the variance of all the variables. In the chart below, we see that the first principal component explains almost 50% of the variance of the variables in the dataset. The second and third principal components explain 11% and 10%, respectively. If we were interested in 3D visualization we would have to choose the first three principal components, which together explain almost 71% of the variance. This is a very good and satisfying result. It is also an unwritten rule that the standard deviation of the principal component should exceed 1.

new_realEstate %>% 
  prcomp(center = TRUE, scale = TRUE) -> pcaEstate

fviz_eig(pcaEstate)

summary(pcaEstate)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5304 1.2163 1.1544 0.91773 0.82028 0.73340 0.66960
## Proportion of Variance 0.4925 0.1138 0.1025 0.06479 0.05176 0.04137 0.03449
## Cumulative Proportion  0.4925 0.6063 0.7089 0.77364 0.82540 0.86677 0.90126
##                           PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.6012 0.52367 0.49708 0.43161 0.38897 0.25148
## Proportion of Variance 0.0278 0.02109 0.01901 0.01433 0.01164 0.00486
## Cumulative Proportion  0.9291 0.95016 0.96917 0.98350 0.99514 1.00000

The following visualization concretely shows through vectors which variable plays a key role for a given observation. Additionally, the longer the length of the vector, the more the variable plays a role in creating the principal component. Vectors that are perpendicular to each other are considered independent. As we have concluded from the correlation matrix analysis, the variables average number of rooms and median price of apartments are identically correlated. However, vector perpendicular to these two is, for example, weighted average of distance to the 5 nearest workplaces.

pcaEstate %>% 
  fviz_pca_var(col.var="contrib",
               repel = TRUE,
               gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
               ggtheme = theme_minimal())

At the same time, we can look at the vectors of variables that build the principal components, but also at the observations that have been transformed with the principal components. So with a bird’s eye view, we can look at which observations (using numbering) are more influenced by specific variables. More expensive housing with more rooms will be located in the upper left corner of the graph, while observations with higher crime index or pollution volume will be analogously located on the opposite side.

pcaEstate %>% 
  fviz_pca_ind( col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ), repel=TRUE)

pcaEstate %>% 
  fviz_pca_biplot( repel = TRUE,
                   geom = c("point"),
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

6 Hierarchical clustering

Hierarchical clustering can also be used to confirm conclusions. Heatmap is used to visualize clusters in two dimensions. It allows you to look at the observations as the observations themselves, as well as look at how the variables used in the analysis group together. Darker and lighter regions suggest strong association between variables or observations.

standRealEstate %>%
  t() %>% 
  as.matrix() %>%
  heatmap.2(labCol = FALSE,
            main = "Heatmap observations/variables",
            xlab = "Houses",
            cexRow = 0.8,
            margins = c(8,8),
            tracecol = NA)

7 Summary

Principal component analysis by itself does not add much to modeling or creating relationships. However, there is a huge potential in this algorithm because not only does it simplify reality and the number of dimensions, but machine learning models are more likely to accept simplified data sets than very complex and extensive ones. Principal component analysis is a fairly simple, transparent algorithm that is worth keeping in your statistical and mathematical workshop.

References

Harrison Jr, David, and Daniel L Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102.

Principal component analysis (PCA) for Real Estate Boston

Robert Kowalczyk

01/2022