1. Introduction
  2. Literature review
  3. Basic transformation
  4. Correlation matrix
  5. Principal component analysis
  6. Hierarchical clustering
  7. Summary
  8. References

The analysis of principal components should start with the preparation of appropriate packages and libraries built into the R programming language environment.

# install.packages(
#   c(
#     "corrplot",
#     "caret",
#     "factoextra",
#     "pdp",
#     "tidyverse",
#     "FactoMineR",
#     "kableExtra",
#     "GGally",
#     "gridExtra",
#     "psych",
#     "ClusterR",
#     "pheatmap",
#     "gplots"
#   )
# )
library(corrplot)
library(caret)
library(factoextra)
library(pdp)
library(tidyverse)
library(FactoMineR)
library(knitr)
library(kableExtra)
library(GGally)
library(gridExtra)
library(psych)
library(ClusterR)
library(gplots)
library(pheatmap)


realEstate <- read.table("data.csv", sep = ',', header = TRUE)

1 Introdution

The purpose of this paper is to show how principal component analysis works for data sets with a large number of variables describing a phenomenon. Principal component analysis is an algorithm used to reduce the number of variables (dimensions) in order to reduce the complexity of the dataset, to simplify the batch data that we want to use in further modeling, and to be able to visualize in 2D or 3D the relevant results from the modeling.

Principal component analysis reduces larger dimensions to smaller dimensions in an attempt to retain as much information as possible about the reduced variables. This information is preserved in the variance that the principal components explain. Moderation in the number of principal components should be exercised by keeping in mind the highest possible variance of all explanatory variables in the data set.

2 Literature review

A measure of the effect of air pollution on the marginal willingness to pay higher housing costs in the city of Boston was analyzed in a paper Harrison Jr and Rubinfeld (1978). The entire article raises very important economic issues for citizens. The authors first dealt with an attempt to determine the hedonic equation of housing value as a function of the degree of pollution, then tried to determine the marginal willingness to pay for cleaner air, and to determine the equation of the housing demand curve as a function of many important parameters such as distance from the nearest 5 jobs, highway availability, tax rates, etc. They found unequivocally that it is very difficult to estimate correctly the marginal willingness to pay for cleaner air because people incorrectly underestimate the benefits of such conditions by up to 60%. Neither changing the functional form of marginal willingness to pay nor changing the variables to other variables had a statistically significant effect on the correct valuation of housing and social benefits.

3 Basic transformation

The dataset, which is accessible through a web service Kaggle Boston Real Estate describes the characteristics of selected apartments in various locations in Boston. It includes a wealth of information on crime per capita, access to the nearest jobs, freeway accessibility, number of rooms in the apartment, or air pollution. The literature presented, in which the authors tried to explain the variation of housing price depending on many variables using econometric methods (linear regression model) shows the significance of these variables.

realEstate %>% 
  rename(CrimesPerCapita = CRIM,
         PropotionOfLandZonedForLots = ZN,
         PropotionOfNonRetailBusiness = INDUS,
         NitricOxidesConcentration = NOX,
         AverageNumberOfRooms = RM,
         ProportionOfOccupiedUnits = AGE,
         WeightedDistancesToFiveEmploy = DIS,
         IndexOfAccessibilityHighway = RAD,
         ValueOfTaxRate = TAX,
         PupilTeacherRatio = PTRATIO,
         PropotionOfBlackPeople = B,
         ProportionOfPopLowerStatus = LSTAT,
         MedianValueHome = MEDV) %>% 
  select(-CHAS) %>% 
  drop_na(AverageNumberOfRooms)  -> new_realEstate

new_realEstate %>% 
  select(everything()) %>% 
  describe(quant = c(0.25, 0.75)) %>% 
  as_tibble(rownames = "rowname") %>% 
  kable(caption = "Basic statistics of the variables") %>%
  kable_styling(font_size = 12, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Basic statistics of the variables
rowname vars n mean sd median trimmed mad min max range skew kurtosis se Q0.25 Q0.75
CrimesPerCapita 1 506 3.6174040 8.6001233 0.266005 1.6864663 0.3427252 0.00632 88.9762 88.96988 5.1936288 36.6125426 0.3823221 0.0822675 3.677083
PropotionOfLandZonedForLots 2 506 11.2895257 23.3253499 0.000000 4.9876847 0.0000000 0.00000 100.0000 100.00000 2.2203246 3.9770836 1.0369383 0.0000000 12.500000
PropotionOfNonRetailBusiness 3 506 11.1748419 6.8245918 9.690000 10.9850000 9.3107280 0.46000 27.7400 27.28000 0.2794927 -1.2277573 0.3033901 5.1900000 18.100000
NitricOxidesConcentration 4 506 0.5552089 0.1156112 0.538000 0.5457005 0.1275036 0.38500 0.8710 0.48600 0.7195751 -0.0765268 0.0051395 0.4490000 0.624000
AverageNumberOfRooms 5 506 6.2875889 0.7038016 6.209000 6.2565567 0.5203926 3.56100 8.7800 5.21900 0.3906614 1.8062933 0.0312878 5.8855000 6.629750
ProportionOfOccupiedUnits 6 506 68.5557312 28.1615730 77.500000 71.1721675 28.9848300 2.90000 100.0000 97.10000 -0.5955805 -0.9809586 1.2519346 45.0250000 93.975000
WeightedDistancesToFiveEmploy 7 506 3.7752308 2.0961474 3.122200 3.5175500 1.8545102 1.12960 12.1265 10.99690 1.0276487 0.5242025 0.0931851 2.0985000 5.117675
IndexOfAccessibilityHighway 8 506 9.5316206 8.7166614 5.000000 8.7093596 2.9652000 1.00000 24.0000 23.00000 0.9999212 -0.8785117 0.3875029 4.0000000 24.000000
ValueOfTaxRate 9 506 408.3300395 168.3826846 330.000000 400.1600985 108.2298000 187.00000 711.0000 524.00000 0.6680231 -1.1466598 7.4855234 280.2500000 666.000000
PupilTeacherRatio 10 506 18.4984190 2.2020777 19.100000 18.6908867 1.6308600 12.60000 23.0000 10.40000 -0.7272263 -0.2628700 0.0978943 17.4000000 20.200000
PropotionOfBlackPeople 11 506 356.2283794 91.2534620 391.260000 382.6140887 8.3618640 0.32000 396.9000 396.58000 -2.8632279 7.0658163 4.0567112 374.6875000 396.210000
ProportionOfPopLowerStatus 12 506 12.8725692 7.8235275 11.465000 11.9849261 7.2054360 1.73000 76.0000 74.27000 1.7981388 8.2818194 0.3477982 6.9500000 17.107500
MedianValueHome 13 506 22.7118577 9.5205197 21.200000 21.6610837 6.0786600 5.00000 67.0000 62.00000 1.2099885 1.9635032 0.4232387 17.0250000 25.075000

4 Correlation matrix

Before proceeding to principal component analysis, it would be necessary to initially determine the matrix of correlation coefficients between the analyzed variables. All variables can be considered as continuous variables or as ratio or interval variables. For this purpose we determine Pearson correlation coefficients.

new_realEstate %>% cor(method = "pearson") -> corVal
corVal %>% corrplot(method = "square")

At first glance, we can see that most of the variables are neither negatively nor positively correlated with each other. On the other hand, there is also a moderate pair of variables that are positively or negatively correlated. To the positive correlations we may consider the median price of apartments with the average number of rooms. This makes economic sense, since the more rooms, the larger the apartment, the more expensive it will cost. Also, the amount of housing tax is positively correlated with the highway accessibility index. Among the negative correlations we can include the amount of housing prices with the proportion of wealthy people. The higher the prices, the less people will be able to afford the housing.

5 Principal component analysis

Principal component analysis is the most important part of this study. Principal component analysis as mentioned earlier is responsible for reducing the number of dimensions while keeping as much of the variance of all variables as possible. The higher percentage of variance is explained by principal components the less information is lost with this transformation.

In order to obtain unencumbered eigenvalues and eigenvectors, one should first standardize all the variables under consideration. This gives us values from a normal distribution with mean 0 and variance equal to 1.

new_realEstate %>% 
  preProcess(method = c("center", "scale")) %>% 
  predict(new_realEstate) -> standRealEstate

summary(standRealEstate)
##  CrimesPerCapita     PropotionOfLandZonedForLots PropotionOfNonRetailBusiness
##  Min.   :-0.419887   Min.   :-0.48400            Min.   :-1.5700             
##  1st Qu.:-0.411056   1st Qu.:-0.48400            1st Qu.:-0.8770             
##  Median :-0.389692   Median :-0.48400            Median :-0.2176             
##  Mean   : 0.000000   Mean   : 0.00000            Mean   : 0.0000             
##  3rd Qu.: 0.006939   3rd Qu.: 0.05189            3rd Qu.: 1.0147             
##  Max.   : 9.925299   Max.   : 3.80318            Max.   : 2.4273             
##  NitricOxidesConcentration AverageNumberOfRooms ProportionOfOccupiedUnits
##  Min.   :-1.4723           Min.   :-3.8741      Min.   :-2.3314          
##  1st Qu.:-0.9187           1st Qu.:-0.5713      1st Qu.:-0.8356          
##  Median :-0.1489           Median :-0.1117      Median : 0.3176          
##  Mean   : 0.0000           Mean   : 0.0000      Mean   : 0.0000          
##  3rd Qu.: 0.5950           3rd Qu.: 0.4862      3rd Qu.: 0.9026          
##  Max.   : 2.7315           Max.   : 3.5414      Max.   : 1.1166          
##  WeightedDistancesToFiveEmploy IndexOfAccessibilityHighway ValueOfTaxRate   
##  Min.   :-1.2621               Min.   :-0.9788             Min.   :-1.3144  
##  1st Qu.:-0.7999               1st Qu.:-0.6346             1st Qu.:-0.7606  
##  Median :-0.3115               Median :-0.5199             Median :-0.4652  
##  Mean   : 0.0000               Mean   : 0.0000             Mean   : 0.0000  
##  3rd Qu.: 0.6404               3rd Qu.: 1.6599             3rd Qu.: 1.5303  
##  Max.   : 3.9841               Max.   : 1.6599             Max.   : 1.7975  
##  PupilTeacherRatio PropotionOfBlackPeople ProportionOfPopLowerStatus
##  Min.   :-2.6786   Min.   :-3.9002        Min.   :-1.4242           
##  1st Qu.:-0.4988   1st Qu.: 0.2023        1st Qu.:-0.7570           
##  Median : 0.2732   Median : 0.3839        Median :-0.1799           
##  Mean   : 0.0000   Mean   : 0.0000        Mean   : 0.0000           
##  3rd Qu.: 0.7727   3rd Qu.: 0.4381        3rd Qu.: 0.5413           
##  Max.   : 2.0442   Max.   : 0.4457        Max.   : 8.0689           
##  MedianValueHome  
##  Min.   :-1.8604  
##  1st Qu.:-0.5973  
##  Median :-0.1588  
##  Mean   : 0.0000  
##  3rd Qu.: 0.2482  
##  Max.   : 4.6519
standRealEstate %>% cov() -> standRealEstate.cov
standRealEstate.cov %>% eigen() -> standRealEstate.eigen
standRealEstate.eigen$values
##  [1] 6.40307267 1.47929098 1.33275143 0.84222164 0.67286378 0.53787365
##  [7] 0.44836594 0.36141387 0.27422943 0.24708642 0.18628796 0.15130026
## [13] 0.06324197

Let’s now turn to the visualization of the ScreePlot plot, which shows us how successive principal components play a role in explaining the variance of all the variables. In the chart below, we see that the first principal component explains almost 50% of the variance of the variables in the dataset. The second and third principal components explain 11% and 10%, respectively. If we were interested in 3D visualization we would have to choose the first three principal components, which together explain almost 71% of the variance. This is a very good and satisfying result. It is also an unwritten rule that the standard deviation of the principal component should exceed 1.

new_realEstate %>% 
  prcomp(center = TRUE, scale = TRUE) -> pcaEstate

fviz_eig(pcaEstate)

summary(pcaEstate)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5304 1.2163 1.1544 0.91773 0.82028 0.73340 0.66960
## Proportion of Variance 0.4925 0.1138 0.1025 0.06479 0.05176 0.04137 0.03449
## Cumulative Proportion  0.4925 0.6063 0.7089 0.77364 0.82540 0.86677 0.90126
##                           PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.6012 0.52367 0.49708 0.43161 0.38897 0.25148
## Proportion of Variance 0.0278 0.02109 0.01901 0.01433 0.01164 0.00486
## Cumulative Proportion  0.9291 0.95016 0.96917 0.98350 0.99514 1.00000

The following visualization concretely shows through vectors which variable plays a key role for a given observation. Additionally, the longer the length of the vector, the more the variable plays a role in creating the principal component. Vectors that are perpendicular to each other are considered independent. As we have concluded from the correlation matrix analysis, the variables average number of rooms and median price of apartments are identically correlated. However, vector perpendicular to these two is, for example, weighted average of distance to the 5 nearest workplaces.

pcaEstate %>% 
  fviz_pca_var(col.var="contrib",
               repel = TRUE,
               gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
               ggtheme = theme_minimal()) 

At the same time, we can look at the vectors of variables that build the principal components, but also at the observations that have been transformed with the principal components. So with a bird’s eye view, we can look at which observations (using numbering) are more influenced by specific variables. More expensive housing with more rooms will be located in the upper left corner of the graph, while observations with higher crime index or pollution volume will be analogously located on the opposite side.

pcaEstate %>% 
  fviz_pca_ind( col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ), repel=TRUE)

pcaEstate %>% 
  fviz_pca_biplot( repel = TRUE,
                   geom = c("point"),
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

6 Hierarchical clustering

Hierarchical clustering can also be used to confirm conclusions. Heatmap is used to visualize clusters in two dimensions. It allows you to look at the observations as the observations themselves, as well as look at how the variables used in the analysis group together. Darker and lighter regions suggest strong association between variables or observations.

standRealEstate %>%
  t() %>% 
  as.matrix() %>%
  heatmap.2(labCol = FALSE,
            main = "Heatmap observations/variables",
            xlab = "Houses",
            cexRow = 0.8,
            margins = c(8,8),
            tracecol = NA)

7 Summary

Principal component analysis by itself does not add much to modeling or creating relationships. However, there is a huge potential in this algorithm because not only does it simplify reality and the number of dimensions, but machine learning models are more likely to accept simplified data sets than very complex and extensive ones. Principal component analysis is a fairly simple, transparent algorithm that is worth keeping in your statistical and mathematical workshop.

References

Harrison Jr, David, and Daniel L Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102.