The analysis of principal components should start with the preparation of appropriate packages and libraries built into the R programming language environment.
# install.packages(
# c(
# "corrplot",
# "caret",
# "factoextra",
# "pdp",
# "tidyverse",
# "FactoMineR",
# "kableExtra",
# "GGally",
# "gridExtra",
# "psych",
# "ClusterR",
# "pheatmap",
# "gplots"
# )
# )
library(corrplot)
library(caret)
library(factoextra)
library(pdp)
library(tidyverse)
library(FactoMineR)
library(knitr)
library(kableExtra)
library(GGally)
library(gridExtra)
library(psych)
library(ClusterR)
library(gplots)
library(pheatmap)
realEstate <- read.table("data.csv", sep = ',', header = TRUE)
The purpose of this paper is to show how principal component analysis works for data sets with a large number of variables describing a phenomenon. Principal component analysis is an algorithm used to reduce the number of variables (dimensions) in order to reduce the complexity of the dataset, to simplify the batch data that we want to use in further modeling, and to be able to visualize in 2D or 3D the relevant results from the modeling.
Principal component analysis reduces larger dimensions to smaller dimensions in an attempt to retain as much information as possible about the reduced variables. This information is preserved in the variance that the principal components explain. Moderation in the number of principal components should be exercised by keeping in mind the highest possible variance of all explanatory variables in the data set.
A measure of the effect of air pollution on the marginal willingness to pay higher housing costs in the city of Boston was analyzed in a paper Harrison Jr and Rubinfeld (1978). The entire article raises very important economic issues for citizens. The authors first dealt with an attempt to determine the hedonic equation of housing value as a function of the degree of pollution, then tried to determine the marginal willingness to pay for cleaner air, and to determine the equation of the housing demand curve as a function of many important parameters such as distance from the nearest 5 jobs, highway availability, tax rates, etc. They found unequivocally that it is very difficult to estimate correctly the marginal willingness to pay for cleaner air because people incorrectly underestimate the benefits of such conditions by up to 60%. Neither changing the functional form of marginal willingness to pay nor changing the variables to other variables had a statistically significant effect on the correct valuation of housing and social benefits.
The dataset, which is accessible through a web service Kaggle Boston Real Estate describes the characteristics of selected apartments in various locations in Boston. It includes a wealth of information on crime per capita, access to the nearest jobs, freeway accessibility, number of rooms in the apartment, or air pollution. The literature presented, in which the authors tried to explain the variation of housing price depending on many variables using econometric methods (linear regression model) shows the significance of these variables.
realEstate %>%
rename(CrimesPerCapita = CRIM,
PropotionOfLandZonedForLots = ZN,
PropotionOfNonRetailBusiness = INDUS,
NitricOxidesConcentration = NOX,
AverageNumberOfRooms = RM,
ProportionOfOccupiedUnits = AGE,
WeightedDistancesToFiveEmploy = DIS,
IndexOfAccessibilityHighway = RAD,
ValueOfTaxRate = TAX,
PupilTeacherRatio = PTRATIO,
PropotionOfBlackPeople = B,
ProportionOfPopLowerStatus = LSTAT,
MedianValueHome = MEDV) %>%
select(-CHAS) %>%
drop_na(AverageNumberOfRooms) -> new_realEstate
new_realEstate %>%
select(everything()) %>%
describe(quant = c(0.25, 0.75)) %>%
as_tibble(rownames = "rowname") %>%
kable(caption = "Basic statistics of the variables") %>%
kable_styling(font_size = 12, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
rowname | vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | Q0.25 | Q0.75 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CrimesPerCapita | 1 | 506 | 3.6174040 | 8.6001233 | 0.266005 | 1.6864663 | 0.3427252 | 0.00632 | 88.9762 | 88.96988 | 5.1936288 | 36.6125426 | 0.3823221 | 0.0822675 | 3.677083 |
PropotionOfLandZonedForLots | 2 | 506 | 11.2895257 | 23.3253499 | 0.000000 | 4.9876847 | 0.0000000 | 0.00000 | 100.0000 | 100.00000 | 2.2203246 | 3.9770836 | 1.0369383 | 0.0000000 | 12.500000 |
PropotionOfNonRetailBusiness | 3 | 506 | 11.1748419 | 6.8245918 | 9.690000 | 10.9850000 | 9.3107280 | 0.46000 | 27.7400 | 27.28000 | 0.2794927 | -1.2277573 | 0.3033901 | 5.1900000 | 18.100000 |
NitricOxidesConcentration | 4 | 506 | 0.5552089 | 0.1156112 | 0.538000 | 0.5457005 | 0.1275036 | 0.38500 | 0.8710 | 0.48600 | 0.7195751 | -0.0765268 | 0.0051395 | 0.4490000 | 0.624000 |
AverageNumberOfRooms | 5 | 506 | 6.2875889 | 0.7038016 | 6.209000 | 6.2565567 | 0.5203926 | 3.56100 | 8.7800 | 5.21900 | 0.3906614 | 1.8062933 | 0.0312878 | 5.8855000 | 6.629750 |
ProportionOfOccupiedUnits | 6 | 506 | 68.5557312 | 28.1615730 | 77.500000 | 71.1721675 | 28.9848300 | 2.90000 | 100.0000 | 97.10000 | -0.5955805 | -0.9809586 | 1.2519346 | 45.0250000 | 93.975000 |
WeightedDistancesToFiveEmploy | 7 | 506 | 3.7752308 | 2.0961474 | 3.122200 | 3.5175500 | 1.8545102 | 1.12960 | 12.1265 | 10.99690 | 1.0276487 | 0.5242025 | 0.0931851 | 2.0985000 | 5.117675 |
IndexOfAccessibilityHighway | 8 | 506 | 9.5316206 | 8.7166614 | 5.000000 | 8.7093596 | 2.9652000 | 1.00000 | 24.0000 | 23.00000 | 0.9999212 | -0.8785117 | 0.3875029 | 4.0000000 | 24.000000 |
ValueOfTaxRate | 9 | 506 | 408.3300395 | 168.3826846 | 330.000000 | 400.1600985 | 108.2298000 | 187.00000 | 711.0000 | 524.00000 | 0.6680231 | -1.1466598 | 7.4855234 | 280.2500000 | 666.000000 |
PupilTeacherRatio | 10 | 506 | 18.4984190 | 2.2020777 | 19.100000 | 18.6908867 | 1.6308600 | 12.60000 | 23.0000 | 10.40000 | -0.7272263 | -0.2628700 | 0.0978943 | 17.4000000 | 20.200000 |
PropotionOfBlackPeople | 11 | 506 | 356.2283794 | 91.2534620 | 391.260000 | 382.6140887 | 8.3618640 | 0.32000 | 396.9000 | 396.58000 | -2.8632279 | 7.0658163 | 4.0567112 | 374.6875000 | 396.210000 |
ProportionOfPopLowerStatus | 12 | 506 | 12.8725692 | 7.8235275 | 11.465000 | 11.9849261 | 7.2054360 | 1.73000 | 76.0000 | 74.27000 | 1.7981388 | 8.2818194 | 0.3477982 | 6.9500000 | 17.107500 |
MedianValueHome | 13 | 506 | 22.7118577 | 9.5205197 | 21.200000 | 21.6610837 | 6.0786600 | 5.00000 | 67.0000 | 62.00000 | 1.2099885 | 1.9635032 | 0.4232387 | 17.0250000 | 25.075000 |
Before proceeding to principal component analysis, it would be necessary to initially determine the matrix of correlation coefficients between the analyzed variables. All variables can be considered as continuous variables or as ratio or interval variables. For this purpose we determine Pearson correlation coefficients.
new_realEstate %>% cor(method = "pearson") -> corVal
corVal %>% corrplot(method = "square")
At first glance, we can see that most of the variables are neither negatively nor positively correlated with each other. On the other hand, there is also a moderate pair of variables that are positively or negatively correlated. To the positive correlations we may consider the median price of apartments with the average number of rooms. This makes economic sense, since the more rooms, the larger the apartment, the more expensive it will cost. Also, the amount of housing tax is positively correlated with the highway accessibility index. Among the negative correlations we can include the amount of housing prices with the proportion of wealthy people. The higher the prices, the less people will be able to afford the housing.
Principal component analysis is the most important part of this study. Principal component analysis as mentioned earlier is responsible for reducing the number of dimensions while keeping as much of the variance of all variables as possible. The higher percentage of variance is explained by principal components the less information is lost with this transformation.
In order to obtain unencumbered eigenvalues and eigenvectors, one should first standardize all the variables under consideration. This gives us values from a normal distribution with mean 0 and variance equal to 1.
new_realEstate %>%
preProcess(method = c("center", "scale")) %>%
predict(new_realEstate) -> standRealEstate
summary(standRealEstate)
## CrimesPerCapita PropotionOfLandZonedForLots PropotionOfNonRetailBusiness
## Min. :-0.419887 Min. :-0.48400 Min. :-1.5700
## 1st Qu.:-0.411056 1st Qu.:-0.48400 1st Qu.:-0.8770
## Median :-0.389692 Median :-0.48400 Median :-0.2176
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.006939 3rd Qu.: 0.05189 3rd Qu.: 1.0147
## Max. : 9.925299 Max. : 3.80318 Max. : 2.4273
## NitricOxidesConcentration AverageNumberOfRooms ProportionOfOccupiedUnits
## Min. :-1.4723 Min. :-3.8741 Min. :-2.3314
## 1st Qu.:-0.9187 1st Qu.:-0.5713 1st Qu.:-0.8356
## Median :-0.1489 Median :-0.1117 Median : 0.3176
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5950 3rd Qu.: 0.4862 3rd Qu.: 0.9026
## Max. : 2.7315 Max. : 3.5414 Max. : 1.1166
## WeightedDistancesToFiveEmploy IndexOfAccessibilityHighway ValueOfTaxRate
## Min. :-1.2621 Min. :-0.9788 Min. :-1.3144
## 1st Qu.:-0.7999 1st Qu.:-0.6346 1st Qu.:-0.7606
## Median :-0.3115 Median :-0.5199 Median :-0.4652
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6404 3rd Qu.: 1.6599 3rd Qu.: 1.5303
## Max. : 3.9841 Max. : 1.6599 Max. : 1.7975
## PupilTeacherRatio PropotionOfBlackPeople ProportionOfPopLowerStatus
## Min. :-2.6786 Min. :-3.9002 Min. :-1.4242
## 1st Qu.:-0.4988 1st Qu.: 0.2023 1st Qu.:-0.7570
## Median : 0.2732 Median : 0.3839 Median :-0.1799
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7727 3rd Qu.: 0.4381 3rd Qu.: 0.5413
## Max. : 2.0442 Max. : 0.4457 Max. : 8.0689
## MedianValueHome
## Min. :-1.8604
## 1st Qu.:-0.5973
## Median :-0.1588
## Mean : 0.0000
## 3rd Qu.: 0.2482
## Max. : 4.6519
standRealEstate %>% cov() -> standRealEstate.cov
standRealEstate.cov %>% eigen() -> standRealEstate.eigen
standRealEstate.eigen$values
## [1] 6.40307267 1.47929098 1.33275143 0.84222164 0.67286378 0.53787365
## [7] 0.44836594 0.36141387 0.27422943 0.24708642 0.18628796 0.15130026
## [13] 0.06324197
Let’s now turn to the visualization of the ScreePlot plot, which shows us how successive principal components play a role in explaining the variance of all the variables. In the chart below, we see that the first principal component explains almost 50% of the variance of the variables in the dataset. The second and third principal components explain 11% and 10%, respectively. If we were interested in 3D visualization we would have to choose the first three principal components, which together explain almost 71% of the variance. This is a very good and satisfying result. It is also an unwritten rule that the standard deviation of the principal component should exceed 1.
new_realEstate %>%
prcomp(center = TRUE, scale = TRUE) -> pcaEstate
fviz_eig(pcaEstate)
summary(pcaEstate)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5304 1.2163 1.1544 0.91773 0.82028 0.73340 0.66960
## Proportion of Variance 0.4925 0.1138 0.1025 0.06479 0.05176 0.04137 0.03449
## Cumulative Proportion 0.4925 0.6063 0.7089 0.77364 0.82540 0.86677 0.90126
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.6012 0.52367 0.49708 0.43161 0.38897 0.25148
## Proportion of Variance 0.0278 0.02109 0.01901 0.01433 0.01164 0.00486
## Cumulative Proportion 0.9291 0.95016 0.96917 0.98350 0.99514 1.00000
The following visualization concretely shows through vectors which variable plays a key role for a given observation. Additionally, the longer the length of the vector, the more the variable plays a role in creating the principal component. Vectors that are perpendicular to each other are considered independent. As we have concluded from the correlation matrix analysis, the variables average number of rooms and median price of apartments are identically correlated. However, vector perpendicular to these two is, for example, weighted average of distance to the 5 nearest workplaces.
pcaEstate %>%
fviz_pca_var(col.var="contrib",
repel = TRUE,
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
ggtheme = theme_minimal())
At the same time, we can look at the vectors of variables that build the principal components, but also at the observations that have been transformed with the principal components. So with a bird’s eye view, we can look at which observations (using numbering) are more influenced by specific variables. More expensive housing with more rooms will be located in the upper left corner of the graph, while observations with higher crime index or pollution volume will be analogously located on the opposite side.
pcaEstate %>%
fviz_pca_ind( col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07" ), repel=TRUE)
pcaEstate %>%
fviz_pca_biplot( repel = TRUE,
geom = c("point"),
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
Hierarchical clustering can also be used to confirm conclusions. Heatmap is used to visualize clusters in two dimensions. It allows you to look at the observations as the observations themselves, as well as look at how the variables used in the analysis group together. Darker and lighter regions suggest strong association between variables or observations.
standRealEstate %>%
t() %>%
as.matrix() %>%
heatmap.2(labCol = FALSE,
main = "Heatmap observations/variables",
xlab = "Houses",
cexRow = 0.8,
margins = c(8,8),
tracecol = NA)
Principal component analysis by itself does not add much to modeling or creating relationships. However, there is a huge potential in this algorithm because not only does it simplify reality and the number of dimensions, but machine learning models are more likely to accept simplified data sets than very complex and extensive ones. Principal component analysis is a fairly simple, transparent algorithm that is worth keeping in your statistical and mathematical workshop.