datamatrix <- cor(data_X)
corrplot(datamatrix, order="hclust", type='upper', tl.srt = 45)
cov_mat = cov(X_std)
pcaobj <- prcomp(X_std)
print(pcaobj)
## Standard deviations (1, .., p=9):
## [1] 2.0140181 1.6663949 1.0147369 0.8319572 0.4542260 0.3579423 0.2274711
## [8] 0.2013117 0.1352858
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3
## total_homes_sold 0.026214138 -0.57072301 0.159404117
## median_sale_price 0.470008835 0.02168610 -0.024850850
## total_new_listings 0.039679991 -0.58677946 0.070576135
## median_new_listing_price 0.474098524 0.01566229 -0.032900591
## active_listings 0.008739652 -0.57069952 -0.150383084
## median_active_list_price 0.473896842 0.01475440 -0.007512401
## average_of_median_list_price 0.427735256 0.01898059 -0.055379485
## average_of_median_offer_price 0.374481952 0.01920246 -0.037613238
## median_days_on_market -0.064898543 -0.05103806 -0.969925859
## PC4 PC5 PC6
## total_homes_sold -0.013005627 0.0009467114 -0.662923582
## median_sale_price 0.321385009 -0.0934434067 -0.077856371
## total_new_listings 0.004213931 -0.0013971181 -0.026426255
## median_new_listing_price 0.313010353 -0.0773727930 0.040116449
## active_listings -0.003044756 0.0009112646 0.708773054
## median_active_list_price 0.320774894 -0.0916466193 0.035991441
## average_of_median_list_price -0.420458296 0.7978788323 -0.004874567
## average_of_median_offer_price -0.719725206 -0.5830685576 0.001455649
## median_days_on_market 0.029222219 -0.0173039103 -0.220187479
## PC7 PC8 PC9
## total_homes_sold 0.389659971 -0.2357630905 0.033489519
## median_sale_price 0.332054247 0.7396974700 0.050098551
## total_new_listings -0.735399081 0.3234121444 -0.054855157
## median_new_listing_price -0.214053026 -0.3945456923 0.683261643
## active_listings 0.375222477 -0.0880593634 0.026361007
## median_active_list_price -0.090194362 -0.3593872897 -0.724749880
## average_of_median_list_price 0.003230162 0.0080523228 -0.010979765
## average_of_median_offer_price -0.001798892 0.0001657396 0.001039342
## median_days_on_market -0.048311260 -0.0048121403 -0.020835075
summary(pcaobj)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0140 1.6664 1.0147 0.83196 0.45423 0.35794 0.22747
## Proportion of Variance 0.4507 0.3085 0.1144 0.07691 0.02292 0.01424 0.00575
## Cumulative Proportion 0.4507 0.7592 0.8737 0.95055 0.97348 0.98771 0.99346
## PC8 PC9
## Standard deviation 0.2013 0.13529
## Proportion of Variance 0.0045 0.00203
## Cumulative Proportion 0.9980 1.00000
ev <- eigen(cor(X_std))
ev$values
## [1] 4.05626880 2.77687183 1.02969093 0.69215276 0.20632123 0.12812271 0.05174309
## [8] 0.04052641 0.01830224
plot(ev$values)
#Rule of Thumb: Keep the number of components that explain ~ 80% of variation
#That is PC1, PC2, and PC3
#Rule of Thumb: Keep the number of components that explain ~ 80% of variation
From the scree plot above and the rule of thumb, I will probably keep PC1, PC2, and PC3. PC1 contributors: median_sale_price, median_new_listing_price, median_active_list_price, average_of_median_list_price, and average_of_median_offer_price contribute PC2 contributors: total_homes_sold, total_new_listings, and active_listings PC3 contributors: median_days_on_market
biplot(pcaobj,scale=0, cex=1)
#library(factoextra)
var <- get_pca_var(pcaobj)
#library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
I am thinking of embedding this part into the factor analysis section in my final paper. I will compare the results of factor analysis and PCA, specifically on the difference of the weight/importance each variable.