Step 1. Establish the optimal number of components: visualize the scree plot and explain your decision

datamatrix <- cor(data_X)
corrplot(datamatrix, order="hclust", type='upper', tl.srt = 45)

cov_mat = cov(X_std)
pcaobj <- prcomp(X_std)
print(pcaobj)
## Standard deviations (1, .., p=9):
## [1] 2.0140181 1.6663949 1.0147369 0.8319572 0.4542260 0.3579423 0.2274711
## [8] 0.2013117 0.1352858
## 
## Rotation (n x k) = (9 x 9):
##                                        PC1         PC2          PC3
## total_homes_sold               0.026214138 -0.57072301  0.159404117
## median_sale_price              0.470008835  0.02168610 -0.024850850
## total_new_listings             0.039679991 -0.58677946  0.070576135
## median_new_listing_price       0.474098524  0.01566229 -0.032900591
## active_listings                0.008739652 -0.57069952 -0.150383084
## median_active_list_price       0.473896842  0.01475440 -0.007512401
## average_of_median_list_price   0.427735256  0.01898059 -0.055379485
## average_of_median_offer_price  0.374481952  0.01920246 -0.037613238
## median_days_on_market         -0.064898543 -0.05103806 -0.969925859
##                                        PC4           PC5          PC6
## total_homes_sold              -0.013005627  0.0009467114 -0.662923582
## median_sale_price              0.321385009 -0.0934434067 -0.077856371
## total_new_listings             0.004213931 -0.0013971181 -0.026426255
## median_new_listing_price       0.313010353 -0.0773727930  0.040116449
## active_listings               -0.003044756  0.0009112646  0.708773054
## median_active_list_price       0.320774894 -0.0916466193  0.035991441
## average_of_median_list_price  -0.420458296  0.7978788323 -0.004874567
## average_of_median_offer_price -0.719725206 -0.5830685576  0.001455649
## median_days_on_market          0.029222219 -0.0173039103 -0.220187479
##                                        PC7           PC8          PC9
## total_homes_sold               0.389659971 -0.2357630905  0.033489519
## median_sale_price              0.332054247  0.7396974700  0.050098551
## total_new_listings            -0.735399081  0.3234121444 -0.054855157
## median_new_listing_price      -0.214053026 -0.3945456923  0.683261643
## active_listings                0.375222477 -0.0880593634  0.026361007
## median_active_list_price      -0.090194362 -0.3593872897 -0.724749880
## average_of_median_list_price   0.003230162  0.0080523228 -0.010979765
## average_of_median_offer_price -0.001798892  0.0001657396  0.001039342
## median_days_on_market         -0.048311260 -0.0048121403 -0.020835075
summary(pcaobj)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0140 1.6664 1.0147 0.83196 0.45423 0.35794 0.22747
## Proportion of Variance 0.4507 0.3085 0.1144 0.07691 0.02292 0.01424 0.00575
## Cumulative Proportion  0.4507 0.7592 0.8737 0.95055 0.97348 0.98771 0.99346
##                           PC8     PC9
## Standard deviation     0.2013 0.13529
## Proportion of Variance 0.0045 0.00203
## Cumulative Proportion  0.9980 1.00000
ev <- eigen(cor(X_std))
ev$values
## [1] 4.05626880 2.77687183 1.02969093 0.69215276 0.20632123 0.12812271 0.05174309
## [8] 0.04052641 0.01830224
plot(ev$values)

#Rule of Thumb: Keep the number of components that explain ~ 80% of variation
#That is PC1, PC2, and PC3

Step 2. Visualize PCA1 and PCA2 and describe which variables contribute to the PCA

#Rule of Thumb: Keep the number of components that explain ~ 80% of variation

From the scree plot above and the rule of thumb, I will probably keep PC1, PC2, and PC3. PC1 contributors: median_sale_price, median_new_listing_price, median_active_list_price, average_of_median_list_price, and average_of_median_offer_price contribute PC2 contributors: total_homes_sold, total_new_listings, and active_listings PC3 contributors: median_days_on_market

biplot(pcaobj,scale=0, cex=1)

#library(factoextra)
var <- get_pca_var(pcaobj)
#library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

Step 3. Reflect how you could use the reduced dimensionality in your final paper

I am thinking of embedding this part into the factor analysis section in my final paper. I will compare the results of factor analysis and PCA, specifically on the difference of the weight/importance each variable.