1. The Kaiser-Guttman rule and the Scree test

In the video, you saw the three most common methods that people utilize to decide the number of principal components to retain:

  1. Kaiser-Guttman rule
  2. Scree test (constructing the screeplot)
  3. Parallel Analysis

Your task now is to apply all of them on the R’s built-in airquality dataset!

# install.packages("FactoMineR", dependencies = TRUE)
library(FactoMineR)

# Conduct a PCA on the airquality dataset
pca_air <- PCA(airquality)
## Warning in PCA(airquality): Missing values are imputed by the mean of the
## variable: you should use the imputePCA function of the missMDA package

# Apply the Kaiser-Guttman rule
summary(pca_air, ncp = 4)
## 
## Call:
## PCA(X = airquality) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               2.318   1.165   0.983   0.790   0.435   0.310
## % of var.             38.625  19.411  16.385  13.175   7.246   5.158
## Cumulative % of var.  38.625  58.036  74.421  87.596  94.842 100.000
## 
## Individuals (the 10 first)
##             Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1       |  2.582 | -0.570  0.092  0.049 | -1.539  1.329  0.355 | -0.229
## 2       |  2.404 | -0.663  0.124  0.076 | -0.922  0.477  0.147 | -0.437
## 3       |  2.473 | -1.536  0.665  0.386 | -1.246  0.871  0.254 | -0.834
## 4       |  3.101 | -1.536  0.665  0.245 | -2.467  3.416  0.633 | -0.148
## 5       |  3.225 | -2.191  1.354  0.462 | -1.668  1.561  0.267 | -0.136
## 6       |  2.653 | -1.948  1.071  0.540 | -1.549  1.346  0.341 | -0.368
## 7       |  2.667 | -0.947  0.253  0.126 | -2.050  2.358  0.591 |  0.257
## 8       |  3.101 | -2.668  2.008  0.741 | -0.737  0.305  0.057 | -0.302
## 9       |  4.380 | -3.841  4.161  0.769 | -0.329  0.061  0.006 | -0.874
## 10      |  1.863 | -0.679  0.130  0.133 | -1.106  0.687  0.353 |  0.455
##            ctr   cos2    Dim.4    ctr   cos2  
## 1        0.035  0.008 | -1.861  2.863  0.519 |
## 2        0.127  0.033 | -2.072  3.551  0.743 |
## 3        0.463  0.114 | -1.001  0.828  0.164 |
## 4        0.015  0.002 | -0.318  0.084  0.011 |
## 5        0.012  0.002 | -0.782  0.506  0.059 |
## 6        0.090  0.019 | -0.445  0.163  0.028 |
## 7        0.044  0.009 | -0.696  0.400  0.068 |
## 8        0.060  0.009 | -1.140  1.074  0.135 |
## 9        0.508  0.040 | -0.526  0.229  0.014 |
## 10       0.137  0.060 | -1.207  1.205  0.420 |
## 
## Variables
##            Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Ozone   |  0.828 29.610  0.686 | -0.078  0.517  0.006 |  0.295  8.877
## Solar.R |  0.385  6.402  0.148 | -0.720 44.559  0.519 |  0.167  2.821
## Wind    | -0.715 22.029  0.511 | -0.178  2.719  0.032 | -0.200  4.072
## Temp    |  0.866 32.341  0.750 |  0.056  0.267  0.003 | -0.126  1.612
## Month   |  0.447  8.608  0.199 |  0.558 26.725  0.311 | -0.514 26.881
## Day     | -0.153  1.010  0.023 |  0.542 25.212  0.294 |  0.740 55.737
##           cos2    Dim.4    ctr   cos2  
## Ozone    0.087 | -0.082  0.854  0.007 |
## Solar.R  0.028 |  0.481 29.245  0.231 |
## Wind     0.040 |  0.493 30.782  0.243 |
## Temp     0.016 |  0.127  2.039  0.016 |
## Month    0.264 |  0.404 20.665  0.163 |
## Day      0.548 |  0.360 16.416  0.130 |
# Perform the screeplot test
# install.packages("factoextra", dependencies = TRUE)
library(factoextra)
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_screeplot(pca_air, ncp = 5)

2. Parallel Analysis with paran()

In this exercise, you will use two R functions for conducting parallel analysis for PCA:

  • paran() of the paran package and
  • fa.parallel() of the psych package.

fa.parallel() has one advantage over the paran() function; it allows you to use more of your data while building the correlation matrix. On the other hand, paran() does not handle missing data and you should first exclude missing values before passing the data to the function. For checking out the suggested number of PCs to retain, fa.parallel()’s output object includes the attribute ncomp.

The built-in R dataset airquality, on which you will be doing your parallel analyses, describes daily air quality measurements in New York from May to September 1973 and includes missing values.

# Subset the complete rows of airquality.
airquality_complete <- airquality[complete.cases(airquality), ]

# Conduct a parallel analysis with paran().
# install.packages("paran", dependencies = TRUE)
library(paran)
## Loading required package: MASS
air_paran <- paran(airquality_complete, seed = 1)
## 
## Using eigendecomposition of correlation matrix.
## Computing: 10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
## 
## 
## Results of Horn's Parallel Analysis for component retention
## 180 iterations, using the mean estimate
## 
## -------------------------------------------------- 
## Component   Adjusted    Unadjusted    Estimated 
##             Eigenvalue  Eigenvalue    Bias 
## -------------------------------------------------- 
## 1           2.137510    2.468840      0.331329
## -------------------------------------------------- 
## 
## Adjusted eigenvalues > 1 indicate dimensions to retain.
## (1 components retained)
# Check out air_paran's suggested number of PCs to retain.
air_paran
## $Retained
## [1] 1
## 
## $AdjEv
## [1] 2.1375108 0.9482615 0.9527979 0.8286986 0.6025979 0.5301334
## 
## $Ev
## [1] 2.4688406 1.1131258 0.9983882 0.7682592 0.4246993 0.2266869
## 
## $RndEv
## [1] 1.3313298 1.1648643 1.0455903 0.9395607 0.8221013 0.6965535
## 
## $Bias
## [1]  0.33132984  0.16486435  0.04559032 -0.06043933 -0.17789868 -0.30344649
## 
## $SimEvs
##            [,1]     [,2]      [,3]      [,4]      [,5]      [,6]
##   [1,] 1.224118 1.131226 1.0752308 0.9935903 0.8193453 0.7564889
##   [2,] 1.268741 1.078492 0.9930671 0.9748260 0.8759164 0.8089578
##   [3,] 1.516501 1.083730 0.9969250 0.9033106 0.8238899 0.6756435
##   [4,] 1.275651 1.124091 1.0542289 0.9815151 0.8237031 0.7408112
##   [5,] 1.315201 1.247901 1.0034169 0.9226841 0.8431972 0.6676001
##   [6,] 1.354071 1.112538 1.0270687 0.9862366 0.8135938 0.7064921
##   [7,] 1.349247 1.325950 0.9551017 0.8456129 0.8146764 0.7094119
##   [8,] 1.479519 1.233551 1.0852321 0.8035660 0.7469296 0.6512024
##   [9,] 1.383507 1.193352 1.0492556 0.9750082 0.7896749 0.6092026
##  [10,] 1.196308 1.124168 1.0382455 1.0081813 0.8551243 0.7779725
##  [11,] 1.355647 1.215461 1.0596736 0.8782092 0.7808199 0.7101887
##  [12,] 1.480487 1.181731 1.0046303 0.8650389 0.7503372 0.7177751
##  [13,] 1.365008 1.178037 1.0483682 0.8898827 0.8544917 0.6642125
##  [14,] 1.434219 1.184054 1.0290323 0.9012887 0.7803443 0.6710615
##  [15,] 1.309963 1.230484 1.0097014 0.8910772 0.8058219 0.7529518
##  [16,] 1.352824 1.120526 0.9915702 0.9066423 0.8572514 0.7711861
##  [17,] 1.215240 1.115258 1.0880958 0.9650069 0.8983806 0.7180188
##  [18,] 1.200463 1.097499 1.0174071 0.9983648 0.9428617 0.7434040
##  [19,] 1.281613 1.139133 1.0509600 0.9722194 0.8734553 0.6826196
##  [20,] 1.429993 1.139216 1.0257668 0.9371782 0.8307338 0.6371114
##  [21,] 1.274397 1.161670 1.1189738 1.0144499 0.8113773 0.6191323
##  [22,] 1.298627 1.171242 1.1192630 0.9869436 0.7780373 0.6458873
##  [23,] 1.336722 1.188723 1.0522123 0.9422725 0.7706136 0.7094565
##  [24,] 1.377848 1.136640 1.0573741 0.9027620 0.8217153 0.7036607
##  [25,] 1.167928 1.118408 1.0452599 0.9850514 0.8662276 0.8171250
##  [26,] 1.286621 1.094897 1.0589507 0.9550176 0.8704555 0.7340582
##  [27,] 1.502922 1.133530 1.0208576 0.8509795 0.8428670 0.6488442
##  [28,] 1.290053 1.205168 1.0352525 0.9830437 0.8269086 0.6595748
##  [29,] 1.288588 1.164356 1.0785146 0.9481275 0.8051331 0.7152816
##  [30,] 1.247194 1.159532 1.0287545 1.0162000 0.8626717 0.6856479
##  [31,] 1.315107 1.094759 1.0488663 0.9967186 0.7900381 0.7545110
##  [32,] 1.217838 1.110290 1.0821962 1.0076891 0.8502416 0.7317457
##  [33,] 1.303085 1.176155 1.0705407 0.9127635 0.8015662 0.7358893
##  [34,] 1.294644 1.144322 1.0241828 0.9297681 0.8444462 0.7626369
##  [35,] 1.383665 1.150966 1.0968468 0.9013357 0.8344591 0.6327275
##  [36,] 1.500264 1.098856 0.9842212 0.9583670 0.7990819 0.6592101
##  [37,] 1.432081 1.133216 1.0156538 0.9753212 0.7790055 0.6647225
##  [38,] 1.241306 1.190111 1.0580949 0.9538305 0.8117722 0.7448855
##  [39,] 1.500705 1.146445 1.0389146 1.0019294 0.7661906 0.5458146
##  [40,] 1.356392 1.178044 1.0819563 0.9711388 0.7571661 0.6553031
##  [41,] 1.326852 1.214333 1.0222281 0.8936583 0.8057597 0.7371692
##  [42,] 1.227394 1.172616 1.0982472 0.9892330 0.8163092 0.6962008
##  [43,] 1.418379 1.219434 1.0715631 0.9686858 0.7992939 0.5226434
##  [44,] 1.355480 1.177922 1.0874628 0.8994612 0.7866652 0.6930092
##  [45,] 1.397271 1.149776 1.0636688 0.9000110 0.7712021 0.7180710
##  [46,] 1.245930 1.113199 0.9956366 0.9649122 0.8621183 0.8182043
##  [47,] 1.296584 1.169497 0.9699690 0.9318219 0.8364618 0.7956669
##  [48,] 1.311610 1.163565 1.0101828 0.8960687 0.8604555 0.7581179
##  [49,] 1.235026 1.161982 0.9842707 0.9632626 0.9016942 0.7537650
##  [50,] 1.230424 1.210636 1.0671649 0.9644451 0.7769693 0.7503614
##  [51,] 1.301309 1.210494 1.1237719 0.9616775 0.7094615 0.6932867
##  [52,] 1.253942 1.144620 1.0997036 0.9324509 0.8465504 0.7227327
##  [53,] 1.390297 1.224710 1.1319662 0.9256544 0.7067932 0.6205795
##  [54,] 1.348815 1.195180 1.0311313 0.9113535 0.7909776 0.7225424
##  [55,] 1.363731 1.221170 1.0598420 0.8855564 0.8286112 0.6410890
##  [56,] 1.265898 1.179480 1.0210498 0.9731966 0.7840419 0.7763339
##  [57,] 1.296328 1.203475 1.0072713 0.8926310 0.8385882 0.7617058
##  [58,] 1.293159 1.173373 1.0313460 1.0123721 0.8314274 0.6583222
##  [59,] 1.253301 1.117383 0.9969127 0.9586626 0.8944106 0.7793300
##  [60,] 1.262664 1.155809 1.1295790 0.9354351 0.8237339 0.6927787
##  [61,] 1.383129 1.195118 1.0517181 0.9247330 0.8354336 0.6098687
##  [62,] 1.308953 1.181685 1.0365917 0.9713006 0.8202104 0.6812599
##  [63,] 1.344078 1.163544 0.9520393 0.9364791 0.8620707 0.7417889
##  [64,] 1.531044 1.165391 1.1409087 0.8906585 0.7274974 0.5445003
##  [65,] 1.272825 1.220320 1.0883338 0.9499726 0.8209410 0.6476071
##  [66,] 1.365451 1.183139 1.0971806 0.9461657 0.8221517 0.5859115
##  [67,] 1.350555 1.121161 1.0516732 0.9609751 0.8295301 0.6861054
##  [68,] 1.297714 1.088284 1.0854776 0.9034904 0.8756759 0.7493585
##  [69,] 1.258360 1.184816 1.0833513 0.9440709 0.8249875 0.7044145
##  [70,] 1.360884 1.202978 1.0359219 0.8561910 0.8368138 0.7072110
##  [71,] 1.372508 1.141583 1.0734941 1.0364194 0.7562971 0.6196985
##  [72,] 1.305170 1.138238 0.9673124 0.9628976 0.8585089 0.7678732
##  [73,] 1.421421 1.171755 1.0524431 0.9449634 0.7303744 0.6790426
##  [74,] 1.230346 1.175767 1.0069927 0.9831804 0.8438422 0.7598716
##  [75,] 1.561418 1.144113 0.9759837 0.9327225 0.7811597 0.6046025
##  [76,] 1.246979 1.084998 1.0725859 1.0304370 0.8586456 0.7063550
##  [77,] 1.374296 1.206760 1.0406182 0.9236351 0.7672561 0.6874345
##  [78,] 1.349302 1.241454 1.1418002 0.8128299 0.7822079 0.6724060
##  [79,] 1.426358 1.085935 1.0002091 0.9193133 0.8142966 0.7538884
##  [80,] 1.255644 1.148588 1.0346601 0.9711501 0.8411516 0.7488056
##  [81,] 1.416326 1.136058 1.0671428 0.9094559 0.8563590 0.6146583
##  [82,] 1.235646 1.120424 1.0654215 1.0003666 0.8395815 0.7385608
##  [83,] 1.309437 1.120046 1.0140890 0.8961578 0.8767492 0.7835212
##  [84,] 1.353748 1.180640 1.0667022 0.8745794 0.7818777 0.7424531
##  [85,] 1.289601 1.081858 1.0324944 0.9346241 0.8486028 0.8128202
##  [86,] 1.371884 1.095346 0.9876916 0.9416828 0.8687028 0.7346931
##  [87,] 1.334824 1.177365 1.0111907 0.9022971 0.8534819 0.7208412
##  [88,] 1.278658 1.132988 1.0056342 0.9203163 0.8489453 0.8134587
##  [89,] 1.350017 1.151034 1.0245123 0.9288475 0.8643915 0.6811983
##  [90,] 1.376702 1.164864 1.0553996 0.9215212 0.8774015 0.6041115
##  [91,] 1.433322 1.171291 1.0769126 0.9253315 0.7592811 0.6338619
##  [92,] 1.184326 1.160059 1.0494062 0.9501484 0.8696776 0.7863831
##  [93,] 1.421712 1.125479 1.0229255 0.9011297 0.8705914 0.6581621
##  [94,] 1.353508 1.174272 0.9968587 0.9705144 0.8517735 0.6530728
##  [95,] 1.326724 1.179941 1.0819087 0.9099860 0.8181542 0.6832864
##  [96,] 1.267852 1.192097 1.0613213 0.9343009 0.8491104 0.6953186
##  [97,] 1.439927 1.196617 1.0221718 0.9641855 0.7363012 0.6407973
##  [98,] 1.264309 1.164978 1.0144831 0.9633449 0.8408765 0.7520083
##  [99,] 1.333953 1.139375 1.0722555 0.9822795 0.8320994 0.6400375
## [100,] 1.230198 1.085176 1.0211460 0.9679588 0.9344327 0.7610883
## [101,] 1.275576 1.174322 1.0315770 0.8966313 0.8532778 0.7686162
## [102,] 1.258433 1.151753 1.0173267 0.9587395 0.8796857 0.7340620
## [103,] 1.410101 1.315501 1.0222753 0.8956742 0.7570464 0.5994019
## [104,] 1.277467 1.235793 0.9919912 0.9085929 0.8705002 0.7156558
## [105,] 1.372219 1.192988 1.0068545 0.9000525 0.8663024 0.6615831
## [106,] 1.339290 1.172100 1.0522358 0.9438207 0.7845360 0.7080165
## [107,] 1.259836 1.200126 1.0375907 0.9265690 0.8695238 0.7063553
## [108,] 1.275223 1.141782 1.0586116 0.9675778 0.8421366 0.7146697
## [109,] 1.388065 1.062500 1.0334845 0.9145595 0.8744462 0.7269452
## [110,] 1.406495 1.142878 1.0682931 0.9395096 0.7929826 0.6498413
## [111,] 1.372262 1.260930 1.0523427 0.9662200 0.7199244 0.6283216
## [112,] 1.413715 1.200698 1.0880080 0.8415065 0.7512209 0.7048515
## [113,] 1.285306 1.195523 1.0387358 0.9764442 0.8654714 0.6385199
## [114,] 1.297262 1.143197 1.0166682 0.9483736 0.8709473 0.7235518
## [115,] 1.301080 1.250272 1.1329333 0.9707244 0.7281906 0.6168006
## [116,] 1.229024 1.191066 1.0998519 1.0079533 0.7499848 0.7221205
## [117,] 1.370920 1.196362 1.0286956 0.9368311 0.8103066 0.6568841
## [118,] 1.377728 1.191513 0.9931849 0.8975421 0.7982432 0.7417885
## [119,] 1.304057 1.129281 1.0235109 0.9803878 0.8644574 0.6983060
## [120,] 1.261980 1.095052 1.0460005 0.9519124 0.8896488 0.7554064
## [121,] 1.335215 1.178312 1.0962483 1.0535747 0.7555006 0.5811497
## [122,] 1.477378 1.113452 0.9798634 0.8926823 0.8466256 0.6899991
## [123,] 1.288906 1.157825 1.0395820 0.9853136 0.9340215 0.5943516
## [124,] 1.348828 1.196020 1.0096877 0.9527047 0.8476101 0.6451495
## [125,] 1.345336 1.197583 1.0801170 0.9446689 0.7234890 0.7088060
## [126,] 1.349707 1.149391 1.0041329 0.9899834 0.8246650 0.6821207
## [127,] 1.400363 1.109139 0.9608139 0.9230087 0.8217975 0.7848779
## [128,] 1.306060 1.191992 1.0233480 0.9653893 0.8032209 0.7099904
## [129,] 1.305945 1.178186 1.0199141 0.9850006 0.8412813 0.6696735
## [130,] 1.318446 1.134878 1.0332927 0.9466785 0.8264406 0.7402635
## [131,] 1.428465 1.116247 1.0737736 0.9007903 0.8151898 0.6655344
## [132,] 1.356316 1.143602 1.1383255 0.9534557 0.8852277 0.5230727
## [133,] 1.355958 1.286157 1.0010471 0.9406589 0.8004161 0.6157629
## [134,] 1.418964 1.076116 1.0127036 0.9721374 0.7792829 0.7407961
## [135,] 1.469738 1.302096 1.1016390 0.8694868 0.6860795 0.5709601
## [136,] 1.370664 1.187321 1.0642547 0.8630256 0.7945737 0.7201600
## [137,] 1.312399 1.151663 1.0524973 0.9783977 0.8783154 0.6267275
## [138,] 1.229339 1.125029 1.0420546 0.8992309 0.8850607 0.8192854
## [139,] 1.383690 1.188978 0.9999740 0.9180920 0.7738426 0.7354234
## [140,] 1.297919 1.189160 1.0386987 0.9648097 0.8381032 0.6713095
## [141,] 1.422223 1.225830 0.8812756 0.8682304 0.8420318 0.7604091
## [142,] 1.319777 1.099486 0.9817816 0.9468968 0.8958941 0.7561646
## [143,] 1.194842 1.153390 1.0847297 0.9674482 0.8459148 0.7536757
## [144,] 1.316853 1.248634 1.0643644 0.9568810 0.8570024 0.5562644
## [145,] 1.292923 1.159763 1.0291091 0.9205033 0.8501031 0.7475981
## [146,] 1.288863 1.157549 1.0983432 0.9242982 0.8055270 0.7254203
## [147,] 1.300540 1.183350 1.0525308 0.9850334 0.8242239 0.6543215
## [148,] 1.434848 1.238748 1.0423658 0.9394993 0.7331430 0.6113956
## [149,] 1.302028 1.192951 0.9464232 0.9207072 0.8254651 0.8124256
## [150,] 1.399419 1.180305 1.1088879 0.9812033 0.6862374 0.6439468
## [151,] 1.331616 1.141430 1.1011517 1.0169344 0.7557034 0.6531641
## [152,] 1.409681 1.174902 1.0884470 0.8731723 0.8154741 0.6383238
## [153,] 1.280388 1.155483 1.0873585 0.9540923 0.7809118 0.7417661
## [154,] 1.251922 1.177234 1.1441697 0.9494293 0.8551848 0.6220608
## [155,] 1.326250 1.175596 1.0173583 0.9221453 0.8463358 0.7123141
## [156,] 1.317082 1.112763 1.0976221 0.9056369 0.8320331 0.7348624
## [157,] 1.446005 1.160582 1.0709269 0.8938750 0.7756817 0.6529300
## [158,] 1.392210 1.155073 1.0430289 0.9185163 0.8600739 0.6310988
## [159,] 1.270733 1.212669 1.0106105 0.9646833 0.8650210 0.6762829
## [160,] 1.179876 1.169633 1.0648537 0.9655234 0.8971688 0.7229456
## [161,] 1.356567 1.154530 1.0554808 0.9418888 0.8795455 0.6119880
## [162,] 1.389493 1.207301 1.1069064 0.9329762 0.7087772 0.6545453
## [163,] 1.317841 1.262138 1.0961311 0.9341910 0.7383228 0.6513762
## [164,] 1.274252 1.211299 1.0545031 0.9561379 0.7680651 0.7357425
## [165,] 1.320213 1.132978 1.0902116 0.8877619 0.8658726 0.7029624
## [166,] 1.278334 1.115484 1.0399034 0.9711404 0.8910958 0.7040419
## [167,] 1.259543 1.121302 0.9830341 0.9280042 0.8584543 0.8496618
## [168,] 1.328778 1.042107 1.0008968 0.9415997 0.8785200 0.8080982
## [169,] 1.313654 1.162153 1.0741882 0.9346474 0.7896155 0.7257418
## [170,] 1.339718 1.247390 1.0718715 0.8782942 0.7417305 0.7209963
## [171,] 1.330189 1.125621 1.0153115 0.9252616 0.8907308 0.7128867
## [172,] 1.396970 1.176271 1.0555445 0.8970679 0.7672308 0.7069159
## [173,] 1.251013 1.123910 1.0550290 0.9568395 0.8922237 0.7209844
## [174,] 1.410772 1.175246 1.0779690 0.8780586 0.7935113 0.6644428
## [175,] 1.310059 1.114904 1.0196810 0.9296775 0.8993728 0.7263063
## [176,] 1.378394 1.236796 0.9796657 0.9017316 0.7830390 0.7203733
## [177,] 1.376068 1.134129 1.0700317 0.9426921 0.8184312 0.6586479
## [178,] 1.248628 1.178825 1.1129237 0.9723193 0.8163871 0.6709169
## [179,] 1.398654 1.162430 1.0973091 0.9476966 0.7612598 0.6326498
## [180,] 1.238664 1.110698 1.0603892 1.0051988 0.8778525 0.7071975
# Conduct a parallel analysis with fa.parallel().
# install.packages("psych", dependencies = TRUE)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
air_fa_parallel <- fa.parallel(airquality)

## Parallel analysis suggests that the number of factors =  3  and the number of components =  1
# Check out air_fa_parallel's suggested number of PCs to retain.
air_fa_parallel
## Call: fa.parallel(x = airquality)
## Parallel analysis suggests that the number of factors =  3  and the number of components =  1 
## 
##  Eigen Values of 
##   Original factors Resampled data Simulated data Original components
## 1             2.02           0.66           0.63                2.43
## 2             0.29           0.18           0.17                1.17
## 3             0.21           0.09           0.08                0.98
##   Resampled components Simulated components
## 1                 1.30                 1.29
## 2                 1.15                 1.14
## 3                 1.03                 1.04

3. Why is mean imputation problematic?

Why is using the mean to impute missing data not the best approach?

  1. Because it is quick.
  2. Because it will distort the real distribution of the variables if the data has a lot of missing values.
  3. Because it is the default method of FactoMineR’s PCA() function.

A better approach is to fit a linear model on the most correlated variable for each of the variables that contain NAs and impute the missing values based on the regression model.

4. Estimating missing values with missMDA

As you saw in the video, In R, there are two packages for conducting PCA to a dataset with missing values; pcaMethods and missMDA. In this exercise, you are going to use the first method introduced in the video: combining missMDA and FactoMineR. Both packages are loaded for you in this exercise.

The two-step procedure includes a) the estimation of missing values by using an iterative PCA algorithm in the first place and b) number of dimensions for PCA by cross-validation.

In this exercise, you will be working with the ozone dataset of the missMDA package that includes 112 daily measurements of meteorological variables (wind speed, temperature, rainfall, etc.) and ozone concentration recorded in Rennes (France) during the summer 2001.

# install.packages("missMDA", dependencies = TRUE)
library(missMDA)
data("ozone", package = "missMDA")

# Check out the number of cells with missing values.
sum(is.na(ozone[,1:11]))
## [1] 126
# Estimate the optimal number of dimensions for imputation. 
ozone_ncp <- estim_ncpPCA(ozone[, 1:11])

# Do the actual data imputation. 
complete_ozone <- imputePCA(ozone[,1:11], ncp = ozone_ncp$ncp, scale = TRUE)

you know which steps to follow for conducting PCA on datasets with missing values by using both FactoMineR and missMDA.

5. Topic detection with N-NMF: Part I

In the next two exercises, you will be detecting topics in corpora. corpus_tdm is loaded into your workspace, a term-document matrix of 50 texts sampled from the BBCsport dataset, classified in 5 different subject areas (i.e. athletics, cricket, football, rugby, tennis). For getting a verbose output of the nmf() output with details about the runtime and the number of iterations required while processing, you simply need to set its .options argument to “v”.

In this exercise, your goal is to extract the basis matrix, W. The columns of W can be easily interpreted as the conditional probabilities of the terms in a corpus given a topic, if we normalize their column values to sum to 1. We have prepared and loaded the function normal() for you that achieves just that.

library(NMF)
## Loading required package: pkgmaker
## Loading required package: registry
## 
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:base':
## 
##     isFALSE
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [NO: missing Biobase] | Shared memory capabilities [OK] | Cores 3/4
##   To enable the Bioconductor layer, try: install.extras('
## NMF
## ') [with Bioconductor repository enabled]
url <- 'https://assets.datacamp.com/production/course_4249/datasets/bbc_res.rds'

bbc_res <- readRDS(gzcon(url(url)))

# Get the term-topic matrix W.
W <- basis(bbc_res)

# Check out the dimensions of W.
dim(W)
## [1] 3137    5
# Normalize W.
normal <- function(m) {
  m/sum(m)
}
normal_W <- apply(W, 2, normal)

You created the basis, or term-topic matrix, W. Let’s move on to the extraction of H, the coefficient matrix.

6. Topic detection with N-NMF: Part II

The next step in topic detection is to extract the topic-text, or coefficient matrix, H. The columns of H can be interpreted as the conditional probabilities of the topics given a corpus of texts, respectively, if we normalize their column values to sum to 1. We have prepared and loaded for you that achieves just that. bbc_res, the 5-rank approximation of corpus_tdm created in the last exercise, is at your disposal as well as the function normal() for achieving the normalization.

# Get the topic-text matrix H.
H <- coef(bbc_res)

# Check out the dimensions of H.
dim(H)
## [1]  5 50
# Normalize H.
normal <- function(m) {
  m/sum(m)
}
normal_H <- apply(H, 2, normal)

You made it to automatically detect topics for a sample of a bbc corpus! Remember that the columns of W and H can be interpreted as the conditional probabilities of the terms in a corpus given a topic and of the topics given a corpus of texts. As you can see, the number of columns of W and the number of rows of H are equal to 5, the low-rank approximation that we were looking for.

7. Trying different N-NMF algorithms

The main differences between the algorithms are in the computation of the objective function and the optimization techniques used for the update steps. By default, the NMF package runs brunet, but you can choose any of the 11 algorithms implemented within the NMF package, and put it as the third argument of nmf(). For browsing through the available N-NMF algorithms implemented in NMF you can simply use the nmfAlgorithm() function. Using nmfAlgorithm() without arguments, a vector with all the 11 algorithms, optimized in C++, is returned. For extracting the older versions of some of these algorithms, written in R, you can use the version argument and set it to R in order to get the older versions. Let’s put all this into practice!

# Explore the nmf's algorithms.
alg <- nmfAlgorithm()

# Choose the algorithms implemented in R.
R_alg <- nmfAlgorithm(version = "R")

# Get a 5-rank approximation of corpus_tdm.
# bbc_double_opt <- nmf(corpus_tdm, 5, R_alg, .options = 'v')

You are in a position of using a series of algorithms implemented in nmfAlgorithm(). In the upcoming video, I’ll demostrate how to trace performance differences among N-NMF algorithms.