We will apply dimensionality reduction techniques such as Factor Analysis, Item Cluster Analysis to understand the latent constructs in the NDDT (Numeracy Detailed Diagnostic Test) battery.
Factor Analysis assumes the following generative model that must’ve produced the observed data:
\(X_{d \times 1} = A_{d \times k} F_{k \times 1} + \mu_{d \times 1} + \epsilon_{d \times 1}\)
where
It is further assumed that:
Then, it follows that \(\Sigma \equiv Cov(X)= A A^T + \Psi\)
Factor Analysis tries to decompose the covariance (or correlation) matrix and recover the factor loading matrix \(A\). There are various techniques to decompose \(\Sigma\) such as Principal Axis, MinRes (OLS minimization), Maximum Likelihood, among others. We use Principal Axis method as it does not assume normality of the data.
Nevertheless, the estimated Factor Loadings are not unique. Several transformation methods (rotations) such as Promax, Varimax, Varimin, Obliqmin etc exist each of which tries to attain a certain objective and make the estimated factor loadings more interpretable. For example, Varimax rotation tries to distribute the factor loadings across factors so that all factors get equal share (informally speaking, that is). We use will Varimax and Obliqmin roation in the following analysis.
An important step in Factor Analysis is finding the right number of Factors. Notably, it is the most challenging task.
Determining the most interpretable number of factors from a factor analysis is perhaps one of the greatest challenges in factor analysis. There are many solutions to this problem, none of which is uniformly the best. “Solving the number of factors problem is easy, I do it everyday before breakfast. But knowing the right solution is harder” (Kaiser, 195x).
Several techniques exist to arrive at an optimal number of components. R’s pysch package, provides the VSS (very simple structure), Parallel Analysis, MAP, Scree test. R package nFactors prvovides another set of alternatives - most of them based on the strcuture of the eigen values of \(\Sigma\). For example, the Bartlet, Anderson, Lawley procedure tests the hypothesis that the bottom \(d-k\) eigen-values are all equal, and reports the corresponding \(k\). It is expected that there would be no consensus among the methods. We will do a monte-carlo analysis in order to understand this variation.
Load libraries and read NDDT data from S3
## Reading NDDT Item meta-data
## read 1024 records on 60 items
Inspect the data
## % of correct responses and their confidence intervels
## % of missing values
## summary stats of the responses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 234.0 440.0 634.0 550.8 656.0 791.0
## smallest sample size is: 234
## Heatmap of the response pattern
Sample size does not seem to be a problem. Check item correlation, which forms the back-bone of all dimensionality techniques under consideration in this analysis.
## Compute tetrochoric correlation and look at the item correlations
Graph Viz of Item Correlation
plot forced n/w
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
##
## Attaching package: 'visNetwork'
##
## The following object is masked from 'package:igraph':
##
## %>%
Items in level-1 have better correlation with each other. Within level correlation is lower (in fact, one notices negative correlation) as one moves up the level. There could be several reasons. Relatively smaller sample sizes, variability in responses at higher levels.
Roughly, each level corresponds to a grade: eg level-1 is grade-1
Let us proceed to do a FA ## Estimating the number of factors
## Apply Parallel Analysis. Using tetrochoric correlation and Principal Axis estimating method
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## Parallel analysis suggests that the number of factors = 20 and the number of components = 15
PA suggests that 20 factors should be used. It is the smallest index at which (ordered) eigen-values of simulated data exceed the corresponding eigen value of real data. However, if we go by the Kaiser’s rule of thumb that components below an eigen value of 1 are not significant, we get 9 components.
Let us get a bootsrapped estimate of the # of factors and and its error bar (optional)
Let us try VSS structure with PA with varimax and oblimin transformations
## Apply VSS. Using tetrochoric correlation and Principal Axis estimating method
## Warning in cor.smooth(mat): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
##
## Very Simple Structure
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm,
## n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.69 with 1 factors
## VSS complexity 2 achieves a maximimum of 0.76 with 2 factors
##
## The Velicer MAP achieves a minimum of 0.02 with 3 factors
## BIC achieves a minimum of 370334.1 with 8 factors
## Sample Size adjusted BIC achieves a minimum of 374520.2 with 8 factors
##
## Statistics by number of factors
## vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC
## 1 0.69 0.00 0.026 1710 396489 0 90 0.69 0.48 384637 390068
## 2 0.50 0.76 0.023 1651 392513 0 70 0.76 0.49 381069 386313
## 3 0.46 0.76 0.023 1593 389927 0 59 0.80 0.49 378885 383945
## 4 0.43 0.74 0.024 1536 387537 0 51 0.83 0.50 376891 381769
## 5 0.41 0.67 0.024 1480 385022 0 44 0.85 0.51 374763 379464
## 6 0.36 0.61 0.025 1425 382969 0 39 0.87 0.52 373092 377618
## 7 0.34 0.60 0.026 1371 381262 0 34 0.88 0.53 371759 376113
## 8 0.40 0.66 0.027 1318 379470 0 30 0.90 0.54 370334 374520
## complex eChisq SRMR eCRMS eBIC
## 1 1.0 54413 0.123 0.125 42560
## 2 1.4 39199 0.104 0.108 27755
## 3 1.7 32103 0.094 0.099 21061
## 4 2.0 26904 0.086 0.092 16257
## 5 2.2 22681 0.079 0.087 12423
## 6 2.5 19826 0.074 0.082 9948
## 7 2.7 17247 0.069 0.078 7744
## 8 2.6 15052 0.064 0.075 5916
## Loading required namespace: GPArotation
## Warning in cor.smooth(mat): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
##
## Very Simple Structure
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm,
## n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.69 with 1 factors
## VSS complexity 2 achieves a maximimum of 0.69 with 2 factors
##
## The Velicer MAP achieves a minimum of 0.02 with 3 factors
## BIC achieves a minimum of 370334.1 with 8 factors
## Sample Size adjusted BIC achieves a minimum of 374520.2 with 8 factors
##
## Statistics by number of factors
## vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC
## 1 0.69 0.00 0.026 1710 396489 0 90 0.69 0.48 384637 390068
## 2 0.48 0.69 0.023 1651 392513 0 91 0.69 0.49 381069 386313
## 3 0.45 0.67 0.023 1593 389927 0 82 0.72 0.49 378885 383945
## 4 0.41 0.63 0.024 1536 387537 0 75 0.75 0.50 376891 381769
## 5 0.37 0.58 0.024 1480 385022 0 78 0.74 0.51 374763 379464
## 6 0.27 0.49 0.025 1425 382969 0 86 0.71 0.52 373092 377618
## 7 0.24 0.42 0.026 1371 381262 0 87 0.71 0.53 371759 376113
## 8 0.23 0.41 0.027 1318 379470 0 84 0.72 0.54 370334 374520
## complex eChisq SRMR eCRMS eBIC
## 1 1.0 54413 0.123 0.125 42560
## 2 1.4 39199 0.104 0.108 27755
## 3 1.6 32103 0.094 0.099 21061
## 4 1.9 26904 0.086 0.092 16257
## 5 2.1 22681 0.079 0.087 12423
## 6 2.4 19826 0.074 0.082 9948
## 7 2.8 17247 0.069 0.078 7744
## 8 2.9 15052 0.064 0.075 5916
Suggested # of factors vary quite a bit
## Apply VSS. Using tetrochoric correlation and Principal Axis estimating method
## Warning in cor.smooth(mat): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : A Heywood case was detected. Examine the loadings carefully.
## The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
##
## Number of factors
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm,
## n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.69 with 1 factors
## VSS complexity 2 achieves a maximimum of 0.76 with 2 factors
## The Velicer MAP achieves a minimum of 0.02 with 3 factors
## Empirical BIC achieves a minimum of -2605.45 with 20 factors
## Sample Size adjusted BIC achieves a minimum of 354028 with 20 factors
##
## Statistics by number of factors
## vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC
## 1 0.69 0.00 0.026 1710 396489 0 90.1 0.69 0.48 384637 390068
## 2 0.50 0.76 0.023 1651 392513 0 69.6 0.76 0.49 381069 386313
## 3 0.46 0.76 0.023 1593 389927 0 58.8 0.80 0.49 378885 383945
## 4 0.43 0.74 0.024 1536 387537 0 50.6 0.83 0.50 376891 381769
## 5 0.41 0.67 0.024 1480 385022 0 43.9 0.85 0.51 374763 379464
## 6 0.36 0.61 0.025 1425 382969 0 39.1 0.87 0.52 373092 377618
## 7 0.34 0.60 0.026 1371 381262 0 34.4 0.88 0.53 371759 376113
## 8 0.40 0.66 0.027 1318 379470 0 30.4 0.90 0.54 370334 374520
## 9 0.38 0.63 0.028 1266 377522 0 27.0 0.91 0.55 368746 372767
## 10 0.36 0.61 0.029 1215 375595 0 23.9 0.92 0.56 367173 371032
## 11 0.36 0.62 0.030 1165 373794 0 21.0 0.93 0.57 365719 369419
## 12 0.29 0.55 0.031 1116 372199 0 18.5 0.94 0.58 364463 368008
## 13 0.27 0.51 0.032 1068 369902 0 16.5 0.94 0.59 362499 365892
## 14 0.26 0.46 0.033 1021 367930 0 14.3 0.95 0.60 360853 364095
## 15 0.26 0.48 0.036 975 366094 0 12.8 0.96 0.61 359336 362432
## 16 0.26 0.47 0.038 930 363844 0 11.2 0.96 0.63 357398 360351
## 17 0.25 0.46 0.039 886 361988 0 9.8 0.97 0.64 355846 358660
## 18 0.26 0.47 0.041 843 360004 0 8.9 0.97 0.66 354161 356838
## 19 0.27 0.47 0.043 801 357998 0 7.8 0.97 0.67 352446 354990
## 20 0.27 0.47 0.045 760 356882 0 7.1 0.98 0.69 351614 354028
## complex eChisq SRMR eCRMS eBIC
## 1 1.0 54413 0.123 0.125 42560
## 2 1.4 39199 0.104 0.108 27755
## 3 1.7 32103 0.094 0.099 21061
## 4 2.0 26904 0.086 0.092 16257
## 5 2.2 22681 0.079 0.087 12423
## 6 2.5 19826 0.074 0.082 9948
## 7 2.7 17247 0.069 0.078 7744
## 8 2.6 15052 0.064 0.075 5916
## 9 2.8 13193 0.060 0.071 4418
## 10 3.0 11488 0.056 0.068 3067
## 11 3.1 9929 0.052 0.065 1853
## 12 3.3 8647 0.049 0.062 912
## 13 3.6 7452 0.045 0.058 49
## 14 3.9 6315 0.042 0.055 -762
## 15 3.7 5574 0.039 0.053 -1184
## 16 3.9 4839 0.037 0.050 -1608
## 17 4.3 4137 0.034 0.048 -2005
## 18 4.4 3569 0.031 0.045 -2275
## 19 4.2 3115 0.029 0.044 -2437
## 20 4.3 2662 0.027 0.041 -2605
VSS and MAP results show the same, eBIC shows that max factors requested should be used which was set to 20 in this case. However, from an interpreation PoV, 20 factors could not be helpful.
Let us look at the scree tests from nFactors R package (optional)
cat("Not run")
####################
# use pysch's defualt way of treating the missing values and check the loadings
rc <- (r$rho+t(r$rho))/2
nf.scree <- nScree(x=rc,model="factors")
print(nf.scree)
Before actually extracting the factors, let us look at two more alternative tecniques:
## Use iclust to determine the structure. Goal of clustering is to improve Chronbach's alpha with in cluster
## ICLUST (Item Cluster Analysis)Call: iclust(r.mat = r$rho)
## ICLUST
##
## Purified Alpha:
## C58 C52
## 0.94 0.49
##
## Guttman Lambda6*
## C58 C52
## 1 1
##
## Original Beta:
## C58 C52
## 0.10 0.36
##
## Cluster size:
## C58 C52
## 51 9
##
## Purified scale intercorrelations
## reliabilities on diagonal
## correlations corrected for attenuation above diagonal:
## C58 C52
## C58 0.9432 -0.014
## C52 -0.0092 0.494
## Show cluster's with loadings greater than 0.5
## ICLUST (Item Cluster Analysis)
## Call: iclust(r.mat = r$rho)
##
## Purified Alpha:
## C58 C52
## 0.94 0.49
##
## G6* reliability:
## C58 C52
## 0.8523 0.0033
##
## Original Beta:
## C58 C52
## 0.10 0.36
##
## Cluster size:
## C58 C52
## 51 9
##
## Item by Cluster Structure matrix:
## O P C58 C52
## l1q1 C58 C58 -0.52
## l1q2 C58 C58 -0.51
## l1q3 C58 C58
## l1q4 C58 C58
## l1q5 C58 C58 -0.63
## l1q6 C58 C58 -0.61
## l1q7 C58 C58 -0.62
## l1q8 C58 C58 -0.64
## l1q9 C58 C58
## l1q10 C58 C58 -0.71
## l1q11 C58 C58 -0.61
## l1q12 C58 C58 -0.62
## l2q1 C58 C58 -0.58
## l2q2 C58 C58 -0.51
## l2q3 C58 C58
## l2q4 C58 C58 -0.59
## l2q5 C58 C58 -0.65
## l2q6 C58 C58 -0.62
## l2q7 C58 C58 -0.56
## l2q8 C58 C58 -0.58
## l2q9 C58 C58 -0.51
## l2q10 C58 C58 -0.69
## l2q11 C58 C58 -0.71
## l2q12 C58 C58 -0.57
## l3q1 C58 C58
## l3q2 C58 C58 -0.50
## l3q3 C58 C58 -0.58
## l3q4 C58 C58 -0.58
## l3q5 C58 C58
## l3q6 C58 C58
## l3q7 C58 C58 -0.56
## l3q8 C58 C58 -0.60
## l3q9 C58 C58 -0.60
## l3q10 C58 C58 -0.60
## l3q11 C58 C58
## l3q12 C58 C58 -0.55
## l4q1 C58 C58
## l4q2 C52 C52
## l4q3 C52 C52 -0.70
## l4q4 C58 C58 -0.51
## l4q5 C58 C52
## l4q6 C58 C58 -0.60
## l4q7 C58 C58
## l4q8 C58 C58
## l4q9 C58 C58
## l4q10 C58 C52
## l4q11 C58 C58 -0.54
## l4q12 C58 C58 -0.58
## l5q1 C58 C58
## l5q2 C58 C52
## l5q3 C58 C52 -0.55
## l5q4 C58 C58 -0.55
## l5q5 C52 C52
## l5q6 C58 C58
## l5q7 C58 C58
## l5q8 C58 C58
## l5q9 C58 C58
## l5q10 C58 C58
## l5q11 C52 C52
## l5q12 C58 C52
##
## With eigenvalues of:
## C58 C52
## 14.3 2.9
##
## Purified scale intercorrelations
## reliabilities on diagonal
## correlations corrected for attenuation above diagonal:
## C58 C52
## C58 0.94 -0.01
## C52 -0.01 0.49
##
## Cluster fit = 0.69 Pattern fit = 0.83 RMSR = 0.12
TBD
The above analysis was done using
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] visNetwork_1.0.1 plyr_1.8.3 reshape2_1.4.1
## [4] clusteval_0.1 clusterSim_0.44-2 clValid_0.6-6
## [7] cluster_2.0.3 nFactors_2.3.3 lattice_0.20-33
## [10] boot_1.3-18 MASS_7.3-45 igraph_1.0.1
## [13] psych_1.5.6 dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 knitr_1.10.5 magrittr_1.5
## [4] mnormt_1.5-3 R6_2.1.2 stringr_1.1.0
## [7] tools_3.2.4 parallel_3.2.4 grid_3.2.4
## [10] e1071_1.6-6 DBI_0.4-1 htmltools_0.2.6
## [13] class_7.3-14 rgl_0.95.1247 ade4_1.7-2
## [16] yaml_2.1.13 assertthat_0.1 digest_0.6.10
## [19] tibble_1.1 GPArotation_2014.11-1 formatR_1.2
## [22] htmlwidgets_0.5 evaluate_0.9 rmarkdown_0.7
## [25] stringi_1.1.1 R2HTML_2.3.1 jsonlite_1.0
## [28] mvtnorm_1.0-3