We will apply dimensionality reduction techniques such as Factor Analysis, Item Cluster Analysis to understand the latent constructs in the NDDT (Numeracy Detailed Diagnostic Test) battery.

Factor Analysis:

Factor Analysis assumes the following generative model that must’ve produced the observed data:

\(X_{d \times 1} = A_{d \times k} F_{k \times 1} + \mu_{d \times 1} + \epsilon_{d \times 1}\)

where

It is further assumed that:

Then, it follows that \(\Sigma \equiv Cov(X)= A A^T + \Psi\)

Factor Analysis tries to decompose the covariance (or correlation) matrix and recover the factor loading matrix \(A\). There are various techniques to decompose \(\Sigma\) such as Principal Axis, MinRes (OLS minimization), Maximum Likelihood, among others. We use Principal Axis method as it does not assume normality of the data.

Nevertheless, the estimated Factor Loadings are not unique. Several transformation methods (rotations) such as Promax, Varimax, Varimin, Obliqmin etc exist each of which tries to attain a certain objective and make the estimated factor loadings more interpretable. For example, Varimax rotation tries to distribute the factor loadings across factors so that all factors get equal share (informally speaking, that is). We use will Varimax and Obliqmin roation in the following analysis.

An important step in Factor Analysis is finding the right number of Factors. Notably, it is the most challenging task.

Determining the most interpretable number of factors from a factor analysis is perhaps one of the greatest challenges in factor analysis. There are many solutions to this problem, none of which is uniformly the best. “Solving the number of factors problem is easy, I do it everyday before breakfast. But knowing the right solution is harder” (Kaiser, 195x).

Several techniques exist to arrive at an optimal number of components. R’s pysch package, provides the VSS (very simple structure), Parallel Analysis, MAP, Scree test. R package nFactors prvovides another set of alternatives - most of them based on the strcuture of the eigen values of \(\Sigma\). For example, the Bartlet, Anderson, Lawley procedure tests the hypothesis that the bottom \(d-k\) eigen-values are all equal, and reports the corresponding \(k\). It is expected that there would be no consensus among the methods. We will do a monte-carlo analysis in order to understand this variation.

NDDT Data

Load libraries and read NDDT data from S3

## Reading NDDT Item meta-data
## read  1024 records on  60 items

Inspect the data

## % of correct responses and their confidence intervels

## % of missing values

## summary stats of the responses
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   234.0   440.0   634.0   550.8   656.0   791.0
## smallest sample size is:  234
## Heatmap of the response pattern

Sample size does not seem to be a problem. Check item correlation, which forms the back-bone of all dimensionality techniques under consideration in this analysis.

## Compute tetrochoric correlation and look at the item correlations

Graph Viz of Item Correlation

plot forced n/w

## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## 
## Attaching package: 'visNetwork'
## 
## The following object is masked from 'package:igraph':
## 
##     %>%

Items in level-1 have better correlation with each other. Within level correlation is lower (in fact, one notices negative correlation) as one moves up the level. There could be several reasons. Relatively smaller sample sizes, variability in responses at higher levels.

Roughly, each level corresponds to a grade: eg level-1 is grade-1

Let us proceed to do a FA ## Estimating the number of factors

## Apply Parallel Analysis. Using tetrochoric correlation and Principal Axis estimating method
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.

## Parallel analysis suggests that the number of factors =  20  and the number of components =  15

PA suggests that 20 factors should be used. It is the smallest index at which (ordered) eigen-values of simulated data exceed the corresponding eigen value of real data. However, if we go by the Kaiser’s rule of thumb that components below an eigen value of 1 are not significant, we get 9 components.

Let us get a bootsrapped estimate of the # of factors and and its error bar (optional)

Let us try VSS structure with PA with varimax and oblimin transformations

## Apply VSS. Using tetrochoric correlation and Principal Axis estimating method
## Warning in cor.smooth(mat): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.

## 
## Very Simple Structure
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.69  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.76  with  2  factors
## 
## The Velicer MAP achieves a minimum of 0.02  with  3  factors 
## BIC achieves a minimum of  370334.1  with  8  factors
## Sample Size adjusted BIC achieves a minimum of  374520.2  with  8  factors
## 
## Statistics by number of factors 
##   vss1 vss2   map  dof  chisq prob sqresid  fit RMSEA    BIC  SABIC
## 1 0.69 0.00 0.026 1710 396489    0      90 0.69  0.48 384637 390068
## 2 0.50 0.76 0.023 1651 392513    0      70 0.76  0.49 381069 386313
## 3 0.46 0.76 0.023 1593 389927    0      59 0.80  0.49 378885 383945
## 4 0.43 0.74 0.024 1536 387537    0      51 0.83  0.50 376891 381769
## 5 0.41 0.67 0.024 1480 385022    0      44 0.85  0.51 374763 379464
## 6 0.36 0.61 0.025 1425 382969    0      39 0.87  0.52 373092 377618
## 7 0.34 0.60 0.026 1371 381262    0      34 0.88  0.53 371759 376113
## 8 0.40 0.66 0.027 1318 379470    0      30 0.90  0.54 370334 374520
##   complex eChisq  SRMR eCRMS  eBIC
## 1     1.0  54413 0.123 0.125 42560
## 2     1.4  39199 0.104 0.108 27755
## 3     1.7  32103 0.094 0.099 21061
## 4     2.0  26904 0.086 0.092 16257
## 5     2.2  22681 0.079 0.087 12423
## 6     2.5  19826 0.074 0.082  9948
## 7     2.7  17247 0.069 0.078  7744
## 8     2.6  15052 0.064 0.075  5916
## Loading required namespace: GPArotation
## Warning in cor.smooth(mat): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.

## 
## Very Simple Structure
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.69  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.69  with  2  factors
## 
## The Velicer MAP achieves a minimum of 0.02  with  3  factors 
## BIC achieves a minimum of  370334.1  with  8  factors
## Sample Size adjusted BIC achieves a minimum of  374520.2  with  8  factors
## 
## Statistics by number of factors 
##   vss1 vss2   map  dof  chisq prob sqresid  fit RMSEA    BIC  SABIC
## 1 0.69 0.00 0.026 1710 396489    0      90 0.69  0.48 384637 390068
## 2 0.48 0.69 0.023 1651 392513    0      91 0.69  0.49 381069 386313
## 3 0.45 0.67 0.023 1593 389927    0      82 0.72  0.49 378885 383945
## 4 0.41 0.63 0.024 1536 387537    0      75 0.75  0.50 376891 381769
## 5 0.37 0.58 0.024 1480 385022    0      78 0.74  0.51 374763 379464
## 6 0.27 0.49 0.025 1425 382969    0      86 0.71  0.52 373092 377618
## 7 0.24 0.42 0.026 1371 381262    0      87 0.71  0.53 371759 376113
## 8 0.23 0.41 0.027 1318 379470    0      84 0.72  0.54 370334 374520
##   complex eChisq  SRMR eCRMS  eBIC
## 1     1.0  54413 0.123 0.125 42560
## 2     1.4  39199 0.104 0.108 27755
## 3     1.6  32103 0.094 0.099 21061
## 4     1.9  26904 0.086 0.092 16257
## 5     2.1  22681 0.079 0.087 12423
## 6     2.4  19826 0.074 0.082  9948
## 7     2.8  17247 0.069 0.078  7744
## 8     2.9  15052 0.064 0.075  5916

Suggested # of factors vary quite a bit

## Apply VSS. Using tetrochoric correlation and Principal Axis estimating method
## Warning in cor.smooth(mat): Matrix was not positive definite, smoothing was
## done
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate =
## rotate, : A Heywood case was detected. Examine the loadings carefully.
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.

## 
## Number of factors
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.69  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.76  with  2  factors
## The Velicer MAP achieves a minimum of 0.02  with  3  factors 
## Empirical BIC achieves a minimum of  -2605.45  with  20  factors
## Sample Size adjusted BIC achieves a minimum of  354028  with  20  factors
## 
## Statistics by number of factors 
##    vss1 vss2   map  dof  chisq prob sqresid  fit RMSEA    BIC  SABIC
## 1  0.69 0.00 0.026 1710 396489    0    90.1 0.69  0.48 384637 390068
## 2  0.50 0.76 0.023 1651 392513    0    69.6 0.76  0.49 381069 386313
## 3  0.46 0.76 0.023 1593 389927    0    58.8 0.80  0.49 378885 383945
## 4  0.43 0.74 0.024 1536 387537    0    50.6 0.83  0.50 376891 381769
## 5  0.41 0.67 0.024 1480 385022    0    43.9 0.85  0.51 374763 379464
## 6  0.36 0.61 0.025 1425 382969    0    39.1 0.87  0.52 373092 377618
## 7  0.34 0.60 0.026 1371 381262    0    34.4 0.88  0.53 371759 376113
## 8  0.40 0.66 0.027 1318 379470    0    30.4 0.90  0.54 370334 374520
## 9  0.38 0.63 0.028 1266 377522    0    27.0 0.91  0.55 368746 372767
## 10 0.36 0.61 0.029 1215 375595    0    23.9 0.92  0.56 367173 371032
## 11 0.36 0.62 0.030 1165 373794    0    21.0 0.93  0.57 365719 369419
## 12 0.29 0.55 0.031 1116 372199    0    18.5 0.94  0.58 364463 368008
## 13 0.27 0.51 0.032 1068 369902    0    16.5 0.94  0.59 362499 365892
## 14 0.26 0.46 0.033 1021 367930    0    14.3 0.95  0.60 360853 364095
## 15 0.26 0.48 0.036  975 366094    0    12.8 0.96  0.61 359336 362432
## 16 0.26 0.47 0.038  930 363844    0    11.2 0.96  0.63 357398 360351
## 17 0.25 0.46 0.039  886 361988    0     9.8 0.97  0.64 355846 358660
## 18 0.26 0.47 0.041  843 360004    0     8.9 0.97  0.66 354161 356838
## 19 0.27 0.47 0.043  801 357998    0     7.8 0.97  0.67 352446 354990
## 20 0.27 0.47 0.045  760 356882    0     7.1 0.98  0.69 351614 354028
##    complex eChisq  SRMR eCRMS  eBIC
## 1      1.0  54413 0.123 0.125 42560
## 2      1.4  39199 0.104 0.108 27755
## 3      1.7  32103 0.094 0.099 21061
## 4      2.0  26904 0.086 0.092 16257
## 5      2.2  22681 0.079 0.087 12423
## 6      2.5  19826 0.074 0.082  9948
## 7      2.7  17247 0.069 0.078  7744
## 8      2.6  15052 0.064 0.075  5916
## 9      2.8  13193 0.060 0.071  4418
## 10     3.0  11488 0.056 0.068  3067
## 11     3.1   9929 0.052 0.065  1853
## 12     3.3   8647 0.049 0.062   912
## 13     3.6   7452 0.045 0.058    49
## 14     3.9   6315 0.042 0.055  -762
## 15     3.7   5574 0.039 0.053 -1184
## 16     3.9   4839 0.037 0.050 -1608
## 17     4.3   4137 0.034 0.048 -2005
## 18     4.4   3569 0.031 0.045 -2275
## 19     4.2   3115 0.029 0.044 -2437
## 20     4.3   2662 0.027 0.041 -2605

VSS and MAP results show the same, eBIC shows that max factors requested should be used which was set to 20 in this case. However, from an interpreation PoV, 20 factors could not be helpful.

Let us look at the scree tests from nFactors R package (optional)

cat("Not run")
####################
# use pysch's defualt way of treating the missing values and check the loadings
rc <- (r$rho+t(r$rho))/2
nf.scree <- nScree(x=rc,model="factors")
print(nf.scree)

Before actually extracting the factors, let us look at two more alternative tecniques:

Hierarchical Iterm Cluster Analysis

## Use iclust to determine the structure. Goal of clustering is to improve Chronbach's alpha with in cluster

## ICLUST (Item Cluster Analysis)Call: iclust(r.mat = r$rho)
## ICLUST 
## 
## Purified Alpha:
##  C58  C52 
## 0.94 0.49 
## 
##  Guttman Lambda6* 
## C58 C52 
##   1   1 
## 
## Original Beta:
##  C58  C52 
## 0.10 0.36 
## 
## Cluster size:
## C58 C52 
##  51   9 
## 
## Purified scale intercorrelations
##  reliabilities on diagonal
##  correlations corrected for attenuation above diagonal: 
##         C58    C52
## C58  0.9432 -0.014
## C52 -0.0092  0.494
## Show cluster's with loadings greater than 0.5
## ICLUST (Item Cluster Analysis)
## Call: iclust(r.mat = r$rho)
## 
## Purified Alpha:
##  C58  C52 
## 0.94 0.49 
## 
## G6* reliability:
##    C58    C52 
## 0.8523 0.0033 
## 
## Original Beta:
##  C58  C52 
## 0.10 0.36 
## 
## Cluster size:
## C58 C52 
##  51   9 
## 
## Item by Cluster Structure matrix:
##         O   P   C58   C52
## l1q1  C58 C58 -0.52      
## l1q2  C58 C58 -0.51      
## l1q3  C58 C58            
## l1q4  C58 C58            
## l1q5  C58 C58 -0.63      
## l1q6  C58 C58 -0.61      
## l1q7  C58 C58 -0.62      
## l1q8  C58 C58 -0.64      
## l1q9  C58 C58            
## l1q10 C58 C58 -0.71      
## l1q11 C58 C58 -0.61      
## l1q12 C58 C58 -0.62      
## l2q1  C58 C58 -0.58      
## l2q2  C58 C58 -0.51      
## l2q3  C58 C58            
## l2q4  C58 C58 -0.59      
## l2q5  C58 C58 -0.65      
## l2q6  C58 C58 -0.62      
## l2q7  C58 C58 -0.56      
## l2q8  C58 C58 -0.58      
## l2q9  C58 C58 -0.51      
## l2q10 C58 C58 -0.69      
## l2q11 C58 C58 -0.71      
## l2q12 C58 C58 -0.57      
## l3q1  C58 C58            
## l3q2  C58 C58 -0.50      
## l3q3  C58 C58 -0.58      
## l3q4  C58 C58 -0.58      
## l3q5  C58 C58            
## l3q6  C58 C58            
## l3q7  C58 C58 -0.56      
## l3q8  C58 C58 -0.60      
## l3q9  C58 C58 -0.60      
## l3q10 C58 C58 -0.60      
## l3q11 C58 C58            
## l3q12 C58 C58 -0.55      
## l4q1  C58 C58            
## l4q2  C52 C52            
## l4q3  C52 C52       -0.70
## l4q4  C58 C58 -0.51      
## l4q5  C58 C52            
## l4q6  C58 C58 -0.60      
## l4q7  C58 C58            
## l4q8  C58 C58            
## l4q9  C58 C58            
## l4q10 C58 C52            
## l4q11 C58 C58 -0.54      
## l4q12 C58 C58 -0.58      
## l5q1  C58 C58            
## l5q2  C58 C52            
## l5q3  C58 C52       -0.55
## l5q4  C58 C58 -0.55      
## l5q5  C52 C52            
## l5q6  C58 C58            
## l5q7  C58 C58            
## l5q8  C58 C58            
## l5q9  C58 C58            
## l5q10 C58 C58            
## l5q11 C52 C52            
## l5q12 C58 C52            
## 
## With eigenvalues of:
##  C58  C52 
## 14.3  2.9 
## 
## Purified scale intercorrelations
##  reliabilities on diagonal
##  correlations corrected for attenuation above diagonal: 
##       C58   C52
## C58  0.94 -0.01
## C52 -0.01  0.49
## 
## Cluster fit =  0.69   Pattern fit =  0.83  RMSR =  0.12

Estimate the Factor Loadings, for a given no. of factors

TBD

The above analysis was done using

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] visNetwork_1.0.1  plyr_1.8.3        reshape2_1.4.1   
##  [4] clusteval_0.1     clusterSim_0.44-2 clValid_0.6-6    
##  [7] cluster_2.0.3     nFactors_2.3.3    lattice_0.20-33  
## [10] boot_1.3-18       MASS_7.3-45       igraph_1.0.1     
## [13] psych_1.5.6       dplyr_0.5.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6           knitr_1.10.5          magrittr_1.5         
##  [4] mnormt_1.5-3          R6_2.1.2              stringr_1.1.0        
##  [7] tools_3.2.4           parallel_3.2.4        grid_3.2.4           
## [10] e1071_1.6-6           DBI_0.4-1             htmltools_0.2.6      
## [13] class_7.3-14          rgl_0.95.1247         ade4_1.7-2           
## [16] yaml_2.1.13           assertthat_0.1        digest_0.6.10        
## [19] tibble_1.1            GPArotation_2014.11-1 formatR_1.2          
## [22] htmlwidgets_0.5       evaluate_0.9          rmarkdown_0.7        
## [25] stringi_1.1.1         R2HTML_2.3.1          jsonlite_1.0         
## [28] mvtnorm_1.0-3