S3729C Data Analytics Seminar

01 Principal Component Analysis, Item Analysis and Confirmatory Factor Analysis

Class on 28 August 2021

Principal Component Analysis

In the first section of the S3729C Data Analytics Seminar, you will be led through the process of Principal Component Analysis (PCA). This will be done using the Community2Campus base data as an example.

For more information about PCA, you may refer to the following resources

In this section, you are not required to carry out any hands-on practice, but please it is meant to provide you with an appreciation of the process that goes into conducting exploratory data analysis, and better visualise the variation present ina dataset with many variables, or “wide” variables as we refer to it in the field.

This is especially pertinent when you have “wide” datasets such as the Community2Campus dataset which you are already familiar with from Lesson 06 to 08, as well as your CW1 presentations in Lesson 10.

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages later.

install.packages("pacman",repos = "http://cran.us.r-project.org")
## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpY7sLRp\downloaded_packages

Load the packages required for this section

pacman::p_load(GPArotation, magrittr, pacman, psych, rio, tidyverse, lavaan)

You can now import the PCA data from the csv file which I have placed online at Github via the link https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/PCArevdata.csv using the read.csv command.

df <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/PCArevdata.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

df %>% colnames()
##  [1] "NCSS_P1" "NCSS_P2" "NCSS_P3" "NCSS_P4" "NCSS_P5" "AT_PT1"  "AT_PT2" 
##  [8] "AT_PT3"  "AT_WT1"  "AT_WT2"  "AT_WT3"  "AT_WT4"  "AT_WT5"  "AT_CT1" 
## [15] "AT_CT2"  "AT_CT3"  "AT_CT4"  "AT_CT5"  "AT_LA1"  "AT_LA2"  "AT_LA3" 
## [22] "AT_LA4"  "AT_LA5"  "AT_SC1"  "AT_SC2"  "AT_SC3"  "AT_SC4"  "AT_SC5" 
## [29] "AT_SC6"  "ST_SD1"  "ST_SD2"  "ST_SD3"  "ST_SD4"  "ST_SD5"  "ST_SR2" 
## [36] "ST_SR1"  "ST_SR3"  "ST_SR4"  "ST_SR5"

Principal Component Analysis Using the Default Method

There are three methods of PCA available in R

We will use the default method prcomp first

pc <- df %>%
prcomp(center = TRUE, scale = TRUE)

Get summary stats

summary(pc)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.9552 2.2443 1.28210 1.24473 1.17405 0.97936 0.93336
## Proportion of Variance 0.4011 0.1291 0.04215 0.03973 0.03534 0.02459 0.02234
## Cumulative Proportion  0.4011 0.5303 0.57241 0.61214 0.64748 0.67208 0.69442
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.91529 0.91023 0.84528 0.80615 0.77563 0.76162 0.74171
## Proportion of Variance 0.02148 0.02124 0.01832 0.01666 0.01543 0.01487 0.01411
## Cumulative Proportion  0.71590 0.73714 0.75546 0.77212 0.78755 0.80242 0.81653
##                           PC15    PC16    PC17   PC18    PC19    PC20   PC21
## Standard deviation     0.72811 0.70806 0.69206 0.6756 0.66110 0.65390 0.6244
## Proportion of Variance 0.01359 0.01286 0.01228 0.0117 0.01121 0.01096 0.0100
## Cumulative Proportion  0.83012 0.84298 0.85526 0.8670 0.87817 0.88913 0.8991
##                           PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     0.60497 0.59783 0.56482 0.55525 0.52770 0.51575 0.4997
## Proportion of Variance 0.00938 0.00916 0.00818 0.00791 0.00714 0.00682 0.0064
## Cumulative Proportion  0.90851 0.91768 0.92586 0.93376 0.94090 0.94772 0.9541
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.48621 0.46285 0.45290 0.44246 0.42277 0.39387 0.36975
## Proportion of Variance 0.00606 0.00549 0.00526 0.00502 0.00458 0.00398 0.00351
## Cumulative Proportion  0.96019 0.96568 0.97094 0.97596 0.98054 0.98452 0.98803
##                           PC36    PC37    PC38    PC39
## Standard deviation     0.36186 0.35104 0.34942 0.30114
## Proportion of Variance 0.00336 0.00316 0.00313 0.00233
## Cumulative Proportion  0.99138 0.99454 0.99767 1.00000

Get a screeplot of the eigenvalues

plot(pc)

Very Simple Structure (VSS) Algorithm

df %>% 
select(1:39) %>%
vss(n = 10)

## 
## Very Simple Structure
## Call: vss(x = ., n = 10)
## VSS complexity 1 achieves a maximimum of 0.86  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.95  with  2  factors
## 
## The Velicer MAP achieves a minimum of 0.01  with  5  factors 
## BIC achieves a minimum of  -137.51  with  10  factors
## Sample Size adjusted BIC achieves a minimum of  1120.59  with  10  factors
## 
## Statistics by number of factors 
##    vss1 vss2   map dof chisq prob sqresid  fit RMSEA   BIC SABIC complex eChisq
## 1  0.86 0.00 0.048 702 23939    0    40.1 0.86 0.131 18630 20861     1.0  45097
## 2  0.67 0.95 0.014 664 11597    0    15.0 0.95 0.093  6575  8685     1.3   7471
## 3  0.54 0.85 0.012 627  8999    0    12.6 0.96 0.083  4258  6250     1.7   5105
## 4  0.60 0.91 0.011 591  7079    0    10.4 0.96 0.076  2610  4488     1.7   3176
## 5  0.47 0.84 0.011 556  5753    0     8.7 0.97 0.070  1548  3315     2.1   1959
## 6  0.52 0.88 0.011 522  4978    0     7.9 0.97 0.067  1031  2689     2.0   1522
## 7  0.48 0.85 0.012 489  4403    0     7.3 0.97 0.064   705  2259     2.1   1199
## 8  0.48 0.86 0.013 457  3735    0     6.7 0.98 0.061   279  1731     2.2    923
## 9  0.46 0.86 0.014 426  3320    0     6.3 0.98 0.059    98  1452     2.2    744
## 10 0.47 0.86 0.016 396  2857    0     6.0 0.98 0.057  -138  1121     2.4    606
##     SRMR eCRMS  eBIC
## 1  0.126 0.129 39788
## 2  0.051 0.054  2449
## 3  0.042 0.046   364
## 4  0.033 0.037 -1293
## 5  0.026 0.030 -2246
## 6  0.023 0.028 -2425
## 7  0.021 0.025 -2498
## 8  0.018 0.023 -2533
## 9  0.016 0.021 -2478
## 10 0.015 0.020 -2388

nFactors Algorithm

df %>%
select(1:39) %>%
nfactors(n = 10)

## 
## Number of factors
## Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
##     n.obs = n.obs, plot = FALSE, title = title, use = use, cor = cor)
## VSS complexity 1 achieves a maximimum of 0.86  with  1  factors
## VSS complexity 2 achieves a maximimum of 0.95  with  2  factors
## The Velicer MAP achieves a minimum of 0.01  with  5  factors 
## Empirical BIC achieves a minimum of  -2532.75  with  8  factors
## Sample Size adjusted BIC achieves a minimum of  1120.59  with  10  factors
## 
## Statistics by number of factors 
##    vss1 vss2   map dof chisq prob sqresid  fit RMSEA   BIC SABIC complex eChisq
## 1  0.86 0.00 0.048 702 23939    0    40.1 0.86 0.131 18630 20861     1.0  45097
## 2  0.67 0.95 0.014 664 11597    0    15.0 0.95 0.093  6575  8685     1.3   7471
## 3  0.54 0.85 0.012 627  8999    0    12.6 0.96 0.083  4258  6250     1.7   5105
## 4  0.60 0.91 0.011 591  7079    0    10.4 0.96 0.076  2610  4488     1.7   3176
## 5  0.47 0.84 0.011 556  5753    0     8.7 0.97 0.070  1548  3315     2.1   1959
## 6  0.52 0.88 0.011 522  4978    0     7.9 0.97 0.067  1031  2689     2.0   1522
## 7  0.48 0.85 0.012 489  4403    0     7.3 0.97 0.064   705  2259     2.1   1199
## 8  0.48 0.86 0.013 457  3735    0     6.7 0.98 0.061   279  1731     2.2    923
## 9  0.46 0.86 0.014 426  3320    0     6.3 0.98 0.059    98  1452     2.2    744
## 10 0.47 0.86 0.016 396  2857    0     6.0 0.98 0.057  -138  1121     2.4    606
##     SRMR eCRMS  eBIC
## 1  0.126 0.129 39788
## 2  0.051 0.054  2449
## 3  0.042 0.046   364
## 4  0.033 0.037 -1293
## 5  0.026 0.030 -2246
## 6  0.023 0.028 -2425
## 7  0.021 0.025 -2498
## 8  0.018 0.023 -2533
## 9  0.016 0.021 -2478
## 10 0.015 0.020 -2388

Factor Analysis

Calculate and plot factors with fa()

pa5.out <- fa(df[1:39],
nfactors = 5,
fm = "pa",
max.iter = 100,
rotate = "oblimin")

Plot the fa.diagram to visualise the rotated factor solution, and interpret whether it is an interpretable solution

fa.diagram(pa5.out)

Hierarchical Clustering

Hierarchical clustering of items with iclust()

df %>% 
  select(1:39) %>% 
  iclust()

## ICLUST (Item Cluster Analysis)
## Call: iclust(r.mat = .)
## 
## Purified Alpha:
##  C37  C26 
## 0.95 0.88 
## 
## G6* reliability:
## C37 C26 
## 1.0 0.9 
## 
## Original Beta:
##  C37  C26 
## 0.75 0.70 
## 
## Cluster size:
## C37 C26 
##  30   9 
## 
## Item by Cluster Structure matrix:
##           O   P   C37   C26
## NCSS_P1 C37 C26  0.43  0.63
## NCSS_P2 C37 C37  0.32 -0.10
## NCSS_P3 C37 C37  0.56  0.24
## NCSS_P4 C37 C26  0.49  0.53
## NCSS_P5 C37 C26  0.55  0.65
## AT_PT1  C37 C37  0.78  0.38
## AT_PT2  C37 C37  0.79  0.30
## AT_PT3  C37 C37  0.87  0.35
## AT_WT1  C37 C26  0.65  0.82
## AT_WT2  C26 C26  0.06 -0.54
## AT_WT3  C26 C26  0.18  0.67
## AT_WT4  C26 C26 -0.23 -0.71
## AT_WT5  C37 C37  0.58  0.06
## AT_CT1  C37 C37  0.66  0.09
## AT_CT2  C37 C37  0.69  0.06
## AT_CT3  C37 C37  0.84  0.43
## AT_CT4  C37 C37  0.30 -0.17
## AT_CT5  C37 C37  0.55  0.50
## AT_LA1  C37 C37  0.65  0.18
## AT_LA2  C37 C37  0.37 -0.03
## AT_LA3  C37 C37  0.53  0.05
## AT_LA4  C26 C26  0.31  0.65
## AT_LA5  C37 C26  0.58  0.78
## AT_SC1  C37 C37  0.40 -0.02
## AT_SC2  C37 C37  0.80  0.46
## AT_SC3  C37 C37  0.68  0.19
## AT_SC4  C37 C37  0.42  0.10
## AT_SC5  C37 C37  0.29  0.06
## AT_SC6  C37 C37  0.85  0.36
## ST_SD1  C37 C37  0.87  0.41
## ST_SD2  C37 C37  0.90  0.41
## ST_SD3  C37 C37  0.88  0.38
## ST_SD4  C37 C37  0.59  0.19
## ST_SD5  C37 C37  0.51  0.45
## ST_SR2  C37 C37  0.75  0.74
## ST_SR1  C37 C37  0.74  0.65
## ST_SR3  C37 C37  0.76  0.70
## ST_SR4  C37 C37  0.73  0.69
## ST_SR5  C37 C37  0.61  0.57
## 
## With eigenvalues of:
##  C37  C26 
## 13.9  5.5 
## 
## Purified scale intercorrelations
##  reliabilities on diagonal
##  correlations corrected for attenuation above diagonal: 
##      C37  C26
## C37 0.95 0.52
## C26 0.48 0.88
## 
## Cluster fit =  0.86   Pattern fit =  0.98  RMSR =  0.07

Principal Component Analysis with K Factors

First PCA with no rotation, specify 5 factors

df %>% principal(nfactors = 5)
## Principal Components Analysis
## Call: principal(r = ., nfactors = 5)
## Standardized loadings (pattern matrix) based upon correlation matrix
##           RC2   RC1   RC4   RC3   RC5   h2   u2 com
## NCSS_P1  0.44  0.11 -0.20  0.59  0.09 0.60 0.40 2.3
## NCSS_P2  0.02  0.09  0.64  0.38  0.23 0.61 0.39 2.0
## NCSS_P3  0.17  0.35  0.30  0.53  0.25 0.59 0.41 3.2
## NCSS_P4  0.43  0.19  0.01  0.69 -0.19 0.74 0.26 2.0
## NCSS_P5  0.54  0.15 -0.01  0.68 -0.03 0.78 0.22 2.0
## AT_PT1   0.33  0.70  0.03  0.23  0.13 0.68 0.32 1.8
## AT_PT2   0.34  0.69  0.19  0.16  0.11 0.67 0.33 1.8
## AT_PT3   0.41  0.73  0.18  0.10  0.21 0.79 0.21 2.0
## AT_WT1   0.78  0.25 -0.20  0.18  0.12 0.76 0.24 1.5
## AT_WT2  -0.32  0.15  0.64  0.00 -0.03 0.54 0.46 1.6
## AT_WT3   0.51 -0.12 -0.43  0.18  0.17 0.52 0.48 2.6
## AT_WT4  -0.42 -0.04  0.55 -0.33 -0.11 0.61 0.39 2.7
## AT_WT5   0.26  0.58  0.35 -0.10 -0.16 0.56 0.44 2.3
## AT_CT1   0.27  0.58  0.40 -0.05  0.07 0.58 0.42 2.3
## AT_CT2   0.19  0.66  0.36  0.00  0.13 0.62 0.38 1.8
## AT_CT3   0.49  0.67  0.11  0.07  0.20 0.74 0.26 2.1
## AT_CT4  -0.01  0.34  0.35 -0.28  0.21 0.37 0.63 3.6
## AT_CT5   0.76  0.13  0.16  0.00 -0.02 0.61 0.39 1.2
## AT_LA1   0.22  0.68  0.12 -0.02  0.12 0.54 0.46 1.3
## AT_LA2  -0.22  0.61 -0.07  0.08  0.28 0.51 0.49 1.8
## AT_LA3  -0.10  0.78 -0.07  0.10  0.07 0.64 0.36 1.1
## AT_LA4   0.53  0.02 -0.38  0.15  0.22 0.50 0.50 2.4
## AT_LA5   0.77  0.13 -0.15  0.23  0.19 0.72 0.28 1.4
## AT_SC1   0.16  0.21  0.49  0.03  0.33 0.42 0.58 2.4
## AT_SC2   0.61  0.48  0.17 -0.04  0.35 0.76 0.24 2.8
## AT_SC3   0.34  0.45  0.37  0.05  0.33 0.57 0.43 3.8
## AT_SC4   0.19  0.26  0.17 -0.16  0.62 0.54 0.46 1.9
## AT_SC5  -0.06  0.23  0.01  0.12  0.69 0.55 0.45 1.3
## AT_SC6   0.44  0.71  0.14  0.01  0.24 0.78 0.22 2.0
## ST_SD1   0.46  0.72  0.11  0.10  0.18 0.79 0.21 2.0
## ST_SD2   0.45  0.76  0.11  0.14  0.14 0.83 0.17 1.8
## ST_SD3   0.43  0.75  0.13  0.16  0.10 0.80 0.20 1.8
## ST_SD4   0.07  0.75 -0.07  0.20 -0.09 0.61 0.39 1.2
## ST_SD5   0.58  0.25  0.00  0.17 -0.19 0.47 0.53 1.8
## ST_SR2   0.87  0.27  0.01  0.12  0.14 0.86 0.14 1.3
## ST_SR1   0.84  0.30  0.05  0.10  0.03 0.81 0.19 1.3
## ST_SR3   0.82  0.32  0.00  0.18  0.06 0.82 0.18 1.4
## ST_SR4   0.80  0.33 -0.04  0.21 -0.04 0.79 0.21 1.5
## ST_SR5   0.63  0.33 -0.04  0.25 -0.14 0.59 0.41 2.0
## 
##                        RC2  RC1  RC4  RC3  RC5
## SS loadings           9.08 8.83 2.82 2.49 2.03
## Proportion Var        0.23 0.23 0.07 0.06 0.05
## Cumulative Var        0.23 0.46 0.53 0.60 0.65
## Proportion Explained  0.36 0.35 0.11 0.10 0.08
## Cumulative Proportion 0.36 0.71 0.82 0.92 1.00
## 
## Mean item complexity =  2
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.04 
##  with the empirical chi square  3907.38  with prob <  0 
## 
## Fit based upon off diagonal values = 0.99

Second PCA with oblimin (oblique) rotation

df %>% 
  principal(
    nfactors = 5, 
    rotate = "oblimin"
  ) %>% 
  plot()  # Plot position of variables on components

Item Analysis

Forming a key for scale scores and a scoring key with indices defined by PCA for the variables

key <- list(
MR1 = c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33),
MR2 = c(2, 10, 11, 12, 22),
MR3 = c(9,18, 23, 25, 34, 35, 36, 37, 38, 39),
MR4 = c(1, 3, 4, 5),
MR5 = c(24, 26, 27, 28)
)

Matrix of alphas, correlations, and correlations corrected for attentuation using scoreItems() from psych

scoreItems(key, df)
## Call: scoreItems(keys = key, items = df)
## 
## (Unstandardized) Alpha:
##        MR1  MR2  MR3  MR4  MR5
## alpha 0.88 0.77 0.76 0.83 0.89
## 
## Standard errors of unstandardized Alpha:
##          MR1   MR2   MR3   MR4   MR5
## ASE   0.0061 0.015 0.011 0.015 0.013
## 
## Average item correlation:
##           MR1 MR2  MR3  MR4  MR5
## average.r 0.3 0.4 0.24 0.55 0.66
## 
## Median item correlation:
##  MR1  MR2  MR3  MR4  MR5 
## 0.32 0.46 0.25 0.54 0.64 
## 
##  Guttman 6* reliability: 
##           MR1  MR2  MR3  MR4  MR5
## Lambda.6 0.94 0.87 0.86 0.88 0.91
## 
## Signal/Noise based upon av.r : 
##              MR1 MR2 MR3 MR4 MR5
## Signal/Noise   7 3.4 3.2   5 7.8
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
##      MR1  MR2  MR3  MR4  MR5
## MR1 0.88 1.04 1.03 0.98 0.89
## MR2 0.86 0.77 0.97 1.04 0.76
## MR3 0.85 0.75 0.76 0.95 0.81
## MR4 0.83 0.83 0.76 0.83 0.65
## MR5 0.78 0.63 0.67 0.55 0.89
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

Add Scale Scores to Data

df %<>% 
mutate(
MR1 = rowMeans(df[c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33)], na.rm = T),
MR2 = rowMeans(df[c(2, 10, 11, 12, 22)], na.rm = T),
MR3 = rowMeans(df[c(9,18, 23, 25, 34, 35, 36, 37, 38, 39)], na.rm = T),
MR4 = rowMeans(df[c(1, 3, 4, 5)], na.rm = T),
MR5 = rowMeans(df[c(24, 26, 27, 28)], na.rm = T)
)

Descriptives for scale scores

df %>% 
select(MR1:MR5) %>%
describe()
##     vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## MR1    1 1924 2.94 0.98    3.0    2.95 1.11 1.0 5.0   4.0 -0.06    -0.77 0.02
## MR2    2 1924 2.85 0.46    2.8    2.84 0.30 1.2 4.6   3.4  0.12     0.87 0.01
## MR3    3 1924 3.31 1.17    3.7    3.37 1.33 1.0 5.0   4.0 -0.39    -1.25 0.03
## MR4    4 1924 2.96 1.02    3.0    2.95 1.11 1.0 5.0   4.0  0.06    -0.81 0.02
## MR5    5 1924 2.83 0.86    3.0    2.83 0.74 1.0 5.0   4.0  0.06    -0.02 0.02

Plot of Means and SDs for Aggregated Variables

df %>% 
select(MR1:MR5) %>%
error.bars(
ylim = c(1, 5),
sd = T,
arrow.col = "gray", 
eyes = F,
pch = 19,
main = "M & SD of C2C Aggregated Factors",
ylab = "Means",
xlab = "C2C Aggregated Factors"
)

Pairs Panels

This is a whole bunch of histograms, scatterplots, correlation plots, etc, to summarise the outcome of the factor and item analysis done in previous sections.

df %>% 
select(MR1:MR5) %>%
pairs.panels(
hist.col = "gray", 
jiggle = T,
main = "C2C Aggregated Factors"
)

Cronbach’s Alpha

Cronbach’s alpha is a measure of internal consistency, that is, how closely related a set of items are as a group. We will be computing the Cronbach’s alpha for each of the aggregated factors and their components. This is an example with MR1.

df %>%
select (c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33)) %>%
psych::alpha()
## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.95      0.95    0.96      0.53  18 0.0016  2.9 0.98     0.53
## 
##  lower alpha upper     95% confidence boundaries
## 0.95 0.95 0.95 
## 
##  Reliability if an item is dropped:
##        raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## AT_PT1      0.94      0.94    0.95      0.52  16   0.0017 0.029  0.53
## AT_PT2      0.94      0.94    0.95      0.52  16   0.0017 0.029  0.52
## AT_PT3      0.94      0.94    0.95      0.51  16   0.0018 0.027  0.52
## AT_WT5      0.95      0.95    0.95      0.54  17   0.0016 0.030  0.53
## AT_CT1      0.95      0.94    0.95      0.53  17   0.0017 0.030  0.52
## AT_CT2      0.95      0.94    0.95      0.53  17   0.0017 0.031  0.52
## AT_CT3      0.94      0.94    0.95      0.52  16   0.0018 0.029  0.52
## AT_CT4      0.95      0.95    0.96      0.56  19   0.0015 0.021  0.54
## AT_LA1      0.95      0.94    0.95      0.53  17   0.0017 0.031  0.53
## AT_LA2      0.95      0.95    0.96      0.55  19   0.0015 0.026  0.54
## AT_LA3      0.95      0.95    0.95      0.54  17   0.0016 0.031  0.54
## AT_SC6      0.94      0.94    0.95      0.51  16   0.0018 0.028  0.52
## ST_SD1      0.94      0.94    0.95      0.51  16   0.0018 0.027  0.52
## ST_SD2      0.94      0.94    0.95      0.51  16   0.0018 0.026  0.52
## ST_SD3      0.94      0.94    0.95      0.51  16   0.0018 0.026  0.52
## ST_SD4      0.95      0.95    0.95      0.54  17   0.0016 0.030  0.53
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean  sd
## AT_PT1 1924  0.79  0.79  0.78   0.76  3.0 1.3
## AT_PT2 1924  0.82  0.82  0.81   0.79  2.9 1.3
## AT_PT3 1924  0.88  0.88  0.88   0.86  2.9 1.4
## AT_WT5 1924  0.65  0.66  0.62   0.60  2.7 1.2
## AT_CT1 1924  0.71  0.72  0.70   0.67  2.7 1.2
## AT_CT2 1924  0.76  0.76  0.75   0.72  2.8 1.2
## AT_CT3 1924  0.84  0.83  0.83   0.81  3.1 1.5
## AT_CT4 1924  0.42  0.42  0.37   0.35  2.7 1.2
## AT_LA1 1924  0.73  0.73  0.71   0.69  3.1 1.3
## AT_LA2 1924  0.49  0.51  0.46   0.44  3.0 1.1
## AT_LA3 1924  0.65  0.66  0.63   0.61  3.0 1.2
## AT_SC6 1924  0.87  0.86  0.86   0.84  3.1 1.4
## ST_SD1 1924  0.87  0.87  0.87   0.85  3.0 1.4
## ST_SD2 1924  0.89  0.89  0.89   0.87  3.1 1.4
## ST_SD3 1924  0.88  0.87  0.88   0.86  3.0 1.4
## ST_SD4 1924  0.67  0.67  0.64   0.62  2.9 1.2
## 
## Non missing response frequency for each item
##           1    2    3    4    5 miss
## AT_PT1 0.15 0.22 0.28 0.16 0.19    0
## AT_PT2 0.17 0.23 0.28 0.16 0.17    0
## AT_PT3 0.20 0.25 0.20 0.15 0.21    0
## AT_WT5 0.20 0.21 0.36 0.13 0.10    0
## AT_CT1 0.17 0.27 0.30 0.16 0.09    0
## AT_CT2 0.17 0.25 0.32 0.16 0.10    0
## AT_CT3 0.20 0.18 0.19 0.17 0.26    0
## AT_CT4 0.20 0.24 0.31 0.16 0.09    0
## AT_LA1 0.14 0.19 0.31 0.20 0.17    0
## AT_LA2 0.10 0.23 0.38 0.20 0.10    0
## AT_LA3 0.09 0.25 0.34 0.19 0.14    0
## AT_SC6 0.17 0.23 0.23 0.12 0.25    0
## ST_SD1 0.19 0.20 0.24 0.12 0.25    0
## ST_SD2 0.18 0.19 0.24 0.16 0.23    0
## ST_SD3 0.19 0.21 0.24 0.16 0.21    0
## ST_SD4 0.15 0.19 0.39 0.15 0.12    0
df %>%
pull(AT_CT4) %>%
hist()

Item Response Theory

Item Response Theory Output for MR1

df %>%
select (c(6, 7 ,8, 13, 14, 15, 16, 17, 19, 20, 21, 29, 30, 31, 32, 33)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR1") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##               -3    -2    -1     0     1     2    3
## AT_PT1      0.35  0.79  1.06  1.17  1.16  0.78 0.29
## AT_PT2      0.35  0.86  1.12  1.22  1.29  0.98 0.38
## AT_PT3      0.69  1.49  1.17  1.76  2.00  1.87 0.64
## AT_WT5      0.20  0.36  0.54  0.63  0.62  0.49 0.30
## AT_CT1      0.22  0.44  0.66  0.78  0.77  0.61 0.35
## AT_CT2      0.25  0.55  0.81  0.91  0.91  0.74 0.40
## AT_CT3      0.31  1.02  1.54  1.70  1.50  0.74 0.18
## AT_CT4      0.12  0.15  0.17  0.18  0.18  0.16 0.13
## AT_LA1      0.30  0.57  0.80  0.86  0.77  0.52 0.25
## AT_LA2      0.17  0.22  0.26  0.27  0.26  0.22 0.17
## AT_LA3      0.26  0.41  0.54  0.60  0.55  0.41 0.24
## AT_SC6      0.57  1.23  1.32  1.61  1.95  1.11 0.22
## ST_SD1      0.57  1.48  1.55  1.47  2.05  1.44 0.28
## ST_SD2      1.26  1.40  1.85  1.26  2.02  1.96 0.49
## ST_SD3      0.72  1.48  1.53  1.42  1.83  1.68 0.49
## ST_SD4      0.24  0.43  0.60  0.67  0.63  0.48 0.29
## Test Info   6.57 12.88 15.51 16.51 18.49 14.18 5.12
## SEM         0.39  0.28  0.25  0.25  0.23  0.27 0.44
## Reliability 0.85  0.92  0.94  0.94  0.95  0.93 0.80
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 104  and the objective function was  1.82 
## The number of observations was  1924  with Chi Square =  3495.98  with prob <  0 
## 
## The root mean square of the residuals (RMSA) is  0.05 
## The df corrected root mean square of the residuals is  0.06 
## 
## Tucker Lewis Index of factoring reliability =  0.873
## RMSEA index =  0.13  and the 10 % confidence intervals are  0.127 0.134
## BIC =  2709.51

Item Response Theory Output for MR2

df %>%
select (c(2, 10, 11, 12, 22)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR2") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1    2    3
## AT_WT2       0.19 0.35 0.52 0.62 0.61 0.49 0.32
## AT_WT3       0.25 0.49 0.68 0.74 0.72 0.61 0.40
## AT_WT4       0.22 0.52 0.79 0.84 0.83 0.78 0.56
## AT_LA4       0.18 0.30 0.42 0.49 0.47 0.38 0.26
## Test Info    0.85 1.66 2.41 2.69 2.63 2.26 1.53
## SEM          1.08 0.78 0.64 0.61 0.62 0.66 0.81
## Reliability -0.17 0.40 0.58 0.63 0.62 0.56 0.35
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 5  and the objective function was  0.1 
## The number of observations was  1924  with Chi Square =  193.02  with prob <  8.9e-40 
## 
## The root mean square of the residuals (RMSA) is  0.07 
## The df corrected root mean square of the residuals is  0.1 
## 
## Tucker Lewis Index of factoring reliability =  0.846
## RMSEA index =  0.14  and the 10 % confidence intervals are  0.123 0.157
## BIC =  155.2

Item Response Theory Output for MR3

df %>%
select (c(9,18, 23, 25, 34, 35, 36, 37, 38, 39)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR3") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##               -3    -2    -1     0     1    2    3
## AT_WT1      0.52  1.14  1.49  1.43  1.00 0.39 0.10
## AT_CT5      0.30  0.57  0.79  0.85  0.70 0.43 0.21
## AT_LA5      0.40  0.99  1.43  1.48  1.03 0.39 0.10
## AT_SC2      0.21  0.53  0.96  1.17  0.92 0.47 0.18
## ST_SD5      0.29  0.49  0.66  0.68  0.57 0.37 0.20
## ST_SR2      1.42  2.49  1.44  2.24  4.47 0.50 0.00
## ST_SR1      1.07  2.17  2.16  2.19  1.82 1.16 0.11
## ST_SR3      2.13  1.68  1.59  2.39  1.13 2.19 0.26
## ST_SR4      1.00  1.64  1.88  1.80  1.51 0.88 0.13
## ST_SR5      0.46  0.75  0.85  0.83  0.72 0.45 0.20
## Test Info   7.81 12.45 13.25 15.06 13.85 7.23 1.49
## SEM         0.36  0.28  0.27  0.26  0.27 0.37 0.82
## Reliability 0.87  0.92  0.92  0.93  0.93 0.86 0.33
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 35  and the objective function was  0.76 
## The number of observations was  1924  with Chi Square =  1452.62  with prob <  7.1e-283 
## 
## The root mean square of the residuals (RMSA) is  0.04 
## The df corrected root mean square of the residuals is  0.04 
## 
## Tucker Lewis Index of factoring reliability =  0.917
## RMSEA index =  0.145  and the 10 % confidence intervals are  0.139 0.152
## BIC =  1187.94

Item Response Theory Output for MR4

df %>%
select (c(1, 3, 4, 5)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR4") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1    2    3
## NCSS_P1      0.35 0.58 0.75 0.77 0.63 0.38 0.18
## NCSS_P3      0.15 0.21 0.26 0.29 0.29 0.25 0.19
## NCSS_P4      0.37 0.84 1.04 1.09 1.16 1.02 0.51
## NCSS_P5      0.02 0.17 4.93 0.33 2.44 1.22 0.47
## Test Info    0.89 1.80 6.97 2.48 4.51 2.87 1.35
## SEM          1.06 0.75 0.38 0.63 0.47 0.59 0.86
## Reliability -0.12 0.44 0.86 0.60 0.78 0.65 0.26
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2  and the objective function was  0.01 
## The number of observations was  1924  with Chi Square =  11.09  with prob <  0.0039 
## 
## The root mean square of the residuals (RMSA) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## Tucker Lewis Index of factoring reliability =  0.993
## RMSEA index =  0.049  and the 10 % confidence intervals are  0.023 0.078
## BIC =  -4.04

Item Response Theory Output for MR5

df %>%
select (c(24, 26, 27, 28)) %>%
irt.fa %T>%  # T-pipe
plot(main = "C2C Aggregated Factor : MR5") %>% 
print()

## Item Response Analysis using Factor Analysis  
## 
## Call: irt.fa(x = .)
## Item Response Analysis using Factor Analysis  
## 
##  Summary information by factor and item
##  Factor =  1 
##                -3   -2   -1    0    1    2    3
## AT_SC1       0.18 0.28 0.38 0.43 0.42 0.34 0.24
## AT_SC3       0.25 0.65 1.00 1.10 1.16 0.90 0.38
## AT_SC4       0.23 0.36 0.49 0.55 0.51 0.39 0.25
## AT_SC5       0.14 0.17 0.20 0.21 0.21 0.18 0.15
## Test Info    0.79 1.47 2.07 2.29 2.30 1.82 1.01
## SEM          1.12 0.82 0.70 0.66 0.66 0.74 0.99
## Reliability -0.26 0.32 0.52 0.56 0.57 0.45 0.01
## 
## Factor analysis with Call: fa(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, 
##     fm = fm)
## 
## Test of the hypothesis that 1 factor is sufficient.
## The degrees of freedom for the model is 2  and the objective function was  0.06 
## The number of observations was  1924  with Chi Square =  106.46  with prob <  7.6e-24 
## 
## The root mean square of the residuals (RMSA) is  0.06 
## The df corrected root mean square of the residuals is  0.1 
## 
## Tucker Lewis Index of factoring reliability =  0.794
## RMSEA index =  0.165  and the 10 % confidence intervals are  0.139 0.192
## BIC =  91.33

Confirmatory Factor Analysis

Define the factors for the CFA model

cfa.model <- 'MR1L =~ AT_PT1 + AT_PT2 + AT_PT3 + AT_WT5 + AT_CT1 + AT_CT2 + AT_CT3 + AT_CT4 + AT_LA1 + AT_LA2 + AT_LA3 + AT_SC6 + ST_SD1 + ST_SD2 + ST_SD3 + ST_SD4
MR2L =~ NCSS_P2 + AT_WT2 + AT_WT3 + AT_WT4 + AT_LA4
MR3L =~ AT_WT1 + AT_CT5 + AT_LA5 + AT_SC2 + ST_SD5 + ST_SR2 + ST_SR1 + ST_SR3 + ST_SR4 + ST_SR5
MR4L =~ NCSS_P1 + NCSS_P3 + NCSS_P4 + NCSS_P5
MR5L =~ AT_SC1 + AT_SC3 + AT_SC4 + AT_SC5'

Fit the CFA Model

fit <- cfa(cfa.model, data = df)

Check the results, quick check with “User Model versus Baseline Model,” which should have values above 0.9 or possibly 0.95

fit %>% summary(fit.measures = TRUE)
## lavaan 0.6-9 ended normally after 91 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        88
##                                                       
##   Number of observations                          1924
##                                                       
## Model Test User Model:
##                                                        
##   Test statistic                              11636.334
##   Degrees of freedom                                692
##   P-value (Chi-square)                            0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                             60703.928
##   Degrees of freedom                               741
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.817
##   Tucker-Lewis Index (TLI)                       0.805
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)            -101202.716
##   Loglikelihood unrestricted model (H1)     -95384.549
##                                                       
##   Akaike (AIC)                              202581.432
##   Bayesian (BIC)                            203070.902
##   Sample-size adjusted Bayesian (BIC)       202791.325
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.091
##   90 Percent confidence interval - lower         0.089
##   90 Percent confidence interval - upper         0.092
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.103
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   MR1L =~                                             
##     AT_PT1            1.000                           
##     AT_PT2            1.018    0.025   40.916    0.000
##     AT_PT3            1.220    0.026   47.354    0.000
##     AT_WT5            0.693    0.025   28.136    0.000
##     AT_CT1            0.756    0.024   31.508    0.000
##     AT_CT2            0.814    0.024   34.155    0.000
##     AT_CT3            1.188    0.028   42.885    0.000
##     AT_CT4            0.385    0.026   14.690    0.000
##     AT_LA1            0.823    0.025   32.675    0.000
##     AT_LA2            0.424    0.024   17.989    0.000
##     AT_LA3            0.640    0.024   26.744    0.000
##     AT_SC6            1.194    0.026   45.286    0.000
##     ST_SD1            1.238    0.026   47.311    0.000
##     ST_SD2            1.239    0.025   48.806    0.000
##     ST_SD3            1.206    0.025   47.491    0.000
##     ST_SD4            0.714    0.024   29.903    0.000
##   MR2L =~                                             
##     NCSS_P2           1.000                           
##     AT_WT2            3.396    0.506    6.707    0.000
##     AT_WT3           -3.831    0.564   -6.793    0.000
##     AT_WT4            4.009    0.589    6.806    0.000
##     AT_LA4           -3.900    0.578   -6.750    0.000
##   MR3L =~                                             
##     AT_WT1            1.000                           
##     AT_CT5            0.780    0.023   34.374    0.000
##     AT_LA5            1.005    0.024   42.029    0.000
##     AT_SC2            0.979    0.027   36.257    0.000
##     ST_SD5            0.686    0.023   29.992    0.000
##     ST_SR2            1.258    0.023   54.258    0.000
##     ST_SR1            1.178    0.023   51.893    0.000
##     ST_SR3            1.129    0.021   53.474    0.000
##     ST_SR4            1.089    0.022   49.657    0.000
##     ST_SR5            0.711    0.020   35.829    0.000
##   MR4L =~                                             
##     NCSS_P1           1.000                           
##     NCSS_P3           0.668    0.035   19.186    0.000
##     NCSS_P4           1.155    0.035   33.049    0.000
##     NCSS_P5           1.457    0.041   35.784    0.000
##   MR5L =~                                             
##     AT_SC1            1.000                           
##     AT_SC3            1.651    0.075   21.968    0.000
##     AT_SC4            0.980    0.055   17.901    0.000
##     AT_SC5            0.621    0.048   12.811    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   MR1L ~~                                             
##     MR2L             -0.037    0.008   -4.778    0.000
##     MR3L              0.870    0.040   21.630    0.000
##     MR4L              0.451    0.028   16.184    0.000
##     MR5L              0.528    0.031   16.948    0.000
##   MR2L ~~                                             
##     MR3L             -0.141    0.022   -6.529    0.000
##     MR4L             -0.096    0.015   -6.401    0.000
##     MR5L              0.003    0.004    0.866    0.386
##   MR3L ~~                                             
##     MR4L              0.724    0.036   19.955    0.000
##     MR5L              0.407    0.028   14.540    0.000
##   MR4L ~~                                             
##     MR5L              0.189    0.018   10.273    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .AT_PT1            0.640    0.022   29.436    0.000
##    .AT_PT2            0.592    0.020   29.243    0.000
##    .AT_PT3            0.391    0.014   27.147    0.000
##    .AT_WT5            0.926    0.030   30.492    0.000
##    .AT_CT1            0.802    0.026   30.297    0.000
##    .AT_CT2            0.728    0.024   30.097    0.000
##    .AT_CT3            0.652    0.023   28.823    0.000
##    .AT_CT4            1.303    0.042   30.901    0.000
##    .AT_LA1            0.854    0.028   30.215    0.000
##    .AT_LA2            1.012    0.033   30.837    0.000
##    .AT_LA3            0.901    0.029   30.557    0.000
##    .AT_SC6            0.495    0.018   28.097    0.000
##    .ST_SD1            0.405    0.015   27.171    0.000
##    .ST_SD2            0.321    0.012   26.155    0.000
##    .ST_SD3            0.374    0.014   27.068    0.000
##    .ST_SD4            0.833    0.027   30.397    0.000
##    .NCSS_P2           1.340    0.044   30.803    0.000
##    .AT_WT2            0.903    0.033   27.301    0.000
##    .AT_WT3            0.604    0.025   23.834    0.000
##    .AT_WT4            0.564    0.025   22.587    0.000
##    .AT_LA4            0.913    0.035   26.145    0.000
##    .AT_WT1            0.641    0.022   28.872    0.000
##    .AT_CT5            0.909    0.030   30.097    0.000
##    .AT_LA5            0.792    0.027   29.262    0.000
##    .AT_SC2            1.223    0.041   29.941    0.000
##    .ST_SD5            1.017    0.033   30.383    0.000
##    .ST_SR2            0.331    0.014   24.389    0.000
##    .ST_SR1            0.403    0.015   26.242    0.000
##    .ST_SR3            0.300    0.012   25.114    0.000
##    .ST_SR4            0.447    0.016   27.351    0.000
##    .ST_SR5            0.668    0.022   29.979    0.000
##    .NCSS_P1           0.802    0.029   27.892    0.000
##    .NCSS_P3           1.249    0.041   30.187    0.000
##    .NCSS_P4           0.513    0.022   23.611    0.000
##    .NCSS_P5           0.259    0.023   11.505    0.000
##    .AT_SC1            1.044    0.037   28.204    0.000
##    .AT_SC3            0.519    0.036   14.575    0.000
##    .AT_SC4            0.988    0.035   28.159    0.000
##    .AT_SC5            1.188    0.039   30.088    0.000
##     MR1L              1.089    0.052   20.845    0.000
##     MR2L              0.041    0.012    3.435    0.001
##     MR3L              1.357    0.062   22.042    0.000
##     MR4L              0.761    0.045   16.887    0.000
##     MR5L              0.433    0.037   11.692    0.000

Congratulations !

You have completed session 1 of the S3729C Data Analytics Seminar.