Principal component analysis

Load data:

data <- read_csv("../data/by_ppt_means_wide.csv") %>% 
  select_if(~ !any(is.na(.))) # remove variables that contains NAs

Run principal component analysis:

# Principal component analysis
pc <- prcomp(dplyr::select(data, where(is.double)), center = TRUE, scale = TRUE)

Scree plot suggests 6 or 7 PCs.

These are the strongest loadings for the first three PCs. Variables that correlated with a PC less than \(|<.2|\) were omitted.

dv pc loading
is_lookback PC1 -0.34
edit PC1 -0.31
is_lookback_post-word_production PC1 -0.30
is_lookback_within-word_production PC1 -0.30
productionseq PC1 0.30
edit_post-word PC1 -0.30
edit_within-word PC1 -0.29
nonlookbackseq PC1 0.27
n_edges PC1 -0.21
n_jumps PC1 -0.20
log_totalfix_duration PC2 -0.39
n_fixes PC2 -0.38
log_totalfix_duration_post-word_production PC2 -0.36
log_totalfix_duration_within-word_production PC2 -0.35
n_fixes_post-word_production PC2 -0.30
n_fixes_within-word_production PC2 -0.29
log_totalfix_duration_post-sentence_production PC2 -0.26
n_fixes_post-sentence_production PC2 -0.24
log_event_duration_within-word_production PC3 0.32
edit_within-word PC3 -0.31
log_event_duration PC3 0.31
edit PC3 -0.31
log_event_duration_within-word_deletion PC3 0.26
n_edges PC3 -0.24
log_event_duration_post-word_production PC3 0.24
is_lookback_within-word_production PC3 0.23
edit_post-word PC3 -0.22
n_jumps PC3 -0.22

Loadings arranged by PC with weak ones omitted. Variables are ordered starting with the one that exhibit the strongest loadings. Negative correlations indicated in red and positive correlations indicated in blue.

PC1 PC2 PC3
is_lookback log_totalfix_duration log_event_duration_within-word_production
edit n_fixes edit_within-word
is_lookback_post-word_production log_totalfix_duration_post-word_production log_event_duration
is_lookback_within-word_production log_totalfix_duration_within-word_production edit
productionseq n_fixes_post-word_production log_event_duration_within-word_deletion
edit_post-word n_fixes_within-word_production n_edges
edit_within-word log_totalfix_duration_post-sentence_production log_event_duration_post-word_production
nonlookbackseq n_fixes_post-sentence_production is_lookback_within-word_production
n_edges NA edit_post-word
n_jumps NA n_jumps

Clustering

Hartingan’s Rule suggests 4-5 clusters.

Gap statistic suggests 2 clusters.

Create clusters using \(K\)-means.

k3 <- kmeans(pcs, centers = 3, nstart = 25)

3D cluster plots

Click on plot to rotate it and zoom in and out. Every sphere represents one participant. Colour indicates cluster.

Three clusters solution:

By-cluster means

Means of principal components by cluster.

# A tibble: 3 × 4
  pc    `Cluster 1` `Cluster 2` `Cluster 3`
  <chr>       <dbl>       <dbl>       <dbl>
1 PC1        -0.467      -1.28        4.18 
2 PC2        -1.08        2.36       -0.361
3 PC3         0.552      -0.766      -0.641

Simplified: rank for cluster score on each of the three principal components. Then create a table where maximum score is shown and \(+\), minimum is shown a \(-\) and the middle one is left blank.

Cluster 1 Cluster 2 Cluster 3
PC1 \(-\) \(+\)
PC2 \(-\) \(+\)
PC3 \(+\) \(-\)

Ppts that changed cluster

Out of 30 ppts, which ppts changed cluster from task 1 to task 2:

[1] "F19-0038" "F20-0013" "F21-0034" "F22-0043" "M19-0006"

Cluster changes by participant

Cluster membership changes by task

Cluster membership for first task and for second task showing how many switched participants swiched cluster, and to which cluster.