Load data:
data <- read_csv("../data/by_ppt_means_wide.csv") %>%
select_if(~ !any(is.na(.))) # remove variables that contains NAs
Run principal component analysis:
# Principal component analysis
pc <- prcomp(dplyr::select(data, where(is.double)), center = TRUE, scale = TRUE)
Scree plot suggests 6 or 7 PCs.
These are the strongest loadings for the first three PCs. Variables that correlated with a PC less than \(|<.2|\) were omitted.
dv | pc | loading |
---|---|---|
is_lookback | PC1 | -0.34 |
edit | PC1 | -0.31 |
is_lookback_post-word_production | PC1 | -0.30 |
is_lookback_within-word_production | PC1 | -0.30 |
productionseq | PC1 | 0.30 |
edit_post-word | PC1 | -0.30 |
edit_within-word | PC1 | -0.29 |
nonlookbackseq | PC1 | 0.27 |
n_edges | PC1 | -0.21 |
n_jumps | PC1 | -0.20 |
log_totalfix_duration | PC2 | -0.39 |
n_fixes | PC2 | -0.38 |
log_totalfix_duration_post-word_production | PC2 | -0.36 |
log_totalfix_duration_within-word_production | PC2 | -0.35 |
n_fixes_post-word_production | PC2 | -0.30 |
n_fixes_within-word_production | PC2 | -0.29 |
log_totalfix_duration_post-sentence_production | PC2 | -0.26 |
n_fixes_post-sentence_production | PC2 | -0.24 |
log_event_duration_within-word_production | PC3 | 0.32 |
edit_within-word | PC3 | -0.31 |
log_event_duration | PC3 | 0.31 |
edit | PC3 | -0.31 |
log_event_duration_within-word_deletion | PC3 | 0.26 |
n_edges | PC3 | -0.24 |
log_event_duration_post-word_production | PC3 | 0.24 |
is_lookback_within-word_production | PC3 | 0.23 |
edit_post-word | PC3 | -0.22 |
n_jumps | PC3 | -0.22 |
Loadings arranged by PC with weak ones omitted. Variables are ordered starting with the one that exhibit the strongest loadings. Negative correlations indicated in red and positive correlations indicated in blue.
PC1 | PC2 | PC3 |
---|---|---|
is_lookback | log_totalfix_duration | log_event_duration_within-word_production |
edit | n_fixes | edit_within-word |
is_lookback_post-word_production | log_totalfix_duration_post-word_production | log_event_duration |
is_lookback_within-word_production | log_totalfix_duration_within-word_production | edit |
productionseq | n_fixes_post-word_production | log_event_duration_within-word_deletion |
edit_post-word | n_fixes_within-word_production | n_edges |
edit_within-word | log_totalfix_duration_post-sentence_production | log_event_duration_post-word_production |
nonlookbackseq | n_fixes_post-sentence_production | is_lookback_within-word_production |
n_edges | NA | edit_post-word |
n_jumps | NA | n_jumps |
Hartingan’s Rule suggests 4-5 clusters.
Gap statistic suggests 2 clusters.
Create clusters using \(K\)-means.
k3 <- kmeans(pcs, centers = 3, nstart = 25)
Click on plot to rotate it and zoom in and out. Every sphere represents one participant. Colour indicates cluster.
Three clusters solution:
Means of principal components by cluster.
# A tibble: 3 × 4
pc `Cluster 1` `Cluster 2` `Cluster 3`
<chr> <dbl> <dbl> <dbl>
1 PC1 -0.467 -1.28 4.18
2 PC2 -1.08 2.36 -0.361
3 PC3 0.552 -0.766 -0.641
Simplified: rank for cluster score on each of the three principal components. Then create a table where maximum score is shown and \(+\), minimum is shown a \(-\) and the middle one is left blank.
Cluster 1 | Cluster 2 | Cluster 3 | |
---|---|---|---|
PC1 | \(-\) | \(+\) | |
PC2 | \(-\) | \(+\) | |
PC3 | \(+\) | \(-\) |
Out of 30 ppts, which ppts changed cluster from task 1 to task 2:
[1] "F19-0038" "F20-0013" "F21-0034" "F22-0043" "M19-0006"
Cluster membership for first task and for second task showing how many switched participants swiched cluster, and to which cluster.