This document serves as the preregistration for Phase 1 of Study 1 of the STEM Research Experience Project, which is aimed at determining the nature and associations of different underlying components of STEM research experience.
Study 1 relies on archival data from the AScILS (Assessing Scientific Inquiry and Leadership Skills) project collected from alumni of undergraduate majors in STEM. This Study 1 dataset is heretofore referred to as the “Alumni” date. We begin with this dataset because it is the largest and most variable. The purpose of Phase 1 is as follows is to determine the underlying dimensionality of the construct of “research experience” using exploratory factor analysis. Phase 2 of Study 1 will:
We will prepare a separate preregistration for Phase 2, as the approach will depend on the final factor structure of Phase 1.
There will also be subsequent preregistrations for Study 2 and Study 3, to be prepared upon completion of Study 1. Briefly, these studies will involve the following:
Study 2 will involve confirming the factor model in a sample of high school students and will include mentor ratings of identity. We refer to this as the “COSMOS” project.
Study 3 will involve confirming the factor model in a sample of current undergraduate students and will include both measures of identity and a scientific reasoning assessment. We refer to this as the “PSTEM” project.
The data for this project were previously published as Study 1 in:
Syed, M., Zurbriggen, E., Chemers, M. M., Goza, B. K., Bearman, S., Crosby, F., Shaw, J. M., Hunter, L., & Morgan, E. M. (2019). The role of self-efficacy and identity in mediating the effects of STEM support experiences. Analysis of Social Issues and Public Policy, 19(1), 7-49. https://doi.org/10.1111/asap.12170
Moin Syed has worked with these data extensively, in ways that go beyond what is reported in the above paper. This work included looking into the dimensionality of the research experience measure. Most of this work was conducted as a graduate student in 2005-2008 (the current year is 2023), so his memory of what he did or found is not great, although the suggestions of possible dimensionality is what motivated the current project. This might seem like sufficient background knowledge to “invalidate” a preregistration, but we proceed regardless because we want to be as tranparent as possible, and what we provide here is our analytic strategy in the year of 2023.
Load the packages we will need: {haven},
{labelled}, {dplyr},
{summarytools}, {psych}, {mass},
{EFAtools}.
The data file is “SciEngSurvey Alumni Retro MASTER.sav”
These are the 19 variables assessing research experience, each rated on a 1-5 scale:
labelled::look_for((ascils_dat %>% dplyr::select(outcls1:outcls19)), details = FALSE)
## pos variable label
## 1 outcls1 Participated in Research / Eng Projects
## 2 outcls2 Worked in Sci / Eng
## 3 outcls3 Member of Research / Eng Team
## 4 outcls4 Played Leadership Role
## 5 outcls5 Generated Research Question / Eng Problem
## 6 outcls6 Identified Questions
## 7 outcls7 Collected Data / Identified Constraints
## 8 outcls8 Interpreted Data / Found Solutions
## 9 outcls9 Explained Results / Evaluated Solution Fit
## 10 outcls10 Used Literature
## 11 outcls11 Related Results to Work of Others
## 12 outcls12 Gave Presentation to Students
## 13 outcls13 Gave Professional Presentation
## 14 outcls14 Wrote Article
## 15 outcls15 Planned Research / Projects
## 16 outcls16 Attended Lectures
## 17 outcls17 Attended Conferences
## 18 outcls18 Learned Technical Skills
## 19 outcls19 Learned Terminology
We will randomly select 60% of cases (302) and conduct an exploratory factor analysis (EFA) to determine the optimal factor structure.
The first step will be to calculate the communalities for the set of
items with the squared multiple correlations (SMC) using the
smc() function in the {psych} package as the
lower bound estimates (Guttman,
1956). To be maximally inclusive, we set the lower bound to .15,
meaning any items with communalities below that level will be excluded
from the item set and the communalities will be re-estimated. This
process will continue until all items exceed .15.
Next, we will test the assumptions that the data are suitable for
factor analysis by conducting the Bartlett's test of sphericity (Bartlett, 1954) and
Kaiser-Meyer-Olkin test of sampling adequacy (KMO; Kaiser, 1970).
In order to proceed, we require a significant \(\chi^2\) test from the Bartlett’s test and
a KMO index of at least 0.5 (Netemeyer
et al., 2003). The functions used will be
EFAtools::BARTLETT() and EFAtools::KMO().
Next, we will conduct parallel analysis using
EFAtools::PARALLEL() to determine the optimal number of
factors to extract. We will use the default number of 1,000 datasets to
simulate and the average simulated eigenvalues for the decision
rule.
Once the number of factors is determined, EFA will be conducted using
the psych::fa() function. We will conduct an oblimin
rotation with maximum likelihood and examine the pattern of factor
loadings. Any standardized loading greater than .15 will be interpreted,
and double-loadings will be retained. If any item fails to load at least
.15 on any factor, then that item will be removed and will re-run the
parallel analysis and re-rotate the solution. That process will continue
until all items load at least .15 on one of the factors. The final
factor solution will be tabled and prepared for interpretation. We will
use a modified version of thematic analysis to create labels for the
resultant latent factors. At that point, Phase 1 will be complete.
We start with some simulated data following a multivariate normal distribution of 200 respondents across 10 items and compute the initial communalities
# simulated multivariate normal data
set.seed(202304)
n <- 200 # 200 respondents
mu <- rep(x = 0, times = 10) # 10 items
sigma <- matrix(rnorm(n = 100,
mean = 0.2,
sd = 0.1),
ncol = 10)
diag(sigma) <- 1
sim <- mvrnorm(n = n,
mu = mu,
Sigma = sigma)
# calculate communalities
psych::smc(sim)
## V1 V2 V3 V4 V5 V6 V7
## 0.30207770 0.28691894 0.03510222 0.30570632 0.21624682 0.33742889 0.16764447
## V8 V9 V10
## 0.32696904 0.26062969 0.24574557
According to these initial communalities, we will drop items 3 and re-compute communalities across the remaining items.
# remove the item 3
sim_common <- sim[, -3]
# re-calculate communalities
psych::smc(sim_common)
## V1 V2 V3 V4 V5 V6 V7 V8
## 0.3006148 0.2863965 0.3045363 0.2145922 0.3280205 0.1676430 0.3215448 0.2538971
## V9
## 0.2456853
We now see that all items exceed the minimum boundary of 0.15 and thus we can move onto the next stage.
# Bartlett's test of sphericity
EFAtools::BARTLETT(x = sim_common)
## ℹ 'x' was not a correlation matrix. Correlations are found from entered raw data.
##
## ✔ The Bartlett's test of sphericity was significant at an alpha level of .05.
## These data are probably suitable for factor analysis.
##
## 𝜒²(36) = 291.08, p < .001
# KMO test of sampling adequacy
EFAtools::KMO(x = sim_common)
## ℹ 'x' was not a correlation matrix. Correlations are found from entered raw data.
##
## ── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────
##
## ✖ The overall KMO value for your data is miserable.
## These data are hardly suitable for factor analysis.
##
## Overall: 0.575
##
## For each variable:
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 0.691 0.601 0.549 0.626 0.520 0.583 0.430 0.549 0.675
After having passed these assumption checks, we can move onto parallel analysis to determine the number of factors to extract.
EFAtools::PARALLEL(x = sim_common,
n_datasets = 1000,
eigen_type = "EFA")
## ℹ 'x' was not a correlation matrix. Correlations are found from entered raw data.
## Parallel Analysis performed using 1000 simulated random data sets
## Eigenvalues were found using EFA
##
## Decision rule used: means
##
## ── Number of factors to retain according to ────────────────────────────────────
##
## ◌ EFA-determined eigenvalues: 5
Lastly, we can now conduct the EFA:
psych::fa(r = cor(sim_common),
nfactors = 5,
rotate = "oblimin",
fm = "ml")
## Loading required namespace: GPArotation
## Factor Analysis using method = ml
## Call: psych::fa(r = cor(sim_common), nfactors = 5, rotate = "oblimin",
## fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 ML3 ML4 ML5 h2 u2 com
## 1 0.15 -0.12 0.49 0.37 -0.01 0.43 0.571 2.2
## 2 1.00 0.03 -0.03 -0.02 0.02 1.00 0.005 1.0
## 3 -0.02 0.15 -0.10 0.71 0.03 0.59 0.409 1.1
## 4 -0.05 0.17 -0.09 0.17 0.54 0.37 0.625 1.5
## 5 0.03 0.97 0.05 0.03 0.01 1.00 0.005 1.0
## 6 0.07 -0.22 0.16 0.29 0.27 0.22 0.776 3.6
## 7 -0.07 0.12 0.81 -0.09 0.04 0.72 0.283 1.1
## 8 0.16 -0.07 0.15 -0.11 0.58 0.46 0.539 1.4
## 9 0.29 0.05 0.20 0.31 -0.18 0.29 0.706 3.4
##
## ML1 ML2 ML3 ML4 ML5
## SS loadings 1.20 1.10 1.06 0.94 0.78
## Proportion Var 0.13 0.12 0.12 0.10 0.09
## Cumulative Var 0.13 0.26 0.37 0.48 0.56
## Proportion Explained 0.24 0.22 0.21 0.19 0.15
## Cumulative Proportion 0.24 0.45 0.66 0.85 1.00
##
## With factor correlations of
## ML1 ML2 ML3 ML4 ML5
## ML1 1.00 0.13 0.17 0.27 0.27
## ML2 0.13 1.00 0.22 0.33 0.12
## ML3 0.17 0.22 1.00 0.05 0.23
## ML4 0.27 0.33 0.05 1.00 0.12
## ML5 0.27 0.12 0.23 0.12 1.00
##
## Mean item complexity = 1.8
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 36 and the objective function was 1.49
## The degrees of freedom for the model are 1 and the objective function was 0.03
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.11
##
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## ML1 ML2 ML3 ML4 ML5
## Correlation of (regression) scores with factors 1.00 1.00 0.87 0.82 0.76
## Multiple R square of scores with factors 0.99 0.99 0.75 0.67 0.58
## Minimum correlation of possible factor scores 0.99 0.99 0.51 0.34 0.16
Phase 2 is not included in the current preregistration, but we outline the general plans below for completeness.
In Phase 2a, confirm the finalized structure with CFA on remaining 200 cases. Any multi-dimensional solution should be compared against a one-factor model and any other plausible models (this depends on the outcome of the EFA).
Starting with the finalized factor structure, assesss measurement invariance by gender (men/women), race/ethnicity (URM/Asian/White), and major (science/engineering).
summarytools::freq(ascils_dat$gennumpp)
## Frequencies
## ascils_dat$gennumpp
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 249 57.51 57.51 49.60 49.60
## 2 184 42.49 100.00 36.65 86.25
## <NA> 69 13.75 100.00
## Total 502 100.00 100.00 100.00 100.00
summarytools::freq(ascils_dat$ethnumpp)
## Frequencies
## ascils_dat$ethnumpp
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 31 6.21 6.21 6.18 6.18
## 2 71 14.23 20.44 14.14 20.32
## 3 70 14.03 34.47 13.94 34.26
## 4 155 31.06 65.53 30.88 65.14
## 5 33 6.61 72.14 6.57 71.71
## 6 25 5.01 77.15 4.98 76.69
## 7 5 1.00 78.16 1.00 77.69
## 8 2 0.40 78.56 0.40 78.09
## 9 107 21.44 100.00 21.31 99.40
## <NA> 3 0.60 100.00
## Total 502 100.00 100.00 100.00 100.00
summarytools::freq(ascils_dat$ethncity)
## Frequencies
## ascils_dat$ethncity
## Label: Ethnicity
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 0 19 3.78 3.78 3.78 3.78
## 1 170 33.86 37.65 33.86 37.65
## 2 199 39.64 77.29 39.64 77.29
## 3 114 22.71 100.00 22.71 100.00
## <NA> 0 0.00 100.00
## Total 502 100.00 100.00 100.00 100.00
summarytools::freq(ascils_dat$type)
## Frequencies
## ascils_dat$type
## Label: Type - Scientist / Engineer
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 392 78.09 78.09 78.09 78.09
## 2 110 21.91 100.00 21.91 100.00
## <NA> 0 0.00 100.00
## Total 502 100.00 100.00 100.00 100.00
Just simple bivariate correlations here, with tests of significance for difference in strength. Then test for differences in correlation strength by gender, race/ethnicity, and major. Identity items are as follows:
tibble(labelled::look_for((ascils_dat %>% dplyr::select(ident1,
ident3,
ident5,
ident7,
ident9,
ident10,)), details = FALSE))
## # A tibble: 6 × 3
## pos variable label
## <int> <chr> <chr>
## 1 1 ident1 Sci / Eng Part of Self-Image
## 2 2 ident3 Belong to Sci / Eng Community
## 3 3 ident5 Sci / Eng Reflection of Self
## 4 4 ident7 Think of Self as Sci / Eng
## 5 5 ident9 Belong in Field of Sci / Eng
## 6 6 ident10 I am a Sci / Eng