This document serves as the preregistration for Phase 1 of Study 1 of the STEM Research Experience Project, which is aimed at determining the nature and associations of different underlying components of STEM research experience.

Study 1 relies on archival data from the AScILS (Assessing Scientific Inquiry and Leadership Skills) project collected from alumni of undergraduate majors in STEM. This Study 1 dataset is heretofore referred to as the “Alumni” date. We begin with this dataset because it is the largest and most variable. The purpose of Phase 1 is as follows is to determine the underlying dimensionality of the construct of “research experience” using exploratory factor analysis. Phase 2 of Study 1 will:

We will prepare a separate preregistration for Phase 2, as the approach will depend on the final factor structure of Phase 1.

There will also be subsequent preregistrations for Study 2 and Study 3, to be prepared upon completion of Study 1. Briefly, these studies will involve the following:

Study 2 will involve confirming the factor model in a sample of high school students and will include mentor ratings of identity. We refer to this as the “COSMOS” project.

Study 3 will involve confirming the factor model in a sample of current undergraduate students and will include both measures of identity and a scientific reasoning assessment. We refer to this as the “PSTEM” project.

Study 1 Data: The Alumni Study

The data for this project were previously published as Study 1 in:

Syed, M., Zurbriggen, E., Chemers, M. M., Goza, B. K., Bearman, S., Crosby, F., Shaw, J. M., Hunter, L., & Morgan, E. M. (2019). The role of self-efficacy and identity in mediating the effects of STEM support experiences. Analysis of Social Issues and Public Policy, 19(1), 7-49. https://doi.org/10.1111/asap.12170

Prior Knowledge of the Data

Moin Syed has worked with these data extensively, in ways that go beyond what is reported in the above paper. This work included looking into the dimensionality of the research experience measure. Most of this work was conducted as a graduate student in 2005-2008 (the current year is 2023), so his memory of what he did or found is not great, although the suggestions of possible dimensionality is what motivated the current project. This might seem like sufficient background knowledge to “invalidate” a preregistration, but we proceed regardless because we want to be as tranparent as possible, and what we provide here is our analytic strategy in the year of 2023.

Load the packages we will need: {haven}, {labelled}, {dplyr}, {summarytools}, {psych}, {mass}, {EFAtools}.

Phase 1: Determine the Optimal Structure Via Exploratory Factor Analysis

The data file is “SciEngSurvey Alumni Retro MASTER.sav”

These are the 19 variables assessing research experience, each rated on a 1-5 scale:

labelled::look_for((ascils_dat %>% dplyr::select(outcls1:outcls19)), details = FALSE)

##  pos variable label                                     
##   1  outcls1  Participated in Research / Eng Projects   
##   2  outcls2  Worked in Sci / Eng                       
##   3  outcls3  Member of Research / Eng Team             
##   4  outcls4  Played Leadership Role                    
##   5  outcls5  Generated Research Question / Eng Problem 
##   6  outcls6  Identified Questions                      
##   7  outcls7  Collected Data / Identified Constraints   
##   8  outcls8  Interpreted Data / Found Solutions        
##   9  outcls9  Explained Results / Evaluated Solution Fit
##  10  outcls10 Used Literature                           
##  11  outcls11 Related Results to Work of Others         
##  12  outcls12 Gave Presentation to Students             
##  13  outcls13 Gave Professional Presentation            
##  14  outcls14 Wrote Article                             
##  15  outcls15 Planned Research / Projects               
##  16  outcls16 Attended Lectures                         
##  17  outcls17 Attended Conferences                      
##  18  outcls18 Learned Technical Skills                  
##  19  outcls19 Learned Terminology

We will randomly select 60% of cases (302) and conduct an exploratory factor analysis (EFA) to determine the optimal factor structure.

The first step will be to calculate the communalities for the set of items with the squared multiple correlations (SMC) using the smc() function in the {psych} package as the lower bound estimates (Guttman, 1956). To be maximally inclusive, we set the lower bound to .15, meaning any items with communalities below that level will be excluded from the item set and the communalities will be re-estimated. This process will continue until all items exceed .15.

Next, we will test the assumptions that the data are suitable for factor analysis by conducting the Bartlett's test of sphericity (Bartlett, 1954) and Kaiser-Meyer-Olkin test of sampling adequacy (KMO; Kaiser, 1970). In order to proceed, we require a significant \(\chi^2\) test from the Bartlett’s test and a KMO index of at least 0.5 (Netemeyer et al., 2003). The functions used will be EFAtools::BARTLETT() and EFAtools::KMO().

Next, we will conduct parallel analysis using EFAtools::PARALLEL() to determine the optimal number of factors to extract. We will use the default number of 1,000 datasets to simulate and the average simulated eigenvalues for the decision rule.

Once the number of factors is determined, EFA will be conducted using the psych::fa() function. We will conduct an oblimin rotation with maximum likelihood and examine the pattern of factor loadings. Any standardized loading greater than .15 will be interpreted, and double-loadings will be retained. If any item fails to load at least .15 on any factor, then that item will be removed and will re-run the parallel analysis and re-rotate the solution. That process will continue until all items load at least .15 on one of the factors. The final factor solution will be tabled and prepared for interpretation. We will use a modified version of thematic analysis to create labels for the resultant latent factors. At that point, Phase 1 will be complete.

Sample R code

We start with some simulated data following a multivariate normal distribution of 200 respondents across 10 items and compute the initial communalities

# simulated multivariate normal data
set.seed(202304)
n     <- 200                    # 200 respondents                                       
mu    <- rep(x = 0, times = 10) # 10 items
sigma <- matrix(rnorm(n    = 100,
                      mean = 0.2,
                      sd   = 0.1),
                ncol = 10)
diag(sigma) <- 1
sim <- mvrnorm(n = n,
               mu = mu, 
               Sigma = sigma)

# calculate communalities
psych::smc(sim)

##         V1         V2         V3         V4         V5         V6         V7 
## 0.30207770 0.28691894 0.03510222 0.30570632 0.21624682 0.33742889 0.16764447 
##         V8         V9        V10 
## 0.32696904 0.26062969 0.24574557

According to these initial communalities, we will drop items 3 and re-compute communalities across the remaining items.

# remove the item 3
sim_common <- sim[, -3]

# re-calculate communalities
psych::smc(sim_common)

##        V1        V2        V3        V4        V5        V6        V7        V8 
## 0.3006148 0.2863965 0.3045363 0.2145922 0.3280205 0.1676430 0.3215448 0.2538971 
##        V9 
## 0.2456853

We now see that all items exceed the minimum boundary of 0.15 and thus we can move onto the next stage.

# Bartlett's test of sphericity
EFAtools::BARTLETT(x = sim_common)

## ℹ 'x' was not a correlation matrix. Correlations are found from entered raw data.

## 
## ✔ The Bartlett's test of sphericity was significant at an alpha level of .05.
##   These data are probably suitable for factor analysis.
## 
##   𝜒²(36) = 291.08, p < .001

# KMO test of sampling adequacy
EFAtools::KMO(x = sim_common)

## ℹ 'x' was not a correlation matrix. Correlations are found from entered raw data.

## 
## ── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────
## 
## ✖ The overall KMO value for your data is miserable.
##   These data are hardly suitable for factor analysis.
## 
##   Overall: 0.575
## 
##   For each variable:
##    V1    V2    V3    V4    V5    V6    V7    V8    V9 
## 0.691 0.601 0.549 0.626 0.520 0.583 0.430 0.549 0.675

After having passed these assumption checks, we can move onto parallel analysis to determine the number of factors to extract.

EFAtools::PARALLEL(x = sim_common,
                   n_datasets = 1000,
                   eigen_type = "EFA")

## ℹ 'x' was not a correlation matrix. Correlations are found from entered raw data.

## Parallel Analysis performed using 1000 simulated random data sets
## Eigenvalues were found using EFA
## 
## Decision rule used: means
## 
## ── Number of factors to retain according to ────────────────────────────────────
## 
## ◌ EFA-determined eigenvalues:  5

Lastly, we can now conduct the EFA:

psych::fa(r = cor(sim_common),
          nfactors = 5, 
          rotate = "oblimin", 
          fm = "ml")

## Loading required namespace: GPArotation

## Factor Analysis using method =  ml
## Call: psych::fa(r = cor(sim_common), nfactors = 5, rotate = "oblimin", 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##     ML1   ML2   ML3   ML4   ML5   h2    u2 com
## 1  0.15 -0.12  0.49  0.37 -0.01 0.43 0.571 2.2
## 2  1.00  0.03 -0.03 -0.02  0.02 1.00 0.005 1.0
## 3 -0.02  0.15 -0.10  0.71  0.03 0.59 0.409 1.1
## 4 -0.05  0.17 -0.09  0.17  0.54 0.37 0.625 1.5
## 5  0.03  0.97  0.05  0.03  0.01 1.00 0.005 1.0
## 6  0.07 -0.22  0.16  0.29  0.27 0.22 0.776 3.6
## 7 -0.07  0.12  0.81 -0.09  0.04 0.72 0.283 1.1
## 8  0.16 -0.07  0.15 -0.11  0.58 0.46 0.539 1.4
## 9  0.29  0.05  0.20  0.31 -0.18 0.29 0.706 3.4
## 
##                        ML1  ML2  ML3  ML4  ML5
## SS loadings           1.20 1.10 1.06 0.94 0.78
## Proportion Var        0.13 0.12 0.12 0.10 0.09
## Cumulative Var        0.13 0.26 0.37 0.48 0.56
## Proportion Explained  0.24 0.22 0.21 0.19 0.15
## Cumulative Proportion 0.24 0.45 0.66 0.85 1.00
## 
##  With factor correlations of 
##      ML1  ML2  ML3  ML4  ML5
## ML1 1.00 0.13 0.17 0.27 0.27
## ML2 0.13 1.00 0.22 0.33 0.12
## ML3 0.17 0.22 1.00 0.05 0.23
## ML4 0.27 0.33 0.05 1.00 0.12
## ML5 0.27 0.12 0.23 0.12 1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  36  and the objective function was  1.49
## The degrees of freedom for the model are 1  and the objective function was  0.03 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.11 
## 
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3  ML4  ML5
## Correlation of (regression) scores with factors   1.00 1.00 0.87 0.82 0.76
## Multiple R square of scores with factors          0.99 0.99 0.75 0.67 0.58
## Minimum correlation of possible factor scores     0.99 0.99 0.51 0.34 0.16

Phase 2a: Confirm the Structure Via Confirmatory Factor Analysis

Phase 2 is not included in the current preregistration, but we outline the general plans below for completeness.

In Phase 2a, confirm the finalized structure with CFA on remaining 200 cases. Any multi-dimensional solution should be compared against a one-factor model and any other plausible models (this depends on the outcome of the EFA).

Phase 2b Assess Measurement Invariance

Starting with the finalized factor structure, assesss measurement invariance by gender (men/women), race/ethnicity (URM/Asian/White), and major (science/engineering).

summarytools::freq(ascils_dat$gennumpp)

## Frequencies  
## ascils_dat$gennumpp  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    249     57.51          57.51     49.60          49.60
##           2    184     42.49         100.00     36.65          86.25
##        <NA>     69                              13.75         100.00
##       Total    502    100.00         100.00    100.00         100.00

summarytools::freq(ascils_dat$ethnumpp)

## Frequencies  
## ascils_dat$ethnumpp  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1     31      6.21           6.21      6.18           6.18
##           2     71     14.23          20.44     14.14          20.32
##           3     70     14.03          34.47     13.94          34.26
##           4    155     31.06          65.53     30.88          65.14
##           5     33      6.61          72.14      6.57          71.71
##           6     25      5.01          77.15      4.98          76.69
##           7      5      1.00          78.16      1.00          77.69
##           8      2      0.40          78.56      0.40          78.09
##           9    107     21.44         100.00     21.31          99.40
##        <NA>      3                               0.60         100.00
##       Total    502    100.00         100.00    100.00         100.00

summarytools::freq(ascils_dat$ethncity)

## Frequencies  
## ascils_dat$ethncity  
## Label: Ethnicity  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0     19      3.78           3.78      3.78           3.78
##           1    170     33.86          37.65     33.86          37.65
##           2    199     39.64          77.29     39.64          77.29
##           3    114     22.71         100.00     22.71         100.00
##        <NA>      0                               0.00         100.00
##       Total    502    100.00         100.00    100.00         100.00

summarytools::freq(ascils_dat$type)

## Frequencies  
## ascils_dat$type  
## Label: Type - Scientist / Engineer  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    392     78.09          78.09     78.09          78.09
##           2    110     21.91         100.00     21.91         100.00
##        <NA>      0                               0.00         100.00
##       Total    502    100.00         100.00    100.00         100.00

Phase 2c: Compute correlations between dimensions of research experience and identity as a scientist.

Just simple bivariate correlations here, with tests of significance for difference in strength. Then test for differences in correlation strength by gender, race/ethnicity, and major. Identity items are as follows:

tibble(labelled::look_for((ascils_dat %>% dplyr::select(ident1,
                                                 ident3,
                                                 ident5,
                                                 ident7,
                                                 ident9,
                                                 ident10,)), details = FALSE))

## # A tibble: 6 × 3
##     pos variable label                        
##   <int> <chr>    <chr>                        
## 1     1 ident1   Sci / Eng Part of Self-Image 
## 2     2 ident3   Belong to Sci / Eng Community
## 3     3 ident5   Sci / Eng Reflection of Self 
## 4     4 ident7   Think of Self as Sci / Eng   
## 5     5 ident9   Belong in Field of Sci / Eng 
## 6     6 ident10  I am a Sci / Eng

STEM Research Experience Project - Study 1, Phase 1 Preregistration

Moin Syed and Linh Nguyen

2023-04-14