Code written: 2020-01-02
Last ran: 2020-01-16
Website: http://rpubs.com/navona/thesis_dataInterpolationClinical
Description: We have a small amount of missing data in the neurocognition and social cognition variables that we will use in the CCA analysis. Though the count is small, we want to avoid list-wise deletion due to overall sample size considerations. Here, we interpolate scores for all participants that are missing values for social cognition and/or neurocognition, but have at least one score from each set (participants missing all values have been excluded).
Method: We elected not to apply mean substitution, which leaves the mean unchanged (desirable) but decreases variance (undesirable). Instead, we used the MICE
(Multivariate Imputation via Chained Equations) method. MICE
is similar to the regression method, except that it uses all existing relevant information to predict a plausible value (actually drawn from the existing distribution) for each missing datapoint, for each simulated regression model. The main virtues of the method are that imputations are restricted to the observed values, and that it can preserve non-linear relations. The method assumes that missing data are Missing at Random (MAR). Moreover, this is the method that Lindsay used for interpolation in her SEM paper.
Evaluation: The m=5 models produce varying values for some variables more than others. Overall, however, the imputed values are comparable to each other across models, and come from the same / similar distributions to the observed values. This imputation method is ideal from our dataset (based on n x p size, and number of missing values).
References: https://www.jstatsoft.org/article/view/v045i03 ; https://datascienceplus.com/imputing-missing-data-with-r-mice-package/
In total, 7 neurocognitive data points are missing across the following participants:
record_id | group |
---|---|
SPN01_CMH_0013 | SSD |
SPN01_CMH_0097 | SSD |
SPN01_ZHP_0065 | HC |
SPN01_ZHP_0081 | SSD |
SPN01_ZHP_0083 | SSD |
SPN01_ZHP_0086 | SSD |
SPN01_ZHP_0118 | HC |
All participants are missing data from the same factor, attention and vigilance
, which reflects performance on the CPT-IP:
##
## Variables sorted by number of missings:
## Variable Count
## ncog_attn_vig 0.01658768
## ncog_process_speed 0.00000000
## ncog_work_memory 0.00000000
## ncog_verbal_learning 0.00000000
## ncog_visual_learning 0.00000000
## ncog_problem_solving 0.00000000
We implement the MICE
imputation method as follows. The m=5
argument refers to the number of imputed datasets (this is also the default value). The maxit
argument specifies the number of iterations. The meth='pmm'
argument refers to the imputation method: we are using predictive mean matching as imputation method. Seed
sets a random seed for reprodicibility.
imputed_ncog <- mice(df_ncogNumeric, m=5, maxit = 50, method = 'pmm', seed = 500)
The code returns m=5 imputed values, each based on 50 iterations, for all participants with missing neurocognitive data. Ultimately, we will combine, or ‘pool’ these values. Inspection of the values show that there is some variability across m. However, this is to be expected. The following visualizations show the distribution of the original and imputed data (density and scatterplot). In both cases, the imputed data is red, and the observed data is blue. We see acceptable overlap / similarity between imputed and observed data points/ distributions.
Social cognition
In total, 33 social cognition data points are missing across the following participants:
We implemented the
MICE
imputation in the same way as above:As above, the tabbed visualizations show the resulting 5 imputed values, each based on 50 iterations. The imputated data are adequately similar to the observed data. We pooled the imputed data across all m.