SPINS CCA: Missing data interpolation

Code written: 2020-01-02
Last ran: 2020-01-16
Website: http://rpubs.com/navona/thesis_dataInterpolationClinical

Description: We have a small amount of missing data in the neurocognition and social cognition variables that we will use in the CCA analysis. Though the count is small, we want to avoid list-wise deletion due to overall sample size considerations. Here, we interpolate scores for all participants that are missing values for social cognition and/or neurocognition, but have at least one score from each set (participants missing all values have been excluded).
Method: We elected not to apply mean substitution, which leaves the mean unchanged (desirable) but decreases variance (undesirable). Instead, we used the MICE (Multivariate Imputation via Chained Equations) method. MICE is similar to the regression method, except that it uses all existing relevant information to predict a plausible value (actually drawn from the existing distribution) for each missing datapoint, for each simulated regression model. The main virtues of the method are that imputations are restricted to the observed values, and that it can preserve non-linear relations. The method assumes that missing data are Missing at Random (MAR). Moreover, this is the method that Lindsay used for interpolation in her SEM paper.
Evaluation: The m=5 models produce varying values for some variables more than others. Overall, however, the imputed values are comparable to each other across models, and come from the same / similar distributions to the observed values. This imputation method is ideal from our dataset (based on n x p size, and number of missing values).
References: https://www.jstatsoft.org/article/view/v045i03 ; https://datascienceplus.com/imputing-missing-data-with-r-mice-package/

Neurocognition

In total, 7 neurocognitive data points are missing across the following participants:

record_id	group
SPN01_CMH_0013	SSD
SPN01_CMH_0097	SSD
SPN01_ZHP_0065	HC
SPN01_ZHP_0081	SSD
SPN01_ZHP_0083	SSD
SPN01_ZHP_0086	SSD
SPN01_ZHP_0118	HC

All participants are missing data from the same factor, attention and vigilance, which reflects performance on the CPT-IP:

## 
##  Variables sorted by number of missings: 
##              Variable      Count
##         ncog_attn_vig 0.01658768
##    ncog_process_speed 0.00000000
##      ncog_work_memory 0.00000000
##  ncog_verbal_learning 0.00000000
##  ncog_visual_learning 0.00000000
##  ncog_problem_solving 0.00000000

We implement the MICE imputation method as follows. The m=5 argument refers to the number of imputed datasets (this is also the default value). The maxit argument specifies the number of iterations. The meth='pmm' argument refers to the imputation method: we are using predictive mean matching as imputation method. Seed sets a random seed for reprodicibility.

imputed_ncog <- mice(df_ncogNumeric, m=5, maxit = 50, method = 'pmm', seed = 500)

The code returns m=5 imputed values, each based on 50 iterations, for all participants with missing neurocognitive data. Ultimately, we will combine, or ‘pool’ these values. Inspection of the values show that there is some variability across m. However, this is to be expected. The following visualizations show the distribution of the original and imputed data (density and scatterplot). In both cases, the imputed data is red, and the observed data is blue. We see acceptable overlap / similarity between imputed and observed data points/ distributions.

Density plot

Strip plot

Social cognition

In total, 33 social cognition data points are missing across the following participants:

record_id	group
SPN01_CMH_0095	SSD
SPN01_CMH_0165	SSD
SPN01_CMH_0168	SSD
SPN01_CMH_0180	SSD
SPN01_CMH_0196	SSD
SPN01_CMH_0198	SSD
SPN01_CMH_0207	SSD
SPN01_CMP_0205	SSD
SPN01_CMP_0209	HC
SPN01_MRP_0151	SSD
SPN01_ZHH_0040	SSD
SPN01_ZHH_0041	SSD
SPN01_ZHP_0061	SSD
SPN01_ZHP_0091	SSD
SPN01_ZHP_0094	SSD
SPN01_ZHP_0095	SSD
SPN01_ZHP_0099	SSD
SPN01_ZHP_0108	HC
SPN01_ZHP_0110	SSD
SPN01_ZHP_0117	HC
SPN01_ZHP_0123	SSD
SPN01_ZHP_0140	SSD
SPN01_ZHP_0144	SSD
SPN01_ZHP_0172	SSD

## 
##  Variables sorted by number of missings: 
##      Variable       Count
##       scog_EA 0.035545024
##    scog_ER_40 0.011848341
##     scog_RMET 0.007109005
##      scog_RAD 0.007109005
##  scog_TASIT_3 0.007109005
##  scog_TASIT_1 0.004739336
##  scog_TASIT_2 0.004739336
##      scog_IRI 0.000000000

We implemented the MICE imputation in the same way as above:

imputed_scog <- mice(df_scogNumeric, m=5, maxit = 50, method = 'pmm', seed = 500)

As above, the tabbed visualizations show the resulting 5 imputed values, each based on 50 iterations. The imputated data are adequately similar to the observed data. We pooled the imputed data across all m.

SPINS CCA: Missing data interpolation – clinical

Neurocognition

Density plot

Strip plot

Social cognition

Density plot

Strip plot