Code written: 2020-01-02
Last ran: 2020-01-16
Website: http://rpubs.com/navona/thesis_dataInterpolationClinical


Description: We have a small amount of missing data in the neurocognition and social cognition variables that we will use in the CCA analysis. Though the count is small, we want to avoid list-wise deletion due to overall sample size considerations. Here, we interpolate scores for all participants that are missing values for social cognition and/or neurocognition, but have at least one score from each set (participants missing all values have been excluded).
Method: We elected not to apply mean substitution, which leaves the mean unchanged (desirable) but decreases variance (undesirable). Instead, we used the MICE (Multivariate Imputation via Chained Equations) method. MICE is similar to the regression method, except that it uses all existing relevant information to predict a plausible value (actually drawn from the existing distribution) for each missing datapoint, for each simulated regression model. The main virtues of the method are that imputations are restricted to the observed values, and that it can preserve non-linear relations. The method assumes that missing data are Missing at Random (MAR). Moreover, this is the method that Lindsay used for interpolation in her SEM paper.
Evaluation: The m=5 models produce varying values for some variables more than others. Overall, however, the imputed values are comparable to each other across models, and come from the same / similar distributions to the observed values. This imputation method is ideal from our dataset (based on n x p size, and number of missing values).
References: https://www.jstatsoft.org/article/view/v045i03 ; https://datascienceplus.com/imputing-missing-data-with-r-mice-package/


Neurocognition

In total, 7 neurocognitive data points are missing across the following participants:

record_id group
SPN01_CMH_0013 SSD
SPN01_CMH_0097 SSD
SPN01_ZHP_0065 HC
SPN01_ZHP_0081 SSD
SPN01_ZHP_0083 SSD
SPN01_ZHP_0086 SSD
SPN01_ZHP_0118 HC


All participants are missing data from the same factor, attention and vigilance, which reflects performance on the CPT-IP:

## 
##  Variables sorted by number of missings: 
##              Variable      Count
##         ncog_attn_vig 0.01658768
##    ncog_process_speed 0.00000000
##      ncog_work_memory 0.00000000
##  ncog_verbal_learning 0.00000000
##  ncog_visual_learning 0.00000000
##  ncog_problem_solving 0.00000000

We implement the MICE imputation method as follows. The m=5 argument refers to the number of imputed datasets (this is also the default value). The maxit argument specifies the number of iterations. The meth='pmm' argument refers to the imputation method: we are using predictive mean matching as imputation method. Seed sets a random seed for reprodicibility.

imputed_ncog <- mice(df_ncogNumeric, m=5, maxit = 50, method = 'pmm', seed = 500)

The code returns m=5 imputed values, each based on 50 iterations, for all participants with missing neurocognitive data. Ultimately, we will combine, or ‘pool’ these values. Inspection of the values show that there is some variability across m. However, this is to be expected. The following visualizations show the distribution of the original and imputed data (density and scatterplot). In both cases, the imputed data is red, and the observed data is blue. We see acceptable overlap / similarity between imputed and observed data points/ distributions.

Density plot

Strip plot


Social cognition

In total, 33 social cognition data points are missing across the following participants:

record_id group
SPN01_CMH_0095 SSD
SPN01_CMH_0165 SSD
SPN01_CMH_0168 SSD
SPN01_CMH_0180 SSD
SPN01_CMH_0196 SSD
SPN01_CMH_0198 SSD
SPN01_CMH_0207 SSD
SPN01_CMP_0205 SSD
SPN01_CMP_0209 HC
SPN01_MRP_0151 SSD
SPN01_ZHH_0040 SSD
SPN01_ZHH_0041 SSD
SPN01_ZHP_0061 SSD
SPN01_ZHP_0091 SSD
SPN01_ZHP_0094 SSD
SPN01_ZHP_0095 SSD
SPN01_ZHP_0099 SSD
SPN01_ZHP_0108 HC
SPN01_ZHP_0110 SSD
SPN01_ZHP_0117 HC
SPN01_ZHP_0123 SSD
SPN01_ZHP_0140 SSD
SPN01_ZHP_0144 SSD
SPN01_ZHP_0172 SSD

## 
##  Variables sorted by number of missings: 
##      Variable       Count
##       scog_EA 0.035545024
##    scog_ER_40 0.011848341
##     scog_RMET 0.007109005
##      scog_RAD 0.007109005
##  scog_TASIT_3 0.007109005
##  scog_TASIT_1 0.004739336
##  scog_TASIT_2 0.004739336
##      scog_IRI 0.000000000

We implemented the MICE imputation in the same way as above:

imputed_scog <- mice(df_scogNumeric, m=5, maxit = 50, method = 'pmm', seed = 500)

As above, the tabbed visualizations show the resulting 5 imputed values, each based on 50 iterations. The imputated data are adequately similar to the observed data. We pooled the imputed data across all m.

Density plot

Strip plot