08a | Review and remove outliers

Description. We looked at outliers during data cleaning, but didn't act on them then. The outliers visualized here may be slightly different, as some participants have been removed (for ineligibility, incomplete data, etc.), small amounts of missing data have been interpolated, and FA values have been corrected for covariates/confounds (age, sex, parental education, and for patients, antipsychotic exposure). The review here immediately precedes CCA, so we are only reviewing the k=84 variables (k=6 neurocognition, k=6 social cognition, k=72 white matter FA) intended for the CCA, across the n=174 (n=83 SPINS, n=91 PNS) participants we otherwise believe to have 'good' data. Note: we later decided to include only 23 of these white matter tracts in the CCA analysis, based on hypothesized contribution to social cognition and neurocognition, respectively.

Outliers

In keeping with past lab work, we define outliers in accordance with the interquartile range (IQR). The IQR is the central 50% or the area between the 75th and the 25th percentile of a distribution. A point is an outlier if it is above the 75th or below the 25th percentile by a factor of 1.5 times the IQR. An outlier would be a point below [Q1- (1.5)IQR] or above [Q3+(1.5)IQR].

#function to get into long format
plot_fn <- function(df) {
  df %>%
  as_tibble() %>%
  select_if(is.numeric) %>%
  gather(key='variable', value='value')
}

#function for plot - boxplot without outliers
boxplot_fn <- function(df){
ggplot(df, aes(x=value, y='')) +
  geom_boxplot(outlier.colour='red') + 
  geom_jitter(alpha=.2) +
  theme(axis.title = element_blank()) +
  facet_wrap(~variable, scales = 'free', ncol=7)
}

Univariate outliers. Several variables have a small number of outlying participants, indicated in red, determined by Tukey's fences.

Entire sample

df_plot <- plot_fn(df)
boxplot_fn(df_plot)

SPINS only

df_plot_spins <- plot_fn(df_spins)
boxplot_fn(df_plot_spins)

PNS only

df_plot_pns <- plot_fn(df_pns)
boxplot_fn(df_plot_pns)

"Extreme outlier" participants. Some of the outliers visualized above come from the same participants, i.e., some participants show >1 outlying value. We could decide to exclude participants with a high number of outliers -- but at the time, we have not done so. In the tables below, count indicates number of outliers, and number indicates how many participants show that given count of outlying values:

#function to identify those who _aren't_ outliers, based on boxplot (Tukey fences)
isnt_out_tukey <- function(x, k = 1.5, na.rm = TRUE) {
  quar <- quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
  iqr <- diff(quar)
  (quar[1] - k * iqr <= x) & (x <= quar[2] + k * iqr)
}

#transmute, so that outliers are indicated as FALSE
outliers <- df %>% transmute_if(is.numeric, isnt_out_tukey)

#for each variable, identify those with 'FALSE' (i.e., outliers)
outliers <- apply(
  outliers, 1, 
  function(x) paste(names(which(x==FALSE)), collapse="," ))

#make dataframe
outliers <- as.data.frame(outliers)

#split at delimiter
outliers <- cbind(row.names(df), splitstackshape::cSplit(outliers, 'outliers', sep=','))

#add a variable for site
outliers$site <- str_sub(outliers$V1, 1, 3)

#add a rowSums column, indicating how many outliers
outliers$count <- apply(outliers, 1, function(x) sum(!is.na(x))) 
outliers$count <- outliers$count - 2 #don't count name or site

#across the 85 variables, identify participants with over 10 outlying values - these will be removed
#exclude_outliers <- paste('>10 outliers', outliers$V1[which(rowSums(!is.na(outliers)) > 10)]) #include ID in count

#write out names for record
#write(exclude_outliers, file = "../csvs/modified/outliers.txt",
      #ncolumns = 1,
      #append = TRUE, sep = " ")

#remove from dataframe, as don't want in normality calcualtions
#df <- df[!row.names(df) %in% outliers$V1[which(rowSums(!is.na(outliers)) > 10)],]

Entire sample

outliers %>%
  select(count) %>%
  group_by(count) %>%
  dplyr::summarise(number = n()) %>% 
  t(.) %>%
  kable() %>% 
  kable_styling()

count	0	1	2	3	4	5	6	7	12	14	15	17	22
number	68	54	25	14	1	2	2	1	1	2	1	2	1

SPINS only

outliers %>%
  filter(site == 'SPN') %>%
  select(count) %>%
  group_by(count) %>%
  dplyr::summarise(number = n()) %>% 
  t(.) %>%
  kable() %>% 
  kable_styling()

count	0	1	2	3	4	6	12	14	17
number	31	28	13	5	1	1	1	1	2

PNS only

outliers %>%
  filter(site == 'PNS') %>%
  select(count) %>%
  group_by(count) %>%
  dplyr::summarise(number = n()) %>% 
  t(.) %>%
  kable() %>% 
  kable_styling()

count	0	1	2	3	5	6	7	14	15	22
number	37	26	12	9	2	1	1	1	1	1

Multivariate outliers (Malhalanobis distance). We see that 43 participants (of 174) are considered to be multivariate outliers, based on Malhalanobis distance. Mahalanobis distance is often used for multivariate outliers detection as this distance takes into account the shape of the observations. As the Mahalanobis distance can be approximated by a Chi squared distribution, we can use the alpha quantile of the chi-square distribution with k degrees of freedom (k being the number of columns). By default, the alpha threshold is set to 0.025 (corresponding to the 2.5% most extreme observations).

Note that we must calculate multivariate normality on the combined sample, as the individual datasets (SPINS n=83, PNS n=91) would require us to arbitary drop variables (the Malhalanobis calculation requires n >> number of features).

#review outliers with MVN package 
result <- mvn(df, multivariateOutlierMethod = 'adj',showOutliers = TRUE,showNewData = TRUE) #44 outliers

In the full sample plot above, n=16 are from SPINS (out of 83), and n=27 from PNS (out of 91). Thus, it does not appear that multivariate outliers are driven by one particular site. Participant IDs and distance values are as follows:

DT::datatable(result$multivariateOutliers) %>%
  DT::formatRound(digits=3, columns = c('Observation', 'Mahalanobis Distance'))

Replace outliers

As CCA is sensitive to outliers, we have decided to remove extreme outlying values, even though for cognition variables, we have verified that these are 'real' values, i.e., not data entry errors, and for white matter FA variables, we have visually confirmed that tracts appear plausible.

Attempt 1: Capping method

First, we replaced outlying values (defined by IQR) with the highest (lowest) value defined by IQR that is not an outlier. As visualized below, this method leaves some of our distributions with "heavy tails". By definition, this method removes all outliers. Note: we later discovered that these adjustments precluded correction to normality via Yeo-Johnson.

#function to replace outliers with 'capped' quantile - and make very high/low or else get huge tails
outlierReplace_fn <- function(x){
  quantiles <- quantile(x, c(.25, .75))
  limit <- 1.5 * IQR(x)
  x[x<quantiles[1] - limit] <- quantiles[1] - limit
  x[x>quantiles[2] + limit] <- quantiles[2] + limit
  return(x)
}

#apply outlier replace function to entire dataset
df_noExtremeOutliers <- df %>%
  rownames_to_column('record_id') %>%
  mutate_if(is.numeric, funs(outlierReplace_fn)) %>%
  column_to_rownames('record_id')

df_plot_noExtremeOutliers <- plot_fn(df_noExtremeOutliers)
boxplot_fn(df_plot_noExtremeOutliers)

Attempt 2: Removal and interpolation

Next, we used the same IQR criterion, but instead of replacing the values, we removed them, and then replaced missing values via multivariate interpolation (on the basis of all other k=84 variables, with the mice package). In total, n=290 were interpolated, across the k=84 variables, and n=174 participants (mirroring the number of values adjusted above). This second method is much more successful at maintaining a normal-esque data shape, but it risks being less reflective of the real data. It is also the case that, in a few cases, the interpolated values are themselves outliers. We can put restraints on the interpolation so this is not the case, but I have not done so here.

df_plot_noOutliers <- plot_fn(df_noOutliers)
boxplot_fn(df_plot_noOutliers)

#write.csv(df_noExtremeOutliers, paste('../csvs/modified/df_CCA_noExtremeOutliers_', Sys.Date(), '.csv', sep=''), row.names = T)
write.csv(df_noOutliers, paste('../csvs/modified/df_CCA_noOutliersINT_', Sys.Date(), '.csv', sep=''), row.names = T)