Missing data imputation

Author

Aggie Turlo

Published

December 8, 2025

Packages

These are the packages that you will need to install and load:

library(tidyverse)
library(ggplot2)
library(pheatmap)
library(imputeLCMD)
library(gridExtra)
library(ggfortify)

Data

Load the example dataset (save it in your working directory first):

df <- read.csv('MSC_proteomics.csv')

Missing data filtering

Remove variables (columns) that have more than 30% observations missing in at least one experimental group (here represented by the ‘Tissue’ column).

First, subset observations (samples) representing each tissue type.

rownames(df) <- df$ID
assay <- df[,-c(1:4)] # keep only columns with protein abundances

type <- df$Tissue == 'Bone marrow' # which rows are of Tissue category Bone marrow?
bm <- assay[type, ] # subset Bone marrow sample dataset
ad <- assay[!type, ] # subset Adipose sample dataset

Calculate % of missing values within each data subset.

Vnas_bm <- data.frame(Fnas  = colSums(is.na(bm))/nrow(bm)*100)
Vnas_ad <- data.frame(Fnas  = colSums(is.na(ad))/nrow(ad)*100)

Join the outputs to be able to compare % missing between data subsets.

Vnas_all <- Vnas_bm %>% dplyr::rename(BM = Fnas) %>% mutate(prot = rownames(.)) %>% left_join(Vnas_ad %>% dplyr::rename(AD = Fnas) %>% mutate(prot = rownames(.)))

Identify proteins with >30% values missing and remove them from the dataset.

Rm_all <- Vnas_all %>% dplyr::filter(AD > 30 | BM > 30)
df_f <- df %>% dplyr::select(!Rm_all$prot)

Missing data type assessment

To make decisions about imputation, we need to determine first if data is likely to be missing at random (MAR) or not at random (MNAR).

Missingness vs experimental variables

We start by looking at potential associations between missingness and experimental variables using a heatmap. First, subset the metadata (first 4 columns) and protein abundance data into two objects:

meta <- df_f[,1:4]
assay_f <- df_f[,5:ncol(df_f)]

Code missing and observed values in protein dataset as 0s and 1s. First check if there are any 0 values in the dataset already.

table(assay_f[assay_f == 0]) 
< table of extent 0 >

This result means that none of the protein levels were recorded as 0. Now we can go ahead and implement the 0-1 code for missing and observed data.

assay_f[is.na(assay_f) == TRUE] <- 0 # code NAs as 0
assay_f[assay_f > 0] <- 1 # code observed data as 1

Plot the heatmap

Code
# remember that row names must be the same in the plotted data and metadata
heatmap <- pheatmap(t(as.matrix(assay_f)), 
                    show_rownames = F, 
                    cluster_cols =  T, 
                    cluster_rows = T, 
                    show_colnames = T,
                    fontsize_row = 4,
                    main = 'Missing values',
                    legend  = T, 
                    color = c('Tomato1','Black'),
                    treeheight_col = 35,
                    treeheight_row = 0,
                    # argument below specifies metadata to be included on heatmap
                    annotation_col = meta %>% dplyr::select(Tissue, Age, Sex),  
                    clustering_method = "ward.D2")
Figure 1: Missing value heatmap (0 = missing, 1 = observed)

After filtering out low abundance variables it doesn’t seem like there is systemic missingness related to experimental groups (hierarchical clustering does not group samples clearly by age, sex or tissue type). Some adipose samples at the edges of the plot however seem to have more missing data than others.

We can calculate % missing values within samples to examine that further

# subset protein data from the filtered dataset
assay_f <- df_f[,-c(1:4)]

# calculate % values missing per sample
Snas <- data.frame(Fnas  = rowSums(is.na(assay_f))/ncol(assay_f)*100) %>%
  mutate(ID = rownames(.)) %>% left_join(meta)

Visualise the results

Code
# plot missingness within samples across tissue types
Snas %>% ggplot(aes(x = Tissue, y = Fnas)) + 
  geom_boxplot(outliers = FALSE) + 
  geom_jitter(pch = 21, size = 2) + 
  theme_bw() +
  ylab('% missing values')
Figure 2: Boxplots of missing values within samples

Adipose cells seem to have more variability in missingness than bone marrow cells but there is little evidence that missingness is systematically related to either tissue type.

Missingness vs abundances

Another important consideration is the relationship between missing values and observed values within variables and samples. This is especially helpful in detecting data MNAR due to not meeting the detection limit of the analytical method.

Start with log-transforming protein abundances for visualisations

df_f[,5:ncol(df_f)] <- log(df_f[,5:ncol(df_f)])

Plot protein abundance distributions across samples

Code
# change data format from wide to long
df_fL <- df_f %>% pivot_longer(cols =! c(ID, Tissue, Age, Sex), 
                               names_to = 'prot', values_to = 'abundance')

# plot intensity distributions
dens <- df_fL %>% ggplot(aes(x = abundance, group = ID, colour = Tissue)) + 
  geom_density(alpha = 0.5, size = 0.5) + 
  theme_bw() +
  theme(legend.position = 'none') +
  ggtitle('Individual samples')

dens_sum <- df_fL %>% ggplot(aes(x = abundance, group = Tissue, colour = Tissue)) +
  geom_density(alpha = 0.5, size = 0.5) + 
  theme_bw() +
  ggtitle('Summarised')

grid.arrange(dens, dens_sum, ncol = 2, widths = c(0.4, 0.6))
Figure 3: Density plots of log-transformed protein abundances in cell samples

This can be also shown using box plots.

Code
box <- df_fL %>% 
  ggplot(aes(y = abundance, x = ID, fill = Tissue)) + 
  geom_boxplot(alpha = 0.6) + theme_bw() +
  geom_hline(yintercept = median(df_fL$abundance, na.rm = TRUE), 
             size = 0.8, colour = 'darkblue') +
  theme(axis.text.x = element_text(size = 6))

box
Figure 4: Boxplots of log-transformed protein abundances in cell samples. Blue line represents median abundance of the whole dataset.

Plot % of missing values against mean protein abundance

Code
# transform data from wide to long
Vnas_allL <- Vnas_all %>% 
  pivot_longer(cols =! prot, names_to = 'Tissue', values_to ='FraqNA') %>%
  mutate(Tissue = ifelse(Tissue == 'AD', 'Adipose', 'Bone marrow'))

# create a plot
df_fL %>% group_by(Tissue, prot) %>% 
  summarise(mean_logAb = mean(abundance, na.rm = TRUE)) %>%
  left_join(Vnas_allL) %>%
  ggplot(aes(x = mean_logAb, y = FraqNA, group = FraqNA)) +
  geom_violin() + 
  geom_boxplot() + 
  theme_bw() +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = 10, face = 'bold')) +
  facet_wrap(~Tissue, ncol = 2) +
  ylab('% missing values') +
  xlab('mean log abundance')
Figure 5: Violin and box plots of log-transformed protein abundances depending on % observations missing

Figure 5 shows that proteins with more observations missing tend to have lower mean abundance (the observed values in those proteins are lower than in more complete proteins). This suggests data MNAR due to low abundance, or left-censored data. Meanwhile, Figure 3 shows that bone marrow cells have higher mean protein concentrations.

A reasonable approach to missing value imputation in this scenario is to use a method suitable for left-censored data but perform imputation separately in bone marrow and adipose data subsets, due to differences in their global distributions of protein abundances.

Imputation

Again, we start by subsetting protein abundance data according to cell tissue source.

assay_bm <- t(df_f[type, -c(1:4)]) # subset Bone marrow samples and transpose
assay_ad <- t(df_f[!type, -c(1:4)]) # subset Adipose samples and transpose

Impute missing data in each subset using quantile regression (Wei et al. 2018).

assay_bm_imp <- impute.QRILC(assay_bm)[[1]]
assay_ad_imp <- impute.QRILC(assay_ad)[[1]]

Concatenate back into one dataset.

assay_imp <- rbind(as.data.frame(t(assay_bm_imp)) %>% mutate(ID = rownames(.)),
                   as.data.frame(t(assay_ad_imp)) %>% mutate(ID = rownames(.)))

# add metadata
df_f_imp <- meta %>% left_join(assay_imp)

Evaluating outcomes of imputation

It is recommended to examine (and report) how imputation affected your data (e.g. variance structure, summary statistics). Plotting your data before and after imputation is useful for that. Below are examples of plots that you may want to include in a publication involving imputed dataset.

Plot intensity distributions and compare with Figure 3.

Code
# change data format from wide to long
df_f_impL <- df_f_imp %>% pivot_longer(cols =! c(ID, Tissue, Age, Sex), 
                               names_to = 'prot', values_to = 'abundance')

dens_imp <- df_f_impL %>% ggplot(aes(x = abundance, group = ID, colour = Tissue)) +
  geom_density(alpha = 0.5, size = 0.5) + 
  theme_bw() +
  ggtitle('Individual samples (imputed)')

dens_sum_imp <- df_f_impL %>% 
  ggplot(aes(x = abundance, group = Tissue, colour = Tissue)) +
  geom_density(alpha = 0.5, size = 0.5) + 
  theme_bw() +
  ggtitle('Summarised (imputed)')

grid.arrange(dens, dens_imp, 
             dens_sum + theme(legend.position = 'none'), dens_sum_imp, 
             ncol = 2, nrow = 2, widths = c(0.4, 0.6))
Figure 6

Some of the imputed individual distributions, especially in the adipose cell samples, gained a low-abundance second peak (visible also in summarised density plots). This is expected result of left-censored imputation and samples with most missing values are the most affected.

Repeat the same using boxplots.

Code
box_imp <- df_f_impL %>% ggplot(aes(y = abundance, x = ID, fill = Tissue)) + 
  geom_boxplot(alpha = 0.6) + theme_bw() +
  geom_hline(yintercept = median(df_f_impL$abundance, na.rm = TRUE), 
             size = 0.8, colour = 'darkblue') +
  theme(axis.text.x = element_text(size = 6)) +
  ggtitle('Observed + Imputed')

grid.arrange(box + ggtitle('Observed'), box_imp, nrow = 2)
Figure 7

Principal Component Analysis score plots are useful to evaluate the variance structure of the dataset before and after imputation and spot the outliers. PCA can be only performed on a complete dataset. Therefore, we will be comparing the imputed dataset with dataset including only proteins where all observations were recorded.

Start with filtering out proteins with any missing values.

assay_c <- t(t(assay_f) %>% as.data.frame() %>% drop_na())
ncol(assay_c)
[1] 314

Complete dataset includes 314 proteins (compared with 571 in the imputed dataset). Now we can perform PCA and create scores plot.

pc_c <- prcomp(assay_c, scale = TRUE, center = TRUE)

pc_c_t <- autoplot(pc_c, data = Snas, 
                   fill = 'Tissue', shape = 21, size = 2.5) +
  theme_bw() +
  theme(legend.position = 'bottom')

pc_c_na <- autoplot(pc_c, data = Snas, 
                    fill = 'Fnas', shape = 21, size = 2.5) + 
  scale_fill_viridis_c(name = '% missing') +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.key.height = unit(0.1, 'cm'))

grid.arrange(pc_c_t, pc_c_na, ncol = 2)
Figure 8: PCA scores plots of complete protein abundance dataset.

Repeat the same for the imputed dataset and compare with Figure 8.

Code
assay_imp <- df_f_imp[,-c(1:4)]
pc_imp <- prcomp(assay_imp, scale = TRUE, center = TRUE)

pc_imp_t <- autoplot(pc_imp, data = Snas, 
                     fill = 'Tissue', shape = 21, size = 2.5) + 
  theme_bw() +
  theme(legend.position = 'bottom')

pc_imp_na <- autoplot(pc_imp, data = Snas, 
                      fill = 'Fnas', shape = 21, size = 2.5) + 
  scale_fill_viridis_c(name = '% missing') +
  theme_bw() +
  theme(legend.position = 'bottom',
        legend.key.height = unit(0.1, 'cm'))

grid.arrange(pc_imp_t, pc_imp_na, ncol = 2)
Figure 9: PCA scores plots of imputed protein abundance dataset.

Location of data points representing samples on PCA score plots changed slightly following imputation but the sample groupings in the plots remained similar (they resemble mirror images of each other along horizontal axis). For example, the tendency of samples to group according to tissue type or for the samples with higher % of data missing to separate along PC1 is similar between the datasets.

Remember that complete dataset does not necessarily represent true protein profiles more than the imputed dataset - removing incomplete variable if data are missing not at random can introduce bias as much as incorrect imputation!

References

Wei, Runmin, Jingye Wang, Mingming Su, Erik Jia, Shaoqiu Chen, Tianlu Chen, and Yan Ni. 2018. “Missing Value Imputation Approach for Mass Spectrometry-Based Metabolomics Data.” Scientific Reports 8 (1): 663. https://doi.org/10.1038/s41598-017-19120-0.