if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")
pacman::p_load(ggplot2, mice, naniar)

Missingness

The airquality dataset contains daily air quality measurements in New York from May to September 1973. This dataset includes some missing values, making it ideal for this exercise.

Understanding Missing Data Mechanisms

Before addressing missing data, it’s essential to understand the mechanisms that lead to missingness, as they dictate the appropriate handling methods:

  • Missing Completely at Random (MCAR): The probability of missingness is the same for all observations.
  • Missing at Random (MAR): The probability of missingness is related to observed data but not the missing data itself.
  • Missing Not at Random (MNAR): The probability of missingness is related to the value of the missing data itself.

Identifying the type of missingness helps in choosing the right imputation or analysis method.

Visualizing Missing Data

Visual representations can help identify patterns and potential reasons behind missing data.

Basic Visualization with vis_miss()

The vis_miss() function provides a quick overview of the missing data in your dataset:

vis_miss(airquality)

This plot displays the proportion of missing data in each column and highlights the missingness in the dataset.

Exploring Missing Data Patterns with gg_miss_upset()

To examine combinations of missingness across variables, use gg_miss_upset():

# Visualize combinations of missing data
gg_miss_upset(airquality)

This plot helps identify if certain variables tend to have missing values simultaneously, which can inform the choice of imputation methods.

Visualizing Missing Data in Scatter Plots with geom_miss_point()

When exploring relationships between two continuous variables, it’s important to visualize missing data points:

# Scatter plot highlighting missing values
ggplot(data = airquality, aes(x = Ozone, y = Solar.R)) +
  geom_miss_point()

This scatter plot highlights missing values by placing them 10% below the minimum value of the respective variable.

Summarizing Missing Data

Quantifying missing data helps in understanding its scope and making informed decisions on handling it.

Overall Missing Data Summary

Use the following functions to get a quick summary:

# Total number of missing values
n_miss(airquality)
## [1] 44
# Proportion of missing values
prop_miss(airquality)
## [1] 0.04793028

Variable-wise Missing Data Summary

To see missing data per variable:

miss_var_summary(airquality)

This provides a tibble with the count and percentage of missing values for each variable.

Handling Missing Data

Imputation with naniar’s impute_mean()

For exploratory purposes, impute missing values using the mean of the observed data:

# Impute missing values with the mean
airquality_imputed <- airquality %>%
  impute_mean_all()

This method replaces all missing values with the mean of their respective variables. However, be cautious with this approach, as it can underestimate variability.

Advanced Imputation with mice and Visualization with naniar

For more robust imputation, the mice package can be used, and naniar can help visualize the imputed values:

# Perform multiple imputation
imputed_data <- mice(airquality, m = 5, method = 'pmm', seed = 123)
## 
##  iter imp variable
##   1   1  Ozone  Solar.R
##   1   2  Ozone  Solar.R
##   1   3  Ozone  Solar.R
##   1   4  Ozone  Solar.R
##   1   5  Ozone  Solar.R
##   2   1  Ozone  Solar.R
##   2   2  Ozone  Solar.R
##   2   3  Ozone  Solar.R
##   2   4  Ozone  Solar.R
##   2   5  Ozone  Solar.R
##   3   1  Ozone  Solar.R
##   3   2  Ozone  Solar.R
##   3   3  Ozone  Solar.R
##   3   4  Ozone  Solar.R
##   3   5  Ozone  Solar.R
##   4   1  Ozone  Solar.R
##   4   2  Ozone  Solar.R
##   4   3  Ozone  Solar.R
##   4   4  Ozone  Solar.R
##   4   5  Ozone  Solar.R
##   5   1  Ozone  Solar.R
##   5   2  Ozone  Solar.R
##   5   3  Ozone  Solar.R
##   5   4  Ozone  Solar.R
##   5   5  Ozone  Solar.R
# Complete the dataset after imputation
completed_data <- complete(imputed_data)

# Visualize the imputed data
completed_data %>%
ggplot(aes(x = Ozone, y = Solar.R)) +
  geom_point()

This approach allows for the assessment the distribution of imputed values and ensure they make sense in the context of the data.

Advanced Techniques

Creating Shadow Matrices with naniar**

A shadow matrix is a data structure that mirrors your original dataset but focuses solely on indicating the presence or absence of data. Each variable in the shadow matrix corresponds to a variable in the original dataset, with a suffix _NA added to its name. The values in these shadow variables are categorical indicators:

  • !NA signifies that the corresponding value in the original dataset is present (not missing).
  • NA signifies that the corresponding value is missing.

This structure allows for a clear and organized representation of missing data patterns within your dataset.

# Create a shadow matrix for the airquality dataset
shadow_airquality <- as_shadow(airquality)

# View the first few rows of the shadow matrix
head(shadow_airquality)

This will produce a tibble where each variable from the airquality dataset has a corresponding _NA variable indicating the presence (!NA) or absence (NA) of data.

Binding the Shadow Matrix to the Original Data

For a more integrated analysis, bind the shadow matrix to the original dataset, creating what is known as nabular data (a combination of “NA” and “tabular”). This can be achieved using the bind_shadow() function:

# Bind the shadow matrix to the original dataset
nabular_airquality <- bind_shadow(airquality)

# View the first few rows of the combined dataset
head(nabular_airquality)

In this combined dataset, each original variable is accompanied by its _NA counterpart, providing a comprehensive view of the data alongside its missingness indicators.

Practical Applications of Shadow Matrices

  1. Analyzing Relationships Involving Missing Data: Shadow matrices allow you to examine how missing data in one variable relates to the presence or values of another variable. For instance, assess whether missing Ozone values are associated with specific Temp values:
nabular_airquality %>%
  ggplot(aes(x = Temp, fill = Ozone_NA)) +
     geom_histogram(binwidth = 5, position = "dodge")

This histogram visualizes the distribution of temperature (Temp) in the airquality dataset, grouped by whether Ozone values are missing or not.

  • Missingness is not random across temperature:
    • There are more missing Ozone values (teal bars) at higher temperatures, especially in the 80–90°F range.
    • This suggests the possibility that the missingness in Ozone is not completely at random (not MCAR). It may depend on temperature (possibly MAR — Missing At Random).
  • Higher temperatures are well-represented:
    • Most data points cluster around 75–85°F, for both missing and non-missing Ozone values.
    • This shows the bulk of the data is in the warmer range of the observed temperatures.
  • Few low-temp days:
    • There are fewer observations below 65°F, and these have mostly complete Ozone data.
  • Note:
    • When handling missing data or building imputation models, it might be useful to include Temp as a predictor for missing Ozone values, since there’s an apparent relationship.
  1. Facilitating Imputation: When performing data imputation, shadow matrices help track which values were originally missing. This tracking is essential for evaluating the performance of imputation methods and ensuring that analyses account for imputed versus observed data.

Handling Special Missing Values

In some datasets, specific codes (e.g., -99, 9999) are used to represent missing data due to particular reasons, such as instrument failure or data entry errors. The naniar package provides the recode_shadow() function to handle such special missing values by assigning them meaningful labels:

# Example dataset with a special missing value code
df <- tibble::tribble(
  ~wind, ~temp,
  -99, 45,
  68, NA,
  72, 25
)

# Bind the shadow to the data
df_shadow <- bind_shadow(df)

# Recode the special missing value
df_recode <- recode_shadow(df_shadow, wind = .where(wind == -99 ~ "instrument_failure"))

# View the recoded dataset
df_recode

In this example, the value -99 in the wind variable is recoded as a special missing value labeled “instrument_failure,” preserving the context of the missingness.

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Dubai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] naniar_1.1.0  mice_3.17.0   ggplot2_3.5.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6      shape_1.4.6.1     xfun_0.51         bslib_0.9.0      
##  [5] visdat_0.6.0      lattice_0.22-6    vctrs_0.6.5       tools_4.4.3      
##  [9] Rdpack_2.6.3      generics_0.1.3    tibble_3.2.1      pacman_0.5.1     
## [13] pan_1.9           pkgconfig_2.0.3   jomo_2.7-6        Matrix_1.7-3     
## [17] lifecycle_1.0.4   compiler_4.4.3    farver_2.1.2      munsell_0.5.1    
## [21] codetools_0.2-20  htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10      
## [25] glmnet_4.1-8      pillar_1.10.1     nloptr_2.2.1      jquerylib_0.1.4  
## [29] tidyr_1.3.1       MASS_7.3-65       cachem_1.1.0      reformulas_0.4.0 
## [33] iterators_1.0.14  rpart_4.1.24      boot_1.3-31       foreach_1.5.2    
## [37] mitml_0.4-5       nlme_3.1-167      tidyselect_1.2.1  digest_0.6.37    
## [41] dplyr_1.1.4       purrr_1.0.4       labeling_0.4.3    forcats_1.0.0    
## [45] splines_4.4.3     fastmap_1.2.0     grid_4.4.3        colorspace_2.1-1 
## [49] cli_3.6.4         magrittr_2.0.3    survival_3.8-3    broom_1.0.8      
## [53] withr_3.0.2       scales_1.3.0      backports_1.5.0   rsthemes_0.5.0   
## [57] rmarkdown_2.29    nnet_7.3-20       lme4_1.1-37       gridExtra_2.3    
## [61] evaluate_1.0.3    knitr_1.50        UpSetR_1.4.0      rbibutils_2.3    
## [65] rlang_1.1.5       Rcpp_1.0.14       glue_1.8.0        rstudioapi_0.17.1
## [69] minqa_1.2.8       jsonlite_2.0.0    R6_2.6.1          plyr_1.8.9