if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")
pacman::p_load(ggplot2, mice, naniar)The airquality dataset contains daily air quality
measurements in New York from May to September 1973. This dataset
includes some missing values, making it ideal for this exercise.
Before addressing missing data, it’s essential to understand the mechanisms that lead to missingness, as they dictate the appropriate handling methods:
Identifying the type of missingness helps in choosing the right imputation or analysis method.
Visual representations can help identify patterns and potential reasons behind missing data.
vis_miss()The vis_miss() function provides a quick overview of the
missing data in your dataset:
This plot displays the proportion of missing data in each column and highlights the missingness in the dataset.
gg_miss_upset()To examine combinations of missingness across variables, use
gg_miss_upset():
This plot helps identify if certain variables tend to have missing values simultaneously, which can inform the choice of imputation methods.
geom_miss_point()When exploring relationships between two continuous variables, it’s important to visualize missing data points:
# Scatter plot highlighting missing values
ggplot(data = airquality, aes(x = Ozone, y = Solar.R)) +
geom_miss_point()This scatter plot highlights missing values by placing them 10% below the minimum value of the respective variable.
Quantifying missing data helps in understanding its scope and making informed decisions on handling it.
Use the following functions to get a quick summary:
## [1] 44
## [1] 0.04793028
naniar’s
impute_mean()For exploratory purposes, impute missing values using the mean of the observed data:
This method replaces all missing values with the mean of their respective variables. However, be cautious with this approach, as it can underestimate variability.
mice and Visualization with
naniarFor more robust imputation, the mice package can be
used, and naniar can help visualize the imputed values:
##
## iter imp variable
## 1 1 Ozone Solar.R
## 1 2 Ozone Solar.R
## 1 3 Ozone Solar.R
## 1 4 Ozone Solar.R
## 1 5 Ozone Solar.R
## 2 1 Ozone Solar.R
## 2 2 Ozone Solar.R
## 2 3 Ozone Solar.R
## 2 4 Ozone Solar.R
## 2 5 Ozone Solar.R
## 3 1 Ozone Solar.R
## 3 2 Ozone Solar.R
## 3 3 Ozone Solar.R
## 3 4 Ozone Solar.R
## 3 5 Ozone Solar.R
## 4 1 Ozone Solar.R
## 4 2 Ozone Solar.R
## 4 3 Ozone Solar.R
## 4 4 Ozone Solar.R
## 4 5 Ozone Solar.R
## 5 1 Ozone Solar.R
## 5 2 Ozone Solar.R
## 5 3 Ozone Solar.R
## 5 4 Ozone Solar.R
## 5 5 Ozone Solar.R
# Complete the dataset after imputation
completed_data <- complete(imputed_data)
# Visualize the imputed data
completed_data %>%
ggplot(aes(x = Ozone, y = Solar.R)) +
geom_point()This approach allows for the assessment the distribution of imputed values and ensure they make sense in the context of the data.
naniar**A shadow matrix is a data structure that mirrors
your original dataset but focuses solely on indicating the presence or
absence of data. Each variable in the shadow matrix corresponds to a
variable in the original dataset, with a suffix _NA added
to its name. The values in these shadow variables are categorical
indicators:
!NA signifies that the corresponding value in the
original dataset is present (not missing).NA signifies that the corresponding value is
missing.This structure allows for a clear and organized representation of missing data patterns within your dataset.
# Create a shadow matrix for the airquality dataset
shadow_airquality <- as_shadow(airquality)
# View the first few rows of the shadow matrix
head(shadow_airquality)This will produce a tibble where each variable from the
airquality dataset has a corresponding _NA
variable indicating the presence (!NA) or absence
(NA) of data.
For a more integrated analysis, bind the shadow matrix to the
original dataset, creating what is known as nabular
data (a combination of “NA” and “tabular”). This can be achieved using
the bind_shadow() function:
# Bind the shadow matrix to the original dataset
nabular_airquality <- bind_shadow(airquality)
# View the first few rows of the combined dataset
head(nabular_airquality)In this combined dataset, each original variable is accompanied by
its _NA counterpart, providing a comprehensive view of the
data alongside its missingness indicators.
Ozone values are associated with specific
Temp values:nabular_airquality %>%
ggplot(aes(x = Temp, fill = Ozone_NA)) +
geom_histogram(binwidth = 5, position = "dodge")This histogram visualizes the distribution of temperature
(Temp) in the airquality dataset,
grouped by whether Ozone values are missing or
not.
Ozone values (teal bars) at
higher temperatures, especially in the 80–90°F
range.Ozone is not completely at random (not
MCAR). It may depend on temperature (possibly MAR — Missing At
Random).Ozone values.Ozone data.Temp as a predictor
for missing Ozone values, since there’s an apparent
relationship.In some datasets, specific codes (e.g., -99,
9999) are used to represent missing data due to particular
reasons, such as instrument failure or data entry errors. The
naniar package provides the recode_shadow()
function to handle such special missing values by assigning them
meaningful labels:
# Example dataset with a special missing value code
df <- tibble::tribble(
~wind, ~temp,
-99, 45,
68, NA,
72, 25
)
# Bind the shadow to the data
df_shadow <- bind_shadow(df)
# Recode the special missing value
df_recode <- recode_shadow(df_shadow, wind = .where(wind == -99 ~ "instrument_failure"))
# View the recoded dataset
df_recodeIn this example, the value -99 in the wind
variable is recoded as a special missing value labeled
“instrument_failure,” preserving the context of the missingness.
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Dubai
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] naniar_1.1.0 mice_3.17.0 ggplot2_3.5.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 shape_1.4.6.1 xfun_0.51 bslib_0.9.0
## [5] visdat_0.6.0 lattice_0.22-6 vctrs_0.6.5 tools_4.4.3
## [9] Rdpack_2.6.3 generics_0.1.3 tibble_3.2.1 pacman_0.5.1
## [13] pan_1.9 pkgconfig_2.0.3 jomo_2.7-6 Matrix_1.7-3
## [17] lifecycle_1.0.4 compiler_4.4.3 farver_2.1.2 munsell_0.5.1
## [21] codetools_0.2-20 htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
## [25] glmnet_4.1-8 pillar_1.10.1 nloptr_2.2.1 jquerylib_0.1.4
## [29] tidyr_1.3.1 MASS_7.3-65 cachem_1.1.0 reformulas_0.4.0
## [33] iterators_1.0.14 rpart_4.1.24 boot_1.3-31 foreach_1.5.2
## [37] mitml_0.4-5 nlme_3.1-167 tidyselect_1.2.1 digest_0.6.37
## [41] dplyr_1.1.4 purrr_1.0.4 labeling_0.4.3 forcats_1.0.0
## [45] splines_4.4.3 fastmap_1.2.0 grid_4.4.3 colorspace_2.1-1
## [49] cli_3.6.4 magrittr_2.0.3 survival_3.8-3 broom_1.0.8
## [53] withr_3.0.2 scales_1.3.0 backports_1.5.0 rsthemes_0.5.0
## [57] rmarkdown_2.29 nnet_7.3-20 lme4_1.1-37 gridExtra_2.3
## [61] evaluate_1.0.3 knitr_1.50 UpSetR_1.4.0 rbibutils_2.3
## [65] rlang_1.1.5 Rcpp_1.0.14 glue_1.8.0 rstudioapi_0.17.1
## [69] minqa_1.2.8 jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9