One of the most prevalent characteristics of real-world data is arguably the presence of missing values.
While the best strategy to treat missing values largely depends on the purpose of our analysis and the algorithm to be used, the first step is always to identify the magnitude and patterns of missing in our data. Missing value identification is therefore important: oversight or negligence could easily bias our analysis.
Here I will share my workflow and favorite tools in this “detective work”. Hope it is of some help and would love to hear your thoughts too. :)
Overview of missings
First of all, it is helpful to get an overview about missing across different variables in the dataset. The naniar package comes in handy here.
# Explore and visualize missing values
library(naniar)
# other packages
library(tidyverse) # data wrangling
library(tidyquant) # theme and color schemes
Missings in each variables
The following table shows the number of percentage of missing values in each variable. The result is automatically sorted, with variables that has the most missing at the top. The dataset explored here, airquality, is a built-in dataset in R.
# create a table that shows missing in each variable
airquality |>
miss_var_summary() |>
kable()
|
variable
|
n_miss
|
pct_miss
|
|
Ozone
|
37
|
24.18
|
|
Solar.R
|
7
|
4.58
|
|
Wind
|
0
|
0.00
|
|
Temp
|
0
|
0.00
|
|
Month
|
0
|
0.00
|
|
Day
|
0
|
0.00
|
When you have a large number of variables in the dataset, the result of miss_var_summary() can be used as a filter to help zoom in to a subgroup with particular situation in missing. For example, we can choose to single out variables with more than 10% missing values.
more_than_10pct_missing <- airquality |>
miss_var_summary() |>
# choose variable(s) that have more than 10% missing values
filter(pct_miss > 10) |>
pull(variable)
more_than_10pct_missing
## [1] "Ozone"
A summary of missings
The table below summarized missing in all variables. For example, from the first row, we can tell that four variables (see n_vars) have 0 missing values (see n_miss_in_var), which account for 66.67% (i.e., 4 out of 6, see pct_vars) of all the variables.
# a summary of variables by missing
airquality |>
miss_var_table() |>
kable()
|
n_miss_in_var
|
n_vars
|
pct_vars
|
|
0
|
4
|
66.67
|
|
7
|
1
|
16.67
|
|
37
|
1
|
16.67
|
DataViz for missings
Missing overview: Graph version
This visual offers a quick overview regarding missings in all variables. You can customize it with ggplot’s grammar of graphics. In the following example, a title was added using labs() and the appearance of x and y axis title have been adjusted in theme().
# number of missing across all variables (ranked from high to low)
airquality |>
gg_miss_var(show_pct=TRUE)+
# add a tile
labs(title = "Missing in the Airquality Data") +
# appearance
theme_minimal() +
theme(axis.title = element_text(color="grey10",face = "italic"))

Side by side comparison
This graph made by vis_miss() can help us identify the extent to which missing across variables happen together, i.e., happen among the same cases/observations. In order to do this, I normally sort a variable of interest first, so a) all missing values on that variable are displayed together, and b) matches and mismatches can be identified more easily.
From the following visual, we can see that only a small proportion of observations are missing on both Ozone and Solar.R.
# missing by observations across all variables
airquality |>
# sort a variable of interests
arrange(Solar.R) |>
# visualize missing
vis_miss() +
# appearance setting
theme_tq() +
labs(title = "Missing in the Airquality Data")

Patterns in missing
Detecting patterns in missing can be very important, especially when we have reason to suspect that missing in our data may have happened systematically rather than completely at random. For example, in a dataset, observation from some specific ethnic, age or income group might be more likely to have missing values on a particular variable than the rest of the population. In such cases, missing doesn’t happen completely at random, relevant patterns need to be identified and handled accordingly. In this note we will focus on pattern detection, but if you are interested in relevant imputation methods, here is an interesting read.
In the code chunk below, geom_miss_point() is the function that plots missing vs non-missing points in different colors, then facet_wrap creates faceted plot to reveal patterns across different months. Notice that we have missing values in Ozone across all five months, with the most missing happen in June (i.e., month = 6) and least in September. Missings in Solar.R are only present in May and August.
Because missing in Solar.R and Ozone do not overlap much, it would be interesting to check this: Do observations missing on Ozone (vs not) score differently on Solar.R? How about observations missing on Solar.R, do they have different values on Ozone from the rest of the observations?
# work with ggplot2 to create a faceted plot to help identify possible patterns in missing
airquality |>
ggplot(aes(x=Ozone, y=Solar.R)) +
# visualize missings & non-missings
geom_miss_point() +
# faceting by months
facet_wrap(~Month) +
# appearance setting
labs(title="Missing vs Not-Missing Values in the Airquality Data", color="") +
theme_tq() +
scale_color_tq()

Hidden Missing Values
So far we have been talking about missing values that are clearly labeled as missing. But, sometimes missing values can be hidden.
For example, missing in dates are sometimes represented by a particular value such as “1900-01-01”. Similarly, missing in numerical variables are sometimes represented by values such as 999. A even trickier case is that in some system, missing values can be recorded as zero.
Hidden missing detection
One can imagine that this type of hidden missing values can seriously bias our analysis if not detected and correctly labeled. Some data comes with documentation that explains special values (e.g., 999) which represent missing. However, this is not always the case. While there is no single magic formula to detect such hidden missing values, checking out basic descriptive statistics might be a good starting point.
Descriptive statistics
R’s built in summary() function is pretty good. Suspicious values discussed above are likely to emerge as min or max.
airquality |>
summary()
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
descr() in the summarytools packages are nice too. (also recommended is skim() in the skimr package)
library(summarytools)
airquality |>
descr()
## Descriptive Statistics
## airquality
## N: 153
##
## Day Month Ozone Solar.R Temp Wind
## ----------------- -------- -------- -------- --------- -------- --------
## Mean 15.80 6.99 42.13 185.93 77.88 9.96
## Std.Dev 8.86 1.42 32.99 90.06 9.47 3.52
## Min 1.00 5.00 1.00 7.00 56.00 1.70
## Q1 8.00 6.00 18.00 115.00 72.00 7.40
## Median 16.00 7.00 31.50 205.00 79.00 9.70
## Q3 23.00 8.00 63.50 259.00 85.00 11.50
## Max 31.00 9.00 168.00 334.00 97.00 20.70
## MAD 11.86 1.48 25.95 98.59 8.90 3.41
## IQR 15.00 2.00 45.25 143.00 13.00 4.10
## CV 0.56 0.20 0.78 0.48 0.12 0.35
## Skewness 0.00 0.00 1.21 -0.42 -0.37 0.34
## SE.Skewness 0.20 0.20 0.22 0.20 0.20 0.20
## Kurtosis -1.22 -1.32 1.11 -1.00 -0.46 0.03
## N.Valid 153.00 153.00 116.00 146.00 153.00 153.00
## Pct.Valid 100.00 100.00 75.82 95.42 100.00 100.00
Common NA strings
Missing values sometimes are represented by descriptions such as “N/A”, “NA” or “NULL”. Admittedly, when working with real-world data, such description is not necessarily consistently applied within one dataset, or even within one variable. No problem, they can be easily identified and replaced (demonstrated later).
Missing together variables
Another trick that I found helpful is to think about variables that are likely to be missing together in one observation. Use the side by side comparison graph made by vis_miss() discussed above to check whether the pattern make sense. If not, go investigate.
Set missings as missing
Once we have done our detective work and have identified any values that should have been treated as missing, several functions in naniar will come in handy to set things right. Please see a few examples below, more examples are available in the naniar package documentation).
The demo data
# the demo data
df <- tibble::tribble(
~name, ~x, ~y, ~z,
"N/A", 1, "2022-02-02", -100,
"N A", 3, "1900-01-01", 100,
"N / A", NA, "2021-12-26", 999,
"Not Available", 99, "2022-01-15", 99,
"John Smith", 999, "1900-01-01", -1) |>
mutate(y=as.Date(y))
df
## # A tibble: 5 x 4
## name x y z
## <chr> <dbl> <date> <dbl>
## 1 N/A 1 2022-02-02 -100
## 2 N A 3 1900-01-01 100
## 3 N / A NA 2021-12-26 999
## 4 Not Available 99 2022-01-15 99
## 5 John Smith 999 1900-01-01 -1
Replacing relevant values with NA
Numeric values and dates
Three handy functions, replace_with_na(), replace_with_na_at() and replace_with_na_if() are demonstrated below.
# set 99 and 999 in x as missing
df |>
replace_with_na(replace=list(x=c(99,999)))
## # A tibble: 5 x 4
## name x y z
## <chr> <dbl> <date> <dbl>
## 1 N/A 1 2022-02-02 -100
## 2 N A 3 1900-01-01 100
## 3 N / A NA 2021-12-26 999
## 4 Not Available NA 2022-01-15 99
## 5 John Smith NA 1900-01-01 -1
# set all 99 and 999 in both x and z as missing
df |>
replace_with_na_at(.vars = c("x","z"),
condition = ~.x %in% c(99,999))
## # A tibble: 5 x 4
## name x y z
## <chr> <dbl> <date> <dbl>
## 1 N/A 1 2022-02-02 -100
## 2 N A 3 1900-01-01 100
## 3 N / A NA 2021-12-26 NA
## 4 Not Available NA 2022-01-15 NA
## 5 John Smith NA 1900-01-01 -1
# for all "date" fields, set "1900-01-01" as missing
df |>
replace_with_na_if(.predicate = is.Date,
condition = ~.x == "1900-01-01")
## # A tibble: 5 x 4
## name x y z
## <chr> <dbl> <date> <dbl>
## 1 N/A 1 2022-02-02 -100
## 2 N A 3 NA 100
## 3 N / A NA 2021-12-26 999
## 4 Not Available 99 2022-01-15 99
## 5 John Smith 999 NA -1
Strings
For string variables, common_na_strings in the naniar packages contains a rich set of characters (e.g., NA, N/A) that normally represent missing in real-world data.
# for missing values labeled by strings, naniar offers common_na_strings
common_na_strings
## [1] "NA" "N A" "N/A" "#N/A" "NA " " NA" "N /A" "N / A"
## [9] " N / A" "N / A " "na" "n a" "n/a" "na " " na" "n /a"
## [17] "n / a" " a / a" "n / a " "NULL" "null" "" "\\?" "\\*"
## [25] "\\."
# utilize common_na_strings to set missing
df |>
replace_with_na_all(condition = ~.x %in% common_na_strings)
## # A tibble: 5 x 4
## name x y z
## <chr> <dbl> <date> <dbl>
## 1 <NA> 1 2022-02-02 -100
## 2 <NA> 3 1900-01-01 100
## 3 <NA> NA 2021-12-26 999
## 4 Not Available 99 2022-01-15 99
## 5 John Smith 999 1900-01-01 -1
Sometimes we may have NA strings that are not covered in common_na_strings. In the above table, we can see that “Not Available” is meant to represent missing but have not been replaced yet. We can easily create a custom set by adding custom na strings to common_na_strings. Please see the example below.
# adding items to common_na_strings to make a custom set
na_strings <- c("Not Available",
common_na_strings)
# utilize the custom set to replace missing
df |>
replace_with_na_all(condition = ~.x %in% na_strings)
## # A tibble: 5 x 4
## name x y z
## <chr> <dbl> <date> <dbl>
## 1 <NA> 1 2022-02-02 -100
## 2 <NA> 3 1900-01-01 100
## 3 <NA> NA 2021-12-26 999
## 4 <NA> 99 2022-01-15 99
## 5 John Smith 999 1900-01-01 -1