One of the most prevalent characteristics of real-world data is arguably the presence of missing values.

While the best strategy to treat missing values largely depends on the purpose of our analysis and the algorithm to be used, the first step is always to identify the magnitude and patterns of missing in our data. Missing value identification is therefore important: oversight or negligence could easily bias our analysis.

Here I will share my workflow and favorite tools in this “detective work”. Hope it is of some help and would love to hear your thoughts too. :)


Overview of missings

First of all, it is helpful to get an overview about missing across different variables in the dataset. The naniar package comes in handy here.

# Explore and visualize missing values
library(naniar)
# other packages
library(tidyverse) # data wrangling
library(tidyquant) # theme and color schemes

Missings in each variables

The following table shows the number of percentage of missing values in each variable. The result is automatically sorted, with variables that has the most missing at the top. The dataset explored here, airquality, is a built-in dataset in R.

# create a table that shows missing in each variable
airquality |> 
    miss_var_summary() |> 
    kable()
variable n_miss pct_miss
Ozone 37 24.18
Solar.R 7 4.58
Wind 0 0.00
Temp 0 0.00
Month 0 0.00
Day 0 0.00

When you have a large number of variables in the dataset, the result of miss_var_summary() can be used as a filter to help zoom in to a subgroup with particular situation in missing. For example, we can choose to single out variables with more than 10% missing values.

more_than_10pct_missing <- airquality |> 
  miss_var_summary() |>   
  
  # choose variable(s) that have more than 10% missing values
  filter(pct_miss > 10) |> 
  pull(variable)

more_than_10pct_missing
## [1] "Ozone"


A summary of missings

The table below summarized missing in all variables. For example, from the first row, we can tell that four variables (see n_vars) have 0 missing values (see n_miss_in_var), which account for 66.67% (i.e., 4 out of 6, see pct_vars) of all the variables.

# a summary of variables by missing
airquality |> 
    miss_var_table() |> 
    kable()
n_miss_in_var n_vars pct_vars
0 4 66.67
7 1 16.67
37 1 16.67



DataViz for missings

Missing overview: Graph version

This visual offers a quick overview regarding missings in all variables. You can customize it with ggplot’s grammar of graphics. In the following example, a title was added using labs() and the appearance of x and y axis title have been adjusted in theme().

# number of missing across all variables (ranked from high to low)
airquality |> 
    gg_miss_var(show_pct=TRUE)+
  
    # add a tile
    labs(title = "Missing in the Airquality Data") +
  
    # appearance
    theme_minimal() +
    theme(axis.title = element_text(color="grey10",face = "italic"))


Side by side comparison

This graph made by vis_miss() can help us identify the extent to which missing across variables happen together, i.e., happen among the same cases/observations. In order to do this, I normally sort a variable of interest first, so a) all missing values on that variable are displayed together, and b) matches and mismatches can be identified more easily.

From the following visual, we can see that only a small proportion of observations are missing on both Ozone and Solar.R.

# missing by observations across all variables
airquality |> 
  
    # sort a variable of interests
    arrange(Solar.R) |> 
  
    # visualize missing
    vis_miss() +
  
    # appearance setting
    theme_tq() +
    labs(title = "Missing in the Airquality Data")


Patterns in missing

Detecting patterns in missing can be very important, especially when we have reason to suspect that missing in our data may have happened systematically rather than completely at random. For example, in a dataset, observation from some specific ethnic, age or income group might be more likely to have missing values on a particular variable than the rest of the population. In such cases, missing doesn’t happen completely at random, relevant patterns need to be identified and handled accordingly. In this note we will focus on pattern detection, but if you are interested in relevant imputation methods, here is an interesting read.

In the code chunk below, geom_miss_point() is the function that plots missing vs non-missing points in different colors, then facet_wrap creates faceted plot to reveal patterns across different months. Notice that we have missing values in Ozone across all five months, with the most missing happen in June (i.e., month = 6) and least in September. Missings in Solar.R are only present in May and August.

Because missing in Solar.R and Ozone do not overlap much, it would be interesting to check this: Do observations missing on Ozone (vs not) score differently on Solar.R? How about observations missing on Solar.R, do they have different values on Ozone from the rest of the observations?

# work with ggplot2 to create a faceted plot to help identify possible patterns in missing
airquality |> 
    ggplot(aes(x=Ozone, y=Solar.R)) + 
  
    # visualize missings & non-missings
    geom_miss_point() +
  
    # faceting by months
    facet_wrap(~Month) +
  
    # appearance setting
    labs(title="Missing vs Not-Missing Values in the Airquality Data", color="") +
    theme_tq() +
    scale_color_tq()


Hidden Missing Values

So far we have been talking about missing values that are clearly labeled as missing. But, sometimes missing values can be hidden.

For example, missing in dates are sometimes represented by a particular value such as “1900-01-01”. Similarly, missing in numerical variables are sometimes represented by values such as 999. A even trickier case is that in some system, missing values can be recorded as zero.

Hidden missing detection

One can imagine that this type of hidden missing values can seriously bias our analysis if not detected and correctly labeled. Some data comes with documentation that explains special values (e.g., 999) which represent missing. However, this is not always the case. While there is no single magic formula to detect such hidden missing values, checking out basic descriptive statistics might be a good starting point.

Descriptive statistics

R’s built in summary() function is pretty good. Suspicious values discussed above are likely to emerge as min or max.

airquality |> 
    summary()
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

descr() in the summarytools packages are nice too. (also recommended is skim() in the skimr package)

library(summarytools)

airquality |> 
    descr()
## Descriptive Statistics  
## airquality  
## N: 153  
## 
##                        Day    Month    Ozone   Solar.R     Temp     Wind
## ----------------- -------- -------- -------- --------- -------- --------
##              Mean    15.80     6.99    42.13    185.93    77.88     9.96
##           Std.Dev     8.86     1.42    32.99     90.06     9.47     3.52
##               Min     1.00     5.00     1.00      7.00    56.00     1.70
##                Q1     8.00     6.00    18.00    115.00    72.00     7.40
##            Median    16.00     7.00    31.50    205.00    79.00     9.70
##                Q3    23.00     8.00    63.50    259.00    85.00    11.50
##               Max    31.00     9.00   168.00    334.00    97.00    20.70
##               MAD    11.86     1.48    25.95     98.59     8.90     3.41
##               IQR    15.00     2.00    45.25    143.00    13.00     4.10
##                CV     0.56     0.20     0.78      0.48     0.12     0.35
##          Skewness     0.00     0.00     1.21     -0.42    -0.37     0.34
##       SE.Skewness     0.20     0.20     0.22      0.20     0.20     0.20
##          Kurtosis    -1.22    -1.32     1.11     -1.00    -0.46     0.03
##           N.Valid   153.00   153.00   116.00    146.00   153.00   153.00
##         Pct.Valid   100.00   100.00    75.82     95.42   100.00   100.00

Common NA strings

Missing values sometimes are represented by descriptions such as “N/A”, “NA” or “NULL”. Admittedly, when working with real-world data, such description is not necessarily consistently applied within one dataset, or even within one variable. No problem, they can be easily identified and replaced (demonstrated later).

Missing together variables

Another trick that I found helpful is to think about variables that are likely to be missing together in one observation. Use the side by side comparison graph made by vis_miss() discussed above to check whether the pattern make sense. If not, go investigate.

Set missings as missing

Once we have done our detective work and have identified any values that should have been treated as missing, several functions in naniar will come in handy to set things right. Please see a few examples below, more examples are available in the naniar package documentation).

The demo data

# the demo data
df <- tibble::tribble(
  ~name,           ~x,  ~y,              ~z,  
  "N/A",           1,   "2022-02-02",    -100, 
  "N A",           3,   "1900-01-01",     100,
  "N / A",         NA,  "2021-12-26",     999,
  "Not Available", 99,  "2022-01-15",     99,
  "John Smith",    999, "1900-01-01",      -1) |> 
  mutate(y=as.Date(y))

df
## # A tibble: 5 x 4
##   name              x y              z
##   <chr>         <dbl> <date>     <dbl>
## 1 N/A               1 2022-02-02  -100
## 2 N A               3 1900-01-01   100
## 3 N / A            NA 2021-12-26   999
## 4 Not Available    99 2022-01-15    99
## 5 John Smith      999 1900-01-01    -1

Replacing relevant values with NA

Numeric values and dates

Three handy functions, replace_with_na(), replace_with_na_at() and replace_with_na_if() are demonstrated below.

# set 99 and 999 in x as missing
df |> 
    replace_with_na(replace=list(x=c(99,999))) 
## # A tibble: 5 x 4
##   name              x y              z
##   <chr>         <dbl> <date>     <dbl>
## 1 N/A               1 2022-02-02  -100
## 2 N A               3 1900-01-01   100
## 3 N / A            NA 2021-12-26   999
## 4 Not Available    NA 2022-01-15    99
## 5 John Smith       NA 1900-01-01    -1
# set all 99 and 999 in both x and z as missing
df |> 
    replace_with_na_at(.vars = c("x","z"),
                       condition = ~.x %in% c(99,999)) 
## # A tibble: 5 x 4
##   name              x y              z
##   <chr>         <dbl> <date>     <dbl>
## 1 N/A               1 2022-02-02  -100
## 2 N A               3 1900-01-01   100
## 3 N / A            NA 2021-12-26    NA
## 4 Not Available    NA 2022-01-15    NA
## 5 John Smith       NA 1900-01-01    -1
# for all "date" fields, set "1900-01-01" as missing
df |> 
    replace_with_na_if(.predicate = is.Date,
                       condition = ~.x == "1900-01-01")
## # A tibble: 5 x 4
##   name              x y              z
##   <chr>         <dbl> <date>     <dbl>
## 1 N/A               1 2022-02-02  -100
## 2 N A               3 NA           100
## 3 N / A            NA 2021-12-26   999
## 4 Not Available    99 2022-01-15    99
## 5 John Smith      999 NA            -1

Strings

For string variables, common_na_strings in the naniar packages contains a rich set of characters (e.g., NA, N/A) that normally represent missing in real-world data.

# for missing values labeled by strings, naniar offers common_na_strings
common_na_strings
##  [1] "NA"     "N A"    "N/A"    "#N/A"   "NA "    " NA"    "N /A"   "N / A" 
##  [9] " N / A" "N / A " "na"     "n a"    "n/a"    "na "    " na"    "n /a"  
## [17] "n / a"  " a / a" "n / a " "NULL"   "null"   ""       "\\?"    "\\*"   
## [25] "\\."
# utilize common_na_strings to set missing
df |> 
    replace_with_na_all(condition = ~.x %in% common_na_strings) 
## # A tibble: 5 x 4
##   name              x y              z
##   <chr>         <dbl> <date>     <dbl>
## 1 <NA>              1 2022-02-02  -100
## 2 <NA>              3 1900-01-01   100
## 3 <NA>             NA 2021-12-26   999
## 4 Not Available    99 2022-01-15    99
## 5 John Smith      999 1900-01-01    -1

Sometimes we may have NA strings that are not covered in common_na_strings. In the above table, we can see that “Not Available” is meant to represent missing but have not been replaced yet. We can easily create a custom set by adding custom na strings to common_na_strings. Please see the example below.

# adding items to common_na_strings to make a custom set
na_strings <- c("Not Available",
                common_na_strings)

# utilize the custom set to replace missing
df |> 
    replace_with_na_all(condition = ~.x %in% na_strings) 
## # A tibble: 5 x 4
##   name           x y              z
##   <chr>      <dbl> <date>     <dbl>
## 1 <NA>           1 2022-02-02  -100
## 2 <NA>           3 1900-01-01   100
## 3 <NA>          NA 2021-12-26   999
## 4 <NA>          99 2022-01-15    99
## 5 John Smith   999 1900-01-01    -1