Assessing missingness of data in research

Author

Duncan Kabiito Matovu

Introduction

Missingness

Research and data analysis frequently face the problem of missing data, which is characterized as the absence of observations or replies in a data set. It can happen for a number of reasons, including survey non-response, data entry mistakes, or longitudinal study dropout. Missing data must be handled carefully because failing to do so can result in skewed findings and incorrect inferences.

To maintain the validity and integrity of their studies, researchers must carefully deal with missing data. The gaps can be filled in using methods like imputation, which calculates missing values from observed data. It is also crucial to perform sensitivity analysis to determine how missingness affects study results.

Fundamentally, recognizing and compensating for missing data is not only a statistical requirement but also a critical step in performing trustworthy and credible research. It ensures that research results appropriately reflect the community under study and make a significant contribution to scientific or social understanding.

Imputation

A statistical approach known as “imputation” is used to handle missing data by “imputing” or predicting the values that are missing based on the observed data. For several reasons, it is essential to data analysis and research.

Imputation, in the first place, aids in maintaining sample size and statistical power. Simply eliminating cases with missing values when data is present might drastically reduce the sample size, which may result in underpowered analyses and unreliable findings. Imputation enables researchers to keep as much data as they can.

Second, imputation lessens bias and maintains the dataset’s representativeness. Without imputation, the data may become biased in favor of people or cases having full information, which could result in inaccurate estimations. Imputed values aid in reestablishing equilibrium and guarantee that the dataset is more accurate in representing the complete population.

Imputation can be done in a number of ways, such as mean imputation, regression imputation, and multiple imputation. The strategy chosen depends on the type of data being used and the theories put out regarding the mechanism for missing data.

In conclusion, imputation is an effective method for addressing missing data that guarantees the statistical validity of the study and yields more precise understanding of the phenomenon being studied. When working with partial datasets, researchers should take this useful technique into account.

Correcting for missingness

Below are five steps to ensuring missing data are correctly identified and appropriately dealt with:

  1. Ensure your data are coded correctly.

  2. Identify missing values within each variable.

  3. Look for patterns of missingness.

  4. Check for associations between missing and observed data.

  5. Decide how to handle missing data.

NOTE: The simulation below used a pseudo linelist dataset from the Epi R Handbook

Assessing missingness within your dataset

Percent of all data frame values that are missing

## [1] 6.688745

Percent of rows with any value missing

## [1] 69.12364

Percent of rows that are complete (no values missing)

## [1] 30.87636

Visualizing missingness

Show percentage of missingness in the dataset, for all variables

Percentage of missingness by factor variable

Show missingness of all variables by a factor variable of your choice

Below, I wanted to understand the percentage of missigness for both the outcome and gender variable

Heatplot of missingness across the entire data frame

A heatplot is the otherway you could try to quickly visualize and look through, to obtain a picture of missingness across all variables in your dataset.

You realize that the infector and source variables have about 35% missingness but overall missingness in this data stands at 6.7%

This seems much better than the lollipop plot we had above

Heatplot of missingness only for a specific variable, chosen variable is differ

Sometimes you would want to be specific to a variable of your choice, just leverage the select function for a variable of your choice, here I used the infector variable

Explore and visualize missingness relationships

There are times when you would like to look at the level of missingness between two variables, here I used age in years and temperature

Imputation

As earlier introduced, there are some rudementary ways of dealing imputing like using mean, median, maximum and minimum imputations. These methods are directional/predicatable in a way hence leading to erroneous results that could bias inference

Regression imputation

A slightly more sophisticated approach is to fill in the missing number with a statistical model’s prediction of what it will likely be.

Below, a demonstrated use of a a simple temperature model to predict values just for the observations where temperature is missing

Displayed is a sample prediction of missing values within temperature, only 10 rows are displayed

Predictions for missing temp

36.98

36.97

36.97

36.98

36.98

36.98

36.98

36.97

36.98

36.98

Last Observation Carried Forward (LOCF) and Baseline Observation Carried Forward (BOCF)

LOCF and BOCF are approaches for time series/longitudinal data imputation. The concept is to substitute the prior observed value for the missing data. The method looks for the most recent observed value when a series of values are missing.

Imputing for years using fill function

Year

Male

Female

2,004

3,751,700

3,657,700

2,005

3,319,500

3,178,000

2,005

2,937,000

2,007

2,638,800

2,570,100

2,008

2,241,500

2,008

1,650,300

2,010

1,534,700

Imputing the male population using the fill function

Year

Male

Female

2,004

3,751,700

3,657,700

2,005

3,319,500

3,178,000

2,937,000

2,007

2,638,800

2,570,100

2,008

2,638,800

2,241,500

1,650,300

2,010

1,650,300

1,534,700

Imputing the missing male population values in the “down” direction

Year

Male

Female

2,004

3,751,700

3,657,700

2,005

3,319,500

3,178,000

2,937,000

2,007

2,638,800

2,570,100

2,008

2,638,800

2,241,500

1,650,300

2,010

1,650,300

1,534,700

Imputing the missing female population values using the “up” direction

Year

Male

Female

2,004

3,751,700

3,657,700

2,005

3,319,500

3,178,000

2,937,000

2,570,100

2,007

2,638,800

2,570,100

2,008

2,241,500

1,650,300

1,534,700

2,010

1,534,700

Multiple Imputation

When you do multiple imputation, you create multiple datasets with the missing values imputed to plausible data values (depending on your research data you might want to create more or less of these imputed datasets, but the mice package sets the default number to 5)

Percentage of missingness

This is the percentage of missingness in the dataset we shallmutlize for multiple imputation

## [1] 4.954993

After multiple imputation, the percentage of missingness has reduced to:

## [1] 3.564878

For a more robust analysis when missing data is a significant concern, Multiple Imputation is good solution that isn’t always much more work than doing a complete case analysis.