## [1] 6.688745
Assessing missingness of data in research
Introduction
Missingness
Research and data analysis frequently face the problem of missing data, which is characterized as the absence of observations or replies in a data set. It can happen for a number of reasons, including survey non-response, data entry mistakes, or longitudinal study dropout. Missing data must be handled carefully because failing to do so can result in skewed findings and incorrect inferences.
To maintain the validity and integrity of their studies, researchers must carefully deal with missing data. The gaps can be filled in using methods like imputation, which calculates missing values from observed data. It is also crucial to perform sensitivity analysis to determine how missingness affects study results.
Fundamentally, recognizing and compensating for missing data is not only a statistical requirement but also a critical step in performing trustworthy and credible research. It ensures that research results appropriately reflect the community under study and make a significant contribution to scientific or social understanding.
Imputation
A statistical approach known as “imputation” is used to handle missing data by “imputing” or predicting the values that are missing based on the observed data. For several reasons, it is essential to data analysis and research.
Imputation, in the first place, aids in maintaining sample size and statistical power. Simply eliminating cases with missing values when data is present might drastically reduce the sample size, which may result in underpowered analyses and unreliable findings. Imputation enables researchers to keep as much data as they can.
Second, imputation lessens bias and maintains the dataset’s representativeness. Without imputation, the data may become biased in favor of people or cases having full information, which could result in inaccurate estimations. Imputed values aid in reestablishing equilibrium and guarantee that the dataset is more accurate in representing the complete population.
Imputation can be done in a number of ways, such as mean imputation, regression imputation, and multiple imputation. The strategy chosen depends on the type of data being used and the theories put out regarding the mechanism for missing data.
In conclusion, imputation is an effective method for addressing missing data that guarantees the statistical validity of the study and yields more precise understanding of the phenomenon being studied. When working with partial datasets, researchers should take this useful technique into account.
Correcting for missingness
Below are five steps to ensuring missing data are correctly identified and appropriately dealt with:
Ensure your data are coded correctly.
Identify missing values within each variable.
Look for patterns of missingness.
Check for associations between missing and observed data.
Decide how to handle missing data.
NOTE: The simulation below used a pseudo linelist dataset from the Epi R Handbook
Assessing missingness within your dataset
Percent of all data frame values that are missing
Percent of rows with any value missing
## [1] 69.12364
Percent of rows that are complete (no values missing)
## [1] 30.87636
Visualizing missingness
Show percentage of missingness in the dataset, for all variables
Percentage of missingness by factor variable
Show missingness of all variables by a factor variable of your choice
Below, I wanted to understand the percentage of missigness for both the outcome and gender variable
Heatplot of missingness across the entire data frame
A heatplot is the otherway you could try to quickly visualize and look through, to obtain a picture of missingness across all variables in your dataset.
You realize that the infector and source variables have about 35% missingness but overall missingness in this data stands at 6.7%
This seems much better than the lollipop plot we had above
Heatplot of missingness only for a specific variable, chosen variable is differ
Sometimes you would want to be specific to a variable of your choice, just leverage the select function for a variable of your choice, here I used the infector variable
Explore and visualize missingness relationships
There are times when you would like to look at the level of missingness between two variables, here I used age in years and temperature
Imputation
As earlier introduced, there are some rudementary ways of dealing imputing like using mean, median, maximum and minimum imputations. These methods are directional/predicatable in a way hence leading to erroneous results that could bias inference
Regression imputation
A slightly more sophisticated approach is to fill in the missing number with a statistical model’s prediction of what it will likely be.
Below, a demonstrated use of a a simple temperature model to predict values just for the observations where temperature is missing
Displayed is a sample prediction of missing values within temperature, only 10 rows are displayed
Predictions for missing temp |
|---|
36.98 |
36.97 |
36.97 |
36.98 |
36.98 |
36.98 |
36.98 |
36.97 |
36.98 |
36.98 |
Last Observation Carried Forward (LOCF) and Baseline Observation Carried Forward (BOCF)
LOCF and BOCF are approaches for time series/longitudinal data imputation. The concept is to substitute the prior observed value for the missing data. The method looks for the most recent observed value when a series of values are missing.
Imputing for years using fill function
Year |
Male |
Female |
|---|---|---|
2,004 |
3,751,700 |
3,657,700 |
2,005 |
3,319,500 |
3,178,000 |
2,005 |
2,937,000 |
|
2,007 |
2,638,800 |
2,570,100 |
2,008 |
2,241,500 |
|
2,008 |
1,650,300 |
|
2,010 |
1,534,700 |
Imputing the male population using the fill function
Year |
Male |
Female |
|---|---|---|
2,004 |
3,751,700 |
3,657,700 |
2,005 |
3,319,500 |
3,178,000 |
2,937,000 |
||
2,007 |
2,638,800 |
2,570,100 |
2,008 |
2,638,800 |
2,241,500 |
1,650,300 |
||
2,010 |
1,650,300 |
1,534,700 |
Imputing the missing male population values in the “down” direction
Year |
Male |
Female |
|---|---|---|
2,004 |
3,751,700 |
3,657,700 |
2,005 |
3,319,500 |
3,178,000 |
2,937,000 |
||
2,007 |
2,638,800 |
2,570,100 |
2,008 |
2,638,800 |
2,241,500 |
1,650,300 |
||
2,010 |
1,650,300 |
1,534,700 |
Imputing the missing female population values using the “up” direction
Year |
Male |
Female |
|---|---|---|
2,004 |
3,751,700 |
3,657,700 |
2,005 |
3,319,500 |
3,178,000 |
2,937,000 |
2,570,100 |
|
2,007 |
2,638,800 |
2,570,100 |
2,008 |
2,241,500 |
|
1,650,300 |
1,534,700 |
|
2,010 |
1,534,700 |
Multiple Imputation
When you do multiple imputation, you create multiple datasets with the missing values imputed to plausible data values (depending on your research data you might want to create more or less of these imputed datasets, but the mice package sets the default number to 5)
Percentage of missingness
This is the percentage of missingness in the dataset we shallmutlize for multiple imputation
## [1] 4.954993
After multiple imputation, the percentage of missingness has reduced to:
## [1] 3.564878
For a more robust analysis when missing data is a significant concern, Multiple Imputation is good solution that isn’t always much more work than doing a complete case analysis.