Introduction

Big data is known for its velocity, volume, value, variety, and veracity. As such, the field of data science has emerged to make sense of large amounts of data by generating useful insights. However, in storage, processing or analysis, missing data is a problem that data scientists commonly encounter, especially when datasets are incomplete due to non-response or drop-out.1

The definition of missing data is any data value that is not stored as a variable in an observation of interest. Missing data can be problematic due to these reasons: (1) It reduces statistical power, (2) causes bias in estimation of parameters, (3) reduces the representativeness of the samples, and (4) it complicates analyses of the studies. Incompleteness in data can attenuate the ability of data scientists to mine for insights to be applied in making any sensible policy decisions, clinical prediction decisions or statistical inference making. As such, it is essential that data handling in the form of data imputation is done properly to ensure subsequent data analysis is reliable.1

However, the approach to data handling is often confusing and may vary based on the purpose of use of the data. By studying the performance of data imputation methods for different data characteristics and different analysis aims, we can make betters decision on which methods are appropriate for dealing with missing data. In this essay, we will cover the types of missing data, how to handle them appropriately and some recommendations moving forward.1

Types of Missing Data

Data is often described in accordance to the reasons for the missing data. According to the mechanisms of missingness, we assume three types of missing data.1,4,5

1. Missing Completely at Random (MCAR)

MCAR is known as the probability that the data are missing, has no relation to either the value that was supposed to be obtained or the set of responses observed.1 MCAR is defined as the probability that the missingness of the data is not related to either the specific value which is supposed to be obtained or the set of observed responses. Examples of such MCAR data include data that are missing by design, due to equipment failure, samples lost in transit, and technically unsatisfactory.1

As a rule of thumb, it is easy to check an assumption that missing data is due to MCAR. If one can predict which units have missing data using common sense or regression, then the data is most likely not MCAR. An accurate and more formal way of testing for MCAR data is through Little’s MCAR test.2 In the test, a p value of less than 0.05 is usually interpreted as the missing data being not MCAR.2

An advantage of MCAR is that its data does not affect any bias in statistical analyses. Though, statistical power is decreased, any estimated parameters are not biased as a result of the missing data.2

2. Missing at Random (MAR)

“Missing at random” is often misleading in its phrasing because MAR data demonstrates a systematic pattern of missingness instead of its expressed “random” pattern. MAR data is assumed when its missingness can be predicted using other observable variables and is not dependent on any unobserved variables.1

MAR is often viewed as a spectrum, which depends on how well other observed variables can explain missingness. An example of MAR in its purest form would be to suppose there were two test scores, test_1 and test_2, in a dataset. These two tests are sequential and test_1 is done before test_2. Students who scored higher than a score of 85 would be excused from test_2. Supposing there were no dropouts in taking of students taking the test, the missingness of data in test_2 would be **entirely due to the test_1 variable.2

However, in a large dataset, missingness on a variable may be significantly related to another observed variable, but its relatedness may be weak due to weak effect size such that missingness is not able to be predicted from that variable. Such is the threshold which differentiates MAR from MNAR; MAR still allows for prediction from other observed variables, albeit weak, but MNAR data are insufficiently correlated to its other observed variables in the dataset.2

We might tend to consider that MAR is not problematic because it does not produce statistical bias. However, the missingness that MAR data produces cannot be neglected.

3. Missing Not at Random (MNAR)

The concept of MNAR is the most complicated of the three assumed natures of missing data. In some literature, it is also termed as Not Missing at Random (NMAR). MNAR suggests that the probability of a value being missing fluctuates for reasons unknown to us. When the characteristics of missing data do not meet those of MCAR and MAR, they are categorised into data that is MNAR. A case of MNAR assumes that the missingness is directly related to what is missing.2

Examples:

  • A student skips a tutorial lesson because the student knows that the attendance would not be graded.
  • A researcher does data collection from a weighing scale.
    • Suppose that the weighing mechanism deteriorates over time, producing more missing data over time.
    • When heavier objects are subsequently measured, the distribution of its measurements will be shifted and unreliable.

As such, this missingness of data is often difficult to diagnose and handle.2

Handling Missing Data

Ad-hoc Methods

1. Listwise Deletion

The most commonly used approach that data scientists use to deal with missing data is to simply omit cases with missing data, only analysing the rest of the dataset. This method is known as listwise deletion or complete-case analysis.3 The na.omit() function in R removes all cases with one or more missing data values in a dataset.

Let’s create a new small dataframe as an example to demonstrate Listwise Deletion

A look at the dataframe df

##   Name   Sex Age
## 1 John   men  45
## 2  Tim   men  53
## 3 <NA> women  NA

Perform listwise deletion using na.omit(df):

##   Name Sex Age
## 1 John men  45
## 2  Tim men  53

Now we can see that the na.omit() function removes all cases (rows) which contain any NAs

The greatest advantage of this method is its convenience. However, if the nature of the data is MCAR, listwise deletion will result in standard errors that are significant for the reduced data but not for the whole dataset, which had the missing data.3 This method of dealing with missing data is possibly wasteful. If cases are removed in this way, one must be aware of the decreased ability to detect the true effect of the variables of interest.3

However, if the data are not MCAR, the complete-cases analysis can greatly influence estimates of mean, regression coefficients and correlations. Listwise deletion may cause nonsensical subsamples.3

Taking a larger airquality dataset, a dataset of daily air quality measurements in New York from May to September 1973, which has NA values within its variables. The rows of the dataset represent 154 consecutive days. Any deletion of these rows will affect the continuity of time, which may affect any time series analyses done.3

Let’s have a closer look at the dataset airquality

Firstly, let’s peek at the top 6 cases of the dataset

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

We can already observe some NA values within the dataset

Next, we perform a simple na.omit(airquality) function to remove cases which have NAs

Lastly, let’s look at the result:

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

Again, we see that any rows that contained any NAs in any variable were removed from the dataframe

2. Pairwise Deletion

Available-case analysis or pairwise deletion is an attempt to fix the problem of data loss as a result of listwise deletion. In this methodology, the means and co(variances) of all observed data are calculated.3 For variables X and Y, both their means are based on all cases of their respective observed values. Summary statistics can then be put through a program for further regression analysis.

Let’s demonstrate Pairwise Deletion using the airquality dataset as an example!

Load and peek at the first6 rows of the airquality dataset:

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

We notice that there are some NAs in the dataset. We first calculate the means and correlations of the airquality data under pairwise deletion in R:

let’s have a look at the mean mu

##      Ozone    Solar.R       Wind 
##  42.129310 185.931507   9.957516

let’s have a look at the covariances cv

##              Ozone    Solar.R      Wind
## Ozone   1088.20052 1056.58346 -70.93853
## Solar.R 1056.58346 8110.51941 -17.94597
## Wind     -70.93853  -17.94597  12.41154

Next, we use the lavaan() function from the lavaan package to take means and covariances as input instead of the lm() function, which does not.

## lavaan 0.6-5 ended normally after 21 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                          4
##                                                       
##   Number of observations                           111
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0

This method of dealing with missing data is simple. With MCAR data, it still generates estimates of mean, correlations and covariances consistently.4 Despite its strengths, its applications are restricted to data that are MCAR. If data are not MCAR, estimates may be biased. In addition, pairwise deletion only applies to numerical values that follow a normal distribution approximately. This may not apply in reality, where we have variables with many mixed data types.3

Therefore, the pairwise deletion method is best used to handle missing data where it follows an approximate normal distribution, where correlation between the observed variables are low and where the missing data is of MCAR in nature.

3. Mean Imputation

Some data scientists or statisticians may look for a quick fix by replacing missing data with the mean. Mode is often used to impute categorical data.

For example, in the airquality dataset, suppose that we want to impute the mean of its missing values. Here, we use the R package mice. By changing the argument method = mean, it specifies mean imputation, the argument m = 1 changes the iterations to 1 which means no (iteration). The theoretical rationale of using the mean to impute missing data is that the mean is a good estimate to randomly select an observation from a normal distribution.

Now, we shall attempt to impute the mean for the Ozone and Solar.R variables of the airquality dataset!

Firstly, we shall load the mice and mipackages (install the package from CRAN first using install.packages("mice") and install.packages("mi"))

The mean of the variables Ozone and Solar.R can be imputed by the mice() function

## 
##  iter imp variable
##   1   1  Ozone  Solar.R

Let’s look at a histogram and scatterplot of what the airquality dataset looks like after mean imputation:

We see that in the histogram shown above, the red bar that is used to represent the mean imputed data, stands out from the rest of the distribution of data. We can also expect the standard deviation of the imputed data to be lower than the observed data. In the scatterplot to show the relation between Ozone and Solar.R data, the mean imputations have changed the correlation coefficient between the two

However, mean imputation may distort the distribution of the data as a whole. In a dataset with missing values that are not totally random, this method may result in a biased analysis.

4. Regression Imputation

In regression imputation, we are basically taking what we know from other variables to produce smarter imputations. This is done so by first building linear model of the observed data. The missing data are then predicted using the fitted model.3

The cases of MNAR data are problematic. The only way to obtain an unbiased estimate of the parameters in such a case is to model the missing data. The model may then be incorporated into a more complex one for estimating the missing values.3

Let’s try the regression imputation method on the airquality dataset!

Firstly, let’s build the linear regression model and make predictions

Next, let’s plot the histogram of the imputed values and the original observed values

Here, we focus more on the scatterplot, because we cannot derive much from the histogram when we look at the distributions of the two data types (red for imputed values, blue for original values)

Imputing these predicted values creates an effect on the correlation. The red line of imputed values has a correlation of 1. When both original and imputed values were combined, its the correlation between Ozone and Solar.R increased.

Imputed values are based on estimates from the model. Nevertheless, the imputed values will vary less than the observed values from the actual dataset. Moreover, by imputing predicted values from a regression model, we also create an effect on the correlation. Correlations are biased upwards.3

However, while regression has its uses to produce very realistic imputations of data, one must always be careful of its use because it may artificially strengthen the effect of data. One may eventually introduce false positives and spurious relations.3

5. Stochastic Regression Imputation

Essentially, stochastic regression is a form of refining of the regression imputation method by incorporating noise into the predictions.

Let’s see an example of imputation of Ozone from Solar.R by Stochastic Regression Imputation!

Firstly, we subset the Ozone and Solar.R from the original airquality dataset, then use the mice() function to impute data from Solar.R for Ozone

Next, we plot a simple histogram and scatterplot to look at the difference between the distributions of the imputed vs the original data

In the plots above, we can see that the stochastic regression imputation method spreads out the distribution of imputed values.

The argument method = norm.nob will utilise a plain stochastic regression method. How it works is that it will first estimate the intercept, the slope and the residual variance under the linear model, then calculate the predicted values, then add a random sample from the residual values to the prediction.3

However, there are consequences to this imputation method here. There is a negative value imputed in our data set. These values either can arise from the residual distribution or from the usage of linear models for data with no negative values. If the variables do not exist as negative values in reality, these imputed values cannot be considered as imputations. In addition, the higher end of the distribution is not well covered with data. If done well, stochastic regression imputation can not only preserve any regression weights, but also retain the correlation between the variables within the dataset. This method is the crux of advanced imputation techniques.3

6. Last Observation Carried Forward (LOCF)

LOCF is an imputation method for longitudinal data, which involves taking the previous observed value to replace missing values in the dataset. It is mostly used in confidence for cases where the data scientist or statistician knows what the missing data should be. Although this method of imputation is favoured by the US Food and Drug Administration for imputing clinical trial data, LOCF still needs to be put through a proper statistical analysis to segregate real and imputed data to look at its similarities. This method is especially useful on data to be plotted for a time series analysis.

Let’s go through a short demonstration of the LOCF method on the airquality dataset!

To perform the LOCF imputation method, we simply run the fill() function from the tidyr package on the dataset and the variable with the missing data:

We then look at a time-series plot of Ozone against time in days:

The above time-series plot looks at the first 80 days of the Ozone variable. Again, the red dots represent the newly imputed values, and the blue dots represent the original data.

The LOCF can be seen as a good method of choice because it produces a complete dataset and it can be used for cases where we know what the missing values should be. However, this method needs proper subsequent statisical analyses to determine the difference between the real and imputed data before making any assumptions about insight gained from any data analysis.

Multiple Imputation

Multiple imputation is a process that is done in 3 main steps: Imputation, analysis, and pooling. It is different from single imputation methods as mentioned above because it substitutes missing data with credible values while still retaining the variability and uncertainty of the original data. This gives the imputed data a valid statistical inference.3

Steps for Multiple Imputation:

  • Firstly, in the first step of imputation, the process goes through many iterations to generate multiple imputed data sets.1
  • Secondly, each imputed data set is put through a standard statistical analysis to test for complete data.1
  • Lastly, the results from the analyses are combined and produces a single overall analysis result for the pooled dataset.1

The characteristic of natural variability is incorporated by imputing values from predictions derived from variables correlated to the missing data. The characteristic of uncertainty is achieved through the multiple iterations of missing data and observing its variability.1

Multiple imputation has been shown to produce valid statistical inference that reflects the uncertainty associated with the estimation of the missing data. Furthermore, multiple imputation turns out to be robust to the violation of the normality assumptions and produces appropriate results even in small sample sizes or a lots of missing data.1

Now let’s have a look at how to perform multiple imputation on the airquality data set!

Firstly, let’s impute the dataset with the mice function and fit is to a linear model with the lm() function

Secondly, let’s have a look at the summary of efficients of the model

##                 Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) -64.34207893 23.05472435 -2.790841 6.226638e-03
## Wind         -3.33359131  0.65440710 -5.094063 1.515934e-06
## Temp          1.65209291  0.25352979  6.516366 2.423506e-09
## Solar.R       0.05982059  0.02318647  2.579979 1.123664e-02

Thirdly, let’s look at a histogram and scatterplot to visualise our imputed and actual data

Upon observation, the blue (original data), and the red (imputed data) distributions are similar in the histogram. This time, there are no negative values as seen in previous distributions.

The red imputed points look like what could have been measured if they were not missing.

Lastly, let’s connect the dots of the original data values of the variable Ozone

The above graph shows the entire completed Ozone data. Note that the connecting lines are not drawn for imputed values, and only for original values.

The model recognises the low Ozone values within the first 20 days, followed by higher imputed values on days that were hot and sunny for days 20 to 40, and lastly that from day 40 to 80, the values for Ozone are expected to be moderate. This is due to the other available information in the airquality dataset: wind, temperature, and sunshine.3

Recommendations

1. Prevention

The key to dealing with missing data is prevention. Missing data can greatly decrease the statistical power of a study in a dataset. Researchers should first and foremost, pay more attention to data collection methods in order to prevent missing data. Imputation and deletion techniques should only be done when the efforts to prevent missing data in data collection are maximised.

2. Understand the Natures of Missing Data

As described above about different natures of missing data, each missing data type, whether be it MCAR, MAR or MNAR, each has to be dealt with differently. A wrong implementation of data handling methods may lead to false statistical conclusions.

3. Understand the Use of Data

While it is important to know the nature of missing data, it is also important to know what the data will be used for. For example, if the data will be used for a time-series analysis, then simple listwise deletion will not work because that will remove dates/time from the data.

4. Understand the Mechanisms of Data Handling Methods

With the different data handling methods, it is simple for statisticians or the average data analyst to brush off the problem of missing data with a data handling method of choice. However, one should always take extra care to understand the mechanism behind the data handling methods, in order to not lose unnecesary statistical power or important data.

5. Check the Data After Deletion/Imputation

It is important for one to check back and do a simple plot to look at the difference the imputed data values make. This is so that if there were any spurious results in the insights gained from deeper data analysis, the data scientist can then backtrack and resolve any issues with data handling. It may also be possibly useful to attribute any outliers plotted to the imputed data instead of observed data.

Conclusion

In conclusion, researchers should always understand the nature missingness of the data (MCAR, MAR or MNAR), then deal with the data using its appropriate data handling method. Generally, the multiple imputation technique is seen as the most used and best approach for dealing with missing data.

Moving forward, one should always take care to (1) prevent missing data, (2) understand its nature, (3) understand the usage of the data, (4) understand the mechanisms of data handling methods, and to (5) always check the data after handling.

References

  1. Kang H. (2013). The prevention and handling of the missing data. Korean journal of anesthesiology, 64(5), 402–406. doi:10.4097/kjae.2013.64.5.402

  2. Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198-1202.

  3. Buuren SV. (2018). Flexible Imputation of Missing Data. Retrieved from https://stefvanbuuren.name/fimd/sec-simplesolutions.html.

  4. Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data. Hoboken, NJ: Wiley.

  5. Rubin DB. (1976). Inference and missing data Biometrika,63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581