Correlation matrix of missing values.

Correlation matrix of missing values.

Intro: What is missing data?

Missing data can be both simple and complex. The concepts are simple to understand, but they are surprising in breadth and depth. Fortunately, the complexity can be easily overcome with telling examples and explanations. Before peeling back the layers of missing data, let’s establish some common ground.

A dataset has a columns representing variables and rows representing instances. The columns can be further classified as explanatory variables (i.e. predictors or features) or response variables (i.e. outcome). In a modeling problem, you use the instances of data from explanatory variables to predict an outcome of interest. If any of the instances of data are blank, or missing from the dataset, this is considered missing data. You may be thinking why on earth did I say missing data was complex, you just spent a paragraph explaining something I already knew or could of easily figured out. From here we can now explore types of missing data, patterns of missingness, and what that means for your modeling project.

Types of missing data

Above we introduced the columns of your dataset as variables and the rows as the instances of data. Unsurprisingly, missing data can be defined as “column-wise” or “row-wise” depending on the perspective. This distinction plays an important role in what to do, if anything, about the missing data. We will start with column-wise missing data, what we more commonly think of when we say missing data. Let us say we have a data set with three explanatory variables (age, gender and number of sexual partners) and one outcome of interest (HIV status). To calculate the missing data for, say, gender, we would count the number of rows with missing data and divide this by the number of rows in the data set.

Patterns of missingness

Now not all missing data is created equal. And while knowing the amount or percentage of missing data is essential, it is certainly not sufficient. You must also consider the pattern of missingness. Depending on the pattern, you may need to do nothing, use statistics to impute the missingness, or drop the variable all together. Patterns of missingness can be defined as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR).

MNAR

MNAR means the missing value is the result of the value of the missing variable itself. Let us say we have three predictors (Age, temperature and white blood cell count ) and one outcome (post-surgical infection). In this example, the health care providers measure the temperature of all patient. However, when a patients’ temperature is normal, they are less likely to document the number in the medical chart. The reason, then, for the temperature column to have a missing value has to do with the value of the patient’s temperature itself. MNAR missingness is the most problematic for modeling.

MAR

MAR means the missing value can be explained by another variable in the dataset. Keeping with the example above, let’s say health care providers are more likely to order WBCs on patient that have a suspected infection. In that case, patients with a temperature above 38C would be more likely to have their WBCs measured than patients with a normal temperature. Whether or not the WBC column has a missing value or not is dependent on the value of another column, temperature. MAR missingness is better than MNAR and can be handled with different modeling techniques. However, MAR data can introduce bias into your model

MCAR

You never want missing data but if you did have it, you would want it to be MCAR. MCAR means the missing data cannot be explained by any variable in the dataset including itself. Again, using our example of age, WBCs, and temperature to predict infection, let’s say we have missing data in the age column. If this missingness were not explained by a patients’ temperature or WBC counts, then the missing data would be MCAR. MCAR can be handled by imputation, and your concern for introducing bias into your model is less compared to MAR and MNAR. Unfortunately, missing data is most often not MCAR. Example causes of MCAR data are equipment failure or research samples lost in transit.

Why is it a problem?

Missing data, ubiquitous in global health research, present challenges to any statistical analysis including machine learning derived predictive models. These challenges include reductions in true data, statistical power, and representativeness along with biased estimates.(1)

What can you do about it?

Complete case analysis

Complete case analysis, also referred to as listwise deletion, is much like it sounds – you keep the rows in your dataset which have a value for all columns. Rows with one or more missing values are dropped from the dataset. To use this approach on your dataset, the pattern of missingness must be MCAR (one can test for MCAR using the TestMCARNormality function from the MissMech package in R). If the missing data are not MCAR, listwise deletion may introduce bias into your analysis.

Pairwise deletion

Pairwise deletion is similar to listwise deletion, however, only rows with missing values in the variables (columns) of interest are deleted. Missing data elsewhere in the dataset does not result in row deletion. As result, pairwise deletion preserves more of your data compared to listwise deletion.(1)

Simple imputation

Mean, Median, Mode

This approach substitutes a missing value with the mean, median or model from the known values of a variable or column. As a general rule of thumb, missing categorical data should be replaced with mode and continuous data should be replaced with the median (avoid using mean as it is influence by outliers and median is not). The advantage of this approach is that it is quick and simple. Contrastly, these options reduce the variance of your data.

Linear regression

Linear regression is a more robust method of simple imputation to replace missing values. First, a correlation matrix is used to identify predictors (other columns in the dataset) of the variable that is missing. Once the best predictors are identified, a regression equation is created where the independent variables are the columns with known values and the dependent variable is the missing value. Initially the regression is run on cases with complete data to generate the regression equation. Then the equation is used to predict the missing value for incomplete cases. The disadvantages of this approach are that (1) the imputed values are derived from the data, thus the values may overfit the data and (2) there is an assumption that the variables in the regression equation follow a linear relationship.

Multiple imputation

KNN (K Nearest Neighbors)

In this technique, a K number of values at some distance to the missing value are averaged. This value is then used to impute the missing value. You can specify the number of nearest neighbors and the distance metric.

What to do with missing data? Drop that observation Drop that feature Cool ways to visualize if missing variables are correlated R vignette for visualizing missing data Other ways to visualize missing data

Resources and references

  • Some background on patterns of missingness and imputation techniques
  • Vignette on MICE
  • Light intro to missing data in machine learning with some visualization examples
  • Gallery of missing data visualizations