Applied statistics: Missing data handling tutorial

Missing data is not a trivial problem when analyzing dataset, it is usually not so straightforward either. If the amount of missing data is very small relatively to the size of the dataset, then leaving out the few samples with missing values may be a good solution in order not to bias the analysis. However leaving out some available data (some few samples) may hide some amount of information and depending on the situation you face, you may want to look for other fixes before extracting potentially useful data from your dataset. While some quick solutions such as mean imputation may be good in some cases, such simple approaches usually introduce bias into the data, for instance, applying mean imputation leaves the mean unchanged (which is desirable) but decreases variance, which may be undesirable.

On other hand, The mice method, helps imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing data, which can be a good solution.

In this tutorial: We are using the built-in dataset ‘airquality’ in R as a sample dataset,

We will identify missing data
We will visualize data
Handle missing data : imputation, mice, …

# Load the airquality dataset
data("airquality")
help("airquality")

## starting httpd help server ... done

Description of the airquality dataset: Daily air quality measurements in New York, May to September 1973.

Format: A data frame with 153 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)
[,6] Day numeric Day of month (1–31)

Details: Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at LaGuardia Airport.

1. Indetification of Missing Data

Let us now check whether this dataset contains missing data (you can guess that this is the case ;) ). Please fill in the following code cells to answers questions.

Q. Let us show some lines from this dataset in order to see what it looks like

# View the first few rows of the dataset
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Q. Perform a summary of the dataset. What do you see ? How many features have missing data ? how much lines are missing for each feature ?

# Summary of the dataset to check for missing values
summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

From the summary of the dataset, we observe that the features Ozone and Solar.R have missing values. Ozone has 37 missing entries and Solar.R has 7 missing entries. The other variables (Wind, Temp, Month, Day) do not contain any missing values. Therefore, in total, the dataset contains 44 missing values.

Q. We can use another R function in order to give us directly information about missing data: sapply

# Count missing values in each column
sapply(airquality, function(x) sum(is.na(x)))

##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

# Overall missing value count
sum(is.na(airquality))

## [1] 44

We can see that in total there are 44 missing data in the dataset. The Wind, Temp, Month and Day columns have no missing data and Ozone column has 37. Solar.R column also has 7 missing.

We can also calculate the percentage of missing values in each column. This could be really useful for big and messy datasets. To learn more about sapply function, see here

Q. use the sapply function to get percentage of missing values.

# Calculate the percentage of missing values in each column

sapply(airquality, function(x) mean(is.na(x)) * 100)

##     Ozone   Solar.R      Wind      Temp     Month       Day 
## 24.183007  4.575163  0.000000  0.000000  0.000000  0.000000

Q. The mice package provides a function md.pattern() to get a better understanding of the pattern of missing data. Apply this function

The output tells us that 111 samples are complete, 37 samples miss only the Ozone measurement, 7 samples miss only the Solar.R value and so on.

library(mice)

## Warning: package 'mice' was built under R version 4.3.3

## Warning in check_dep_version(): ABI version mismatch: 
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

md.pattern(airquality)

##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44

Q. A perhaps more helpful visual representation can be obtained using the VIM package as follows. Take a look here in order to learn how to apply this function https://cran.r-project.org/web/packages/VIM/vignettes/VisualImp.html The plot helps us understanding that almost(72%) of the samples are not missing any information, ~22% are missing the Ozone value, and the remaining ones show other missing patterns. Through this approach the situation looks a bit clearer

library(VIM) #install.packages("VIM")

## Warning: package 'VIM' was built under R version 4.3.3

## Loading required package: colorspace

## Warning: package 'colorspace' was built under R version 4.3.3

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

2. Handle Missing Data

Let us now study two solution for handling missing data: deleting missing data, imputation of missing data. We will see also a performance comparison in order to show th quality of the solutions.

2.1. Deleting Missing Rows

One of the simplest approaches to address missing data in a dataset is to delete observations (instances or rows) that contain any missing values. This method, often referred to as “listwise deletion” or “complete case analysis,” involves removing entire records from the analysis if they are missing any data point in one or more features.

When to Consider Deleting Missing Rows?

Minimal Missing Data: If the missing data is slight and seemingly random, eliminating those incomplete observations is unlikely to significantly affect the dataset’s overall quality.
MCAR Data: Deletion is most appropriate when the missing data is Missing Completely At Random (MCAR), meaning there is no systematic difference between the missing and observed values.

Q. Let us remove each row that contains a missing value. You can use the fucntion omit (learn more about it here https://www.rdocumentation.org/packages/photobiology/versions/0.13.2/topics/na.omit)

# Remove rows with any missing value
airquality_no_missing <- na.omit(airquality)

Q. Let us check if there is any remaining missing value

# See the missing value now
sum(is.na(airquality_no_missing))

## [1] 0

Q. Check the new dimensions of the dataset, (learn about it here https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim)

# Get the dimensions of the dataset (rows and columns)
dim(airquality_no_missing)

## [1] 111   6

We can see that the new dataset now contains no missing value.

Note that many statistical software packages and functions designed for linear regression and similar models have built-in mechanisms to address missing data. Typically, these mechanisms involve automatically removing rows with missing values in any variable included in the model (listwise deletion). Therefore if you do not have specific needs or sensitive data, it may not be necessary to manually remove all the missing values.

2.2. Imputation with Mean

When dealing with missing data, a common and straightforward approach is to fill in the missing values with the mean of the available values in the same variable. This method, known as mean imputation, involves calculating the average of the non-missing values for each variable and substituting that average for the missing entries.

When to Consider Deleting Missing Rows?

Suitable for MCAR: Mean imputation is most effective when the data is Missing Completely At Random (MCAR). It assumes that the missing values, on average, are similar to the observed ones.
Preliminary Analysis: It can serve as a quick fix for preliminary analyses, allowing for a fuller utilization of the dataset, especially when dealing with minimal missingness.

Imputation partially : We can choose to impute only certain columns if other columns just have trivial missingness. In our example, as the ozone variable is the only one containing a lot of missing values, we can perform mean imputation only for this variable. We can notice that the Solar.R variable still contain 7 missing values (which is trivial).

Q. If you use a pre-existing function like this one https://www.rdocumentation.org/packages/missMethods/versions/0.4.0/topics/impute_mean, it will impute all variables containing missing values, which is not our objective here. We focus only on one variable with lots of missing data, the Ozone. So, find a way to perform this.

# For a single column
# Create a copy of the original dataframe
airquality_mean <- airquality
airquality_mean$Ozone[is.na(airquality_mean$Ozone)] <- 
  mean(airquality_mean$Ozone, na.rm = TRUE)

Q. Check if missing values have been imputed in the ozone feature. We can see that now the dataset has only 7 missing values in solar.R column, that we chose not to remove.

# Count missing values in each column
sapply(airquality_mean, function(x) sum(is.na(x)))

##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       7       0       0       0       0

Whole Dataset: We can also impute every single column containig missing values in the dataset. Note that if you use this code below, make sure all the columns with missing values are numeric. Fill the following code cell to do that!

# For all columns in a dataset
airquality_allmean <- airquality

for (col in names(airquality_allmean)) {
  if (is.numeric(airquality_allmean[[col]])) {
    airquality_allmean[[col]][is.na(airquality_allmean[[col]])] <- 
      mean(airquality_allmean[[col]], na.rm = TRUE)
  }
}

Q. Let us check that :

# Count missing values in each column
sapply(airquality_allmean, function(x) sum(is.na(x)))

##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

Check the quality of the applied solutions : original data vs. deletion v. imputation of all variables

We can plot the density of of the three data sources : orginal data, data with deleted instanes, data with imputation of all features. We can use the geom_density function (lean about it here https://ggplot2.tidyverse.org/reference/geom_density.html). We can see that compared with the original dataset, the imputed dataset has a high density of around 50, which is normal because a lot of mean values were imputed to the missing columns.

# Density plots 
library(ggplot2)

# Original dataset
orig <- airquality

# Dataset with deleted missing values
deleted <- na.omit(airquality)

# Dataset with mean imputation
imputed <- airquality_allmean   # jo tumne pehle banaya tha

# Combine into one dataframe for plotting
orig$Type <- "Original"
deleted$Type <- "Deleted"
imputed$Type <- "Imputed"

# Bind together
combined <- rbind(orig, deleted, imputed)

# Plot density for Ozone column as example
ggplot(combined, aes(x = Ozone, fill = Type)) +
  geom_density(alpha = 0.4) +
  labs(title = "Density Plot of Ozone: Original vs Deleted vs Imputed")

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_density()`).

2.3. Imputation with MICE (Multivariate Imputation by Chained Equations)

In this section we will be using R packages ‘mice’ and ‘naniar’ to do the imputation. See an introduction to the MICE package and installation guide here https://www.rdocumentation.org/packages/mice/versions/3.17.0 and here.

Getting started with naniar package here https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html

Let us load necessary packages

# Load the packages
library(mice)
library(ggplot2)
library(naniar) ##install.packages("naniar")

## Warning: package 'naniar' was built under R version 4.3.3

Q. Then we can visualize the missing data in each column using the vis_miss() function. From the graph, similarly, we can see that the other four columns do not have any missing values. 24% of Ozone is missing and 5% of Solar.R is missing.

# Visualize missing data
vis_miss(airquality)

Q. We will now use the mice package to impute the missing value. MICE means multivariate imputation by chained equations.

# Set the seed for reproducibility
set.seed(12345)

# Perform Multiple Imputation using MICE
imp <- mice(airquality, m = 5, method = "pmm", maxit = 50, seed = 12345)

## 
##  iter imp variable
##   1   1  Ozone  Solar.R
##   1   2  Ozone  Solar.R
##   1   3  Ozone  Solar.R
##   1   4  Ozone  Solar.R
##   1   5  Ozone  Solar.R
##   2   1  Ozone  Solar.R
##   2   2  Ozone  Solar.R
##   2   3  Ozone  Solar.R
##   2   4  Ozone  Solar.R
##   2   5  Ozone  Solar.R
##   3   1  Ozone  Solar.R
##   3   2  Ozone  Solar.R
##   3   3  Ozone  Solar.R
##   3   4  Ozone  Solar.R
##   3   5  Ozone  Solar.R
##   4   1  Ozone  Solar.R
##   4   2  Ozone  Solar.R
##   4   3  Ozone  Solar.R
##   4   4  Ozone  Solar.R
##   4   5  Ozone  Solar.R
##   5   1  Ozone  Solar.R
##   5   2  Ozone  Solar.R
##   5   3  Ozone  Solar.R
##   5   4  Ozone  Solar.R
##   5   5  Ozone  Solar.R
##   6   1  Ozone  Solar.R
##   6   2  Ozone  Solar.R
##   6   3  Ozone  Solar.R
##   6   4  Ozone  Solar.R
##   6   5  Ozone  Solar.R
##   7   1  Ozone  Solar.R
##   7   2  Ozone  Solar.R
##   7   3  Ozone  Solar.R
##   7   4  Ozone  Solar.R
##   7   5  Ozone  Solar.R
##   8   1  Ozone  Solar.R
##   8   2  Ozone  Solar.R
##   8   3  Ozone  Solar.R
##   8   4  Ozone  Solar.R
##   8   5  Ozone  Solar.R
##   9   1  Ozone  Solar.R
##   9   2  Ozone  Solar.R
##   9   3  Ozone  Solar.R
##   9   4  Ozone  Solar.R
##   9   5  Ozone  Solar.R
##   10   1  Ozone  Solar.R
##   10   2  Ozone  Solar.R
##   10   3  Ozone  Solar.R
##   10   4  Ozone  Solar.R
##   10   5  Ozone  Solar.R
##   11   1  Ozone  Solar.R
##   11   2  Ozone  Solar.R
##   11   3  Ozone  Solar.R
##   11   4  Ozone  Solar.R
##   11   5  Ozone  Solar.R
##   12   1  Ozone  Solar.R
##   12   2  Ozone  Solar.R
##   12   3  Ozone  Solar.R
##   12   4  Ozone  Solar.R
##   12   5  Ozone  Solar.R
##   13   1  Ozone  Solar.R
##   13   2  Ozone  Solar.R
##   13   3  Ozone  Solar.R
##   13   4  Ozone  Solar.R
##   13   5  Ozone  Solar.R
##   14   1  Ozone  Solar.R
##   14   2  Ozone  Solar.R
##   14   3  Ozone  Solar.R
##   14   4  Ozone  Solar.R
##   14   5  Ozone  Solar.R
##   15   1  Ozone  Solar.R
##   15   2  Ozone  Solar.R
##   15   3  Ozone  Solar.R
##   15   4  Ozone  Solar.R
##   15   5  Ozone  Solar.R
##   16   1  Ozone  Solar.R
##   16   2  Ozone  Solar.R
##   16   3  Ozone  Solar.R
##   16   4  Ozone  Solar.R
##   16   5  Ozone  Solar.R
##   17   1  Ozone  Solar.R
##   17   2  Ozone  Solar.R
##   17   3  Ozone  Solar.R
##   17   4  Ozone  Solar.R
##   17   5  Ozone  Solar.R
##   18   1  Ozone  Solar.R
##   18   2  Ozone  Solar.R
##   18   3  Ozone  Solar.R
##   18   4  Ozone  Solar.R
##   18   5  Ozone  Solar.R
##   19   1  Ozone  Solar.R
##   19   2  Ozone  Solar.R
##   19   3  Ozone  Solar.R
##   19   4  Ozone  Solar.R
##   19   5  Ozone  Solar.R
##   20   1  Ozone  Solar.R
##   20   2  Ozone  Solar.R
##   20   3  Ozone  Solar.R
##   20   4  Ozone  Solar.R
##   20   5  Ozone  Solar.R
##   21   1  Ozone  Solar.R
##   21   2  Ozone  Solar.R
##   21   3  Ozone  Solar.R
##   21   4  Ozone  Solar.R
##   21   5  Ozone  Solar.R
##   22   1  Ozone  Solar.R
##   22   2  Ozone  Solar.R
##   22   3  Ozone  Solar.R
##   22   4  Ozone  Solar.R
##   22   5  Ozone  Solar.R
##   23   1  Ozone  Solar.R
##   23   2  Ozone  Solar.R
##   23   3  Ozone  Solar.R
##   23   4  Ozone  Solar.R
##   23   5  Ozone  Solar.R
##   24   1  Ozone  Solar.R
##   24   2  Ozone  Solar.R
##   24   3  Ozone  Solar.R
##   24   4  Ozone  Solar.R
##   24   5  Ozone  Solar.R
##   25   1  Ozone  Solar.R
##   25   2  Ozone  Solar.R
##   25   3  Ozone  Solar.R
##   25   4  Ozone  Solar.R
##   25   5  Ozone  Solar.R
##   26   1  Ozone  Solar.R
##   26   2  Ozone  Solar.R
##   26   3  Ozone  Solar.R
##   26   4  Ozone  Solar.R
##   26   5  Ozone  Solar.R
##   27   1  Ozone  Solar.R
##   27   2  Ozone  Solar.R
##   27   3  Ozone  Solar.R
##   27   4  Ozone  Solar.R
##   27   5  Ozone  Solar.R
##   28   1  Ozone  Solar.R
##   28   2  Ozone  Solar.R
##   28   3  Ozone  Solar.R
##   28   4  Ozone  Solar.R
##   28   5  Ozone  Solar.R
##   29   1  Ozone  Solar.R
##   29   2  Ozone  Solar.R
##   29   3  Ozone  Solar.R
##   29   4  Ozone  Solar.R
##   29   5  Ozone  Solar.R
##   30   1  Ozone  Solar.R
##   30   2  Ozone  Solar.R
##   30   3  Ozone  Solar.R
##   30   4  Ozone  Solar.R
##   30   5  Ozone  Solar.R
##   31   1  Ozone  Solar.R
##   31   2  Ozone  Solar.R
##   31   3  Ozone  Solar.R
##   31   4  Ozone  Solar.R
##   31   5  Ozone  Solar.R
##   32   1  Ozone  Solar.R
##   32   2  Ozone  Solar.R
##   32   3  Ozone  Solar.R
##   32   4  Ozone  Solar.R
##   32   5  Ozone  Solar.R
##   33   1  Ozone  Solar.R
##   33   2  Ozone  Solar.R
##   33   3  Ozone  Solar.R
##   33   4  Ozone  Solar.R
##   33   5  Ozone  Solar.R
##   34   1  Ozone  Solar.R
##   34   2  Ozone  Solar.R
##   34   3  Ozone  Solar.R
##   34   4  Ozone  Solar.R
##   34   5  Ozone  Solar.R
##   35   1  Ozone  Solar.R
##   35   2  Ozone  Solar.R
##   35   3  Ozone  Solar.R
##   35   4  Ozone  Solar.R
##   35   5  Ozone  Solar.R
##   36   1  Ozone  Solar.R
##   36   2  Ozone  Solar.R
##   36   3  Ozone  Solar.R
##   36   4  Ozone  Solar.R
##   36   5  Ozone  Solar.R
##   37   1  Ozone  Solar.R
##   37   2  Ozone  Solar.R
##   37   3  Ozone  Solar.R
##   37   4  Ozone  Solar.R
##   37   5  Ozone  Solar.R
##   38   1  Ozone  Solar.R
##   38   2  Ozone  Solar.R
##   38   3  Ozone  Solar.R
##   38   4  Ozone  Solar.R
##   38   5  Ozone  Solar.R
##   39   1  Ozone  Solar.R
##   39   2  Ozone  Solar.R
##   39   3  Ozone  Solar.R
##   39   4  Ozone  Solar.R
##   39   5  Ozone  Solar.R
##   40   1  Ozone  Solar.R
##   40   2  Ozone  Solar.R
##   40   3  Ozone  Solar.R
##   40   4  Ozone  Solar.R
##   40   5  Ozone  Solar.R
##   41   1  Ozone  Solar.R
##   41   2  Ozone  Solar.R
##   41   3  Ozone  Solar.R
##   41   4  Ozone  Solar.R
##   41   5  Ozone  Solar.R
##   42   1  Ozone  Solar.R
##   42   2  Ozone  Solar.R
##   42   3  Ozone  Solar.R
##   42   4  Ozone  Solar.R
##   42   5  Ozone  Solar.R
##   43   1  Ozone  Solar.R
##   43   2  Ozone  Solar.R
##   43   3  Ozone  Solar.R
##   43   4  Ozone  Solar.R
##   43   5  Ozone  Solar.R
##   44   1  Ozone  Solar.R
##   44   2  Ozone  Solar.R
##   44   3  Ozone  Solar.R
##   44   4  Ozone  Solar.R
##   44   5  Ozone  Solar.R
##   45   1  Ozone  Solar.R
##   45   2  Ozone  Solar.R
##   45   3  Ozone  Solar.R
##   45   4  Ozone  Solar.R
##   45   5  Ozone  Solar.R
##   46   1  Ozone  Solar.R
##   46   2  Ozone  Solar.R
##   46   3  Ozone  Solar.R
##   46   4  Ozone  Solar.R
##   46   5  Ozone  Solar.R
##   47   1  Ozone  Solar.R
##   47   2  Ozone  Solar.R
##   47   3  Ozone  Solar.R
##   47   4  Ozone  Solar.R
##   47   5  Ozone  Solar.R
##   48   1  Ozone  Solar.R
##   48   2  Ozone  Solar.R
##   48   3  Ozone  Solar.R
##   48   4  Ozone  Solar.R
##   48   5  Ozone  Solar.R
##   49   1  Ozone  Solar.R
##   49   2  Ozone  Solar.R
##   49   3  Ozone  Solar.R
##   49   4  Ozone  Solar.R
##   49   5  Ozone  Solar.R
##   50   1  Ozone  Solar.R
##   50   2  Ozone  Solar.R
##   50   3  Ozone  Solar.R
##   50   4  Ozone  Solar.R
##   50   5  Ozone  Solar.R

Explanation of the mice function:

Number of Imputations (m=?): The ‘m’ argument specifies how many complete datasets you wish to generate, each with missing values filled in. By setting m to 5, the function will create five versions of your dataset, each with missing values imputed differently. This multiplicity captures the uncertainty inherent in the imputation process.
Imputation Method: The ‘method’ argument dictates the statistical technique mice will use to predict missing values. PMM means Predictive Mean Matching and it’s a non-parametric approach particularly suited for continuous data. PMM operates by finding observed values with similar predictive characteristics to the missing entries. The missing values are then imputed u, thus preserving the distribution and variance of the original data more effectively than simpler methods, such as mean imputation. Onto the imputation now. Other methods : cart (Classification and regression trees), laso.norm (Lasso linear regression).

We can also see that whether the imputation distorted the distribution of a variable too much or not.

# imputed_data
imputed_data <- complete(imp, 1)
head(imputed_data)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    37     148 14.3   56     5   5
## 6    28     149 14.9   66     5   6

After fitting the imputation model, examine the imputations by plotting the observed and imputed data together. Ideally, the imputed values are plausible compared to the observed values. Imputed values that are far away from the distribution of the observed values (not possible with pmm but possible with other methods) may indicate a problem with the imputation model, or perhaps that an MNAR model is needed. Imputed values which only span a subset of the distribution of the observed values are interesting in that they provide some information about the nature of the missing values that may assist in reducing missing data in future studies.

stripplot() function can be used to plot the observed and missing values for continuous variables. The observed data are plotted (labeled as 0 on the x-axis) as well as the observed and imputed data together for each completed dataset (labeled as 1 to the number of imputations). The points are “jittered” to provide some spread, so it is easier to see the imputed values superimposed over the observed values.

Inspecting the distribution of original and imputed data

Q. Let us see the distribution of Ozone per imputed data set. Use the stripplot function, learn about it here https://www.rdocumentation.org/packages/lattice/versions/0.3-1/topics/stripplot In general, we would like the imputations to be plausible, i.e., values that could have been observed if they had not been missing.

# inspect quality of imputations
# Perform Multiple Imputation using mice
imputed_data <- mice(airquality, m = 5, method = "pmm", seed = 123)

## 
##  iter imp variable
##   1   1  Ozone  Solar.R
##   1   2  Ozone  Solar.R
##   1   3  Ozone  Solar.R
##   1   4  Ozone  Solar.R
##   1   5  Ozone  Solar.R
##   2   1  Ozone  Solar.R
##   2   2  Ozone  Solar.R
##   2   3  Ozone  Solar.R
##   2   4  Ozone  Solar.R
##   2   5  Ozone  Solar.R
##   3   1  Ozone  Solar.R
##   3   2  Ozone  Solar.R
##   3   3  Ozone  Solar.R
##   3   4  Ozone  Solar.R
##   3   5  Ozone  Solar.R
##   4   1  Ozone  Solar.R
##   4   2  Ozone  Solar.R
##   4   3  Ozone  Solar.R
##   4   4  Ozone  Solar.R
##   4   5  Ozone  Solar.R
##   5   1  Ozone  Solar.R
##   5   2  Ozone  Solar.R
##   5   3  Ozone  Solar.R
##   5   4  Ozone  Solar.R
##   5   5  Ozone  Solar.R

# Stripplot on mids object
stripplot(imputed_data, Ozone ~ .imp, pch = 20, cex = 1.2)

Let us another way to compare the distributions of original and imputed data. First of all we can use a scatterplot and plot Ozone against all the other variables. What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”. You can use the xyplot (learn about it here https://www.rdocumentation.org/packages/lattice/versions/0.13-4/topics/xyplot)

# xyplot plot
xyplot(imputed_data, Ozone ~ Wind + Temp + Solar.R | .imp,
       pch = 20, cex = 1.2, col = c("blue", "red"))

Q Another helpful plot is the density plot https://www.rdocumentation.org/packages/car/versions/3.1-3/topics/densityPlot: The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. Again, under our previous assumptions we expect the distributions to be similar.

# density plot
densityplot(imputed_data, ~ Ozone,
            plot.points = FALSE,
            main = "Density plot of observed vs imputed Ozone values")

Finally, we can get back the completed dataset using the complete() function. It is almost plain English. The missing values have been replaced with the imputed values in the first of the five datasets. If you wish to use another one, just change the second parameter in the complete() function.

completedData <- complete(imputed_data,1)
completedData

##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      18     150 14.3   56     5   5
## 6      28      48 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     12     194  8.6   69     5  10
## 11      7     273  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     18      66 16.6   57     5  25
## 26     13     266 14.9   58     5  26
## 27     20       7  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     16     286  8.6   78     6   1
## 33     12     287  9.7   74     6   2
## 34     19     242 16.1   67     6   3
## 35     52     186  9.2   84     6   4
## 36      7     220  8.6   85     6   5
## 37     21     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     64     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     64     259 10.9   93     6  11
## 43     61     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     30     332 13.8   80     6  14
## 46     13     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     23     150  6.3   77     6  21
## 53     85      59  1.7   76     6  22
## 54     37      91  4.6   76     6  23
## 55     23     250  6.3   76     6  24
## 56     29     135  8.0   75     6  25
## 57     47     127  8.0   78     6  26
## 58     44      47 10.3   73     6  27
## 59     46      98 11.5   80     6  28
## 60     11      31 14.9   77     6  29
## 61     78     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     18     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     52     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     40     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83      7     258  9.7   81     7  22
## 84     29     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78     213  6.9   86     8   4
## 97     35     295  7.4   85     8   5
## 98     66     191  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102   110     222  8.6   92     8  10
## 103    28     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    16      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    22     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119   122     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    27     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).