Missing data is not a trivial problem when analyzing dataset, it is usually not so straightforward either. If the amount of missing data is very small relatively to the size of the dataset, then leaving out the few samples with missing values may be a good solution in order not to bias the analysis. However leaving out some available data (some few samples) may hide some amount of information and depending on the situation you face, you may want to look for other fixes before extracting potentially useful data from your dataset. While some quick solutions such as mean imputation may be good in some cases, such simple approaches usually introduce bias into the data, for instance, applying mean imputation leaves the mean unchanged (which is desirable) but decreases variance, which may be undesirable.
On other hand, The mice method, helps imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing data, which can be a good solution.
In this tutorial: We are using the built-in dataset ‘airquality’ in R as a sample dataset,
# Load the airquality dataset
data("airquality")
help("airquality")
## starting httpd help server ... done
Description of the airquality dataset: Daily air quality measurements in New York, May to September 1973.
Format: A data frame with 153 observations on 6 variables.
Details: Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.
Let us now check whether this dataset contains missing data (you can guess that this is the case ;) ). Please fill in the following code cells to answers questions.
Q. Let us show some lines from this dataset in order to see what it looks like
# View the first few rows of the dataset
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Q. Perform a summary of the dataset. What do you see ? How many features have missing data ? how much lines are missing for each feature ?
# Summary of the dataset to check for missing values
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
From the summary of the dataset, we observe that the features Ozone and Solar.R have missing values. Ozone has 37 missing entries and Solar.R has 7 missing entries. The other variables (Wind, Temp, Month, Day) do not contain any missing values. Therefore, in total, the dataset contains 44 missing values.
Q. We can use another R function in order to give us directly information about missing data: sapply
# Count missing values in each column
sapply(airquality, function(x) sum(is.na(x)))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
# Overall missing value count
sum(is.na(airquality))
## [1] 44
We can see that in total there are 44 missing data in the dataset. The Wind, Temp, Month and Day columns have no missing data and Ozone column has 37. Solar.R column also has 7 missing.
We can also calculate the percentage of missing values in each column. This could be really useful for big and messy datasets. To learn more about sapply function, see here
Q. use the sapply function to get percentage of missing values.
# Calculate the percentage of missing values in each column
sapply(airquality, function(x) mean(is.na(x)) * 100)
## Ozone Solar.R Wind Temp Month Day
## 24.183007 4.575163 0.000000 0.000000 0.000000 0.000000
Q. The mice package provides a function md.pattern() to get a better understanding of the pattern of missing data. Apply this function
The output tells us that 111 samples are complete, 37 samples miss only the Ozone measurement, 7 samples miss only the Solar.R value and so on.
library(mice)
## Warning: package 'mice' was built under R version 4.3.3
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
md.pattern(airquality)
## Wind Temp Month Day Solar.R Ozone
## 111 1 1 1 1 1 1 0
## 35 1 1 1 1 1 0 1
## 5 1 1 1 1 0 1 1
## 2 1 1 1 1 0 0 2
## 0 0 0 0 7 37 44
Q. A perhaps more helpful visual representation can be obtained using the VIM package as follows. Take a look here in order to learn how to apply this function https://cran.r-project.org/web/packages/VIM/vignettes/VisualImp.html The plot helps us understanding that almost(72%) of the samples are not missing any information, ~22% are missing the Ozone value, and the remaining ones show other missing patterns. Through this approach the situation looks a bit clearer
library(VIM) #install.packages("VIM")
## Warning: package 'VIM' was built under R version 4.3.3
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.3.3
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
Let us now study two solution for handling missing data: deleting missing data, imputation of missing data. We will see also a performance comparison in order to show th quality of the solutions.
One of the simplest approaches to address missing data in a dataset is to delete observations (instances or rows) that contain any missing values. This method, often referred to as “listwise deletion” or “complete case analysis,” involves removing entire records from the analysis if they are missing any data point in one or more features.
When to Consider Deleting Missing Rows?
Q. Let us remove each row that contains a missing value. You can use the fucntion omit (learn more about it here https://www.rdocumentation.org/packages/photobiology/versions/0.13.2/topics/na.omit)
# Remove rows with any missing value
airquality_no_missing <- na.omit(airquality)
Q. Let us check if there is any remaining missing value
# See the missing value now
sum(is.na(airquality_no_missing))
## [1] 0
Q. Check the new dimensions of the dataset, (learn about it here https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim)
# Get the dimensions of the dataset (rows and columns)
dim(airquality_no_missing)
## [1] 111 6
We can see that the new dataset now contains no missing value.
Note that many statistical software packages and functions designed for linear regression and similar models have built-in mechanisms to address missing data. Typically, these mechanisms involve automatically removing rows with missing values in any variable included in the model (listwise deletion). Therefore if you do not have specific needs or sensitive data, it may not be necessary to manually remove all the missing values.
When dealing with missing data, a common and straightforward approach is to fill in the missing values with the mean of the available values in the same variable. This method, known as mean imputation, involves calculating the average of the non-missing values for each variable and substituting that average for the missing entries.
When to Consider Deleting Missing Rows?
Imputation partially : We can choose to impute only certain columns if other columns just have trivial missingness. In our example, as the ozone variable is the only one containing a lot of missing values, we can perform mean imputation only for this variable. We can notice that the Solar.R variable still contain 7 missing values (which is trivial).
Q. If you use a pre-existing function like this one https://www.rdocumentation.org/packages/missMethods/versions/0.4.0/topics/impute_mean, it will impute all variables containing missing values, which is not our objective here. We focus only on one variable with lots of missing data, the Ozone. So, find a way to perform this.
# For a single column
# Create a copy of the original dataframe
airquality_mean <- airquality
airquality_mean$Ozone[is.na(airquality_mean$Ozone)] <-
mean(airquality_mean$Ozone, na.rm = TRUE)
Q. Check if missing values have been imputed in the ozone feature. We can see that now the dataset has only 7 missing values in solar.R column, that we chose not to remove.
# Count missing values in each column
sapply(airquality_mean, function(x) sum(is.na(x)))
## Ozone Solar.R Wind Temp Month Day
## 0 7 0 0 0 0
Whole Dataset: We can also impute every single column containig missing values in the dataset. Note that if you use this code below, make sure all the columns with missing values are numeric. Fill the following code cell to do that!
# For all columns in a dataset
airquality_allmean <- airquality
for (col in names(airquality_allmean)) {
if (is.numeric(airquality_allmean[[col]])) {
airquality_allmean[[col]][is.na(airquality_allmean[[col]])] <-
mean(airquality_allmean[[col]], na.rm = TRUE)
}
}
Q. Let us check that :
# Count missing values in each column
sapply(airquality_allmean, function(x) sum(is.na(x)))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
We can plot the density of of the three data sources : orginal data, data with deleted instanes, data with imputation of all features. We can use the geom_density function (lean about it here https://ggplot2.tidyverse.org/reference/geom_density.html). We can see that compared with the original dataset, the imputed dataset has a high density of around 50, which is normal because a lot of mean values were imputed to the missing columns.
# Density plots
library(ggplot2)
# Original dataset
orig <- airquality
# Dataset with deleted missing values
deleted <- na.omit(airquality)
# Dataset with mean imputation
imputed <- airquality_allmean # jo tumne pehle banaya tha
# Combine into one dataframe for plotting
orig$Type <- "Original"
deleted$Type <- "Deleted"
imputed$Type <- "Imputed"
# Bind together
combined <- rbind(orig, deleted, imputed)
# Plot density for Ozone column as example
ggplot(combined, aes(x = Ozone, fill = Type)) +
geom_density(alpha = 0.4) +
labs(title = "Density Plot of Ozone: Original vs Deleted vs Imputed")
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_density()`).
In this section we will be using R packages ‘mice’ and ‘naniar’ to do the imputation. See an introduction to the MICE package and installation guide here https://www.rdocumentation.org/packages/mice/versions/3.17.0 and here.
Getting started with naniar package here https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html
Let us load necessary packages
# Load the packages
library(mice)
library(ggplot2)
library(naniar) ##install.packages("naniar")
## Warning: package 'naniar' was built under R version 4.3.3
Q. Then we can visualize the missing data in each column using the vis_miss() function. From the graph, similarly, we can see that the other four columns do not have any missing values. 24% of Ozone is missing and 5% of Solar.R is missing.
# Visualize missing data
vis_miss(airquality)
Q. We will now use the mice package to impute the
missing value. MICE means multivariate imputation by chained
equations.
# Set the seed for reproducibility
set.seed(12345)
# Perform Multiple Imputation using MICE
imp <- mice(airquality, m = 5, method = "pmm", maxit = 50, seed = 12345)
##
## iter imp variable
## 1 1 Ozone Solar.R
## 1 2 Ozone Solar.R
## 1 3 Ozone Solar.R
## 1 4 Ozone Solar.R
## 1 5 Ozone Solar.R
## 2 1 Ozone Solar.R
## 2 2 Ozone Solar.R
## 2 3 Ozone Solar.R
## 2 4 Ozone Solar.R
## 2 5 Ozone Solar.R
## 3 1 Ozone Solar.R
## 3 2 Ozone Solar.R
## 3 3 Ozone Solar.R
## 3 4 Ozone Solar.R
## 3 5 Ozone Solar.R
## 4 1 Ozone Solar.R
## 4 2 Ozone Solar.R
## 4 3 Ozone Solar.R
## 4 4 Ozone Solar.R
## 4 5 Ozone Solar.R
## 5 1 Ozone Solar.R
## 5 2 Ozone Solar.R
## 5 3 Ozone Solar.R
## 5 4 Ozone Solar.R
## 5 5 Ozone Solar.R
## 6 1 Ozone Solar.R
## 6 2 Ozone Solar.R
## 6 3 Ozone Solar.R
## 6 4 Ozone Solar.R
## 6 5 Ozone Solar.R
## 7 1 Ozone Solar.R
## 7 2 Ozone Solar.R
## 7 3 Ozone Solar.R
## 7 4 Ozone Solar.R
## 7 5 Ozone Solar.R
## 8 1 Ozone Solar.R
## 8 2 Ozone Solar.R
## 8 3 Ozone Solar.R
## 8 4 Ozone Solar.R
## 8 5 Ozone Solar.R
## 9 1 Ozone Solar.R
## 9 2 Ozone Solar.R
## 9 3 Ozone Solar.R
## 9 4 Ozone Solar.R
## 9 5 Ozone Solar.R
## 10 1 Ozone Solar.R
## 10 2 Ozone Solar.R
## 10 3 Ozone Solar.R
## 10 4 Ozone Solar.R
## 10 5 Ozone Solar.R
## 11 1 Ozone Solar.R
## 11 2 Ozone Solar.R
## 11 3 Ozone Solar.R
## 11 4 Ozone Solar.R
## 11 5 Ozone Solar.R
## 12 1 Ozone Solar.R
## 12 2 Ozone Solar.R
## 12 3 Ozone Solar.R
## 12 4 Ozone Solar.R
## 12 5 Ozone Solar.R
## 13 1 Ozone Solar.R
## 13 2 Ozone Solar.R
## 13 3 Ozone Solar.R
## 13 4 Ozone Solar.R
## 13 5 Ozone Solar.R
## 14 1 Ozone Solar.R
## 14 2 Ozone Solar.R
## 14 3 Ozone Solar.R
## 14 4 Ozone Solar.R
## 14 5 Ozone Solar.R
## 15 1 Ozone Solar.R
## 15 2 Ozone Solar.R
## 15 3 Ozone Solar.R
## 15 4 Ozone Solar.R
## 15 5 Ozone Solar.R
## 16 1 Ozone Solar.R
## 16 2 Ozone Solar.R
## 16 3 Ozone Solar.R
## 16 4 Ozone Solar.R
## 16 5 Ozone Solar.R
## 17 1 Ozone Solar.R
## 17 2 Ozone Solar.R
## 17 3 Ozone Solar.R
## 17 4 Ozone Solar.R
## 17 5 Ozone Solar.R
## 18 1 Ozone Solar.R
## 18 2 Ozone Solar.R
## 18 3 Ozone Solar.R
## 18 4 Ozone Solar.R
## 18 5 Ozone Solar.R
## 19 1 Ozone Solar.R
## 19 2 Ozone Solar.R
## 19 3 Ozone Solar.R
## 19 4 Ozone Solar.R
## 19 5 Ozone Solar.R
## 20 1 Ozone Solar.R
## 20 2 Ozone Solar.R
## 20 3 Ozone Solar.R
## 20 4 Ozone Solar.R
## 20 5 Ozone Solar.R
## 21 1 Ozone Solar.R
## 21 2 Ozone Solar.R
## 21 3 Ozone Solar.R
## 21 4 Ozone Solar.R
## 21 5 Ozone Solar.R
## 22 1 Ozone Solar.R
## 22 2 Ozone Solar.R
## 22 3 Ozone Solar.R
## 22 4 Ozone Solar.R
## 22 5 Ozone Solar.R
## 23 1 Ozone Solar.R
## 23 2 Ozone Solar.R
## 23 3 Ozone Solar.R
## 23 4 Ozone Solar.R
## 23 5 Ozone Solar.R
## 24 1 Ozone Solar.R
## 24 2 Ozone Solar.R
## 24 3 Ozone Solar.R
## 24 4 Ozone Solar.R
## 24 5 Ozone Solar.R
## 25 1 Ozone Solar.R
## 25 2 Ozone Solar.R
## 25 3 Ozone Solar.R
## 25 4 Ozone Solar.R
## 25 5 Ozone Solar.R
## 26 1 Ozone Solar.R
## 26 2 Ozone Solar.R
## 26 3 Ozone Solar.R
## 26 4 Ozone Solar.R
## 26 5 Ozone Solar.R
## 27 1 Ozone Solar.R
## 27 2 Ozone Solar.R
## 27 3 Ozone Solar.R
## 27 4 Ozone Solar.R
## 27 5 Ozone Solar.R
## 28 1 Ozone Solar.R
## 28 2 Ozone Solar.R
## 28 3 Ozone Solar.R
## 28 4 Ozone Solar.R
## 28 5 Ozone Solar.R
## 29 1 Ozone Solar.R
## 29 2 Ozone Solar.R
## 29 3 Ozone Solar.R
## 29 4 Ozone Solar.R
## 29 5 Ozone Solar.R
## 30 1 Ozone Solar.R
## 30 2 Ozone Solar.R
## 30 3 Ozone Solar.R
## 30 4 Ozone Solar.R
## 30 5 Ozone Solar.R
## 31 1 Ozone Solar.R
## 31 2 Ozone Solar.R
## 31 3 Ozone Solar.R
## 31 4 Ozone Solar.R
## 31 5 Ozone Solar.R
## 32 1 Ozone Solar.R
## 32 2 Ozone Solar.R
## 32 3 Ozone Solar.R
## 32 4 Ozone Solar.R
## 32 5 Ozone Solar.R
## 33 1 Ozone Solar.R
## 33 2 Ozone Solar.R
## 33 3 Ozone Solar.R
## 33 4 Ozone Solar.R
## 33 5 Ozone Solar.R
## 34 1 Ozone Solar.R
## 34 2 Ozone Solar.R
## 34 3 Ozone Solar.R
## 34 4 Ozone Solar.R
## 34 5 Ozone Solar.R
## 35 1 Ozone Solar.R
## 35 2 Ozone Solar.R
## 35 3 Ozone Solar.R
## 35 4 Ozone Solar.R
## 35 5 Ozone Solar.R
## 36 1 Ozone Solar.R
## 36 2 Ozone Solar.R
## 36 3 Ozone Solar.R
## 36 4 Ozone Solar.R
## 36 5 Ozone Solar.R
## 37 1 Ozone Solar.R
## 37 2 Ozone Solar.R
## 37 3 Ozone Solar.R
## 37 4 Ozone Solar.R
## 37 5 Ozone Solar.R
## 38 1 Ozone Solar.R
## 38 2 Ozone Solar.R
## 38 3 Ozone Solar.R
## 38 4 Ozone Solar.R
## 38 5 Ozone Solar.R
## 39 1 Ozone Solar.R
## 39 2 Ozone Solar.R
## 39 3 Ozone Solar.R
## 39 4 Ozone Solar.R
## 39 5 Ozone Solar.R
## 40 1 Ozone Solar.R
## 40 2 Ozone Solar.R
## 40 3 Ozone Solar.R
## 40 4 Ozone Solar.R
## 40 5 Ozone Solar.R
## 41 1 Ozone Solar.R
## 41 2 Ozone Solar.R
## 41 3 Ozone Solar.R
## 41 4 Ozone Solar.R
## 41 5 Ozone Solar.R
## 42 1 Ozone Solar.R
## 42 2 Ozone Solar.R
## 42 3 Ozone Solar.R
## 42 4 Ozone Solar.R
## 42 5 Ozone Solar.R
## 43 1 Ozone Solar.R
## 43 2 Ozone Solar.R
## 43 3 Ozone Solar.R
## 43 4 Ozone Solar.R
## 43 5 Ozone Solar.R
## 44 1 Ozone Solar.R
## 44 2 Ozone Solar.R
## 44 3 Ozone Solar.R
## 44 4 Ozone Solar.R
## 44 5 Ozone Solar.R
## 45 1 Ozone Solar.R
## 45 2 Ozone Solar.R
## 45 3 Ozone Solar.R
## 45 4 Ozone Solar.R
## 45 5 Ozone Solar.R
## 46 1 Ozone Solar.R
## 46 2 Ozone Solar.R
## 46 3 Ozone Solar.R
## 46 4 Ozone Solar.R
## 46 5 Ozone Solar.R
## 47 1 Ozone Solar.R
## 47 2 Ozone Solar.R
## 47 3 Ozone Solar.R
## 47 4 Ozone Solar.R
## 47 5 Ozone Solar.R
## 48 1 Ozone Solar.R
## 48 2 Ozone Solar.R
## 48 3 Ozone Solar.R
## 48 4 Ozone Solar.R
## 48 5 Ozone Solar.R
## 49 1 Ozone Solar.R
## 49 2 Ozone Solar.R
## 49 3 Ozone Solar.R
## 49 4 Ozone Solar.R
## 49 5 Ozone Solar.R
## 50 1 Ozone Solar.R
## 50 2 Ozone Solar.R
## 50 3 Ozone Solar.R
## 50 4 Ozone Solar.R
## 50 5 Ozone Solar.R
Explanation of the mice function:
Number of Imputations (m=?): The ‘m’ argument specifies how many complete datasets you wish to generate, each with missing values filled in. By setting m to 5, the function will create five versions of your dataset, each with missing values imputed differently. This multiplicity captures the uncertainty inherent in the imputation process.
Imputation Method: The ‘method’ argument dictates the statistical technique mice will use to predict missing values. PMM means Predictive Mean Matching and it’s a non-parametric approach particularly suited for continuous data. PMM operates by finding observed values with similar predictive characteristics to the missing entries. The missing values are then imputed u, thus preserving the distribution and variance of the original data more effectively than simpler methods, such as mean imputation. Onto the imputation now. Other methods : cart (Classification and regression trees), laso.norm (Lasso linear regression).
We can also see that whether the imputation distorted the distribution of a variable too much or not.
# imputed_data
imputed_data <- complete(imp, 1)
head(imputed_data)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 37 148 14.3 56 5 5
## 6 28 149 14.9 66 5 6
After fitting the imputation model, examine the imputations by plotting the observed and imputed data together. Ideally, the imputed values are plausible compared to the observed values. Imputed values that are far away from the distribution of the observed values (not possible with pmm but possible with other methods) may indicate a problem with the imputation model, or perhaps that an MNAR model is needed. Imputed values which only span a subset of the distribution of the observed values are interesting in that they provide some information about the nature of the missing values that may assist in reducing missing data in future studies.
stripplot() function can be used to plot the observed and missing values for continuous variables. The observed data are plotted (labeled as 0 on the x-axis) as well as the observed and imputed data together for each completed dataset (labeled as 1 to the number of imputations). The points are “jittered” to provide some spread, so it is easier to see the imputed values superimposed over the observed values.
Q. Let us see the distribution of Ozone per imputed data set. Use the stripplot function, learn about it here https://www.rdocumentation.org/packages/lattice/versions/0.3-1/topics/stripplot In general, we would like the imputations to be plausible, i.e., values that could have been observed if they had not been missing.
# inspect quality of imputations
# Perform Multiple Imputation using mice
imputed_data <- mice(airquality, m = 5, method = "pmm", seed = 123)
##
## iter imp variable
## 1 1 Ozone Solar.R
## 1 2 Ozone Solar.R
## 1 3 Ozone Solar.R
## 1 4 Ozone Solar.R
## 1 5 Ozone Solar.R
## 2 1 Ozone Solar.R
## 2 2 Ozone Solar.R
## 2 3 Ozone Solar.R
## 2 4 Ozone Solar.R
## 2 5 Ozone Solar.R
## 3 1 Ozone Solar.R
## 3 2 Ozone Solar.R
## 3 3 Ozone Solar.R
## 3 4 Ozone Solar.R
## 3 5 Ozone Solar.R
## 4 1 Ozone Solar.R
## 4 2 Ozone Solar.R
## 4 3 Ozone Solar.R
## 4 4 Ozone Solar.R
## 4 5 Ozone Solar.R
## 5 1 Ozone Solar.R
## 5 2 Ozone Solar.R
## 5 3 Ozone Solar.R
## 5 4 Ozone Solar.R
## 5 5 Ozone Solar.R
# Stripplot on mids object
stripplot(imputed_data, Ozone ~ .imp, pch = 20, cex = 1.2)
Let us another way to compare the distributions of original and imputed data. First of all we can use a scatterplot and plot Ozone against all the other variables. What we would like to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”. You can use the xyplot (learn about it here https://www.rdocumentation.org/packages/lattice/versions/0.13-4/topics/xyplot)
# xyplot plot
xyplot(imputed_data, Ozone ~ Wind + Temp + Solar.R | .imp,
pch = 20, cex = 1.2, col = c("blue", "red"))
Q Another helpful plot is the density plot https://www.rdocumentation.org/packages/car/versions/3.1-3/topics/densityPlot: The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. Again, under our previous assumptions we expect the distributions to be similar.
# density plot
densityplot(imputed_data, ~ Ozone,
plot.points = FALSE,
main = "Density plot of observed vs imputed Ozone values")
Finally, we can get back the completed dataset using the complete() function. It is almost plain English. The missing values have been replaced with the imputed values in the first of the five datasets. If you wish to use another one, just change the second parameter in the complete() function.
completedData <- complete(imputed_data,1)
completedData
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 18 150 14.3 56 5 5
## 6 28 48 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 12 194 8.6 69 5 10
## 11 7 273 6.9 74 5 11
## 12 16 256 9.7 69 5 12
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
## 16 14 334 11.5 64 5 16
## 17 34 307 12.0 66 5 17
## 18 6 78 18.4 57 5 18
## 19 30 322 11.5 68 5 19
## 20 11 44 9.7 62 5 20
## 21 1 8 9.7 59 5 21
## 22 11 320 16.6 73 5 22
## 23 4 25 9.7 61 5 23
## 24 32 92 12.0 61 5 24
## 25 18 66 16.6 57 5 25
## 26 13 266 14.9 58 5 26
## 27 20 7 8.0 57 5 27
## 28 23 13 12.0 67 5 28
## 29 45 252 14.9 81 5 29
## 30 115 223 5.7 79 5 30
## 31 37 279 7.4 76 5 31
## 32 16 286 8.6 78 6 1
## 33 12 287 9.7 74 6 2
## 34 19 242 16.1 67 6 3
## 35 52 186 9.2 84 6 4
## 36 7 220 8.6 85 6 5
## 37 21 264 14.3 79 6 6
## 38 29 127 9.7 82 6 7
## 39 64 273 6.9 87 6 8
## 40 71 291 13.8 90 6 9
## 41 39 323 11.5 87 6 10
## 42 64 259 10.9 93 6 11
## 43 61 250 9.2 92 6 12
## 44 23 148 8.0 82 6 13
## 45 30 332 13.8 80 6 14
## 46 13 322 11.5 79 6 15
## 47 21 191 14.9 77 6 16
## 48 37 284 20.7 72 6 17
## 49 20 37 9.2 65 6 18
## 50 12 120 11.5 73 6 19
## 51 13 137 10.3 76 6 20
## 52 23 150 6.3 77 6 21
## 53 85 59 1.7 76 6 22
## 54 37 91 4.6 76 6 23
## 55 23 250 6.3 76 6 24
## 56 29 135 8.0 75 6 25
## 57 47 127 8.0 78 6 26
## 58 44 47 10.3 73 6 27
## 59 46 98 11.5 80 6 28
## 60 11 31 14.9 77 6 29
## 61 78 138 8.0 83 6 30
## 62 135 269 4.1 84 7 1
## 63 49 248 9.2 85 7 2
## 64 32 236 9.2 81 7 3
## 65 18 101 10.9 84 7 4
## 66 64 175 4.6 83 7 5
## 67 40 314 10.9 83 7 6
## 68 77 276 5.1 88 7 7
## 69 97 267 6.3 92 7 8
## 70 97 272 5.7 92 7 9
## 71 85 175 7.4 89 7 10
## 72 52 139 8.6 82 7 11
## 73 10 264 14.3 73 7 12
## 74 27 175 14.9 81 7 13
## 75 40 291 14.9 91 7 14
## 76 7 48 14.3 80 7 15
## 77 48 260 6.9 81 7 16
## 78 35 274 10.3 82 7 17
## 79 61 285 6.3 84 7 18
## 80 79 187 5.1 87 7 19
## 81 63 220 11.5 85 7 20
## 82 16 7 6.9 74 7 21
## 83 7 258 9.7 81 7 22
## 84 29 295 11.5 82 7 23
## 85 80 294 8.6 86 7 24
## 86 108 223 8.0 85 7 25
## 87 20 81 8.6 82 7 26
## 88 52 82 12.0 86 7 27
## 89 82 213 7.4 88 7 28
## 90 50 275 7.4 86 7 29
## 91 64 253 7.4 83 7 30
## 92 59 254 9.2 81 7 31
## 93 39 83 6.9 81 8 1
## 94 9 24 13.8 81 8 2
## 95 16 77 7.4 82 8 3
## 96 78 213 6.9 86 8 4
## 97 35 295 7.4 85 8 5
## 98 66 191 4.6 87 8 6
## 99 122 255 4.0 89 8 7
## 100 89 229 10.3 90 8 8
## 101 110 207 8.0 90 8 9
## 102 110 222 8.6 92 8 10
## 103 28 137 11.5 86 8 11
## 104 44 192 11.5 86 8 12
## 105 28 273 11.5 82 8 13
## 106 65 157 9.7 80 8 14
## 107 16 64 11.5 79 8 15
## 108 22 71 10.3 77 8 16
## 109 59 51 6.3 79 8 17
## 110 23 115 7.4 76 8 18
## 111 31 244 10.9 78 8 19
## 112 44 190 10.3 78 8 20
## 113 21 259 15.5 77 8 21
## 114 9 36 14.3 72 8 22
## 115 22 255 12.6 75 8 23
## 116 45 212 9.7 79 8 24
## 117 168 238 3.4 81 8 25
## 118 73 215 8.0 86 8 26
## 119 122 153 5.7 88 8 27
## 120 76 203 9.7 97 8 28
## 121 118 225 2.3 94 8 29
## 122 84 237 6.3 96 8 30
## 123 85 188 6.3 94 8 31
## 124 96 167 6.9 91 9 1
## 125 78 197 5.1 92 9 2
## 126 73 183 2.8 93 9 3
## 127 91 189 4.6 93 9 4
## 128 47 95 7.4 87 9 5
## 129 32 92 15.5 84 9 6
## 130 20 252 10.9 80 9 7
## 131 23 220 10.3 78 9 8
## 132 21 230 10.9 75 9 9
## 133 24 259 9.7 73 9 10
## 134 44 236 14.9 81 9 11
## 135 21 259 15.5 76 9 12
## 136 28 238 6.3 77 9 13
## 137 9 24 10.9 71 9 14
## 138 13 112 11.5 71 9 15
## 139 46 237 6.9 78 9 16
## 140 18 224 13.8 67 9 17
## 141 13 27 10.3 76 9 18
## 142 24 238 10.3 68 9 19
## 143 16 201 8.0 82 9 20
## 144 13 238 12.6 64 9 21
## 145 23 14 9.2 71 9 22
## 146 36 139 10.3 81 9 23
## 147 7 49 10.3 69 9 24
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 27 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).