Yiqiong Yang (Miriam) 2022-08-18
Missing data occur when no data value is stored for the variable in an observation. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data (Kang, 2013).
In this markdown we will introduce some basic approaches to address missing data.
library(knitr) # Markdown presentation
library(tidyverse) # Data wragnling and visualisation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(visdat) # Visualise missing data
library(mice) # Multiple imputation
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(dplyr) # Data wrangling
Now we are creating a dataset with four columns Col1: the name of eight students Col2 - Col4: numbers of supervision hours across three academic years
suphours <- data.frame(Student = c("John", "Amos", "Can", "Eman", "Joy",
"Flora", "Max", "Kay"),
year2019 = c(10, NA, 13, 8, 12, NA, 9, 11),
year2020 = sample(12:20, 8, replace = F),
year2021 = c(12, NA, 12, NA, NA, 9, 25, 7))
## Transpose the wide format to long format for demo purpose
suphourslong <- suphours %>%
gather("Years", "Hours", 2:4) %>%
mutate(Years = gsub("\\D", "", Years))
head(suphours, 8)
## Student year2019 year2020 year2021
## 1 John 10 18 12
## 2 Amos NA 13 NA
## 3 Can 13 12 12
## 4 Eman 8 16 NA
## 5 Joy 12 14 NA
## 6 Flora NA 19 9
## 7 Max 9 20 25
## 8 Kay 11 15 7
head(suphourslong, 8)
## Student Years Hours
## 1 John 2019 10
## 2 Amos 2019 NA
## 3 Can 2019 13
## 4 Eman 2019 8
## 5 Joy 2019 12
## 6 Flora 2019 NA
## 7 Max 2019 9
## 8 Kay 2019 11
From the dataset we could see that there are several na in Hours. The first approach is to remove all rows contain NA. When we have a large sample size with less missing values, removal of missing data would not necessarily bias the result of analysis.
Visualising missing date provide you a general idea of where they are and how many they are. For example, whether missing data follow a certain pattern? Whether there are a big chunk of missing data? If they do, then they are probably not missing at random, i.e., missing data at entry stage. Therefore, different treatment is needed.
# Visualise missing data
visdat::vis_dat(suphourslong)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
mice::md.pattern(suphourslong)
## Student Years Hours
## 19 1 1 1 0
## 5 1 1 0 1
## 0 0 5 5
There are different ways to remove rows containing NA. We can do that through dplyr::filter or drop_na. Visualised graphs tell that the NA have been completely removed.
# dplyr::filter
suphourslong %>%
filter(!is.na(Hours)) %>% # is.na() return rows containing na. '!' means not equal
visdat::vis_dat()
suphourslong %>%
filter(complete.cases(Hours)) %>% # complete.cases is self-explanatory
mice::md.pattern()
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Student Years Hours
## 19 1 1 1 0
## 0 0 0 0
# drop_na
suphourslong %>%
drop_na(Hours) %>%
visdat::vis_dat()
Removing cases or participants would cause bias results, so we want to keep them into the analysis. We could replace them with other values or characters. The code below demonstrate how to replace missing values with 0.
# For example, replace NA with 0 with long format data
suphourslong %>%
mutate(Hours = replace_na(Hours, 0)) %>%
head()
## Student Years Hours
## 1 John 2019 10
## 2 Amos 2019 0
## 3 Can 2019 13
## 4 Eman 2019 8
## 5 Joy 2019 12
## 6 Flora 2019 0
# For example, replace NA with 0 with wide format data applying to all the element and return a data frame, this is when 'apply' is applicable.
suphours_na_0 <-
as.data.frame(apply(suphours[, 2:4], 2, replace_na, 0)) %>%
# Combine it with the student name from the original dataset
cbind(Student = suphours[,1])
# Check the new dataset, na has been replaced by 0.
head(suphours_na_0, 8)
## year2019 year2020 year2021 Student
## 1 10 18 12 John
## 2 0 13 0 Amos
## 3 13 12 12 Can
## 4 8 16 0 Eman
## 5 12 14 0 Joy
## 6 0 19 9 Flora
## 7 9 20 25 Max
## 8 11 15 7 Kay
We can replace NA with mean
suphourslong %>%
mutate(Hours = replace_na(Hours, round(mean(Hours, na.rm = T), digits = 0))) %>%
head(8)
## Student Years Hours
## 1 John 2019 10
## 2 Amos 2019 13
## 3 Can 2019 13
## 4 Eman 2019 8
## 5 Joy 2019 12
## 6 Flora 2019 13
## 7 Max 2019 9
## 8 Kay 2019 11
# Through error
# suphours_na_mean <-
# as.data.frame(sapply(suphours[,2:4], as.numeric)) %>%
# mutate_at(vars(1:3), ~replace_na(., mean(., na.rm = TRUE)))
#
When we want to replace na with character, we need to be careful with the type. If we replace na with character in numeric column/rows, we need to make sure the type is consistent throughout. So we need to convert that column into a matching type, i.e., character in the code below. Inconsistent type would through an error.
suphourslong %>%
mutate(Hours = replace_na(as.character(Hours), "none"),
Student = replace_na(as.character(Student), "none")) %>%
head(8)
## Student Years Hours
## 1 John 2019 10
## 2 Amos 2019 none
## 3 Can 2019 13
## 4 Eman 2019 8
## 5 Joy 2019 12
## 6 Flora 2019 none
## 7 Max 2019 9
## 8 Kay 2019 11
The above approaches to replace or remove na are basic and limited to certain sample size or data. When we do not want to remove any missing cases, or replace them with other values, we could apply multiple imputation.
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.
m – is the number of imputations, the default is 5. defaultMethod – indicates which method you want to use for imputation, the default is PMM. Accepted values are logreg, polyreg, and polr. maxit– A scalar gives the number of iterations. The default is 5. seed – offsetting the random number generator.
Start imputation using mice:
# Here we are asking R to iterate the data for 3 times,
imputed_data <- mice(suphours,m=3,maxit=50,meth='pmm',seed=500)
##
## iter imp variable
## 1 1 year2019 year2021
## 1 2 year2019 year2021
## 1 3 year2019 year2021
## 2 1 year2019 year2021
## 2 2 year2019 year2021
## 2 3 year2019 year2021
## 3 1 year2019 year2021
## 3 2 year2019 year2021
## 3 3 year2019 year2021
## 4 1 year2019 year2021
## 4 2 year2019 year2021
## 4 3 year2019 year2021
## 5 1 year2019 year2021
## 5 2 year2019 year2021
## 5 3 year2019 year2021
## 6 1 year2019 year2021
## 6 2 year2019 year2021
## 6 3 year2019 year2021
## 7 1 year2019 year2021
## 7 2 year2019 year2021
## 7 3 year2019 year2021
## 8 1 year2019 year2021
## 8 2 year2019 year2021
## 8 3 year2019 year2021
## 9 1 year2019 year2021
## 9 2 year2019 year2021
## 9 3 year2019 year2021
## 10 1 year2019 year2021
## 10 2 year2019 year2021
## 10 3 year2019 year2021
## 11 1 year2019 year2021
## 11 2 year2019 year2021
## 11 3 year2019 year2021
## 12 1 year2019 year2021
## 12 2 year2019 year2021
## 12 3 year2019 year2021
## 13 1 year2019 year2021
## 13 2 year2019 year2021
## 13 3 year2019 year2021
## 14 1 year2019 year2021
## 14 2 year2019 year2021
## 14 3 year2019 year2021
## 15 1 year2019 year2021
## 15 2 year2019 year2021
## 15 3 year2019 year2021
## 16 1 year2019 year2021
## 16 2 year2019 year2021
## 16 3 year2019 year2021
## 17 1 year2019 year2021
## 17 2 year2019 year2021
## 17 3 year2019 year2021
## 18 1 year2019 year2021
## 18 2 year2019 year2021
## 18 3 year2019 year2021
## 19 1 year2019 year2021
## 19 2 year2019 year2021
## 19 3 year2019 year2021
## 20 1 year2019 year2021
## 20 2 year2019 year2021
## 20 3 year2019 year2021
## 21 1 year2019 year2021
## 21 2 year2019 year2021
## 21 3 year2019 year2021
## 22 1 year2019 year2021
## 22 2 year2019 year2021
## 22 3 year2019 year2021
## 23 1 year2019 year2021
## 23 2 year2019 year2021
## 23 3 year2019 year2021
## 24 1 year2019 year2021
## 24 2 year2019 year2021
## 24 3 year2019 year2021
## 25 1 year2019 year2021
## 25 2 year2019 year2021
## 25 3 year2019 year2021
## 26 1 year2019 year2021
## 26 2 year2019 year2021
## 26 3 year2019 year2021
## 27 1 year2019 year2021
## 27 2 year2019 year2021
## 27 3 year2019 year2021
## 28 1 year2019 year2021
## 28 2 year2019 year2021
## 28 3 year2019 year2021
## 29 1 year2019 year2021
## 29 2 year2019 year2021
## 29 3 year2019 year2021
## 30 1 year2019 year2021
## 30 2 year2019 year2021
## 30 3 year2019 year2021
## 31 1 year2019 year2021
## 31 2 year2019 year2021
## 31 3 year2019 year2021
## 32 1 year2019 year2021
## 32 2 year2019 year2021
## 32 3 year2019 year2021
## 33 1 year2019 year2021
## 33 2 year2019 year2021
## 33 3 year2019 year2021
## 34 1 year2019 year2021
## 34 2 year2019 year2021
## 34 3 year2019 year2021
## 35 1 year2019 year2021
## 35 2 year2019 year2021
## 35 3 year2019 year2021
## 36 1 year2019 year2021
## 36 2 year2019 year2021
## 36 3 year2019 year2021
## 37 1 year2019 year2021
## 37 2 year2019 year2021
## 37 3 year2019 year2021
## 38 1 year2019 year2021
## 38 2 year2019 year2021
## 38 3 year2019 year2021
## 39 1 year2019 year2021
## 39 2 year2019 year2021
## 39 3 year2019 year2021
## 40 1 year2019 year2021
## 40 2 year2019 year2021
## 40 3 year2019 year2021
## 41 1 year2019 year2021
## 41 2 year2019 year2021
## 41 3 year2019 year2021
## 42 1 year2019 year2021
## 42 2 year2019 year2021
## 42 3 year2019 year2021
## 43 1 year2019 year2021
## 43 2 year2019 year2021
## 43 3 year2019 year2021
## 44 1 year2019 year2021
## 44 2 year2019 year2021
## 44 3 year2019 year2021
## 45 1 year2019 year2021
## 45 2 year2019 year2021
## 45 3 year2019 year2021
## 46 1 year2019 year2021
## 46 2 year2019 year2021
## 46 3 year2019 year2021
## 47 1 year2019 year2021
## 47 2 year2019 year2021
## 47 3 year2019 year2021
## 48 1 year2019 year2021
## 48 2 year2019 year2021
## 48 3 year2019 year2021
## 49 1 year2019 year2021
## 49 2 year2019 year2021
## 49 3 year2019 year2021
## 50 1 year2019 year2021
## 50 2 year2019 year2021
## 50 3 year2019 year2021
## Warning: Number of logged events: 1
summary(imputed_data)
## Class: mids
## Number of multiple imputations: 3
## Imputation methods:
## Student year2019 year2020 year2021
## "" "pmm" "" "pmm"
## PredictorMatrix:
## Student year2019 year2020 year2021
## Student 0 1 1 1
## year2019 0 0 1 1
## year2020 0 1 0 1
## year2021 0 1 1 0
## Number of logged events: 1
## it im dep meth out
## 1 0 0 constant Student
The output of calling mice is a group of lists. If we want to check the imputed values, we need to specify it in ‘imp’, and further specify the column of the imputation.
imputed_data$imp$year2019 # For example, check the imputed values for year2019
## 1 2 3
## 2 8 11 12
## 6 9 12 12
When imputation has done, we could replace the NA in the original dataset by the imputed values we just created.
# Assign the imputed dataset to 'finished_imputed'
finished_imputed <- complete(imputed_data, 2)
head(suphours, 8)
## Student year2019 year2020 year2021
## 1 John 10 18 12
## 2 Amos NA 13 NA
## 3 Can 13 12 12
## 4 Eman 8 16 NA
## 5 Joy 12 14 NA
## 6 Flora NA 19 9
## 7 Max 9 20 25
## 8 Kay 11 15 7
head(finished_imputed, 8)
## Student year2019 year2020 year2021
## 1 John 10 18 12
## 2 Amos 11 13 12
## 3 Can 13 12 12
## 4 Eman 8 16 7
## 5 Joy 12 14 12
## 6 Flora 12 19 9
## 7 Max 9 20 25
## 8 Kay 11 15 7
# Note the replacement of categorical column is still incomplete
References:
Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013 May;64(5):402-6. doi: 10.4097/kjae.2013.64.5.402. Epub 2013 May 24. PMID: 23741561; PMCID: PMC3668100.
https://datasciencebeginners.com/2018/11/11/a-brief-introduction-to-mice-r-package/
https://www.rdocumentation.org/packages/mice/versions/3.14.0/topics/mice
https://www.codingprof.com/how-to-replace-nas-with-the-mean-in-r-examples/
https://datascienceplus.com/imputing-missing-data-with-r-mice-package/