The goal of this tutorial is to learn the percentage of NA that each variable has. In addition we can understand the distribution of these NA. This means that we can identify if two variables are missing at the same time in a systematic way.
library(VIM)
# In this example we are going to work with a dataset prepared by us
vec1 <- c(1,2,1,2,1,NA,NA,1,2,3)
vec2 <- c(1,2,1,2,1,2,4,5,4,NA)
vec3 <- c(1,2,1,NA,NA,1,2,3,10,10)
vec4 <- c(1,2,1,2,NA,NA,NA,1,2,3)
dataset <- data.frame(vec1,vec2,vec3,vec4)
dataset
## vec1 vec2 vec3 vec4
## 1 1 1 1 1
## 2 2 2 2 2
## 3 1 1 1 1
## 4 2 2 NA 2
## 5 1 1 NA NA
## 6 NA 2 1 NA
## 7 NA 4 2 NA
## 8 1 5 3 1
## 9 2 4 10 2
## 10 3 NA 10 3
# We have introduced some NA to the dataset
summary(dataset)
## vec1 vec2 vec3 vec4
## Min. :1.000 Min. :1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 1.00 1st Qu.:1.000
## Median :1.500 Median :2.000 Median : 2.00 Median :2.000
## Mean :1.625 Mean :2.444 Mean : 3.75 Mean :1.714
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.: 4.75 3rd Qu.:2.000
## Max. :3.000 Max. :5.000 Max. :10.00 Max. :3.000
## NA's :2 NA's :1 NA's :2 NA's :3
# We can check the total amount of NA with a simple instruction
length(which(is.na(dataset)))
## [1] 8
# This question is different from the previous one as more than one NA can be in the same row
# The function complete.cases tells us which rows don't contain NA, but we can obtain the opposite doing
length(which(!complete.cases(dataset)))
## [1] 5
# We are going to use the aggr function to this purpose
# The plot on the left shows the percentage of missing values for each variable
# The plot on the right shows the combination of missing values between variables
# How to read the plot on the right:
# The first blue row shows that 5 rows contain no NA
# There are two rows where vec1 and vec4 contain NA at the same time
# There is one row where vec3 contains a missing value
# There is one row where vec3 and vec4 contain a missing value
# There is one row where vec2 contains a missing value
summary(aggr(dataset))
##
## Missings per variable:
## Variable Count
## vec1 2
## vec2 1
## vec3 2
## vec4 3
##
## Missings in combinations of variables:
## Combinations Count Percent
## 0:0:0:0 5 50
## 0:0:1:0 1 10
## 0:0:1:1 1 10
## 0:1:0:0 1 10
## 1:0:0:1 2 20
In this tutorial we have learnt how to better study the amount and distribution of our missing values. Statistics of missing values can ve very valuable in a data quality process.