1 Goal

The goal of this tutorial is to learn the percentage of NA that each variable has. In addition we can understand the distribution of these NA. This means that we can identify if two variables are missing at the same time in a systematic way.

2 Data import

library(VIM)

# In this example we are going to work with a dataset prepared by us

vec1 <- c(1,2,1,2,1,NA,NA,1,2,3)
vec2 <- c(1,2,1,2,1,2,4,5,4,NA)
vec3 <- c(1,2,1,NA,NA,1,2,3,10,10)
vec4 <- c(1,2,1,2,NA,NA,NA,1,2,3)

dataset <- data.frame(vec1,vec2,vec3,vec4)
dataset

##    vec1 vec2 vec3 vec4
## 1     1    1    1    1
## 2     2    2    2    2
## 3     1    1    1    1
## 4     2    2   NA    2
## 5     1    1   NA   NA
## 6    NA    2    1   NA
## 7    NA    4    2   NA
## 8     1    5    3    1
## 9     2    4   10    2
## 10    3   NA   10    3

# We have introduced some NA to the dataset
summary(dataset)

##       vec1            vec2            vec3            vec4      
##  Min.   :1.000   Min.   :1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 1.00   1st Qu.:1.000  
##  Median :1.500   Median :2.000   Median : 2.00   Median :2.000  
##  Mean   :1.625   Mean   :2.444   Mean   : 3.75   Mean   :1.714  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.: 4.75   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :5.000   Max.   :10.00   Max.   :3.000  
##  NA's   :2       NA's   :1       NA's   :2       NA's   :3

3 How many NA are there in my dataset

# We can check the total amount of NA with a simple instruction
length(which(is.na(dataset)))

## [1] 8

4 How many rows contain NA

# This question is different from the previous one as more than one NA can be in the same row
# The function complete.cases tells us which rows don't contain NA, but we can obtain the opposite doing

length(which(!complete.cases(dataset)))

## [1] 5

5 Which percentage of my variables is NA

# We are going to use the aggr function to this purpose
# The plot on the left shows the percentage of missing values for each variable
# The plot on the right shows the combination of missing values between variables
# How to read the plot on the right:
      # The first blue row shows that 5 rows contain no NA
      # There are two rows where vec1 and vec4 contain NA at the same time
      # There is one row where vec3 contains a missing value
      # There is one row where vec3 and vec4 contain a missing value
      # There is one row where vec2 contains a missing value
summary(aggr(dataset))

## 
##  Missings per variable: 
##  Variable Count
##      vec1     2
##      vec2     1
##      vec3     2
##      vec4     3
## 
##  Missings in combinations of variables: 
##  Combinations Count Percent
##       0:0:0:0     5      50
##       0:0:1:0     1      10
##       0:0:1:1     1      10
##       0:1:0:0     1      10
##       1:0:0:1     2      20

6 Conclusion

In this tutorial we have learnt how to better study the amount and distribution of our missing values. Statistics of missing values can ve very valuable in a data quality process.

Study the distribution of NA: aggr function

Luis Serra @ Ubiqum Code Academy

1 Goal

2 Data import

3 How many NA are there in my dataset

4 How many rows contain NA

5 Which percentage of my variables is NA

6 Conclusion