1 Preamble

Understanding the problem and cleaning the dataset covers around 70% of our data analysis. In this notes we shall look at what are missing values and how can we handle them in a dataset before analyzing the data.

2 Missing observations

A data is collected from different sources by different persons. It is not always possible to get complete data where all the values of features for all observations are available. Therefore it is common to have some observations missing in the dataset.

3 Dataset and pakages used

Let us use tidyverse for data wrangling and modeest for finding mode

library(tidyverse)
library(modeest)

We shall use employee dataset which represents wage and other information of 11 persons

Sl.No Variable_name Type
1 Age int
2 Education chr
3 Experience.in.years int
4 Wage int
5 Balance int
6 Jobclass chr

4 Handling missing observations

Once we have a dataset with missing values we can either remove the observation or we can remove feature if it has more number of missing values when compared to the total number of values. If we have very less number of missing observations, we can replace them with some value based on available values of dataset.

In R missing values are stored as NA. There is a small difficulty that R only identifies missing values if the cell is blank in the dataset or it is mentioned as NA ( means ‘Not Available’) or NAn(Not a number).

Let us look at our dataset in hand

Age Education Experience.in.years Wage Balance Jobclass
45 HS 15 35000 245000 Information
44 PG 20 38000 322000 Industry
38 UG 14 450000 Not available
55 HS 20 30000 NA N/A
60 PG NA 595000 N/A
38 PHD 1 12300 na
50 UG 29 40000 99000 Information
49 HS 20 35500 150000
65 HS 40 55800 NA
39 UG NA 38000 NA Industry
50 28 Not available NA

Here the cells which are blank and cell having the following values ‘NA’, ‘N/A’, ‘na’, ‘Not available’ are all missing values.

4.1 Finding existence of missing value

is.na() is our function for identifying missing observation in a dataset, it returns a data frame or vector based on input with values TRUE for indicating missing observation or FALSE otherwise.

is.na(da)
##         Age Education Experience.in.years  Wage Balance Jobclass
##  [1,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [2,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [3,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [4,] FALSE     FALSE               FALSE FALSE    TRUE    FALSE
##  [5,] FALSE     FALSE                TRUE FALSE   FALSE    FALSE
##  [6,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [7,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [8,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [9,] FALSE     FALSE               FALSE FALSE    TRUE    FALSE
## [10,] FALSE     FALSE                TRUE FALSE    TRUE    FALSE
## [11,] FALSE     FALSE               FALSE FALSE    TRUE    FALSE

Here we can see for some cells it has TRUE, whereas in original dataset we have more missing observations which are represented in various forms. To overcome this, we can use dplyr to mutate the missing values which are represented in different form to NA using replace() function.

4.1.1 Conversion for missing observation

4.1.1.1 Empty string in Character/Factor type

Empty cell in a categorical/ Factor or character type variable is considered as a separate value which is taken as empty string. we need to change this to NA so that R can identify it as missing observation. For factor variables we need to convert it to character and then replace and finally back to factor. We have two variables Education and Jobclass which has different style of representation of missing value. Let us change those

da=da %>% mutate(Education=replace(Education,Education=="",NA))
da=da %>% mutate(Jobclass=replace(Jobclass,Jobclass=="Not available",NA),
                 Jobclass=replace(Jobclass,Jobclass=="N/A",NA),
                 Jobclass=replace(Jobclass,Jobclass=="na",NA),
                 Jobclass=replace(Jobclass,Jobclass=="",NA))

4.1.1.2 Missing value represented as string in a numeric variable

When a missing observation is represented using a string in a numeric variable, the whole variable will be considered as character in R. For this variable we replace missing observation first and then convert the variable to numeric

da=da %>% mutate(Wage=replace(Wage,Wage=="",NA),
                 Wage=replace(Wage,Wage=="Not available",NA),
                 Wage=as.numeric(Wage))
Let us now look at our modified dataset and also corresponding missing value indication
Age Education Experience.in.years Wage Balance Jobclass
45 HS 15 35000 245000 Information
44 PG 20 38000 322000 Industry
38 UG 14 NA 450000 NA
55 HS 20 30000 NA NA
60 PG NA NA 595000 NA
38 PHD 1 NA 12300 NA
50 UG 29 40000 99000 Information
49 HS 20 35500 150000 NA
65 HS 40 55800 NA NA
39 UG NA 38000 NA Industry
50 NA 28 NA NA NA
Age Education Experience.in.years Wage Balance Jobclass
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE TRUE FALSE TRUE
FALSE FALSE FALSE TRUE FALSE TRUE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE FALSE FALSE FALSE FALSE TRUE
FALSE FALSE FALSE FALSE TRUE TRUE
FALSE FALSE TRUE FALSE TRUE FALSE
FALSE TRUE FALSE TRUE TRUE TRUE

4.1.2 Counting missing observations

Once we have prepared data in a form where missing values are clearly indicated as NA we can count

  1. Total number of missing values
sum(is.na(da))
## [1] 18
  1. Number of missing values in each column
colSums(is.na(da))
##                 Age           Education Experience.in.years                Wage 
##                   0                   1                   2                   4 
##             Balance            Jobclass 
##                   4                   7
  1. Number of missing values in each row
rowSums(is.na(da))
##  [1] 0 0 2 2 3 2 0 1 2 2 4

We can compare this number of missing values with total number of rows and columns and if the proportion missing values is high with total number of values we can exclude the specific row or column from the dataset.

dim(da)
## [1] 11  6

Since the last row has 4 missing values out of 6, we can remove the single row and store it as another data

da1=da[,-6]

4.1.3 Finding numerical summary if missing values are present

While attempting to find Numerical summaries like mean,median,variance etc. for variables with NA we will get error, to overcome this we use argument na.rm=T

mean(da$Experience.in.years,na.rm=T)
## [1] 20.77778
median(da$Balance,na.rm=T)
## [1] 245000
var(da$Wage,na.rm=T)
## [1] 65736667

4.1.4 Removing rows which have missing values

We can remove all the rows in a dataset which has missing values using na.omit().we shall remove all rows with missing values and store it as separate data.

da2=na.omit(da)
dim(da2)
## [1] 3 6

We can see that this dataset is much smaller when compared to original data

4.1.5 Replacing NA with values

If we remove rows or columns having missing values we lose many information from the dataset, to avoid this, we can replace missing observation with some value, let us see some simple techniques for replacement

4.1.6 Replacing NA with values central tendecy for numeric data

For numerical data we can replace missing observation with mean value if all values are closer or median if some values are extreme. For this we can use either base::replace()``** or **tidyr::replace_na()``.

da=da %>% mutate(Experience.in.years=
                   replace(Experience.in.years,
                           is.na(Experience.in.years),
                           mean(Experience.in.years,na.rm=T)
                           )
                 )

4.1.7 Replacing NA with most frequent values for Categorical data

For categorical data we can replace missing observation with mode

da=da %>% mutate(Jobclass=replace_na(Jobclass,mfv1(Jobclass,na_rm=T))) 

5 Final note

We have discussed what are missing values, converting various representations to Standard NA. Identifying missing observations. Removing missing rows or columns and finally how to replace them with suitable values for analysis. More methods can be used for replacing and those can be learned once we master basic ones.