1 Preamble

Understanding the problem and cleaning the dataset covers around 70% of our data analysis. In this notes we shall look at what are missing values and how can we handle them in a dataset before analyzing the data.

2 Missing observations

A data is collected from different sources by different persons. It is not always possible to get complete data where all the values of features for all observations are available. Therefore it is common to have some observations missing in the dataset.

3 Dataset and pakages used

Let us use tidyverse for data wrangling and modeest for finding mode

library(tidyverse)
library(modeest)

We shall use employee dataset which represents wage and other information of 11 persons

Sl.No	Variable_name	Type
1	Age	int
2	Education	chr
3	Experience.in.years	int
4	Wage	int
5	Balance	int
6	Jobclass	chr

4 Handling missing observations

Once we have a dataset with missing values we can either remove the observation or we can remove feature if it has more number of missing values when compared to the total number of values. If we have very less number of missing observations, we can replace them with some value based on available values of dataset.

In R missing values are stored as NA. There is a small difficulty that R only identifies missing values if the cell is blank in the dataset or it is mentioned as NA ( means ‘Not Available’) or NAn(Not a number).

Let us look at our dataset in hand

Age	Education	Experience.in.years	Wage	Balance	Jobclass
45	HS	15	35000	245000	Information
44	PG	20	38000	322000	Industry
38	UG	14		450000	Not available
55	HS	20	30000	NA	N/A
60	PG	NA		595000	N/A
38	PHD	1		12300	na
50	UG	29	40000	99000	Information
49	HS	20	35500	150000
65	HS	40	55800	NA
39	UG	NA	38000	NA	Industry
50		28	Not available	NA

Here the cells which are blank and cell having the following values ‘NA’, ‘N/A’, ‘na’, ‘Not available’ are all missing values.

4.1 Finding existence of missing value

is.na() is our function for identifying missing observation in a dataset, it returns a data frame or vector based on input with values TRUE for indicating missing observation or FALSE otherwise.

is.na(da)

##         Age Education Experience.in.years  Wage Balance Jobclass
##  [1,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [2,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [3,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [4,] FALSE     FALSE               FALSE FALSE    TRUE    FALSE
##  [5,] FALSE     FALSE                TRUE FALSE   FALSE    FALSE
##  [6,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [7,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [8,] FALSE     FALSE               FALSE FALSE   FALSE    FALSE
##  [9,] FALSE     FALSE               FALSE FALSE    TRUE    FALSE
## [10,] FALSE     FALSE                TRUE FALSE    TRUE    FALSE
## [11,] FALSE     FALSE               FALSE FALSE    TRUE    FALSE

Here we can see for some cells it has TRUE, whereas in original dataset we have more missing observations which are represented in various forms. To overcome this, we can use dplyr to mutate the missing values which are represented in different form to NA using replace() function.

4.1.1 Conversion for missing observation

4.1.1.1 Empty string in Character/Factor type

Empty cell in a categorical/ Factor or character type variable is considered as a separate value which is taken as empty string. we need to change this to NA so that R can identify it as missing observation. For factor variables we need to convert it to character and then replace and finally back to factor. We have two variables Education and Jobclass which has different style of representation of missing value. Let us change those

da=da %>% mutate(Education=replace(Education,Education=="",NA))
da=da %>% mutate(Jobclass=replace(Jobclass,Jobclass=="Not available",NA),
                 Jobclass=replace(Jobclass,Jobclass=="N/A",NA),
                 Jobclass=replace(Jobclass,Jobclass=="na",NA),
                 Jobclass=replace(Jobclass,Jobclass=="",NA))

4.1.1.2 Missing value represented as string in a numeric variable

When a missing observation is represented using a string in a numeric variable, the whole variable will be considered as character in R. For this variable we replace missing observation first and then convert the variable to numeric

da=da %>% mutate(Wage=replace(Wage,Wage=="",NA),
                 Wage=replace(Wage,Wage=="Not available",NA),
                 Wage=as.numeric(Wage))

Let us now look at our modified dataset and also corresponding missing value indication

Age	Education	Experience.in.years	Wage	Balance	Jobclass
45	HS	15	35000	245000	Information
44	PG	20	38000	322000	Industry
38	UG	14	NA	450000	NA
55	HS	20	30000	NA	NA
60	PG	NA	NA	595000	NA
38	PHD	1	NA	12300	NA
50	UG	29	40000	99000	Information
49	HS	20	35500	150000	NA
65	HS	40	55800	NA	NA
39	UG	NA	38000	NA	Industry
50	NA	28	NA	NA	NA

Age	Education	Experience.in.years	Wage	Balance	Jobclass
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE	TRUE	TRUE
FALSE	FALSE	TRUE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE	TRUE	TRUE
FALSE	FALSE	TRUE	FALSE	TRUE	FALSE
FALSE	TRUE	FALSE	TRUE	TRUE	TRUE

4.1.2 Counting missing observations

Once we have prepared data in a form where missing values are clearly indicated as NA we can count

Total number of missing values

sum(is.na(da))

## [1] 18

Number of missing values in each column

colSums(is.na(da))

##                 Age           Education Experience.in.years                Wage 
##                   0                   1                   2                   4 
##             Balance            Jobclass 
##                   4                   7

Number of missing values in each row

rowSums(is.na(da))

##  [1] 0 0 2 2 3 2 0 1 2 2 4

We can compare this number of missing values with total number of rows and columns and if the proportion missing values is high with total number of values we can exclude the specific row or column from the dataset.

dim(da)

## [1] 11  6

Since the last row has 4 missing values out of 6, we can remove the single row and store it as another data

da1=da[,-6]

4.1.3 Finding numerical summary if missing values are present

While attempting to find Numerical summaries like mean,median,variance etc. for variables with NA we will get error, to overcome this we use argument na.rm=T

mean(da$Experience.in.years,na.rm=T)

## [1] 20.77778

median(da$Balance,na.rm=T)

## [1] 245000

var(da$Wage,na.rm=T)

## [1] 65736667

4.1.4 Removing rows which have missing values

We can remove all the rows in a dataset which has missing values using na.omit().we shall remove all rows with missing values and store it as separate data.

da2=na.omit(da)
dim(da2)

## [1] 3 6

We can see that this dataset is much smaller when compared to original data

4.1.5 Replacing NA with values

If we remove rows or columns having missing values we lose many information from the dataset, to avoid this, we can replace missing observation with some value, let us see some simple techniques for replacement

4.1.6 Replacing NA with values central tendecy for numeric data

For numerical data we can replace missing observation with mean value if all values are closer or median if some values are extreme. For this we can use either base::replace()``** or **tidyr::replace_na()``.

da=da %>% mutate(Experience.in.years=
                   replace(Experience.in.years,
                           is.na(Experience.in.years),
                           mean(Experience.in.years,na.rm=T)
                           )
                 )

4.1.7 Replacing NA with most frequent values for Categorical data

For categorical data we can replace missing observation with mode

da=da %>% mutate(Jobclass=replace_na(Jobclass,mfv1(Jobclass,na_rm=T)))

5 Final note

We have discussed what are missing values, converting various representations to Standard NA. Identifying missing observations. Removing missing rows or columns and finally how to replace them with suitable values for analysis. More methods can be used for replacing and those can be learned once we master basic ones.

Handling missing values using R

RSN_Thinklabz_Chennai

Website https://www.thinklabz.ai

11/09/2021