Understanding the problem and cleaning the dataset covers around 70% of our data analysis. In this notes we shall look at what are missing values and how can we handle them in a dataset before analyzing the data.
A data is collected from different sources by different persons. It is not always possible to get complete data where all the values of features for all observations are available. Therefore it is common to have some observations missing in the dataset.
Let us use tidyverse for data wrangling and modeest for finding mode
library(tidyverse)
library(modeest)
We shall use employee dataset which represents wage and other information of 11 persons
| Sl.No | Variable_name | Type |
|---|---|---|
| 1 | Age | int |
| 2 | Education | chr |
| 3 | Experience.in.years | int |
| 4 | Wage | int |
| 5 | Balance | int |
| 6 | Jobclass | chr |
Once we have a dataset with missing values we can either remove the observation or we can remove feature if it has more number of missing values when compared to the total number of values. If we have very less number of missing observations, we can replace them with some value based on available values of dataset.
In R missing values are stored as NA. There is a small difficulty that R only identifies missing values if the cell is blank in the dataset or it is mentioned as NA ( means ‘Not Available’) or NAn(Not a number).
Let us look at our dataset in hand
| Age | Education | Experience.in.years | Wage | Balance | Jobclass |
|---|---|---|---|---|---|
| 45 | HS | 15 | 35000 | 245000 | Information |
| 44 | PG | 20 | 38000 | 322000 | Industry |
| 38 | UG | 14 | 450000 | Not available | |
| 55 | HS | 20 | 30000 | NA | N/A |
| 60 | PG | NA | 595000 | N/A | |
| 38 | PHD | 1 | 12300 | na | |
| 50 | UG | 29 | 40000 | 99000 | Information |
| 49 | HS | 20 | 35500 | 150000 | |
| 65 | HS | 40 | 55800 | NA | |
| 39 | UG | NA | 38000 | NA | Industry |
| 50 | 28 | Not available | NA |
Here the cells which are blank and cell having the following values ‘NA’, ‘N/A’, ‘na’, ‘Not available’ are all missing values.
is.na() is our function for identifying missing observation in a dataset, it returns a data frame or vector based on input with values TRUE for indicating missing observation or FALSE otherwise.
is.na(da)
## Age Education Experience.in.years Wage Balance Jobclass
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE TRUE FALSE
## [5,] FALSE FALSE TRUE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE TRUE FALSE
## [10,] FALSE FALSE TRUE FALSE TRUE FALSE
## [11,] FALSE FALSE FALSE FALSE TRUE FALSE
Here we can see for some cells it has TRUE, whereas in original dataset we have more missing observations which are represented in various forms. To overcome this, we can use dplyr to mutate the missing values which are represented in different form to NA using replace() function.
Empty cell in a categorical/ Factor or character type variable is considered as a separate value which is taken as empty string. we need to change this to NA so that R can identify it as missing observation. For factor variables we need to convert it to character and then replace and finally back to factor. We have two variables Education and Jobclass which has different style of representation of missing value. Let us change those
da=da %>% mutate(Education=replace(Education,Education=="",NA))
da=da %>% mutate(Jobclass=replace(Jobclass,Jobclass=="Not available",NA),
Jobclass=replace(Jobclass,Jobclass=="N/A",NA),
Jobclass=replace(Jobclass,Jobclass=="na",NA),
Jobclass=replace(Jobclass,Jobclass=="",NA))
When a missing observation is represented using a string in a numeric variable, the whole variable will be considered as character in R. For this variable we replace missing observation first and then convert the variable to numeric
da=da %>% mutate(Wage=replace(Wage,Wage=="",NA),
Wage=replace(Wage,Wage=="Not available",NA),
Wage=as.numeric(Wage))
Let us now look at our modified dataset and also corresponding missing value indication
| Age | Education | Experience.in.years | Wage | Balance | Jobclass |
|---|---|---|---|---|---|
| 45 | HS | 15 | 35000 | 245000 | Information |
| 44 | PG | 20 | 38000 | 322000 | Industry |
| 38 | UG | 14 | NA | 450000 | NA |
| 55 | HS | 20 | 30000 | NA | NA |
| 60 | PG | NA | NA | 595000 | NA |
| 38 | PHD | 1 | NA | 12300 | NA |
| 50 | UG | 29 | 40000 | 99000 | Information |
| 49 | HS | 20 | 35500 | 150000 | NA |
| 65 | HS | 40 | 55800 | NA | NA |
| 39 | UG | NA | 38000 | NA | Industry |
| 50 | NA | 28 | NA | NA | NA |
| Age | Education | Experience.in.years | Wage | Balance | Jobclass |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | TRUE | FALSE | TRUE |
| FALSE | FALSE | FALSE | FALSE | TRUE | TRUE |
| FALSE | FALSE | TRUE | TRUE | FALSE | TRUE |
| FALSE | FALSE | FALSE | TRUE | FALSE | TRUE |
| FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
| FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| FALSE | FALSE | FALSE | FALSE | TRUE | TRUE |
| FALSE | FALSE | TRUE | FALSE | TRUE | FALSE |
| FALSE | TRUE | FALSE | TRUE | TRUE | TRUE |
Once we have prepared data in a form where missing values are clearly indicated as NA we can count
sum(is.na(da))
## [1] 18
colSums(is.na(da))
## Age Education Experience.in.years Wage
## 0 1 2 4
## Balance Jobclass
## 4 7
rowSums(is.na(da))
## [1] 0 0 2 2 3 2 0 1 2 2 4
We can compare this number of missing values with total number of rows and columns and if the proportion missing values is high with total number of values we can exclude the specific row or column from the dataset.
dim(da)
## [1] 11 6
Since the last row has 4 missing values out of 6, we can remove the single row and store it as another data
da1=da[,-6]
While attempting to find Numerical summaries like mean,median,variance etc. for variables with NA we will get error, to overcome this we use argument na.rm=T
mean(da$Experience.in.years,na.rm=T)
## [1] 20.77778
median(da$Balance,na.rm=T)
## [1] 245000
var(da$Wage,na.rm=T)
## [1] 65736667
We can remove all the rows in a dataset which has missing values using na.omit().we shall remove all rows with missing values and store it as separate data.
da2=na.omit(da)
dim(da2)
## [1] 3 6
We can see that this dataset is much smaller when compared to original data
If we remove rows or columns having missing values we lose many information from the dataset, to avoid this, we can replace missing observation with some value, let us see some simple techniques for replacement
For numerical data we can replace missing observation with mean value if all values are closer or median if some values are extreme. For this we can use either base::replace()``** or **tidyr::replace_na()``.
da=da %>% mutate(Experience.in.years=
replace(Experience.in.years,
is.na(Experience.in.years),
mean(Experience.in.years,na.rm=T)
)
)
For categorical data we can replace missing observation with mode
da=da %>% mutate(Jobclass=replace_na(Jobclass,mfv1(Jobclass,na_rm=T)))
We have discussed what are missing values, converting various representations to Standard NA. Identifying missing observations. Removing missing rows or columns and finally how to replace them with suitable values for analysis. More methods can be used for replacing and those can be learned once we master basic ones.