Data Preparation and Missing Values

When working with data analysis we usually find a dataset that has missing values. for this blog, I will try different solutions on how to work out missing values.

For this blog, I will use the iris dataset. this dataset doesnt have missing values so I will try to insert NAs values for the purpose of this project and how to work with missing values.

library(mice)
library(dplyr)
library(Amelia)
library(mice)
library(tidyr)
library(tidyverse)
data <- iris

data$Sepal.Length[3] <- NA
data$Sepal.Length[5] <- NA
data$Sepal.Length[4] <- NA
data$Sepal.Length[5] <- NA
data$Sepal.Length[10] <- NA
data$Sepal.Width[1] <- NA
data$Sepal.Width[11] <- NA
data$Sepal.Width[3] <- NA
data$Sepal.Width[2] <- NA

str(data)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 NA NA NA 5.4 4.6 5 4.4 NA ...
##  $ Sepal.Width : num  NA NA NA 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(data)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.125   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.872   Mean   :3.049   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##  NA's   :4       NA's   :4                                      
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##                 
## 
missmap(data)

mean(complete.cases(data))
## [1] 0.9533333

As we see one of the packages we can use its the “Amelia” which with a graph that lets us know very simple the percent of missing data.

They are different solutions to work with missing data and all depends how much are those missing observation in our analysis. sometimes you won’t need those missing values and you can then drop those rows where are missing values.

drop_na() - replace_na()

df <- drop_na(data)

missmap(df)

df2 <- data %>% mutate_all(replace_na,0)
## Warning in `[<-.factor`(`*tmp*`, !is_complete(data), value = 0): invalid factor
## level, NA generated
missmap(df2)

Imputation of the data (Mice Package)

Sometimes those missing observations are very important for our analysis and we can’t just drop those rows because it can affect too much the analysis results. we can then impute those values for example by the mean/media or using a package such as “Mice”.

missmap(data)

t <- mice(data,m=5, maxit = 50, method = 'pmm', seed = 500,printFlag = F)
imputed.data <- complete(t)

missmap(imputed.data )

Conclusion

As we can see they are several ways to solve the problem of working with a dataset that has missing values. depends how important are those missing observations you can decide on drop those rows, replace those missing values of doing an imputation, and replace then with average or median values instead.