When working with data analysis we usually find a dataset that has missing values. for this blog, I will try different solutions on how to work out missing values.
For this blog, I will use the iris dataset. this dataset doesnt have missing values so I will try to insert NAs values for the purpose of this project and how to work with missing values.
library(mice)
library(dplyr)
library(Amelia)
library(mice)
library(tidyr)
library(tidyverse)data <- iris
data$Sepal.Length[3] <- NA
data$Sepal.Length[5] <- NA
data$Sepal.Length[4] <- NA
data$Sepal.Length[5] <- NA
data$Sepal.Length[10] <- NA
data$Sepal.Width[1] <- NA
data$Sepal.Width[11] <- NA
data$Sepal.Width[3] <- NA
data$Sepal.Width[2] <- NA
str(data)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 NA NA NA 5.4 4.6 5 4.4 NA ...
## $ Sepal.Width : num NA NA NA 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(data)## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.125 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.872 Mean :3.049 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## NA's :4 NA's :4
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
##
missmap(data)mean(complete.cases(data))## [1] 0.9533333
As we see one of the packages we can use its the “Amelia” which with a graph that lets us know very simple the percent of missing data.
They are different solutions to work with missing data and all depends how much are those missing observation in our analysis. sometimes you won’t need those missing values and you can then drop those rows where are missing values.
df <- drop_na(data)
missmap(df)df2 <- data %>% mutate_all(replace_na,0)## Warning in `[<-.factor`(`*tmp*`, !is_complete(data), value = 0): invalid factor
## level, NA generated
missmap(df2)Sometimes those missing observations are very important for our analysis and we can’t just drop those rows because it can affect too much the analysis results. we can then impute those values for example by the mean/media or using a package such as “Mice”.
missmap(data)t <- mice(data,m=5, maxit = 50, method = 'pmm', seed = 500,printFlag = F)
imputed.data <- complete(t)
missmap(imputed.data )As we can see they are several ways to solve the problem of working with a dataset that has missing values. depends how important are those missing observations you can decide on drop those rows, replace those missing values of doing an imputation, and replace then with average or median values instead.