The goal of this tutorial is to find missing values in a factor variable.
# In this example we will use the open repository of plants classification Iris.
data("iris")
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# First we are going to introduce different kinds of missing values
# First introducing real missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 42 69 89 125 145
# We generate a missing value by introducing a value that is not in the levels
iris$Species[na_index] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, na_index, value = structure(c(1L, 1L,
## 1L, : invalid factor level, NA generated
# We change the variable to character to introduce the missing values
iris$Species <- as.character(iris$Species)
# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 127 112 57 81 116
# Now we introduce missing values with the form ?
iris$Species[na_index] <- "?"
# Let's repeat the same procedure to introduce " "
# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 78 105 132 67 38
# Now we introduce missing values with the form " "
iris$Species[na_index] <- " "
# Finally we go back to factor
iris$Species <- factor(iris$Species)
# We can examine the levels of the factor to see if something shows up
levels(iris$Species)
## [1] " " "?" "setosa" "versicolor" "virginica"
# And we can search for these values
which(iris$Species == levels(iris$Species)[c(1,2)])
## [1] 67 105 112 116
# We can see the rows where the missing values are located
iris[which(iris$Species == levels(iris$Species)[c(1,2)]), ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 67 5.6 3.0 4.5 1.5
## 105 6.5 3.0 5.8 2.2
## 112 6.4 2.7 5.3 1.9 ?
## 116 6.4 3.2 5.3 2.3 ?
# We can directly look for missing values
which(is.na(iris$Species))
## [1] 42 69 89 125 145
In this tutorial we have learnt how to deal with missing values in a variable of factor class.