The goal of this tutorial is to find missing values in a factor variable.
# In this example we will use the open repository of plants classification Iris.
data("iris")
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# First we are going to introduce different kinds of missing values
# First introducing real missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 118 86 75 56 148
# We generate a missing value by introducing a value that is not in the levels
iris$Species[na_index] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, na_index, value = structure(c(1L, 1L,
## 1L, : invalid factor level, NA generated
# We change the variable to character to introduce the missing values
iris$Species <- as.character(iris$Species)
# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 66 103 41 141 55
# Now we introduce missing values with the form ?
iris$Species[na_index] <- "?"
# Let's repeat the same procedure to introduce " "
# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 30 48 137 43 60
# Now we introduce missing values with the form " "
iris$Species[na_index] <- " "
# Finally we go back to factor
iris$Species <- factor(iris$Species)
# We can examine the levels of the factor to see if something shows up
levels(iris$Species)
## [1] " " "?" "setosa" "versicolor" "virginica"
# And we can search for these values
which(iris$Species == levels(iris$Species)[c(1,2)])
## [1] 43 66 137
# We can see the rows where the missing values are located
iris[which(iris$Species == levels(iris$Species)[c(1,2)]), ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 43 4.4 3.2 1.3 0.2
## 66 6.7 3.1 4.4 1.4 ?
## 137 6.3 3.4 5.6 2.4
# We can directly look for missing values
which(is.na(iris$Species))
## [1] 56 75 86 118 148
In this tutorial we have learnt how to deal with missing values in a variable of factor class.