# 1 Goal

The goal of this tutorial is to find missing values in a factor variable.

# 2 Data import

# In this example we will use the open repository of plants classification Iris.
data("iris")
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
##        Species
##  setosa    :50
##  versicolor:50
##  virginica :50
##
##
## 

# 3 Introducing missing values

# First we are going to introduce different kinds of missing values
# First introducing real missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 118  86  75  56 148
# We generate a missing value by introducing a value that is not in the levels
iris$Species[na_index] <- "NA" ## Warning in [<-.factor(*tmp*, na_index, value = structure(c(1L, 1L, ## 1L, : invalid factor level, NA generated # We change the variable to character to introduce the missing values iris$Species <- as.character(iris$Species) # Let's create 5 random rows to introduce missing values na_index <- sample(1:nrow(iris), 5) na_index ## [1] 66 103 41 141 55 # Now we introduce missing values with the form ? iris$Species[na_index] <- "?"

# Let's repeat the same procedure to introduce " "
# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1]  30  48 137  43  60
# Now we introduce missing values with the form " "
iris$Species[na_index] <- " " # Finally we go back to factor iris$Species <- factor(iris$Species) # 4 Finding missing values in factor # We can examine the levels of the factor to see if something shows up levels(iris$Species)
## [1] " "          "?"          "setosa"     "versicolor" "virginica"
# And we can search for these values
which(iris$Species == levels(iris$Species)[c(1,2)])
## [1]  43  66 137
# We can see the rows where the missing values are located
iris[which(iris$Species == levels(iris$Species)[c(1,2)]), ]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 43           4.4         3.2          1.3         0.2
## 66           6.7         3.1          4.4         1.4       ?
## 137          6.3         3.4          5.6         2.4
# We can directly look for missing values
which(is.na(iris\$Species))
## [1]  56  75  86 118 148

# 5 Conclusion

In this tutorial we have learnt how to deal with missing values in a variable of factor class.