1 Goal


The goal of this tutorial is to find missing values in a factor variable.


2 Data import


# In this example we will use the open repository of plants classification Iris. 
data("iris")
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

3 Introducing missing values


# First we are going to introduce different kinds of missing values
# First introducing real missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1] 118  86  75  56 148
# We generate a missing value by introducing a value that is not in the levels
iris$Species[na_index] <- "NA"
## Warning in `[<-.factor`(`*tmp*`, na_index, value = structure(c(1L, 1L,
## 1L, : invalid factor level, NA generated
# We change the variable to character to introduce the missing values
iris$Species <- as.character(iris$Species)

# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1]  66 103  41 141  55
# Now we introduce missing values with the form ?
iris$Species[na_index] <- "?"

# Let's repeat the same procedure to introduce " "
# Let's create 5 random rows to introduce missing values
na_index <- sample(1:nrow(iris), 5)
na_index
## [1]  30  48 137  43  60
# Now we introduce missing values with the form " "
iris$Species[na_index] <- " "

# Finally we go back to factor
iris$Species <- factor(iris$Species)

4 Finding missing values in factor


# We can examine the levels of the factor to see if something shows up
levels(iris$Species)
## [1] " "          "?"          "setosa"     "versicolor" "virginica"
# And we can search for these values
which(iris$Species == levels(iris$Species)[c(1,2)])
## [1]  43  66 137
# We can see the rows where the missing values are located
iris[which(iris$Species == levels(iris$Species)[c(1,2)]), ]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 43           4.4         3.2          1.3         0.2        
## 66           6.7         3.1          4.4         1.4       ?
## 137          6.3         3.4          5.6         2.4
# We can directly look for missing values
which(is.na(iris$Species))
## [1]  56  75  86 118 148

5 Conclusion


In this tutorial we have learnt how to deal with missing values in a variable of factor class.