1 Goal

The goal of this tutorial is to learn methods to find, identify and handle missing values. We will use different methods to solve the existence of missing values in our dataset.

2 Data loading

# In this example we will use the open repository of plants classification Iris. 
data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

3 Introducing missing values in the dataset

# First we create a copy of our dataset
iris_copy <- iris

# We know that missing values are identifed in R by the symbol NA
# We introduce several missing values in some columns
iris_copy$Sepal.Length[c(15, 20, 50, 67, 97, 118)] <- NA
iris_copy$Sepal.Width[c(4, 80, 97, 106)] <- NA
iris_copy$Petal.Length[c(5, 17, 35, 49)] <- NA

# Now we see that there are missing values in some columns
summary(iris_copy)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.400   Median :1.300  
##  Mean   :5.844   Mean   :3.062   Mean   :3.822   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.375   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##  NA's   :6       NA's   :4       NA's   :4                      
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##                 
##

4 Find NA’s in my dataset

# The first thing we can do is to ask if there is any missing value in our table
length(which(is.na(iris_copy)))

## [1] 14

# We can check that we introduced 14 missing values in the table

# There are several ways to identify rows containing NA's.
# First we will use the complete.cases function (check ?complete.cases for information)
# This function returns only rows without NA's. Putting ! in front of it we get only rows with NA's

iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 4            4.6          NA          1.5         0.2     setosa
## 5            5.0         3.6           NA         0.2     setosa
## 15            NA         4.0          1.2         0.2     setosa
## 17           5.4         3.9           NA         0.4     setosa
## 20            NA         3.8          1.5         0.3     setosa
## 35           4.9         3.1           NA         0.2     setosa
## 49           5.3         3.7           NA         0.2     setosa
## 50            NA         3.3          1.4         0.2     setosa
## 67            NA         3.0          4.5         1.5 versicolor
## 80           5.7          NA          3.5         1.0 versicolor
## 97            NA          NA          4.2         1.3 versicolor
## 106          7.6          NA          6.6         2.1  virginica
## 118           NA         3.8          6.7         2.2  virginica

# We see that we have 13 rows with missing values on it 

# Another way is to search for TRUE values in the is.na function

iris_NA <- iris_copy[rowSums(is.na(iris_copy)) > 0, ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 4            4.6          NA          1.5         0.2     setosa
## 5            5.0         3.6           NA         0.2     setosa
## 15            NA         4.0          1.2         0.2     setosa
## 17           5.4         3.9           NA         0.4     setosa
## 20            NA         3.8          1.5         0.3     setosa
## 35           4.9         3.1           NA         0.2     setosa
## 49           5.3         3.7           NA         0.2     setosa
## 50            NA         3.3          1.4         0.2     setosa
## 67            NA         3.0          4.5         1.5 versicolor
## 80           5.7          NA          3.5         1.0 versicolor
## 97            NA          NA          4.2         1.3 versicolor
## 106          7.6          NA          6.6         2.1  virginica
## 118           NA         3.8          6.7         2.2  virginica

# We obtain the same result. However the function complete cases is

5 Removing rows containing NA’s

# Depending on the case, there are different things we can do with NA's. However there is not a unique and general solution to this issue. If the missing value can be calculated direclty using other columns the problem is solved.

# The first choice can be to just remove the rows containing NA's
iris_clean <- complete.cases(iris_copy)
length(which(is.na(iris_clean)))

## [1] 0

# The number of missing values in this table is 0

6 Substituting NA’s with numerical values

# In other cases we don't want to lose the information that we have in one row with missing values
# In this case we will substitute the missing value with a numerical value 

# The first thing we can do is to introduce the mean of a column in a missing value
# However it's more safe to use the median because it's not affected by outliers
# However we should be careful as in this case it's more correct to introduce the mean for the proper species
# We should do it column by column

iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "setosa"),"Sepal.Length"] <- median(iris_copy$Sepal.Length[which(iris_copy$Species == "setosa")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 4            4.6          NA          1.5         0.2     setosa
## 5            5.0         3.6           NA         0.2     setosa
## 17           5.4         3.9           NA         0.4     setosa
## 35           4.9         3.1           NA         0.2     setosa
## 49           5.3         3.7           NA         0.2     setosa
## 67            NA         3.0          4.5         1.5 versicolor
## 80           5.7          NA          3.5         1.0 versicolor
## 97            NA          NA          4.2         1.3 versicolor
## 106          7.6          NA          6.6         2.1  virginica
## 118           NA         3.8          6.7         2.2  virginica

# Now we have removed 3 NA's. Only 11 left

iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "versicolor"),"Sepal.Length"] <- median(iris_copy$Sepal.Length[which(iris_copy$Species == "versicolor")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 4           4.60          NA          1.5         0.2     setosa
## 5           5.00         3.6           NA         0.2     setosa
## 17          5.40         3.9           NA         0.4     setosa
## 35          4.90         3.1           NA         0.2     setosa
## 49          5.30         3.7           NA         0.2     setosa
## 80          5.70          NA          3.5         1.0 versicolor
## 97          5.95          NA          4.2         1.3 versicolor
## 106         7.60          NA          6.6         2.1  virginica
## 118           NA         3.8          6.7         2.2  virginica

# Now we have removed 2 NA's. Only 9 left

iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "virginica"),"Sepal.Length"] <- median(iris_copy$Sepal.Length[which(iris_copy$Species == "virginica")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 4           4.60          NA          1.5         0.2     setosa
## 5           5.00         3.6           NA         0.2     setosa
## 17          5.40         3.9           NA         0.4     setosa
## 35          4.90         3.1           NA         0.2     setosa
## 49          5.30         3.7           NA         0.2     setosa
## 80          5.70          NA          3.5         1.0 versicolor
## 97          5.95          NA          4.2         1.3 versicolor
## 106         7.60          NA          6.6         2.1  virginica

# Now we have removed 1 NA's. Only 8 left

iris_copy[is.na(iris_copy$Sepal.Width) & (iris_copy$Species == "setosa"),"Sepal.Width"] <- median(iris_copy$Sepal.Width[which(iris_copy$Species == "setosa")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 5           5.00         3.6           NA         0.2     setosa
## 17          5.40         3.9           NA         0.4     setosa
## 35          4.90         3.1           NA         0.2     setosa
## 49          5.30         3.7           NA         0.2     setosa
## 80          5.70          NA          3.5         1.0 versicolor
## 97          5.95          NA          4.2         1.3 versicolor
## 106         7.60          NA          6.6         2.1  virginica

# Now we have removed 1 NA's. Only 7 left
iris_copy[is.na(iris_copy$Petal.Length) & (iris_copy$Species == "setosa"),"Petal.Length"] <- median(iris_copy$Petal.Length[which(iris_copy$Species == "setosa")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 80          5.70          NA          3.5         1.0 versicolor
## 97          5.95          NA          4.2         1.3 versicolor
## 106         7.60          NA          6.6         2.1  virginica

# Now it's your time to finish the logic and remove all the remaining NA's of the table

# We have saved a lot of interesting data. However we must be careful because each case is different and we can affect the result if we introduce the wrong number in the NA position

7 Conclusion

In this tutorial we have learnt how to find and locate NA’s in a dataset. We have used three different ways to address this situation, but we encourage you to read more about this issue and how to handle NA’s.

How do I find missing values

Luis Serra @ Ubiqum Code Academy

1 Goal

2 Data loading

3 Introducing missing values in the dataset

4 Find NA’s in my dataset

5 Removing rows containing NA’s

6 Substituting NA’s with numerical values

7 Conclusion