Dealing with Missing Values

Prologue

  • Missing values are often present in data frames.
  • They may be explicit, for example “NA”.
  • They may also not be explicit, for example blanks (“”) - Imagine empty cells in an Excel spreadsheet.
  • In some cases, you may want to define some values as missing, for example “NULL”, “0”, etc.
  • The first step in interpreting missing values is during data import. Define the values you wish to be interpreted as missing using the na.strings argument when importing data using the read.table() or read.csv() functions.
  • Because missing values play an important role in manipulating data frame, I have specially create a session here just for missing values to supplement previous sessions on data manipulation.

Subsetting data frame

  • We will use the iris dataset for this section.
  • Simply use the is.na() function to include missing values.
  • Add an exclamation mark in front of the function if you wish to exclude missing values.
  • The is.na() function can be used alongside the subset() function to include or exclude missing values.
  • More information on subsetting data frame here.
# Load dataset
data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Insert missing values into data frame
iris[2,3] <- NA
iris[3,3] <- NA

# Subset rows with missing values
missing <- subset(iris, subset=(is.na(iris$Petal.Length)), )
missing
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2          4.9         3.0           NA         0.2  setosa
## 3          4.7         3.2           NA         0.2  setosa
# Subset rows with non-missing values
no_missing <- subset(iris, subset=(!is.na(iris$Petal.Length)), )
head(no_missing)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 7          4.6         3.4          1.4         0.3  setosa
## 8          5.0         3.4          1.5         0.2  setosa

Ordering data frame

  • Missing values will always appear at the bottom of the data frame after ordering a data frame using column(s) with the missing values in it regardless of whether the column(s) is sorted in an ascending or descending fashion.
  • This goes to show that missing values do not have positions or ranks relative to the other values within the same column.
  • More information on ordering data frame here.
# Load dataset
data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Insert missing values into data frame
iris[2,3] <- NA
iris[3,3] <- NA

# Order data frame by Petal.Length in ascending order
asc <- iris[order(iris$Petal.Length), ]
tail(asc)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 106          7.6         3.0          6.6         2.1 virginica
## 118          7.7         3.8          6.7         2.2 virginica
## 123          7.7         2.8          6.7         2.0 virginica
## 119          7.7         2.6          6.9         2.3 virginica
## 2            4.9         3.0           NA         0.2    setosa
## 3            4.7         3.2           NA         0.2    setosa
# Order data frame by Petal.Length in descending order
desc <- iris[order(iris$Petal.Length, decreasing=TRUE), ]
tail(asc)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 106          7.6         3.0          6.6         2.1 virginica
## 118          7.7         3.8          6.7         2.2 virginica
## 123          7.7         2.8          6.7         2.0 virginica
## 119          7.7         2.6          6.9         2.3 virginica
## 2            4.9         3.0           NA         0.2    setosa
## 3            4.7         3.2           NA         0.2    setosa

Collapsing data frame

  • You may observe missing values across the columns that you wish you calculate the statistics of.
  • Example of this scenario as follows. Here, you wish you calculate the mean age and mean popularity 1 score collapsed by, say, animal type.
# Set seed for reproduciblity
set.seed(888)

# Define vectors
col1 <- c(1:100)
col2 <- sample(c("Dog", "Cat", "Pikachu"), size=100, replace=TRUE)
col3 <- sample(c("Pink", "Blue", "Yellow"), size=100, replace=TRUE)
col4 <- rnorm(n=100, mean=5, sd=1)
col5 <- rnorm(n=100, mean=10, sd=2)
col6 <- runif(n=100, min=1, max=10)
col7 <- runif(n=100, min=1, max=10)
col8 <- runif(n=100, min=1, max=10)

# Create data frame
df <- data.frame("ID"=col1, "Animal"=col2, "Colour"=col3, "Weight"=col4, "Age"=col5, "Popularity_1"=col6, "Popularity_2"=col7, "Popularity_3"=col8)
head(df)
##   ID  Animal Colour   Weight       Age Popularity_1 Popularity_2
## 1  1     Dog Yellow 5.729918 10.055249     3.475518     8.039746
## 2  2     Cat   Pink 4.154880 12.855135     9.084591     8.183140
## 3  3     Dog   Blue 3.792285 13.096655     1.850557     2.558149
## 4  4 Pikachu Yellow 6.760088 12.471876     7.714097     9.637706
## 5  5 Pikachu   Pink 3.671745  8.126629     6.591289     4.228270
## 6  6     Dog Yellow 4.352879  9.523793     7.704855     6.027007
##   Popularity_3
## 1     8.181624
## 2     9.984992
## 3     8.841783
## 4     3.884761
## 5     4.876320
## 6     8.366354
# Insert missing values into dataset
df2 <- df
df2[1,5] <- NA
df2[2,6] <- NA
df2[3, c(5,6)] <- NA
head(df2)
##   ID  Animal Colour   Weight       Age Popularity_1 Popularity_2
## 1  1     Dog Yellow 5.729918        NA     3.475518     8.039746
## 2  2     Cat   Pink 4.154880 12.855135           NA     8.183140
## 3  3     Dog   Blue 3.792285        NA           NA     2.558149
## 4  4 Pikachu Yellow 6.760088 12.471876     7.714097     9.637706
## 5  5 Pikachu   Pink 3.671745  8.126629     6.591289     4.228270
## 6  6     Dog Yellow 4.352879  9.523793     7.704855     6.027007
##   Popularity_3
## 1     8.181624
## 2     9.984992
## 3     8.841783
## 4     3.884761
## 5     4.876320
## 6     8.366354
  • This is not a problem when the missing values appear in all the columns for which you want to calculate the statistic for within the same row, e.g. in row no. 3. aggregate() simply omits this particular row from calculating the statistic.
  • The problem arises when the missing values do not appear in all the columns for which you want to calculate the statistic for within the same row, e.g. in row no. 1 and 2. This is of concern because aggregate() will completely remove the entire row when it calculates the statistic even if there are non-missing values within the same row. This is because of the default argument na.action=na.omit.
  • To prevent aggregate() from doing this, simply pass the argument na.action=NULL.
# Using original dataset with the 3rd row removed
aggregate(cbind(Age, Popularity_1) ~ Animal, data=df[-3,], mean)
##    Animal       Age Popularity_1
## 1     Cat 10.070069     6.186497
## 2     Dog 10.321719     5.187912
## 3 Pikachu  9.675325     5.658100
# Missing age at a single row and collapse using default arguments
aggregate(cbind(Age, Popularity_1) ~ Animal, data=df2, mean, na.rm=TRUE)
##    Animal       Age Popularity_1
## 1     Cat  9.977233     6.089893
## 2     Dog 10.329120     5.235479
## 3 Pikachu  9.675325     5.658100
  • Notice that the mean popularity_1 score here is affected by the missing age value.
  • Use the na.action=NULL argument to enable aggregate() to take the non-missing value into consideration even if there are missing values in the same row/other columns.
aggregate(cbind(Age, Popularity_1) ~ Animal, data=df2, mean, na.rm=TRUE, na.action=NULL)
##    Animal       Age Popularity_1
## 1     Cat 10.070069     6.089893
## 2     Dog 10.329120     5.187912
## 3 Pikachu  9.675325     5.658100

Concatenating multiple columns

  • Missing values will be converted to literal NA’s after concatenating columns with missing values in them, i.e. the new combined character strings will have NA(s) embedded in them.
  • More information on concatenating and also parsing here
# Set seed for reproduciblity
set.seed(888)

# Define vectors
col1 <- sample(letters, size=5, replace=TRUE)
col2 <- sample(letters, size=5, replace=TRUE)
col3 <- sample(letters, size=5, replace=TRUE)

# Create data frame
df <- data.frame("var1"=col1, "var2"=col2, "var3"=col3)

# Insert missing values into data frame
df[1,1] <- NA
df[2,2] <- NA
df[4,2] <- NA
df[4,3] <- NA

# Concatenate all columns
df$var4 <- paste(df$var1, df$var2, df$var3, sep=" ")
df
##   var1 var2 var3    var4
## 1 <NA>    c    k  NA c k
## 2    j <NA>    b  j NA b
## 3    b    h    a   b h a
## 4    r <NA> <NA> r NA NA
## 5    t    x    s   t x s
  • You would have to make the additional step of replacing these NA’s if you wish to obliterate them from your character strings.
library(qdap)

df$var4new <- multigsub(pattern=c("NA", "^NA " ," NA$"), replacement="", df$var4)
df
##   var1 var2 var3    var4 var4new
## 1 <NA>    c    k  NA c k     c k
## 2    j <NA>    b  j NA b     j b
## 3    b    h    a   b h a   b h a
## 4    r <NA> <NA> r NA NA       r
## 5    t    x    s   t x s   t x s