Missing Values

Identifying

Across dataframe

# is.na(df)

In a single column

# is.na(df$col)

Location of NAs in a column

# which(is.na(df$col))

Number of NAs in a column

# sum(is.na(df$col))

Sum of missing values per column 1

# colSums(is.na(df))

Sum of missing values per column 2

# sapply(df, function(x) sum( is.na(x) ))

Return rows with specific number of missing values

# df[rowSums(is.na(df))==9,]
# df[rowSums(is.na(df)) >9,]
# df[rowSums(is.na(df)) <9,]

Recoding

Recode with mean of column

# df$col[is.na(df$col)] <- mean(df$col, na.rm=TRUE)

Recode from “..” to NA

# df[df==".."] <- NA

Excluding

A common method of handling missing values is simply to omit the records or fields with missing values from the analysis.

However, this may be dangerous, since the pattern of missing values may in fact be systematic, and simply deleting records with missing values would lead to a biased subset of the data.

When to exclude

Some authors recommend that if the amount of missing data is very small relative to the size of the data set (up to 5%), then leaving out the few values with missing features would be the best strategy in order not to bias the analysis.

Exclude missing values from maths operations

# mean(df$col, na.rm=TRUE)

Exclude rows which have missing values

# na.omit(df)

Exclude rows with all missing values (x=number of columns)

# df[rowSums(is.na(df)) < x, ]

Replacing with Constant

If values in a dataframe follow a certain rule, such as total = staff + housing + cleaning, then we can use packages to solve missing values according to the formula.

# install.packages("deductive")
# install.packages("validate")
# library(deductive)
# library(validate)

Define rules as a validator expression

# Rules <- validator( staff + cleaning + housing == total,
#                   staff >= 0,
#                   housing >= 0,
#                   cleaning >= 0)

Use rules to fill out missing values (impute_lr() function)

# imputed_df <- impute_lr(df, Rules)

Replacing with Mean/Median/Mode

The Hmisc package has a convenient wrapper function allowing you to specify what function is used to compute imputed values from the non-missing.

# install.packages("Hmisc")
# library(Hmisc)

Columns

Replace missing values with the COLUMN mean, median, mode

# col <- impute(df$col, fun = mean)
# col <- impute(df$col, fun = median)
# col <- impute(df$col, fun = mode)

Rows

Replace missing values with the ROW mean. This function replaces values in the 2017_Q4 col with the rowmeans of Q1, Q2 and Q3. See ifelse() below.

# pr$`2017_Q4` <- ifelse(is.na(pr$`2017_Q4`),
#                     rowMeans(pr[, c("2017_Q1", 
#                                     "2017_Q2", 
#                                      "2017_Q3")], 
#                               na.rm = TRUE), 
#                      pr$`2017_Q4`)

ifelse(condition, x, y)

X is the value to replace a value, if the condition is met
Y is the value to replace a value if the condition is not met

Checking which values have been replaced

A nice feature of the impute function is that the resulting vector remembers what values were imputed. TRUE means a value has been imputed.

# is.imputed(df$col)

Special Values

In addition to missing values, there are a few special values that are used in R. These are -Inf, Inf and NaN.

If a computation results in a number that is too big, R will return Inf (meaning positive infinity) for a positive number and -Inf for a negative number (meaning negative infinity).

Identifying Special Values

Calculations involving special values often result in special values, thus it is important to handle special values prior to analysis.

The is.finite, is.infinite, or is.nan functions will generate logical values (TRUE or FALSE) and they can be used to identify the special values in a data set.

Check if values are finite

# is.inite(df$col)

Check if values are infinite

# is.infinite(df$col)

Check if values are not a number (NaN)

# is.nan(df$col)

The apply family functions will apply a specified function to a given data object (e.g. vectors, lists, matrices, data frames). Most common forms of apply functions are:

apply() for matrices and data frames
lapply() for lists (output as list)
sapply() for lists (output simplified)
tapply() for vectors

Check if values are infinite in a dataframe using sapply

# sapply(df, is.inifite)

Check for multiple different types of missing values using a function

This function checks every numerical column for infinite, NaN or na values.

# is.specialorNA <- function(x){
#   if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))}

Apply the function to the dataframe. TRUE means the value is infinite, not a number or na as specified in the function.

# sapply(df, is.specialorNA)

Sum of missing values for each column

This function checks population_NA for na values and returns a sum of the values in a nice little table. The x in the functions below needs to stay as it is part of the function returning the sum.

Use a special function like in the last heading if you want to test for multiple special values.

# sapply(population_NA, function(x) sum( is.na(x) ))