Across dataframe
# is.na(df)
In a single column
# is.na(df$col)
Location of NAs in a column
# which(is.na(df$col))
Number of NAs in a column
# sum(is.na(df$col))
Sum of missing values per column 1
# colSums(is.na(df))
Sum of missing values per column 2
# sapply(df, function(x) sum( is.na(x) ))
Return rows with specific number of missing values
# df[rowSums(is.na(df))==9,]
# df[rowSums(is.na(df)) >9,]
# df[rowSums(is.na(df)) <9,]
Recode with mean of column
# df$col[is.na(df$col)] <- mean(df$col, na.rm=TRUE)
Recode from “..” to NA
# df[df==".."] <- NA
A common method of handling missing values is simply to omit the records or fields with missing values from the analysis.
However, this may be dangerous, since the pattern of missing values may in fact be systematic, and simply deleting records with missing values would lead to a biased subset of the data.
Some authors recommend that if the amount of missing data is very small relative to the size of the data set (up to 5%), then leaving out the few values with missing features would be the best strategy in order not to bias the analysis.
# mean(df$col, na.rm=TRUE)
# na.omit(df)
# df[rowSums(is.na(df)) < x, ]
If values in a dataframe follow a certain rule, such as total = staff + housing + cleaning, then we can use packages to solve missing values according to the formula.
# install.packages("deductive")
# install.packages("validate")
# library(deductive)
# library(validate)
Define rules as a validator expression
# Rules <- validator( staff + cleaning + housing == total,
# staff >= 0,
# housing >= 0,
# cleaning >= 0)
Use rules to fill out missing values (impute_lr() function)
# imputed_df <- impute_lr(df, Rules)
The Hmisc package has a convenient wrapper function allowing you to specify what function is used to compute imputed values from the non-missing.
# install.packages("Hmisc")
# library(Hmisc)
Replace missing values with the COLUMN mean, median, mode
# col <- impute(df$col, fun = mean)
# col <- impute(df$col, fun = median)
# col <- impute(df$col, fun = mode)
Replace missing values with the ROW mean. This function replaces values in the 2017_Q4 col with the rowmeans of Q1, Q2 and Q3. See ifelse() below.
# pr$`2017_Q4` <- ifelse(is.na(pr$`2017_Q4`),
# rowMeans(pr[, c("2017_Q1",
# "2017_Q2",
# "2017_Q3")],
# na.rm = TRUE),
# pr$`2017_Q4`)
ifelse(condition, x, y)
A nice feature of the impute function is that the resulting vector remembers what values were imputed. TRUE means a value has been imputed.
# is.imputed(df$col)
In addition to missing values, there are a few special values that are used in R. These are -Inf, Inf and NaN.
If a computation results in a number that is too big, R will return Inf (meaning positive infinity) for a positive number and -Inf for a negative number (meaning negative infinity).
Calculations involving special values often result in special values, thus it is important to handle special values prior to analysis.
The is.finite, is.infinite, or is.nan functions will generate logical values (TRUE or FALSE) and they can be used to identify the special values in a data set.
# is.inite(df$col)
# is.infinite(df$col)
# is.nan(df$col)
The apply family functions will apply a specified function to a given data object (e.g. vectors, lists, matrices, data frames). Most common forms of apply functions are:
# sapply(df, is.inifite)
This function checks every numerical column for infinite, NaN or na values.
# is.specialorNA <- function(x){
# if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))}
Apply the function to the dataframe. TRUE means the value is infinite, not a number or na as specified in the function.
# sapply(df, is.specialorNA)
This function checks population_NA for na values and returns a sum of the values in a nice little table. The x in the functions below needs to stay as it is part of the function returning the sum.
Use a special function like in the last heading if you want to test for multiple special values.
# sapply(population_NA, function(x) sum( is.na(x) ))