The goal of this tutorial is to automatise the process of finding NAs in columns.
We are going to define a general function that returns the number of missing value of a column. This function can then be used in many different ways that will make our life easier.
# Let's define our function
columnHasMissingValues <- function( myDataframe, columnName){
# function takes two arguments
# myDataframe is the table we want to inspect
# columnName is the name of column in the table we want to make sure does not contain missing values
# We make sure that the column exists before performing any operations
if( columnName %in% names( myDataframe)){
# We keep only the column we are interested in
checkTable <- myDataframe[, columnName]
# We return the number of missing values in that column
return( sum(!complete.cases(checkTable)))
}else{
# If column does not exist the function returns a negative 1
return( -1)
}
}
# Adding some helpful information
attr(columnHasMissingValues, "help") <- "columnHasMissingValues takes two attributes, a dataframe and a column name. It returns the number of missing values in the column of choice or -1 if the column name does not exist in the given dataframe. Column name is case sensitive"
More often than not, we find mandatory fields in our SQL databases missing. We can write a script that automatically runs every morning and alerts us if there was any corruption in the data the day before.
# We can check if the iris dataset contains any missing Species
myColumnName <- "Species"
# We need to check that the dataframe exists before performing any operation
if( exists( "iris")){ # I always run functions inside of protection brackets like these
missing_species <- columnHasMissingValues( iris, myColumnName)
# We print the number of missing values in the column Species
print( missing_species)
}
## [1] 0
We can now create a check that prints a message that alerts us that something went wrong
# First we introduce a missing value in the dataset
iris$Species[50] <- NA
if( exists( "iris")){ # I always run functions inside of protection brackets like these
missing_species <- columnHasMissingValues( iris, myColumnName)
if(missing_species > 0){
print(paste("Your column", myColumnName, "contains missing values, please check what went wrong"))
}
}
## [1] "Your column Species contains missing values, please check what went wrong"
We can find cases where our Machine Learning algorithms depend on a low missing value ratio of our data. We can also monitor if the data quality of our data is evolving with time in the wrong direction by checking the ratio of missing values of the total.
# Let's introduce 50 missing values
iris$Species[1:50] <- NA
# We need to check that the dataframe exists before performing any operation
if( exists( "iris")){ # I always run functions inside of protection brackets like these
missing_species <- columnHasMissingValues( iris, myColumnName)
# We print the number of missing values in the column Species
print( missing_species)
}
## [1] 50
# Now we can put a limit on the ratio of missing values
if( exists( "iris")){ # I always run functions inside of protection brackets like these
missing_species <- columnHasMissingValues( iris, myColumnName)
# We divide the number of missing values by the number of rows to find the ratio
if( missing_species / nrow( iris) > 0.2){
print( paste0("More than 20% of missing values found in the column ", myColumnName ,". Please check."))
}
}
## [1] "More than 20% of missing values found in the column Species. Please check."
Finally we can check the help attribute for our function to make sure that we understand the input and output of our function and to help others how to use it.
attr(columnHasMissingValues, "help")
## [1] "columnHasMissingValues takes two attributes, a dataframe and a column name. It returns the number of missing values in the column of choice or -1 if the column name does not exist in the given dataframe. Column name is case sensitive"
In this tutorial we have learnt how to create a useful function to find out if key columns contain missing values and how many. We can use this general function to automatise scripts to alert us if the data contains more missing values than expected or desired.