Detailed Example

Eliora Henzler

2019-02-11

Specific test functions

what is “all” in inspect_all?

inspect_all() wraps a number of specific data cleaning checks. These can all be accessed directly as individual functions, outlined below.

Load the cleaninginspectoR Package

library("cleaninginspectoR")

Example data frame

Here we create some fake data for illustration purposes. It is not important to understand this; we keep it in so you can run the example yourself if you like. The dataset contains:

testdf <- data.frame(a= c(runif(98),7287,-100),
                   b=sample(letters,100,T),
                   uuid=c(1:98, 4,20),
                   water.source.other = c(rep(NA,98),"neighbour's well","neighbour's well"),
                   GPS.lat = runif(100)
                   )

Finding duplicates in certain columns

There is a generic function to find duplicates in a certain specified column:

find_duplicates(testdf, duplicate.column.name = "uuid")
index value variable has_issue issue_type
99 4 uuid TRUE duplicate in uuid
100 20 uuid TRUE duplicate in uuid

Often this is used on a column with UUID’s, so there is a wrapper that looks for “uuid” in the column names and returns duplicates in the first matching column it finds. This gives the same result as the above:

find_duplicates_uuid(testdf)
index value variable has_issue issue_type
99 4 uuid TRUE duplicate in uuid
100 20 uuid TRUE duplicate in uuid

run ?find_duplicates or ?find_duplicates_uuid for details.

Checking for outliers

find_outliers(testdf)
index value variable has_issue issue_type
99 7287 a TRUE normal distribution outlier

Run ?find_outliers for details

Checking for other responses

find_other_responses(testdf)
index value variable has_issue issue_type
NA neighbour’s well \ 2 instance(s) water.source.other NA ‘other’ response. may need recoding.

Run ?find_other_responses for details