Specific test functions

what is “all” in inspect_all?

inspect_all() wraps a number of specific data cleaning checks. These can all be accessed directly as individual functions, outlined below.

Load the cleaninginspectoR Package

library("cleaninginspectoR")

Example data frame

Here we create some fake data for illustration purposes. It is not important to understand this; we keep it in so you can run the example yourself if you like. The dataset contains:

variable a: random values and outliers
variable uuid: values should be unique but are not
variable water.source.other: all NA except for two
variable GPS.lat just some numbers, but the column header indicates this is potentially sensitive

testdf <- data.frame(a= c(runif(98),7287,-100),
                   b=sample(letters,100,T),
                   uuid=c(1:98, 4,20),
                   water.source.other = c(rep(NA,98),"neighbour's well","neighbour's well"),
                   GPS.lat = runif(100)
                   )

Finding duplicates in certain columns

There is a generic function to find duplicates in a certain specified column:

find_duplicates(testdf, duplicate.column.name = "uuid")

index	value	variable	has_issue	issue_type
99	4	uuid	TRUE	duplicate in uuid
100	20	uuid	TRUE	duplicate in uuid

Often this is used on a column with UUID’s, so there is a wrapper that looks for “uuid” in the column names and returns duplicates in the first matching column it finds. This gives the same result as the above:

find_duplicates_uuid(testdf)

index	value	variable	has_issue	issue_type
99	4	uuid	TRUE	duplicate in uuid
100	20	uuid	TRUE	duplicate in uuid

run ?find_duplicates or ?find_duplicates_uuid for details.

Checking for outliers

find_outliers(testdf)

index	value	variable	has_issue	issue_type
99	7287	a	TRUE	normal distribution outlier

Run ?find_outliers for details

Checking for other responses

find_other_responses(testdf)

index	value	variable	has_issue	issue_type
NA	neighbour’s well \ 2 instance(s)	water.source.other	NA	‘other’ response. may need recoding.

Run ?find_other_responses for details

Detailed Example

Eliora Henzler

2019-02-11