inspect_all() wraps a number of specific data cleaning checks. These can all be accessed directly as individual functions, outlined below.
library("cleaninginspectoR")Here we create some fake data for illustration purposes. It is not important to understand this; we keep it in so you can run the example yourself if you like. The dataset contains:
a: random values and outliersuuid: values should be unique but are notwater.source.other: all NA except for twoGPS.lat just some numbers, but the column header indicates this is potentially sensitivetestdf <- data.frame(a= c(runif(98),7287,-100),
b=sample(letters,100,T),
uuid=c(1:98, 4,20),
water.source.other = c(rep(NA,98),"neighbour's well","neighbour's well"),
GPS.lat = runif(100)
)There is a generic function to find duplicates in a certain specified column:
find_duplicates(testdf, duplicate.column.name = "uuid")| index | value | variable | has_issue | issue_type |
|---|---|---|---|---|
| 99 | 4 | uuid | TRUE | duplicate in uuid |
| 100 | 20 | uuid | TRUE | duplicate in uuid |
Often this is used on a column with UUID’s, so there is a wrapper that looks for “uuid” in the column names and returns duplicates in the first matching column it finds. This gives the same result as the above:
find_duplicates_uuid(testdf)| index | value | variable | has_issue | issue_type |
|---|---|---|---|---|
| 99 | 4 | uuid | TRUE | duplicate in uuid |
| 100 | 20 | uuid | TRUE | duplicate in uuid |
run ?find_duplicates or ?find_duplicates_uuid for details.
find_outliers(testdf)| index | value | variable | has_issue | issue_type |
|---|---|---|---|---|
| 99 | 7287 | a | TRUE | normal distribution outlier |
Run ?find_outliers for details
find_other_responses(testdf)| index | value | variable | has_issue | issue_type |
|---|---|---|---|---|
| NA | neighbour’s well \ 2 instance(s) | water.source.other | NA | ‘other’ response. may need recoding. |
Run ?find_other_responses for details