Quickstart

Eliora Henzler

2019-02-11

Load the cleaninginspectoR library

library("cleaninginspectoR")

Example data frame

Here we create some fake data for illustration purposes. It is not important to understand this; we keep it in so you can run the example yourself if you like. The dataset contains:

testdf <- data.frame(a= c(runif(98),7287,-100),
                   b=sample(letters,100,T),
                   uuid=c(1:98, 4,20),
                   water.source.other = c(rep(NA,98),"neighbour's well","neighbour's well"),
                   GPS.lat = runif(100)
                   )

Perform all available cleaning inspections

The function inspect_all runs all cleaning checks that are available.

inspect_all(testdf)
index value variable has_issue issue_type
NA NA GPS.lat TRUE Potentially sensitive information. Please ensure all PII is removed
99 4 uuid TRUE duplicate in uuid
100 20 uuid TRUE duplicate in uuid
99 7287 a TRUE normal distribution outlier
NA neighbour’s well \ 2 instance(s) water.source.other NA ‘other’ response. may need recoding.

For unusual UUID column names

One of the things inspect_all does is to look for duplicates in the first column containing the word “uuid”. If your ID column has a different name, you can specify it in the second parameter:

inspect_all(df = testdf,uuid.column.name = "b")
kable(inspect_all(df = testdf,uuid.column.name = "b"))
index value variable has_issue issue_type
NA NA GPS.lat TRUE Potentially sensitive information. Please ensure all PII is removed
10 b b TRUE duplicate in b
13 k b TRUE duplicate in b
15 t b TRUE duplicate in b
16 d b TRUE duplicate in b
19 u b TRUE duplicate in b
21 h b TRUE duplicate in b
22 h b TRUE duplicate in b
23 e b TRUE duplicate in b
25 p b TRUE duplicate in b
26 t b TRUE duplicate in b
27 u b TRUE duplicate in b
30 z b TRUE duplicate in b
32 r b TRUE duplicate in b
33 t b TRUE duplicate in b
36 u b TRUE duplicate in b
37 f b TRUE duplicate in b
38 k b TRUE duplicate in b
39 l b TRUE duplicate in b
40 u b TRUE duplicate in b
41 w b TRUE duplicate in b
42 e b TRUE duplicate in b
44 p b TRUE duplicate in b
46 s b TRUE duplicate in b
47 e b TRUE duplicate in b
48 t b TRUE duplicate in b
50 p b TRUE duplicate in b
52 m b TRUE duplicate in b
53 l b TRUE duplicate in b
54 v b TRUE duplicate in b
55 y b TRUE duplicate in b
56 c b TRUE duplicate in b
57 v b TRUE duplicate in b
58 v b TRUE duplicate in b
59 c b TRUE duplicate in b
60 j b TRUE duplicate in b
61 j b TRUE duplicate in b
62 j b TRUE duplicate in b
63 i b TRUE duplicate in b
64 m b TRUE duplicate in b
66 d b TRUE duplicate in b
67 l b TRUE duplicate in b
68 h b TRUE duplicate in b
69 t b TRUE duplicate in b
70 l b TRUE duplicate in b
71 a b TRUE duplicate in b
72 a b TRUE duplicate in b
73 y b TRUE duplicate in b
74 q b TRUE duplicate in b
75 p b TRUE duplicate in b
76 w b TRUE duplicate in b
77 y b TRUE duplicate in b
78 o b TRUE duplicate in b
79 m b TRUE duplicate in b
80 n b TRUE duplicate in b
81 u b TRUE duplicate in b
82 f b TRUE duplicate in b
83 s b TRUE duplicate in b
84 s b TRUE duplicate in b
85 o b TRUE duplicate in b
86 u b TRUE duplicate in b
87 o b TRUE duplicate in b
88 t b TRUE duplicate in b
89 y b TRUE duplicate in b
90 v b TRUE duplicate in b
91 r b TRUE duplicate in b
92 o b TRUE duplicate in b
93 y b TRUE duplicate in b
94 v b TRUE duplicate in b
95 t b TRUE duplicate in b
96 h b TRUE duplicate in b
97 t b TRUE duplicate in b
98 l b TRUE duplicate in b
99 e b TRUE duplicate in b
100 u b TRUE duplicate in b
99 7287 a TRUE normal distribution outlier
NA neighbour’s well \ 2 instance(s) water.source.other NA ‘other’ response. may need recoding.

More Details

For more information and individual check functions, see the detailed example.