Recently I’ve found a package called ‘validate’ (so simple.) Which specializes in (you guessed it) data validation.
So again, while this package could help to create validation sets when working with data. Its a little more simple, its a package for validating our data against restraints!
Here is the base example, show cased by R-bloggers
iris %>% check_that(
Sepal.Width > 0.5 * Sepal.Length
, mean(Sepal.Width) > 0
, if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
) %>% summary()
## name items passes fails nNA error warning
## 1 V1 150 66 84 0 FALSE FALSE
## 2 V2 1 1 0 0 FALSE FALSE
## 3 V3 150 84 66 0 FALSE FALSE
## expression
## 1 Sepal.Width > 0.5 * Sepal.Length
## 2 mean(Sepal.Width) > 0
## 3 !(Sepal.Width > 0.5 * Sepal.Length) | (Sepal.Length > 10)
As you can see, validation package allows us to check out data through conditions;(Listed under expression column) and gives us a detailed right up on how our tests went. Saves quite a lot of time, considering each column in this write up would have been a seperate set of code; and output. But Validation package doesnt stop there.
This next function is similar, but stores every constraint (object) so we can do other things to it later. Lets use it on another dataset to see what I mean.
data("mammalsleep")
v = validator(
bigGuys = bw>median(bw),
smartGuys = brw>mean(brw)
)
v
## Object of class 'validator' with 2 elements:
## bigGuys : bw > median(bw)
## smartGuys: brw > mean(brw)
Asking for the validator object returns our objects ( the rules of which we can apply to our dataset)
c = confront(mammalsleep,v)
c
## Object of class 'validation'
## Call:
## confront(x = mammalsleep, dat = v)
##
## Confrontations: 2
## With fails : 2
## Warnings : 0
## Errors : 0
We see if there were any issues with our rules applied to the data; using Confront(), and graph it with a barplot()
barplot(c,main='Mammals')
Now with a few lines of code, we can see that most mammal brains are above their median weight (Easily done with a histogram) but we did so using a package that could support much more sophisticated queries.