D621B4

Michael Muller

May 24, 2018


If you’re like me, your suite of data validation techniques stops after assigning a vector of random ints as partition flags for a training and test set.

Recently I’ve found a package called ‘validate’ (so simple.) Which specializes in (you guessed it) data validation.

Validation package, allows us to validate our data is a number of ways. Most of which have to do with applying certain rules as objects, to pass through the entire dataframe.

So again, while this package could help to create validation sets when working with data. Its a little more simple, its a package for validating our data against restraints!

Here is the base example, show cased by R-bloggers

iris %>% check_that(
  Sepal.Width > 0.5 * Sepal.Length
  , mean(Sepal.Width) > 0
  , if ( Sepal.Width > 0.5*Sepal.Length) Sepal.Length > 10
) %>% summary()
##   name items passes fails nNA error warning
## 1   V1   150     66    84   0 FALSE   FALSE
## 2   V2     1      1     0   0 FALSE   FALSE
## 3   V3   150     84    66   0 FALSE   FALSE
##                                                  expression
## 1                          Sepal.Width > 0.5 * Sepal.Length
## 2                                     mean(Sepal.Width) > 0
## 3 !(Sepal.Width > 0.5 * Sepal.Length) | (Sepal.Length > 10)

As you can see, validation package allows us to check out data through conditions;(Listed under expression column) and gives us a detailed right up on how our tests went. Saves quite a lot of time, considering each column in this write up would have been a seperate set of code; and output. But Validation package doesnt stop there.

This next function is similar, but stores every constraint (object) so we can do other things to it later. Lets use it on another dataset to see what I mean.

data("mammalsleep")
v = validator(
  bigGuys = bw>median(bw),
  smartGuys = brw>mean(brw)
)
v
## Object of class 'validator' with 2 elements:
##  bigGuys  : bw > median(bw)
##  smartGuys: brw > mean(brw)

Asking for the validator object returns our objects ( the rules of which we can apply to our dataset)

c = confront(mammalsleep,v)
c
## Object of class 'validation'
## Call:
##     confront(x = mammalsleep, dat = v)
## 
## Confrontations: 2
## With fails    : 2
## Warnings      : 0
## Errors        : 0

We see if there were any issues with our rules applied to the data; using Confront(), and graph it with a barplot()

barplot(c,main='Mammals')

Now with a few lines of code, we can see that most mammal brains are above their median weight (Easily done with a histogram) but we did so using a package that could support much more sophisticated queries.