This post was previously published on my wordpress blog:
https://dav1d00.wordpress.com/2014/12/20/132/


Simple random sampling is the most common practise when dealing with data sets which are large enough to be split into training and test set for predictive purposes.

Think of classification models.

You randomly extract, say, \(\frac3{4}\) of the rows, and that’s a fair technique, at least until you are quite sure that both your sets will have a sufficient number of observations in all the possible classes.

To figure out the real problem, I often think of a professor that teaches three methods (A, B and C) to perform a task.

What might this behavior lead to?

Your classification models are the same as those students and your data is that professor who aims to teach them about the population heterogeneity.
If your data well mirrors this diversity in all its nuances, your models will get aware of the complexity of the phenomenon and the predictions will be more accurate.

Take, for instance, a data set including a response characterized by three classes: A, B and C.

If class C has a few observations, simple random sampling may not be what you’re looking for, because you may want a greater control on the split in order to make sure that your data will first assimilate, then test, its existence.

If you fail to control the split, you might incur in situations in which you do not have enough observations to make the model totally conscious of the difference existing between C and the other two classes.
It’d result in a particular attention for A and B and in an identification of C just in the very few cases in which the values the predictors take on are extremely different from the ones characterizing A and B (graphically, the points labelled in C appear far away from the decision margin).

If recognizing class C is your priority, and you would be willing to accept a possibly higher overall misclassification error, inserting class weights in your model may adjust this problem.
However, it is a bit hard to know in advance the exact weights to give to classes and you clearly risk to affect the margins between the other classes and/or to give too much importance to C, ending up with a misclassification disaster.

All above considered, you can just resort to stratified sampling.

Thanks to that, you can suggest the professor to dedicate quite the same attention to all the three cases concerning the task, in order to see students with a complete overall knowledge.

As in every field, statistics and surroundings above all, everything has a cost, or, at least, may have it.

If you stratify, your data will benefit of a higher representativeness, but do not exaggerate!
If you begin to consider too many criteria to assure the presence of, say, all the levels of each categorical predictor in each class, you will end up with some empty or small-size group.
It is ok to give a hand to randomization, but you cannot expect to control every possible case.

The couple of function I am going to show implements the stratified split of a data set.
I have included also a threshold which would provide a warning in case the test set had a number of observations lower than the threshold for one or more groups. The cut-off is set to zero (no control), but the analyst ought to specify it.

Why is the cut-off for the validation set?

Because you generally save for the test a number of rows which is lower than (at most equal to) the number of training rows. Therefore, if you can control the test set, you automatically control the training set.
Another reason is that it is you unquestionably care a lot about the representativeness of all the main cases in your training sample, but you also demand the test set to contain everything.

Otherwise, how could the professor measure the students’ preparation on all the events characterizing the task?
He would remain in a position in which he guesses he has provided all the right tools to inform them, but actually he cannot know if everything has been grasped and learnt.

Finally, the R functions:

To run these functions, you must have installed both dplyr and magrittr.

strat_sample <- function(data, gr_variab, tr_percent, thresh_test = 0, seed) {
  
  stopifnot(tr_percent > 0 & tr_percent < 1)
  
  if(require(dplyr) & require(magrittr)) {
    
    if(!missing(seed)) set.seed(seed)
    
    names0 <- names(data)
    gr_variab <- which(names0 == gr_variab)
    names(data) <- make.unique(c("n", "tRows", "SET", names0))[-(1:3)]
    gr_variab <- names(data)[gr_variab]        
    
    data %<>% 
      sample_frac %>% 
      group_by_(gr_variab) %>%
      mutate(n = n(), tRows = round(tr_percent * n))
    
    with(data, if(any(n - tRows < thresh_test))        
      warning("Zero or too few observations in one or more groups"))
    
    data %<>%
      mutate(SET = ifelse(row_number() <= tRows, "Train", "Test")) %>%
      select(-n, -tRows) %>%
      ungroup
    
    names(data) <- make.unique(c(names0, "SET"))
    
    data
    
  }
  
}

extract_set <- function(data, whichSET) {
    
    stopifnot(is.element(whichSET, c("Train", "Test")))
    
    if(require(dplyr)) {
        
        variab <- names(data)[ncol(data)]
        condit <- get(variab, data) == whichSET
        
        data %>%
            filter_(~ condit) %>%
            select_(paste0("-", variab)) 
        
    }
}  

Let’s try it on fake data.

n <- 1e+5

set.seed(386)

Df <- data.frame(V1 = rnorm(n),
                 V2 = rt(n, df = 4),
                 V3 = rpois(n, lambda = 1),
                 y = sample(letters[1:4], n, replace = T, 
                            prob = c(.33, .33, .33, .01))) # "d" will just appear in about 1% of rows

groups <- strat_sample(Df, "y", .75, 
                       thresh_test = 300) # at least 300 observations per group in the test set
## Warning: Zero or too few observations in one or more groups

The warning is due to class “d”, which is present in less than 300 rows of the test set.

Let’s now see the proportions of class observations in each set.

with(groups, prop.table(table(y, SET), 1))
##    SET
## y        Test     Train
##   a 0.2500076 0.7499924
##   b 0.2500000 0.7500000
##   c 0.2500000 0.7500000
##   d 0.2497492 0.7502508

Everything is as expected.

Let’s now extract both sets.

extract_set(groups, "Train")
## Source: local data frame [75,000 x 4]
## 
##             V1          V2 V3 y
## 1  -1.15255450 -0.09175182  1 b
## 2   1.04397437  1.90545498  1 b
## 3  -0.06141112  0.06718518  1 b
## 4   1.74595667  1.19711538  0 c
## 5  -0.69313791 -0.76256078  0 b
## 6   0.42058714  0.73443779  0 b
## 7  -1.14555445 -1.04763448  1 c
## 8  -0.94041132  0.10522625  0 a
## 9  -0.72705699  0.35570454  0 b
## 10 -1.30080135 -2.37771513  1 a
## ..         ...         ... .. .
extract_set(groups, "Test")
## Source: local data frame [25,000 x 4]
## 
##             V1           V2 V3 y
## 1  -0.88376571 -0.465006741  2 d
## 2  -0.08263995  2.630909005  0 d
## 3  -0.25888570 -0.009938382  1 d
## 4  -0.03194042 -3.484653822  2 d
## 5   0.39469734 -0.779444603  0 d
## 6  -0.49666191 -0.121627489  2 d
## 7  -0.14770261  0.301794833  0 d
## 8   0.43400765 -0.651536060  1 d
## 9  -1.90505923  0.736410437  2 b
## 10  1.44438995  2.219118526  1 b
## ..         ...          ... .. .

Hope you enjoyed it.
Feel free to leave comments and suggestions.