Summary

My goal was to segment my own continuous data into a few different ordinal levels (think low, medium, high – small, medium, large – 12 months, 12-24 months, 24-36 months, etc). This is a reproducible example solving that same problem with the mtcars dataset.


1. Data

I chose to work with the displacement variable, as it has a nice range (71.1, 472).

## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

It also has 3 somewhat distinct segments. We can choose to break at these pre-determined points (0 - 235, 235 - 425, 425 - 500).

2. Segmentation

So our goal is to generate segments based on engine displacement (small, medium, large). We will classify this as 0 - 235, 236 - 425, 426 - 500 as mentioned before. We then need to classify it as a factor. We can do this many different ways.


2.1 Segment via cut

I’m skipping ahead to what I think are the two optimal ways (yay for small wins!). Based on an example in “R for Data Science” and one from @, I used the cut function to segment without doing any fancy for, else, if loops or sapply, purrr functions.

# create dataset
mtcars_cut <- mtcars

# create labels
seg_levels <- c('small', 'medium', 'large')

# run cut function to segment and change labels/levels
mtcars_cut$disp <- cut(mtcars_cut$disp, 
                       breaks = c(0, 235, 425, Inf), 
                       labels = seg_levels)
##  Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small"  "medium" "large"

2.2 Segment via dplyr::case_when

This example was initiated by Russel Pierce (@RussellSPierce) after I asked some questions about this segmentation issue on Twitter. This method of solving the problem is still quite a bit of hardcoding and takes more coding overall, but is very readable and pipe-able. dplyr::case_when will evaluate each of the logicals and allows you to set labels by using the ~.

mtcars %>%
  mutate(
    # this is a vectorized switch/if-else loop in dplyr!
    # evaluates each value in disp and generates a label
    # (small, medium, large) in new column disp_segment
    disp_segment = case_when(
          disp > 425 ~ "large",
          disp > 235 & disp <= 425 ~ "medium",
          disp <= 235 ~ "small"
    )) %>% 
  mutate(
    #creates a factor out of disp_size
    disp_size = factor(disp_segment)) %>% 
  group_by(
    disp_segment) %>% 
  # min and max to check our segmented disp factor against disp numerically
  summarize(min = min(disp),
            max = max(disp),
            median = median(disp))
## # A tibble: 3 x 4
##   disp_segment   min   max median
##          <chr> <dbl> <dbl>  <dbl>
## 1        large 440.0   472    460
## 2       medium 258.0   400    311
## 3        small  71.1   225    121

2.3 Segmentation via Loop

Our alternative option is to create an if-else loop with hard coded breaks.

displacement_segment <- function(x){
  if (x <= 235) {
    return('small')
  }else if (x > 235 && x <= 425){
    return('medium')
  }else if (x > 425){
    return('large')
  }
}

2.4 Segment function applied via sapply

So we created our if-else loop, now we need to loop this function over each value in the dataframe at column disp.

# create a dataset to test our loop on
mtcars_test <- mtcars

# run the for loop function on the disp column
mtcars_test$disp <- sapply(mtcars_test$disp, displacement_segment)

# create labels for factors
mtcars_test$disp <- factor(mtcars_test$disp, levels = seg_levels)
##  Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small"  "medium" "large"

2.5 Segment function with purrr

Alternatively, we can skip the sapply call and use purrr::map_chr.

# create a dataset to test our loop on
mtcars_purrr <- mtcars

# run the for loop function on the disp column
mtcars_purrr$disp <- map_chr(mtcars_purrr$disp, displacement_segment)
mtcars_purrr$disp <- factor(mtcars_purrr$disp, levels = seg_levels)
##  Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small"  "medium" "large"

2.6 Segment function with alternative purrr call

Now we use map_chr and mutate with the pipe.

# create a dataset to test our loop on
mtcars_purrr <- mtcars

# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>% 
  mutate(disp = map_chr(disp, displacement_segment),
         disp = factor(disp, levels = seg_levels))

  #modify_at(3, displacement_segment)
##  Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small"  "medium" "large"

2.7 Segment function yet another purrr call

Interestingly, our function does not play nice with another purrr::modify_at using either the direct name or the column number. It seems to stop at the first loop of the function, which is an issue with the repeated if-else loops in our written functiin. I need to work on my vectorization of the custom function to prevent purrr from reading only the 1st loop! Otherwise it will end up ONLY evaluating the first loop and assigning small to every value in the column. I would prefer to solve this problem with either cut or dplyr::case_when as they are more predictable/readable.

# create a dataset to test our loop on
mtcars_purrr <- mtcars

# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>% 
  modify_at("disp", displacement_segment)
## Warning in if (x <= 235) {: the condition has length > 1 and only the first
## element will be used
##  chr [1:32] "small" "small" "small" "small" "small" "small" "small" ...
## [1] "small"

Again we can see that the call generates only small labels.

# create a dataset to test our loop on
mtcars_purrr <- mtcars

# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>% 
  modify_at(3, displacement_segment)
## Warning in if (x <= 235) {: the condition has length > 1 and only the first
## element will be used
##  chr [1:32] "small" "small" "small" "small" "small" "small" "small" ...
## [1] "small"