My goal was to segment my own continuous data into a few different ordinal levels (think low, medium, high – small, medium, large – 12 months, 12-24 months, 24-36 months, etc). This is a reproducible example solving that same problem with the mtcars dataset.
I chose to work with the displacement variable, as it has a nice range (71.1, 472).
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
It also has 3 somewhat distinct segments. We can choose to break at these pre-determined points (0 - 235, 235 - 425, 425 - 500).
So our goal is to generate segments based on engine displacement (small, medium, large). We will classify this as 0 - 235, 236 - 425, 426 - 500 as mentioned before. We then need to classify it as a factor. We can do this many different ways.
I’m skipping ahead to what I think are the two optimal ways (yay for small wins!). Based on an example in “R for Data Science” and one from @, I used the cut function to segment without doing any fancy for, else, if loops or sapply, purrr functions.
# create dataset
mtcars_cut <- mtcars
# create labels
seg_levels <- c('small', 'medium', 'large')
# run cut function to segment and change labels/levels
mtcars_cut$disp <- cut(mtcars_cut$disp,
breaks = c(0, 235, 425, Inf),
labels = seg_levels)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
This example was initiated by Russel Pierce (@RussellSPierce) after I asked some questions about this segmentation issue on Twitter. This method of solving the problem is still quite a bit of hardcoding and takes more coding overall, but is very readable and pipe-able. dplyr::case_when will evaluate each of the logicals and allows you to set labels by using the ~.
mtcars %>%
mutate(
# this is a vectorized switch/if-else loop in dplyr!
# evaluates each value in disp and generates a label
# (small, medium, large) in new column disp_segment
disp_segment = case_when(
disp > 425 ~ "large",
disp > 235 & disp <= 425 ~ "medium",
disp <= 235 ~ "small"
)) %>%
mutate(
#creates a factor out of disp_size
disp_size = factor(disp_segment)) %>%
group_by(
disp_segment) %>%
# min and max to check our segmented disp factor against disp numerically
summarize(min = min(disp),
max = max(disp),
median = median(disp))
## # A tibble: 3 x 4
## disp_segment min max median
## <chr> <dbl> <dbl> <dbl>
## 1 large 440.0 472 460
## 2 medium 258.0 400 311
## 3 small 71.1 225 121
Our alternative option is to create an if-else loop with hard coded breaks.
displacement_segment <- function(x){
if (x <= 235) {
return('small')
}else if (x > 235 && x <= 425){
return('medium')
}else if (x > 425){
return('large')
}
}
sapplySo we created our if-else loop, now we need to loop this function over each value in the dataframe at column disp.
# create a dataset to test our loop on
mtcars_test <- mtcars
# run the for loop function on the disp column
mtcars_test$disp <- sapply(mtcars_test$disp, displacement_segment)
# create labels for factors
mtcars_test$disp <- factor(mtcars_test$disp, levels = seg_levels)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
purrrAlternatively, we can skip the sapply call and use purrr::map_chr.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr$disp <- map_chr(mtcars_purrr$disp, displacement_segment)
mtcars_purrr$disp <- factor(mtcars_purrr$disp, levels = seg_levels)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
purrr callNow we use map_chr and mutate with the pipe.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>%
mutate(disp = map_chr(disp, displacement_segment),
disp = factor(disp, levels = seg_levels))
#modify_at(3, displacement_segment)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
purrr callInterestingly, our function does not play nice with another purrr::modify_at using either the direct name or the column number. It seems to stop at the first loop of the function, which is an issue with the repeated if-else loops in our written functiin. I need to work on my vectorization of the custom function to prevent purrr from reading only the 1st loop! Otherwise it will end up ONLY evaluating the first loop and assigning small to every value in the column. I would prefer to solve this problem with either cut or dplyr::case_when as they are more predictable/readable.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>%
modify_at("disp", displacement_segment)
## Warning in if (x <= 235) {: the condition has length > 1 and only the first
## element will be used
## chr [1:32] "small" "small" "small" "small" "small" "small" "small" ...
## [1] "small"
Again we can see that the call generates only small labels.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>%
modify_at(3, displacement_segment)
## Warning in if (x <= 235) {: the condition has length > 1 and only the first
## element will be used
## chr [1:32] "small" "small" "small" "small" "small" "small" "small" ...
## [1] "small"