My goal was to segment my own continuous data into a few different ordinal levels (think low, medium, high – small, medium, large – 12 months, 12-24 months, 24-36 months, etc). This is a reproducible example solving that same problem with the mtcars
dataset.
I chose to work with the displacement variable, as it has a nice range (71.1, 472).
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
It also has 3 somewhat distinct segments. We can choose to break at these pre-determined points (0 - 235
, 235 - 425
, 425 - 500
).
So our goal is to generate segments based on engine displacement (small, medium, large). We will classify this as 0 - 235
, 236 - 425
, 426 - 500
as mentioned before. We then need to classify it as a factor. We can do this many different ways.
I’m skipping ahead to what I think are the two optimal ways (yay for small wins!). Based on an example in “R for Data Science” and one from @, I used the cut function to segment without doing any fancy for
, else
, if
loops or sapply
, purrr
functions.
# create dataset
mtcars_cut <- mtcars
# create labels
seg_levels <- c('small', 'medium', 'large')
# run cut function to segment and change labels/levels
mtcars_cut$disp <- cut(mtcars_cut$disp,
breaks = c(0, 235, 425, Inf),
labels = seg_levels)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
This example was initiated by Russel Pierce (@RussellSPierce) after I asked some questions about this segmentation issue on Twitter. This method of solving the problem is still quite a bit of hardcoding and takes more coding overall, but is very readable and pipe-able. dplyr::case_when
will evaluate each of the logicals and allows you to set labels by using the ~
.
mtcars %>%
mutate(
# this is a vectorized switch/if-else loop in dplyr!
# evaluates each value in disp and generates a label
# (small, medium, large) in new column disp_segment
disp_segment = case_when(
disp > 425 ~ "large",
disp > 235 & disp <= 425 ~ "medium",
disp <= 235 ~ "small"
)) %>%
mutate(
#creates a factor out of disp_size
disp_size = factor(disp_segment)) %>%
group_by(
disp_segment) %>%
# min and max to check our segmented disp factor against disp numerically
summarize(min = min(disp),
max = max(disp),
median = median(disp))
## # A tibble: 3 x 4
## disp_segment min max median
## <chr> <dbl> <dbl> <dbl>
## 1 large 440.0 472 460
## 2 medium 258.0 400 311
## 3 small 71.1 225 121
Our alternative option is to create an if-else loop with hard coded breaks.
displacement_segment <- function(x){
if (x <= 235) {
return('small')
}else if (x > 235 && x <= 425){
return('medium')
}else if (x > 425){
return('large')
}
}
sapply
So we created our if-else loop, now we need to loop this function over each value in the dataframe at column disp
.
# create a dataset to test our loop on
mtcars_test <- mtcars
# run the for loop function on the disp column
mtcars_test$disp <- sapply(mtcars_test$disp, displacement_segment)
# create labels for factors
mtcars_test$disp <- factor(mtcars_test$disp, levels = seg_levels)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
purrr
Alternatively, we can skip the sapply
call and use purrr::map_chr
.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr$disp <- map_chr(mtcars_purrr$disp, displacement_segment)
mtcars_purrr$disp <- factor(mtcars_purrr$disp, levels = seg_levels)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
purrr
callNow we use map_chr
and mutate
with the pipe.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>%
mutate(disp = map_chr(disp, displacement_segment),
disp = factor(disp, levels = seg_levels))
#modify_at(3, displacement_segment)
## Factor w/ 3 levels "small","medium",..: 1 1 1 2 2 1 2 1 1 1 ...
## [1] "small" "medium" "large"
purrr
callInterestingly, our function does not play nice with another purrr::modify_at
using either the direct name or the column number. It seems to stop at the first loop of the function, which is an issue with the repeated if-else
loops in our written functiin. I need to work on my vectorization
of the custom function to prevent purrr
from reading only the 1st loop! Otherwise it will end up ONLY evaluating the first loop and assigning small
to every value in the column. I would prefer to solve this problem with either cut
or dplyr::case_when
as they are more predictable/readable.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>%
modify_at("disp", displacement_segment)
## Warning in if (x <= 235) {: the condition has length > 1 and only the first
## element will be used
## chr [1:32] "small" "small" "small" "small" "small" "small" "small" ...
## [1] "small"
Again we can see that the call generates only small
labels.
# create a dataset to test our loop on
mtcars_purrr <- mtcars
# run the for loop function on the disp column
mtcars_purrr <- mtcars_purrr %>%
modify_at(3, displacement_segment)
## Warning in if (x <= 235) {: the condition has length > 1 and only the first
## element will be used
## chr [1:32] "small" "small" "small" "small" "small" "small" "small" ...
## [1] "small"