A familiar situation among data analysts is the need to split or cut a series of values into groups. When an ‘age’ variable is given as specific ages, it is usually necessary to group the ages into bins to simplify the data.

#Age
x <- c(18, 42, 31, 25, 76, 22)

I would like to change these values into groups from ‘18-34’, ‘35-64’, and ‘65+’. We could use if and else statements as we would do in many programming languages.

x2 <- ifelse(x >= 18 & x <= 34, "18-34", ifelse(x >= 35 & x <= 64, "35-64", ifelse(x >= 65, "65+", NA)))
x2
## [1] "18-34" "35-64" "18-34" "18-34" "65+"   "18-34"

This approach works but is not scalable as the number of breaks increase.

The function ‘cut’ helps to simplify this syntax.

cut(x, breaks=c(17, 34, 64, Inf), labels=c("18-34", "35-64", "65+"))
## [1] 18-34 35-64 18-34 18-34 65+   18-34
## Levels: 18-34 35-64 65+

How does the cut function work?

#Basic 'cut' structure
cut(vector, breaks, labels)

There are three main parts, the object to be cut, where should the breaks occur, and what labels should be used. The ‘labels’ argument does not have to be specified; when missing, default labels will be generated.

We will learn through the most common errors and show how to overcome them. Let’s first create labels for the final output.

mylabels <- c("18-34", "35-64", "65+")

Now on to the cut.

cut(x, breaks=c(18, 35, 65), labels=mylabels)
#Error in cut.default(x, breaks = c(18, 35, 65), labels = mylabels) : 
#  lengths of 'breaks' and 'labels' differ

The error says lengths of 'breaks' and 'labels' differ, but the lengths are both ‘3’. There are three labels and three breaks. Labels: “18-34”, “35-64”, “65+” and Breaks: 18, 35, 65.

The problem and solution lies in the fact that the breaks in ‘cut’ form the endpoints.

There are three break points but only two groups! We need one more break-point than we have as groups. Let’s add another break-point. We could use c(18, 35, 65, 100) to add 100 as the last break. This would cover everyone in the range from 65-100. But in case there was an individual older than that break, we can set the value to positive infinity, c(18, 35, 65, Inf). That way, anyone older than 65 will be placed in the last group.

cut(x, breaks=c(17, 34, 64, Inf), labels=c("18-34", "35-64", "65+"))
## [1] 18-34 35-64 18-34 18-34 65+   18-34
## Levels: 18-34 35-64 65+
p1 + geom_cut(breaks=c(17,34,64, Inf),
              yint=2.5, size=1, color="grey", spread=.5) + 
  geom_text(aes(label=x), vjust=2, size=3)

As we can see above, there are four final break points and three groups.

Good to Know

We should avoid two other pitfalls; the arguments include.lowest and right should be avoided.

The choice include.lowest defaults to FALSE which means the lowest value among the breaks will not be included in the grouping. In our case, someone who is the age of 18 may cause problems if breaks were set to c(18, 34, 64, Inf). The lowest break is ‘18’ and will not be included. Therefore, we set the breaks to c(17,34,64, Inf).

The reason we avoid changing this argument when possible is for our own sanity. The more parts we move, the more confusing the operation of the function will be to us.

The same goes for the right argument. By default, it is set to TRUE; meaning the values on the high end of each group will be included in the range. For example, a range of 18-34 can be interpreted in different ways. It isn’t clear if 34 should be considered part of the group or be considered the start of the next. The ‘cut’ function will assume that 34 is a part of the range. Changing this to right=FALSE will push 34 to next range.

This should definitely be avoided when possible. Confusion will abound once this feature is tweaked with. It will be very difficult to remember which break is which, and which edge case belongs where.

By working within the confines of the defaults we will learn one way to use the function and gain peace of mind to add these features once functionality has been internalized fully.