When analyzing data, it can sometimes be useful to group numerical objects into buckets or bins. For example, when dealing with age data, perhaps you’d like to group the ages into age groups like 20 to 24, 25 to 30 and so on.

For this demonstation, we will use the training data from the Titanic data set which can be downloaded from https://www.kaggle.com/c/titanic. I know there are a few rows in this dataset where the age is “NA” so we will also remove those rows first.

df <- read.csv("train.csv")  # Load data
missing.age <- is.na(df$Age)
df <- df[!missing.age,]  # Remove NA

Next we will name our age groups (or “bins” or “buckets”) 0-4, 5-9, 10-14 and so on.

labs <- c(paste(seq(0, 95, by = 5), seq(0 + 5 - 1, 100 - 1, by = 5),
                sep = "-"), paste(100, "+", sep = ""))
labs
##  [1] "0-4"   "5-9"   "10-14" "15-19" "20-24" "25-29" "30-34" "35-39"
##  [9] "40-44" "45-49" "50-54" "55-59" "60-64" "65-69" "70-74" "75-79"
## [17] "80-84" "85-89" "90-94" "95-99" "100+"

To add the ages to age groups, we create a new column AgeGroup and use the cut function to break Age into groups with the labels we defined in the previous step.

df$AgeGroup <- cut(df$Age, breaks = c(seq(0, 100, by = 5), Inf), labels = labs, right = FALSE)

And here is our new AgeGroup column shown alongside the Age data.

head(df[c("Age", "AgeGroup")], 15)
##    Age AgeGroup
## 1   22    20-24
## 2   38    35-39
## 3   26    25-29
## 4   35    35-39
## 5   35    35-39
## 7   54    50-54
## 8    2      0-4
## 9   27    25-29
## 10  14    10-14
## 11   4      0-4
## 12  58    55-59
## 13  20    20-24
## 14  39    35-39
## 15  14    10-14
## 16  55    55-59