When analyzing data, it can sometimes be useful to group numerical objects into buckets or bins. For example, when dealing with age data, perhaps you’d like to group the ages into age groups like 20 to 24, 25 to 30 and so on.
For this demonstation, we will use the training data from the Titanic data set which can be downloaded from https://www.kaggle.com/c/titanic. I know there are a few rows in this dataset where the age is “NA” so we will also remove those rows first.
df <- read.csv("train.csv") # Load data
missing.age <- is.na(df$Age)
df <- df[!missing.age,] # Remove NA
Next we will name our age groups (or “bins” or “buckets”) 0-4, 5-9, 10-14 and so on.
labs <- c(paste(seq(0, 95, by = 5), seq(0 + 5 - 1, 100 - 1, by = 5),
sep = "-"), paste(100, "+", sep = ""))
labs
## [1] "0-4" "5-9" "10-14" "15-19" "20-24" "25-29" "30-34" "35-39"
## [9] "40-44" "45-49" "50-54" "55-59" "60-64" "65-69" "70-74" "75-79"
## [17] "80-84" "85-89" "90-94" "95-99" "100+"
To add the ages to age groups, we create a new column AgeGroup
and use the cut
function to break Age
into groups with the labels we defined in the previous step.
df$AgeGroup <- cut(df$Age, breaks = c(seq(0, 100, by = 5), Inf), labels = labs, right = FALSE)
And here is our new AgeGroup
column shown alongside the Age
data.
head(df[c("Age", "AgeGroup")], 15)
## Age AgeGroup
## 1 22 20-24
## 2 38 35-39
## 3 26 25-29
## 4 35 35-39
## 5 35 35-39
## 7 54 50-54
## 8 2 0-4
## 9 27 25-29
## 10 14 10-14
## 11 4 0-4
## 12 58 55-59
## 13 20 20-24
## 14 39 35-39
## 15 14 10-14
## 16 55 55-59