Categorization of numeric variables

Hello, everyone!

In this tutorial I will show you how to easily and quickly make categorical variable based on numeric variable.

While doing data analysis we sometimes need to divide a continuous scale into particular groups according to its numeric values. It is necessary for classification of observations and it can be useful for subsequent analysis.

The first way to do this that comes to mind is to write conditional functions with logical expressions about whether a value is bigger or smaller than some number. However, such approach can take a lot of time and space because in some cases there can be huge cascades of those ‘ifelse’ functions. Fortunately, there is a function that allows to do it shorter and faster.

First, let’s load the necessary library which is ‘Hmisc’.

library(Hmisc)

I will give the simplest example one can imagine, namely I will divide data about age into age categories. To do this, let’s create a sample with numbers ranges from 18 to 100.

set.seed(10)
age_num <- sample(c(18:100), 1000, replace = TRUE) 
summary(age_num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   37.75   59.00   58.85   80.00  100.00

Now, we have a sample of 1000 random numbers between 18 and 100, and let’s pretend that this is a data on age from some survey. Our task is to group observations of age according to age classification of World Health Organization.

Age period	Age group
18 - 44	Young age
45 - 59	Middle age
60 - 74	Elderly age
75 - 89	Senile age
90+	Longevity

In order to split age into groups we apply cut2 function from ‘Hmisc’ package that is already loaded. The function requires a variable to divide as a first argument and a vector with splitting points as a second argument. You can also specify other arguments if it is necessary. The whole information about them is in R Documentation.

age_cat <- cut2(age_num, c(45, 60, 75, 90))
summary(age_cat)

## [ 18, 45) [ 45, 60) [ 60, 75) [ 75, 90) [ 90,100] 
##       326       184       169       188       133

In the summary you see intervals and number of observations that fall into them. Square brackets mean that the lower limit of an interval starts strictly with the number within brackets. Parentheses mean that the upper limit of an interval does not include include the number within brackets but continues up to this number. For example, first interval starts with 18 and continues up to 45 but ends at 44.

In order to check that division was done in right way we can unite vectors with numbers and categories into data frame and summarize it. Also, to make more sense of categories, let’s rename levels of factor.

levels(age_cat) <- c("Young age", "Middle age", "Elderly age", "Senile age", "Longevity")
summary(age_cat)

##   Young age  Middle age Elderly age  Senile age   Longevity 
##         326         184         169         188         133

df <- cbind(age_cat, age_num)
df1 <- as.data.frame(df)
library(dplyr)
df1%>%
  group_by(age_cat)%>%
  summarise(Min = min(age_num), Max = max(age_num))%>%
  kable()%>%
  kable_styling(bootstrap_options=c("bordered", "responsive", "striped"), full_width = FALSE)

age_cat	Min	Max
1	18	44
2	45	59
3	60	74
4	75	89
5	90	100

You see in the table that minimal and maximal values for age groups are the same as in WHO classification. Thus, we can conclude that the division was done correctly.

I hope, you will find useful this function. Now, it’s your turn to try cut2 function. Good luck!

Categorization of numeric variables

by Artyom Kulikov

17/03/2021