Grouping a numeric variable

In some situations, we want to group data of a continuous variable to create a discrete set. Although we lose information, it helps to give a meaning to the variable and explore its dependencies with other variables in EDA. For instance, instead of looking at the height of basketball players as a continuous variable, we can define a new variable with 3 discrete values low, medium and tall. Then, we can look at the composition of the team of players that play when the match starts.

So, given a vector of numeric data of a continuous variable, how can we transform it into a discrete set? A similar problem is faced when we build a histogram of data, as variable is grouped into bins to build this data representation.

We can start by creating k identical groups in size. First, we order data. Second, assuming data have n values, n is divided into k groups. quantile function helps to achieve it. In what follows, we will use rivers dataset as an example. It gives the length of the 141 “major” rivers in North America.

data(rivers)

Supose we want to create 4 equal groups. Then

br <- quantile(rivers, (0:4) / 4)

gives the right breaks to use in the function cut and obtain the factor that creates the 4 equal groups. Take into account that there are different methods to obtain the quantiles so the exact limit boundaries of the groups do not have to agree. By default, quantile uses type 7.

The 4 equal size groups split will be

groups <- cut(rivers, br, include.lowest = TRUE, dig.lab = 4)
levels(groups)

## [1] "[135,310]"  "(310,425]"  "(425,680]"  "(680,3710]"

summary(groups)

##  [135,310]  (310,425]  (425,680] (680,3710] 
##         36         35         35         35

By grouping, information was lost. We can look at the proportion of the variance in rivers data that can be explained by the 4 groups

ss_decomposition <- function(v, f){
    means <- tapply(v, f, mean)
    means[is.na(means)] <- 0
    n <- tapply(v, f, length)
    n[is.na(n)] <- 0
    avg_v <- mean(v)
    sst <- sum((v - avg_v)^2)
    sse <- sum(n * (means - avg_v)^2)
    means_f <- rep(means, n)
    ssi <- sum((v[order(f)] - means_f)^2)
    list(mean_t = avg_v, means = means, sizes = n,
        sst = sst, sse = sse, ssi = ssi, check = sse + ssi,
        ratio_ss = sse/sst,
        var_explained = sprintf("%.2f%%", sse/sst * 100))
}
test_variance <- ss_decomposition(rivers, groups)
test_variance$var_explained

## [1] "56.47%"

We could also use aov function to compute the variance explained

aov_summary <- summary(aov(rivers ~ groups))
aov_summary[[1]]$`Sum Sq`[1] / sum(aov_summary[[1]]$`Sum Sq`)

## [1] 0.5647245

Alternatively, we can also compute it as it follows

ss <- aov_summary[[1]][, 2]
cat(sprintf("variance explained = %.2f%%\n", 100 * ss[1]/sum(ss)))

## variance explained = 56.47%

The intervals chosen have not meaning beyond corresponding to those that divide rivers dataset in 4 equal groups.

We can define other ranges and check later if they explain more or less variance. For instance, I’d suggest the following groups

br2 <- c(0, 300, 400, 700, 5000)
groups2 <- cut(rivers, breaks = br2, include.lowest = TRUE,
    labels = c("<=300", "(300, 400]", "(400, 700]", ">700"))
summary(groups2)

##      <=300 (300, 400] (400, 700]       >700 
##         32         32         43         34

Variance explained is, for groups2,

ss_decomposition(rivers, groups2)$var_explained

## [1] "57.00%"

Without the restriction of having the same number of data points in each group, we can divide the range of the variable in k parts. This is exactly what cut function does when breaks = k

groups3 <- cut(rivers, 4, include.lowest = TRUE, dig.lab = 4)
summary(groups3)

## [131.4,1029]  (1029,1922]  (1922,2816]  (2816,3714] 
##          125           12            3            1

that explains this variance

ss_decomposition(rivers, groups3)$var_explained

## [1] "83.42%"

In order to obtain the exact break points, we can read the levels that follow after applying the funciton cut without labels arguments and using as many decimal position, 10 for instance.

group_control <- cut(rivers, 4, dig.lab = 10)
levels_obtained <- levels(group_control)
br_temp <- unlist(
    strsplit(gsub("[\\[\\]\\)\\(]", replacement = "",
        levels_obtained, perl = TRUE),
    split = ","))
br_method3 <- as.numeric(c(br_temp[1], br_temp[c(FALSE, TRUE)]))

It is possible to obtain the breaks directly using the same method used by cut function

range_v <- range(rivers)
br_method3_direct <- c(range_v[1] - diff(range_v) * 0.001,
    (1:3) * diff(range_v) / 4 +  range_v[1],
    range_v[2] + diff(range_v) * 0.001)

Once we have this suggestion we can define groups with simpler boundaries

br4 <- c(0, 1000, 2000, 3000, 5000)
groups4 <- cut(rivers, breaks = br4, include.lowest = TRUE,
    labels = c("<=1e3", "(1e3, 2e3]", "(2e3, 3e3]", ">3e3"))
summary(groups4)

##      <=1e3 (1e3, 2e3] (2e3, 3e3]       >3e3 
##        125         12          3          1

which explains the same variance as group3 but has more appealing boundaries

ss_decomposition(rivers, groups4)$var_explained

## [1] "83.42%"

Clustering methods can be used to define the groups. For instance,

d <- dist(rivers)
hc <- hclust(d)
resp <- cutree(hc, k = 4)
hclust_temp <- unlist(tapply(rivers, resp, range))
br_hclust <- as.numeric(c(hclust_temp[1], hclust_temp[c(FALSE, TRUE)]))

Once we have this suggestion, a new grouping can be defined

br5 <- c(0, 1100, 2000, 3000, 5000)
groups5 <- cut(rivers, breaks = br5, include.lowest = TRUE,
    labels = c("<=1.25e3", "(1.25e3, 2e3]", "(2e3, 3e3]", ">3e3"))
summary(groups5)

##      <=1.25e3 (1.25e3, 2e3]    (2e3, 3e3]          >3e3 
##           128             9             3             1

which explains this part of the variance

ss_decomposition(rivers, groups5)$var_explained

## [1] "81.03%"

In addition, we can transform the original variable and find new suggestion for boundaries if the 4 groups

group_control <- cut(log(rivers), 4, dig.lab = 10)
levels_obtained <- levels(group_control)
br_temp <- unlist(
    strsplit(gsub("[\\[\\]\\)\\(]", replacement = "",
        levels_obtained, perl = TRUE),
    split = ","))
br6_temp <- as.numeric(c(br_temp[1], br_temp[c(FALSE, TRUE)]))
br6 <- exp(br6_temp)
group6 <- cut(rivers, br6, dig.lab = 10)

aov_rivers6 <- aov(rivers ~ group6)
ss6 <- summary(aov_rivers6)[[1]][, 2]
cat(sprintf("variance explained = %.2f%%\n", 100 * ss6[1]/sum(ss6)))

## variance explained = 86.22%

Looking at the breaks generated

br6

## [1]  134.5534  309.0963  707.7076 1620.3688 3722.3135

we can, again, define simpler boundaries without sacrifing that much variance

br7 <- c(0, 300, 700, 1500, 5000)
groups7 <- cut(rivers, breaks = br7, include.lowest = TRUE,
    labels = c("<=3e2", "(3e2, 7e2]", "(7e2, 15e2]", ">15e2"))
summary(groups7)

##       <=3e2  (3e2, 7e2] (7e2, 15e2]       >15e2 
##          32          75          28           6

aov_rivers7 <- aov(rivers ~ groups7)
ss7 <- summary(aov_rivers7)[[1]][, 2]
cat(sprintf("variance explained = %.2f%%\n", 100 * ss7[1]/sum(ss7)))

## variance explained = 86.11%

In same cases, we could be interested in finding the number of groups that explain a given proportion of the variance, 90%, for instance. Minimum k explaining a given amount of variance can be found by iteration. Function k_best is a simple protype of it.

var_explained <- function(k, v){
    computed_breaks <- cut(v, k, include.lowest = TRUE, dig.lab = 4)
    proportion_variance_explained <- ss_decomposition(v, computed_breaks)$ratio_ss
    return(proportion_variance_explained)
}

k_best <- function(v, limit = 0.8, kmax = 20){
    # we assume more than 50 points
    total <- sapply(2:kmax, var_explained, v)
    n_bins <- which(total >= limit )[1] + 1
    list(k = n_bins, groups = cut(v, n_bins, include.lowest = TRUE, dig.lab = 4))
}
    
k_best(rivers, 0.90)$k

## [1] 6

Example

We give one more example of grouping a continuous variable. We take in this case the faithful dataset and try to group the eruptions times.

data(faithful)
eruptions <- faithful$eruptions
groups_best <- k_best(eruptions, 0.9)
groups_best$k

## [1] 3

summary(groups_best$groups)

## [1.597,2.767] (2.767,3.933] (3.933,5.103] 
##            94            36           142

Looking at data always helps to

hist(eruptions, 20)

We realize that although 3 groups explain more than 90% of the variance, a couple of groups could be enough depending of the purpose

br8 <- c(0, 3, 6)
groups8 <- cut(eruptions, breaks = br8, include.lowest = TRUE,
    labels = c("<=3 min", ">3 min"))
summary(groups8)

## <=3 min  >3 min 
##      97     175

which already accounts for a great part of the variance

ss_decomposition(eruptions, groups8)$var_explained

## [1] "89.74%"

This short note is dedicated to Luca Cerone, a Data scientist I worked with and an R activist.

Grouping a numeric variable

Pep Porrà

Nov 10th, 2017

Example