In some situations, we want to group data of a continuous variable to create a discrete set. Although we lose information, it helps to give a meaning to the variable and explore its dependencies with other variables in EDA. For instance, instead of looking at the height of basketball players as a continuous variable, we can define a new variable with 3 discrete values low, medium and tall. Then, we can look at the composition of the team of players that play when the match starts.
So, given a vector of numeric data of a continuous variable, how can we transform it into a discrete set? A similar problem is faced when we build a histogram of data, as variable is grouped into bins to build this data representation.
We can start by creating k identical groups in size. First, we order data. Second, assuming data have n values, n is divided into k groups. quantile function helps to achieve it. In what follows, we will use rivers dataset as an example. It gives the length of the 141 “major” rivers in North America.
data(rivers)
Supose we want to create 4 equal groups. Then
br <- quantile(rivers, (0:4) / 4)
gives the right breaks to use in the function cut and obtain the factor that creates the 4 equal groups. Take into account that there are different methods to obtain the quantiles so the exact limit boundaries of the groups do not have to agree. By default, quantile uses type 7.
The 4 equal size groups split will be
groups <- cut(rivers, br, include.lowest = TRUE, dig.lab = 4)
levels(groups)
## [1] "[135,310]" "(310,425]" "(425,680]" "(680,3710]"
summary(groups)
## [135,310] (310,425] (425,680] (680,3710]
## 36 35 35 35
By grouping, information was lost. We can look at the proportion of the variance in rivers data that can be explained by the 4 groups
ss_decomposition <- function(v, f){
means <- tapply(v, f, mean)
means[is.na(means)] <- 0
n <- tapply(v, f, length)
n[is.na(n)] <- 0
avg_v <- mean(v)
sst <- sum((v - avg_v)^2)
sse <- sum(n * (means - avg_v)^2)
means_f <- rep(means, n)
ssi <- sum((v[order(f)] - means_f)^2)
list(mean_t = avg_v, means = means, sizes = n,
sst = sst, sse = sse, ssi = ssi, check = sse + ssi,
ratio_ss = sse/sst,
var_explained = sprintf("%.2f%%", sse/sst * 100))
}
test_variance <- ss_decomposition(rivers, groups)
test_variance$var_explained
## [1] "56.47%"
We could also use aov function to compute the variance explained
aov_summary <- summary(aov(rivers ~ groups))
aov_summary[[1]]$`Sum Sq`[1] / sum(aov_summary[[1]]$`Sum Sq`)
## [1] 0.5647245
Alternatively, we can also compute it as it follows
ss <- aov_summary[[1]][, 2]
cat(sprintf("variance explained = %.2f%%\n", 100 * ss[1]/sum(ss)))
## variance explained = 56.47%
The intervals chosen have not meaning beyond corresponding to those that divide rivers dataset in 4 equal groups.
We can define other ranges and check later if they explain more or less variance. For instance, I’d suggest the following groups
br2 <- c(0, 300, 400, 700, 5000)
groups2 <- cut(rivers, breaks = br2, include.lowest = TRUE,
labels = c("<=300", "(300, 400]", "(400, 700]", ">700"))
summary(groups2)
## <=300 (300, 400] (400, 700] >700
## 32 32 43 34
Variance explained is, for groups2,
ss_decomposition(rivers, groups2)$var_explained
## [1] "57.00%"
Without the restriction of having the same number of data points in each group, we can divide the range of the variable in k parts. This is exactly what cut function does when breaks = k
groups3 <- cut(rivers, 4, include.lowest = TRUE, dig.lab = 4)
summary(groups3)
## [131.4,1029] (1029,1922] (1922,2816] (2816,3714]
## 125 12 3 1
that explains this variance
ss_decomposition(rivers, groups3)$var_explained
## [1] "83.42%"
In order to obtain the exact break points, we can read the levels that follow after applying the funciton cut without labels arguments and using as many decimal position, 10 for instance.
group_control <- cut(rivers, 4, dig.lab = 10)
levels_obtained <- levels(group_control)
br_temp <- unlist(
strsplit(gsub("[\\[\\]\\)\\(]", replacement = "",
levels_obtained, perl = TRUE),
split = ","))
br_method3 <- as.numeric(c(br_temp[1], br_temp[c(FALSE, TRUE)]))
It is possible to obtain the breaks directly using the same method used by cut function
range_v <- range(rivers)
br_method3_direct <- c(range_v[1] - diff(range_v) * 0.001,
(1:3) * diff(range_v) / 4 + range_v[1],
range_v[2] + diff(range_v) * 0.001)
Once we have this suggestion we can define groups with simpler boundaries
br4 <- c(0, 1000, 2000, 3000, 5000)
groups4 <- cut(rivers, breaks = br4, include.lowest = TRUE,
labels = c("<=1e3", "(1e3, 2e3]", "(2e3, 3e3]", ">3e3"))
summary(groups4)
## <=1e3 (1e3, 2e3] (2e3, 3e3] >3e3
## 125 12 3 1
which explains the same variance as group3 but has more appealing boundaries
ss_decomposition(rivers, groups4)$var_explained
## [1] "83.42%"
Clustering methods can be used to define the groups. For instance,
d <- dist(rivers)
hc <- hclust(d)
resp <- cutree(hc, k = 4)
hclust_temp <- unlist(tapply(rivers, resp, range))
br_hclust <- as.numeric(c(hclust_temp[1], hclust_temp[c(FALSE, TRUE)]))
Once we have this suggestion, a new grouping can be defined
br5 <- c(0, 1100, 2000, 3000, 5000)
groups5 <- cut(rivers, breaks = br5, include.lowest = TRUE,
labels = c("<=1.25e3", "(1.25e3, 2e3]", "(2e3, 3e3]", ">3e3"))
summary(groups5)
## <=1.25e3 (1.25e3, 2e3] (2e3, 3e3] >3e3
## 128 9 3 1
which explains this part of the variance
ss_decomposition(rivers, groups5)$var_explained
## [1] "81.03%"
In addition, we can transform the original variable and find new suggestion for boundaries if the 4 groups
group_control <- cut(log(rivers), 4, dig.lab = 10)
levels_obtained <- levels(group_control)
br_temp <- unlist(
strsplit(gsub("[\\[\\]\\)\\(]", replacement = "",
levels_obtained, perl = TRUE),
split = ","))
br6_temp <- as.numeric(c(br_temp[1], br_temp[c(FALSE, TRUE)]))
br6 <- exp(br6_temp)
group6 <- cut(rivers, br6, dig.lab = 10)
aov_rivers6 <- aov(rivers ~ group6)
ss6 <- summary(aov_rivers6)[[1]][, 2]
cat(sprintf("variance explained = %.2f%%\n", 100 * ss6[1]/sum(ss6)))
## variance explained = 86.22%
Looking at the breaks generated
br6
## [1] 134.5534 309.0963 707.7076 1620.3688 3722.3135
we can, again, define simpler boundaries without sacrifing that much variance
br7 <- c(0, 300, 700, 1500, 5000)
groups7 <- cut(rivers, breaks = br7, include.lowest = TRUE,
labels = c("<=3e2", "(3e2, 7e2]", "(7e2, 15e2]", ">15e2"))
summary(groups7)
## <=3e2 (3e2, 7e2] (7e2, 15e2] >15e2
## 32 75 28 6
aov_rivers7 <- aov(rivers ~ groups7)
ss7 <- summary(aov_rivers7)[[1]][, 2]
cat(sprintf("variance explained = %.2f%%\n", 100 * ss7[1]/sum(ss7)))
## variance explained = 86.11%
In same cases, we could be interested in finding the number of groups that explain a given proportion of the variance, 90%, for instance. Minimum k explaining a given amount of variance can be found by iteration. Function k_best is a simple protype of it.
var_explained <- function(k, v){
computed_breaks <- cut(v, k, include.lowest = TRUE, dig.lab = 4)
proportion_variance_explained <- ss_decomposition(v, computed_breaks)$ratio_ss
return(proportion_variance_explained)
}
k_best <- function(v, limit = 0.8, kmax = 20){
# we assume more than 50 points
total <- sapply(2:kmax, var_explained, v)
n_bins <- which(total >= limit )[1] + 1
list(k = n_bins, groups = cut(v, n_bins, include.lowest = TRUE, dig.lab = 4))
}
k_best(rivers, 0.90)$k
## [1] 6
We give one more example of grouping a continuous variable. We take in this case the faithful dataset and try to group the eruptions times.
data(faithful)
eruptions <- faithful$eruptions
groups_best <- k_best(eruptions, 0.9)
groups_best$k
## [1] 3
summary(groups_best$groups)
## [1.597,2.767] (2.767,3.933] (3.933,5.103]
## 94 36 142
Looking at data always helps to
hist(eruptions, 20)
We realize that although 3 groups explain more than 90% of the variance, a couple of groups could be enough depending of the purpose
br8 <- c(0, 3, 6)
groups8 <- cut(eruptions, breaks = br8, include.lowest = TRUE,
labels = c("<=3 min", ">3 min"))
summary(groups8)
## <=3 min >3 min
## 97 175
which already accounts for a great part of the variance
ss_decomposition(eruptions, groups8)$var_explained
## [1] "89.74%"
This short note is dedicated to Luca Cerone, a Data scientist I worked with and an R activist.