Equal-width (distance) binning

In equal-width binning, the variable is divided into n intervals of equal size. If \(y_{max}\) and \(y_{min}\) are the maximum and minimum values in the variable, the width of the intervals will be:

\[ w = \frac{\left( y_{max} - y_{min} \right)}{n} \]

Thus, you need to define the number of intervals, \(n\), prior to binning. However, this is not an easy task for the analyst and constitutes one of the disadvantages of this method.

A histogram uses equal-width binning to describe the distribution of the data. From this histogram of the JohnsonJohnson dataset in R, note that the width of each of the bins is equal, while the frequency counts (number of observations in each bin) differ.

hist(JohnsonJohnson, breaks = 10)

Also note that as the number of bins changes (and therefore the bin width also changes), the appearance of the distribution may vary. The division into too many bins can over-emphasise noise in the data.

hist(JohnsonJohnson, breaks = 15)

Too few bins, however, may lead to a loss of detail. This is why the choice of a suitable number of bins may be a challenging.

hist(JohnsonJohnson, breaks = 4)

Let’s practise

Perform the following steps on the ‘iris’ dataset to apply equal-width binning.

In R, you can use the discretize() function under the {infotheo} package to apply equal-width binning.

Step 1: Let’s visit the subset of the ‘iris’ dataset and apply binning using equal width. Your code should look like this:

# load iris data and subset using versicolor flowers, with the first three variables

versicolor_sl <- iris %>%  
  filter( Species == "versicolor" ) %>%  
  dplyr::select(Sepal.Length)

head(versicolor_sl)
##   Sepal.Length
## 1          7.0
## 2          6.4
## 3          6.9
## 4          5.5
## 5          6.5
## 6          5.7

Step 2: Apply equal-width binning to the Sepal.Length variable using the code given below:

ew_binned <- infotheo::discretize(versicolor_sl, disc = "equalwidth")
names(ew_binned) <- "sepal_length_binned"

versicolor_sl %<>% 
  bind_cols(ew_binned) 

versicolor_sl %>% 
  head(n = 10)
##    Sepal.Length sepal_length_binned
## 1           7.0                   3
## 2           6.4                   3
## 3           6.9                   3
## 4           5.5                   1
## 5           6.5                   3
## 6           5.7                   2
## 7           6.3                   3
## 8           4.9                   1
## 9           6.6                   3
## 10          5.2                   1