Binning in Histograms

Exploratory Data Analysis/Binning/Activity1A

The best choice of histogram (as similar to stem plot) is closely related to a good choice of bin width when one constructs a histogram. In this part of the activity, we will use a “slider” function to adjust the number of bins in a histogram, experiment with choosing good bin widths for some simulated data and then suggest a rule for determining the bin width in a histogram.

knitr::opts_chunk$set(error = TRUE)

set.seed(886784)
library(LearnEDA)

## Loading required package: vcd

## Loading required package: grid

## Loading required package: manipulate

library(vcd)
library(grid)
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(printr)
library(manipulate)
y=rnorm(50)
slider.histogram(y)

## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio

The histogram with 9 bins looks okay as it has no empty bins but shows fails to show a good spread. We do not want to choose any smaller number for the bins and at the same time not any bigger number.

Let us take samples of size 30, 100, 250 & 1000. For each of these 4 cases, we will use ‘slider.hist’ function to construct a histogram of the sample of normally distributed data, and adjust the number of classes to produce a suitable representation of the data.

We will also produce graphs, compute the bin width for the best histogram.

n=c(30,100,250,1000)
number_bins=bin_widths=freeman_width=freeman_bins=c(0,0,0,0)

y=rnorm(30)
slider.histogram(y)

## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio

hist(y, breaks = 30)

number_bins[1]=30
bin_widths[1]=.1
lval(y)

	depth	lo	hi	mids	spreads
M	15.5	-0.0762517	-0.0762517	-0.0762517	0.000000
H	8.0	-0.7218991	0.7804406	0.0292707	1.502340
E	4.5	-0.8359390	1.0796184	0.1218397	1.915557
D	2.5	-0.9863350	1.4476985	0.2306817	2.434033
C	1.0	-1.2614452	1.6609938	0.1997743	2.922439

lval = lval(y)
binwidth=2*(lval[2,5])/30^((1/3))
binwidth

## [1] 0.9669953

freeman_width[1]=binwidth
min(y) + 3*freeman_width[1]

## [1] 1.639541

min(y) + 4*freeman_width[1]

## [1] 2.606536

freeman_bins[1] = 4

What do we see? The histogram with 30 bins looks good. We may not want to go anything less than 10 as the data structure may not be visible clearly. Also, we can see the empty bins and gives light to the structure. The bins start nicely from 0.00 and finishes well at 1.0 without any overlapping or underlapping making the count of bins easier. The width of the bins are simple decimals that makes it look better. We can count the number of bins from the displayed histogram and we can see the bin width.

The lval function helps us to get the fourth spread(IQR). We calculated the Freeman Width based on the binwidth calculated using IQR.

Suppose the histogram is made to start at the smallest value so then we keep adding the bin width until we get a bin that has the max and then we stop. That gets us the number of Freeman bins.

y=rnorm(100)
slider.histogram(y)

## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio

hist(y, breaks = 22)

number_bins[2]=22
bin_widths[2]=.1
lval(y)

	depth	lo	hi	mids	spreads
M	50.5	-0.1868858	-0.1868858	-0.1868858	0.000000
H	25.5	-0.8498964	0.6186209	-0.1156377	1.468517
E	13.0	-1.2274889	1.1560293	-0.0357298	2.383518
D	7.0	-1.8104089	1.3323456	-0.2390316	3.142755
C	4.0	-2.0940875	1.7915783	-0.1512546	3.885666
B	2.5	-2.1147769	2.6095016	0.2473623	4.724278
A	1.0	-2.2534208	3.0634742	0.4050267	5.316895

lval = lval(y)
binwidth=2*(lval[2,5])/100^((1/3))
binwidth

## [1] 0.6327649

freeman_width[2]=binwidth
min(y) + 3*freeman_width[2]

## [1] -0.355126

min(y) + 8*freeman_width[2]

## [1] 2.808699

min(y) + 9*freeman_width[2]

## [1] 3.441463

freeman_bins[2] = 9

We see that the skin widths are ‘skinny’ with bins starting perfectly from 0 with a bin width of 0.1 which is again a simple decimal. We can count the number of bins from the displayed histogram and it is easy to see the bin width now. All the histogram is going to be normal or close to bell shape as this is from normal data but our goal here is to observe the structure. There are several bins that are empty.

y=rnorm(250)
slider.histogram(y)

## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio

hist(y, breaks = 30)

number_bins[3]=30
bin_widths[3]=.2
lval(y)

	depth	lo	hi	mids	spreads
M	125.5	0.1370814	0.1370814	0.1370814	0.000000
H	63.0	-0.6131411	0.9114082	0.1491336	1.524549
E	32.0	-0.9771524	1.3286121	0.1757298	2.305764
D	16.5	-1.4198268	1.7569927	0.1685829	3.176820
C	8.5	-1.7249883	1.9459183	0.1104650	3.670907
B	4.5	-2.0687376	2.2396310	0.0854467	4.308369
A	2.5	-2.4932656	2.3224460	-0.0854098	4.815712
Z	1.0	-2.5893304	3.0621857	0.2364276	5.651516

lval = lval(y)
binwidth=2*(lval[2,5])/250^((1/3))
binwidth

## [1] 0.4840142

freeman_width[3]=binwidth
min(y) + 6*freeman_width[3]

## [1] 0.3147549

min(y) + 8*freeman_width[3]

## [1] 1.282783

min(y) + 9*freeman_width[3]

## [1] 1.766798

min(y) + 12*freeman_width[3]

## [1] 3.21884

freeman_bins[3] = 12

The bins start from a perfect 0 nicely. The width of the bin is 0.2. There are fewer bins that are empty.

y=rnorm(1000)
slider.histogram(y)

## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio

hist(y, breaks = 37)

number_bins[4]=37
bin_widths[4]=.2
lval(y)

	depth	lo	hi	mids	spreads
M	500.5	0.0133952	0.0133952	0.0133952	0.000000
H	250.5	-0.6901463	0.6402821	-0.0249321	1.330428
E	125.5	-1.1203163	1.1067599	-0.0067782	2.227076
D	63.0	-1.5242109	1.5748041	0.0252966	3.099015
C	32.0	-1.8165123	1.7810568	-0.0177278	3.597569
B	16.5	-2.0133688	2.0527098	0.0196705	4.066079
A	8.5	-2.3028392	2.3801847	0.0386728	4.683024
Z	4.5	-2.5216192	2.8241555	0.1512681	5.345775
Y	2.5	-2.7781253	3.2144109	0.2181428	5.992536
X	1.0	-3.3748261	3.4240355	0.0246047	6.798862

lval = lval(y)
binwidth=2*(lval[2,5])/1000^((1/3))
binwidth

## [1] 0.2660857

freeman_width[4]=binwidth
min(y) + 3*freeman_width[4]

## [1] -2.576569

min(y) + 8*freeman_width[4]

## [1] -1.246141

min(y) + 9*freeman_width[4]

## [1] -0.980055

min(y) + 18*freeman_width[4]

## [1] 1.414716

min(y) + 25*freeman_width[4]

## [1] 3.277316

min(y) + 26*freeman_width[4]

## [1] 3.543401

freeman_bins[4] = 26

The bins start with perfect 0 nicely and the width of the bins are 0.2 which is a simple decimal. Overall, we can observe that with the increase in sample size the number of bins come down.

We can also compare the best number of bins with the ones found using the formula. We will talk about the differences that we see.

table=data.frame(number_bins, bin_widths, freeman_width, freeman_bins)
table

number_bins	bin_widths	freeman_width	freeman_bins
30	0.1	0.9669953	4
22	0.1	0.6327649	9
30	0.2	0.4840142	12
37	0.2	0.2660857	26

It looks like the Freeman bins are far lesser compared to the number of bins that we came up with. The widths in case of Freeman bins are much larger than what we got. The Freeman method provides a more unbiased estimate with an optimal bin size so the number of bins are much lesser compared to the number of bins we selected . The Freeman bins use underlying distribution/densities to make it reliable.

Let us examine more using a built-in dataset in R “faithful”. You can learn more about the dataset using the help options in R.

faithful$waiting

##   [1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78
##  [24] 69 74 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83
##  [47] 64 53 82 59 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65
##  [70] 73 82 56 79 71 62 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90
##  [93] 50 78 63 72 84 75 51 82 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59
## [116] 81 50 85 59 87 53 69 77 56 88 81 45 82 55 90 45 83 56 89 46 82 51 86
## [139] 53 79 81 60 82 77 76 59 80 49 96 53 77 77 65 81 71 70 81 93 53 89 45
## [162] 86 58 78 66 76 63 88 52 93 49 57 77 68 81 81 73 50 85 74 55 77 83 83
## [185] 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78 60 82 91 53 78 46 77
## [208] 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78 79 78 78 70 79
## [231] 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74 54 83 73
## [254] 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74

slider.histogram(faithful$waiting)

## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio

lval(faithful$waiting)

	depth	lo	hi	mids	spreads
M	136.5	76	76	76.0	0
H	68.5	58	82	70.0	24
E	34.5	52	85	68.5	33
D	17.5	49	88	68.5	39
C	9.0	46	90	68.0	44
B	5.0	46	92	69.0	46
A	3.0	45	93	69.0	48
Z	2.0	45	94	69.5	49
Y	1.0	43	96	69.5	53

binwidth=2*(24/272^(1/3))
hist(faithful$waiting, breaks = min(faithful$waiting) + {0:8}*binwidth)

What we have is a histogram using Freeman binwidth. This starts perfectly with a bin at a min and ends with the bin that contains the max.

The bin width obtained using the formula is 7.4082.

Again, the bin widths obtained here are bigger than what we go in previous case (7.40). This shows that the series of interval are larger. We prefer bins that are smaller and that has some spread to visually see the data structures clearly. In this case, there are only fewer bins and we can not comment anything clearly. With a sample size of 272 we could have got a broader spread with some data structures in it.

Binning in Histograms

Suresh Gajapathy

September 9, 2016