Exploratory Data Analysis/Binning/Activity1A
The best choice of histogram (as similar to stem plot) is closely related to a good choice of bin width when one constructs a histogram. In this part of the activity, we will use a “slider” function to adjust the number of bins in a histogram, experiment with choosing good bin widths for some simulated data and then suggest a rule for determining the bin width in a histogram.
knitr::opts_chunk$set(error = TRUE)
set.seed(886784)
library(LearnEDA)
## Loading required package: vcd
## Loading required package: grid
## Loading required package: manipulate
library(vcd)
library(grid)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(printr)
library(manipulate)
y=rnorm(50)
slider.histogram(y)
## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio
The histogram with 9 bins looks okay as it has no empty bins but shows fails to show a good spread. We do not want to choose any smaller number for the bins and at the same time not any bigger number.
Let us take samples of size 30, 100, 250 & 1000. For each of these 4 cases, we will use ‘slider.hist’ function to construct a histogram of the sample of normally distributed data, and adjust the number of classes to produce a suitable representation of the data.
We will also produce graphs, compute the bin width for the best histogram.
n=c(30,100,250,1000)
number_bins=bin_widths=freeman_width=freeman_bins=c(0,0,0,0)
y=rnorm(30)
slider.histogram(y)
## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio
hist(y, breaks = 30)
number_bins[1]=30
bin_widths[1]=.1
lval(y)
| depth | lo | hi | mids | spreads | |
|---|---|---|---|---|---|
| M | 15.5 | -0.0762517 | -0.0762517 | -0.0762517 | 0.000000 |
| H | 8.0 | -0.7218991 | 0.7804406 | 0.0292707 | 1.502340 |
| E | 4.5 | -0.8359390 | 1.0796184 | 0.1218397 | 1.915557 |
| D | 2.5 | -0.9863350 | 1.4476985 | 0.2306817 | 2.434033 |
| C | 1.0 | -1.2614452 | 1.6609938 | 0.1997743 | 2.922439 |
lval = lval(y)
binwidth=2*(lval[2,5])/30^((1/3))
binwidth
## [1] 0.9669953
freeman_width[1]=binwidth
min(y) + 3*freeman_width[1]
## [1] 1.639541
min(y) + 4*freeman_width[1]
## [1] 2.606536
freeman_bins[1] = 4
What do we see? The histogram with 30 bins looks good. We may not want to go anything less than 10 as the data structure may not be visible clearly. Also, we can see the empty bins and gives light to the structure. The bins start nicely from 0.00 and finishes well at 1.0 without any overlapping or underlapping making the count of bins easier. The width of the bins are simple decimals that makes it look better. We can count the number of bins from the displayed histogram and we can see the bin width.
The lval function helps us to get the fourth spread(IQR). We calculated the Freeman Width based on the binwidth calculated using IQR.
Suppose the histogram is made to start at the smallest value so then we keep adding the bin width until we get a bin that has the max and then we stop. That gets us the number of Freeman bins.
y=rnorm(100)
slider.histogram(y)
## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio
hist(y, breaks = 22)
number_bins[2]=22
bin_widths[2]=.1
lval(y)
| depth | lo | hi | mids | spreads | |
|---|---|---|---|---|---|
| M | 50.5 | -0.1868858 | -0.1868858 | -0.1868858 | 0.000000 |
| H | 25.5 | -0.8498964 | 0.6186209 | -0.1156377 | 1.468517 |
| E | 13.0 | -1.2274889 | 1.1560293 | -0.0357298 | 2.383518 |
| D | 7.0 | -1.8104089 | 1.3323456 | -0.2390316 | 3.142755 |
| C | 4.0 | -2.0940875 | 1.7915783 | -0.1512546 | 3.885666 |
| B | 2.5 | -2.1147769 | 2.6095016 | 0.2473623 | 4.724278 |
| A | 1.0 | -2.2534208 | 3.0634742 | 0.4050267 | 5.316895 |
lval = lval(y)
binwidth=2*(lval[2,5])/100^((1/3))
binwidth
## [1] 0.6327649
freeman_width[2]=binwidth
min(y) + 3*freeman_width[2]
## [1] -0.355126
min(y) + 8*freeman_width[2]
## [1] 2.808699
min(y) + 9*freeman_width[2]
## [1] 3.441463
freeman_bins[2] = 9
We see that the skin widths are ‘skinny’ with bins starting perfectly from 0 with a bin width of 0.1 which is again a simple decimal. We can count the number of bins from the displayed histogram and it is easy to see the bin width now. All the histogram is going to be normal or close to bell shape as this is from normal data but our goal here is to observe the structure. There are several bins that are empty.
y=rnorm(250)
slider.histogram(y)
## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio
hist(y, breaks = 30)
number_bins[3]=30
bin_widths[3]=.2
lval(y)
| depth | lo | hi | mids | spreads | |
|---|---|---|---|---|---|
| M | 125.5 | 0.1370814 | 0.1370814 | 0.1370814 | 0.000000 |
| H | 63.0 | -0.6131411 | 0.9114082 | 0.1491336 | 1.524549 |
| E | 32.0 | -0.9771524 | 1.3286121 | 0.1757298 | 2.305764 |
| D | 16.5 | -1.4198268 | 1.7569927 | 0.1685829 | 3.176820 |
| C | 8.5 | -1.7249883 | 1.9459183 | 0.1104650 | 3.670907 |
| B | 4.5 | -2.0687376 | 2.2396310 | 0.0854467 | 4.308369 |
| A | 2.5 | -2.4932656 | 2.3224460 | -0.0854098 | 4.815712 |
| Z | 1.0 | -2.5893304 | 3.0621857 | 0.2364276 | 5.651516 |
lval = lval(y)
binwidth=2*(lval[2,5])/250^((1/3))
binwidth
## [1] 0.4840142
freeman_width[3]=binwidth
min(y) + 6*freeman_width[3]
## [1] 0.3147549
min(y) + 8*freeman_width[3]
## [1] 1.282783
min(y) + 9*freeman_width[3]
## [1] 1.766798
min(y) + 12*freeman_width[3]
## [1] 3.21884
freeman_bins[3] = 12
The bins start from a perfect 0 nicely. The width of the bin is 0.2. There are fewer bins that are empty.
y=rnorm(1000)
slider.histogram(y)
## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio
hist(y, breaks = 37)
number_bins[4]=37
bin_widths[4]=.2
lval(y)
| depth | lo | hi | mids | spreads | |
|---|---|---|---|---|---|
| M | 500.5 | 0.0133952 | 0.0133952 | 0.0133952 | 0.000000 |
| H | 250.5 | -0.6901463 | 0.6402821 | -0.0249321 | 1.330428 |
| E | 125.5 | -1.1203163 | 1.1067599 | -0.0067782 | 2.227076 |
| D | 63.0 | -1.5242109 | 1.5748041 | 0.0252966 | 3.099015 |
| C | 32.0 | -1.8165123 | 1.7810568 | -0.0177278 | 3.597569 |
| B | 16.5 | -2.0133688 | 2.0527098 | 0.0196705 | 4.066079 |
| A | 8.5 | -2.3028392 | 2.3801847 | 0.0386728 | 4.683024 |
| Z | 4.5 | -2.5216192 | 2.8241555 | 0.1512681 | 5.345775 |
| Y | 2.5 | -2.7781253 | 3.2144109 | 0.2181428 | 5.992536 |
| X | 1.0 | -3.3748261 | 3.4240355 | 0.0246047 | 6.798862 |
lval = lval(y)
binwidth=2*(lval[2,5])/1000^((1/3))
binwidth
## [1] 0.2660857
freeman_width[4]=binwidth
min(y) + 3*freeman_width[4]
## [1] -2.576569
min(y) + 8*freeman_width[4]
## [1] -1.246141
min(y) + 9*freeman_width[4]
## [1] -0.980055
min(y) + 18*freeman_width[4]
## [1] 1.414716
min(y) + 25*freeman_width[4]
## [1] 3.277316
min(y) + 26*freeman_width[4]
## [1] 3.543401
freeman_bins[4] = 26
The bins start with perfect 0 nicely and the width of the bins are 0.2 which is a simple decimal. Overall, we can observe that with the increase in sample size the number of bins come down.
We can also compare the best number of bins with the ones found using the formula. We will talk about the differences that we see.
table=data.frame(number_bins, bin_widths, freeman_width, freeman_bins)
table
| number_bins | bin_widths | freeman_width | freeman_bins |
|---|---|---|---|
| 30 | 0.1 | 0.9669953 | 4 |
| 22 | 0.1 | 0.6327649 | 9 |
| 30 | 0.2 | 0.4840142 | 12 |
| 37 | 0.2 | 0.2660857 | 26 |
It looks like the Freeman bins are far lesser compared to the number of bins that we came up with. The widths in case of Freeman bins are much larger than what we got. The Freeman method provides a more unbiased estimate with an optimal bin size so the number of bins are much lesser compared to the number of bins we selected . The Freeman bins use underlying distribution/densities to make it reliable.
Let us examine more using a built-in dataset in R “faithful”. You can learn more about the dataset using the help options in R.
faithful$waiting
## [1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78
## [24] 69 74 83 55 76 78 79 73 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83
## [47] 64 53 82 59 75 90 54 80 54 83 71 64 77 81 59 84 48 82 60 92 78 78 65
## [70] 73 82 56 79 71 62 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90
## [93] 50 78 63 72 84 75 51 82 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59
## [116] 81 50 85 59 87 53 69 77 56 88 81 45 82 55 90 45 83 56 89 46 82 51 86
## [139] 53 79 81 60 82 77 76 59 80 49 96 53 77 77 65 81 71 70 81 93 53 89 45
## [162] 86 58 78 66 76 63 88 52 93 49 57 77 68 81 81 73 50 85 74 55 77 83 83
## [185] 51 78 84 46 83 55 81 57 76 84 77 81 87 77 51 78 60 82 91 53 78 46 77
## [208] 84 49 83 71 80 49 75 64 76 53 94 55 76 50 82 54 75 78 79 78 78 70 79
## [231] 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82 67 74 54 83 73
## [254] 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74
slider.histogram(faithful$waiting)
## Error in manipulate(plot.hist(num, y, var.name), num = manipulate::slider(1, : The manipulate package must be run from within RStudio
lval(faithful$waiting)
| depth | lo | hi | mids | spreads | |
|---|---|---|---|---|---|
| M | 136.5 | 76 | 76 | 76.0 | 0 |
| H | 68.5 | 58 | 82 | 70.0 | 24 |
| E | 34.5 | 52 | 85 | 68.5 | 33 |
| D | 17.5 | 49 | 88 | 68.5 | 39 |
| C | 9.0 | 46 | 90 | 68.0 | 44 |
| B | 5.0 | 46 | 92 | 69.0 | 46 |
| A | 3.0 | 45 | 93 | 69.0 | 48 |
| Z | 2.0 | 45 | 94 | 69.5 | 49 |
| Y | 1.0 | 43 | 96 | 69.5 | 53 |
binwidth=2*(24/272^(1/3))
hist(faithful$waiting, breaks = min(faithful$waiting) + {0:8}*binwidth)
What we have is a histogram using Freeman binwidth. This starts perfectly with a bin at a min and ends with the bin that contains the max.
The bin width obtained using the formula is 7.4082.
Again, the bin widths obtained here are bigger than what we go in previous case (7.40). This shows that the series of interval are larger. We prefer bins that are smaller and that has some spread to visually see the data structures clearly. In this case, there are only fewer bins and we can not comment anything clearly. With a sample size of 272 we could have got a broader spread with some data structures in it.