Exploring Midwest Demographic Distributions

Background

A key task in assessing a data set is to understand how the variables are distributed. Visualizing the distribution is often the first step, and a common way to do this is via a histogram.

Constructing a traditional (regular) histogram uses bins of equal widths. One may also contruct a histogram using bins of equal area, in which each bin has approximately the same number of observations.

Equal width histograms tend to show outliers well but oversmooth in high density regions, and respond poorly to spikes in the data. Equal area histograms handle spikes well, but oversmooth in the tails. The diagonally cut histogram (dhist) proposed by Denby and Mallows strikes a balance between these two approaches and “preserves the desirable features of both the equal width hist and the equal area hist. It will show tall narrow bins like the [equal area] hist when there are spikes in the data and will show isolated outliers just like the usual histogram”

Let’s demonstrate this by looking at examples of both regular (equal-width) and dhist histograms to examine the distribution of the percentage of adult poverty in non-metro Wisconson counties:

Regular (equal-width) Histograms

Histograms are commonly built using equal width bins, and are called regular or equal-width histograms:

hbreaks <- histogram(demo[[sel_stat]], plot=FALSE, verbose=FALSE)$breaks

g0 <- ggplot(demo, aes_string(x=sel_stat)) 
g1 <- g0 + geom_histogram(breaks=hbreaks, fill="white", colour="black",position='identity')
g2 <- g1 + labs(title=paste0('Regular Histogram: ',sel_stat)
                ,y='Number of counties', x='Percentage of demographic')
g2

Note that by using equal width bins, the resulting histogram is quite coarse.

Diagonally cut (dhist) Histograms

Diagonally-cut histograms are built using bins of approximately equal numbers of data points, and with special handing for spikes in data:

dbreaks <- dhist(demo[[sel_stat]],plot=FALSE)$xbr

g0 <- ggplot(demo, aes_string(x=sel_stat)) 
g1 <- g0 + geom_histogram(breaks=dbreaks, fill="white", colour="black",position='identity')
g2 <- g1 + labs(title=paste0('Diagonally Cut Histogram: ',sel_stat)
                ,y='Number of counties', x='Percentage of demographic')
g2

By allowing the bin widths to vary we can better see the spikes in the data.

Exploring more

Interactive App

You can explore more on line with this easy to use Shiny App

References

L. Denby and C. Mallows. Variations on the histogram. Journal of Computational and Graphical Statistics, 18 (1):21-31, 2009. URL http://pubs.amstat.org/doi/abs/10.1198/jcgs.2009.0002.

See Also PDF