Kernel Density Estimate

library(tidyverse)

For a video walkthrough of what follows see https://www.youtube.com/watch?v=ZcVhekkzUAs.

Nick Strayer argued that the kernel density plot given by geom_density() may be the best way to display the distribution of a single quantitative variable. It avoids the problem of distortion due to binning, since every point in the observed data plays a clearly defined role.

Each point has a normal curve plotted at its exact value. The standard deviation of the normal curve is controlled by the argument bw. A larger value of bw yields a larger standard deviation. The default value is 1, the usual value for a standard normal distribution.

The kernel density plot is constructed as the vertical sum of all of these individual normal curves, scaled down so that the total area under the curve is 1.0. Larger values of bw give each point more influence farther away from the actual data value. This has the effect of smoothing out the curve.

If the number of observations is not too large, adding geom_rug() identifies the location of every actual data value. Decreasing the value of alpha helps as the number of observations increases.

I will illustrate the process with a small articicial set of data.

df = data_frame(x=c(1,1.5,4,15)) 
df %>% ggplot(aes(x=x)) + geom_density(bw=2) + geom_rug(color = "red")

Now let’s put this to use with the diamonds dataset. I’ll take a sample of 5,000 points and look at the distribution of prices.

diamonds %>% 
  sample_n(5000) %>% 
  ggplot(aes(x=price)) + 
  geom_density() +
  geom_rug(color = "red",alpha=.05) +
  facet_wrap(~cut)