The purpose of drawing histograms, like that of all other statistical techniques, is to acquire information. Once we have the information, we frequently need to describe what we’ve learned to others. We describe the shape of histograms on the basis of the following characteristics.
Symmetry: A histogram is said to be symmetric if, when we draw a vertical line down the center of the histogram, the two sides are identical in shape and size.
Skewness: A skewed histogram is one with a long tail extending to either the right or the left. The former is called positively skewed, and the latter is called negatively skewed. An example: Incomes of employees in large firms tend to be positively skewed because there is a large number of relatively low-paid workers and a small number of well-paid executives. The time taken by students to write exams is frequently negatively skewed because few students hand in their exams early; most prefer to reread their papers and hand them in near the end of the scheduled test period. The skewness can be calculated from the following formula:
\[Skewness=\frac{\sum_{i=1}^{N}(x_i-\bar{x})^3}{(N-1)s^3}\]
where:
This post presents R codes for creating and comparing types of data distributions by ploting histograms. R codes for Figure 1:
# Simulate data:
library(tidyverse)
set.seed(5)
n <- 10000
tibble(Value = c(rexp(n, 0.1), rbeta(n, 10, 1), rnorm(n)),
DataType = rep(c("Negative Skew", "Positive Skew", "Normal Distribution"), each = n, times = 1)) -> df
# Calculate median and mean by data group:
df %>%
group_by(DataType) %>%
summarise(Median = median(Value)) %>%
ungroup() %>%
mutate(Text = paste0("Median: ", round(Median, 2))) -> df_median
df %>%
group_by(DataType) %>%
summarise(Mean = mean(Value)) %>%
ungroup() %>%
mutate(Text = paste0("Mean: ", round(Mean, 2))) -> df_mean
# Make a draft of plot:
library(extrafont)
my_font <- "Ubuntu Condensed"
color_text <- c("#0098FF", "#8F1C3F")
theme_set(theme_bw())
df %>%
ggplot(aes(x = Value)) +
geom_density(fill = "red", color = "red", alpha = 0.1) +
geom_histogram(aes(y = ..density..), color = "blue", fill = "blue", alpha = 0.1) +
facet_wrap(~ DataType, scales = "free", nrow = 1) +
geom_vline(data = df_median, aes(xintercept = Median), color = color_text[1], size = 1) +
geom_vline(data = df_mean, aes(xintercept = Mean), color = color_text[2], size = 1) +
theme(text = element_text(family = my_font, size = 14)) +
theme(plot.margin = unit(rep(0.7, 4), "cm")) +
labs(x = NULL, y = NULL,
title = "Figure 1: Shape of Histogram by Distribution",
caption = "Source: Data Based on Simulation") -> plot_draft
# Show the draft:
plot_draft
# Add annotation (solution 1):
y_for_mean <- c(0.06, 0.31, 6.3)
plot_draft +
geom_label(data = df_mean %>% mutate(Value = c(60, 2.8, 0.55)),
aes(label = Text, y = y_for_mean), family = my_font,
size = 3.5, color = color_text[1]) +
geom_label(data = df_median %>% mutate(Value = c(60, 2.8, 0.55)),
aes(label = Text, y = 0.91*y_for_mean), family = my_font,
size = 3.5, color = color_text[2])