1 Introduction

In this vignette I will explore the use of charts to display basic descriptive statistics in order understand the mean and spread of a variable in a dataset. First I will use the summary() function in R, plot a histogram to show the distribution of the variable, then use a boxplot to further examine the distribution of the variable and highlight key data points that represent the mean, median, 1st and 3rd quantiles and outliers.

1.1 The gapminder dataset

For this example, I will be using the gapminder data set. The gapminder dataset is a excerpt of data found on the Gapminder website and published through the R package gapminder. To get the dataset you must install the package: install.packages(gapminder) and reference the library in your code: library(gapmidner).

In this vignette we will use the year, continent and lifeExp variables to demostrate concepts covered in this vignette.

df <- gapminder

knitr::kable(df %>% head(10), caption = "Table: Gapminder dataset (Sample)")
Table: Gapminder dataset (Sample)
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
Afghanistan Asia 1982 39.854 12881816 978.0114
Afghanistan Asia 1987 40.822 13867957 852.3959
Afghanistan Asia 1992 41.674 16317921 649.3414
Afghanistan Asia 1997 41.763 22227415 635.3414

1.2 Basic summary

A descriptive summary can be produced on the data set or any variable using the summary() function. The function call on a numeric variable will show data range of the variable, mean, min, max and the 1st and 3rd quantiles.

summary(df$lifeExp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60

This information is useful, but you often require more detailed look at the spread and distribution of the data. Using histogram and boxplot charts can help getting more understanding of the data.

1.3 Distribution using histograms

Histograms are useful charts to show the frequencies of variables in a dataset.

The below histogram of life expectancy in the year of 1952 clearly shows that the distribution of the data is bimodal, meaning there are two peaks, one around the 40 years of age, and one around the 65 years of age.

df %>% filter(year == 1952) %>%
  ggplot(aes(x=lifeExp)) + 
  geom_histogram(binwidth = 2) + 
  theme_light() +
  labs(title = "Life Expectancy", 
       subtitle = "year 1952",
       y = "Count", 
       x = "Life Expectancy")

Faceting the chart over continents can reveal more detailed distribution; africans tended to live shorter that the rest of the world in 1952.

df %>% filter(year == 1952) %>%
ggplot(aes(x=lifeExp)) + 
  geom_histogram(binwidth = 2) + 
  theme_light() +
  labs(title = "Life Expectancy", 
       subtitle = "year 1952",
       caption = "... by continent",
       y = "Count", 
       x = "Life Expectancy") + 
  facet_wrap(~ continent)

2 Boxplots

The boxplot is a good chart option to show the distribution of a variable. It displays key summary statistics: median, 1st and 3rd quantiles (and two whiskers), and outlier points in a compact form.

df %>% filter(year == 1952, continent=='Europe') %>%
ggplot(aes(y=lifeExp)) + 
  geom_boxplot() + 
  theme_light() +
  labs(title = "Life Expectancy", 
       subtitle = "year 1952",
       caption= "... for Europe",
       y = "Life Expectancy", 
       x = "") 

note that the continous variable is specified as a y aestatic in the ggplot function call ggplot(aes(y=lifeExp))

2.1 Interpeting the boxplot

The black line in the middle of the box represent the median of the variable. The upper and lower edges of the box are the 1st and 3rd quartiles (the 25th and 75th percentiles) visualy showing the range of 50% of the the variable.

Boxplots also display two whiskers that extend from the upper and lower edges of the box to show the smallest and and largest non outlier data points, which represnt where the first and last 25% of the data fall.

The upper whisker extends from the edge to the largest value no further than 1.5 * IQR from the edge. IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the edge to the smallest value at most 1.5 * IQR of the hinge.

The skewness of the data is also represented in a box plot chart, a distribution is skewed left if observations are concenterated in the upper part of the box, and skewed right if they are concerntrated in the lower part of the box. In the above example, the data is skewed right.

Data points that fall outside the IQR range, represented by the whiskers, are known as the outliers, all outlier data points are plotted individually on the chart. In this example, there is one outlier in the data set.

2.2 Compare the distribution for multiple categorical groups

In the histogram example, we examined the distribution of life expectancy by continent using the facet_wrap() function to facet by continent.

In this example, we can show a boxplot for each continent by using the x aestatic: ggplot(aes(y=lifeExp, x=continent).

df %>% filter(year == 1952 | year == 2007) %>%
ggplot(aes(y=lifeExp, x=continent, fill=as.factor(year))) + 
  geom_boxplot() + 
  theme_light() +
  labs(title = "Change in Life Expectancy", 
       subtitle = "year 1952 and 2007",
       y = "Life Expectancy", 
       x = "Continent", 
       fill = "Year"
       ) 

We can clearly see that life expectancy improved over the years in all continents, with Africa still having the lowest life expectancy in 2007.

3 Showing addtional data labels

The ggplot geom_boxplot() function only shows the median of the dataset, you can however add a point that represent the mean, but first you need to calculate the mean and use the geom_point() function to overlay it on the plot.

3.1 Summary data labels

To prepare a dataset with the additional data labels to show on the chart, and since the boxplot is summarising data by year and continent, the values for the labels dataset will be computed for each year and continent combinations. The code to prepare the summary dataset will use the group_by() function before piping the data set to the summarise() function.

df_labels <- df %>% filter(year == 1952 | year == 2007) %>% 
  group_by(year, continent) %>% 
  summarise(
    mean = mean(lifeExp), 
    median = median(lifeExp), 
    min = min(lifeExp), 
    max = max(lifeExp), 
    quantile_25 = quantile(lifeExp, .25),
    quantile_75 = quantile(lifeExp, .75)
  ) %>% gather("label", "value", mean, median, min, max, quantile_25, quantile_75)

This is what the labels dataset looks like:

Summary labels for Africa 1952
year continent label value
1952 Africa mean 39.13550
1952 Africa median 38.83300
1952 Africa min 30.00000
1952 Africa max 52.72400
1952 Africa quantile_25 35.81175
1952 Africa quantile_75 42.11775

3.2 add a mean data point to the chart

To add a mean data point, the geom_point is used to plot the mean valu from the labels dataset: geom_point( data = ... , aes(y = value, x=continent), shape=23 ). Then an additional function call to geom_text is used to add labels over the mean data point on the chart.

df %>% filter(year == 1952 | year == 2007) %>%
ggplot(aes(y=lifeExp, x=continent, fill=as.factor(year))) + 
  geom_boxplot() + 
  theme_light() +
  labs(title = "Change in Life Expectancy", 
       subtitle = "year 1952 and 2007",
       y = "Life Expectancy", 
       x = "Continent", 
       fill = "Year"
       ) + 
  
    geom_point( data = df_labels %>% filter(label == "mean"), 
              aes(y = value, x=continent), shape=23  ) +
   geom_text(data = df_labels %>% 
              filter(label == "mean"),
            aes(
              label = paste("mean:",  format(value, digits=2, nsmall=0)),
              y = value,
              x = continent, 
              vjust = -.5
            )) 

Similar approach can be used to overaly other data points from the labels dataset that was prepared earlier.