1 Introduction
2 Boxplots
- 2.1 Interpeting the boxplot
- 2.2 Compare the distribution for multiple categorical groups
3 Showing addtional data labels
- 3.1 Summary data labels
- 3.2 add a mean data point to the chart

1 Introduction

In this vignette I will explore the use of charts to display basic descriptive statistics in order understand the mean and spread of a variable in a dataset. First I will use the summary() function in R, plot a histogram to show the distribution of the variable, then use a boxplot to further examine the distribution of the variable and highlight key data points that represent the mean, median, 1st and 3rd quantiles and outliers.

1.1 The gapminder dataset

For this example, I will be using the gapminder data set. The gapminder dataset is a excerpt of data found on the Gapminder website and published through the R package gapminder. To get the dataset you must install the package: install.packages(gapminder) and reference the library in your code: library(gapmidner).

In this vignette we will use the year, continent and lifeExp variables to demostrate concepts covered in this vignette.

df <- gapminder

knitr::kable(df %>% head(10), caption = "Table: Gapminder dataset (Sample)")

Table: Gapminder dataset (Sample)
country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134
Afghanistan	Asia	1982	39.854	12881816	978.0114
Afghanistan	Asia	1987	40.822	13867957	852.3959
Afghanistan	Asia	1992	41.674	16317921	649.3414
Afghanistan	Asia	1997	41.763	22227415	635.3414

1.2 Basic summary

A descriptive summary can be produced on the data set or any variable using the summary() function. The function call on a numeric variable will show data range of the variable, mean, min, max and the 1st and 3rd quantiles.

summary(df$lifeExp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60

This information is useful, but you often require more detailed look at the spread and distribution of the data. Using histogram and boxplot charts can help getting more understanding of the data.

1.3 Distribution using histograms

Histograms are useful charts to show the frequencies of variables in a dataset.

The below histogram of life expectancy in the year of 1952 clearly shows that the distribution of the data is bimodal, meaning there are two peaks, one around the 40 years of age, and one around the 65 years of age.

df %>% filter(year == 1952) %>%
  ggplot(aes(x=lifeExp)) + 
  geom_histogram(binwidth = 2) + 
  theme_light() +
  labs(title = "Life Expectancy", 
       subtitle = "year 1952",
       y = "Count", 
       x = "Life Expectancy")

Faceting the chart over continents can reveal more detailed distribution; africans tended to live shorter that the rest of the world in 1952.

df %>% filter(year == 1952) %>%
ggplot(aes(x=lifeExp)) + 
  geom_histogram(binwidth = 2) + 
  theme_light() +
  labs(title = "Life Expectancy", 
       subtitle = "year 1952",
       caption = "... by continent",
       y = "Count", 
       x = "Life Expectancy") + 
  facet_wrap(~ continent)

2 Boxplots

The boxplot is a good chart option to show the distribution of a variable. It displays key summary statistics: median, 1st and 3rd quantiles (and two whiskers), and outlier points in a compact form.

df %>% filter(year == 1952, continent=='Europe') %>%
ggplot(aes(y=lifeExp)) + 
  geom_boxplot() + 
  theme_light() +
  labs(title = "Life Expectancy", 
       subtitle = "year 1952",
       caption= "... for Europe",
       y = "Life Expectancy", 
       x = "")

note that the continous variable is specified as a y aestatic in the ggplot function call ggplot(aes(y=lifeExp))

2.1 Interpeting the boxplot

The black line in the middle of the box represent the median of the variable. The upper and lower edges of the box are the 1st and 3rd quartiles (the 25th and 75th percentiles) visualy showing the range of 50% of the the variable.

Boxplots also display two whiskers that extend from the upper and lower edges of the box to show the smallest and and largest non outlier data points, which represnt where the first and last 25% of the data fall.

The upper whisker extends from the edge to the largest value no further than 1.5 * IQR from the edge. IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the edge to the smallest value at most 1.5 * IQR of the hinge.

The skewness of the data is also represented in a box plot chart, a distribution is skewed left if observations are concenterated in the upper part of the box, and skewed right if they are concerntrated in the lower part of the box. In the above example, the data is skewed right.

Data points that fall outside the IQR range, represented by the whiskers, are known as the outliers, all outlier data points are plotted individually on the chart. In this example, there is one outlier in the data set.

2.2 Compare the distribution for multiple categorical groups

In the histogram example, we examined the distribution of life expectancy by continent using the facet_wrap() function to facet by continent.

In this example, we can show a boxplot for each continent by using the x aestatic: ggplot(aes(y=lifeExp, x=continent).

df %>% filter(year == 1952 | year == 2007) %>%
ggplot(aes(y=lifeExp, x=continent, fill=as.factor(year))) + 
  geom_boxplot() + 
  theme_light() +
  labs(title = "Change in Life Expectancy", 
       subtitle = "year 1952 and 2007",
       y = "Life Expectancy", 
       x = "Continent", 
       fill = "Year"
       )

We can clearly see that life expectancy improved over the years in all continents, with Africa still having the lowest life expectancy in 2007.

3 Showing addtional data labels

The ggplot geom_boxplot() function only shows the median of the dataset, you can however add a point that represent the mean, but first you need to calculate the mean and use the geom_point() function to overlay it on the plot.

3.1 Summary data labels

To prepare a dataset with the additional data labels to show on the chart, and since the boxplot is summarising data by year and continent, the values for the labels dataset will be computed for each year and continent combinations. The code to prepare the summary dataset will use the group_by() function before piping the data set to the summarise() function.

df_labels <- df %>% filter(year == 1952 | year == 2007) %>% 
  group_by(year, continent) %>% 
  summarise(
    mean = mean(lifeExp), 
    median = median(lifeExp), 
    min = min(lifeExp), 
    max = max(lifeExp), 
    quantile_25 = quantile(lifeExp, .25),
    quantile_75 = quantile(lifeExp, .75)
  ) %>% gather("label", "value", mean, median, min, max, quantile_25, quantile_75)

This is what the labels dataset looks like:

Summary labels for Africa 1952
year	continent	label	value
1952	Africa	mean	39.13550
1952	Africa	median	38.83300
1952	Africa	min	30.00000
1952	Africa	max	52.72400
1952	Africa	quantile_25	35.81175
1952	Africa	quantile_75	42.11775

3.2 add a mean data point to the chart

To add a mean data point, the geom_point is used to plot the mean valu from the labels dataset: geom_point( data = ... , aes(y = value, x=continent), shape=23 ). Then an additional function call to geom_text is used to add labels over the mean data point on the chart.

df %>% filter(year == 1952 | year == 2007) %>%
ggplot(aes(y=lifeExp, x=continent, fill=as.factor(year))) + 
  geom_boxplot() + 
  theme_light() +
  labs(title = "Change in Life Expectancy", 
       subtitle = "year 1952 and 2007",
       y = "Life Expectancy", 
       x = "Continent", 
       fill = "Year"
       ) + 
  
    geom_point( data = df_labels %>% filter(label == "mean"), 
              aes(y = value, x=continent), shape=23  ) +
   geom_text(data = df_labels %>% 
              filter(label == "mean"),
            aes(
              label = paste("mean:",  format(value, digits=2, nsmall=0)),
              y = value,
              x = continent, 
              vjust = -.5
            ))

Similar approach can be used to overaly other data points from the labels dataset that was prepared earlier.

Using Boxplots for Data Exploration

Mutaz Abu Ghazaleh

02 September, 2018