In this vignette I will explore the use of charts to display basic descriptive statistics in order understand the mean and spread of a variable in a dataset. First I will use the summary()
function in R, plot a histogram to show the distribution of the variable, then use a boxplot to further examine the distribution of the variable and highlight key data points that represent the mean, median, 1st and 3rd quantiles and outliers.
For this example, I will be using the gapminder
data set. The gapminder dataset is a excerpt of data found on the Gapminder website and published through the R package gapminder
. To get the dataset you must install the package: install.packages(gapminder)
and reference the library in your code: library(gapmidner)
.
In this vignette we will use the year
, continent
and lifeExp
variables to demostrate concepts covered in this vignette.
df <- gapminder
knitr::kable(df %>% head(10), caption = "Table: Gapminder dataset (Sample)")
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 |
Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 |
Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 |
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
A descriptive summary can be produced on the data set or any variable using the summary()
function. The function call on a numeric variable will show data range of the variable, mean, min, max and the 1st and 3rd quantiles.
summary(df$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 48.20 60.71 59.47 70.85 82.60
This information is useful, but you often require more detailed look at the spread and distribution of the data. Using histogram and boxplot charts can help getting more understanding of the data.
Histograms are useful charts to show the frequencies of variables in a dataset.
The below histogram of life expectancy in the year of 1952 clearly shows that the distribution of the data is bimodal, meaning there are two peaks, one around the 40 years of age, and one around the 65 years of age.
df %>% filter(year == 1952) %>%
ggplot(aes(x=lifeExp)) +
geom_histogram(binwidth = 2) +
theme_light() +
labs(title = "Life Expectancy",
subtitle = "year 1952",
y = "Count",
x = "Life Expectancy")
Faceting the chart over continents can reveal more detailed distribution; africans tended to live shorter that the rest of the world in 1952.
df %>% filter(year == 1952) %>%
ggplot(aes(x=lifeExp)) +
geom_histogram(binwidth = 2) +
theme_light() +
labs(title = "Life Expectancy",
subtitle = "year 1952",
caption = "... by continent",
y = "Count",
x = "Life Expectancy") +
facet_wrap(~ continent)
The boxplot is a good chart option to show the distribution of a variable. It displays key summary statistics: median, 1st and 3rd quantiles (and two whiskers), and outlier points in a compact form.
df %>% filter(year == 1952, continent=='Europe') %>%
ggplot(aes(y=lifeExp)) +
geom_boxplot() +
theme_light() +
labs(title = "Life Expectancy",
subtitle = "year 1952",
caption= "... for Europe",
y = "Life Expectancy",
x = "")
note that the continous variable is specified as a y aestatic in the ggplot function call
ggplot(aes(y=lifeExp))
The black line in the middle of the box represent the median of the variable. The upper and lower edges of the box are the 1st and 3rd quartiles (the 25th and 75th percentiles) visualy showing the range of 50% of the the variable.
Boxplots also display two whiskers that extend from the upper and lower edges of the box to show the smallest and and largest non outlier data points, which represnt where the first and last 25% of the data fall.
The upper whisker extends from the edge to the largest value no further than 1.5 * IQR from the edge. IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the edge to the smallest value at most 1.5 * IQR of the hinge.
The skewness of the data is also represented in a box plot chart, a distribution is skewed left if observations are concenterated in the upper part of the box, and skewed right if they are concerntrated in the lower part of the box. In the above example, the data is skewed right.
Data points that fall outside the IQR range, represented by the whiskers, are known as the outliers, all outlier data points are plotted individually on the chart. In this example, there is one outlier in the data set.
In the histogram example, we examined the distribution of life expectancy by continent using the facet_wrap()
function to facet by continent
.
In this example, we can show a boxplot for each continent
by using the x
aestatic: ggplot(aes(y=lifeExp, x=continent)
.
df %>% filter(year == 1952 | year == 2007) %>%
ggplot(aes(y=lifeExp, x=continent, fill=as.factor(year))) +
geom_boxplot() +
theme_light() +
labs(title = "Change in Life Expectancy",
subtitle = "year 1952 and 2007",
y = "Life Expectancy",
x = "Continent",
fill = "Year"
)
We can clearly see that life expectancy improved over the years in all continents, with Africa still having the lowest life expectancy in 2007.
The ggplot geom_boxplot()
function only shows the median of the dataset, you can however add a point that represent the mean, but first you need to calculate the mean and use the geom_point()
function to overlay it on the plot.
To prepare a dataset with the additional data labels to show on the chart, and since the boxplot is summarising data by year and continent, the values for the labels dataset will be computed for each year and continent combinations. The code to prepare the summary dataset will use the group_by()
function before piping the data set to the summarise()
function.
df_labels <- df %>% filter(year == 1952 | year == 2007) %>%
group_by(year, continent) %>%
summarise(
mean = mean(lifeExp),
median = median(lifeExp),
min = min(lifeExp),
max = max(lifeExp),
quantile_25 = quantile(lifeExp, .25),
quantile_75 = quantile(lifeExp, .75)
) %>% gather("label", "value", mean, median, min, max, quantile_25, quantile_75)
This is what the labels dataset looks like:
year | continent | label | value |
---|---|---|---|
1952 | Africa | mean | 39.13550 |
1952 | Africa | median | 38.83300 |
1952 | Africa | min | 30.00000 |
1952 | Africa | max | 52.72400 |
1952 | Africa | quantile_25 | 35.81175 |
1952 | Africa | quantile_75 | 42.11775 |
To add a mean data point, the geom_point
is used to plot the mean valu from the labels dataset: geom_point( data = ... , aes(y = value, x=continent), shape=23 )
. Then an additional function call to geom_text
is used to add labels over the mean data point on the chart.
df %>% filter(year == 1952 | year == 2007) %>%
ggplot(aes(y=lifeExp, x=continent, fill=as.factor(year))) +
geom_boxplot() +
theme_light() +
labs(title = "Change in Life Expectancy",
subtitle = "year 1952 and 2007",
y = "Life Expectancy",
x = "Continent",
fill = "Year"
) +
geom_point( data = df_labels %>% filter(label == "mean"),
aes(y = value, x=continent), shape=23 ) +
geom_text(data = df_labels %>%
filter(label == "mean"),
aes(
label = paste("mean:", format(value, digits=2, nsmall=0)),
y = value,
x = continent,
vjust = -.5
))
Similar approach can be used to overaly other data points from the labels dataset that was prepared earlier.