Historgrams, Frequency charts and Boxplots

We take agauin the example of the ESS Survey where more thn 40.000 respondents from 29 countries have participated. This time, we filter the respondents from Sweden (N = 1539). If we want to visualize the distribution of a metric (continuous) variable like age, news consumption per day or years of education, we have three chart types as possibilities:

Historgram with geom_hist()
Frequency polygon with geom_freqpoly()
Boxplot with geom_boxplot()

The most important argument within the geom_hist() function is “binwidth”, which determines the number of “bins” (columns) in the chart. A bin is a range between two values (f. ex. 15 and 19) for which all observations are grouped or “binned” so it can be displayed as one column. The smaller “binwidth”, the more details we see in the histogram. Binwidth can also be used with geom_freqpoly() which are plotted in the same way.

Here is an example with age (age in years) from the filtered dataset for Sweden (data object called “se”):

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ess <- read_csv("C:/Users/petemaur/Teaching/Data/ess_data.csv")

## Rows: 49519 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): cntry
## dbl (11): idno, nwspol, polintr, trstprl, trstep, trstun, vote, gndr, yrbrn,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Recode a metric (numeric) variable to a categorical variable (factor)
ess$gndr <- recode_factor(ess$gndr, "1" = "Male", "2" = "Female")
ess$polintr <- recode_factor(ess$polintr, "1" = "very", "2" = "quite", "3" = "hardly", "4" = "not at all")

#Calculate age from Year of Birth
ess <- mutate(ess, age=2018-yrbrn)

#Filter Swedish respondents
se <- filter(ess, cntry %in% c("SE"))

#Start creating charts

ggplot(se, aes(age))+
  geom_histogram(binwidth = 5)+
  theme_classic()

If we want more details, we can reduce binwidth to 2.5:

ggplot(se, aes(age))+
  geom_histogram(binwidth = 2.5)+
  theme_classic()

We can also use a frequency polygon chart to visualize the same distribution:

ggplot(se, aes(age))+
  geom_freqpoly(binwidth = 5)+
  theme_classic()

If we want to compare for gender (woman vs. men), we can map in aes() gender as a grouping variable to the “color” argument. If it was a histogram, we would need to use the “fill” argument.

ggplot(se, aes(age, color = gndr))+
  geom_freqpoly(binwidth = 5)+
  theme_classic()

If we want to compare the distribution of age, years of education or news consumption etc. between two or more groups, like gender (male vs. female) or respondents who have different levels of political interest (high, medium, low), we can use boxplots. Boxplots show us the median of each distribution in a conveniant way. With the argument “notch = True”, we can make the median more visible but we do not need it.

ggplot(se, aes(age, gndr))+
  geom_boxplot(notch = F)+
  theme_classic()

ggplot(se, aes(age, gndr))+
  geom_boxplot(notch = T)+
  theme_classic()

We can examine the distribution of respondents’ news use in minutes per day in the same way. First, we use a frequency polygon to see the general distribtuion and then we use a boxplot where we compare the time of news consumption per day between people with different levels of political interest. For this chart, we change the y-axis with the command scale_y_discrete() and label both axes with the z/y-lab() command.

ggplot(se, aes(nwspol))+
  geom_freqpoly(binwidth = 30)+
  theme_classic()

## Warning: Removed 25 rows containing non-finite values (stat_bin).

ggplot(se, aes(nwspol, polintr))+
  geom_boxplot(notch = T, varwidth = F)+
  scale_y_discrete(limits = c("very", "quite", "hardly", "not at all"))+
  xlab("News comsumption in minutes/day")+
  ylab("Political interest")+
  theme_classic()

## Warning: Removed 2 rows containing missing values (stat_boxplot).

## Warning: Removed 24 rows containing non-finite values (stat_boxplot).

We clearly see that respondents with high and very high political interest consume more news per day. we also see the outliers in each distribution.