Self-learning tutorial: Creating charts with ggplot and the ESS data set: Bar plots with percentages, scatter plots, box plots

Importing a data set

Import the data set into R studio:
Either the small ESS training data set OR the large ESS full data set
ESS full data is posted on Canvas as an xlsx file in section “Data lab and homework seminar 3”
ESS full data set is very large with 500+ variables
Load the packages with library()
Use the menu File>Import Data set>From Excel to import the data

library(tidyverse)
library(readxl)
library(scales)

ess <- read_excel("C:/Users/Peter Maurer/Documents/_Lehre_Karlstad/MKGB93_DataVisualization/Datasets/Data_labs/ess_data.xlsx",
                  col_types = c("guess", "guess", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric"))

Filtering variables and observations

Before working with ESS full data, search for interesting variables in the code book https://ess.sikt.no/en/datafile/b2b0bf39-176b-4eca-8d26-3c05ea83d2cb?tab=1
After importing, subset the ESS full data set: select variables you are interested in
Filter the observations you are interested in (by country and/or other variables like age, gender, etc. ..) if you don’t want to use all observations
To select variables, it is useful to print the variable names with colnames()
To select several variables from a larger data set, use select()
To filter observations, use filter()
Save the subset under the same or another name with “<-” (Attention, same name overwrites the old object)

colnames(ess)

##  [1] "idno"    "cntry"   "nwspol"  "polintr" "trstprl" "trstep"  "trstun" 
##  [8] "vote"    "gndr"    "yrbrn"   "eduyrs"

ess %>%  #selection is not saved in this example
  select(cntry, trstprl, trstep,  trstun,  vote,  gndr) %>%
  filter(cntry %in% c("SE", "DE", "DK", "IT", "NO"))

We have selected the variables country, vars for political trust, voting and gender
We have then filtered respondents from Sweden, Germany, Denmark, Italy and Norway

Analyse voting behavior in the selected countries

Turn “vote” into a factor and save it, use drop_na() to get rid of the NA values in vote, start with the visualization.

ess$vote <- recode_factor(ess$vote, "1" = "voted", "2" = "abstained", "3" = "non eligible")

ess %>%
  filter(cntry %in% c("SE", "DE", "DK", "IT", "NO")) %>%
  drop_na(vote) %>%
  ggplot(aes(x = vote, fill = vote)) +
  geom_bar()+
  scale_fill_discrete(type = c("skyblue", "gray", "orange"), name = "Voted or not?")+
  labs(x = "Number", y = "Voting", title = "Voting behavior")+
  theme(legend.position = "top")+
  theme_classic()

Using theme() to make the chart more informative and look more professional

We can change the colors, adjust the legend, the title and the subtitle of the chart. This is done with theme() in the last line

ess %>%
  filter(cntry %in% c("SE", "DE", "DK", "IT", "NO")) %>%
  drop_na(vote) %>%
  ggplot(aes(x = vote, fill = vote)) +
  geom_bar()+
  scale_fill_manual(values = c( 
                        "voted" = "tomato",
                      "abstained" = "green",
                      "non eligible" = "beige"), name = "Voted or not?")+
  labs(x = "Number", y = "Voting", title = "Voting behavior", subtitle = "With data from the ESS survey from 2020")+
  theme_classic()+
  theme(legend.position = "bottom", plot.title=element_text(hjust=0.5, vjust=1.5, face='bold'),
        plot.subtitle = element_text(hjust=0.5, vjust=1.5))

Making a chart with percent (%) instead of raw numbers (N’s)

Imagine we want to compare voter turnout between countries. We prefer to show the vote data now as relative frequencies (in percent) within each country. That facilitates comparisons between countries with a different total number (total N) of respondents.

To do this, we must first calculate the percent for each value of “vote” within each country. We do this by grouping the respondents into their countries with group_by(). Then we calculate the percentage for each value of “vote” for each country by dividing the n’s for the values “voted”, “abstained”, “etc.”non eligible” by the country-level n. Lastly, we drop the value NA from the data set with drop_na().

ess %>%
  filter(cntry %in% c("SE", "DE", "DK", "IT", "NO")) %>%
  group_by(cntry) %>%
  count(vote) %>%
  group_by(cntry) %>%
  mutate(percent = n / sum(n)) %>%
  drop_na()

Now, we can create the chart. We join both steps with a pipe operator %>% between the last function above and ggplot(). It is important that we have loaded the scales package before we start plotting since we use a percent scale on the y axis. With n.breaks = 8, we set the number of percent labels on the y axis.

ess %>%
  filter(cntry %in% c("SE", "DE", "DK", "IT", "NO")) %>%
  group_by(cntry) %>%
  count(vote) %>%
  group_by(cntry) %>%
  mutate(percent = n / sum(n)) %>%
  drop_na() %>%
  ggplot(aes(x = cntry, y = percent, fill = vote))+
  geom_bar(stat = "identity", position = "dodge")+
  scale_fill_manual(values = c( 
    "voted" = "tomato",
    "abstained" = "green",
    "non eligible" = "beige"), name = "Voted or not?")+
  scale_y_continuous(labels = percent, n.breaks = 8)+
  labs(x = "Percent", y = "Voting in %", title = "Voting behavior in last election", subtitle = "With data from the ESS survey from 2020")+
  theme_classic()

Making the chart more informative by ordering the bars and adding elements

Now we can make the chart a bit nicer:

We can order the countries by voter turnout (highest to lowest). That is done by turning country into a factor and ordering the factor levels manually like in the first line of code below
With theme() we can tweak the axes, labels, colors, text, title, background etc.
We can move the legend to the bottom
We can change the on the x axis
We can add a title and a subtitle
We can add a background color
We can add a grid line to make it easier to read the percentage for each column from the chart
We can change the font type, the face and the color of the text etc. etc.

ess$cntry <- factor(ess$cntry, levels = c("SE", "DK", "NO", "DE", "IT")) # This turns all other countries into NA

ess %>%
  filter(cntry %in% c("SE", "DE", "DK", "IT", "NO")) %>%
  group_by(cntry) %>%
  count(vote) %>%
  group_by(cntry) %>%
  mutate(percent = n / sum(n)) %>%
  drop_na() %>%
  ggplot(aes(x = cntry, y = percent, fill = vote))+
  geom_bar(stat = "identity", position = "dodge")+
  scale_fill_manual(values = c( 
    "voted" = "tomato",
    "abstained" = "green",
    "non eligible" = "beige"), name = "Voted or not?")+
  scale_y_continuous(labels = percent, n.breaks = 8)+
  scale_x_discrete(labels = c("Sweden", "Denmark", "Norway", "Germany", "Italy"))+
  labs(x = "Percent", y = "Voting in %", title = "Voting behavior in last election", subtitle = "With data from the ESS survey from 2020")+
  theme_classic()+
  theme(legend.position = "bottom", plot.title=element_text(hjust=0.5, vjust=1.5, face='bold'),
        plot.subtitle = element_text(hjust=0.5, vjust=1.5), panel.grid.major.y = element_line(linetype = "dashed", color = "gray40"),
        panel.background = element_rect(fill = "grey90"), axis.text.x = element_text(family = "mono", size = 11))

Exploring correlations visually with scatter plots and geom_jitter()

If we wanted to test if there is a correlation between two numeric (continuous) variables from the ESS data set, like trust in the national parliament and trust in the European parliament, we can use a scatter plot. In a scatter plot, all observations are located as dots in a 2-dimensional coordinate system between the x- and the y-axis. The shape of the distribution shows if there exists a correlation or not. We want to explore this relationship first for Sweden and then for Italy. We also want to know if the relationship is equal for men and women.

ess$gndr <- recode_factor(ess$gndr, "1" = "Male", "2" = "Female")

ess %>%
  filter(cntry %in% c("SE")) %>%
  drop_na() %>%
  ggplot(aes(x = trstprl, y = trstep, color = as_factor(gndr)))+  # map vars x = "trust in national parliament" & y = "trust in European parliament" & color "gender" 
  geom_jitter()+   # ask for a chart type scatter plot
  geom_smooth()+   # add a trend line (smoother) to see the correlation between x and y
  labs(color = "Gender", x = "Trust in Riksdag", y = "Trust in European Parl.", title = "Relationship between two sorts of political trust", subtitle = "ESS data from Sweden, 2020")+  # add a legend
  scale_color_manual(values = c("Male" = "chartreuse", "Female" = "chocolate"))+
  scale_y_continuous(breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))+
  scale_x_continuous(breaks = seq(0, 10, 1))+
  theme_classic()+
  theme(legend.position = "bottom", plot.background = element_rect(fill = "lightgrey", color = "black"), panel.grid = element_blank(), panel.background = element_rect(fill = "beige"))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The same scatter plot for Italy where the parliament is called Camera dei deputati:

ess %>%
  filter(cntry %in% c("IT")) %>%
  drop_na() %>%
  ggplot(aes(x = trstprl, y = trstep, color = as_factor(gndr)))+  # map vars x = "trust in national parliament" & y = "trust in European parliament" & color "gender" 
  geom_jitter()+   # ask for a chart type scatterplot
  geom_smooth()+   # add a trendline (smoother) to see the correlation between x and y
  labs(color = "Gender", x = "Trust in Camera dei deputati", y = "Trust in European Parl.", title = "Relationship between two sorts of political trust", subtitle = "ESS data from Italy, 2020")+  # add a legend
  scale_color_manual(values = c("Male" = "chartreuse", "Female" = "chocolate"))+
  scale_y_continuous(breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))+
  scale_x_continuous(breaks = seq(0, 10, 1))+
  theme_classic()+
  theme(legend.position = "bottom", plot.background = element_rect(fill = "lightgrey", color = "black"), panel.grid = element_blank(), panel.background = element_rect(fill = "beige"))

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Comparing distributions of metric variables between groups and countries with box plots

Last not least, we can also compare trust in national parliaments with the metric variable “trstprl” between all 29 European countries. We can use box plots for each country. With coord_flip()

Because we have changed “cntry” to factor with only 5 valid values along the way, we must load the data again to have all countries.

ess <- read_excel("C:/Users/Peter Maurer/Documents/_Lehre_Karlstad/MKGB93_DataVisualization/Datasets/Data_labs/ess_data.xlsx",
                  col_types = c("guess", "guess", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
                                "numeric", "numeric", "numeric"))

ess %>%
  drop_na %>%
  ggplot(aes(x = cntry, y = trstprl))+
  geom_boxplot(notch = T)+
  coord_flip()+
  labs(x = "Trust in the parliament of [country]", y = "Country", title = "Who trusts their national parliament most in Europe?")+
  theme_classic()

The problem is that the countries are not yet ordered according to the average level of trust. This can be achieved with reorder(). To make this work, “cntry” must be converted to a factor (it is now a character variable) and we must specify that cntry should be ordered according to the mean value of each country in “trstprl”. With the - sign in front of “trstprl”, we determine that the order is from lowest to highest value, without it, it would be from highest to lowest.

ess %>%
  drop_na %>%
  ggplot(aes(x = reorder(as.factor(cntry), -trstprl, mean), y = trstprl))+
  geom_boxplot(notch = T)+
  coord_flip()+
  labs(x = "Country", y = "Trust in the parliament of [country]", title = "Who trusts their national parliament most in Europe?")+
  theme_classic()