Data preparation

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
ess <- read_csv("C:/Users/petemaur/Teaching/Data/ess_data.csv")
## Rows: 49519 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): cntry
## dbl (11): idno, nwspol, polintr, trstprl, trstep, trstun, vote, gndr, yrbrn,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ess$gndr <- recode_factor(ess$gndr, "1" = "Male", "2" = "Female")

ess_scand <- filter(ess, cntry %in% c("NO", "SE"))

We have loaded the dataset and selected only respondents from Sweden and Norway. We have named the new dataset “ess_scand”. We want to examine if there is a correlation between the metric variables trust in the national parliaments (Riksdag and Storting) and the European Parliament. Trust in both institutions was measured on a scale from 0 - 10, where 10 means most trust and 0 least trust. We would expect that both variables are correlated (the more a person trusts one parliament, the more they will trust the other, and vice-versa).

We choose a scatterplot for this task. That means, the geom object is “geom_point”. 1) We need to define in the ggplot() function the data, and with aes() which variables go to which axis. (Attention: short form!) 2) we need to choose a geom object (what type of chart?) 3) we can select a color theme

After each line, we need to write a “+” sign to tell R that the command continues.

library(ggplot2)
ggplot(ess_scand, aes(trstprl, trstep))+
  geom_point(position = "jitter")+
  theme_bw()
## Warning: Removed 386 rows containing missing values (geom_point).

Looks like a correlation. Now, we want to see if the relationship is similar for Norwegians and Swedes. To do this, we can introduce the categorical variable “country” and coloring the dots according to the country. This is done with a third argument in the aes() function: country is mapped to the color argument. With the position argument in geom_point(), we can make the location of the overlapping dots more visible (it’s a trick).

ggplot(ess_scand, aes(trstprl, trstep, color = cntry))+
  geom_point(position = "jitter")+
  theme_bw()
## Warning: Removed 386 rows containing missing values (geom_point).

If we do not want to use color as an aesthetic, we can use shape instead:

ggplot(ess_scand, aes(trstprl, trstep, shape = cntry))+
  geom_point(position = "jitter")+
  theme_bw()
## Warning: Removed 386 rows containing missing values (geom_point).

But its less visible with different shapes.

Instead of using position = jitter as argument in geom_point, we can also directly use geom_jitter:

ggplot(ess_scand, aes(trstprl, trstep, color = cntry))+
  geom_jitter()+
  theme_bw()
## Warning: Removed 386 rows containing missing values (geom_point).

Now, if we want to investigate the distribution f a metric variable like for example the time spent with political news reading/watching in minutes (nwspol), we have a geom called frequency polygon. Again, we can do it with or without adding a categorical variable (like country of gender) to distinguish between groups.

First overall: 1) in the ggplot() function, we give the data and in the aes() we map only one variable (univariate distribution) 2) as the geom object, we choose geom_freqpoly and we can set the binwidth argument to 20. Default is 30 but the lower the more detailed.

ggplot(ess_scand, aes(nwspol))+
  geom_freqpoly(binwidth = 20)+
  theme_bw()
## Warning: Removed 59 rows containing non-finite values (stat_bin).

If we want to distinguish between Norway and Sweden, we use again the color argument in aes() as before:

ggplot(ess_scand, aes(nwspol, color = cntry))+
  geom_freqpoly(binwidth = 20)+
    theme_bw()
## Warning: Removed 59 rows containing non-finite values (stat_bin).

We may want to get an overview over the distribution of a categorical variable, for example compare how many repsondents are Swedish and Norwegian. For that, we use a barchart with the geom_bar function. It counts how many cases (here: respondents) are in each category.

  1. in the ggplot() function, we give the data and in aes() we map the variable of interest to the x axis. So we need only one argument.
  2. we choose geom_bar as our chart type
ggplot(ess_scand, aes(cntry))+
  geom_bar()+
  theme_bw()

We may want to add a second variable to see how the gender distribution in each country was. For that, we need to map the gender variable to aes() with the color argument as above. Here, we also use the position argument in the geom_bar function to tell R to plot the bars for male and female in each country next to each other and leave a space between the countries.

ggplot(ess_scand, aes(cntry, color = gndr))+
  geom_bar(position = "dodge")+
  theme_bw()

Something has not worked as expecetd: Color only colors the margins of the bars! The fix is to use the fill argument toghether with color or without to color the bars for each group.

ggplot(ess_scand, aes(cntry, color = gndr, fill = gndr))+
  geom_bar(position = "dodge")+
  theme_bw()

Next, we may want to adjust and label the axes. Thsi is done with the x/z-lim function and the x/y-lab function. In this example, we label both axes and tweak the scale of the Y Axis. Note that this is done by adding functions to the basic chart: 1)… 2)… 3) We label the Y-axis using the ylab() function and put the name in ““. 4) We do the same for the X-axis 5) We set a new upper limt for the Y-axis. for geom_bar, the lower limit must be 0.

ggplot(ess_scand, aes(cntry, color = gndr, fill = gndr))+
  geom_bar(position = "dodge")+
  ylab("Respondents")+
  xlab("Country")+
  ylim(0, 1000)+
  theme_bw()