Technical notes

  • These are guidelines and templates on how to prepare publishable figures in R Markdown printed into HTML format for SMI205 students.

  • Good news is that in contrast to tables, most functions work without any additional codes helping R Markdown to properly translate them into HTML code, but figures are embedded as pictures in HTML webside. At the same time in your working directory R Markdown creates a dedicated folder with figures called R_Markdown_script_name_files.

  • Before running this R Markdown document install the following packages, if you still do not have them on your machine:

  • This HTML page was produced using the material theme from rmdformats package. The same document, but in the default R Markdown theme, with downloadable code is here: here.

rmdformats::material:
  highlight: kate
  code_folding: show

  • More YAML settings are overviewed here: https://rmarkdown.rstudio.com/docs/reference/html_document.html

  • As previously, I’m setting the global chunk options, so for all r chunks the code is displayed, but any additional messages or warnings are not displayed (both set up as FALSE).

  • cashe=TRUE helps in saving time while re-kniting, as knitr will save the results to reuse in future knits

Data

I use European Social Survey (ESS) 2016 (wave 8) for this excercise. This data can be easily dowloaded using essurvey package.

In the settings of the below r chunk I specify results = 'hide' so R Markdown does not print all information about loaded data.

I subset data and keep the following variables in my new, smaller dataset called round_8_final: idno, cntry, blgetmg, imbgeco, imueclt, imwbcnt, gndr, agea, eisced, polintr, hincfel.

The remanining sections of this website overview a few simple codes on how to visualise descriptive statistics and any results of modelled data.

Frequency of distribution

Histogram

There are a few handy basic R functions which we can use to plot simple graphs. The first one is hist which produces a histogram – a bar chart used to display distribution of metric/numeric and categorical variables. Histogram divides range of values into a series of intervals called bins and displays number of observations per each bin.

Let’s explore a histogram of age variable:

Below I specify more options: I divide the histogram into 5 equal chunks/bins and specify colour of its filling and its border.

With just 5 breaks/bins we get a much less informative graph, so let’s stick to the default setting. We have to tidy up the graph by adding a title and x axis label (xlab). You can also specify x axis range with xlim.

This looks much better now. Full R documentaiton for the hist function is here: https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist

Bar chart

Instead of producing a frequency table to overview variables, you can plot their distribution using a graph. Below is a quick recap on how to do it with another basic R function barplot. First, we have to use table function which tabulates categorical data. This is what we get:

Let’s tidy up the graph a little bit. Add the same colours as above. In order to add title use main option, and names.arg to specify labels for all categories.

R documentation for barplot is here: https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/barplot

The above version in ggplot would look like this:

By making minor changes in the code below we can use it to crosstabulate two categorical variables, by adding the second one as fill aesthetic, while position = "fill" turns count bars into proportional bars.

Proportional bar chart - with bars height proportional to the values that they represent - allows quick and clear comparison of categorical variables across groups.

Central tendency statistics

Median values - boxplot

Boxplot produces a graph with median values for metric variables. Boxes indicate values of the first and third quartiles. Let’s use boxplot function to visualise median value for age in the entire sample.

Not very much informative without any reference point. Let’s add colour, title and labels in.

Now, a similar boxplot, but for more than one variable. They should be measured on the same scale, otherwise a comparison does not make sense.

If you explored the data cross-nationally, looking at median values by country would be useful. So let’s use boxplot to plot values of one variable across groups defined by a second variable.

R documentation for boxplot is here: https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/boxplot

It is annoying that with boxplot not all country labels are displayed (as there’s a lot of them), so let’s try redoing the same visualisation in ggplot:

In SMI105 module you had advanced overview of ggplot function. Revisit this material if necessary.

Mean values - qplot

Source for this section: https://rcompanion.org/handbook/C_04.html

I use groupwiseMean function from rcompanion which agregates data by group – here by country – and calculates the loweer and upper values of error bars to be displayed in the graph.

Below you can see code for this and qplot plotting mean values of imbgeco variable (one measuring attitudes to immigration). It is combined with geom_errorbar function from ggplot.

It turns out this does not show error bars at all. Let’s take a quick look into our new data frame Sum, with mean values of imbgeco per country.

   cntry    n Mean Conf.level Trad.lower Trad.upper
1     AT 1960 4.64       0.95         NA         NA
2     BE 1759 4.98       0.95         NA         NA
3     CH 1496 6.02       0.95         NA         NA
4     CZ 2193 3.96       0.95         NA         NA
5     DE 2817 5.83       0.95         NA         NA
6     EE 1976 4.52       0.95         NA         NA
7     ES 1883 5.39       0.95         NA         NA
8     FI 1906 5.46       0.95         NA         NA
9     FR 2043 4.82       0.95         NA         NA
10    GB 1927 5.67       0.95         NA         NA
11    HU 1488 3.07       0.95         NA         NA
12    IE 2699 5.72       0.95         NA         NA
13    IL 2357 4.96       0.95         NA         NA
14    IS  869 6.69       0.95         NA         NA
15    IT 2516 4.21       0.95         NA         NA
16    LT 1940 5.03       0.95         NA         NA
17    NL 1639 5.29       0.95         NA         NA
18    NO 1525 5.63       0.95         NA         NA
19    PL 1570 5.03       0.95         NA         NA
20    PT 1243 5.68       0.95         NA         NA
21    RU 2230 3.72       0.95         NA         NA
22    SE 1503 5.75       0.95         NA         NA
23    SI 1286 3.99       0.95         NA         NA

There are only NAs for lower and upper values. Before I repeat the function, I will subset round_8_final data to exclude observations with missing data for imbgeco variable.

We get a very good looking graph now. Additionally, I added theme_set(theme_minimal()) to change the colour scheme of the graph.

R documentation for qplot function: https://www.rdocumentation.org/packages/ggplot2/versions/3.3.0/topics/qplot, while geom_errorbar here: https://www.rdocumentation.org/packages/ggplot2/versions/0.9.0/topics/geom_errorbar.

Regression results

Coefficient plots

In past practicals I introduced function tab_model which prepares very good looking HTML tables for regression results. In the same sjPlot package you have plot_model function, which can be used to plot results from regressions. I’m using round_8_finaly without ‘55’ education category (‘other’, as it has small number of respondents and with it you cannot treat this variable as continous). I run simple linear regression first (lm).

We can add labels to variables using axis.labels option. Interestingly, you have to add them in reverse order than they are in your model. I also add title for graph and axis.title to be more specific. Finally, adding values of coefficinet above is useful if some coefficients cluster close to the 0 line, but might be still statistically significant - like age in this graph. You can do it by specifying show.values = TRUE and move it closer to the dot with value.offset = .3.

This website has some excellent examples for plot_model: https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_model_estimates.html

Coefficient plots can be also produced with coefplot package, which has multiplot function, which makes it easier to compare models. Some settings are very similar to previous graphs, except re-labelling variables - newNames - which requires mentioning current names for variables you want to name differently, and for categorical variable - number of category (you will see them when you plot with default names). names allows adding names to models.

R documentation for coefplot package: https://cran.r-project.org/web/packages/coefplot/coefplot.pdf

Let’s return now to plot_model, as it has a useful option of plotting random effects. First, I run multilevel linear regression with respodents nested in countries using lmer function.

The default in plot_model is type = "fe", which means that fixed effects (our model coefficients) are plotted. In the graph above we see how countries differ between each other in relation to the average effect (country intercepts). If you have more random effects - i.e. you woudl argue thatt the effect of another variable differed across countires - you coudl plot both random effects. Below I added eisced to the random part of the model.

R documentation to find about many other options: https://www.rdocumentation.org/packages/sjPlot/versions/2.8.3/topics/plot_model

Marginal effects

Finally, using plot_model we can plot predicted values of atttitudes for different values of the independent variables in the model by specifing type = "pred". These are so called ‘marginal effects’ (as we specify ‘margins’/borders/constrains for the estimation) and tell us how a dependent variable (attitudes) changes across values of a specific independent variable. You request this with terms option.

If a varable is metric (linear) plot_model will plot the effects as a line graph. Below you can see what (estimated) attitudes to immigration have people of different age according to our model, keeping all other variables fixed.

Next, let’s plot marginal effects across values of a categorical variable – interest in politics. This is displayed as a graph with values for each category with confidence intervals.

This is a useful page with overview of more customisation options for plot_model: https://strengejacke.github.io/sjPlot/articles/plot_marginal_effects.html

The terms option is particularly handy for models with interaction terms. Below I specify a model with age and being ethnic minority interacted, as in my paper I argued (let’s assume so) that age effect on attitudes will differ between ethnic majority and minority groups.

Indeed, as the lines are not parallel, there the strenght of relationship is different, and the effect of age is not so strong for people who self-identify as belonging to ethnic minority group. However, 95% confidence intervals overlap, meaning that this diffrence is not statistically significant.

In the second model I interact two categorical variables arguing that the effect of political interest on attitudes with be again different between ethnic majority and ethc minority population.

Some other options in plotting interaction terms with plot_model are nicely discussed here: https://strengejacke.github.io/sjPlot/articles/plot_interactions.html