R Code Quick Reference for Assignment 2 (SO648, 2026)

Assignment 2

1500 words (excluding code), due Monday May 11th, 23:59

Select a dataset from those provided to you on Moodle.
Identify a dependent variable, and three independent variables to base your assignment on.
Produce a linear regression model using the variables identified in step 2.
Interpret the results of your model.
Give a short discussion, with supporting references, of your results.
Include the code used to produce your results at the end of your assignment.

This page contains everything you need to answer this. You should use this page by following along in RStudio as you read. Wherever you encounter code, you can copy it over to RStudio and run it yourself. The key to doing the assignment in this way, with this page as a guide, is (1) using the supplied code, do the examples yourself as you do, and (2) substitute the variables, titles, labels, etc with those chosen by you, for your assignment. Ultimately, you can complete this assignment with four variables from any one of the course datasets. All of the steps you need to follow are given below. Please read the text fully and carefully before emailing for assistance (which you are welcome to do - but if the answer is in this page somewhere, I will politely send you back here). The backup video tutorials are available from my YouTube course playlist at this link.

01. Setting up Your Session in R

01.1 Setting your directory

The first thing you should do in any R session is ensure you have saved all of your required data files to a folder, and then set your working directory to point to this folder. Download and save all of the required files from Moodle, put them in a folder titled ‘so648_2026’, then set your working directory using the code below. Remember, you will need to change the folder address to the working directory on your own computer. Mine is in a folder located in a sub folder ‘…Teaching/so648_2026’. Use the code below to set yours.

setwd("C:/Users/eflaherty/iCloudDrive/Teaching/so648_2026")

01.2 Packages and libraries

Next you need to load the packages (libraries) that you will use for your work session. The ones you will need for this assignment can be loaded with the following code.

library(pacman)
pacman::p_load(dplyr, readr, tidyverse, ggplot2, haven, janitor, ggcorrplot, labelled, 
               Hmisc, pastecs, vtable, broom, GGally, cowplot, ggspatial, 
               expss, patchwork, gapminder, dendroTools, gtsummary, forcats, xfun)

This is the new procedure we introduced in week 8, using the pacman package. This allows yuou to specify a list of packages and load them into R. It will detect if any are not installed, and complete the installation for you. If you ever need to manually install a package, for example, ggplot, then you can use the following code.

install.packages("ggplot2")

01.3 Loading data from .csv (spreadsheet) format

Next, you will need some data. Making sure your datasets are saved into your working directory, as outlined above, run the following code (which depends on the ‘readr’ library) to read .csv spreadsheet data into R. The resulting dataset will save to the Environment panel to the right in RStudio. The code below will get two .csv files from your working directory, so make sure the data files from Moodle are saved into the folder set above as your working directory. If you loading in your own data that you have produced yourself in Excel, ensure that it is saved in .csv format, and not .xls or .xlsx format if you are using the commands below as they rely on read_csv. R can handle other Excel formats however, and you can use the manual Import Dataset procedure in the Environment window to import your own data if you want

so648_country <- read_csv("so648_country_2026.csv")
eurostat <- read_csv("eurostat_data_2026.csv")

01.4 Loading data from .dta (Stata) format

The European Social Survey file is supplied in a format native to the statistical analysis programme Stata. Stata uses a .dta extension to save its datasets. We can use the haven package from R to import this, and you have already installed it using the above combined package installation and library loading process provided by pacman. The code to import the ESS dataset is as follows - make sure you have set your working directory already by following the procedures above.

ess11_ireland <- read_dta("ess11_ireland.dta")

Stata-formatted datasets will be imported to R with a few specific differences that we will need to take care of with some additional code. First, Stata has a particular way of assigning different codes for different kinds of missing data. Whilst we tend to use NA in R as a universal designation for a missing value, Stata uses codes such as ‘.a, .b, .c,’ to designate cases where the question is refused, the respondent doesn’t know, or where data are missing possibly due to mis-coding. You will notice some additional code in some of the procedures below, and this is to account for two things: (1) how R has handled the missing cases when the data were imported from the original .dta Stata file, and (2) how R assigns a variable type to the imported variables. The second issue will largely affect how R identifies certain variables as factors, and sometimes where you see additional code, it will be to ‘force’ R to treat a variable as a factor, without making complex or permanent changes to the dataset. This is in line with good practice in data pipelining, where we want to pass, or ‘pipe’ our data through certain filters or settings, before applying our commands. This will leave the original dataset largely untouched, and will allow us to do many iterations of certain actions using our code, without altering the original dataset. We will deal with these issues as we encounter them below, and don’t worry, you won’t have to figure this part out yourselves. Let’s see how this looks in practice. In the output below, we query the structure of a specific variable, first in the World Bank dataset, then in the ESS dataset.

str(ess11_ireland$trstplc)

##  dbl+lbl [1:2017]     3,    10,     3,     9,    10,     5,     7,     4,  ...
##  @ label       : chr "Trust in the police"
##  @ format.stata: chr "%4.0g"
##  @ labels      : Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   ..- attr(*, "names")= chr [1:14] "No trust at all" "1" "2" "3" ...

str(so648_country$eu)

##  chr [1:47] "Non-EU" "EU" "EU" "Non-EU" "EU" "Non-EU" "Non-EU" "Non-EU" ...

As you can see from the output, the variable eu from the so648_country dataset is a character variable recording countries as either EU or Non-EU. The variable trstplt from the ess11_ireland dataset contains more information. It’s type is dbl_lbl which is a labelled format variable, and the assigned labels are listed in the @ labelsrow. We could query for further information on this, for example using the unique command before the bracketed dataset and variable name, but for now all we need to appreciate is that when R imports the ESS dataset from its native Stata format, it assigns a specific type to it, and also retains the labelling system created in Stata. Later, we may want to treat these variables as ‘factors’ in our regression analysis, and to do so will require some fine-tuning in the code. We will do this by ‘piping’ the data to the command, applying some additional settings in the process.

02. Summary Statistics and Basic Visaulisations in R

02.1 Simple summary statistics

If you want to check your variables as you did with assignment 1, you can use a small selection of commands. The focus of the assignment is on regression modelling, but it is always necessary to perform some basic inspections of the data to get familiar with the variables you are working with. Let’s start with trstplc from the European Social Survey. Let’s first check the variable structure, and get a look at the unique values in the variable.

str(ess11_ireland$trstplc)

##  dbl+lbl [1:2017]     3,    10,     3,     9,    10,     5,     7,     4,  ...
##  @ label       : chr "Trust in the police"
##  @ format.stata: chr "%4.0g"
##  @ labels      : Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   ..- attr(*, "names")= chr [1:14] "No trust at all" "1" "2" "3" ...

unique(ess11_ireland$trstplc)

## <labelled<double>[12]>: Trust in the police
##  [1]     3    10     9     5     7     4     8     0     2 NA(a)     6     1
## 
## Labels:
##  value           label
##      0 No trust at all
##      1               1
##      2               2
##      3               3
##      4               4
##      5               5
##      6               6
##      7               7
##      8               8
##      9               9
##     10  Complete trust
##  NA(a)         Refusal
##  NA(b)      Don't know
##  NA(c)       No answer

If you want, you could also use val_labels(ess11_ireland$trstplc) although the output is slightly less legible for large variables. Next, we take a look at the summary statistics. This is a little different, since trstplc is a variable that by definition is ordinal, but which we can sometimes treat as ratio, and therefore calculate a mean. The mean

summary(ess11_ireland$trstplc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.000   7.000   6.508   8.000  10.000      18

describe(ess11_ireland$trstplc)

##    vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 1999 6.51 2.36      7    6.51 1.48   0  10    10 -0.93     0.54 0.05

round(prop.table(table(as_factor(ess11_ireland$trstplc))) * 100, 1)

## 
## No trust at all               1               2               3               4 
##             3.2             1.9             2.7             3.6             5.1 
##               5               6               7               8               9 
##            11.8            11.2            20.0            23.1            10.4 
##  Complete trust         Refusal      Don't know       No answer 
##             6.3             0.0             0.8             0.0

So from the above pieces of output, we can see that trstplc is a variable coded from 0-10, where 0 = NO trust at all, and 10 = Complete trust. It has a mean of 6.51, a standard deviation of 2.36, and a median of 7. 23.26% of respondents recorded their trust in the police as ‘8’.

02.2 Basic visualisations

The example graph below uses the same variable, this time drawing a bar chart to show the distribution of cases across each category. This is the same code you used in assignment 1, but note some additions in the code, namely %>% filter(!is.na(trstplc) (a filter that removes any NA categories from the procedure), and as_factor(trstplc) (a temporary transformation that will instruct R to treat the variable trstplc as a factor variable). This will ensure that it recognises the variable as made up of factors with a fixed set of values.

ggplot(data = ess11_ireland %>% filter(!is.na(trstplc)), 
       aes(x = as_factor(trstplc))) +
  labs(
    y = "Percentage",
    x = "Trust in the Police"
  ) +
  geom_bar(aes(y = after_stat(count / sum(count) * 100))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Repeating the process with another of our possible independent variables, maritalb (marital status), we need to take a little more care. This is a nominal variable, and therefore calculating a mean or median would not make sense. Instead, we can see the percentage of respondents falling into each marital status category, and graph this. Taking the code from above and adapting it with this variable, we get the following output.

round(prop.table(table(as_factor(ess11_ireland$maritalb))) * 100, 1)

## 
##                                                    Legally married 
##                                                               51.0 
##                                In a legally registered civil union 
##                                                                1.3 
##                                                  Legally separated 
##                                                                2.7 
##                             Legally divorced/Civil union dissolved 
##                                                                5.1 
##                                         Widowed/Civil partner died 
##                                                                8.5 
## None of these (NEVER married or in legally registered civil union) 
##                                                               30.3 
##                                                            Refusal 
##                                                                0.7 
##                                                         Don't know 
##                                                                0.3 
##                                                          No answer 
##                                                                0.0

ggplot(data = ess11_ireland %>% filter(!is.na(maritalb)), 
       aes(x = as_factor(maritalb))) +
  labs(
    y = "Percentage",
    x = "Marital status"
  ) +
  geom_bar(aes(y = after_stat(count / sum(count) * 100))) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

03. Linear Regression in R

03.1 Doing linear regression

Once you have chosen your set of variables (three independent and one dependent), you are ready to implement your regression model. Remember, the method we have reviewed in class covers linear regression for ratio dependent variables only. This means your dependent variable should be measures at the ratio level, or at least be subject to the same assumptions applied to the trstplc variable above - a variable with sufficient categories to be meaningfully treated as numerical, even if by definition it is not.

Taking the example we looked at in class, we tried to model political placement in relation to education. Graphically, this involved plotting the distribution of respondent scores on both variables, and fitting a linear trend to this plot to see the general direction of association. We found a small propensity toward leftist scores with higher levels of education, and we implemented this using the code as shown below. In particualar, notice the inclusion of another layer, geom-jitter. One of the nice features of graphing in R with ggplot is that we can layer up our graphs as we like. Without the jitter function, the graph will show heaping on the individual scores of lrscale, and make the graph harder to read. Try running the code below again with this line removed, if you want to compare the difference.

ggplot(data = ess11_ireland) +
  geom_point(mapping = aes(x = eduyrs, y = lrscale)) +
  theme_gray(base_size = 12) +
  geom_jitter(aes(x = eduyrs, y = lrscale), alpha = .5, width = 2, height = .5) +
  geom_smooth(aes(x = eduyrs, y = lrscale), method = "lm")

At the conclusion of week 10, we fit out model, plotted, and interpreted the results using the <- lm procedure to assign a linear model to an object in the global environment. In the example below, we model placement on the left-right scale (lrscle) as a function of eduyrs, nwspol, atchctr, and iphppla. By interpreting the results shown below, we were table to conlcude that education had a small effect on political placement, whilst attachment to country and prosocial values also showed some evidence of association. Applying the ‘unit effect interpretation’, each additional year of education is associated with a change of -.027 units of lrscale - a modest left-ward driaft, as 0 = Left in the coding of lrscale.

model1 <- lm(formula = lrscale ~ eduyrs + nwspol + atchctr + iphlppla,
                 data = ess11_ireland)

summary(model1)

## 
## Call:
## lm(formula = lrscale ~ eduyrs + nwspol + atchctr + iphlppla, 
##     data = ess11_ireland)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7403 -0.8961  0.0550  0.8566  6.0187 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.0640813  0.3146564  12.916  < 2e-16 ***
## eduyrs      -0.0269096  0.0119747  -2.247   0.0248 *  
## nwspol       0.0005866  0.0004597   1.276   0.2021    
## atchctr      0.1029085  0.0261734   3.932 8.77e-05 ***
## iphlppla     0.1949188  0.0478931   4.070 4.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.941 on 1695 degrees of freedom
##   (317 observations deleted due to missingness)
## Multiple R-squared:  0.02058,    Adjusted R-squared:  0.01827 
## F-statistic: 8.905 on 4 and 1695 DF,  p-value: 4.084e-07

There are instances, however, where our variables can not be interpreted as either ratio (‘real’ numbers), or ordinal (rank-ordered), and to demonstrate this, we looked at the case of religiosity as measured by frequency of religious service attendance. First, examine the variable using the procedures shown above. We might use str or unique and then create a table to do so. This shows us that rlgatnd is an ordinal variable that measures attendance in the form of frequency, but with intervals that are not evenly spaced. In other words, we cannot say precisely how many ‘units’ more of attendance ‘less often’ is compared to ‘daily’.

str(ess11_ireland$rlgatnd)

##  dbl+lbl [1:2017]     5,     6,     7,     7,     3,     1,     3,     3,  ...
##  @ label       : chr "How often attend religious services apart from special occasions"
##  @ format.stata: chr "%4.0g"
##  @ labels      : Named num [1:10] 1 2 3 4 5 6 7 NA NA NA
##   ..- attr(*, "names")= chr [1:10] "Every day" "More than once a week" "Once a week" "At least once a month" ...

unique(ess11_ireland$rlgatnd)

## <labelled<double>[8]>: How often attend religious services apart from special occasions
## [1]     5     6     7     3     1 NA(a)     2     4
## 
## Labels:
##  value                     label
##      1                 Every day
##      2     More than once a week
##      3               Once a week
##      4     At least once a month
##      5 Only on special holy days
##      6                Less often
##      7                     Never
##  NA(a)                   Refusal
##  NA(b)                Don't know
##  NA(c)                 No answer

round(prop.table(table(as_factor(ess11_ireland$rlgatnd))) * 100, 1)

## 
##                 Every day     More than once a week               Once a week 
##                       1.9                       4.8                      21.0 
##     At least once a month Only on special holy days                Less often 
##                      12.5                      16.3                      21.4 
##                     Never                   Refusal                Don't know 
##                      21.6                       0.4                       0.1 
##                 No answer 
##                       0.0

Including this variable in a regression without modification would give us a result that does not make sense, since what we want to know is the effect of each category membership (each variety of attendance frequency) on the dependent variable (political orientation). The coefficient in the rlgatnd line under Coefficients below cannot be interpreted. We need to transform the variable first into a factor variable, before supplying it to the lm command.

model2 <- lm(formula = lrscale ~ eduyrs + nwspol + rlgatnd + iphlppla,
                 data = ess11_ireland)

summary(model2)

## 
## Call:
## lm(formula = lrscale ~ eduyrs + nwspol + rlgatnd + iphlppla, 
##     data = ess11_ireland)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8833 -0.8851  0.0954  0.7868  5.7286 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.8558267  0.2476185  23.649  < 2e-16 ***
## eduyrs      -0.0149707  0.0119302  -1.255  0.20971    
## nwspol       0.0004853  0.0004550   1.067  0.28634    
## rlgatnd     -0.2157747  0.0282370  -7.642 3.57e-14 ***
## iphlppla     0.1521931  0.0467505   3.255  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.92 on 1690 degrees of freedom
##   (322 observations deleted due to missingness)
## Multiple R-squared:  0.04463,    Adjusted R-squared:  0.04237 
## F-statistic: 19.74 on 4 and 1690 DF,  p-value: 6.778e-16

This is how we do it for rlgatnd. Although the code looks lengthy, take a moment to look at the component parts line by line. We have asked R to mutate the variable into a new variable called rlgatnd_factor, supply this to the original dataset ess11_ireland, and to calculate it by treating the original rlgatnd as a factor with as_factor.

ess11_ireland <- ess11_ireland %>%
  mutate(rlgatnd_factor = as_factor(rlgatnd) %>%
           fct_recode(NULL = "Refusal", NULL = "Don't know", NULL = "No answer", NULL = "Not applicable"))

Now, we can supply this to the regresion command, and observe what happens with the results.

model3 <- lm(formula = lrscale ~ eduyrs + nwspol + rlgatnd_factor + iphlppla,
                 data = ess11_ireland)

summary(model3)

## 
## Call:
## lm(formula = lrscale ~ eduyrs + nwspol + rlgatnd_factor + iphlppla, 
##     data = ess11_ireland)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0240 -0.8865  0.1001  0.8461  5.8155 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              5.7853799  0.3943382  14.671  < 2e-16
## eduyrs                                  -0.0152520  0.0119738  -1.274 0.202918
## nwspol                                   0.0005167  0.0004560   1.133 0.257294
## rlgatnd_factorMore than once a week     -0.8890253  0.3896850  -2.281 0.022649
## rlgatnd_factorOnce a week               -0.4371420  0.3395978  -1.287 0.198189
## rlgatnd_factorAt least once a month     -1.0005650  0.3496561  -2.862 0.004268
## rlgatnd_factorOnly on special holy days -0.8911214  0.3443221  -2.588 0.009735
## rlgatnd_factorLess often                -1.1756930  0.3391098  -3.467 0.000540
## rlgatnd_factorNever                     -1.5294429  0.3396458  -4.503 7.16e-06
## iphlppla                                 0.1563222  0.0466660   3.350 0.000827
##                                            
## (Intercept)                             ***
## eduyrs                                     
## nwspol                                     
## rlgatnd_factorMore than once a week     *  
## rlgatnd_factorOnce a week                  
## rlgatnd_factorAt least once a month     ** 
## rlgatnd_factorOnly on special holy days ** 
## rlgatnd_factorLess often                ***
## rlgatnd_factorNever                     ***
## iphlppla                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.916 on 1685 degrees of freedom
##   (322 observations deleted due to missingness)
## Multiple R-squared:  0.05183,    Adjusted R-squared:  0.04676 
## F-statistic: 10.23 on 9 and 1685 DF,  p-value: 1.737e-15

Now we can interpret the results, because we can see the unique effect of ‘being in’ each group, relative to the reference group of ‘Daily’. A note on how to interpret this. R will always designate the first category of the variable the reference category unless we tell it otherwise. So, the coefficient for ‘Never’ is interpreted as the difference in lrscale scores for those who Never attend relative to those who attend Daily. The effect on lrscale is -1.53 for those in the never group, a considerable leftward effect. Similarly, those who attend At least once a month exhibit a difference of -1 points on lrscale relative to those who attend Daily. The unit effect interpretation still applied, it is just interpreted now as the effect of category membership on dependent variable scores. In plain language, less frequent religious service attendance is associated with more leftist placement on the political continuum.

03.2 Visualising your results

Finally, though not necessarily, you might want to visualise your results to help with interpretation. There are several ways to do this, but the broom/tidy method is most straightforward. Taking our model output from the above regression as assigned to the object model3, we can tidy up the results and pass them to a plot in ggplot.

out_model3 <- tidy(model3)
out_model3 |> round_df()

## # A tibble: 10 × 5
##    term                                    estimate std.error statistic p.value
##    <chr>                                      <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)                                5.78      0.394     14.7    0    
##  2 eduyrs                                    -0.015     0.012     -1.27   0.203
##  3 nwspol                                     0.001     0          1.13   0.257
##  4 rlgatnd_factorMore than once a week       -0.889     0.39      -2.28   0.023
##  5 rlgatnd_factorOnce a week                 -0.437     0.34      -1.29   0.198
##  6 rlgatnd_factorAt least once a month       -1.00      0.35      -2.86   0.004
##  7 rlgatnd_factorOnly on special holy days   -0.891     0.344     -2.59   0.01 
##  8 rlgatnd_factorLess often                  -1.18      0.339     -3.47   0.001
##  9 rlgatnd_factorNever                       -1.53      0.34      -4.50   0    
## 10 iphlppla                                   0.156     0.047      3.35   0.001

ggplot(data = filter(out_model3, term != "(Intercept)")) +
  geom_point(mapping = aes(x = estimate, y = reorder(term, estimate)))

Don’t forget your earlier ggplot annotation code - we can make this a lot neater by adding labels and source details.

out_model3 <- tidy(model3)
out_model3 |> round_df()

## # A tibble: 10 × 5
##    term                                    estimate std.error statistic p.value
##    <chr>                                      <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)                                5.78      0.394     14.7    0    
##  2 eduyrs                                    -0.015     0.012     -1.27   0.203
##  3 nwspol                                     0.001     0          1.13   0.257
##  4 rlgatnd_factorMore than once a week       -0.889     0.39      -2.28   0.023
##  5 rlgatnd_factorOnce a week                 -0.437     0.34      -1.29   0.198
##  6 rlgatnd_factorAt least once a month       -1.00      0.35      -2.86   0.004
##  7 rlgatnd_factorOnly on special holy days   -0.891     0.344     -2.59   0.01 
##  8 rlgatnd_factorLess often                  -1.18      0.339     -3.47   0.001
##  9 rlgatnd_factorNever                       -1.53      0.34      -4.50   0    
## 10 iphlppla                                   0.156     0.047      3.35   0.001

ggplot(data = filter(out_model3, term != "(Intercept)")) +
  geom_point(mapping = aes(x = estimate, y = reorder(term, estimate))) +
  labs(
    x = "Coefficient",
    y = "variable / factor level",
    title = "Regression Output: DV = lrscale",
    subtitle = "0 = Left, 10 = Right",
    caption = "Source European Social Survey Round 11, 2023"
  )

03.3 Categorical dependent variables

Sometimes, you might want to use a dependent variable that is non-numeric, either nominal or ordinal. This belongs to a different set of techniques more broadly known as logistic regression. Logistic regression is very common in the social sciences especially as so much of what we want to measure or explain does not fit the conditions or assumptions of ratio measurement. The easiest way to implement this initially, is to do so with a binary dependent variable. This is a variable that can only take on two categories. Let’s look at an example with the vote variable. Inspecting the variable using the code below, we see how it takes on three categories (Yes, No, and Not eligible to vote).

unique(ess11_ireland$vote)

## <labelled<double>[4]>: Voted last national election
## [1]     1     2     3 NA(a)
## 
## Labels:
##  value                label
##      1                  Yes
##      2                   No
##      3 Not eligible to vote
##  NA(a)              Refusal
##  NA(b)           Don't know
##  NA(c)            No answer

round(prop.table(table(as_factor(ess11_ireland$vote))) * 100, 1)

## 
##                  Yes                   No Not eligible to vote 
##                 75.2                 17.4                  7.1 
##              Refusal           Don't know            No answer 
##                  0.1                  0.2                  0.0

What if we wanted to model voting as our dependent variable, with gender, education, and political orientation as our independent variables? First, we need to do some recoding to get the vote variable into two categories.

vote_factor <- to_factor(ess11_ireland$vote)
ess11_ireland$vote_binary <- factor(
  ifelse(vote_factor == "Yes", "Yes", "No"),
  levels = c("No", "Yes")
)
table(ess11_ireland$vote_binary)

## 
##   No  Yes 
##  493 1517

What have we done here? In the ifelseline we have identified, first of all our key category of ‘Yes’, and that the binary will be composed of those answering ‘Yes’, with others assigned to ‘No’. The two levels of the variable will be defined as ‘No’ and ‘Yes’ respectively. Looking at this another way, and with a more complex variable (marital status), we apply the same logic - identifying the key category (the ‘thing we want to explain’ - in this case, being single), and designate this as 1, and all others as 0. For marital status, we could do it like this.

maritalb_factor <- to_factor(ess11_ireland$maritalb)
ess11_ireland$never_married <- factor(
  ifelse(maritalb_factor == "None of these (NEVER married or in legally registered civil union)", 
         "Never Married/Registered", 
         "Ever Married/Registered"),
  levels = c("Ever Married/Registered", "Never Married/Registered")
)

table(ess11_ireland$never_married, useNA = "ifany")

## 
##  Ever Married/Registered Never Married/Registered                     <NA> 
##                     1383                      611                       23

Now, if we want to model this, we can use the logistic regression procedure to do so. The interpretation is slightly different here, as the coefficients must be interpreted as the effect of a unit change in the independent variable on the

model4 <- glm(vote_binary ~ trstplc + eduyrs + rlgatnd_factor, 
             data = ess11_ireland, 
             family = binomial)
tidy(model4, exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 9 × 7
##   term                   estimate std.error statistic p.value conf.low conf.high
##   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)               3.43     0.586      2.10  3.56e-2   1.19      12.5  
## 2 trstplc                   1.12     0.0222     4.94  7.81e-7   1.07       1.17 
## 3 eduyrs                    1.01     0.0141     0.802 4.22e-1   0.984      1.04 
## 4 rlgatnd_factorMore th…    0.422    0.589     -1.46  1.44e-1   0.116      1.23 
## 5 rlgatnd_factorOnce a …    0.696    0.549     -0.660 5.09e-1   0.202      1.84 
## 6 rlgatnd_factorAt leas…    0.576    0.557     -0.991 3.22e-1   0.165      1.55 
## 7 rlgatnd_factorOnly on…    0.331    0.545     -2.03  4.26e-2   0.0964     0.865
## 8 rlgatnd_factorLess of…    0.362    0.543     -1.87  6.10e-2   0.106      0.939
## 9 rlgatnd_factorNever       0.241    0.540     -2.63  8.52e-3   0.0708     0.623

What we have here is a table reporting the results of the model in odds ratios, given by the line of code tidy(model, exponentiate = TRUE, conf.int = TRUE), calling on the broom package that we have used already to tidy up our regression output. The odds ratios can be interpreted using the same unit interpretation, only this time the effect is not read in terms of the score on the dependent variable, but on the odds of being in the ‘Yes’ category. In other words, each line in the term column above shows the effect of each variable or category on the odds of voting. So, for trstplc, the odds ratio is 1.116 (with rounding). Therefore, a one-point or unit increase in trust in the police is associated with an 11.6% increase in the odds of voting. For religious attendance, particularly the rlgatnd_factorNever group, the odds ratio is 0.241. Whenever we have an odds ratio less than 1, we interpret this as ‘negative’, since the odds ratios in this form can be read as differences between proportions in a table. Taking 1-OR gives us .759, and we interpret the odds of voting for those in the ‘Never’ group as having 75.9% lower odds of voting than those who attend daily. You can also map the resulting output using the same mapping code as we did for the linear regression above.

out_model4 <- tidy(model4)
out_model4 |> round_df()

## # A tibble: 9 × 5
##   term                                    estimate std.error statistic p.value
##   <chr>                                      <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                                1.23      0.586     2.10    0.036
## 2 trstplc                                    0.11      0.022     4.94    0    
## 3 eduyrs                                     0.011     0.014     0.802   0.422
## 4 rlgatnd_factorMore than once a week       -0.862     0.589    -1.46    0.144
## 5 rlgatnd_factorOnce a week                 -0.362     0.549    -0.66    0.509
## 6 rlgatnd_factorAt least once a month       -0.552     0.557    -0.991   0.322
## 7 rlgatnd_factorOnly on special holy days   -1.11      0.545    -2.03    0.043
## 8 rlgatnd_factorLess often                  -1.02      0.543    -1.87    0.061
## 9 rlgatnd_factorNever                       -1.42      0.54     -2.63    0.009

ggplot(data = filter(out_model4, term != "(Intercept)")) +
  geom_point(mapping = aes(x = estimate, y = reorder(term, estimate))) +
  labs(
    x = "Odds ratio",
    y = "variable / factor level",
    title = "Logistic Regression Output: DV = Vote",
    subtitle = "1 = Voted, 0 = Did not vote",
    caption = "Source European Social Survey Round 11, 2023"
  )

04. Saving Output

Want to save something? If you have an image you want to save, you can save it manually by going to the plot pane (the lower-right quadrant of RStudio) and clicking Export then Save as Image as shown on the image below. Give it a file name, and it will save to your working directory once you hit save.

Saving Your Graph

If you are happy with your summary statistics table, and want to save that, then you can also use the export option as shown below. You have two options here. By selecting ‘Save as Image…’ you can change the resolution and file type. By selecting ‘Copy to Clipboard’ you can then go straight to a word document and paste it in. Either method is fine.

Saving Your Tables