Assignment 1

Write a 1000 word (excluding code) assignment addressing the following questions. The assignment is due via online submission (Moodle) on Monday March 30th, 23:59.

  1. Select a dataset from those provided to you on Moodle.
  2. Produce a table of summary statistics including at least three variables.
  3. Produce two visualisations from your data (boxplot, histogram, scatterplot).
  4. Write a brief commentary on what you observe from your analysis.
  5. Include the code used to conduct your analysis.
  6. Find one journal article on a topic represented by one or more of your variables and provide a short summary of its main argument.

This page contains everything you need to answer this. You should use this page by following along in RStudio as you read. Wherever you encounter code, you can copy it over to RStudio and run it yourself. The key to doing the assignment in this way, with this page as a guide, is (1) using the supplied code, do the examples yourself as you do, and (2) substitute the variables, titles, labels, etc with those chosen by you, for your assignment. Ultimately, you can complete this assignment with three variables from one of the datasets. All of the steps you need to follow are given below. Please read the text fully and carefully before emailing for assistance (which you are welcome to do - but if the answer is in this page somewhere, I will politely send you back here). The backup video tutorials are available from my YouTube course playlist at this link.

01. Setting up Your Session in R

The first thing you should do in any R session is ensure you have saved all of your required data files to a folder, and then set your working directory to point to this folder. Download and save all of the required files from Moodle, put them in a folder titled ‘so648_2026’, then set your working directory using the code below. Remember, you will need to change the folder address to the working directory on your own computer. Mine is in a folder located in a sub folder ‘…Teaching/so648_2026’. Use the code below to set yours.

setwd("C:/Users/eflaherty/iCloudDrive/Teaching/so648_2026")

Next you need to load the packages (libraries) that you will use for your work session. The ones you will need for this assignment can be loaded with the following code.

library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(haven)
library(labelled)
library(vtable)

If you receive an error message for any of the above, you likely need to install the package. This can happen if you have not installed it before trying to load the library, or if you are working on a new computer. You will only need to install the packages once, after that you can load the libraries only using the above code. If you need to install a package, for example, ggplot, then use the following code.

install.packages("ggplot2")

Next, you will need some data. Making sure your datasets are saved into your working directory, as outlined above, run the following code (which depends on the ‘readr’ library) to read .csv spreadsheet data into R. The resulting dataset will save to the Environment panel to the right in RStudio. The code below will get two .csv files from your working directory, so make sure the data files from Moodle are saved into the folder set above as your working directory.

so648_country <- read_csv("so648_country_2026.csv")
eurostat <- read_csv("eurostat_data_2026.csv")

Let’s check the contents of our datasets. We can look at the column names (variable names) for our so648_country and eurostat datasets by running the following code.

colnames(so648_country)

[1] “cntry” “cntry_code” “unemp_nat” “unemp_ilo” “gini”
[6] “pop_grow” “gni_grow” “gnipc_grow” “gdppc_grow” “gdp_grow”
[11] “gni_pc” “healthex_gdp” “healthex_exp” “fdiin_gdp” “tech_exp”
[16] “service_emp” “trade” “rd_exp” “rd_people” “life”
[21] “credit” “stocks” “top1” “ls” “eu”
[26] “un_developed” “union” “murder”

There are other steps we may wish to take, such as assigning long-form labels to our variables. We will look more closely at labeling systems later in the course. For now, you can simply supply a title to the graphs you create using the labeling functions in ggplot. Examples of this will be given below. Remember if we want to open the Help window for any package/library, we can type help(vtable) or help(ggplot2) to open the help window for that package/library.

02. Summary Statistics in R

There are several different ways to do everything in R. The packages I have chosen for the course should generally make things as easy as possible to code, or should be able to handle different kinds of data or measure within the same grammar. vtable, and the sumtable command are examples of this. We loaded the vtable library above, using the code library(vtable). We can now use the sumtablecommand from vtable to make a table that is formatted in a way we can use easily in our documents.

The following code produces a simple table of summary statistics for as many variables as we supply to the vars argument. Here, I supply three variables: ls gini top1.

sumtable(so648_country, vars = c('ls', 'life', 'top1', 'eu'))
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
ls 39 52 6.2 36 48 56 65
life 47 79 4.5 62 76 82 84
top1 30 13 5.3 6.3 8.8 14 24
eu 47
… EU 28 60%
… Non-EU 19 40%

This is fine for most purposes, but typically we want more customisation. The title, the variable names, the contents of each column, can all be modified. sumtable can also handle nominal and ordinal (factor) variables, which will save us some hassle later on. You can see this from the output above, where we can see that 60% of countries in the dataset are in the EU, and 40% are Non-EU. In the example below, I have added a median to the output, changed the title, and also added a grouping variable to look at within-group summary statistics. You can modify the code below to include your data, variables, and titles if you wish.

sumtable(so648_country, vars = c('ls', 'life', 'top1'),
        digits = 3, 
        add.median = TRUE, 
        title = 'Assignment 1 Summary Statistics (by EU membership)',
        group = 'eu')
Assignment 1 Summary Statistics (by EU membership)
eu
EU
Non-EU
Variable N Mean SD Median N Mean SD Median
ls 28 51.8 5.66 52.1 11 53.9 7.51 56.3
life 28 79.6 2.8 80.9 19 77.6 6.11 79.3
top1 14 9.89 2.44 9.5 16 14.9 5.96 13.1

You do not necessarily need to use the longer form of this code for your assignment, but it is useful to get familiar with working with sub-groups as this is important for analysis later on. If you want the longer form output but don’t want to split it by EU/Non-EU, just remove the code group = 'eu', group.long = TRUE and close the brackets.

03. Histograms and Boxplots Using ggplot

For most basic plots in ggplot, we can draw on a common grammar to make annotations, and modify the graph parameters. This is a big advantage of using ggplot as we can transfer code between graph types. So, the same code that places a title on a histogram, will also place one on a time series graph. The code that modifies an axis of a scatterplot can also modify that of a boxplot. Histograms can be produced using either a simple or expanded form of code. In the example below, the first graph is produced using the simple, where we run two lines and then let R automate the rest for us. The second uses the same initial lines of the first, but modifies some of the graph parameters manually - such as the title, and number of bins. Note the difference between the two. You are free to modify your own graphs as you wish.

The first gives us a histogram of the unionisation rate union, with density on the y-axis rather than frequency. This shows the proportion of cases falling within the range represented by each bin, or bar.

ggplot(so648_country, mapping = aes(union, after_stat(density))) +
  geom_histogram()

We can modify this by adding labels, changing the theme of the graph to alter the overall aesthetic, changing the bins to give us higher or lower resolution.

ggplot(so648_country, mapping = aes(union, after_stat(density))) +
  geom_histogram(col = "black", fill = "darkgray", bins = 15) +
  theme_gray(base_size=16) +
  labs(x = "Unionisation Rate (% workers in trade union)", y = "Density",
       title = "Unionisation Rates (2024)",
       caption = "Source: World Bank Databank.")

Take your time, and look at the differences between the two. Both show the same information, we have just exerted more control over the chart elements in the first vs the second. The additions in geom_histogram for example set the colour of the column outlines to black, and fill them in dark gray (try typing red in there instead of darkgray and see what happens). The information contained between the brackets from ‘labs’ sends a label to each component. So to change them, just change the text typed in between the inverted commas.

We can do the same with boxplots. In the example below, we follow the same logic and grammar structure. The advantage of boxplots is that we can do a little more with them (visually) in terms of splitting the distribution by categories of a second variable. We will do this now. The first graph below shows a simple boxplot with minimal settings. The variable is the top 1% income share (top1).

ggplot(so648_country, aes(y=top1))+
  geom_boxplot()

The expanded form of this graph, using full annotation from the ggplot grammar, would look like this.

ggplot(so648_country, aes(y=top1))+
  geom_boxplot() +
  theme_gray(base_size = 16) +
  ylab("% total income to 1%") +
  xlab("Income Share of Top 1%")+
  labs(title = "Top 1% Income Share (2024)", 
       subtitle = "Data from World Bank")

One of the nice features of boxplots is that, depending on how we draw them, we can compare distributions across sub-sets. By introducing a grouping factor variable (in this case EU membership) we can see not only if median inequality differs, but if the variability is different.

ggplot(so648_country, aes(y=top1, x=eu))+
  geom_boxplot() +
  theme_gray(base_size = 16) +
  ylab("% total income to 1%") +
  xlab("Income Share of Top 1%")+
  labs(title = "Top 1% Income Share by EU Membership (2024)", 
       subtitle = "Data from World Bank")

04. Scatterplots Using ggplot

Scatterplots can be drawn by supplying two continuous (ratio) variables to the plot. In the example below, we supply gni_pc (Gross National Income Per Capita) to the x-axis, and life (life expectancy) to the y.

ggplot(data = so648_country) +
  geom_point(mapping = aes(x = gni_pc, y = life))

Again, we can expand this using the same expanded options as we supplied to both the histogram and boxplots previously.

ggplot(data = so648_country) +
  geom_point(mapping = aes(x = gni_pc, y = life)) +
  labs(y = "Left expectancy (years)", x = "GNI per capita",
       title = "Global Life Expectancy and National Income (2018-2022)",
       subtitle = "National Income = per capita Gross National Incone (GNI)",
       caption = "Source: World Bank Databank")

What if we wanted to illustrate the overall relationship better? We can add our fit line (resistant, regression, slope) to the plot as a separate geom, with (what R calls) a linear smoother. This is important, and something we will come back to in week 8 / week 9 as we explore correlation and regression further. Watch how the code is identical to the above aside from one additional line.

ggplot(data = so648_country) +
  geom_point(mapping = aes(x = gni_pc, y = life)) +
  geom_smooth(aes(x = gni_pc, y = life), method = "lm") +
  labs(y = "Left expectancy (years)", x = "GNI per capita",
       title = "Global Life Expectancy and National Income (2018-2022)",
       subtitle = "National Income = per capita Gross National Incone (GNI)",
       caption = "Source: World Bank Databank")

05. Time Series Graphs Using ggplot

For time series data (in this case the eurostat dataset loaded above) we will use a similar approach, with a graphing method suited to the type of data. Time series data are ordered by time, to varying degrees of resolution (annual, monthly, weekly, daily etc). The data we usually work with are yearly or quarterly - though the latter is much rarer in sociology and criminology aside from some economic indicators that might be examined on their own. In the example below, the graph is produced by specifying a x variable (usually the variable that identifies time for this kind of plot), and a y variable to be graphed. Remember to draw a plot of this type, the data must be ordered by time / in time series format. The country level data is not suitable for this.

ggplot(data=eurostat) +
  geom_line(aes(x = year, y = cpi_change))

Here is the same plot, but with annotations as shown in week 6.

ggplot(data=eurostat) +
  geom_line(aes(x = year, y = cpi_change)) +
  theme_gray(base_size=16) +  
  labs(y = "Annual % change",
       x = "year",
       title = "Changes in Cost of Living, 2000-2025",
       subtitle="Consumer Price Index % Annual Change",
       caption = "Data from Eurostat EU-SILC")

What about variables with a shorter timespan? Remember, as with all other plot types we can alter the axes according to our needs - statistical and aesthetic. Inspect the dataset and you will see that data on sexual assault is only available from 2009 to 2023 for Ireland in Eurostat. If we wanted to graph this without modification, our plot comes out with empty spaces where there is no available data. This is because, given the way we have coded it, ggplot will automatically draw the plot for the full range of the series.

ggplot(data=eurostat) +
  geom_line(aes(x = year, y = sa_rate_fem))

We can ‘fix’ this by limiting the range of the x-axis (time/year) to include only the available range.

ggplot(data=eurostat) +
  geom_line(aes(x = year, y = sa_rate_fem)) +
  xlim(2009, 2023)

Finally, we can add full annotation to the graph to clean it up.

ggplot(data=eurostat) +
  geom_line(aes(x = year, y = sa_rate_fem)) +
  xlim(2009, 2023) +
  theme_gray(base_size=16) +  
  labs(y = "per 100,000",
       x = "year",
       title = "Sexual Assaults (Female Victim), 2009-2023",
       subtitle="Ireland, Police-Recorded Offences",
       caption = "Data from Eurostat EU-SILC")

06. Saving Output

Want to save something? If you have an image you want to save, you can save it manually by going to the plot pane (the lower-right quadrant of RStudio) and clicking Export then Save as Image as shown on the image below. Give it a file name, and it will save to your working directory once you hit save.

If you are happy with your summary statistics table, and want to save that, then you can also use the export option as shown below. You have two options here. By selecting ‘Save as Image…’ you can change the resolution and file type. By selecting ‘Copy to Clipboard’ you can then go straight to a word document and paste it in. Either method is fine.