It’s fine

Hi everyone. I know there is always some anxiety about this module, so I want to write something to let you know a little more about what we will do. We will work mainly with R. This page, the one you are reading now, was written in R, and much of the material you will receive and work with in class will look something like this. R is a computing environment for managing, manipulating, analysing, and visualising data. It works by typing code into a command line or console, running this code, and then interpreting the output. Most of the time, I will supply this code to you each week, and you will run it to produce an output. This can be very simple, like so:

1+1

In this case, running a line of code like this will give us a very simple output.

1+1

[1] 2

Each week, I will post a code file in’.R’ format for you to download from Moodle, and we will work through the code together in class. The code will look something like this. Try to think of it as sets of instructions that will tell R where to find something (like a dataset), and what to do with it (make a table or draw a graph). You will not have to memorise this code, write your own code (aside from making some small adjustments to some of the terms), nor will I ask you to go away and write your own original code.

so648_country <- read_csv("so859_country_2025.csv")
summary(so648_country)
summary(so648_country$gini)

So with the code above, we have imported some data from a spreadsheet titled ‘so648_country’, then asked R to give us summary statistics for the entire dataset using summary(so648_country), then asked it to summarise a variable titled ‘gini’ using the code summary(so648_country$gini).

Interpreting Output

Sometimes, we will use some longer code, or call on specific packages to do something different, like creating this table.

sumtable(so648_country, vars = c('gini','top1', 'ls'), 
        digits = 4, add.median = TRUE, 
        title = 'Table 1: Country Data Summary 
        Statistics')
Table 1: Country Data Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
gini 37 34.39 7.684 25.4 29.2 32.7 36.2 63
top1 30 12.58 5.251 6.3 8.8 11.2 13.83 23.7
ls 39 52.4 6.207 35.56 48.39 52.88 56.39 65.39

In class, we might spend some time talking about what these measures mean - which one might be best for measuring income inequality for example. Should we look at the super-rich (top1) or look at the overall share of GDP going to workers (ls)? How could we tidy this table up to make it more presentable? What others kinds of comparison might we want to make (are European countries more equal than non-European - what about countries within Europe?)

Making Graphs

We will also use our code to create graphs to explore these questions further. We will spend a lot of time doing this, emphasising interpretation more than coding or calculation. We will work mainly with the ggplot2 package for this. I will always give you the code to do this, and we will focus on reading and interpreting the graphs together in class. So the code below will also give us the following graph:

ggplot(so648_country$data, aes(y=so648_country$top1, x=so648_country$un_developed))+
  geom_boxplot() +
  ylab("% total income to 1%'") +
  xlab("Income Share of Top 1%")+
  labs(title = "Top 1% Income Share (2018-2022)", 
       subtitle = "Data from World Bank")

Once we have produced the graph, we will spend some time interpreting about it. So for the graph above, we would discuss why income inequality is much higher in ‘developing’ countries compared to ‘developed’. We might discuss how to relabel the data into something less problematic than ‘developing’ (possibly high vs low income, advanced vs transition). Then we might rearrange the graph to make it more presentable - adding labels, titles, etc. The goal is interpretation and communication.

We can also explore other social issues visually. The example below uses data from the 2021 Census of Northern Ireland, which included a question on Autism for the first time.

Figure 1: Autism in Belfast, 2021

Figure 1: Autism in Belfast, 2021

Using this data, we can identify districts with the highest rates, and also plot the distribution of Autism visually in order to identify things like spatial clusters, hotspots, coldspots, etc. Then we can check to see if there are any co-occurrences like higher or lower rates of education, particular family types, or proximity to health services.

This module is not…

  1. A statistics course. We have a separate department for this, and this is not a data analytics qualification - it is a social research course. You will leave here able to replicate this yourself, but I am not here to teach you mathematics or statistics, nor will you be asked to do any calculation by hand in any part of your assessment.

  2. A programming course. Some people like to call this coding. Even though we will work with code, we will not be using it to write overly complex instructions, nor will we be writing programs to automate things for us (though this can be done in R with additional learning). We will issue instructions to R to find data, name it, label it, transform it, then do ‘something’ to it like draw a graph or produce a summary statistic. We will tell is to show us the mean of a variable, draw a time series graph, compare groups - use it to answer sociologically interesting questions.

  3. Something other than sociology. Sociologists use statistics in very specific ways, and the difference between us and econometrics is the theory we bring to the design and interpretation. This is as important as technical knowledge when it comes to doing good sociology. We will use the technology to answer interesting questions about the social world.

Why do we learn it like this?

  1. I want you to understand the process behind the production of statistical output. This is the emphasis, and this is more important (for this module) than understanding the mechanics of the different procedures. We will cover the logic of the various statistical procedures, but at a level that assumes no prior experience.

  2. I want you to understand the ‘back end’ of the analysis process, so that when you read something that contains quantitative output, you understand the technical process linking what you see on the page to the data in the background. Understanding the workflow of quantitative research opens up many possibilities for further study.

  3. Working in R has some added benefits. R is completely free and open source, meaning you will always have access to the tools you need, unlike other proprietary programmes, which you lose access to after you graduate. R packages are also user-developed, so updates appear frequently, as well as new procedures written by the people directly involved in their development. There is a big R community online, and the simple format of R files makes it easily transferable from platform to platform.

I hope you enjoy this, and I am looking forward to teaching it,

Eoin