Week 1: The Structure and Workflow of R

01. An Overview of the R Environment for New Arrivals

01.1 What is different about working with code in R

Establishing good workflow practices from the very beginning is essential to working successfully in R. Good workflow is achievable by putting some time and effort in at the very beginning to embed a routine. If you follow this routine every time you work, the long-term payoff will be big. You will be able to reproduce your analysis very quickly from a complete record of each step you took in your project. If you need to update your data, you can quickly reproduce all of your output without having to manually re-run each procedure individually. No more re-running graphs and models, saving them separately, and re-compiling them into your written document. You will not need to remember, or record in an unsystematic way, how to do each procedure in your data analysis (where to click, in what order), since all will be coded and saved in your script or document. Dialog-based systems such as SPSS are unhelpful in this regard, as every procedure comes with small variations in the windows you must click through to make the action happen. Stata is a good alternative as it operates on a syntax that is optimised for econometrics - the family of statistical techniques currently dominant in economics and sociology. There are differences in the language that these programmes use.

In Stata, a regression analysis with three variables would involve typing the following code.

regress y1 x1 x2

In SPSS, we would use:

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT y1
  /METHOD=ENTER x1 x2.

In R, using the tidyverse conventions, we would type:

model1 <- lm(y1 ~ x1 + x2, data = USArrests)
tidy(model1)

Some differences are more apparent. In SPSS, we often need to be more detailed with the conditions we specify at the start of the procedure. In Stata, the language is cleaner, in that we identify the command ‘regress’ and a dependent variable (y1) plus two predictors (x1, x2). One of the advantages we will find with R is that we assign the model output to an object that lives in our ‘Environment’ window. We can then recall it, and manipulate its contents if we wish. We can do with in other programmes too, but it is a little more opaque. The code above is a good example of R’s object-oriented nature. When we work in R, we will often create objects (in the above example, model1), that we then populate with output according to the instructions we place after it. The literal nature of R code also makes is more usable. The use of the <- operator, for example, is quite literally telling R to fill the object model1 with the results of the linear model produced from y1 ~ x1 + x2, using the data USArrests. Chaining these steps together gives us this.

model1 <- lm(y1 ~ x1 + x2, data = USArrests)

Don’t worry too much about the terminology or technique here, focus on the structure of the code. We will be able to use these properties of R code to our advantage later on. Compared to SPSS, the syntax of a programme like Stata is superior, but both carry one large disadvantage - their cost. When you register for a programme of study such as an undergraduate or master’s degree, you typically gain access to whatever suite of computer programmes your education institution has subscribed to. Often you will lose access to these once your registration expires. The cost of this can be prohibitive, especially as software providers across all industries are moving away from ownership toward subscription-based distribution. An annual subscription to the base edition of SPSS (without add-ons) currently runs to over €1400. R is both free and open-source, meaning there is no installation or subscription cost aside from a computer ro run it on, and it is infinitely expandable by the user community. This is a considerable advantage as it means new procedures are added all the time by experts who work on everything from new statistical methods, to more refined ways of doing common tasks. This is fine if the choice is yours, but that is not always the case. Preference for statistical software can vary from place to place, with programmes like SAS more commonly used across North America. It also varies by discipline, with SPSS historically more common in psychology and marketing, and Stata more frequently used in economics. Your choice of platform will sometimes be dictated by the existing skill base and preferences of your teacher, or your organisation. This is becoming less of an issue as the user base for proprietary software shrinks. The shift away from software ownership to subscription is also fast becoming less attractive for users who want to own, and not rent, the tools of their trade.

If nothing else, one of the main attractions for R as a teaching tool for me is that my students can still access it once they graduate. But - and this is a big one - it does come with a learning curve. When you first start RStudio, the window you see looks unfamiliar. If you have come to a sociology programme, this is probably not what you subscribed to at all. There are 16+ tabs in the default window, various pull-down menus, and it drops you in with no immediate direction. This can be intimidating, and often prompts a degree of anxiety. You didn’t want to learn to code, and that is fine because our job as social scientists does not require us to to professional coders. Keep in mind that our objective here is to make the software produce results that are interesting, informative, and that ultimately help us draw conclusions that are sociologically relevant, and you will be fine.

Let’s start with the basics for now, with some understanding of why we should work in a platform like this, and the level of expertise we are aiming for. If you are coming to R with some experience on other platforms such as SPSS, Stata, or SAS, you may have some familiarity with the workings of code through SPSS or Stata syntax. This will serve you well, as the general workflow will start to feel familiar after a while. The general framework for many procedures is the same: write a piece of code that will do something to your data. Maybe one that calculates some summary statistics, or produces a plot. Point that command toward some data, tell it which variables you want it to work on, and then execute it. If I wanted to produce a table of summary statistics for all variables in a dataset called gaming, I could write it like this.

summary(gaming)

Textbooks on SPSS or Stata will refer to the initial part of the code as a command. We can adopt the same logic here. The ‘command’ (summary) will apply to the name of the dataset supplied between the brackets (gaming), and produce a table of summary statistics. If I save this command/code in a file such as an R Script (.R), I can come back to my project at any point and re-run the same code, to produce exactly the same results. I can share the exact steps of my analysis by sharing the script with a collaborator or colleague. If someone wanted to check my work, assuming they also have access to my data, they can do so instantly with a copy of the code. This satisfies several important principles of good practice: reproduction, sharing, validation, and repetition. So far not too different to how it works in other programmes in terms of workflow. The main difference (and it is an important one) is that we have removed the intermediary software corporation, holding our capacity to do good social science behind a paywall. This is not the only reason, as the section on packages and libraries explains further on. R has a large community of users and coders, who create their own custom packages and release them free to use. Some are so ubiquitous such as tidyverse and the ggplot2 graphics environment, that they are near-standard tools in social research now. The advantages of the community-based nature of R will become apparent as you work through the course. The point for now is that we can do all of our work, with code, from inside the R environment. We will not need to go outside of RStudio to acquire these packages, because they are integrated into the R environment and can be called into use as needed.

01.2 Packages and Libraries

Packages and libraries are central to the R experience. They are at the core of the R workflow, and define a part of what makes R efficient to work in. Packages provide us with custom-written code, and sometimes accompanying data, that perform a specific function not provided for in ‘base R’. This is basic version of R that you get on first installation, and we will distinguish between ‘base R’ both as a programming language in itself, and as a particular state in which R operates prior to the installation of packages. A note on terminology, we you may have noticed that we used the terms package and library interchangeably in places. Both terms describe something that we will load into R to provide additional functionality. That is all that a package/library does. By definition, a ‘package’ is the collection of code and data that makes a new function available in R. A ‘library’ is the name given to the function in R that loads the contents and features of a package into a work session. The ‘library’ is also the place where this collection of additional code lives. This is all definitional, and better illustrated with an example.

In the later section on preliminary data inspection, we will call on a package that will streamline the process of producing summary statistics for us, allowing us to generate print and report-ready tables easily. One of these packages is called vtable and was developed by Nick Huntington-Klein, an Economics Professor at Seattle University. Using the workflow of code, we would call this package into use as needed, first by installing it using the code:

install.packages("vtable")

Then, when we want to use the functions of this package in future work sessions, we call on its functions using library to make its features available to us:

library(vtable).

You will encounter this kind of action in your workflow quite often when working in R. It is also what makes the open source nature of R, and the wider community of users and function developers, most apparent. By default, R will not start a new session with all additional packages loaded into memory. We will install each package only once using specific code (install.packages as shown above). After installation, we need only use library(package) to make the functions of that package available to us. We can also access help pages for our packages by using the help command. For example,

help(vtable)

Again, working with code in a script means we can retrace each of our steps, and eventually, write our own sets of project setup commands that load only the libraries we need for each of our work session. There is quite a lot of terminology here, and some of this is merely incidental - the distinction between a library and a package matters mainly to us in terms of the sequence in which we run either of these commands.

Table - package, library, function, command, syntax

Term Description Examples
Package A collection of code providing new functionality in R. ggplot2 for graphics and data visualisation, dplyr for data manipulation. install.packages("ggplot2") is used to install the package ggplot2.
Library The location in which the packages are stored. After installation, a package is referred to as a library. Both terms identify essentially the same thing - a specific new function in R. In your work session in R, library(ggplot2) is used to make the functions of the package available in your work session.
Function ‘Function’ can refer either to a mathematical function ‘function’ or a specific operation in R. The dplyr package provides several functions including filter() that allows you to select subsets of cases based on specific values.
Command ‘Command’ is the term used to describe keyword operators in programmes such as SPSS or Stata, that execute a particular function. We can refer to a procedure such as summary(gaming) as executing a command. The ‘command’ in this case is summary.
Syntax The flavor of grammar that a particular programme employs. SPSS and Stata have their own individual syntax, a structure of writing code that allows commands to run. The syntax will have specific rules about keywords, use of special characters, or the order in which things must be written to successfully execute.

01.3 How much math do I need to know?

Most of us who come to sociology did so to avoid mathematics. Math avoidance in sociology was such a big concern that in 2014 the Nuffield Foundation launched the Q-Step programme in the UK. The initiative established 17 centers across UK universities with the goal of embedding quantitative training and general numeracy in core social science programmes. So if you are feeling hesitant about math, you are not alone. Luckily, this is not a maths or statistics or programming course. Much of what we will do here with statistics is develop just enough fundamental knowledge to help us understand what is going on behind the code once we run a test, or produce a graphic. We do not need to understand the fundamentals in great detail - this is for another discipline. You can drive a car your whole life without understanding the physics of internal combustion, or the chemistry of the lithium-ion battery. Even where we will consider mathematical formulae, it might be more helpful to think of these as sets of instructions rendered in a very specific language. Consider the formula for calculating a standard deviation, a common measure of variability.

\[sd = \sqrt{\frac{\sum (y_i-\bar{y})^2}{n-1}}\]

If you read an introductory statistics book which this is not, you will come across formulae like this early on. This is the kind of thing that provokes fear in those of us who got into sociology to escape mathematics. I am one of those people. But what if we reframe this not as something to be ‘solved’ but as a series of steps to follow? When we break it down into its component parts, we find a series of simple calculations performed for as many points of data as we have in our dataset.

  1. It is telling us to take the original value from each case in our variable (\(y_i\)), and subtract it from the variable’s overall mean (\(\bar{y}\)).

  2. We repeat this for all of our data points, and square the result to remove the negative signs. What we get, once we add together all of these distances from the original value to the mean, is a figure that satisfies this part of the formula \(\sum (y_i-\bar{y})^2\).

  3. To get the average squared distance from each value on this variable to its mean for the entire dataset, we need to divide by the sample size (-1). This gives us \(\frac{\sum (y_i-\bar{y})^2}{n-1}\).

  4. Finally, we take the square root of this figure to get our standard deviation. This give us \(sd = \sqrt{\frac{\sum (y_i-\bar{y})^2}{n-1}}\)

A low standard deviation relative to the mean of a variable indicates that the points are gathered close together. We might be more confident that the mean is representative of a typical score in the dataset, than for a variable with a large standard deviation relative to the mean.

What you may often encounter when reading published research based on quantitative analysis is a model specification. This is often a more formal way of writing how a piece of analysis was conducted, how effects were derived, and what additional adjustments were made. It is important, because for the research to be critically assessed, we need to be precise about what steps were taken. Consider the following specification

\[y=a+B_1x_1 + B_2x_2\]

This is another way of writing a linear regression model with two independent variables and one dependent variable. Now let’s interpret this practically. We need to know what the various terms mean. In this case, \(y\) is a given value of the dependent variable. Let’s say that this is life expectancy. And in this model, which we have ‘fit’ to our data, we hypothesise that \(x_1\) is per capita gross national income, and \(x_2\) is health spending as a percentage of Gross Domestic Product. The specification is telling us that, in the form in which we have specified it in our data analysis, we have calculated by how much average life expectancy changes (in years) with each unit increase of both per capita national income (\(x_1\)) and health spending \(x_2\). Another way to say it would be, for this type of model and within the limitations of our data, we examine the extent to which life expectancy changes with respect to unit increases in both. The mathematical specification becomes another way for us to communicate - with greater clarity than written word and in a consistent format - what exactly we did in our analysis. Once you can begin to reframe the function of mathematical writing in social science research as a communication tool, it will become less intimidating.

Some insist that you should undertake a full programme of training in econometrics before calling yourself a competent analyst. This might be true of economics, but not sociology. We are at our best when we can move and communicate across multiple methodologies. This doesn’t mean you can’t become expert in one, but your abilities as a sociologist will be greatly enhanced by your skillset diversity. That said, I do believe that all sociologists should be proficient in quantitative methods to some degree. Especially so, as much of what we claim rests on some claims to quantity or change, yet we seem to either ignore or deny the measurement and computation that underpins this. References to inequality are founded on observations, the extent of which only became fully apparent once we began to measure and collect data systematically at a population level. Equally, we cannot explain the origins and reproduction of these inequalities through counting alone, hence the bizarre state of a discipline that remains, to some extent, divided on methodological grounds. More of this later.

01.4 Scripts, directories, and folders

Getting to this point of producing output requires embedding some habits. The very first thing you should do in any R session is tell R which folder you are going to work from for a particular work session. This will be your ‘working directory’ until you either end the session, or point R to another location on your computer. For our purposes we can use the terms ‘folder’ and ‘directory’ interchangeably. We are not programmers and the distinction is of little practical consequence for us. Working from a directory and writing your work in a script will allow you to open, save, and export all of your work to single location. Whenever you start a new session or project in R, make sure you have saved all of the data files you will need to for that project into a single folder. Give it a recognisable name, and name it according to some convention or hierarchy.

Once you have installed RStudio, start by opening a new R Script file by navigating to File - New File - R Script. When you encounter code, it will be identified with formatting like this. Where these lines of code are set off in lines of their own, like the code below, you should copy them to your own R Script within RStudio and run them yourself. Follow along with these examples as you read. In the example code below, I set my working directory to a folder on my icloud drive titled ‘rbook_data_2026’. You should replace the address between the “” with the address for the folder containing your data. This is an address in Windows format, and this will differ for Mac.

setwd("C:/Users/eflaherty/iCloudDrive/rbook_data_2026")

You can locate the address in Windows by inspecting the properties of your folder. Right-click on the folder and click the ‘Properties’ tab to open the Properties window. From here you can locate the address. You can also get the address directly by right-clicking on the folder icon for the folder that you will use for your working directory, and clicking Copy as path from the dialog that appears. This will place the address into your clipboard, and you can then return to R and paste it into your R Script.

To confirm that R is indeed looking in this folder for files, we can ask it to check or ‘get’ our working directory.

getwd()

Or we can ask it to list all of the files in our working directory to be absolutely sure, but also to confirm the names of files we may wish to import.

list.files()

Establishing this kind of discipline with your folders, files, filepaths, and file names, is very important for keeping your work traceable, and reproducible. Give your files and folders names that will identify their content, and their date. Use sub folders, and try to use a consistent naming convention. If you have followed the conventions so far and have set up your first R Script to do so, you should be looking at something like this. You can use # to identify text in your script that is not code. Whenever you want to write a note, give something a heading, or just detail what you were thinking and doing with a particular piece of code, you can use this to make notes that R will bypass if you attempt to run the script. It will also clearly distinguish your notes in your script, making it easier to read.

02. Data Structures in Social Science

02.1 Data units, levels, and resolution

In sociology we work a lot with macrodata. These are data based on units other than individual people. We might usefully distinguish between individual, meso and macro data. Individual data are collected from human respondents. This was often done by having them complete a questionnaire, but it also includes many other sources now. Opinion polls completed online, transaction or customer data, or large datasets such as the European Social Survey are all examples of microdata comprising the responses of individuals. The structure of these datasets means that each row represents one set of responses for one indivdual. Sometimes, in large-scale household surveys, one of these lines might data on the household alongside that of the individual. The meso level in sociology identifies intermediary social institutions, organisation, or firms. Data on the characteristics of firms or businesses, data on aspects of a country’s health or education system, are examples of data collected at the meso level.

Macrodata is data collected at the societal level - often community, region, or country. Countries will often be organised into geographical sub-units for data collection purposes, where the populations of each area are kept to a common level. This is to allow for comaprison and analysis across a wide number of units, and to account for the impact of population dentisy on the widely number of individuals that would be included if we collected data based on area alone. A region of 1 square kilometer (sqkm) in New York City would include far more people than 1sqkm in rural Ireland. As such, these boundaries tend to be revised slightly between census waves to account for changing populations due to in and outward migration. The Central Statistics Office of Ireland collects and distributes data at the Electoral Division, of which there are 3,420. For finer-scale analysis, data are also organised at the Small Area level, of which there are 18,919. This difference in granulartity is known as spatial resolution. The Small Area data (N=18,919) is of a higher resolution than that of the Electoral Division (N=3,420).

Data can also possess temporal resolution. In the case of time series data - where each data point represents a value for a particular point in time. High temporal resolution data might have observatons at a very high frequency. Some such as stock data may be at the minute or second of frequency. In economics, quarterly data are common, and for macrodata at the country level, the temporal resolution is typically yearly. These properties will inform the kinds of analysis we can do, but also the kinds of conclusions we can draw. Macrodata can only tell us about the behaviours of individuals in a limited way, but they tell us a lot more about how ‘high level’ social properties such as different kinds of social policy, social institutions, forms of regulation, impact other things in the social world such as levels of inequality. Time series data can tell us a lot about change patterns, and about how societies respond to or recover from shock events like disasters or recessions. We can do even more if the data are organised in a panel with multiple countries recorded over time on the same variables. That can help us answer questions like whether the impact of things like changes in political regimes have an immediate or more long-term impact on society, or how changes in social policies such as decreasing taxation on the rich may impact levels of inequality.

Data type Units Characteristics
Individual cross-sectional Individuals Rectangular data arising from individual surveys administered at a single point in time. One-off surveys, student projects using survey forms are typically individual cross-sectional
Macro cross-sectional Regions or countries Rectangular data arising from countries or regions observed at a single or common time point. Geographic data from a single census wave, or macrodata from a public data repository such as Eurostat taken at one point in time (i.e. multiple countries but for 2025 only)
Longitudinal Different individuals at different times Repeat cross-section if new samples are recruited each time. Common in large-scale public social surveys
Panel Same individuals at different times Panel data that allows for more complex analysis of causation due to information on time-ordering of change
Time series The same country, region, or organisation at different time points Time series data used in sociology are often annual, yearly data arising from government or national statistics
Macro longitudinal Countries observed at multiple points in time Pooled time series cross-section, but sometimes also referred to as panel or macro-data, yields more observations and allows for more complex models. Common in quantitative comparative sociology and political economy
Network Relation or link between nodes Connections between individuals or other entities, nodes and links between define network structure. Common in social media profile data, follower networks, co-authorship. Underlying units can be individuals, organisations, products, institutions, firms

Data with individual, human-subject units are especially powerful at helping us figure out the factors that determine things like educational achievement, earnings, or attiudes. Even more if we combine it with data at higher levels in the form of multilevel data. In education research, this could be data collected on children, but also on the household, and school such as with the Growing up in Ireland survey. Sometimes these data are cross-sectional where the set represents data recorded at a single point or window in time only. They can also be longitudinal as is the case with repeated cross-section data like the European Social Survey. Panel data are especially powerful. These sets contain information on the same individuals measured at different points in time. Surveys such as the British Household Panel Survey (now Understanding Society) contain many decades of data, and allow us to study the time-order of causal processes in a way that cannot be achieved with cross-sectional data. Unfortunately they are also incredibly expensive, and thus tend to be quite rare. They are also subject to panel attrition, where respondents can drop out either voluntarily or due to mortality. It is important to be aware of the kind of data you are working with as this will place limits on what kinds of analysis you can perform and conclusions you can draw, as well as what kinds of theory that will be needed to interpret your findings.

Resolution Data type Example
High Macro cross-sectional Geographical data such as the Irish Census Small Areas (N=18,919)
Medium Time series (Unit, Time $$Y_it) Monthly Road Fatality Statistics (Country, Month)
Low Individual cross sectional Small-sample social survey

02.2 Simple Data Structures: Rectangular Datasets

The simplest kind of dataset is rectangular. It is the most common type of dataset you will encounter when working with individual-level data (i.e. data collected from human subject respondents). If you conduct a survey using an online data collection platform such as Qualtrics or JISC, on a sample of respondents at a single time point, then the data you output will be in rectangular format. You may encounter different terms for this kind of data, but they describe essentially the same thing, where the output data arrives in a file of some kind (usually a spreadsheet in .csv or .xlsx format), with data arranged in rows and columns. You may see this written shorthand as ‘r x c’. They are the most common data structures in sociology, and most of the datasets you will encounter will be in this format. The USArrests dataset is rectangular, as the columns represent indivdual variables such as Murder and Assault. The rows represent the individual cases, in this case states including Alabama, Alaska, etc. Cross-sectional data drawn from individuals will also be rendered in rectangular format. An example of a data structure like this is a survey administered to a sample of respondents, where the answers are numerically coded and recorded in a spreadsheet.

Consider the sample data below. Here we have a dataset with five variables:

  1. id (respondent id)
  2. type (type of tenure)
  3. inc (annual gross income)
  4. gen (gender)
  5. job (occupation type)

Look also at the variable types. type is a <chr> variable, as it is entered as characters - (i.e. text). inc is a <dbl> variable as it is a type of variable holding numeric values with decimal points. The name comes from ‘double’, and distinguishes this type of data from an integer (<int> in R) which is a whole number.

## # A tibble: 5 × 5
##   id    type              inc gen    job       
##   <chr> <chr>           <dbl> <chr>  <chr>     
## 1 001   Rent           105000 Female Pilot     
## 2 002   Own             55000 Male   Teacher   
## 3 003   Social housing  52000 Male   Teacher   
## 4 004   Own            120000 Female Broker    
## 5 005   Rent            45000 Female Bus driver

The “row x column”, “r x c” or rectangular structure is apparent here. Each of the rows represents a set of responses from one individual. The first row represented by id 001 has a tenure type rent (they rent their home privately), an income inc of 105,000. Their gender gen is Female, and their main occupation is Pilot.

02.3 Complex Data Structures: Panel, Longitudinal, and Network Datasets

Rectangular data are the first most common type you will encounter in sociology. They are the kind of data that are generated from more common types of social survey such as a one-off survey of individuals, data from a single year on several countries, or a single census wave. When a further dimension of time is introduced, we get either longitudinal or panel data, depending on the underlying units. When connections between the units in our data are recorded, we get network data. Let’s take a look at panel data, in the form of pooled time series cross-section data. If you retrieve country-level data from Eurostat on a range of countries, where each of those countries has data for different time points (say, 27 European countries from 1990-2025), then we have a panel or pooled time series cross-section dataset. Since the terms are often used interchangeably, we will use macro-panel to refer to time-ordered data on multiple units where human subjects are not involved. The example below shows a common structure for this type of data. It is arranged in this way to help you visualise the structure of the dataset, but it is important to be aware that we will often need to reshape this data to a wide or long format, depending on what we want to do with it, and the programme or command set we are working with. Sometimes we will do this for ease of manipulation. For now, just be aware that panel data can become complex, and there are decisions to take around how you enter and organise the data. This can easily be achieved in R with some code, and we will look at this in later sections.

## # A tibble: 6 × 5
##   country  year population employ depriv
##   <chr>   <dbl>      <dbl>  <dbl>  <dbl>
## 1 AT       2022        9     77.3    4.7
## 2 AT       2023        9.1   77.2    6.9
## 3 AT       2024        9.2   77.4    7.5
## 4 BE       2022       11.6   71.9    4.7
## 5 BE       2023       11.7   72.1    5  
## 6 BE       2024       11.8   72.3    5.1

This is an example of long-format macro-panel data, where the column represent different variables (country, year, population, employment rate, and deprivation rate), and the rows are organised by country and year. This is a common format for statistical analysis of panel data, and has several advantages over the alternative wide format illustrated below. We will discuss these later, and it mainly involves a case where we might wish to break the data down further - say, to obtain separate employment rates for different genders, or look at the material deprivation rate in each country-year for owners and renters. In wide format macro-panel data, time is organised horizontally, with the complete series of values for a given country on a single variable, represented in each row. The example below shows the population variable for Austria (AT) and Bulgaria (BT) reshaped into wide format. The tidyverse package in R provides a neat function pivot_wider that will do this for you, so don’t worry about having to figure this out from scratch.

## # A tibble: 2 × 4
##   country `2022` `2023` `2024`
##   <chr>    <dbl>  <dbl>  <dbl>
## 1 AT         9      9.1    9.2
## 2 BE        11.6   11.7   11.8

The wide-format data above is the format in which much of the data from tools such as Eurostat arrives, but you can pivot the data manually either in the website database tool itself, or afterwards in R. It is a step that causes some confusion when working with time series data and it can be difficult to visualise yourself before making the change. So it is important to at least be aware of the distinction now as you browse for your own data, and to understand that you have some choices to make when it comes to formatting your data. This is an important type of data because it can be used effectively and descriptively even if you do not plan to do a ‘full’ quantitative analysis. Time series data can give context to a qualitative project by describing change patterns and key change points in a variable of interest. They can be used to compare countries over time, to establish if the country in which you are collecting your data or conducting your case study is different or similar to others. You can trace the development or impact of social policies on things like unemployment, gender equality, quality of life, or check that conditions were like in your case study area 10, 20, possibly 60 years ago at the touch of a button. As such, these datasets can enhance the context you give to your case studies by allowing you to contextualise them both comparatively (in relation to other countries or regions), and longitudinally (by describing or visualising change processes within the country or region itself).

03. Loading data into R

03.1 Data Formats and Sources

There are many ways to get data into R. The most common methods we will use involve importing data from spreadsheets (.csv), from Stata format datasets (.dta), or from SPSS (.sav). Data that we retrieve from public data repositories such as the World Bank Databank or Eurostat are downloaded initially as spreadsheets in native Excel format (.xlsx) or the more generic Comma-Separated Values (.csv). Both are formats of spreadsheet, but .csv is more readable and interchangeable between different machines and programmes. Later we will explain how to format and prepare these in a way that makes them easily readable in R. Many agencies now supply data through their own Application Programming Interface (API), and some have developed custom packages in R that streamline the process. These allow us to import data directly from their websites in a machine readable format. We may also receive datasets from other projects or researchers that are prepared in the formats they work with. Either way, it is important that we are able to work with a variety of different file types so we can easily begin working on a dataset we might receive.

03.2 Built-in datasets in R

For introductory purposes, R comes with some built-in datasets. Some of these are well known and feature in textbooks, or the many online examples and tutorials that accompany the R community. They are fine for an introduction to code, but their topical coverage is limited. One such dataset is the USArrests dataset that we can load into R without installing additional packages or calling the data from outside the R system. From now on, whenever you see a chunk of R code in the text here, the output you should see in R Studio will be included. So for the command data(USArrests) below, I include the output you should expect to see also if the command or code has worked properly.

data("USArrests")

Now that we have some data loaded into our session, we can start to explore it. The dataset contains four variables (columns), and 50 observations (rows, in this case U.S. States). We can take a quick sense of the contents of the dataset with:

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

By default, R gives us the first five rows. If we want 10, we can make a small addition to the code:

head(USArrests, 10)
##             Murder Assault UrbanPop Rape
## Alabama       13.2     236       58 21.2
## Alaska        10.0     263       48 44.5
## Arizona        8.1     294       80 31.0
## Arkansas       8.8     190       50 19.5
## California     9.0     276       91 40.6
## Colorado       7.9     204       78 38.7
## Connecticut    3.3     110       77 11.1
## Delaware       5.9     238       72 15.8
## Florida       15.4     335       80 31.9
## Georgia       17.4     211       60 25.8

We can see, from the results in the console window, that Alabama had a murder rate of 13.2 per 100,000, compared to 9.0 in California. We can take a look at the averages for the whole dataset by typing:

summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

For murder, the median murder rate for the U.S. in 1970 was 7.25, and the median assault rate was 159. This is fine for demonstration, and it does give us a good sense of the workflow within R - as above, much of what we will do will involve variations on a simple process. Type a command, point it toward some data, generate the results. Modify the code a little to fine-tune the output, and run the code again.

03.3 Loading external data from spreadsheets or Excel files

For cases where you have your data already in a spreadsheet, you can load it from your working directory. The simplest format to do this with is .csv, which is simplest. It is a format that does not allow for multiple sheets organised in tabs like .xlsx, and it is the most interoperable file type that can be easily and quickly shared and read by users of different platforms. There are cases where alternative types might be used and this will again depend on organisational needs, but for a lone researcher working on a project they have full control of, it is the best to start with. We will deal with datasets that may have been supplied to you in Stata or SPSS format in the following sections. A rectangular dataset in a spreadsheet will look something like this. Here, I have a set of country-level data on several variables. This is a macro cross-sectional dataset written in Excel in .csv format, with country units and observations at a single time point (2023). The filename for the dataset is world_bank.csv.

To load this data into my session in R, I will use the read_csv command from the readr package. Let’s revise the proper workflow for this. I start my work sessions by setting my working directory, pointing R to the folder on my computer where my data are stored, and where I will call and send my files.

setwd("C:/Users/eflaherty/iCloudDrive/Teaching/rbook_data_2026")

Next, I load the library I will need to perform the data import. The readr package will simplify this process for us. If this is your first time using the package, you should run the following code.

install.packages("readr")

If you have already installed the package, you only need to load the library. These terms are used interchangeably, and technically a ‘package’ is what you install first, before loading the ‘library’. So, depending on what stage you are using it (installation or later loading), readr can be referred to as both.

library(readr)

Next, we load our data. But we do this using the object-oriented nature of code in R. We will create an object in the Environment called world_bank that we will fill with our data using the <- operator.

world_bank <- read_csv("world_bank.csv")

This object-oriented process will be very useful later on. If we want to subset our data (select only some countries for example) we can create a new object (dataset) with just these countries in our Environment. This will be useful if we want to permanently change the original data in some way. Say, we only want to work with EU countries in our project. Then we can repeat the process, by writing something like world_bank_eu <- ... where the code after the <- defines the selections we would like to make. Manipulating your data in this way is one of the most important aspects of data analysis, and in the wider data science/analytics literature you will see many references to piping data in various ways. You may have encountered SQL (often pronounced ‘sequel’) which is a grammar of logical and mathematical operators used to pipe data from database sources into analytics software. R has a useful package called tidyverse and a selection grammar of its own dplyr that allows us to do this in a consistent way. Situations arise often in sociology where you will want to make selections, filter, or subset your data. If we wanted to limit our time series analysis to everything from 2000 onward instead of the full range of values in our set, if we wanted to focus our analysis of the impact of unemployment on economic hardship but only for those aged 30-40, or if we wanted to compare fear of crime in cities to that in rural areas, this would all involve ‘piping’ our data in some way. Later, we will make extensive use of the pipe operator %>% to select out these rows specifically before passing the data to our command. I like the intuitive nature of the pipe, as you can visualise the data being passed through an imaginary pipe of where it is manipulated, emerging from the other end transformed in some way.

04. Summary of the Workflow Process in R

We end with a review of the workflow of a typical R session. Beginning by opening a new .R script, we set our working directory, load our required libraries, and run some basic code to inspect our data. We will look at some more efficient ways to do some of these tasks later on. For the remainder of the course, the workflow will follow this pattern, and you will always start with these steps. Remember to ensure that the directory you have pointed R to in the setwd() code below is the one that contains all of the files you will need for that particular session.

  1. Set your working directory
setwd("C:/Users/eflaherty/iCloudDrive/Teaching/rbook_data_2026")
  1. Load your libraries, provided they have been installed using install.packages() first.
library(readr)
library(ggplot2)
  1. Load some data into your work session, in this case from a spreadsheet of data tiled ‘world_bank.csv’.
world_bank <- read_csv("world_bank.csv")
  1. Look at summary statistics for all variables in the dataset
summary(world_bank)
  1. Produce a basic boxplot using ggplot2. Look carefully at the code below. Here, we call on our dataset ‘world_bank’ and produce a plot from a specific variable called ‘gini’.
ggplot(world_bank, aes(y=gini))+
  geom_boxplot()