Week 1: The Structure and Workflow of R
01. An Overview of the R Environment for New Arrivals
01.1 What is different about working with code in R
Establishing good workflow practices from the very beginning is essential to working successfully in R. Good workflow is achievable by putting some time and effort in at the very beginning to embed a routine. If you follow this routine every time you work, the long-term payoff will be big. You will be able to reproduce your analysis very quickly from a complete record of each step you took in your project. If you need to update your data, you can quickly reproduce all of your output without having to manually re-run each procedure individually. No more re-running graphs and models, saving them separately, and re-compiling them into your written document. You will not need to remember, or record in an unsystematic way, how to do each procedure in your data analysis (where to click, in what order), since all will be coded and saved in your script or document. Dialog-based systems such as SPSS are unhelpful in this regard, as every procedure comes with small variations in the windows you must click through to make the action happen. Stata is a good alternative as it operates on a syntax that is optimised for econometrics - the family of statistical techniques currently dominant in economics and sociology. There are differences in the language that these programmes use.
In Stata, a regression analysis with three variables would involve typing the following code.
regress y1 x1 x2
In SPSS, we would use:
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT y1
/METHOD=ENTER x1 x2.
In R, using the tidyverse conventions, we would type:
model1 <- lm(y1 ~ x1 + x2, data = USArrests)
tidy(model1)
Some differences are more apparent. In SPSS, we often need to be more
detailed with the conditions we specify at the start of the procedure.
In Stata, the language is cleaner, in that we identify the command
‘regress’ and a dependent variable (y1) plus two predictors (x1, x2).
One of the advantages we will find with R is that we assign the model
output to an object that lives in our ‘Environment’ window. We can then
recall it, and manipulate its contents if we wish. We can do with in
other programmes too, but it is a little more opaque. The code above is
a good example of R’s object-oriented nature. When we work in R, we will
often create objects (in the above example, model1), that
we then populate with output according to the instructions we place
after it. The literal nature of R code also makes is more usable. The
use of the <- operator, for example, is quite literally
telling R to fill the object model1 with the results of the
linear model produced from y1 ~ x1 + x2, using the data
USArrests. Chaining these steps together gives us this.
model1 <- lm(y1 ~ x1 + x2, data = USArrests)
Don’t worry too much about the terminology or technique here, focus on the structure of the code. We will be able to use these properties of R code to our advantage later on. Compared to SPSS, the syntax of a programme like Stata is superior, but both carry one large disadvantage - their cost. When you register for a programme of study such as an undergraduate or master’s degree, you typically gain access to whatever suite of computer programmes your education institution has subscribed to. Often you will lose access to these once your registration expires. The cost of this can be prohibitive, especially as software providers across all industries are moving away from ownership toward subscription-based distribution. An annual subscription to the base edition of SPSS (without add-ons) currently runs to over €1400. R is both free and open-source, meaning there is no installation or subscription cost aside from a computer ro run it on, and it is infinitely expandable by the user community. This is a considerable advantage as it means new procedures are added all the time by experts who work on everything from new statistical methods, to more refined ways of doing common tasks. This is fine if the choice is yours, but that is not always the case. Preference for statistical software can vary from place to place, with programmes like SAS more commonly used across North America. It also varies by discipline, with SPSS historically more common in psychology and marketing, and Stata more frequently used in economics. Your choice of platform will sometimes be dictated by the existing skill base and preferences of your teacher, or your organisation. This is becoming less of an issue as the user base for proprietary software shrinks. The shift away from software ownership to subscription is also fast becoming less attractive for users who want to own, and not rent, the tools of their trade.
If nothing else, one of the main attractions for R as a teaching tool for me is that my students can still access it once they graduate. But - and this is a big one - it does come with a learning curve. When you first start RStudio, the window you see looks unfamiliar. If you have come to a sociology programme, this is probably not what you subscribed to at all. There are 16+ tabs in the default window, various pull-down menus, and it drops you in with no immediate direction. This can be intimidating, and often prompts a degree of anxiety. You didn’t want to learn to code, and that is fine because our job as social scientists does not require us to to professional coders. Keep in mind that our objective here is to make the software produce results that are interesting, informative, and that ultimately help us draw conclusions that are sociologically relevant, and you will be fine.
Let’s start with the basics for now, with some understanding of why
we should work in a platform like this, and the level of expertise we
are aiming for. If you are coming to R with some experience on other
platforms such as SPSS, Stata, or SAS, you may have some familiarity
with the workings of code through SPSS or Stata syntax. This will serve
you well, as the general workflow will start to feel familiar after a
while. The general framework for many procedures is the same: write a
piece of code that will do something to your data. Maybe one that
calculates some summary statistics, or produces a plot. Point that
command toward some data, tell it which variables you want it to work
on, and then execute it. If I wanted to produce a table of summary
statistics for all variables in a dataset called gaming, I
could write it like this.
summary(gaming)
Textbooks on SPSS or Stata will refer to the initial part of the code
as a command. We can adopt the same logic here. The ‘command’
(summary) will apply to the name of the dataset supplied
between the brackets (gaming), and produce a table of
summary statistics. If I save this command/code in a file such as an R
Script (.R), I can come back to my project at any point and re-run the
same code, to produce exactly the same results. I can share the exact
steps of my analysis by sharing the script with a collaborator or
colleague. If someone wanted to check my work, assuming they also have
access to my data, they can do so instantly with a copy of the code.
This satisfies several important principles of good practice:
reproduction, sharing, validation, and repetition. So far not too
different to how it works in other programmes in terms of workflow. The
main difference (and it is an important one) is that we have removed the
intermediary software corporation, holding our capacity to do good
social science behind a paywall. This is not the only reason, as the
section on packages and libraries explains further on. R has a large
community of users and coders, who create their own custom packages and
release them free to use. Some are so ubiquitous such as
tidyverse and the ggplot2 graphics
environment, that they are near-standard tools in social research now.
The advantages of the community-based nature of R will become apparent
as you work through the course. The point for now is that we can do all
of our work, with code, from inside the R environment. We will not need
to go outside of RStudio to acquire these packages, because they are
integrated into the R environment and can be called into use as
needed.
01.2 Packages and Libraries
Packages and libraries are central to the R experience. They are at the core of the R workflow, and define a part of what makes R efficient to work in. Packages provide us with custom-written code, and sometimes accompanying data, that perform a specific function not provided for in ‘base R’. This is basic version of R that you get on first installation, and we will distinguish between ‘base R’ both as a programming language in itself, and as a particular state in which R operates prior to the installation of packages. A note on terminology, we you may have noticed that we used the terms package and library interchangeably in places. Both terms describe something that we will load into R to provide additional functionality. That is all that a package/library does. By definition, a ‘package’ is the collection of code and data that makes a new function available in R. A ‘library’ is the name given to the function in R that loads the contents and features of a package into a work session. The ‘library’ is also the place where this collection of additional code lives. This is all definitional, and better illustrated with an example.
In the later section on preliminary data inspection, we will call on
a package that will streamline the process of producing summary
statistics for us, allowing us to generate print and report-ready tables
easily. One of these packages is called vtable and was
developed by Nick Huntington-Klein, an
Economics Professor at Seattle University. Using the workflow of code,
we would call this package into use as needed, first by installing it
using the code:
install.packages("vtable")
Then, when we want to use the functions of this package in future
work sessions, we call on its functions using library to
make its features available to us:
library(vtable).
You will encounter this kind of action in your workflow quite often
when working in R. It is also what makes the open source nature of R,
and the wider community of users and function developers, most apparent.
By default, R will not start a new session with all additional packages
loaded into memory. We will install each package only once using
specific code (install.packages as shown above). After
installation, we need only use library(package) to make the
functions of that package available to us. We can also access help pages
for our packages by using the help command. For
example,
help(vtable)
Again, working with code in a script means we can retrace each of our steps, and eventually, write our own sets of project setup commands that load only the libraries we need for each of our work session. There is quite a lot of terminology here, and some of this is merely incidental - the distinction between a library and a package matters mainly to us in terms of the sequence in which we run either of these commands.
Table - package, library, function, command, syntax
| Term | Description | Examples |
|---|---|---|
| Package | A collection of code providing new functionality in R. | ggplot2 for graphics and data
visualisation, dplyr for data manipulation.
install.packages("ggplot2") is used to install the package
ggplot2. |
| Library | The location in which the packages are stored. | After installation, a package is referred to as a
library. Both terms identify essentially the same thing - a specific new
function in R. In your work session in R, library(ggplot2)
is used to make the functions of the package available in your work
session. |
| Function | ‘Function’ can refer either to a mathematical function ‘function’ or a specific operation in R. | The dplyr package provides several
functions including filter() that allows you to select
subsets of cases based on specific values. |
| Command | ‘Command’ is the term used to describe keyword operators in programmes such as SPSS or Stata, that execute a particular function. | We can refer to a procedure such as
summary(gaming) as executing a command. The ‘command’ in
this case is summary. |
| Syntax | The flavor of grammar that a particular programme employs. | SPSS and Stata have their own individual syntax, a structure of writing code that allows commands to run. The syntax will have specific rules about keywords, use of special characters, or the order in which things must be written to successfully execute. |
01.3 How much math do I need to know?
Most of us who come to sociology did so to avoid mathematics. Math avoidance in sociology was such a big concern that in 2014 the Nuffield Foundation launched the Q-Step programme in the UK. The initiative established 17 centers across UK universities with the goal of embedding quantitative training and general numeracy in core social science programmes. So if you are feeling hesitant about math, you are not alone. Luckily, this is not a maths or statistics or programming course. Much of what we will do here with statistics is develop just enough fundamental knowledge to help us understand what is going on behind the code once we run a test, or produce a graphic. We do not need to understand the fundamentals in great detail - this is for another discipline. You can drive a car your whole life without understanding the physics of internal combustion, or the chemistry of the lithium-ion battery. Even where we will consider mathematical formulae, it might be more helpful to think of these as sets of instructions rendered in a very specific language. Consider the formula for calculating a standard deviation, a common measure of variability.
\[sd = \sqrt{\frac{\sum (y_i-\bar{y})^2}{n-1}}\]
If you read an introductory statistics book which this is not, you will come across formulae like this early on. This is the kind of thing that provokes fear in those of us who got into sociology to escape mathematics. I am one of those people. But what if we reframe this not as something to be ‘solved’ but as a series of steps to follow? When we break it down into its component parts, we find a series of simple calculations performed for as many points of data as we have in our dataset.
It is telling us to take the original value from each case in our variable (\(y_i\)), and subtract it from the variable’s overall mean (\(\bar{y}\)).
We repeat this for all of our data points, and square the result to remove the negative signs. What we get, once we add together all of these distances from the original value to the mean, is a figure that satisfies this part of the formula \(\sum (y_i-\bar{y})^2\).
To get the average squared distance from each value on this variable to its mean for the entire dataset, we need to divide by the sample size (-1). This gives us \(\frac{\sum (y_i-\bar{y})^2}{n-1}\).
Finally, we take the square root of this figure to get our standard deviation. This give us \(sd = \sqrt{\frac{\sum (y_i-\bar{y})^2}{n-1}}\)
A low standard deviation relative to the mean of a variable indicates that the points are gathered close together. We might be more confident that the mean is representative of a typical score in the dataset, than for a variable with a large standard deviation relative to the mean.
What you may often encounter when reading published research based on quantitative analysis is a model specification. This is often a more formal way of writing how a piece of analysis was conducted, how effects were derived, and what additional adjustments were made. It is important, because for the research to be critically assessed, we need to be precise about what steps were taken. Consider the following specification
\[y=a+B_1x_1 + B_2x_2\]
This is another way of writing a linear regression model with two independent variables and one dependent variable. Now let’s interpret this practically. We need to know what the various terms mean. In this case, \(y\) is a given value of the dependent variable. Let’s say that this is life expectancy. And in this model, which we have ‘fit’ to our data, we hypothesise that \(x_1\) is per capita gross national income, and \(x_2\) is health spending as a percentage of Gross Domestic Product. The specification is telling us that, in the form in which we have specified it in our data analysis, we have calculated by how much average life expectancy changes (in years) with each unit increase of both per capita national income (\(x_1\)) and health spending \(x_2\). Another way to say it would be, for this type of model and within the limitations of our data, we examine the extent to which life expectancy changes with respect to unit increases in both. The mathematical specification becomes another way for us to communicate - with greater clarity than written word and in a consistent format - what exactly we did in our analysis. Once you can begin to reframe the function of mathematical writing in social science research as a communication tool, it will become less intimidating.
Some insist that you should undertake a full programme of training in econometrics before calling yourself a competent analyst. This might be true of economics, but not sociology. We are at our best when we can move and communicate across multiple methodologies. This doesn’t mean you can’t become expert in one, but your abilities as a sociologist will be greatly enhanced by your skillset diversity. That said, I do believe that all sociologists should be proficient in quantitative methods to some degree. Especially so, as much of what we claim rests on some claims to quantity or change, yet we seem to either ignore or deny the measurement and computation that underpins this. References to inequality are founded on observations, the extent of which only became fully apparent once we began to measure and collect data systematically at a population level. Equally, we cannot explain the origins and reproduction of these inequalities through counting alone, hence the bizarre state of a discipline that remains, to some extent, divided on methodological grounds. More of this later.
01.4 Scripts, directories, and folders
Getting to this point of producing output requires embedding some habits. The very first thing you should do in any R session is tell R which folder you are going to work from for a particular work session. This will be your ‘working directory’ until you either end the session, or point R to another location on your computer. For our purposes we can use the terms ‘folder’ and ‘directory’ interchangeably. We are not programmers and the distinction is of little practical consequence for us. Working from a directory and writing your work in a script will allow you to open, save, and export all of your work to single location. Whenever you start a new session or project in R, make sure you have saved all of the data files you will need to for that project into a single folder. Give it a recognisable name, and name it according to some convention or hierarchy.
Once you have installed RStudio, start by opening a new R Script file
by navigating to File - New File - R Script. When you
encounter code, it will be identified with formatting
like this. Where these lines of code are set off in lines
of their own, like the code below, you should copy them to your own R
Script within RStudio and run them yourself. Follow along with these
examples as you read. In the example code below, I set my working
directory to a folder on my icloud drive titled ‘rbook_data_2026’. You
should replace the address between the “” with the address for the
folder containing your data. This is an address in Windows format, and
this will differ for Mac.
setwd("C:/Users/eflaherty/iCloudDrive/rbook_data_2026")
You can locate the address in Windows by inspecting the properties of your folder. Right-click on the folder and click the ‘Properties’ tab to open the Properties window. From here you can locate the address. You can also get the address directly by right-clicking on the folder icon for the folder that you will use for your working directory, and clicking Copy as path from the dialog that appears. This will place the address into your clipboard, and you can then return to R and paste it into your R Script.
To confirm that R is indeed looking in this folder for files, we can ask it to check or ‘get’ our working directory.
getwd()
Or we can ask it to list all of the files in our working directory to be absolutely sure, but also to confirm the names of files we may wish to import.
list.files()
Establishing this kind of discipline with your folders, files,
filepaths, and file names, is very important for keeping your work
traceable, and reproducible. Give your files and folders names that will
identify their content, and their date. Use sub folders, and try to use
a consistent naming convention. If you have followed the conventions so
far and have set up your first R Script to do so, you should be looking
at something like this. You can use # to identify text in
your script that is not code. Whenever you want to write a note, give
something a heading, or just detail what you were thinking and doing
with a particular piece of code, you can use this to make notes that R
will bypass if you attempt to run the script. It will also clearly
distinguish your notes in your script, making it easier to read.
03. Loading data into R
03.1 Data Formats and Sources
There are many ways to get data into R. The most common methods we will use involve importing data from spreadsheets (.csv), from Stata format datasets (.dta), or from SPSS (.sav). Data that we retrieve from public data repositories such as the World Bank Databank or Eurostat are downloaded initially as spreadsheets in native Excel format (.xlsx) or the more generic Comma-Separated Values (.csv). Both are formats of spreadsheet, but .csv is more readable and interchangeable between different machines and programmes. Later we will explain how to format and prepare these in a way that makes them easily readable in R. Many agencies now supply data through their own Application Programming Interface (API), and some have developed custom packages in R that streamline the process. These allow us to import data directly from their websites in a machine readable format. We may also receive datasets from other projects or researchers that are prepared in the formats they work with. Either way, it is important that we are able to work with a variety of different file types so we can easily begin working on a dataset we might receive.
03.2 Built-in datasets in R
For introductory purposes, R comes with some built-in datasets. Some
of these are well known and feature in textbooks, or the many online
examples and tutorials that accompany the R community. They are fine for
an introduction to code, but their topical coverage is limited. One such
dataset is the USArrests dataset that we can load into R
without installing additional packages or calling the data from outside
the R system. From now on, whenever you see a chunk of R code in the
text here, the output you should see in R Studio will be included. So
for the command data(USArrests) below, I include the output
you should expect to see also if the command or code has worked
properly.
Now that we have some data loaded into our session, we can start to explore it. The dataset contains four variables (columns), and 50 observations (rows, in this case U.S. States). We can take a quick sense of the contents of the dataset with:
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
By default, R gives us the first five rows. If we want 10, we can make a small addition to the code:
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
We can see, from the results in the console window, that Alabama had a murder rate of 13.2 per 100,000, compared to 9.0 in California. We can take a look at the averages for the whole dataset by typing:
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
For murder, the median murder rate for the U.S. in 1970 was 7.25, and the median assault rate was 159. This is fine for demonstration, and it does give us a good sense of the workflow within R - as above, much of what we will do will involve variations on a simple process. Type a command, point it toward some data, generate the results. Modify the code a little to fine-tune the output, and run the code again.
03.3 Loading external data from spreadsheets or Excel files
For cases where you have your data already in a spreadsheet, you can
load it from your working directory. The simplest format to do this with
is .csv, which is simplest. It is a format that does
not allow for multiple sheets organised in tabs like
.xlsx, and it is the most interoperable file type that
can be easily and quickly shared and read by users of different
platforms. There are cases where alternative types might be used and
this will again depend on organisational needs, but for a lone
researcher working on a project they have full control of, it is the
best to start with. We will deal with datasets that may have been
supplied to you in Stata or SPSS format in the following sections. A
rectangular dataset in a spreadsheet will look something like this.
Here, I have a set of country-level data on several variables. This is a
macro cross-sectional dataset written in Excel in .csv format, with
country units and observations at a single time point (2023). The
filename for the dataset is world_bank.csv.
To load this data into my session in R, I will use the
read_csv command from the readr package. Let’s
revise the proper workflow for this. I start my work sessions by setting
my working directory, pointing R to the folder on my computer where my
data are stored, and where I will call and send my files.
setwd("C:/Users/eflaherty/iCloudDrive/Teaching/rbook_data_2026")
Next, I load the library I will need to perform the data import. The
readr package will simplify this process for us. If this is
your first time using the package, you should run the
following code.
install.packages("readr")
If you have already installed the package, you only need to load the
library. These terms are used interchangeably, and technically a
‘package’ is what you install first, before loading the ‘library’. So,
depending on what stage you are using it (installation or later
loading), readr can be referred to as both.
library(readr)
Next, we load our data. But we do this using the object-oriented
nature of code in R. We will create an object in the Environment called
world_bank that we will fill with our data using the
<- operator.
world_bank <- read_csv("world_bank.csv")
This object-oriented process will be very useful later on. If we want
to subset our data (select only some countries for example) we can
create a new object (dataset) with just these countries in our
Environment. This will be useful if we want to permanently change the
original data in some way. Say, we only want to work with EU countries
in our project. Then we can repeat the process, by writing something
like world_bank_eu <- ... where the code after the
<- defines the selections we would like to make.
Manipulating your data in this way is one of the most important aspects
of data analysis, and in the wider data science/analytics literature you
will see many references to piping data in various ways. You may have
encountered SQL (often pronounced ‘sequel’) which is a
grammar of logical and mathematical operators used to pipe data from
database sources into analytics software. R has a useful package called
tidyverse and a selection grammar of its own
dplyr that allows us to do this in a consistent way.
Situations arise often in sociology where you will want to make
selections, filter, or subset your data. If we wanted to limit our time
series analysis to everything from 2000 onward instead of the full range
of values in our set, if we wanted to focus our analysis of the impact
of unemployment on economic hardship but only for those aged 30-40, or
if we wanted to compare fear of crime in cities to that in rural areas,
this would all involve ‘piping’ our data in some way. Later, we will
make extensive use of the pipe operator %>% to select
out these rows specifically before passing the data to our command. I
like the intuitive nature of the pipe, as you can visualise the data
being passed through an imaginary pipe of where it is manipulated,
emerging from the other end transformed in some way.
04. Summary of the Workflow Process in R
We end with a review of the workflow of a typical R session.
Beginning by opening a new .R script, we set our working directory, load
our required libraries, and run some basic code to inspect our data. We
will look at some more efficient ways to do some of these tasks later
on. For the remainder of the course, the workflow will follow this
pattern, and you will always start with these steps. Remember to ensure
that the directory you have pointed R to in the setwd()
code below is the one that contains all of the files you will need for
that particular session.
- Set your working directory
setwd("C:/Users/eflaherty/iCloudDrive/Teaching/rbook_data_2026")
- Load your libraries, provided they have been installed using
install.packages()first.
library(readr)
library(ggplot2)
- Load some data into your work session, in this case from a spreadsheet of data tiled ‘world_bank.csv’.
world_bank <- read_csv("world_bank.csv")
- Look at summary statistics for all variables in the dataset
summary(world_bank)
- Produce a basic boxplot using
ggplot2. Look carefully at the code below. Here, we call on our dataset ‘world_bank’ and produce a plot from a specific variable called ‘gini’.
ggplot(world_bank, aes(y=gini))+
geom_boxplot()