Week 1: Establishing Good Workflow

01. Establishing Good Workflow Habits to Future-Proof Your Work

01.1 What is different about working in R

Establishing good workflow practices from the very beginning is essential to working successfully in R. Good workflow is achievable by putting some time and effort in at the very beginning to embed a routine. If you follow this routine every time you work, the long-term payoff will be big. You will be able to reproduce your analysis very quickly from a complete record of each step you took in your project. If you need to update your data, you can quickly reproduce all of your output without having to manually re-run each procedure individually. No more re-running graphs and models, saving them separately, and re-compiling them into your written document. You will not need to remember, or record in an unsystematic way, how to do each procedure in your data analysis (where to click, in what order), since all will be coded and saved in your script or document. Dialog-based systems such as SPSS are unhelpful in this regard, as every procedure comes with small variations in the windows you must click through to make the action happen. Stata is a good alternative as it operates on a syntax that is optimised for econometrics - the family of statistical techniques currently dominant in economics and sociology. There are differences in the language that these programmes use.

In Stata, a regression analysis with three variables would involve typing the following code.

regress y1 x1 x2

In SPSS, we would use:

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT y1
  /METHOD=ENTER x1 x2.

In R, using the tidyverse conventions, we would type:

model1 <- lm(y1 ~ x1 + x2, data = USArrests)
tidy(model1)

Some differences are more apparent. In SPSS, we often need to be more detailed with the conditions we specify at the start of the procedure. In Stata, the language is cleaner, in that we identify the command ‘regress’ and a dependent variable (y1) plus two predictors (x1, x2). One of the advantages we will find with R is that we assign the model output to an object that lives in our ‘Environment’ window. We can then recall it, and manipulate its contents if we wish. We can do with in other programmes too, but it is a little more opaque. The code above is a good example of R’s object-oriented nature. When we work in R, we will often create objects (in the above example, model1), that we then populate with output according to the instructions we place after it. The literal nature of R code also makes is more usable. The use of the <- operator, for example, is quite literally telling R to fill the object model1 with the results of the linear model produced from y1 ~ x1 + x2, using the data USArrests. Chaining these steps together gives us this.

model1 <- lm(y1 ~ x1 + x2, data = USArrests)

Don’t worry too much about the terminology or technique here, focus on the structure of the code. We will be able to use these properties of R code to our advantage later on. Compared to SPSS, the syntax of a programme like Stata is superior, but both carry one large disadvantage - their cost. When you register for a programme of study such as an undergraduate or master’s degree, you typically gain access to whatever suite of computer programmes your education institution has subscribed to. Often you will lose access to these once your registration expires. The cost of this can be prohibitive, especially as software providers across all industries are moving away from ownership toward subscription-based distribution. An annual subscription to the base edition of SPSS (without add-ons) currently runs to over €1400. R is both free and open-source, meaning there is no installation or subscription cost aside from a computer ro run it on, and it is infinitely expandable by the user community. This is a considerable advantage as it means new procedures are added all the time by experts who work on everything from new statistical methods, to more refined ways of doing common tasks. This is fine if the choice is yours, but that is not always the case. Preference for statistical software can vary from place to place, with programmes like SAS more commonly used across North America. It also varies by discipline, with SPSS historically more common in psychology and marketing, and Stata more frequently used in economics. Your choice of platform will sometimes be dictated by the existing skill base and preferences of your teacher, or your organisation. This is becoming less of an issue as the user base for proprietary software shrinks. The shift away from software ownership to subscription is also fast becoming less attractive for users who want to own, and not rent, the tools of their trade.

If nothing else, one of the main attractions for R as a teaching tool for me is that my students can still access it once they graduate. But - and this is a big one - it does come with a learning curve. When you first start RStudio, the window you see looks unfamiliar. If you have come to a sociology programme, this is probably not what you subscribed to at all. There are 16+ tabs in the default window, various pull-down menus, and it drops you in with no immediate direction. This can be intimidating, and often prompts a degree of anxiety. You didn’t want to learn to code, and you certainly don’t want to do any maths. Luckily, this is not a maths or statistics or programming course. Much of what we will do here with statistics is develop just enough fundamental knowledge to help us understand what is going on behind the code once we run a test, or produce a graphic. We do not need to understand the fundamentals in great detail - this is for another discipline. You can drive a car your whole life without understanding the physics of internal combustion, or the chemistry of the lithium-ion battery.

Some insist that you should undertake a full programme of training in econometrics before calling yourself a competent analyst. This might be true of economics, but not sociology. We are at our best when we can move and communicate across multiple methodologies. This doesn’t mean you can’t become expert in one, but your abilities as a sociologist will be greatly enhanced by your skillset diversity. That said, I do believe that all sociologists should be proficient in quantitative methods to some degree. Especially so, as much of what we claim rests on some claims to quantity or change, yet we seem to either ignore or deny the measurement and computation that underpins this. References to inequality are founded on observations, the extent of which only became fully apparent once we began to measure and collect data systematically at a population level. Equally, we cannot explain the origins and reproduction of these inequalities through counting alone, hence the bizarre state of a discipline that remains, to some extent, divided on methodological grounds. More of this later.

01.2 Working with code in R

Let’s start with the basics for now, with some understanding of why we should work in a platform like this, and the level of expertise we are aiming for in this course. If you are coming to R with some experience on other platforms such as SPSS, Stata, or SAS, you may have some familiarity with the workings of code through SPSS or Stata syntax. This will serve you well, as the general workflow will start to feel familiar after a while. The general framework for many procedures is the same: write a piece of code that will do something to your data. Maybe one that calculates some summary statistics, or produces a plot. Point that command toward some data, tell it which variables you want it to work on, and then execute it. If I wanted to produce a table of summary statistics for all variables in a dataset called gaming, I could write it like this.

summary(gaming)

Textbooks on SPSS or Stata will refer to the initial part of the code as a command. We can adopt the same logic here. The ‘command’ (summary) will apply to the name of the dataset supplied between the brackets (gaming), and produce a table of summary statistics. If I save this command/code in a file such as an R Script (.R), I can come back to my project at any point and re-run the same code, to produce exactly the same results. I can share the exact steps of my analysis by sharing the script with a collaborator or colleague. If someone wanted to check my work, assuming they also have access to my data, they can do so instantly with a copy of the code. This satisfies several important principles of good practice: reproduction, sharing, validation, and repetition. So far not too different to how it works in other programmes in terms of workflow. The main difference (and it is an important one) is that we have removed the intermediary software corporation, holding our capacity to do good social science behind a paywall. This is not the only reason. Later, we will call on a package that will streamline the process of producing summary statistics for us, allowing us to generate print and report-ready tables easily. One of these packages is called vtable and was developed by Nick Huntington-Klein, an Economics Professor at Seattle University.

The advantages of the community-based nature of R will become apparent as you work through the course. We can do this all with code from inside the R environment only, we will not need to go outside of RStudio to acquire these packages, because they are integrated into the R environment. By default, R will not start a new session with all additional packages loaded into memory. We will install each package only once using specific code. For the vtable package this would be install.packages("vtable"). Then, when we want to use the functions of this package, we load it as a library to make its features available to us in our R session using library(vtable). By working with code, we can do all of this without going outside of R. We can also access help pages for all of our packages by using the help command. For example, help(vtable). Again, working with code in a script means we can retrace each of our steps, and eventually, write our own sets of project setup commands that load only the libraries we need for each of our work session.

01.3 Scripts, directories, and folders

Getting to this point of producing output requires embedding some habits. The very first thing you should do in any R session is tell R which folder you are going to work from for a particular work session. This will be your ‘working directory’ until you either end the session, or point R to another location on your computer. For our purposes we can use the terms ‘folder’ and ‘directory’ interchangeably. We are not programmers and the distinction is of little practical consequence for us. Working from a directory and writing your work in a script will allow you to open, save, and export all of your work to single location. Whenever you start a new session or project in R, make sure you have saved all of the data files you will need to for that project into a single folder. Give it a recognisable name, and name it according to some convention or hierarchy.

Once you have installed RStudio, start by opening a new R Script file by navigating to File - New File - R Script. When you encounter code, it will be identified with formatting like this. Where these lines of code are set off in lines of their own, like the code below, you should copy them to your own R Script within RStudio and run them yourself. Follow along with these examples as you read. In the example code below, I set my working directory to a folder on my icloud drive titled ‘rbook_data_2026’. You should replace the address between the “” with the address for the folder containing your data. This is an address in Windows format, and this will differ for Mac.

setwd("C:/Users/eflaherty/iCloudDrive/rbook_data_2026")

You can locate the address in Windows by inspecting the properties of your folder. Right-click on the folder and click the ‘Properties’ tab to open the Properties window. From here you can locate the address. You can also get the address directly by right-clicking on the folder icon for the folder that you will use for your working directory, and clicking Copy as path from the dialog that appears. This will place the address into your clipboard, and you can then return to R and paste it into your R Script.

To confirm that R is indeed looking in this folder for files, we can ask it to check or ‘get’ our working directory.

getwd()

Or we can ask it to list all of the files in our working directory to be absolutely sure, but also to confirm the names of files we may wish to import.

list.files()

Establishing this kind of discipline with your folders, files, filepaths, and file names, is very important for keeping your work traceable, and reproducible. Give your files and folders names that will identify their content, and their date. Use sub folders, and try to use a consistent naming convention. If you have followed the conventions so far and have set up your first R Script to do so, you should be looking at something like this. You can use # to identify text in your script that is not code. Whenever you want to write a note, give something a heading, or just detail what you were thinking and doing with a particular piece of code, you can use this to make notes that R will bypass if you attempt to run the script. It will also clearly distinguish your notes in your script, making it easier to read.

01.4 Loading data into R

There are many ways to get data into R. The most common methods we will use involve importing data from spreadsheets (.csv), from Stata format datasets (.dta), or from SPSS (.sav). Data that we retrieve from public data repositories such as the World Bank Databank or Eurostat are downloaded initially as spreadsheets in native Excel format (.xlsx) or the more generic Comma-Separated Values (.csv). Both are formats of spreadsheet, but .csv is more readable and interchangeable between different machines and programmes. Later we will explain how to format and prepare these in a way that makes them easily readable in R. Many agencies now supply data through their own Application Programming Interface (API), and some have developed custom packages in R that streamline the process. These allow us to import data directly from their websites in a machine readable format. We may also receive datasets from other projects or researchers that are prepared in the formats they work with. Either way, it is important that we are able to work with a variety of different file types so we can easily begin working on a dataset we might receive.

01.5 Built-in datasets in R

For introductory purposes, R comes with some built-in datasets. Some of these are well known and feature in textbooks, or the many online examples and tutorials that accompany the R community. They are fine for an introduction to code, but their topical coverage is limited. One such dataset is the USArrests dataset that we can load into R without installing additional packages or calling the data from outside the R system. From now on, whenever you see a chunk of R code in the text here, the output you should see in R Studio will be included. So for the command data(USArrests) below, I include the output you should expect to see also if the command or code has worked properly.

data("USArrests")

Now that we have some data loaded into our session, we can start to explore it. The dataset contains four variables (columns), and 50 observations (rows, in this case U.S. States). We can take a quick sense of the contents of the dataset with:

head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

By default, R gives us the first five rows. If we want 10, we can make a small addition to the code:

head(USArrests, 10)

##             Murder Assault UrbanPop Rape
## Alabama       13.2     236       58 21.2
## Alaska        10.0     263       48 44.5
## Arizona        8.1     294       80 31.0
## Arkansas       8.8     190       50 19.5
## California     9.0     276       91 40.6
## Colorado       7.9     204       78 38.7
## Connecticut    3.3     110       77 11.1
## Delaware       5.9     238       72 15.8
## Florida       15.4     335       80 31.9
## Georgia       17.4     211       60 25.8

We can see, from the results in the console window, that Alabama had a murder rate of 13.2 per 100,000, compared to 9.0 in California. We can take a look at the averages for the whole dataset by typing:

summary(USArrests)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

For murder, the median murder rate for the U.S. in 1970 was 7.25, and the median assault rate was 159. This is fine for demonstration, and it does give us a good sense of the workflow within R - as above, much of what we will do will involve variations on a simple process. Type a command, point it toward some data, generate the results. Modify the code a little to fine-tune the output, and run the code again.

01.6 Data units, levels, and resolution

In sociology we work a lot with macrodata. These are data based on units other than individual people. We might usefully distinguish between individual, meso and macro data. Individual data are collected from human respondents. This was often done by having them complete a questionnaire, but it also includes many other sources now. Opinion polls completed online, transaction or customer data, or large datasets such as the European Social Survey are all examples of microdata comprising the responses of individuals. The structure of these datasets means that each row represents one set of responses for one indivdual. Sometimes, in large-scale household surveys, one of these lines might data on the household alongside that of the individual. The meso level in sociology identifies intermediary social institutions, organisation, or firms. Data on the characteristics of firms or businesses, data on aspects of a country’s health or education system, are examples of data collected at the meso level.

Macrodata is data collected at the societal level - often community, region, or country. Countries will often be organised into geographical sub-units for data collection purposes, where the populations of each area are kept to a common level. This is to allow for comaprison and analysis across a wide number of units, and to account for the impact of population dentisy on the widely number of individuals that would be included if we collected data based on area alone. A region of 1 square kilometer (sqkm) in New York City would include far more people than 1sqkm in rural Ireland. As such, these boundaries tend to be revised slightly between census waves to account for changing populations due to in and outward migration. The Central Statistics Office of Ireland collects and distributes data at the Electoral Division, of which there are 3,420. For finer-scale analysis, data are also organised at the Small Area level, of which there are 18,919. This difference in granulartity is known as spatial resolution. The Small Area data (N=18,919) is of a higher resolution than that of the Electoral Division (N=3,420).

Data can also possess temporal resolution. In the case of time series data - where each data point represents a value for a particular point in time. High temporal resolution data might have observatons at a very high frequency. Some such as stock data may be at the minute or second of frequency. In economics, quarterly data are common, and for macrodata at the country level, the temporal resolution is typically yearly. These properties will inform the kinds of analysis we can do, but also the kinds of conclusions we can draw. Macrodata can only tell us about the behaviours of individuals in a limited way, but they tell us a lot more about how ‘high level’ social properties such as different kinds of social policy, social institutions, forms of regulation, impact other things in the social world such as levels of inequality. Time series data can tell us a lot about change patterns, and about how societies respond to or recover from shock events like disasters or recessions. We can do even more if the data are organised in a panel with multiple countries recorded over time on the same variables. That can help us answer questions like whether the impact of things like changes in political regimes have an immediate or more long-term impact on society, or how changes in social policies such as decreasing taxation on the rich may impact levels of inequality.

Individual data are especially powerful at helping us figure out the factors that determine things like educational achievement, earnings, or attiudes. Even more if we combine it with data at higher levels in the form of multilevel data. In education research, this could be data collected on children, but also on the household, and school such as with the Growing up in Ireland survey. Sometimes these data are cross-sectional where the set represents data recorded at a single point or window in time only. They can also be longitudinal as is the case with repeated cross-section data like the European Social Survey. Panel data are especially powerful. These sets contain information on the same individuals measured at different points in time. Surveys such as the British Household Panel Survey (now Understanding Society) contain many decades of data, and allow us to study the time-order of causal processes in a way that cannot be achieved with cross-sectional data. Unfortunately they are also incredibly expensive, and thus tend to be quite rare. They are also subject to panel attrition, where respondents can drop out either voluntarily or due to mortality. It is important to be aware of the kind of data you are working with as this will place limits on what kinds of analysis you can perform and conclusions you can draw, as well as what kinds of theory that will be needed to interpret your findings.

01.7 Loading external data from spreadsheets or Excel files

The simplest kind of dataset is rectangular. You may encounter different terms for this kind of data, but they describe essentially the same thing - a file of some kind (such as a spreadsheet in .csv format), with data arranged in rows and columns. You may see this written shorthand as ‘r x c’. They are the most common data structures in sociology, and most of the datasets you will encounter will be in this format. The USArrests dataset is rectangular, as the columns represent indivdual variables such as Murder and Assault. The rows represent the individual cases, in this case states including Alabama, Alaska, etc. Cross-sectional data drawn from individuals will also be rendered in rectangular format. An example of a data structure like this is a survey administered to a sample of respondents, where the answers are numerically coded and recorded in a spreadsheet.

Consider the sample data below. Here we have a dataset with five variables:

id (respondent id)
type (type of tenure)
inc (annual gross income)
gen (gender)
job (occupation type)

Look also at the variable types. type is a <chr> variable, as it is entered as characters - (i.e. text). inc is a <dbl> variable as it is a type of variable holding numeric values with decimal points. The name comes from ‘double’, and distinguishes this type of data from an integer (<int> in R) which is a whole number.

## # A tibble: 5 × 5
##   id    type              inc gen    job       
##   <chr> <chr>           <dbl> <chr>  <chr>     
## 1 001   Rent           105000 Female Pilot     
## 2 002   Own             55000 Male   Teacher   
## 3 003   Social housing  52000 Male   Teacher   
## 4 004   Own            120000 Female Broker    
## 5 005   Rent            45000 Female Bus driver

The “row x column”, “r x c” or rectangular structure is apparent here. Each of the rows represents a set of responses from one individual. The first row represented by id 001 has a tenure type rent (they rent their home privately), an income inc of 105,000. Their gender gen is Female, and their main occupation is Pilot.