Introduction

This page contains code snippets, notes and examples for FT R Learners to refer back to when trying to remember how to do something in R.

It is maintained by Ella Hollowood (ella.hollowood@ft.com) and Oliver Hawkins (oli.hawkins@ft.com).

If something’s not in here that you think should be, let us know!

Workflows: How to…

Set up an R Project

The best way to start doing a piece of work in R is to create an R project for it, as this will bundle everything in a portable, self-contained folder. To create an R project:
1. Open up R and select “File” and “New project…”. This should open up the following window:
2. Select ‘New directory’ and when you see this window, give your project a name and choose a folder for it to go in:
For a more indepth intro to using R projects, check out this FT R workshop: Working with RStudio projects

Structure folders

A typical folder structure within an R project might be:
- R project
  - R folder for R scripts
  - data folder for data inputs
  - dist folder for data outputs
  - plots folder for plot outputs

Structure scripts

A typical script workflow might be:

# Brief description of what the script does

# Setup ------------------------------------------------------------------------

library(tidyverse)

# Import -----------------------------------------------------------------------

raw <- read_csv("my_data.csv")

# Cleaning ---------------------------------------------------------------------

clean <- raw |> 
 # pipe functions to clean up the raw data, eg...
 rename(new_column_name = old_column_name) |> 
 mutate(value = as.numeric(value))

# Analysis ---------------------------------------------------------------------

analysis <- clean |>
 # pipe functions to analyse data, eg...
 group_by(new_column_name) |> 
 summarise(sum = sum(value))

Packages: How to…

A major strength of R is that there is a huge online community of R users willing to share packages (also known as libraries) of code and useful functions that they have written with other R users. For example, there is an OECD package that any R user can use to search and extract OECD data. One of the most useful and widely-known packages is actually a compilation of packages known as the tidyverse.

Packages need to first be installed on your computer (once) and then loaded with each new R session.

Install a package for the first time

install.packages("package_name")

Note: install.packages will install a package from the official R packages repository, known as CRAN (the Comprehensive R Network). But it’s also possible to install packages that aren’t on CRAN from other places, such as a GitHub repository. The FT has its own set of R packages stored on FTRAN.

Load a package in an R script

library(package_name)

Update packages

Packages get updated over time in order to fix bugs or make improvements. Because of this, it’s useful to know that you can:

Update the packages you’ve installed in R, by selecting “Tools” and “Check for Package Updates…”
Control the version of the package that you use in an R project with the (ahem) package renv. For more info see RStudio reference page. The FT R workshop Working with RStudio projects also includes an introduction to using renv.

Fundamentals: How to…

Use punctuation

Parantheses

Like spreadsheets, R has functions that perform tasks and generally work as follows:
- Function name
- Open parenthesis: ()
- Function arguments
- Close paranthesis: )
For example in sum(x), sum is the name of the function, and x is the argument or inputs that the function operates on. In this example x could be a column in a dataframe, such as sum(my_df$column_name), or a range of numbers, such as sum(1:5). Or x might be an object that you’ve already defined with the <- operator.
To find out what arguments a function takes, run ?function_name in the R Console in RStudio. For example if you type ?select in the Console and hit enter, this will bring up a Help window in RStudio.

Quote marks

Single (’) and double (“) quotes can be used interchangeably but double quotes are generally the preferred style.
When referring to values entered as text, or to dates, put them in quote marks, like this: "United States", or "2016-07-26".
Numbers are not quoted (and if they are, R interprets them as a text string).
Backticks (`) can be used instead of quote marks, for example if you need to be explicit that you are referring to a variable (for example the column heading Hours per day, rather than the text string “Hours per day”)

Colons

Like spreadsheets, you can specify a range of values with a colon
For example c(1:10) creates a list of numbers from one to ten.

Use operators

Equals signs

R understands single and double equals signs as follows:

= makes an object equal to a value and works like the <- assignment operator
== tests whether an object is equal to a value. This is often used when filtering data.

Arithmetic

R can work like a calculator and understands arithmetic operators like:

+ add
- subtract
* multiply
/ divide

Logic

R understands logical operators like:

& AND
| OR
! NOT
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to

The pipe

|> (formerly %>%) is known as the “pipe” and chains code together by making the output of one line of code the input for the next. It can be helpful to think of |> as meaning “then”.

Create objects

Objects can be created in R with <-, which is known as an “assignment operator” and means “Make the object named to the left equal to the output of the code to the right”. Single values, text strings, datasets and lists can all be defined as “objects” in R.

my_object <- 1
my_object <- "abc"
my_object <- mtcars

Objects can also be removed from your environment with:

rm(my_object)

Create lists

Lists can be created with the function c() which combines values into a list, with the values separated by commas.

list1 <- c("1","2")
list2 <- c("2022-12-31","2023-01-01")
list3 <- c("News Years Eve", "New Years Day")

Create dataframes

Dataframes can be made with the function data.frame()

my_df <- data.frame(order = list1, 
                    date = list2, 
                    celebration = list3)

To manually build up a dataset, you can also create an empty dataframe and use the function add_row to add each entry.

library(tidyverse)

my_df <- 
  # Create empty dataframe, specifying the data type
  data.frame(key = character(),
             value = numeric()) |> 
  # Add one row at a time
  add_row(key = "Example entry",
          value = 0)

Getting data: How to…

Import data

From a csv

library(readr)
raw <- read_csv("folder/filename.csv")
raw <- read_csv("url.csv")

More info: Tidyverse reference page

From an Excel file

library(readxl)
raw <- read_excel("folder/filename.xls", sheet = 2)

More info: Tidyverse reference page

From an Excel download link

Define the url of your Excel download
Define the destination of your download
Pass both to the function download.file()
Use read_excel() as normal

library(readxl)

excel_url <- "https://www.website.com/filename.xls"

destfile <- "~/Downloads/filename.xls" # on Mac

download.file(excel_url, destfile)

raw <- read_excel(destfile)

From a googlesheet

Note: you will need to set up authentication with your Google account. See gist

library(googlesheets4)
sheet_url <- "https://docs.google.com/spreadsheets/d/1SFmm6TWziRlrosj-h0uuMa8--nd1HwroQntM9oult2E/edit?usp=sharing"
raw <- read_sheet(sheet_url)

More info: Tidyverse reference page

From a table on a webpage

library(rvest)
page_url <- "https://webpage.com"
raw <- read_html(page_url)

More info: Tidyverse reference page

From a PDF

library(pdftools)
pdf_file <- "folder/filename.pdf"
raw <- pdf_text(pdf_file)

Examples: 1. Scraping AA fuel prices from online PDF reports 2. Getting YouGov data out of tables in a pdf

More info: pdftools package

From multiple csvs

Create a vector of all the csvs in your directory

csvs <- list.files(pattern = "*.csv")

Run the read_csv function for all csv files

data <- map_dfr(csvs, read_csv)

From multiple urls

Create a vector of urls

urls <- c("url1", "url2", "url3")

Write a function that gets one url

get_my_urls <- function(x){
 raw <- read_html(x) |> 
 html_table()
}

Test function on one url

test <- get_my_urls("url1")

Run function for all urls

results <- map_dfr(urls, get_my_urls)

Gist: Scraping MP voting records from Public Whip

Gist: Scraping ambulance data from NHS England

Inspect data

Structure

str(dataset)

Column headings

colnames(dataset)

Number of rows/columns

summarise(dataset)

Unique categories in a column

unique(my_df$column_name)

Breakdown of results in column

table(my_df$column_name)

Number of NA values in a column

sum(is.na(my_df$column_name))

Wrangling data: How to…

Note: the code snippets in this section assume you already have the tidyverse loaded with library(tidyverse)

Select columns

By column index

Include

my_df |>
 select(2, 5:10)

Remove

my_df |>
 select(-2)

By name

Include

my_df |>
 select(column_a, column_c:column_e)

Remove

my_df |>
 select(-column_a)

That start/end with

Include

my_df |>
 select(starts_with("unemploy_"),
        ends_with("_2020))

Remove

my_df |>
 select(!starts_with("unemploy_"),
        !ends_with("_2020))

Select rows

By row number

Include

my_df |>
 slice(2)

Remove

my_df |>
 slice(-1)

By row range

Include

my_df |>
 slice(2:10)

Remove (often useful for Excel spreadsheets)

my_df |>
 slice(-c(1:2))

More info: Tidyverse reference page

Filter

Single items

Include

my_df |>
 filter(column_name == "example")

Remove

my_df |>
 filter(column_name != "example")

Lists of items

Include

my_df |>
 filter(column_name %in% c("item1", "item2", "item3"))

Remove

my_df |>
 filter(!(column_name %in% c("item1", "item2", "item3")))

If contains text string(s)

Single text string

my_df |> 
 filter(str_detect(column_name, "string"))

Multiple text strings

my_df |> 
 filter(str_detect(column_name, "string1|string2|string3"))

More info: Tidyverse reference page

Rename

Rename columns

my_df |>
  rename(new_name = old_name)

Rename categories within a column

my_df |>
  mutate(category_new = case_when(
    category_old == "old name_1" ~ "new name_1",
    category_old == "old name_2" ~ "new name_2",
    TRUE ~ "category_old"
))

Gist: Cleaning Spice Girls song lyrics with mutate and case_when

More info: Tidyverse reference page

Sort

From small to large

my_df |>
  arrange(variable)

From large to small

my_df |> 
 arrange(desc(variable))

According to a logical order

eg Set an order for the categories “Very poor”, “Poor”, “Okay”, “Good”, “Very good”

my_df |>
 mutate(column_name = factor(column_name, 
                             levels = c("Very poor", "Poor", "Okay", "Good", "Very good"))) |> 
 arrange(column_name)

Reclassify

Convert to ‘numeric’

Eg to enable calculations

my_df |>
 mutate(variable = as.numeric(variable))

Convert to ‘date’

Eg to enable sorting in chronological order

my_df |>
 mutate(variable = as.Date(variable, format = "%Y-%b"))

Eg to convert an Excel data displaying as a number

my_df |>
 mutate(date = as.Date(date, origin = "1899-12-30"))

Convert to ‘character’

Eg to enable a join

my_df |>
 mutate(variable = as.character(variable))

Pivot

‘Tidy data’ is a concept introduced by the author of the tidyverse packages, Hadley Wickham. He came up with the concept in order to have a consistent way of structuring tables of data that would work well with the tidyverse packages.

The fundamental principles of ‘tidy data’ are that:

Each variable (measures or attributes) must have its own column. Some examples of measures and attributes include millimetres of rainfall, GDP per capita and income level.
Each observation (thing being observed) must have its own row. An example of a thing that is being observed might be a day, a country or a demographic.
Each value must have its own cell. For example the value ‘12’ to indicate the millimetres of rainfall (variable) on a particular day (observation). The value ‘119’ to indicate the GDP per capita (variable) of a particular country (observation). Or the value ‘Less than $25,000’ to indicate the income level (variable) of a particular demographic (observation).

More info: Hadley Wickham on the concept of tidy data.

FT R Workshop: Essential data wrangling with the tidyverse

pivot_wider()

Pivot a ‘long’ dataset to a ‘wide’ format (human-readable table). Typically, you will want to take the names for your new columns from a column with categorical data and the values for each observation from a column with numerical data:

my_df |> 
  pivot_longer(
    names_from = categorical_column,
    values_from = numerical_column
  )

More info: Tidyverse reference page

pivot_longer()

Pivot a ‘wide’ dataset to a ‘long’ format (machine-readable database). Note that you can decide what you want the column name for the names to be called, and the column names for the values to be called - the below example uses key and value respectively:

my_df |> 
  pivot_longer(
    cols = 2:last_col(),
    names_to = "key",
    values_to = "value" 
  )

More info: Tidyverse reference page

Join

Join two datasets by a column

A bit like VLOOKUP in Excel, where “column_key” is the column to join by that features in both datasets

my_df |> 
 left_join(dataset2, by = "column_key")

Join two datasets with matching columns

my_df <- bind_rows(dataset1, dataset2)

Remove duplicate rows

my_df |> 
 distinct()

Handle missing data

Missing values in datasets are shown as NA.

Filter to see only NA values

my_df |>
 filter(is.na(column_name))

Filter to remove NA values

my_df |>
 filter(!is.na(column_name))

Change NA values to 0

my_df |> 
 mutate(column_name = replace_na(column_name, 0))

Fill missing values with previous value

my_df |> 
 fill(column_name, .direction  = "down")

Create a sequence of dates

You can use the seq() function to create a sequence of dates in R. Just give it two date objects and then tell it the interval with the by argument.

For example, to get the first day of every month in 2022 you can do this:

first_day_of_month <- seq(as.Date("2022-01-01"), as.Date("2022-12-01"), by = "month")

You can combine this with lubridate’s period functions to create some handy results. For example, to get the last day of every month in 2022, irrespective of the month length, just get the first day of each month since February and subtract one day:

library(lubridate)

last_day_of_month <- seq(as.Date("2022-02-01"), as.Date("2023-01-01"), by = "month") - days(1)

Note: The by argument in seq can take other intervals - for example, you can use week to get every Monday (or any other day of the week) from a given start to date to a given end date.

Analysing data: How to…

Note: the code snippets in this section assume you already have the tidyverse loaded with library(tidyverse)

Caclulate new columns

Mutate to create a new column

my_df |>
 mutate(new_column_name = old_column_name * 100)

Mutate to override an existing variable

my_df |>
 mutate(column_name = column_name * 100)

More info: Tidyverse reference page

Summarise by groups

Count categories in a column

my_df |> 
 group_by(categorical_column) |> 
 count() |> 
 ungroup()

More info: Tidyverse reference page

Summary statistics by group

my_df |> 
 group_by(categorical_column) |> 
 summarise(mean = mean(numerical_column),
           max = max(numerical_column),
           sum = sum(numerical_column)) |> 
 ungroup()

More info: Tidyverse reference page

% of

Eg below rounds to 1 decimal place

my_df |> 
  mutate(pct = round(100 * column_name, 1))

% change

% difference

my_df |> 
 mutate(pct_diff = ((column_name_1 / column_name_2) - 1) * 100)

Year-on-year % change

my_df |>
 arrange(date) |>
 mutate(yoy_change = ((column_name / lag(column_name)) - 1) * 100)

Sum

And ignore any NA values

sum(my_df$column_name, na.rm = TRUE)

Averages

And ignore any NA values

median(my_df$column_name, na.rm = TRUE)
mean(my_df$column_name, na.rm = TRUE)
mode(my_df$column_name, na.rm = TRUE)

Min/max

And ignore any NA values

min(my_df$column_name, na.rm = TRUE)
max(my_df$column_name, na.rm = TRUE)

Rank

my_df |> 
 mutate(rank = rank(column_name))

Quantile

my_df |> 
 mutate(decile = ntile(column_name, 10))

Rebasing

Remember to arrange the series in chronological order first

my_df |>
 arrange(date) |>
 mutate(col_rebased = 100 * (col_to_rebase/col_to_rebase[1]))

Reproducible example to try in R: Rebase the unemployment rate to the value in the first and nth rows

library(tidyverse)
rebase <- economics

# Rebase a single series:
rebase_1 <- rebase  |>
  # Ensure dataframe is arranged by date
  arrange(date) |>
  # Create new column with data rebased to the value in the first row
  mutate(unemploy_rebased = 100 * (unemploy / first(unemploy)))

rebase_2 <- rebase_1 |>
  # Ensure dataframe is arranged by date
  arrange(date) |>
  # Create new column with data rebased to the value in the nth row
  mutate(unemploy_rebased_2 = 100 * (unemploy / nth(unemploy, 5)))

Gist: rebase_solutions.R

Rolling averages

Remember to arrange the series in chronological order first

library(RcppRoll)

my_df |> 
 arrange(date) |> 
 mutate(7da = roll_meanr(column_name, n = 7))

Reproducible example to try in R: Calculate the 7 day moving average for new Covid cases in the UK

library(RcppRoll)

moving_avg <- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=newCasesByPublishDate&format=csv") |>
  # Select columns
  select(date, area = areaName, new_cases = newCasesByPublishDate)

moving_avg_2 <- moving_avg |>
  # Ensure arranged by date first
  arrange(date) |>
  # Ensure no missing dates
  complete(date) |>
  # Group by category (optional)
  group_by(area) |>
  # Create a column with rolling average for every nth value
  mutate(new_cases_7da = roll_meanr(new_cases, n = 7)) |>
  # Round to 0dp
  mutate(new_cases_7da = round(new_cases_7da))

Gist: rolling_averages.R

Compounding

Reproducible example to try in R: The UK population is forecast to grow by 0.5% per year over the next 10 years. The current population is 68.3 million. What is the estimated population in 10 years?

# Define inputs
annual_growth_rate <- 1.005
years <- 10
start_value <- 68.3

# Calculate the final value after X years of Y% annual growth
final_value = start_value * (annual_growth_rate ^ years)

# Calculate the absolute difference between final value and start value
absolute_change = final_value - start_value

# Calculate the overal percentage growth over the period
pct_change = ((annual_growth_rate ^ years) - 1) * 100

And what is the estimated population each year based on this growth rate?

# Define start and end date
start_year <- 2021
end_year <- 2031

compound_series <-
  # Create a dataframe where first column is each year
  data.frame(year = start_year:end_year,
             # ...and second column has the annual growth rate for all subsequent years
             annual_growth_rate = c(1, rep(annual_growth_rate, years))) |>
  # Calculate the cumulative growth rate for each year
  mutate(cumulative_growth_rate = cumprod(annual_growth_rate)) |>
  # Multiply the start value by each year's cumulative growth rate to get the value each year
  mutate(value = start_value * cumulative_growth_rate)

Currency exchange

Reproducible example to try in R: Convert US dollars per UK pound to UK pounds to US dollar

# Get currency exchange rates via FRED
library(fredr)

# Define inputs
series_id <- "DEXUSUK"
start_date <- "2020-01-01"

# Get US dollars per UK pound
invert_currency_1 <- fredr(series_id = series_id) |>
  select(date, dollar_per_pound = value) |>
  filter(date >= start_date)

# Convert to UK pounds per US dollar
invert_currency_2 <- invert_currency_1 |>
  mutate(pound_per_dollar = 1 / dollar_per_pound)

Currency appreciation

Reproducible example to try in R: Calculate the appreciation of the US dollar against the UK pound

# Define base period
base_period <- "2020-01-02"

# Mutate to calculate the % change in the strength of the dollar v pound relative to base period
appreciate_currency_1 <- invert_currency_2 |>
  mutate(dollar_appreciation =
           ((dollar_per_pound / dollar_per_pound[date == base_period]) - 1) * 100)

# Mutate to calculate the % change in the strength of the pound v dollar relative to base period
appreciate_currency_2 <- appreciate_currency_1 |>
  mutate(pound_appreciation = ((1 / ( 1 + (dollar_appreciation/100))) -1 ) * 100)

Troubleshooting tips

It’s easy to feel overwhelmed by frequent and/or confusing error messages when learning R. But don’t be too disheartened, as it’s all part of the learning process and often the error will be due to one of the following:

Typo: first try to rule out whether something as simple as a misspelling, missing comma or bracket has caused the code to break.
Data in the wrong format: use the class() function to check if the error could be due to data not being in the right format. For example, if R is interpreting numbers as characters or text strings, it won’t be able to perform calculations on that data. This kind of error often generates a message with the phrase “non-numeric argument to binary operator”. To resolve an error like this, convert data to/from numeric and character formats using mutate() and the functions as.numeric() and as.character().
Setting working directory: if the issue relates to being unable to read or write a file on your computer, then it might be helpful to put your work into an R project. Doing this will massively simplify the way you reference your file directories, as R projects essentially bundle everything within the R project in a portable, self-contained folder.
Installing a package: if the issue relates to being unable to install an R package, then check out this summary of common package installation problems and solutions.
Running a function: if the issue relates to not being able to run a function, check you’ve understood how the function works and what arguments it takes by typing “?” in the R Console, followed by the function name, eg ?sum. This should bring up the “Help” window in RStudio with information about the function.

If none of the above apply, then try:

Googling the error message to see if other people have posted similar issues on forums like Stack Overflow. Check out the responses they got and see if you can learn from these to fix your issue.
Posting a question in the #r-learners Slack channel

R Learners Cheatsheet

Introduction

Workflows: How to…

Set up an R Project

Structure folders

Structure scripts

Packages: How to…

Install a package for the first time

Load a package in an R script

Update packages

Fundamentals: How to…

Use punctuation

Parantheses

Quote marks

Colons

Use operators

Equals signs

Arithmetic

Logic

The pipe

Create objects

Create lists

Create dataframes

Getting data: How to…

Import data

From a csv

From an Excel file

From an Excel download link

From a googlesheet

From a table on a webpage

From a PDF

From multiple csvs

From multiple urls

Inspect data

Structure

Column headings

Number of rows/columns

Unique categories in a column

Breakdown of results in column

Number of NA values in a column

Wrangling data: How to…

Select columns

By column index

By name

That start/end with

Select rows

By row number

By row range

Filter

Single items

Lists of items

If contains text string(s)

Rename

Rename columns

Rename categories within a column

Sort

From small to large

From large to small

According to a logical order

Reclassify

Convert to ‘numeric’

Convert to ‘date’

Convert to ‘character’

Pivot

pivot_wider()

pivot_longer()

Join

Join two datasets by a column

Join two datasets with matching columns

Remove duplicate rows

Handle missing data

Filter to see only NA values

Filter to remove NA values

Change NA values to 0

Fill missing values with previous value

Create a sequence of dates

Analysing data: How to…

Caclulate new columns

Mutate to create a new column

Mutate to override an existing variable