This page contains code snippets, notes and examples for FT R Learners to refer back to when trying to remember how to do something in R.
It is maintained by Ella Hollowood (ella.hollowood@ft.com) and Oliver Hawkins (oli.hawkins@ft.com).
If something’s not in here that you think should be, let us know!
You might also be interested in:
R basics: find out what R is, how to download it and get started in RStudio on the FT Visual and Data Playbook
FT R Learners workshops: find links to videos, repos and slide decks from our internal R workshops in the R Workshops Log spreadsheet.
FT Gist List: find gists (code snippets) that team members have shared in the Gist List spreadsheet.
The best way to start doing a piece of work in R is to create an R project for it, as this will bundle everything in a portable, self-contained folder. To create an R project:
Open up R and select “File” and “New project…”. This should open
up the following window:
Select ‘New directory’ and when you see this window, give your
project a name and choose a folder for it to go in:
For a more indepth intro to using R projects, check out this FT R workshop: Working with RStudio projects
R project
R folder for R scriptsdata folder for data inputsdist folder for data outputsplots folder for plot outputs# Brief description of what the script does
# Setup ------------------------------------------------------------------------
library(tidyverse)
# Import -----------------------------------------------------------------------
raw <- read_csv("my_data.csv")
# Cleaning ---------------------------------------------------------------------
clean <- raw |>
# pipe functions to clean up the raw data, eg...
rename(new_column_name = old_column_name) |>
mutate(value = as.numeric(value))
# Analysis ---------------------------------------------------------------------
analysis <- clean |>
# pipe functions to analyse data, eg...
group_by(new_column_name) |>
summarise(sum = sum(value))
A major strength of R is that there is a huge online community of R users willing to share packages (also known as libraries) of code and useful functions that they have written with other R users. For example, there is an OECD package that any R user can use to search and extract OECD data. One of the most useful and widely-known packages is actually a compilation of packages known as the tidyverse.
Packages need to first be installed on your computer (once) and then loaded with each new R session.
install.packages("package_name")
Note: install.packages will install a package from the
official R packages repository, known as CRAN (the Comprehensive R
Network). But it’s also possible to install packages that aren’t on CRAN
from other places, such as a GitHub repository. The FT has its own set
of R packages stored on FTRAN.
library(package_name)
Packages get updated over time in order to fix bugs or make improvements. Because of this, it’s useful to know that you can:
Update the packages you’ve installed in R, by selecting “Tools” and “Check for Package Updates…”
Control the version of the package that you use in an R project
with the (ahem) package renv. For more info see RStudio
reference page. The FT R workshop Working
with RStudio projects also includes an introduction to using
renv.
())sum(x), sum is the name of the function,
and x is the argument or inputs that the function operates on. In this
example x could be a column in a dataframe, such as
sum(my_df$column_name), or a range of numbers, such as
sum(1:5). Or x might be an object that you’ve
already defined with the <- operator.?function_name in the R Console in RStudio. For example if
you type ?select in the Console and hit enter, this will
bring up a Help window in RStudio."United States", or
"2016-07-26".Hours per day, rather than the text
string “Hours per day”)c(1:10) creates a list of numbers from one
to ten.R understands single and double equals signs as follows:
= makes an object equal to a value and works like
the <- assignment operator
== tests whether an object is equal to a value. This
is often used when filtering data.
R can work like a calculator and understands arithmetic operators like:
+ add
- subtract
* multiply
/ divide
R understands logical operators like:
& AND
| OR
! NOT
> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
|> (formerly %>%) is known as the
“pipe” and chains code together by making the output of one line of code
the input for the next. It can be helpful to think of |>
as meaning “then”.Objects can be created in R with <-, which is known
as an “assignment operator” and means “Make the object named to the left
equal to the output of the code to the right”. Single values, text
strings, datasets and lists can all be defined as “objects” in R.
my_object <- 1
my_object <- "abc"
my_object <- mtcars
Objects can also be removed from your environment with:
rm(my_object)
Lists can be created with the function c() which
combines values into a list, with the values separated
by commas.
list1 <- c("1","2")
list2 <- c("2022-12-31","2023-01-01")
list3 <- c("News Years Eve", "New Years Day")
Dataframes can be made with the function
data.frame()
my_df <- data.frame(order = list1,
date = list2,
celebration = list3)
To manually build up a dataset, you can also create an empty
dataframe and use the function add_row to add each
entry.
library(tidyverse)
my_df <-
# Create empty dataframe, specifying the data type
data.frame(key = character(),
value = numeric()) |>
# Add one row at a time
add_row(key = "Example entry",
value = 0)
library(readr)
raw <- read_csv("folder/filename.csv")
raw <- read_csv("url.csv")
More info: Tidyverse reference page
library(readxl)
raw <- read_excel("folder/filename.xls", sheet = 2)
More info: Tidyverse reference page
Define the url of your Excel download
Define the destination of your download
Pass both to the function download.file()
Use read_excel() as normal
library(readxl)
excel_url <- "https://www.website.com/filename.xls"
destfile <- "~/Downloads/filename.xls" # on Mac
download.file(excel_url, destfile)
raw <- read_excel(destfile)
Note: you will need to set up authentication with your Google account. See gist
library(googlesheets4)
sheet_url <- "https://docs.google.com/spreadsheets/d/1SFmm6TWziRlrosj-h0uuMa8--nd1HwroQntM9oult2E/edit?usp=sharing"
raw <- read_sheet(sheet_url)
More info: Tidyverse reference page
library(rvest)
page_url <- "https://webpage.com"
raw <- read_html(page_url)
More info: Tidyverse reference page
library(pdftools)
pdf_file <- "folder/filename.pdf"
raw <- pdf_text(pdf_file)
Examples: 1. Scraping AA fuel prices from online PDF reports 2. Getting YouGov data out of tables in a pdf
More info: pdftools package
csvs <- list.files(pattern = "*.csv")
data <- map_dfr(csvs, read_csv)
urls <- c("url1", "url2", "url3")
get_my_urls <- function(x){
raw <- read_html(x) |>
html_table()
}
test <- get_my_urls("url1")
results <- map_dfr(urls, get_my_urls)
str(dataset)
colnames(dataset)
summarise(dataset)
unique(my_df$column_name)
table(my_df$column_name)
sum(is.na(my_df$column_name))
Note: the code snippets in this section assume you already have the
tidyverse loaded with library(tidyverse)
Include
my_df |>
select(2, 5:10)
Remove
my_df |>
select(-2)
Include
my_df |>
select(column_a, column_c:column_e)
Remove
my_df |>
select(-column_a)
Include
my_df |>
select(starts_with("unemploy_"),
ends_with("_2020))
Remove
my_df |>
select(!starts_with("unemploy_"),
!ends_with("_2020))
Include
my_df |>
slice(2)
Remove
my_df |>
slice(-1)
Include
my_df |>
slice(2:10)
Remove (often useful for Excel spreadsheets)
my_df |>
slice(-c(1:2))
More info: Tidyverse reference page
Include
my_df |>
filter(column_name == "example")
Remove
my_df |>
filter(column_name != "example")
Include
my_df |>
filter(column_name %in% c("item1", "item2", "item3"))
Remove
my_df |>
filter(!(column_name %in% c("item1", "item2", "item3")))
Single text string
my_df |>
filter(str_detect(column_name, "string"))
Multiple text strings
my_df |>
filter(str_detect(column_name, "string1|string2|string3"))
More info: Tidyverse reference page
my_df |>
rename(new_name = old_name)
my_df |>
mutate(category_new = case_when(
category_old == "old name_1" ~ "new name_1",
category_old == "old name_2" ~ "new name_2",
TRUE ~ "category_old"
))
Gist: Cleaning Spice Girls song lyrics with mutate and case_when
More info: Tidyverse reference page
my_df |>
arrange(variable)
my_df |>
arrange(desc(variable))
eg Set an order for the categories “Very poor”, “Poor”, “Okay”, “Good”, “Very good”
my_df |>
mutate(column_name = factor(column_name,
levels = c("Very poor", "Poor", "Okay", "Good", "Very good"))) |>
arrange(column_name)
Eg to enable calculations
my_df |>
mutate(variable = as.numeric(variable))
Eg to enable sorting in chronological order
my_df |>
mutate(variable = as.Date(variable, format = "%Y-%b"))
Eg to convert an Excel data displaying as a number
my_df |>
mutate(date = as.Date(date, origin = "1899-12-30"))
Eg to enable a join
my_df |>
mutate(variable = as.character(variable))
‘Tidy data’ is a concept introduced by the author of the tidyverse packages, Hadley Wickham. He came up with the concept in order to have a consistent way of structuring tables of data that would work well with the tidyverse packages.
The fundamental principles of ‘tidy data’ are that:
Each variable (measures or attributes) must have its own column. Some examples of measures and attributes include millimetres of rainfall, GDP per capita and income level.
Each observation (thing being observed) must have its own row. An example of a thing that is being observed might be a day, a country or a demographic.
Each value must have its own cell. For example the value ‘12’ to indicate the millimetres of rainfall (variable) on a particular day (observation). The value ‘119’ to indicate the GDP per capita (variable) of a particular country (observation). Or the value ‘Less than $25,000’ to indicate the income level (variable) of a particular demographic (observation).
More info: Hadley Wickham on the concept of tidy data.
FT R Workshop: Essential data wrangling with the tidyverse
Pivot a ‘long’ dataset to a ‘wide’ format (human-readable table). Typically, you will want to take the names for your new columns from a column with categorical data and the values for each observation from a column with numerical data:
my_df |>
pivot_longer(
names_from = categorical_column,
values_from = numerical_column
)
More info: Tidyverse reference page
Pivot a ‘wide’ dataset to a ‘long’ format (machine-readable
database). Note that you can decide what you want the column name for
the names to be called, and the column names for the values to be called
- the below example uses key and value
respectively:
my_df |>
pivot_longer(
cols = 2:last_col(),
names_to = "key",
values_to = "value"
)
More info: Tidyverse reference page
A bit like VLOOKUP in Excel, where “column_key” is the column to join by that features in both datasets
my_df |>
left_join(dataset2, by = "column_key")
my_df <- bind_rows(dataset1, dataset2)
my_df |>
distinct()
Missing values in datasets are shown as NA.
my_df |>
filter(is.na(column_name))
my_df |>
filter(!is.na(column_name))
my_df |>
mutate(column_name = replace_na(column_name, 0))
my_df |>
fill(column_name, .direction = "down")
You can use the seq() function to create a sequence of
dates in R. Just give it two date objects and then tell it the interval
with the by argument.
For example, to get the first day of every month in 2022 you can do this:
first_day_of_month <- seq(as.Date("2022-01-01"), as.Date("2022-12-01"), by = "month")
You can combine this with lubridate’s period functions
to create some handy results. For example, to get the last day of every
month in 2022, irrespective of the month length, just get the first day
of each month since February and subtract one day:
library(lubridate)
last_day_of_month <- seq(as.Date("2022-02-01"), as.Date("2023-01-01"), by = "month") - days(1)
Note: The by argument in seq can take other
intervals - for example, you can use week to get every Monday (or any
other day of the week) from a given start to date to a given end
date.
Note: the code snippets in this section assume you already have the
tidyverse loaded with library(tidyverse)
my_df |>
mutate(new_column_name = old_column_name * 100)
my_df |>
mutate(column_name = column_name * 100)
More info: Tidyverse reference page
my_df |>
group_by(categorical_column) |>
count() |>
ungroup()
More info: Tidyverse reference page
my_df |>
group_by(categorical_column) |>
summarise(mean = mean(numerical_column),
max = max(numerical_column),
sum = sum(numerical_column)) |>
ungroup()
More info: Tidyverse reference page
Eg below rounds to 1 decimal place
my_df |>
mutate(pct = round(100 * column_name, 1))
my_df |>
mutate(pct_diff = ((column_name_1 / column_name_2) - 1) * 100)
my_df |>
arrange(date) |>
mutate(yoy_change = ((column_name / lag(column_name)) - 1) * 100)
And ignore any NA values
sum(my_df$column_name, na.rm = TRUE)
And ignore any NA values
median(my_df$column_name, na.rm = TRUE)
mean(my_df$column_name, na.rm = TRUE)
mode(my_df$column_name, na.rm = TRUE)
And ignore any NA values
min(my_df$column_name, na.rm = TRUE)
max(my_df$column_name, na.rm = TRUE)
my_df |>
mutate(rank = rank(column_name))
my_df |>
mutate(decile = ntile(column_name, 10))
Remember to arrange the series in chronological order first
my_df |>
arrange(date) |>
mutate(col_rebased = 100 * (col_to_rebase/col_to_rebase[1]))
Reproducible example to try in R: Rebase the unemployment rate to the value in the first and nth rows
library(tidyverse)
rebase <- economics
# Rebase a single series:
rebase_1 <- rebase |>
# Ensure dataframe is arranged by date
arrange(date) |>
# Create new column with data rebased to the value in the first row
mutate(unemploy_rebased = 100 * (unemploy / first(unemploy)))
rebase_2 <- rebase_1 |>
# Ensure dataframe is arranged by date
arrange(date) |>
# Create new column with data rebased to the value in the nth row
mutate(unemploy_rebased_2 = 100 * (unemploy / nth(unemploy, 5)))
Gist: rebase_solutions.R
Remember to arrange the series in chronological order first
library(RcppRoll)
my_df |>
arrange(date) |>
mutate(7da = roll_meanr(column_name, n = 7))
Reproducible example to try in R: Calculate the 7 day moving average for new Covid cases in the UK
library(RcppRoll)
moving_avg <- read_csv("https://api.coronavirus.data.gov.uk/v2/data?areaType=nation&areaCode=E92000001&metric=newCasesByPublishDate&format=csv") |>
# Select columns
select(date, area = areaName, new_cases = newCasesByPublishDate)
moving_avg_2 <- moving_avg |>
# Ensure arranged by date first
arrange(date) |>
# Ensure no missing dates
complete(date) |>
# Group by category (optional)
group_by(area) |>
# Create a column with rolling average for every nth value
mutate(new_cases_7da = roll_meanr(new_cases, n = 7)) |>
# Round to 0dp
mutate(new_cases_7da = round(new_cases_7da))
Gist: rolling_averages.R
Reproducible example to try in R: The UK population is forecast to grow by 0.5% per year over the next 10 years. The current population is 68.3 million. What is the estimated population in 10 years?
# Define inputs
annual_growth_rate <- 1.005
years <- 10
start_value <- 68.3
# Calculate the final value after X years of Y% annual growth
final_value = start_value * (annual_growth_rate ^ years)
# Calculate the absolute difference between final value and start value
absolute_change = final_value - start_value
# Calculate the overal percentage growth over the period
pct_change = ((annual_growth_rate ^ years) - 1) * 100
And what is the estimated population each year based on this growth rate?
# Define start and end date
start_year <- 2021
end_year <- 2031
compound_series <-
# Create a dataframe where first column is each year
data.frame(year = start_year:end_year,
# ...and second column has the annual growth rate for all subsequent years
annual_growth_rate = c(1, rep(annual_growth_rate, years))) |>
# Calculate the cumulative growth rate for each year
mutate(cumulative_growth_rate = cumprod(annual_growth_rate)) |>
# Multiply the start value by each year's cumulative growth rate to get the value each year
mutate(value = start_value * cumulative_growth_rate)
Reproducible example to try in R: Convert US dollars per UK pound to UK pounds to US dollar
# Get currency exchange rates via FRED
library(fredr)
# Define inputs
series_id <- "DEXUSUK"
start_date <- "2020-01-01"
# Get US dollars per UK pound
invert_currency_1 <- fredr(series_id = series_id) |>
select(date, dollar_per_pound = value) |>
filter(date >= start_date)
# Convert to UK pounds per US dollar
invert_currency_2 <- invert_currency_1 |>
mutate(pound_per_dollar = 1 / dollar_per_pound)
Reproducible example to try in R: Calculate the appreciation of the US dollar against the UK pound
# Define base period
base_period <- "2020-01-02"
# Mutate to calculate the % change in the strength of the dollar v pound relative to base period
appreciate_currency_1 <- invert_currency_2 |>
mutate(dollar_appreciation =
((dollar_per_pound / dollar_per_pound[date == base_period]) - 1) * 100)
# Mutate to calculate the % change in the strength of the pound v dollar relative to base period
appreciate_currency_2 <- appreciate_currency_1 |>
mutate(pound_appreciation = ((1 / ( 1 + (dollar_appreciation/100))) -1 ) * 100)
It’s easy to feel overwhelmed by frequent and/or confusing error messages when learning R. But don’t be too disheartened, as it’s all part of the learning process and often the error will be due to one of the following:
Typo: first try to rule out whether something as simple as a misspelling, missing comma or bracket has caused the code to break.
Data in the wrong format: use the
class() function to check if the error could be due to data
not being in the right format. For example, if R is interpreting numbers
as characters or text strings, it won’t be able to perform calculations
on that data. This kind of error often generates a message with the
phrase “non-numeric argument to binary operator”. To resolve an error
like this, convert data to/from numeric and character formats using
mutate() and the functions as.numeric() and
as.character().
Setting working directory: if the issue relates to being unable to read or write a file on your computer, then it might be helpful to put your work into an R project. Doing this will massively simplify the way you reference your file directories, as R projects essentially bundle everything within the R project in a portable, self-contained folder.
Installing a package: if the issue relates to being unable to install an R package, then check out this summary of common package installation problems and solutions.
Running a function: if the issue relates to not
being able to run a function, check you’ve understood how the function
works and what arguments it takes by typing “?” in the R Console,
followed by the function name, eg ?sum. This should bring
up the “Help” window in RStudio with information about the
function.
If none of the above apply, then try:
Googling the error message to see if other people have posted similar issues on forums like Stack Overflow. Check out the responses they got and see if you can learn from these to fix your issue.
Posting a question in the #r-learners Slack channel