This post serves as the primary tutorial notes for the “Introduction to R” sessions of the Intensive Statistics course for MCom (Economics) students at Stellenbosch University (2025). This lecture is intended to offer a cursory introduction to enable students to perform basic operations in R and RStudio.
Free updates and dissemination.
Widespread availability of helpful resources like stackoverflow.
Ctrl/Cmd + Space
is very useful).Author, connect to data, and run code in R Markdown, RStudio’s native authoring framework for data science.
Before you start, please proceed with the following steps, and download and install the necessary software on your machine:
“R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.”
Select the appropriate link corresponding to your machine’s operating system and follow the subsequent instructions:
Execute the newly downloaded R-4.4.2-win.exe
file and follow the instructions. If you encounter any difficulties, please ask for help!
R
code)“RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.”
Once you have completed downloading and installing R
, do the same for RStudio. Execute the newly downloaded RStudio-2024.12.0-467.exe
file and follow the instructions that follow.
File
> New Project
> New directory
> New project
> Choose directory location and name
1It is advisable to work from a R
project, as this keeps track of your most current workspace and variable environment. Its location will serve as your “root folder” or main directory for subsequent operations in the project.
To work from a particular project, open a new project with File
> Open Project
or starting your session in R Studio by clicking the relevant .Rproj
file.
For this tutorial: please create a folder, name it what you want to, in that folder create a subfolder called tutorial and save your r-script there. We will list files in directories later.
Once operating from a new project, you should be faced with the default workspace, as illustrated in Step 3 above. The pane on the left is called the Console. As in Stata
, code can be typed and executed directly in your console (or “Command” in Stata
.)
Try: Enter the code
getwd()
in your console to observe the location of your working directory, the location of your project.
Alternatively, this code can be stored in- and executed from a script or .R
file, much like do-files in Stata
.
File
> New File
> R Script
or with Ctrl/Cmd + Shift + N
. Save the .R
file in your working directory with File
> Save
or Ctrl/Cmd + S
.After creating and saving a new script, you should observe a workspace layout like the one summarised in Table 1.
Location | Function |
---|---|
Top left | Script/RMarkdown |
Top right | Global environment |
Bottom left | Console |
Bottom right | Files/Plots/Packages/Help |
With the software installed and operating, and your new project and script created, you are ready to get started with R
and R Studio. Proceed by copying the code from this post, or by working from the tutorial’s corresponding master script available here.
NB: Highlighted code in the console or script can be executed using
Ctrl/Cmd + Enter
. When working from a script, output will typically appear in the console.
We rely on packages to access pre-written functions performing specific tasks.2 Packages can be downloaded and installed from CRAN, the official repository of packages, or GitHub, which often acts as an repository for third-party libraries or beta-versions.
Installing, loading and maintaining packages can be tedious. Packages need only be installed with install.packages("x")
once, after which only load(x)
is necessary.
# First install the package from CRAN
install.packages("pacman")
# Load the installed package into your workspace
library(pacman)
If a package x
has been installed already, install.packages("x")
will produce the following error:
It can be difficult to keep track of all the packages on your machine. The pacman
package allows us to easily install and load packages from CRAN with p_load()
, and from GitHub with p_load_gh()
. This will install and load a new package, or merely load previously installed ones. Let’s install the packages we will be using in this tutorial.
NB: Functions are called using the function’s name, e.g.,
p_load()
, or its full name, e.g.,pacman::p_load()
. The latter is useful when functions from different packages share the same name.
Directories can be viewed to the bottom-right pane, in addition to plot outputs, currently loaded packages, and help files. Should you ever require help or additional information regarding a specific command, add a ?
before that command and run the code, for example:
?pacman::p_load()
R
is an object-orientated language. Objects of various classes (scalars, matrices, data frames, vectors, etc.) can be stored in memory for later use. Once named and saved, these objects will appear in your global environment.
We use the assignment operators <-
or =
to name and save objects.
Shortcut:
Alt/Option + Minus
to get<-
# object name <- (or =) value(s)
a <- 10
hello <- "Hello world!"
test <- TRUE
# Determine the class of an object
class(a)
[1] "numeric"
class(hello)
[1] "character"
class(test)
[1] "logical"
By executing the relevant code, objects should appear in your global environment like this:
Report these variables in your output by running the following:
a
[1] 10
# or
print(hello)
[1] "Hello world!"
# or by using the glue package for something more fancy
glue::glue("It's {test}. I saved a variable which contains {hello} and I stored the number {a}.")
It's TRUE. I saved a variable which contains Hello world! and I stored the number 10.
An array object is equivalent to a vector of values from the same class. Arrays can be created by concatenating values using the function c()
.
x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
z <- c(7, 8, 9, 10)
# Useful functions to perform on arrays/vectors
sum(x)
[1] 10
min(x)
[1] 1
median(x)
[1] 2.5
# summary() provides a summary of the functions above
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.75 2.50 2.50 3.25 4.00
# Missing values denoted by NA
x_with_missing <- c(1, 2, 3, NA)
# Take care to properly treat missing values:
sum(x_with_missing)
[1] NA
sum(x_with_missing, na.rm = TRUE)
[1] 6
Data frames, consisting of rows and columns, are the workhorse of statistical analysis in R
. They can be created in various ways. Rows and columns can also be named. As always: try to keep your names meaningful and functional - here the data frame’s name comes from the fact that it is the first data frame.
# data.frame() can create columns from arrays and assign column names
df_1 <- data.frame(A = x, B = y, C = z)
# Some useful operations
# as expected, the below provides the column names
colnames(df_1)
[1] "A" "B" "C"
# if you want to create a dataset that looks the same
# as the previous one but with a new name (and therefore more memory):
df_1_copy <- df_1
# If you want to change the names of your variables
colnames(df_1_copy) <- c("col1", "col2", "col3")
colnames(df_1_copy)
[1] "col1" "col2" "col3"
# To count the number of rows and columns
#(sometimes you'll need this if you do manual transformation)
nrow(df_1)
[1] 4
ncol(df_1)
[1] 3
Specific rows, columns, and cells can be referenced as follows:
# Return column "A" as a vector
df_1$A
[1] 1 2 3 4
# df_1[row no., column no.] - empty implies all
df_1[, 1]
[1] 1 2 3 4
# Using tidyverse's pipe operator %>%
# the below pulls all value from the column A
# it does not replace the values of df_1 with column A
df_1 %>% pull(A)
[1] 1 2 3 4
# If we wanted to create an array to do that we can write:
col_A <- df_1 %>% pull(A)
Shortcut:
Ctrl/Cmd + Shift + M
to get %>%
There are numerous ways to subset a data frame.
# Similarly with rows
# Return row 2 as a single row data frame
df_1[2, ]
# Return row 2-3 as a two row data frame
df_1[2:3, ]
# Return cell in row 2 column 1
df_1[2, 1]
Let’s create a new column named “D”:
# Create a new column "D" that is the sum of A and B
df_1$D <- df_1$A + df_1$B
# is the same as
df_1 <- df_1 %>% mutate(D = A + B)
We can also create a subset of the data based on a condition.
# you can permanently remove missing data from a
# data frame using the na.omit code below
df_missing <- data.frame(A = x_with_missing, B = y, C = z)
df_missing_clean <- na.omit(df_missing)
# count the rows and columns, you will see there is one less row now
nrow(df_missing_clean)
[1] 3
ncol(df_missing_clean)
[1] 3
# You can also remove rows or columns with specific conditions
# if we want to drop all rows if the the column value in C is above 9:
df_missing_9 <- df_missing[df_missing$C < 9, ]
# count the rows and columns, you will see there is one less row now
nrow(df_missing_9)
[1] 2
ncol(df_missing_9)
[1] 3
Data is often imported from external files, such as .csv
files. Let’s consider an example. If you haven’t done so already, download the files attached in the email I sent. Alternatively, you can using this link. Extract the files in the .zip
folder and copy the data
folder into your working directory.
We will generally import data from either the internet or a local drive. Generally, try not to always call data from the internet.
Only run this code if you are sure about your folder structure. If you struggle with this code, I’ll help you at the end.
To make sure your data loads, you need to confirm that the files exist and are called from the correct location. In the code below, we look for a folder called data
in our directory.
Note the location prefix in the code below: - The ‘..’ prefix means that the code will go to the parent directory of the current working directory, and then find the file specified - The ‘.’ prefix means that the data is in the subfolder of the current working directory
# DIRECTORIES
# It is often useful to confirm your working directory first
getwd()
[1] "G:/Timeseries_2025/r_introduction/tutorial"
# But often, it helps to confirm all folders in the directory
# the recursive stops further subfolders from being searched
# full.names makes sure we only get the folder name and not the full path)
folders <- list.dirs(path = ".", full.names = FALSE, recursive = FALSE)
print(folders)
character(0)
# If we do not have the folder in the current working directory,
# check the parent directory
parent_folders <- list.dirs(path = "..", full.names = FALSE, recursive = FALSE)
print(parent_folders)
[1] ".Rproj.user" "assignment" "data" "images"
[5] "scripts" "tutorial"
We can also use the structure above to create a list with all files in a specific directory - this is often useful for messily-scraped data. Make sure you confirm that the directory references match your folder structure.
# You can list the files in a specific directory as below
# Here I use the full names as we might want to use the names in the list later
list_files <- list.files(path = "../data/", recursive = FALSE, full.names = TRUE)
print(list_files)
[1] "../data/cs_data.csv" "../data/cs_df.csv"
[3] "../data/Ireland_energy.csv" "../data/Ireland_population.csv"
# You can also specify that you only want a specific type of file, example CSVs
list_files_csv <- list.files(path = "../data/", pattern = "\\.csv$", recursive = FALSE, full.names = TRUE)
print(list_files_csv)
[1] "../data/cs_data.csv" "../data/cs_df.csv"
[3] "../data/Ireland_energy.csv" "../data/Ireland_population.csv"
# You can also specify that you only want files with specific text in them
# in this example, we only look for "ireland"
list_files_irel <- list.files(path = "../data/", pattern = "ireland", recursive = FALSE, full.names = TRUE)
print(list_files_irel)
character(0)
# It returns nothing! This is because we are only looking for files with no capitals
# we need to ignore the case of the files
list_files_ire <- list.files(path = "../data/", pattern = "ireland", recursive = FALSE, full.names = TRUE, ignore.case =TRUE)
print(list_files_ire)
[1] "../data/Ireland_energy.csv" "../data/Ireland_population.csv"
Sometimes we also want to create directories to store output in:
# Specify the name of the directory
dir_name <- "./figures"
# Create the directory
# Confirm that directory exists, else create it
if (!dir.exists(dir_name)) { # this is an if statement; it says if the condition in parentheses is false
# then create the directory called [dir_name] as defined above
dir.create(dir_name)
cat("Directory created:", dir_name, "\n")
} else {
cat("Directory already exists:", dir_name, "\n")
}
Directory created: ./figures
Alternatively, we know we will always create output in a certain format - and we want to keep it in certain folders:
# List of folder names
# - this helps if you know you'll always have the same structures
folders <- c("figures", "tables")
# Loop over the list of folder names
for (folder in folders) { # this is a standard for loop, it says to repeat the command for every entry in folders
# we will call every entry in folders folder;
# in the first iteration of the loop [folder] will take the value [figures]
# in the second iteration [folder] will take the value [tables]
# there is no third iterate as folders<-c("figures", "tables") is only two entries long
# Construct the directory path
dir_name <- paste0("./", folder)
# Check if the directory exists
if (!dir.exists(dir_name)) {
# If it doesn't, create the directory
dir.create(dir_name)
cat("Directory created:", dir_name, "\n") # this prints that the directory is created, the cat command is to concatenate the text.
} else {
cat("Directory already exists:", dir_name, "\n") #
}
}
Directory already exists: ./figures
Directory created: ./tables
Now, read the data from the Ireland_energy.csv
file in the data
folder using read.csv()
, and create a data frame called ire_energy
with an assignment operator. The data represents Ireland’s energy consumption data for 1980-2018.
ire_energy <- read.csv(file = "../data/Ireland_energy.csv",
header = TRUE)
# The "file" argument refers to the relative file path from your root directory
# The "header" argument is set to true because the .csv file contains column headings
Do the same for a corresponding file containing data on Ireland’s population from Ireland_population.csv
, and create a data frame called ire_pop
.
ire_pop <- read.csv("../data/Ireland_population.csv")
# Sometimes it's unnecessary to spell out the arguments
# here, we don't specify that headers are true
We now have two data frames. Take a peak at the first 5 observations in ire_energy
and the last 5 observations in ire_pop
with the following commands:
We can merge the two data frames on the basis of some common value, e.g. Year
, to create a single one called ireland_df
.
# We use the merge command as below where x and y
# are the names of our imported dataframes
ireland_df <- merge(x = ire_energy, y = ire_pop, by.x = "Year", by.y = "Year")
# merge() merges data frames x and y on the basis of some column
# by.x for x's column and by.y for y's column
# Let's confirm our code did what we say it did:
skimr::skim(ireland_df)
Name | ireland_df |
Number of rows | 39 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 1999 | 11.4 | 1980 | 1989.5 | 1999 | 2008.5 | 2018 | ▇▇▇▇▇ |
GJ | 0 | 1 | 522828057 | 147396263.1 | 311769904 | 367775214.1 | 586079129 | 650282165.0 | 727737237 | ▇▂▁▅▇ |
Population | 0 | 1 | 3976410 | 491594.5 | 3417800 | 3538152.5 | 3760480 | 4518187.5 | 4876547 | ▇▂▂▂▃ |
The code above can be simplified if the variable exists with the same name in both datasets:
# We can also simplify the command if the same variable exists in both datasets:
ireland_df_a <- merge(x = ire_energy, y = ire_pop, by = c("Year"))
skimr::skim(ireland_df_a)
Name | ireland_df_a |
Number of rows | 39 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 1999 | 11.4 | 1980 | 1989.5 | 1999 | 2008.5 | 2018 | ▇▇▇▇▇ |
GJ | 0 | 1 | 522828057 | 147396263.1 | 311769904 | 367775214.1 | 586079129 | 650282165.0 | 727737237 | ▇▂▁▅▇ |
Population | 0 | 1 | 3976410 | 491594.5 | 3417800 | 3538152.5 | 3760480 | 4518187.5 | 4876547 | ▇▂▂▂▃ |
If we want to drop dataframes we use the code below - this is especially important if you are working on a shared server in a secure facility.
# Sometimes we want to drop dataframes:
rm(ireland_df_a)
In the first example above, we show a general procedure where the names of the merging variable does not have to be the same in both datasets - below I give an example of this type
# The names of the variables to merge on do not have to be the same
# Let's create a dataframe where we change the name of the previous one:
irepop_difname <- ire_pop %>%
rename("data_year" = "Year")
# We can still merge the data, but now we have to change the by.y variable
ireland_df_a <- merge(x = ire_energy, y = irepop_difname, by.x = "Year", by.y = "data_year")
# As you can see both are the same
skimr::skim(ireland_df)
Name | ireland_df |
Number of rows | 39 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 1999 | 11.4 | 1980 | 1989.5 | 1999 | 2008.5 | 2018 | ▇▇▇▇▇ |
GJ | 0 | 1 | 522828057 | 147396263.1 | 311769904 | 367775214.1 | 586079129 | 650282165.0 | 727737237 | ▇▂▁▅▇ |
Population | 0 | 1 | 3976410 | 491594.5 | 3417800 | 3538152.5 | 3760480 | 4518187.5 | 4876547 | ▇▂▂▂▃ |
skimr::skim(ireland_df_a)
Name | ireland_df_a |
Number of rows | 39 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 1999 | 11.4 | 1980 | 1989.5 | 1999 | 2008.5 | 2018 | ▇▇▇▇▇ |
GJ | 0 | 1 | 522828057 | 147396263.1 | 311769904 | 367775214.1 | 586079129 | 650282165.0 | 727737237 | ▇▂▁▅▇ |
Population | 0 | 1 | 3976410 | 491594.5 | 3417800 | 3538152.5 | 3760480 | 4518187.5 | 4876547 | ▇▂▂▂▃ |
# Dropping the example dataframe
rm(ireland_df_a)
Thereafter, create and save a new column called ln_energy_pc
, representing the natural logarithm of Ireland’s per capita energy consumption.
ireland_df <- ireland_df %>%
mutate(ln_energy_pc = log(GJ / Population))
# You should recognise mutate() from before
# log()'s default setting implies natural logarithmic transformation
# Instead of tidyverse piping, you could have done this:
ireland_df <- mutate(.data = ireland_df, ln_energy_pc = log(GJ / Population))
# But piping is more useful when you require multiple consecutive operations
# For example, everything we've done thus far could've been condensed
ireland_df <- read.csv("../data/Ireland_energy.csv") %>%
merge(
x = ., # full stop represents the result of all previous operations
y = read.csv("../data/Ireland_population.csv"),
by.x = "Year",
by.y = "Year"
) %>%
mutate(ln_energy_pc = log(GJ / Population))
It is sensible to ensure that data frame’s variables are in the appropriate format or class. For example, ireland_df
’s Year
column is of the class integer
.
ireland_df$Year %>%
class(.)
[1] "integer"
Many useful functions require that your vectors/columns be of class date
. Transform the Year
column from an integer to a date.
# Transform the `Year` column from an integer to a date.
# note that we protect the value of the variable year with {}
# if we didn't, the glue command would change every value to the word [Year-01-01]
ireland_df <- ireland_df %>%
mutate(Year = glue::glue("{Year}-01-01"))
# but the result is of class `character`:
# sometimes the base script is faster than piping for quick commands:
class(ireland_df$Year)
[1] "glue" "character"
# We want to transform the year variable to a date
# The as.Date() below renders characters of a given format into dates
# We set that format as: "%Y-%m-%d" means yyyy-mm-dd
ireland_df <- ireland_df %>%
mutate(Year = as.Date(x = Year, format = "%Y-%m-%d"))
# Finally it is in date format
ireland_df$Year %>%
class(.)
[1] "Date"
# Confirm that data is chronological
ireland_df <- ireland_df %>%
arrange(Year) # this is equivalent to sorting by Year
Let’s see what our new data frame ireland_df
looks like. Run view(ireland_df)
or click the appropriate icon in your Global Environment to view the data frame in your workspace.
When reporting your data frame in a .Rmd
or Markdown document, you can create a table using the huxtable
package.
ireland_df %>%
filter(Year > as.Date("2013-01-01")) %>%
# subsets data for entries after 2013
as_hux() %>%
# or huxtable::as_hux() to transform data frame into huxtable object
# hereafter code to define certain aesthetic qualities of our table
theme_basic() %>%
# use a theme to make tables more presentable, e.g. theme_article() or theme_compact()
set_number_format(col = c(2, 4), value = 2) %>%
# set number of decimals to 2 in the 2nd and 4th column
set_font_size(10) %>%
set_caption("Ireland's energy consumption after 2013")
Year | GJ | Population | ln_energy_pc |
---|---|---|---|
2014-01-01 | 603311823.70 | 4662713 | 4.86 |
2015-01-01 | 632644273.00 | 4708040 | 4.90 |
2016-01-01 | 665078014.40 | 4762595 | 4.94 |
2017-01-01 | 674283626.40 | 4813138 | 4.94 |
2018-01-01 | 695195813.70 | 4876547 | 4.96 |
We can now write this data frame back to a .csv
file. The code below saves it in the data
folder in our working directory. (Remember your directory conventions!)
ireland_df %>% # data frame to be written to csv
write.csv(
x = ., # ireland_df is piped into "."
file = "../data/ireland_complete.csv", # file path and file name we choose
#(note the .. because of the directory structure)
row.names = FALSE
) # because we have no row names
Let’s perform some time series analysis with this data, particularly with respect to Ireland’s energy consumption per capita.
It is always useful to eyeball the data before proceeding with formal analysis. Here are some examples using Base R:
# plot()'s default is a scatterplot
# inputting a single vector
ireland_df$ln_energy_pc %>%
plot(.)
# We have an undefined x-axis above, so
# inputting a data frame with two columns (x, y)
ireland_df %>%
select(Year, ln_energy_pc) %>%
plot(.)
# lines instead of points
ireland_df %>%
select(Year, ln_energy_pc) %>%
plot(., type = "l")
If we want to save the figure for use in another text editor, we can use the code below. I will use the folder we created earlier:
# Open a PNG graphics device in the figures folder we created:
png("./figures/ireland_energy_plot.png", width = 800, height = 600)
# Generate the plot and save it
ireland_df %>%
select(Year, ln_energy_pc) %>%
plot(., type = "l",
main = "Energy Consumption per Capita in Ireland",
xlab = "Year",
ylab = "Log Energy per Capita")
# Close the graphics device
dev.off()
png
2
As opposed to visualisations with Base R, ggplot is often used. It offers a comprehensive suite of graphs that can be flexibly tweaked to your liking. ggplot
is part of the so-called tidyverse
. To illustrate, let us make a line graph and consider the sequential adding of layers to our “canvas”.
# ggplot() creates an empty 'canvas'
ireland_df %>%
ggplot()
# aes() provides the coordinates (or "mapping")
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc))
# on the canvas, we add layers with pluses (+)
# add one of many existing themes
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw()
# add a scatterplot
# it inherits the previously defined mapping
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point()
# add a line
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point() +
geom_line()
# add the appropriate labels
# add more breaks to the x-axis and change its labels
# change the range of the y-axis
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point() +
geom_line() +
labs(
title = "Ireland's Primary Energy Consumption",
y = "ln(GJs Per Capita)",
x = "Date"
) +
scale_x_date(
date_labels = "`%y",
date_breaks = "2 year"
) +
scale_y_continuous(limits = c(4, 5.5))
# and customise as you please (I am not going over this in class)
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point(
aes(color = ifelse(Year < as.Date("2000-01-01"), "Before 2000",
ifelse(Year > as.Date("2000-01-01"), "After 2000", "2000")
)),
size = 1.5
) +
geom_line(
alpha = 0.5,
color = "lightgrey",
size = 1
) +
labs(
title = "Ireland's Primary Energy Consumption",
y = "ln(GJs Per Capita)",
x = "Date"
) +
scale_x_date(
date_labels = "`%y",
date_breaks = "2 year"
) +
scale_y_continuous(limits = c(4, 5.5)) +
theme(
plot.title = element_text(hjust = 0.5, size = 10),
axis.title.y = element_text(
margin = margin(t = 0, r = 10, b = 0, l = 0),
size = 10
),
axis.title.x = element_text(
margin = margin(t = 0, r = 0, b = 0, l = 0),
size = 10
),
axis.text.x = element_text(angle = 45),
legend.position = "bottom",
legend.margin = margin(t = -10, r = 0, b = 0, l = 0),
legend.title = element_blank()
) +
geom_label(
data = . %>% filter(Year == as.Date("2000-01-01")),
aes(label = round(ln_energy_pc, 1)),
nudge_y = 0.15,
size = 3,
color = met.brewer("Austria", type = "discrete")[1]
) +
geom_hline(aes(color = "Mean", yintercept = mean(ln_energy_pc)),
size = 1,
linetype = "dashed",
show.legend = FALSE
) +
scale_color_manual(values = met.brewer("Austria", type = "discrete"))
Often, you may find yourself unsatisfied with a colour scheme for a variety of reasons. Below, I show how you can define your own colours. I specify colourblind-friendly colours based on Paul Tol’s pallette. (Note that the Brewer Scheme has plenty of colourblind friendly options)
friendly_blue <- '#0072B2'
friendly_orange <- '#D55E00'
friendly_green <- '#009E73'
Now that we know what the series looks like, proceed to compute and plot the autocorrelation- and partial autocorrelation function of Ireland’s per capita energy consumption.
Below we used R’s built in autocorrelation function. Recall that mathematically the autocorrelation function plots the following for each \(j\) (or to however many we specify):
\[ \rho_j = \frac{\text{cov}(y_t, y_{t-j})}{\text{Var}(y_t)} \]
ireland_df %>%
select(ln_energy_pc) %>% # isolate ln_energy_pc in data frame
acf(
plot = TRUE, # create a plot
type = "correlation"
) # standard ACF
We can also use the partial autocorrelation function
pacf_result <- ireland_df %>%
select(ln_energy_pc) %>%
acf(
plot = TRUE,
type = "partial"
) # PACF option
Our data might have a long term trend, this may be problematic when we want to confirm whether our data is stationary.
A simple way to detrend data is by removing the linear or quadratic trend - we can get these trends with simple linear regression.
Let’s do the the regressions by “hand” first and then using the R packages. Note that it is often useful to confirm that you understand what your underlying models do by confirming the results by “hand”.
# Remember, OLS Coefficients are: (β̂ = (X'X)^(-1) * X'Y)
# Load data
y <- ireland_df$ln_energy_pc # Dependent variable
# Convert 'Year' column to numeric
year_numeric <- as.numeric(format(ireland_df$Year, "%Y"))
# Construct the matrix representing the data (including intercept)
X <- cbind(1, year_numeric) # cbind combines the the vector of 1 to the year vector
# Convert y to a column matrix
y_matrix <- matrix(y, ncol = 1)
# Step 1: Compute X'X (X transpose times X)
XtX <- t(X) %*% X # note that t(X) transposes the matrix X and %*% does the matrix multiplication
print("X'X:")
[1] "X'X:"
print(XtX)
year_numeric
39 77961
year_numeric 77961 155848979
# Step 2: Compute the inverse of X'X manually using Gaussian elimination (it is really difficult to invert matrices by hand)
XtX_inv <- solve(XtX)
# Step 3: Compute X'Y (X transpose times Y)
XtY <- t(X) %*% y_matrix
# Step 4: Compute the coefficients (β̂ = (X'X)^(-1) * X'Y)
beta_hat <- XtX_inv %*% XtY
print("Estimated Coefficients (β̂):")
[1] "Estimated Coefficients (β̂):"
print(beta_hat)
[,1]
-23.11463880
year_numeric 0.01398618
# Step 5: Compute predicted values (ŷ = X * β̂)
y_hat <- X %*% beta_hat
# Step 6: Calculate residuals (ε = y - ŷ)
residuals <- y_matrix - y_hat
Here we do exactly the same, but using R’s lm
framework instead. You can confirm that the estimated coefficients are the same.
# Extract the dependent variable
y <- ireland_df$ln_energy_pc
# Extract the year as a numeric variable from the 'Year' column
year_numeric <- as.numeric(format(ireland_df$Year, "%Y"))
# Combine into a data frame
detrended_example <- data.frame(
y = y,
year = year_numeric,
year_squared = year_numeric^2 # Add quadratic term for future use
)
# ------------------------------------------
# Run the OLS regression for a linear trend
# ------------------------------------------
# note that the y ~ year part is the part that specifies the trend
# Note that the lm(.) means that we are running a linear model
model_linear <- lm(y ~ year, data = detrended_example)
# View the regression summary
summary(model_linear)
Call:
lm(formula = y ~ year, data = detrended_example)
Residuals:
Min 1Q Median 3Q Max
-0.19070 -0.12764 -0.01876 0.12072 0.25112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.114639 3.956777 -5.842 1.03e-06 ***
year 0.013986 0.001979 7.066 2.32e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1391 on 37 degrees of freedom
Multiple R-squared: 0.5744, Adjusted R-squared: 0.5629
F-statistic: 49.93 on 1 and 37 DF, p-value: 2.315e-08
# compare to betas from matrix multiplication
print(beta_hat)
[,1]
-23.11463880
year_numeric 0.01398618
Now we get the predicted value and detrend - in the results below we show that the detrended value predicted from the residuals is the same as the series minus the trend.
# Predict the trend (fitted values from the model)
detrended_example$trend <- predict(model_linear)
# Create the detrended series
detrended_example$detrended_y_b <- detrended_example$y - detrended_example$trend
# Note you can do the same with:
detrended_example$detrended_lin <- residuals(model_linear)
# You can confirm that the detrended series and the residuals are
# the same thing here:
check_detrend <- lm(detrended_lin ~ detrended_y_b, data=detrended_example)
# As you can see this leads to exactly the same result
summary(check_detrend)
Call:
lm(formula = detrended_lin ~ detrended_y_b, data = detrended_example)
Residuals:
Min 1Q Median 3Q Max
-2.795e-14 -1.442e-16 8.579e-16 1.466e-15 2.402e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.727e-15 7.653e-16 -8.790e+00 1.37e-10 ***
detrended_y_b 1.000e+00 5.648e-15 1.771e+14 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.779e-15 on 37 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 3.135e+28 on 1 and 37 DF, p-value: < 2.2e-16
Compare to the manually constructed results to the residuals from the lm model:
# Place residuals from matrix multiplication here
detrended_example$matrixtrend <- residuals
model_check <- lm(detrended_lin ~ matrixtrend, data=detrended_example)
summary(model_check)
Call:
lm(formula = detrended_lin ~ matrixtrend, data = detrended_example)
Residuals:
Min 1Q Median 3Q Max
-1.876e-13 -7.850e-14 9.570e-16 8.005e-14 1.592e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.374e-11 1.567e-14 1.515e+03 <2e-16 ***
matrixtrend 1.000e+00 1.156e-13 8.648e+12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.785e-14 on 37 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 7.479e+25 on 1 and 37 DF, p-value: < 2.2e-16
We can do the same for the quadratic trend and plot both:
# ------------------------------------------
# Run the OLS regression for a quadratic trend
# ------------------------------------------
# note that the y ~ year part is the part that specifies the trend
model_quad <- lm(y ~ year + year_squared, data = detrended_example)
# View the regression summary
summary(model_quad)
Call:
lm(formula = y ~ year + year_squared, data = detrended_example)
Residuals:
Min 1Q Median 3Q Max
-0.13022 -0.07776 -0.02081 0.07998 0.22320
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.531e+03 5.421e+02 -6.514 1.44e-07 ***
year 3.524e+00 5.424e-01 6.497 1.51e-07 ***
year_squared -8.779e-04 1.357e-04 -6.471 1.64e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09589 on 36 degrees of freedom
Multiple R-squared: 0.8032, Adjusted R-squared: 0.7923
F-statistic: 73.49 on 2 and 36 DF, p-value: 1.952e-13
# Predict the trend (fitted values from the model)
detrended_example$trend_quad <- predict(model_quad)
detrended_example$detrended_quad <- residuals(model_quad)
# Create the ggplot
ggplot(detrended_example, aes(x = year)) +
geom_line(aes(y = y, color = "Log Gigajoules per Capita"), size = 1) + # Original series
geom_line(aes(y = trend, color = "Linear Trend"), size = 1, linetype = "solid") + # Linear trend
geom_line(aes(y = trend_quad, color = "Quadratic Trend"), size = 1, linetype = "dashed") + # Quadratic trend
labs(
title = "Linear and Quadratic Trends",
x = "Year",
y = "Value",
color = "Legend"
) +
scale_color_manual(
values = c("Log Gigajoules per Capita" = friendly_blue, "Linear Trend" = friendly_orange, "Quadratic Trend" = friendly_green)
) +
theme_minimal() + # Clean theme
theme(
legend.position = "top", # Adjust legend position (e.g., "top", "bottom", "left", "right")
legend.text = element_text(size = 10), # Adjust legend text size
plot.title = element_text(hjust = 0.5, size = 14) # Center and size the title
)
Sometimes you will find yourself needing to access stored results. In the matrix and lm models above, we can do this fairly simply. In the environment, you’ll see the regression we ran above as model_linear
- the name we gave it.
The text in the environment pane is a bit unruly - but we can look at it more closely by clicking the magnifying glass. This gives us more details on the object model_linear
:
If we continue exploring we will see that model_linear
has several callable contents. If we just want the coefficient for year
for example, we can simply run:
# You can also get the above window by typing:
# View(model_linear)
# We can do the below if we want to keep the variable name
lin_year_coef = model_linear$coefficients["year"]
print(lin_year_coef)
year
0.01398618
# More often, if we want to do this - we'll need the coefficients as vectors
coef_values <- as.numeric(model_linear$coefficients)
print(coef_values)
[1] -23.11463880 0.01398618
# I often find myself needing only specific coefficients from models in order to do the transformation of interest
# suppose I only need the year coefficient from both equations above (as unrealistic as that is)
# Loop over model names and extract "year" coefficient
# set up data-frame for coefficients:
example_coefs <- data.frame(model_type = character(), year_coef = numeric(), stringsAsFactors = FALSE)
# run the loop over model names
# note when running large scale data on sub_groups of different sizes
# it is often better to dynamically populate this
# with files stored on csv or txt data.
# This especially counts for secure data facilities.
for (model_name in c("model_linear","model_quad")) {
model <- get(model_name)
coef_value <- as.numeric(model$coefficients["year"]) # Extract the coefficient of 'year'
# Handle cases where 'year' might not exist in the model
if (is.na(coef_value)) {
coef_value <- NA
}
# Append to results data frame
example_coefs <- rbind(example_coefs, data.frame(
model_type = sub("model_", "", model_name), # Extract suffix
year_coef = coef_value
))
}
# Print results
print(example_coefs)
model_type year_coef
1 linear 0.01398618
2 quad 3.52397017
Often, we want to confirm whether a time series has a unit root, which implies that the series follows a random walk and lacks mean reversion. A unit root exists when the coefficient of autocorrelation equals one, meaning the series is non-stationary.
The test can be thought of as testing
\[ H_0: \beta_1 = 1 \]
in the regression below
\[ y_t = \beta_1 y_{t-1} + e_t \]
The Dickey-Fuller test, tests the following: \[ (y_t - y_{t-1}) = \alpha_1 y_{t-1} + e_t \]
where \(\alpha_1 = \beta_1 - 1\)
The null hypothesis is now more standard:
\[ \alpha_1 = 0 \]
The issue is that the above still assumes that \(e_t\) is iid. If we have a unit root, that is very unlikely to begin with. The Augmented Dickey-Fuller test controls for this by controlling for persistence in our error term up to a specified number of lags:
\[ (y_t - y_{t-1}) = \Delta y_{t} = \alpha_1 y_{t-1} + \sum_{j=1}^{p} \phi_{j} \Delta y_{t-j} + u_t \]
We can also control for a trend or drift by adding a time-trend or a constant, respectively. A constant will control for the mean fo the data - the mean
we are testing reversion to is not zero. A time trend allows for a deterministic trend in the data.
Use the urca
package to test for stationarity by performing Augmented Dickey-Fuller (ADF) tests with ur.df()
.
# load the urca package
p_load(urca)
# ur.df() requires a vector/array
# you should recognise pull() from before
test_vector <- ireland_df %>%
pull(ln_energy_pc)
my_adf1 <- ur.df(
y = test_vector, # vector
type = "trend", # type of ADF - trend + constant
lags = 5, # max number of lags
selectlags = "AIC"
) # lag selection criteria
# use summary() to present the saved ADF object
# summary() wraps many different kinds of objects
summary(my_adf1)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.087796 -0.019856 0.009542 0.029923 0.045227
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3937065 0.2278056 1.728 0.0950 .
z.lag.1 -0.0797608 0.0498511 -1.600 0.1208
tt 0.0001319 0.0010831 0.122 0.9039
z.diff.lag1 0.1723326 0.1771983 0.973 0.3391
z.diff.lag2 0.3115141 0.1794628 1.736 0.0936 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03842 on 28 degrees of freedom
Multiple R-squared: 0.296, Adjusted R-squared: 0.1954
F-statistic: 2.943 on 4 and 28 DF, p-value: 0.03778
Value of test-statistic is: -1.6 1.6437 2.1039
Critical values for test statistics:
1pct 5pct 10pct
tau3 -4.15 -3.50 -3.18
phi2 7.02 5.13 4.31
phi3 9.31 6.73 5.61
Is the null hypothesis rejected? How about an ADF test that specifies only a constant/drift?
my_adf2 <- ur.df(
y = test_vector,
type = "drift", # type of ADF - with drift
lags = 5,
selectlags = "AIC"
)
summary(my_adf2)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.088270 -0.019198 0.009114 0.030175 0.044932
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.37689 0.17811 2.116 0.0430 *
z.lag.1 -0.07569 0.03633 -2.083 0.0461 *
z.diff.lag1 0.16583 0.16608 0.999 0.3263
z.diff.lag2 0.30364 0.16454 1.845 0.0752 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03776 on 29 degrees of freedom
Multiple R-squared: 0.2956, Adjusted R-squared: 0.2228
F-statistic: 4.057 on 3 and 29 DF, p-value: 0.01595
Value of test-statistic is: -2.0834 2.5446
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1 7.06 4.86 3.94
Does this mean our series should be detrended or not?
(Not covered in class)
We can compare those results to the results from already detrended series as below. Note that the results are only valid for the detrended series - if we reject a unit root in the detrended data, that does not mean that the original data was or was not stationary.
For the linearly detrended series:
# Results for linearly detrended series:
detrend_lin <- detrended_example %>%
pull(detrended_lin)
adf_lin <- ur.df(
y = detrend_lin,
type = "drift", # type of ADF - with drift
lags = 2,
)
summary(adf_lin)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.08750 -0.02700 0.01137 0.02868 0.04876
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007899 0.0063372 0.125 0.9016
z.lag.1 -0.0854229 0.0480295 -1.779 0.0848 .
z.diff.lag1 0.2461152 0.1606244 1.532 0.1353
z.diff.lag2 0.3893391 0.1616341 2.409 0.0219 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03768 on 32 degrees of freedom
Multiple R-squared: 0.2708, Adjusted R-squared: 0.2025
F-statistic: 3.962 on 3 and 32 DF, p-value: 0.01649
Value of test-statistic is: -1.7786 1.5817
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1 7.06 4.86 3.94
For the quadtrically detrended series:
# Results for quadratically detrended seris
detrend_quad <- detrended_example %>%
pull(detrended_quad)
adf_quad <- ur.df(
y = detrend_quad,
type = "drift", # type of ADF - with drift
lags = 2,
)
summary(adf_quad)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.076880 -0.025863 0.009698 0.024754 0.050565
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005764 0.0059899 0.096 0.9239
z.lag.1 -0.1856822 0.0723037 -2.568 0.0151 *
z.diff.lag1 0.2228600 0.1525139 1.461 0.1537
z.diff.lag2 0.3813441 0.1466459 2.600 0.0140 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03539 on 32 degrees of freedom
Multiple R-squared: 0.3287, Adjusted R-squared: 0.2657
F-statistic: 5.222 on 3 and 32 DF, p-value: 0.004763
Value of test-statistic is: -2.5681 3.3616
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1 7.06 4.86 3.94
Using the HP filter to detrend seems to finally get us data where we can reject the null of a unit root. Note that we set the frequency to 6.25 (meaning \(\lambda=6.25\)), which is consistent with annual data.
p_load("mFilter")
library(mFilter)
# Apply the HP filter to test_vector
hpfiltered <- hpfilter(test_vector, freq = 6.25)
# Extract the cyclical component (non-trend)
hp_cycle <- hpfiltered$cycle
# Apply the ADF test to the cyclical component
adf_hp <- ur.df(
y = hp_cycle,
type = "drift", # ADF with drift (no trend, focus on stationarity of cycle - note in these the mean is zero)
lags = 2,
)
# Summary of the ADF test
summary(adf_hp)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.032757 -0.014573 -0.001604 0.012624 0.041336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001153 0.003380 -0.341 0.7353
z.lag.1 -1.422755 0.265403 -5.361 6.95e-06 ***
z.diff.lag1 0.510783 0.210331 2.428 0.0210 *
z.diff.lag2 0.411025 0.161014 2.553 0.0157 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02024 on 32 degrees of freedom
Multiple R-squared: 0.5421, Adjusted R-squared: 0.4992
F-statistic: 12.63 on 3 and 32 DF, p-value: 1.299e-05
Value of test-statistic is: -5.3607 14.3688
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1 7.06 4.86 3.94
The fixest
package aids the estimation of various kinds of regression models. Let’s run a few simple OLS regressions to illustrate. Our example tries to answer the question: What is the relationship between education and wages in the CPS data for the year 1974?. We will also control for age and marital status.
Instead of sourcing .csv
files locally, we can download such files directly from the internet. As before, we can use read.csv()
to create a data frame called cs_df
.
# Replace a local file path with a web address
# Subset the data to only those observations in 1974
# To restrict memory usage, select only the relevant columns
cs_df <- read.csv("https://raw.githubusercontent.com/stata2r/stata2r.github.io/main/data/cps_long.csv") %>%
filter(year == 1974) %>%
select(wage, educ, age, marr)
Before we run the regressions, we should probably get a better picture of the data we are dealing with. Take note of the dimensions of the sample and its variables’ types and distributions.
# Get an overview of the sample
# Do you notice any issues?
skimr::skim(cs_df)
Name | cs_df |
Number of rows | 15992 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
wage | 0 | 1 | 14016.80 | 9569.80 | 0 | 4403.45 | 15123.58 | 23584.18 | 25862.32 | ▆▂▃▃▇ |
educ | 0 | 1 | 12.03 | 2.87 | 0 | 11.00 | 12.00 | 13.00 | 18.00 | ▁▁▂▇▂ |
age | 0 | 1 | 33.23 | 11.05 | 16 | 24.00 | 31.00 | 42.00 | 55.00 | ▆▇▆▅▅ |
marr | 0 | 1 | 0.71 | 0.45 | 0 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
# Get an impression of wage by marital status
cs_df %>% # call the cross-section dataframe
select(wage, marr) %>% # Select the variables wage and marriage from the dataframe
group_by(marr) %>% # group the selected variables from the dataframe
skimr::skim() # skim the data grouped by mariage limiting only to the selected variables from the dataframe
Name | Piped data |
Number of rows | 15992 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | marr |
Variable type: numeric
skim_variable | marr | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
wage | 0 | 0 | 1 | 7681.53 | 8325.02 | 0 | 386.47 | 4325.08 | 12987.49 | 25862.32 | ▇▂▂▁▂ |
wage | 1 | 0 | 1 | 16582.74 | 8818.61 | 0 | 10237.66 | 18677.69 | 25781.50 | 25862.32 | ▃▂▂▃▇ |
(This section only covered if there is time at the end)
Sometimes we don’t want all of these details, especially not for publication.
Below, I provide a table where the mean, standard deviation, and observation count for the Unmarried and Married groups are given separately.
stat_table <- cs_df %>% # Calling the tidyverse
group_by(marr) %>% # Group by marital status
summarise( # Summarise using specific values
Observations = n(), # observations
Wage_mean = mean(wage), # Mean of a variable
Wage_sd = sd(wage), # Note the standard deviation is not given a name here,
# rather a unique number of spaces for each variable.
# I use this approach when looping over longer structures
Education_mean = mean(educ),
Education_sd = sd(educ),
)
stat_table
marr | Observations | Wage_mean | Wage_sd | Education_mean | Education_sd |
---|---|---|---|---|---|
0 | 4610 | 7.68e+03 | 8.33e+03 | 12 | 2.67 |
1 | 11382 | 1.66e+04 | 8.82e+03 | 12 | 2.95 |
We can clean up this table fairly easily, as below.
flipped_table <- cs_df %>% # Calling the tidyverse
group_by(marr) %>% # Group by marital status
summarise( # Summarise using specific values
Observations = n(), # observations
Wage = mean(wage), # Mean of a variable
` ` = sd(wage), # Note the standard deviation is not given a name here,
# rather a unique number of spaces for each variable.
# I use this approach when looping over longer structures
Education = mean(educ),
` ` = sd(educ),
) %>%
pivot_longer( # Pivot the table above
cols = -marr, # All columns except 'marr'
names_to = "Statistic", # Put the names that are not the group variable to statistics
values_to = "Value" # Call their values "Value" - which we replace below/
) %>%
mutate(
Value = case_when(
Statistic == "Observations" ~ as.character(as.integer(Value)), # Ensure integer format
Statistic == " " ~ paste0("(", formatC(as.numeric(Value), format = "f", big.mark = ",", digits = 3), ")"),
# add parentheses around the standard deviations
Statistic == " " ~ paste0("(", formatC(as.numeric(Value), format = "f", big.mark = ",", digits = 3), ")"),
TRUE ~ formatC(as.numeric(Value), format = "f", big.mark = ",", digits = 3) # Format the "Value" with commas
)
) %>%
pivot_wider( # pivot the table again but now using married
names_from = marr, # Set 'marr' levels as column headers
values_from = Value # get the values from Value
)
colnames(flipped_table) <- c("Statistic", "Unmarried", "Married") # Give the columns names
flipped_table
Statistic | Unmarried | Married |
---|---|---|
Observations | 4610 | 11382 |
Wage | 7,681.535 | 16,582.744 |
(8,325.019) | (8,818.607) | |
Education | 12.035 | 12.024 |
(2.671) | (2.948) |
Which we can turn into a huxtable:
# Convert to huxtable
flipped_hux <- as_hux(flipped_table) %>%
add_footnote("Source: Author's calculations based on CPS in 1974. \n \n Each variable's mean is provided by group with standard deviation in parentheses. Missing ") %>%
set_caption("Descriptive Statistics by Marital Status") %>%
set_bold(1, everywhere) %>%
set_bottom_border(1, everywhere, 1) %>%
set_bottom_border(2, everywhere, 1) %>%
set_bold(1, everywhere) %>%
set_align(everywhere, everywhere, "center") %>%
set_width(1)
# We can print the huxtable:
flipped_hux
Statistic | Unmarried | Married |
---|---|---|
Observations | 4610 | 11382 |
Wage | 7,681.535 | 16,582.744 |
(8,325.019) | (8,818.607) | |
Education | 12.035 | 12.024 |
(2.671) | (2.948) | |
Source: Author's calculations based on CPS in 1974. Each variable's mean is provided by group with standard deviation in parentheses. Missing |
# The above can be transformed into a latex table
cat(to_latex(flipped_hux), file = "./tables/example_table.tex")
We can perform standard OLS regressions with fixest
’s feols()
function. This function requires two arguments, formula
(or fml
) and data
. Formulas are given in the format y ~ x1 + x2
, and data
refers to a data frame.
# Since we are regressing on wages, we should probably log first:
# There are some zero wages, but we should probably log wages in any case:
cs_df <- cs_df %>% mutate(lwage = log(wage))
# Our first model
model1 <- feols(fml = lwage ~ educ, data = cs_df)
# Adding an explanatory continuous variable: age
model2 <- feols(lwage ~ educ + age, cs_df)
# Adding a categorical variable
model3 <- feols(lwage ~ educ + age + factor(marr), cs_df)
# As before, use summary() to display the results of model1
summary(model1)
OLS estimation, Dep. Var.: lwage
Observations: 14,079
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.055392 0.039695 228.12249 < 2.2e-16 ***
educ 0.024358 0.003205 7.59909 3.17e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.08271 Adj. R2: 0.004015
Consider using the huxreg()
function from the huxtable
package. Its enables the neat presentation of multiple regression models in a single table. Let’s consider the standard version first.
huxreg(model1, model2, model3)
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | 9.055 *** | 7.200 *** | 7.096 *** |
(0.040) | (0.048) | (0.046) | |
educ | 0.024 *** | 0.054 *** | 0.051 *** |
(0.003) | (0.003) | (0.003) | |
age | 0.044 *** | 0.033 *** | |
(0.001) | (0.001) | ||
factor(marr)1 | 0.726 *** | ||
(0.019) | |||
N | 14079 | 14079 | 14079 |
R2 | 0.004 | 0.190 | 0.263 |
logLik | -21096.007 | -19640.118 | -18976.952 |
AIC | 42196.014 | 39286.237 | 37961.903 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
However, you may want to improve the look of your regression table with some aesthetic adjustments, as we did for Table 5. Please note all of the optional arguments to the huxreg()
function itself.
huxreg(
"Model 1" = model1, "Model 2" = model2, "Model 3" = model3,
statistics = c("N" = "nobs", "R-squared" = "r.squared"),
stars = c(`*` = 0.1, `**` = 0.05, `***` = 0.01, `****` = 0.001),
number_format = 2,
coefs = c(
"Education" = "educ",
"Age" = "age",
"Married" = "factor(marr)1"
)
) %>%
set_font_size(8) %>%
set_caption("Regression of Log Wages on Education, Age, and Marital Status")
Model 1 | Model 2 | Model 3 | |
---|---|---|---|
Education | 0.02 **** | 0.05 **** | 0.05 **** |
(0.00) | (0.00) | (0.00) | |
Age | 0.04 **** | 0.03 **** | |
(0.00) | (0.00) | ||
Married | 0.73 **** | ||
(0.02) | |||
N | 14079 | 14079 | 14079 |
R-squared | 0.00 | 0.19 | 0.26 |
**** p < 0.001; *** p < 0.01; ** p < 0.05; * p < 0.1. |
fixest
also contains various plotting functions based on your regressions. For example, consider the following coefficient plot which graphs the coefficients for each model:
# Notice that models need to be entered as a list() object
# Generate the coefficient plot with explicit inclusion of desired coefficients
coefplot(list(model1, model2, model3), drop = "Constant")
You may also need to run regressions using interaction effects. Consider the following example where the effect of education is moderated by marital status:
# Same as before, but "*" denotes an interaction
model4 <- feols(lwage ~ educ * factor(marr), data = cs_df)
# What do our results say?
summary(model4)
OLS estimation, Dep. Var.: lwage
Observations: 14,079
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.549777 0.078292 96.43044 < 2.2e-16 ***
educ 0.084907 0.006253 13.57775 < 2.2e-16 ***
factor(marr)1 1.873869 0.088143 21.25941 < 2.2e-16 ***
educ:factor(marr)1 -0.069062 0.007062 -9.77877 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.980798 Adj. R2: 0.182566
# maybe this is easier:
huxreg(model4)
(1) | |
---|---|
(Intercept) | 7.550 *** |
(0.078) | |
educ | 0.085 *** |
(0.006) | |
factor(marr)1 | 1.874 *** |
(0.088) | |
educ:factor(marr)1 | -0.069 *** |
(0.007) | |
N | 14079 |
R2 | 0.183 |
logLik | -19704.266 |
AIC | 39416.532 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
We can visualise the differential effects of education on wages by marital status using ggplot
and a plot called geom_smooth
. The latter can represent regression lines for married and unmarried respondents.
cs_df %>%
ggplot(aes(x = educ, y = lwage)) +
theme_bw() +
geom_point(alpha = 0.5) + # creates a scatterplot
geom_smooth(
formula = y ~ x, # x, y inherited from aes()
method = "lm", # specifies linear model
aes(color = factor(marr)), # creates two regression lines
se = TRUE, # display confidence interval,
level = 0.95
) + # confidence level to 95%
theme(legend.position = "bottom") +
labs(
y = "Wage", x = "Years of Education", color = "Married",
title = "The effect of education on wage by marital status"
)
That was the cursory introduction to R
and R Studio with a focus on some basic operations, data visualisations, and a little bit of econometrics. I hope it was useful! You can consult the Acknowledgements and Further Reading sections for additional resources. Thank you for your attention and please feel free to reach out if you encounter any issues.
If you are planning on applying version control to your new project, it is useful to check Create git repository
.↩︎
Packages are extensions to the R
language. They contain code, data, and documentation in a standardised collection format that can be installed by users, typically via a centralised software repository such as CRAN (The Comprehensive R Archive Network). CRAN is a network of ftp (file transfer protocol) and web servers around the world that store identical, up-to-date, versions of code and documentation for R.↩︎