Introduction to R: Tutorial

This post serves as the primary tutorial notes for the “Introduction to R” sessions of the Intensive Statistics course for MCom (Economics) students at Stellenbosch University (2025). This lecture is intended to offer a cursory introduction to enable students to perform basic operations in R and RStudio.

Why R?

  1. R is open-source.
  1. R uses packages.
  1. R uses predictive coding (Ctrl/Cmd + Space is very useful).
  1. R is compatible with Markdown.

Setup

Before you start, please proceed with the following steps, and download and install the necessary software on your machine:

  1. R (The programming language you will be using)

“R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.”

Select the appropriate link corresponding to your machine’s operating system and follow the subsequent instructions:

Execute the newly downloaded R-4.4.2-win.exe file and follow the instructions. If you encounter any difficulties, please ask for help!

  1. RStudio Desktop (The program you will be using to interface with the R code)

“RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.”

Once you have completed downloading and installing R, do the same for RStudio. Execute the newly downloaded RStudio-2024.12.0-467.exe file and follow the instructions that follow.

  1. After completing Steps 1 and 2, open up RStudio. It should look something like this:
This is the default RStudio Desktop workspace. The Console is on the left, the Environment pane is top-right and the Files pane is bottom-right. You can change the layout of your workspace in the taskbar with View > Panes > Pane Layout. You can change the aesthetics of your workspace in the taskbar with Tools > Global Options > Appearance.
  1. File > New Project > New directory > New project > Choose directory location and name1

It is advisable to work from a R project, as this keeps track of your most current workspace and variable environment. Its location will serve as your “root folder” or main directory for subsequent operations in the project.

To work from a particular project, open a new project with File > Open Project or starting your session in R Studio by clicking the relevant .Rproj file.

For this tutorial: please create a folder, name it what you want to, in that folder create a subfolder called tutorial and save your r-script there. We will list files in directories later.

To ensure that you are working from the correct project, check whether your project’s name appears in this top right-hand corner.

Once operating from a new project, you should be faced with the default workspace, as illustrated in Step 3 above. The pane on the left is called the Console. As in Stata, code can be typed and executed directly in your console (or “Command” in Stata.)

Try: Enter the code getwd() in your console to observe the location of your working directory, the location of your project.

Alternatively, this code can be stored in- and executed from a script or .R file, much like do-files in Stata.

  1. Open a new script in the taskbar with File > New File > R Script or with Ctrl/Cmd + Shift + N. Save the .R file in your working directory with File > Save or Ctrl/Cmd + S.

After creating and saving a new script, you should observe a workspace layout like the one summarised in Table 1.

Table 1: Workspace Layout
Location Function
Top left Script/RMarkdown
Top right Global environment
Bottom left Console
Bottom right Files/Plots/Packages/Help

Basics

With the software installed and operating, and your new project and script created, you are ready to get started with R and R Studio. Proceed by copying the code from this post, or by working from the tutorial’s corresponding master script available here.

NB: Highlighted code in the console or script can be executed using Ctrl/Cmd + Enter. When working from a script, output will typically appear in the console.

Packages

We rely on packages to access pre-written functions performing specific tasks.2 Packages can be downloaded and installed from CRAN, the official repository of packages, or GitHub, which often acts as an repository for third-party libraries or beta-versions.

Installing, loading and maintaining packages can be tedious. Packages need only be installed with install.packages("x") once, after which only load(x) is necessary.

# First install the package from CRAN
install.packages("pacman")

# Load the installed package into your workspace
library(pacman)

If a package x has been installed already, install.packages("x") will produce the following error:

It can be difficult to keep track of all the packages on your machine. The pacman package allows us to easily install and load packages from CRAN with p_load(), and from GitHub with p_load_gh(). This will install and load a new package, or merely load previously installed ones. Let’s install the packages we will be using in this tutorial.

# From CRAN
        p_load(fixest, tidyverse, huxtable, modelsummary, glue, skimr, labelled, devtools, MetBrewer) # or
pacman::p_load(fixest, tidyverse, huxtable, modelsummary, glue, skimr, labelled, devtools, MetBrewer)

NB: Functions are called using the function’s name, e.g., p_load(), or its full name, e.g., pacman::p_load(). The latter is useful when functions from different packages share the same name.

Directories can be viewed to the bottom-right pane, in addition to plot outputs, currently loaded packages, and help files. Should you ever require help or additional information regarding a specific command, add a ? before that command and run the code, for example:

?pacman::p_load()

Environment

R is an object-orientated language. Objects of various classes (scalars, matrices, data frames, vectors, etc.) can be stored in memory for later use. Once named and saved, these objects will appear in your global environment.

We use the assignment operators <- or = to name and save objects.

Shortcut: Alt/Option + Minus to get <-

# object name <- (or =) value(s)
a <- 10
hello <- "Hello world!"
test <- TRUE

# Determine the class of an object
class(a)
[1] "numeric"
class(hello)
[1] "character"
class(test)
[1] "logical"

By executing the relevant code, objects should appear in your global environment like this:

Report these variables in your output by running the following:

a
[1] 10
# or
print(hello)
[1] "Hello world!"
# or by using the glue package for something more fancy
glue::glue("It's {test}. I saved a variable which contains {hello} and I stored the number {a}.")
It's TRUE. I saved a variable which contains Hello world! and I stored the number 10.

Arrays

An array object is equivalent to a vector of values from the same class. Arrays can be created by concatenating values using the function c().

x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
z <- c(7, 8, 9, 10)

# Useful functions to perform on arrays/vectors
sum(x)
[1] 10
min(x)
[1] 1
median(x)
[1] 2.5
# summary() provides a summary of the functions above
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.75    2.50    2.50    3.25    4.00 
# Missing values denoted by NA
x_with_missing <- c(1, 2, 3, NA)

# Take care to properly treat missing values:
sum(x_with_missing)
[1] NA
sum(x_with_missing, na.rm = TRUE)
[1] 6

Data frames

Data frames, consisting of rows and columns, are the workhorse of statistical analysis in R. They can be created in various ways. Rows and columns can also be named. As always: try to keep your names meaningful and functional - here the data frame’s name comes from the fact that it is the first data frame.

# data.frame() can create columns from arrays and assign column names
  df_1 <- data.frame(A = x, B = y, C = z)

# Some useful operations
  # as expected, the below provides the column names
  colnames(df_1)
[1] "A" "B" "C"
  # if you want to create a dataset that looks the same 
  # as the previous one but with a new name (and therefore more memory):
  df_1_copy <- df_1
  
  # If you want to change the names of your variables
  colnames(df_1_copy) <- c("col1", "col2", "col3")
  colnames(df_1_copy)
[1] "col1" "col2" "col3"
  # To count the number of rows and columns
    #(sometimes you'll need this if you do  manual transformation)
  nrow(df_1)
[1] 4
  ncol(df_1)
[1] 3

Specific rows, columns, and cells can be referenced as follows:

# Return column "A" as a vector
df_1$A
[1] 1 2 3 4
# df_1[row no., column no.] - empty implies all
df_1[, 1]
[1] 1 2 3 4
# Using tidyverse's pipe operator %>%
  # the below pulls all value from the column A
  # it does not replace the values of df_1 with column A
df_1 %>% pull(A)
[1] 1 2 3 4
  # If we wanted to create an array to do that we can write:
col_A <- df_1 %>% pull(A)

Shortcut: Ctrl/Cmd + Shift + M to get %>%

There are numerous ways to subset a data frame.

# Similarly with rows
# Return row 2 as a single row data frame
df_1[2, ]

# Return row 2-3 as a two row data frame
df_1[2:3, ]

# Return cell in row 2 column 1
df_1[2, 1]

Let’s create a new column named “D”:

# Create a new column "D" that is the sum of A and B
df_1$D <- df_1$A + df_1$B

# is the same as
df_1 <- df_1 %>% mutate(D = A + B)

We can also create a subset of the data based on a condition.

  # you can permanently remove missing data from a 
  # data frame using the na.omit code below 
  df_missing <- data.frame(A = x_with_missing, B = y, C = z) 
  
  df_missing_clean <- na.omit(df_missing)
  # count the rows and columns, you will see there is one less row now
  nrow(df_missing_clean)  
[1] 3
  ncol(df_missing_clean)  
[1] 3
  # You can also remove rows or columns with specific conditions
  # if we want to drop all rows if the the column value in C is above 9:
  df_missing_9 <- df_missing[df_missing$C < 9, ]

  # count the rows and columns, you will see there is one less row now
  nrow(df_missing_9)  
[1] 2
  ncol(df_missing_9)  
[1] 3

Managing Data Files: Directories and Importing

Data is often imported from external files, such as .csv files. Let’s consider an example. If you haven’t done so already, download the files attached in the email I sent. Alternatively, you can using this link. Extract the files in the .zip folder and copy the data folder into your working directory. We will generally import data from either the internet or a local drive. Generally, try not to always call data from the internet.

Directories

Only run this code if you are sure about your folder structure. If you struggle with this code, I’ll help you at the end.

To make sure your data loads, you need to confirm that the files exist and are called from the correct location. In the code below, we look for a folder called data in our directory.

Note the location prefix in the code below: - The ‘..’ prefix means that the code will go to the parent directory of the current working directory, and then find the file specified - The ‘.’ prefix means that the data is in the subfolder of the current working directory

# DIRECTORIES 
  # It is often useful to confirm your working directory first
  getwd()
[1] "G:/Timeseries_2025/r_introduction/tutorial"
  # But often, it helps to confirm all folders in the directory 
    # the recursive stops further subfolders from being searched
    # full.names makes sure we only get the folder name and not the full path)
  folders <- list.dirs(path = ".", full.names = FALSE, recursive = FALSE)
  print(folders)
character(0)
  # If we do not have the folder in the current working directory, 
  # check the parent directory 
  parent_folders <- list.dirs(path = "..", full.names = FALSE, recursive = FALSE)
  print(parent_folders)
[1] ".Rproj.user" "assignment"  "data"        "images"     
[5] "scripts"     "tutorial"   

We can also use the structure above to create a list with all files in a specific directory - this is often useful for messily-scraped data. Make sure you confirm that the directory references match your folder structure.

# You can list the files in a specific directory as below
  # Here I use the full names as we might want to use the names in the list later
  list_files <- list.files(path = "../data/", recursive = FALSE, full.names = TRUE)
print(list_files)
[1] "../data/cs_data.csv"            "../data/cs_df.csv"             
[3] "../data/Ireland_energy.csv"     "../data/Ireland_population.csv"
# You can also  specify that you only want a specific type of file, example CSVs
  list_files_csv <- list.files(path = "../data/", pattern = "\\.csv$", recursive = FALSE, full.names = TRUE)

  print(list_files_csv)
[1] "../data/cs_data.csv"            "../data/cs_df.csv"             
[3] "../data/Ireland_energy.csv"     "../data/Ireland_population.csv"
# You can also  specify that you only want files with specific text in them 
  # in this example, we only look for "ireland"
  list_files_irel <- list.files(path = "../data/", pattern = "ireland", recursive = FALSE, full.names = TRUE)
  print(list_files_irel)
character(0)
  # It returns nothing! This is because we are only looking for files with no capitals
  # we need to ignore the case of the files
  list_files_ire <- list.files(path = "../data/", pattern = "ireland", recursive = FALSE, full.names = TRUE, ignore.case =TRUE)
  print(list_files_ire)
[1] "../data/Ireland_energy.csv"     "../data/Ireland_population.csv"

Sometimes we also want to create directories to store output in:

# Specify the name of the directory
  dir_name <- "./figures"

# Create the directory
  # Confirm that directory exists, else create it

  if (!dir.exists(dir_name)) {        # this is an if statement; it says if the condition in parentheses is false 
                                      # then create the directory called [dir_name] as defined above
    dir.create(dir_name)
    cat("Directory created:", dir_name, "\n")
  } else {
    cat("Directory already exists:", dir_name, "\n")
  }
Directory created: ./figures 

Alternatively, we know we will always create output in a certain format - and we want to keep it in certain folders:

# List of folder names 
  # - this helps if you know you'll always have the same structures
  folders <- c("figures", "tables")

# Loop over the list of folder names 
  for (folder in folders) {           # this is a standard for loop, it says to repeat the command for every entry in folders 
                                      # we will call every entry in folders folder; 
                                      # in the first iteration of the loop [folder] will take the value [figures]
                                      # in the second iteration [folder] will take the value [tables] 
                                      # there is no third iterate as folders<-c("figures", "tables") is only two entries long
    # Construct the directory path
    dir_name <- paste0("./", folder)
    
    # Check if the directory exists
    if (!dir.exists(dir_name)) {     
      # If it doesn't, create the directory
      dir.create(dir_name)
      cat("Directory created:", dir_name, "\n")   # this prints that the directory is created, the cat command is to concatenate the text.
    } else {
      cat("Directory already exists:", dir_name, "\n")  #
    }
  }
Directory already exists: ./figures 
Directory created: ./tables 

Reading data to Data Frames

Now, read the data from the Ireland_energy.csv file in the data folder using read.csv(), and create a data frame called ire_energy with an assignment operator. The data represents Ireland’s energy consumption data for 1980-2018.

ire_energy <- read.csv(file = "../data/Ireland_energy.csv", 
                       header = TRUE)
# The "file" argument refers to the relative file path from your root directory
# The "header" argument is set to true because the .csv file contains column headings

Do the same for a corresponding file containing data on Ireland’s population from Ireland_population.csv, and create a data frame called ire_pop.

ire_pop <- read.csv("../data/Ireland_population.csv")
# Sometimes it's unnecessary to spell out the arguments
  # here, we don't specify that headers are true 
When calling your function, access the tip with Ctrl/Cmd + Space. Note the potential arguments for the particular function and some arguments’ default settings, e.g., header = TRUE.

We now have two data frames. Take a peak at the first 5 observations in ire_energy and the last 5 observations in ire_pop with the following commands:

head(ire_energy, 5) # same as ire_energy[1:5, ]
tail(ire_pop, 5) # same as ire_pop[(nrow(ire_pop) - 4): nrow(ire_pop), ]

# skim() can provide useful overviews of data frames
skimr::skim(ire_pop)

Manipulating Data

Merging Data

We can merge the two data frames on the basis of some common value, e.g. Year, to create a single one called ireland_df.

# We use the merge command as below where x and y 
# are the names of our imported dataframes
ireland_df <- merge(x = ire_energy, y = ire_pop, by.x = "Year", by.y = "Year")
# merge() merges data frames x and y on the basis of some column
# by.x for x's column and by.y for y's column
# Let's confirm our code did what we say it did:
skimr::skim(ireland_df)
Table 2: Data summary
Name ireland_df
Number of rows 39
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1999 11.4 1980 1989.5 1999 2008.5 2018 ▇▇▇▇▇
GJ 0 1 522828057 147396263.1 311769904 367775214.1 586079129 650282165.0 727737237 ▇▂▁▅▇
Population 0 1 3976410 491594.5 3417800 3538152.5 3760480 4518187.5 4876547 ▇▂▂▂▃

The code above can be simplified if the variable exists with the same name in both datasets:

# We can also simplify the command if the same variable exists in both datasets:
ireland_df_a <- merge(x = ire_energy, y = ire_pop, by = c("Year"))
skimr::skim(ireland_df_a)
Table 3: Data summary
Name ireland_df_a
Number of rows 39
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1999 11.4 1980 1989.5 1999 2008.5 2018 ▇▇▇▇▇
GJ 0 1 522828057 147396263.1 311769904 367775214.1 586079129 650282165.0 727737237 ▇▂▁▅▇
Population 0 1 3976410 491594.5 3417800 3538152.5 3760480 4518187.5 4876547 ▇▂▂▂▃

If we want to drop dataframes we use the code below - this is especially important if you are working on a shared server in a secure facility.

# Sometimes we want to drop dataframes: 
rm(ireland_df_a)

In the first example above, we show a general procedure where the names of the merging variable does not have to be the same in both datasets - below I give an example of this type

# The names of the variables to merge on do not have to be the same
  # Let's create a dataframe where we change the name of the previous one:
  irepop_difname <- ire_pop %>% 
                      rename("data_year" = "Year")
    
  # We can still merge the data, but now we have to change the by.y variable
  ireland_df_a <- merge(x = ire_energy, y = irepop_difname, by.x = "Year", by.y = "data_year")
  # As you can see both are the same
  skimr::skim(ireland_df)
Table 4: Data summary
Name ireland_df
Number of rows 39
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1999 11.4 1980 1989.5 1999 2008.5 2018 ▇▇▇▇▇
GJ 0 1 522828057 147396263.1 311769904 367775214.1 586079129 650282165.0 727737237 ▇▂▁▅▇
Population 0 1 3976410 491594.5 3417800 3538152.5 3760480 4518187.5 4876547 ▇▂▂▂▃
  skimr::skim(ireland_df_a)
Table 4: Data summary
Name ireland_df_a
Number of rows 39
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1999 11.4 1980 1989.5 1999 2008.5 2018 ▇▇▇▇▇
GJ 0 1 522828057 147396263.1 311769904 367775214.1 586079129 650282165.0 727737237 ▇▂▁▅▇
Population 0 1 3976410 491594.5 3417800 3538152.5 3760480 4518187.5 4876547 ▇▂▂▂▃
  # Dropping the example dataframe
  rm(ireland_df_a)

Transformation

Thereafter, create and save a new column called ln_energy_pc, representing the natural logarithm of Ireland’s per capita energy consumption.

ireland_df <- ireland_df %>%
  mutate(ln_energy_pc = log(GJ / Population))
# You should recognise mutate() from before
# log()'s default setting implies natural logarithmic transformation

# Instead of tidyverse piping, you could have done this:
ireland_df <- mutate(.data = ireland_df, ln_energy_pc = log(GJ / Population))

# But piping is more useful when you require multiple consecutive operations
# For example, everything we've done thus far could've been condensed
ireland_df <- read.csv("../data/Ireland_energy.csv") %>%
  merge(
x = ., # full stop represents the result of all previous operations
y = read.csv("../data/Ireland_population.csv"),
by.x = "Year",
by.y = "Year"
  ) %>%
  mutate(ln_energy_pc = log(GJ / Population))

Data formats/class

It is sensible to ensure that data frame’s variables are in the appropriate format or class. For example, ireland_df’s Year column is of the class integer.

ireland_df$Year %>%
  class(.)
[1] "integer"

Many useful functions require that your vectors/columns be of class date. Transform the Year column from an integer to a date.

# Transform the `Year` column from an integer to a date.
  # note that we protect the value of the variable year with {}
  # if we didn't, the glue command would change every value to the word [Year-01-01]
ireland_df <- ireland_df %>%
  mutate(Year = glue::glue("{Year}-01-01"))
  # but the result is of class `character`: 
  # sometimes the base script is faster than piping for quick commands:
class(ireland_df$Year)
[1] "glue"      "character"
# We want to transform the year variable to a date
  # The as.Date() below renders characters of a given format into dates
  # We set that format as: "%Y-%m-%d" means yyyy-mm-dd
ireland_df <- ireland_df %>%
  mutate(Year = as.Date(x = Year, format = "%Y-%m-%d"))
  # Finally it is in date format
ireland_df$Year %>%
  class(.)
[1] "Date"
# Confirm that data is chronological
ireland_df <- ireland_df %>%
  arrange(Year) # this is equivalent to sorting by Year

Let’s see what our new data frame ireland_df looks like. Run view(ireland_df) or click the appropriate icon in your Global Environment to view the data frame in your workspace.

When reporting your data frame in a .Rmd or Markdown document, you can create a table using the huxtable package.

ireland_df %>%
  filter(Year > as.Date("2013-01-01")) %>%
  # subsets data for entries after 2013

  as_hux() %>%
  # or huxtable::as_hux() to transform data frame into huxtable object
  # hereafter code to define certain aesthetic qualities of our table

  theme_basic() %>%
  # use a theme to make tables more presentable, e.g. theme_article() or theme_compact()

  set_number_format(col = c(2, 4), value = 2) %>%
  # set number of decimals to 2 in the 2nd and 4th column

  set_font_size(10) %>%
  set_caption("Ireland's energy consumption after 2013")
Table 5: Ireland's energy consumption after 2013
YearGJPopulationln_energy_pc
2014-01-01603311823.7046627134.86
2015-01-01632644273.0047080404.90
2016-01-01665078014.4047625954.94
2017-01-01674283626.4048131384.94
2018-01-01695195813.7048765474.96

Writing data

We can now write this data frame back to a .csv file. The code below saves it in the data folder in our working directory. (Remember your directory conventions!)

ireland_df %>% # data frame to be written to csv
  write.csv(
    x = ., # ireland_df is piped into "."
    file = "../data/ireland_complete.csv", # file path and file name we choose 
                                           #(note the .. because of the directory structure)
    row.names = FALSE
  ) # because we have no row names

Time series analysis

Let’s perform some time series analysis with this data, particularly with respect to Ireland’s energy consumption per capita.

Visualising time series

It is always useful to eyeball the data before proceeding with formal analysis. Here are some examples using Base R:

# plot()'s default is a scatterplot
# inputting a single vector
ireland_df$ln_energy_pc %>%
  plot(.)
# We have an undefined x-axis above, so
# inputting a data frame with two columns (x, y)
ireland_df %>%
  select(Year, ln_energy_pc) %>%
  plot(.)
# lines instead of points
ireland_df %>%
  select(Year, ln_energy_pc) %>%
  plot(., type = "l")

If we want to save the figure for use in another text editor, we can use the code below. I will use the folder we created earlier:

# Open a PNG graphics device in the figures folder we created:
png("./figures/ireland_energy_plot.png", width = 800, height = 600)

# Generate the plot and save it
ireland_df %>%
  select(Year, ln_energy_pc) %>%
  plot(., type = "l", 
       main = "Energy Consumption per Capita in Ireland",
       xlab = "Year", 
       ylab = "Log Energy per Capita")

# Close the graphics device
dev.off()
png 
  2 

As opposed to visualisations with Base R, ggplot is often used. It offers a comprehensive suite of graphs that can be flexibly tweaked to your liking. ggplot is part of the so-called tidyverse. To illustrate, let us make a line graph and consider the sequential adding of layers to our “canvas”.

# ggplot() creates an empty 'canvas'
ireland_df %>%
  ggplot()
# aes() provides the coordinates (or "mapping")
ireland_df %>%
  ggplot(aes(x = Year, y = ln_energy_pc))
# on the canvas, we add layers with pluses (+)
# add one of many existing themes
ireland_df %>%
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw()
# add a scatterplot
# it inherits the previously defined mapping
ireland_df %>%
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point()
# add a line
ireland_df %>%
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point() +
  geom_line()
# add the appropriate labels
# add more breaks to the x-axis and change its labels
# change the range of the y-axis
ireland_df %>%
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point() +
  geom_line() +
  labs(
    title = "Ireland's Primary Energy Consumption",
    y = "ln(GJs Per Capita)",
    x = "Date"
  ) +
  scale_x_date(
    date_labels = "`%y",
    date_breaks = "2 year"
  ) +
  scale_y_continuous(limits = c(4, 5.5))
# and customise as you please (I am not going over this in class)
ireland_df %>%
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point(
    aes(color = ifelse(Year < as.Date("2000-01-01"), "Before 2000",
      ifelse(Year > as.Date("2000-01-01"), "After 2000", "2000")
    )),
    size = 1.5
  ) +
  geom_line(
    alpha = 0.5,
    color = "lightgrey",
    size = 1
  ) +
  labs(
    title = "Ireland's Primary Energy Consumption",
    y = "ln(GJs Per Capita)",
    x = "Date"
  ) +
  scale_x_date(
    date_labels = "`%y",
    date_breaks = "2 year"
  ) +
  scale_y_continuous(limits = c(4, 5.5)) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 10),
    axis.title.y = element_text(
      margin = margin(t = 0, r = 10, b = 0, l = 0),
      size = 10
    ),
    axis.title.x = element_text(
      margin = margin(t = 0, r = 0, b = 0, l = 0),
      size = 10
    ),
    axis.text.x = element_text(angle = 45),
    legend.position = "bottom",
    legend.margin = margin(t = -10, r = 0, b = 0, l = 0),
    legend.title = element_blank()
  ) +
  geom_label(
    data = . %>% filter(Year == as.Date("2000-01-01")),
    aes(label = round(ln_energy_pc, 1)),
    nudge_y = 0.15,
    size = 3,
    color = met.brewer("Austria", type = "discrete")[1]
  ) +
  geom_hline(aes(color = "Mean", yintercept = mean(ln_energy_pc)),
    size = 1,
    linetype = "dashed",
    show.legend = FALSE
  ) +
  scale_color_manual(values = met.brewer("Austria", type = "discrete"))

Defining your own colours

Often, you may find yourself unsatisfied with a colour scheme for a variety of reasons. Below, I show how you can define your own colours. I specify colourblind-friendly colours based on Paul Tol’s pallette. (Note that the Brewer Scheme has plenty of colourblind friendly options)

friendly_blue <- '#0072B2'
friendly_orange <- '#D55E00'
friendly_green <- '#009E73'

Autocorrelation

Now that we know what the series looks like, proceed to compute and plot the autocorrelation- and partial autocorrelation function of Ireland’s per capita energy consumption.

Below we used R’s built in autocorrelation function. Recall that mathematically the autocorrelation function plots the following for each \(j\) (or to however many we specify):

\[ \rho_j = \frac{\text{cov}(y_t, y_{t-j})}{\text{Var}(y_t)} \]

ireland_df %>%
  select(ln_energy_pc) %>% # isolate ln_energy_pc in data frame
  acf(
    plot = TRUE, # create a plot
    type = "correlation"
  ) # standard ACF

We can also use the partial autocorrelation function

pacf_result <- ireland_df %>%
  select(ln_energy_pc) %>%
  acf(
    plot = TRUE,
    type = "partial"
  ) # PACF option

Basic Regressions and Detrending:

Our data might have a long term trend, this may be problematic when we want to confirm whether our data is stationary.

A simple way to detrend data is by removing the linear or quadratic trend - we can get these trends with simple linear regression.

Let’s do the the regressions by “hand” first and then using the R packages. Note that it is often useful to confirm that you understand what your underlying models do by confirming the results by “hand”.

# Remember, OLS Coefficients are: (β̂ = (X'X)^(-1) * X'Y)
# Load data
y <- ireland_df$ln_energy_pc  # Dependent variable

# Convert 'Year' column to numeric
year_numeric <- as.numeric(format(ireland_df$Year, "%Y"))

# Construct the matrix representing the data (including intercept)
X <- cbind(1, year_numeric) # cbind combines the the vector of 1 to the year vector

# Convert y to a column matrix
y_matrix <- matrix(y, ncol = 1)

# Step 1: Compute X'X (X transpose times X)
XtX <- t(X) %*% X  # note that t(X) transposes the matrix X and %*% does the matrix multiplication
print("X'X:") 
[1] "X'X:"
print(XtX)
                   year_numeric
                39        77961
year_numeric 77961    155848979
# Step 2: Compute the inverse of X'X manually using Gaussian elimination (it is really difficult to invert matrices by hand)
XtX_inv <- solve(XtX)


# Step 3: Compute X'Y (X transpose times Y)
XtY <- t(X) %*% y_matrix


# Step 4: Compute the coefficients (β̂ = (X'X)^(-1) * X'Y)
beta_hat <- XtX_inv %*% XtY
print("Estimated Coefficients (β̂):")
[1] "Estimated Coefficients (β̂):"
print(beta_hat)
                     [,1]
             -23.11463880
year_numeric   0.01398618
# Step 5: Compute predicted values (ŷ = X * β̂)
y_hat <- X %*% beta_hat


# Step 6: Calculate residuals (ε = y - ŷ)
residuals <- y_matrix - y_hat

Here we do exactly the same, but using R’s lm framework instead. You can confirm that the estimated coefficients are the same.

  # Extract the dependent variable
    y <- ireland_df$ln_energy_pc

  # Extract the year as a numeric variable from the 'Year' column
    year_numeric <- as.numeric(format(ireland_df$Year, "%Y"))

  # Combine into a data frame
    detrended_example <- data.frame(
      y = y,
      year = year_numeric,
      year_squared = year_numeric^2  # Add quadratic term for future use
    )
  # ------------------------------------------
  # Run the OLS regression for a linear trend
  # ------------------------------------------

    # note that the y ~ year part is the part that specifies the trend
      # Note that the lm(.) means that we are running a linear model
    model_linear <- lm(y ~ year, data = detrended_example)

    # View the regression summary
      summary(model_linear)

Call:
lm(formula = y ~ year, data = detrended_example)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.19070 -0.12764 -0.01876  0.12072  0.25112 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -23.114639   3.956777  -5.842 1.03e-06 ***
year          0.013986   0.001979   7.066 2.32e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1391 on 37 degrees of freedom
Multiple R-squared:  0.5744,    Adjusted R-squared:  0.5629 
F-statistic: 49.93 on 1 and 37 DF,  p-value: 2.315e-08
    # compare to betas from matrix multiplication 
      print(beta_hat)
                     [,1]
             -23.11463880
year_numeric   0.01398618

Now we get the predicted value and detrend - in the results below we show that the detrended value predicted from the residuals is the same as the series minus the trend.

      # Predict the trend (fitted values from the model)
        detrended_example$trend <- predict(model_linear)

      # Create the detrended series
        detrended_example$detrended_y_b <- detrended_example$y - detrended_example$trend
      # Note you can do the same with:
        detrended_example$detrended_lin <- residuals(model_linear)

      # You can confirm that the detrended series and the residuals are             
        # the same thing here:
        check_detrend <- lm(detrended_lin ~ detrended_y_b, data=detrended_example)
        # As you can see this leads to exactly the same result
        summary(check_detrend)

Call:
lm(formula = detrended_lin ~ detrended_y_b, data = detrended_example)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.795e-14 -1.442e-16  8.579e-16  1.466e-15  2.402e-15 

Coefficients:
                Estimate Std. Error    t value Pr(>|t|)    
(Intercept)   -6.727e-15  7.653e-16 -8.790e+00 1.37e-10 ***
detrended_y_b  1.000e+00  5.648e-15  1.771e+14  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.779e-15 on 37 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 3.135e+28 on 1 and 37 DF,  p-value: < 2.2e-16

Compare to the manually constructed results to the residuals from the lm model:

  # Place residuals from matrix multiplication here
  detrended_example$matrixtrend <- residuals
  model_check <- lm(detrended_lin ~ matrixtrend, data=detrended_example)
  summary(model_check)

Call:
lm(formula = detrended_lin ~ matrixtrend, data = detrended_example)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.876e-13 -7.850e-14  9.570e-16  8.005e-14  1.592e-13 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 2.374e-11  1.567e-14 1.515e+03   <2e-16 ***
matrixtrend 1.000e+00  1.156e-13 8.648e+12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.785e-14 on 37 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 7.479e+25 on 1 and 37 DF,  p-value: < 2.2e-16

We can do the same for the quadratic trend and plot both:

    # ------------------------------------------
    # Run the OLS regression for a quadratic trend
    # ------------------------------------------
        # note that the y ~ year part is the part that specifies the trend
        model_quad <- lm(y ~ year + year_squared, data = detrended_example)

        # View the regression summary
        summary(model_quad)

Call:
lm(formula = y ~ year + year_squared, data = detrended_example)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.13022 -0.07776 -0.02081  0.07998  0.22320 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.531e+03  5.421e+02  -6.514 1.44e-07 ***
year          3.524e+00  5.424e-01   6.497 1.51e-07 ***
year_squared -8.779e-04  1.357e-04  -6.471 1.64e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09589 on 36 degrees of freedom
Multiple R-squared:  0.8032,    Adjusted R-squared:  0.7923 
F-statistic: 73.49 on 2 and 36 DF,  p-value: 1.952e-13
        # Predict the trend (fitted values from the model)
        detrended_example$trend_quad <- predict(model_quad)

        detrended_example$detrended_quad <- residuals(model_quad)
        
        # Create the ggplot
        ggplot(detrended_example, aes(x = year)) +
            geom_line(aes(y = y, color = "Log Gigajoules per Capita"), size = 1) + # Original series
          geom_line(aes(y = trend, color = "Linear Trend"), size = 1, linetype = "solid") + # Linear trend
          geom_line(aes(y = trend_quad, color = "Quadratic Trend"), size = 1, linetype = "dashed") + # Quadratic trend
        labs(
          title = "Linear and Quadratic Trends",
          x = "Year",
          y = "Value",
          color = "Legend"
        ) +
        scale_color_manual(
          values = c("Log Gigajoules per Capita" = friendly_blue, "Linear Trend" = friendly_orange, "Quadratic Trend" = friendly_green)
        ) +
        theme_minimal() + # Clean theme
        theme(
          legend.position = "top", # Adjust legend position (e.g., "top", "bottom", "left", "right")
          legend.text = element_text(size = 10), # Adjust legend text size
          plot.title = element_text(hjust = 0.5, size = 14) # Center and size the title
        )

Using stored results

Sometimes you will find yourself needing to access stored results. In the matrix and lm models above, we can do this fairly simply. In the environment, you’ll see the regression we ran above as model_linear - the name we gave it.

Exploring the environment of a model

The text in the environment pane is a bit unruly - but we can look at it more closely by clicking the magnifying glass. This gives us more details on the object model_linear:

Exploring the object

If we continue exploring we will see that model_linear has several callable contents. If we just want the coefficient for year for example, we can simply run:

# You can also get the above window by typing: 
  # View(model_linear)


# We can do the below if we want to keep the variable name
lin_year_coef = model_linear$coefficients["year"]
print(lin_year_coef)
      year 
0.01398618 
# More often, if we want to do this - we'll need the coefficients as vectors
coef_values <- as.numeric(model_linear$coefficients)
print(coef_values)
[1] -23.11463880   0.01398618
# I often find myself needing only specific coefficients from models in order to do the transformation of interest 
# suppose I only need the year coefficient from both equations above (as unrealistic as that is)


# Loop over model names and extract "year" coefficient
  # set up data-frame for coefficients:
example_coefs <- data.frame(model_type = character(), year_coef = numeric(), stringsAsFactors = FALSE)

  # run the loop over model names
    # note when running large scale data on sub_groups of different sizes
    # it is often better to dynamically populate this
    # with files stored on csv or txt data. 
    # This especially counts for secure data facilities.

for (model_name in c("model_linear","model_quad")) {
  model <- get(model_name) 
  coef_value <- as.numeric(model$coefficients["year"])  # Extract the coefficient of 'year'
  
  # Handle cases where 'year' might not exist in the model
  if (is.na(coef_value)) {
    coef_value <- NA
  }

  # Append to results data frame
  example_coefs <- rbind(example_coefs, data.frame(
    model_type = sub("model_", "", model_name),  # Extract suffix
    year_coef = coef_value
  ))
}

# Print results
print(example_coefs)
  model_type  year_coef
1     linear 0.01398618
2       quad 3.52397017

A Time Series Example using Packages: Unit root tests

Often, we want to confirm whether a time series has a unit root, which implies that the series follows a random walk and lacks mean reversion. A unit root exists when the coefficient of autocorrelation equals one, meaning the series is non-stationary.

The test can be thought of as testing \[ H_0: \beta_1 = 1 \] in the regression below
\[ y_t = \beta_1 y_{t-1} + e_t \]

The Dickey-Fuller test, tests the following: \[ (y_t - y_{t-1}) = \alpha_1 y_{t-1} + e_t \]

where \(\alpha_1 = \beta_1 - 1\)

The null hypothesis is now more standard:
\[ \alpha_1 = 0 \]

The issue is that the above still assumes that \(e_t\) is iid. If we have a unit root, that is very unlikely to begin with. The Augmented Dickey-Fuller test controls for this by controlling for persistence in our error term up to a specified number of lags:

\[ (y_t - y_{t-1}) = \Delta y_{t} = \alpha_1 y_{t-1} + \sum_{j=1}^{p} \phi_{j} \Delta y_{t-j} + u_t \]

We can also control for a trend or drift by adding a time-trend or a constant, respectively. A constant will control for the mean fo the data - the mean we are testing reversion to is not zero. A time trend allows for a deterministic trend in the data.

Original Series

Use the urca package to test for stationarity by performing Augmented Dickey-Fuller (ADF) tests with ur.df().

# load the urca package
p_load(urca)

# ur.df() requires a vector/array
# you should recognise pull() from before
test_vector <- ireland_df %>%
  pull(ln_energy_pc)

my_adf1 <- ur.df(
  y = test_vector, # vector
  type = "trend", # type  of ADF - trend + constant
  lags = 5, # max number of lags
  selectlags = "AIC"
) # lag selection criteria

# use summary() to present the saved ADF object
# summary() wraps many different kinds of objects
summary(my_adf1)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.087796 -0.019856  0.009542  0.029923  0.045227 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.3937065  0.2278056   1.728   0.0950 .
z.lag.1     -0.0797608  0.0498511  -1.600   0.1208  
tt           0.0001319  0.0010831   0.122   0.9039  
z.diff.lag1  0.1723326  0.1771983   0.973   0.3391  
z.diff.lag2  0.3115141  0.1794628   1.736   0.0936 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03842 on 28 degrees of freedom
Multiple R-squared:  0.296, Adjusted R-squared:  0.1954 
F-statistic: 2.943 on 4 and 28 DF,  p-value: 0.03778


Value of test-statistic is: -1.6 1.6437 2.1039 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -4.15 -3.50 -3.18
phi2  7.02  5.13  4.31
phi3  9.31  6.73  5.61

Is the null hypothesis rejected? How about an ADF test that specifies only a constant/drift?

my_adf2 <- ur.df(
  y = test_vector,
  type = "drift", # type  of ADF - with drift
  lags = 5,
  selectlags = "AIC"
)
summary(my_adf2)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.088270 -0.019198  0.009114  0.030175  0.044932 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.37689    0.17811   2.116   0.0430 *
z.lag.1     -0.07569    0.03633  -2.083   0.0461 *
z.diff.lag1  0.16583    0.16608   0.999   0.3263  
z.diff.lag2  0.30364    0.16454   1.845   0.0752 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03776 on 29 degrees of freedom
Multiple R-squared:  0.2956,    Adjusted R-squared:  0.2228 
F-statistic: 4.057 on 3 and 29 DF,  p-value: 0.01595


Value of test-statistic is: -2.0834 2.5446 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1  7.06  4.86  3.94

Does this mean our series should be detrended or not?

Detrended Series

(Not covered in class)

We can compare those results to the results from already detrended series as below. Note that the results are only valid for the detrended series - if we reject a unit root in the detrended data, that does not mean that the original data was or was not stationary.

For the linearly detrended series:

# Results for linearly detrended series: 
detrend_lin <- detrended_example %>%
  pull(detrended_lin)

adf_lin <- ur.df(
  y = detrend_lin,
  type = "drift", # type  of ADF - with drift
  lags = 2,
)
summary(adf_lin)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.08750 -0.02700  0.01137  0.02868  0.04876 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.0007899  0.0063372   0.125   0.9016  
z.lag.1     -0.0854229  0.0480295  -1.779   0.0848 .
z.diff.lag1  0.2461152  0.1606244   1.532   0.1353  
z.diff.lag2  0.3893391  0.1616341   2.409   0.0219 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03768 on 32 degrees of freedom
Multiple R-squared:  0.2708,    Adjusted R-squared:  0.2025 
F-statistic: 3.962 on 3 and 32 DF,  p-value: 0.01649


Value of test-statistic is: -1.7786 1.5817 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1  7.06  4.86  3.94

For the quadtrically detrended series:

# Results for quadratically detrended seris 

detrend_quad <- detrended_example %>%
  pull(detrended_quad)

adf_quad <- ur.df(
  y = detrend_quad,
  type = "drift", # type  of ADF - with drift
  lags = 2,
)
summary(adf_quad)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.076880 -0.025863  0.009698  0.024754  0.050565 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.0005764  0.0059899   0.096   0.9239  
z.lag.1     -0.1856822  0.0723037  -2.568   0.0151 *
z.diff.lag1  0.2228600  0.1525139   1.461   0.1537  
z.diff.lag2  0.3813441  0.1466459   2.600   0.0140 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03539 on 32 degrees of freedom
Multiple R-squared:  0.3287,    Adjusted R-squared:  0.2657 
F-statistic: 5.222 on 3 and 32 DF,  p-value: 0.004763


Value of test-statistic is: -2.5681 3.3616 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1  7.06  4.86  3.94

Using the HP filter to detrend seems to finally get us data where we can reject the null of a unit root. Note that we set the frequency to 6.25 (meaning \(\lambda=6.25\)), which is consistent with annual data.

p_load("mFilter")
library(mFilter)

# Apply the HP filter to test_vector
hpfiltered <- hpfilter(test_vector, freq = 6.25)

# Extract the cyclical component (non-trend)
hp_cycle <- hpfiltered$cycle

# Apply the ADF test to the cyclical component
adf_hp <- ur.df(
  y = hp_cycle,
  type = "drift", # ADF with drift (no trend, focus on stationarity of cycle - note in these the mean is zero)
  lags = 2,
)

# Summary of the ADF test
summary(adf_hp)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.032757 -0.014573 -0.001604  0.012624  0.041336 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.001153   0.003380  -0.341   0.7353    
z.lag.1     -1.422755   0.265403  -5.361 6.95e-06 ***
z.diff.lag1  0.510783   0.210331   2.428   0.0210 *  
z.diff.lag2  0.411025   0.161014   2.553   0.0157 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02024 on 32 degrees of freedom
Multiple R-squared:  0.5421,    Adjusted R-squared:  0.4992 
F-statistic: 12.63 on 3 and 32 DF,  p-value: 1.299e-05


Value of test-statistic is: -5.3607 14.3688 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1  7.06  4.86  3.94

Cross section analysis

The fixest package aids the estimation of various kinds of regression models. Let’s run a few simple OLS regressions to illustrate. Our example tries to answer the question: What is the relationship between education and wages in the CPS data for the year 1974?. We will also control for age and marital status.

Instead of sourcing .csv files locally, we can download such files directly from the internet. As before, we can use read.csv() to create a data frame called cs_df.

# Replace a local file path with a web address
# Subset the data to only those observations in 1974
# To restrict memory usage, select only the relevant columns
cs_df <- read.csv("https://raw.githubusercontent.com/stata2r/stata2r.github.io/main/data/cps_long.csv") %>%
  filter(year == 1974) %>%
  select(wage, educ, age, marr)

Descriptive statistics

Before we run the regressions, we should probably get a better picture of the data we are dealing with. Take note of the dimensions of the sample and its variables’ types and distributions.

# Get an overview of the sample
# Do you notice any issues?
skimr::skim(cs_df)
Table 6: Data summary
Name cs_df
Number of rows 15992
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wage 0 1 14016.80 9569.80 0 4403.45 15123.58 23584.18 25862.32 ▆▂▃▃▇
educ 0 1 12.03 2.87 0 11.00 12.00 13.00 18.00 ▁▁▂▇▂
age 0 1 33.23 11.05 16 24.00 31.00 42.00 55.00 ▆▇▆▅▅
marr 0 1 0.71 0.45 0 0.00 1.00 1.00 1.00 ▃▁▁▁▇
# Get an impression of wage by marital status
cs_df %>%                 # call the cross-section dataframe
  select(wage, marr) %>%  # Select the variables wage and marriage from the dataframe 
  group_by(marr) %>%      # group the selected variables from the dataframe 
  skimr::skim()           # skim the data grouped by mariage limiting only to the selected variables from the dataframe
Table 6: Data summary
Name Piped data
Number of rows 15992
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables marr

Variable type: numeric

skim_variable marr n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wage 0 0 1 7681.53 8325.02 0 386.47 4325.08 12987.49 25862.32 ▇▂▂▁▂
wage 1 0 1 16582.74 8818.61 0 10237.66 18677.69 25781.50 25862.32 ▃▂▂▃▇

More Specific Descriptive Statistics

(This section only covered if there is time at the end)

Sometimes we don’t want all of these details, especially not for publication.

Below, I provide a table where the mean, standard deviation, and observation count for the Unmarried and Married groups are given separately.

stat_table <- cs_df %>%    # Calling the tidyverse
  group_by(marr) %>%          # Group by marital status
  summarise(                  # Summarise using specific values 
    Observations = n(),       # observations
    Wage_mean = mean(wage),        # Mean of a variable
    Wage_sd  = sd(wage),          # Note the standard deviation is not given a name here, 
                                # rather a unique number of spaces for each variable.
                                # I use this approach when looping over longer structures
    Education_mean = mean(educ), 
    Education_sd  = sd(educ),
  )    

stat_table
Table 7:
marrObservationsWage_meanWage_sdEducation_meanEducation_sd
046107.68e+038.33e+03122.67
1113821.66e+048.82e+03122.95

We can clean up this table fairly easily, as below.

flipped_table <- cs_df %>%    # Calling the tidyverse
  group_by(marr) %>%          # Group by marital status
  summarise(                  # Summarise using specific values 
    Observations = n(),       # observations
    Wage = mean(wage),        # Mean of a variable
    ` `  = sd(wage),          # Note the standard deviation is not given a name here, 
                                # rather a unique number of spaces for each variable.
                                # I use this approach when looping over longer structures
    Education = mean(educ), 
    `  `  = sd(educ),
  )    %>%
  pivot_longer(                # Pivot the table above                   
    cols = -marr,              # All columns except 'marr'
    names_to = "Statistic",    # Put the names that are not the group variable to statistics
    values_to = "Value"        # Call their values "Value" - which we replace below/
  ) %>%
  mutate(
    Value = case_when(
      Statistic == "Observations" ~ as.character(as.integer(Value)), # Ensure integer format
      Statistic == " " ~ paste0("(", formatC(as.numeric(Value), format = "f", big.mark = ",", digits = 3), ")"), 
      # add parentheses around the standard deviations
      Statistic == "  " ~ paste0("(", formatC(as.numeric(Value), format = "f", big.mark = ",", digits = 3), ")"),
      TRUE ~ formatC(as.numeric(Value), format = "f", big.mark = ",", digits = 3) # Format the "Value" with commas  
    )
  ) %>%
  pivot_wider(                # pivot the table again but now using married 
    names_from = marr, # Set 'marr' levels as column headers
    values_from = Value       # get the values from Value
  )
colnames(flipped_table) <- c("Statistic", "Unmarried", "Married")   # Give the columns names
flipped_table
Table 8:
StatisticUnmarriedMarried
Observations461011382
Wage7,681.53516,582.744
(8,325.019)(8,818.607)
Education12.03512.024
(2.671)(2.948)

Which we can turn into a huxtable:

# Convert to huxtable
flipped_hux <- as_hux(flipped_table) %>%
  add_footnote("Source: Author's calculations based on CPS in 1974.  \n \n Each variable's mean is provided by group with standard deviation in parentheses. Missing ") %>%
  set_caption("Descriptive Statistics by Marital Status") %>%
  set_bold(1, everywhere) %>%
  set_bottom_border(1, everywhere, 1) %>% 
  set_bottom_border(2, everywhere, 1) %>% 
    set_bold(1, everywhere) %>%
  set_align(everywhere, everywhere, "center") %>%
  set_width(1)

# We can print the huxtable:
flipped_hux
Table 9: Descriptive Statistics by Marital Status
StatisticUnmarriedMarried
Observations461011382
Wage7,681.53516,582.744
(8,325.019)(8,818.607)
Education12.03512.024
(2.671)(2.948)
Source: Author's calculations based on CPS in 1974.

Each variable's mean is provided by group with standard deviation in parentheses. Missing
# The above can be transformed into a latex table
cat(to_latex(flipped_hux), file = "./tables/example_table.tex")

Regressions

We can perform standard OLS regressions with fixest’s feols() function. This function requires two arguments, formula (or fml) and data. Formulas are given in the format y ~ x1 + x2, and data refers to a data frame.

# Since we are regressing on wages, we should probably log first:

# There are some zero wages, but we should probably log wages in any case:
cs_df <- cs_df %>% mutate(lwage = log(wage))


# Our first model
model1 <- feols(fml = lwage ~ educ, data = cs_df)

# Adding an explanatory continuous variable: age
model2 <- feols(lwage ~ educ + age, cs_df)

# Adding a categorical variable
model3 <- feols(lwage ~ educ + age + factor(marr), cs_df)

# As before, use summary() to display the results of model1
summary(model1)
OLS estimation, Dep. Var.: lwage
Observations: 14,079
Standard-errors: IID 
            Estimate Std. Error   t value  Pr(>|t|)    
(Intercept) 9.055392   0.039695 228.12249 < 2.2e-16 ***
educ        0.024358   0.003205   7.59909  3.17e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.08271   Adj. R2: 0.004015

Visualising results

Consider using the huxreg() function from the huxtable package. Its enables the neat presentation of multiple regression models in a single table. Let’s consider the standard version first.

huxreg(model1, model2, model3)
Table 10:
(1)(2)(3)
(Intercept)9.055 ***7.200 ***7.096 ***
(0.040)   (0.048)   (0.046)   
educ0.024 ***0.054 ***0.051 ***
(0.003)   (0.003)   (0.003)   
age        0.044 ***0.033 ***
        (0.001)   (0.001)   
factor(marr)1                0.726 ***
                (0.019)   
N14079        14079        14079        
R20.004    0.190    0.263    
logLik-21096.007    -19640.118    -18976.952    
AIC42196.014    39286.237    37961.903    
*** p < 0.001; ** p < 0.01; * p < 0.05.

However, you may want to improve the look of your regression table with some aesthetic adjustments, as we did for Table 5. Please note all of the optional arguments to the huxreg() function itself.

huxreg(
  "Model 1" = model1, "Model 2" = model2, "Model 3" = model3,
  statistics = c("N" = "nobs", "R-squared" = "r.squared"),
  stars = c(`*` = 0.1, `**` = 0.05, `***` = 0.01, `****` = 0.001),
  number_format = 2,
  coefs = c(
    "Education" = "educ",
    "Age" = "age",
    "Married" = "factor(marr)1"
  )
) %>%
  set_font_size(8) %>%
  set_caption("Regression of Log Wages on Education, Age, and Marital Status")
Table 11: Regression of Log Wages on Education, Age, and Marital Status
Model 1Model 2Model 3
Education0.02 ****0.05 ****0.05 ****
(0.00)    (0.00)    (0.00)    
Age        0.04 ****0.03 ****
        (0.00)    (0.00)    
Married                0.73 ****
                (0.02)    
N14079        14079        14079        
R-squared0.00     0.19     0.26     
**** p < 0.001; *** p < 0.01; ** p < 0.05; * p < 0.1.

fixest also contains various plotting functions based on your regressions. For example, consider the following coefficient plot which graphs the coefficients for each model:

# Notice that models need to be entered as a list() object

# Generate the coefficient plot with explicit inclusion of desired coefficients
coefplot(list(model1, model2, model3), drop = "Constant")

Interaction effects

You may also need to run regressions using interaction effects. Consider the following example where the effect of education is moderated by marital status:

# Same as before, but "*" denotes an interaction
model4 <- feols(lwage ~ educ * factor(marr), data = cs_df)

# What do our results say?
summary(model4)
OLS estimation, Dep. Var.: lwage
Observations: 14,079
Standard-errors: IID 
                    Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)         7.549777   0.078292 96.43044 < 2.2e-16 ***
educ                0.084907   0.006253 13.57775 < 2.2e-16 ***
factor(marr)1       1.873869   0.088143 21.25941 < 2.2e-16 ***
educ:factor(marr)1 -0.069062   0.007062 -9.77877 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.980798   Adj. R2: 0.182566
# maybe this is easier:
huxreg(model4)
Table 12:
(1)
(Intercept)7.550 ***
(0.078)   
educ0.085 ***
(0.006)   
factor(marr)11.874 ***
(0.088)   
educ:factor(marr)1-0.069 ***
(0.007)   
N14079        
R20.183    
logLik-19704.266    
AIC39416.532    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We can visualise the differential effects of education on wages by marital status using ggplot and a plot called geom_smooth. The latter can represent regression lines for married and unmarried respondents.

cs_df %>%
  ggplot(aes(x = educ, y = lwage)) +
  theme_bw() +
  geom_point(alpha = 0.5) + # creates a scatterplot
  geom_smooth(
    formula = y ~ x, # x, y inherited from aes()
    method = "lm", # specifies linear model
    aes(color = factor(marr)), # creates two regression lines
    se = TRUE, # display confidence interval,
    level = 0.95
  ) + # confidence level to 95%
  theme(legend.position = "bottom") +
  labs(
    y = "Wage", x = "Years of Education", color = "Married",
    title = "The effect of education on wage by marital status"
  )

End

That was the cursory introduction to R and R Studio with a focus on some basic operations, data visualisations, and a little bit of econometrics. I hope it was useful! You can consult the Acknowledgements and Further Reading sections for additional resources. Thank you for your attention and please feel free to reach out if you encounter any issues.


  1. If you are planning on applying version control to your new project, it is useful to check Create git repository.↩︎

  2. Packages are extensions to the R language. They contain code, data, and documentation in a standardised collection format that can be installed by users, typically via a centralised software repository such as CRAN (The Comprehensive R Archive Network). CRAN is a network of ftp (file transfer protocol) and web servers around the world that store identical, up-to-date, versions of code and documentation for R.↩︎