Learning to build reports using R

Goals

Give you a sense for what is possible with R Markdown
Show you the basic components so you can publish your work online
Build something, make mistakes, and iterate

In the longer term: Use this file and tutorial as a starting point, then go to YouTube, Google or your preferred AI to keep going. Ideally, you’ll be able to articulate what you want a little better after this tutorial.

Remember: Almost nothing works well the first time. It just needs to work well enough for you to learn something and prove a concept. This tutorial is meant to be rough and ready ™. It’s okay if you’re unsure about some part of the process. Getting comfortable will take time.

Components of an R based report

R Markdown file (.Rmd)

Contains all the pieces we need and will be render into our outputs
make a new .RmD file in R studio -> ‘click’ file -> new file -> R Markdown

Outputs (what you publish/share)

PDF
Word
HTML (website)
Graphs (PNG/JPG)

Publishing (if we have time)

Github Pages
DukeSites
Free forever, and relatively easy to use!
Publishing your work in github pages

The Anatomy of an R Markdown File

YAML
- lives at the top of the Rmd file
- sets the outputs for making PDFs, HTML, Word docs, etc
- Ideally, set up a template or example that has all the options you like and just copy/paste when you want to use it
Markdown code
- All of your text is written in markdown
  - Make headers by using a hashtag symbol in front of the text
  - More hashtags translates into a smaller header size (# H1, ## H2, ### H3)
  - Using headers correctly will make it easy to create tables of contents (see Table of contents for example)
- Download the Cheatsheet
- See RMarkdown documentation for help in the future: https://rmarkdown.rstudio.com/lesson-1.html
Code Chunks
- Where all the code lives!
- Used to work with
  - data
  - calculations
  - graphs and data viz

Coding in R

Setup your workspace

Although R has a lot of built-in functionality, we almost always import a separate library to add certain features. Which library you use will depend on what you’re trying to do. Ask Google or an LLM for help figuring out what you need.

#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("tidyverse")
#install.packages("readr")

#install.packages("knitr")

# Install packages to work with, they help you do things that R doesn't have built-in by default
# install.packages("dplyr") <- this is a code comment, it doesn't run, it's for your reference

# helps with data manipulation
library(dplyr)

#helps with plotting
library(ggplot2)

# also helps with data manipulation
library(tidyverse)

# work with CSV data
library(readr)

Loading data

food_security_df <- read.csv('data/raw_data/FAOSTAT_data_en_1-19-2026.csv')

Data cleanup/setup

# Remove the columns we don't want
food_security_df_modified <- food_security_df %>% 
  select(-Domain) %>% 
  select(-Element)

# Removing rows with NA values
food_security_df_modified <- food_security_df_modified[!is.na(food_security_df_modified$Value) & food_security_df_modified$Value != "", ]

Summary stats

# Let's grab only one indicator for now
df_filtered_stability <- food_security_df_modified %>%
  filter(Item == "Political stability and absence of violence/terrorism (index)")


summary(df_filtered_stability$Value)

##    Length     Class      Mode 
##       390 character character

It looks like our values are being treated like text (strings). We need to make them numeric. R needs to be told what kind of values it’s working with so we can get summary statistics.

# right now, our values are being treated like text (strings), we need to make them numeric
# the R needs to be told what kind of values it's working with
df_filtered_stability$value_num <- as.numeric(df_filtered_stability$Value)
df_filtered_stability$Year <- as.numeric(df_filtered_stability$Year)


summary(df_filtered_stability$value_num)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.78000 -0.66000  0.01500 -0.07741  0.76000  1.88000

Basic Graphs

ggplot(df_filtered_stability, aes(x = Year, y = value_num)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = c(2021, 2022)) +
  labs(
    x = "Year",
    y = "Political stability index",
    title = "Political Stability, 2021–2022"
  ) +
  theme_minimal()

#Simple box plot
ggplot(df_filtered_stability, aes(x = factor(Year), y = value_num)) +
  geom_boxplot() +
  labs(
    x = "Year",
    y = "Political stability index",
    title = "Political Stability by Year"
  ) +
  theme_minimal()

#Pretty box plot
ggplot(df_filtered_stability, aes(x = factor(Year), y = value_num, fill = factor(Year))) +
  geom_boxplot(
    width = 0.6,
    alpha = 0.8,
    outlier.shape = NA
  ) +
  geom_jitter(
    width = 0.15,
    alpha = 0.35,
    size = 1.5
  ) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    x = "Year",
    y = "Political stability index",
    title = "Political Stability and Absence of Violence/Terrorism",
    subtitle = "Distribution by year"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )

# Violing plot

temp_plot <- ggplot(df_filtered_stability, aes(x = factor(Year), y = value_num, fill = factor(Year))) +
  geom_violin(
    trim = FALSE,
    alpha = 0.8
  ) +
  geom_boxplot(
    width = 0.12,
    fill = "white",
    outlier.shape = NA
  ) +
  geom_jitter(
    width = 0.1,
    alpha = 0.3,
    size = 1.3
  ) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    x = "Year",
    y = "Political stability index",
    title = "Political Stability and Absence of Violence/Terrorism",
    subtitle = "Distribution by year"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )

Saving plots as images

#Saving the plots as images
ggsave(
  filename = "political_stability_violin_2021_2022_2.png",
  plot = temp_plot,
  width = 8,
  height = 6,
  dpi = 300
)

Regressions

Let’s investigate the relationship between political stability and GDP

# lets see how stability and GDP compare
reg_df <- food_security_df_modified %>% 
  filter(
    Item == "Political stability and absence of violence/terrorism (index)" 
    |Item == "Gross domestic product per capita, PPP, (constant 2021 international $)"
    ) 


reg_df_years <- reg_df %>% 
  filter(
    Year == 2021 | Year == 2022
  )

# Reshaping the data so we can run the regressions. Right now it's long, we need it wide (i.e. we're going to pivot the table)
df_wide <- reg_df_years %>%
  filter(Item %in% c(
    "Gross domestic product per capita, PPP, (constant 2021 international $)",
    "Political stability and absence of violence/terrorism (index)"
  )) %>%
  #making sure our values are treated as numbers
  mutate(
    Value = as.numeric(Value),     # if Value is character
    Year  = as.integer(Year)
  ) %>%
  #pivoting the table
  select(Area, Year, Item, Value) %>%
  pivot_wider(
    names_from = Item,
    values_from = Value
  ) %>%
  # renaming columns for ease of use
  rename(
    gdp = `Gross domestic product per capita, PPP, (constant 2021 international $)`,
    pol_stab   = `Political stability and absence of violence/terrorism (index)`
  ) %>%
  # getting rid of any blank values is good practice
  drop_na(gdp, pol_stab)

Running the regression

Raw GPD

#running the regression

# Raw GDP
m1 <- lm(gdp ~ pol_stab, data = df_wide)
summary(m1)

## 
## Call:
## lm(formula = gdp ~ pol_stab, data = df_wide)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39475 -13757  -2334  10534  93768 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    26106       1158   22.54   <2e-16 ***
## pol_stab       14200       1196   11.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22430 on 374 degrees of freedom
## Multiple R-squared:  0.2737, Adjusted R-squared:  0.2718 
## F-statistic:   141 on 1 and 374 DF,  p-value: < 2.2e-16

Log GDP, since GDP tends to skew right

# Log GDP, since it tends to be sweked right 
m2 <- lm(log(gdp) ~pol_stab, data= df_wide)
summary(m2)

## 
## Call:
## lm(formula = log(gdp) ~ pol_stab, data = df_wide)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3856 -0.6761  0.2010  0.6945  2.0400 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.60146    0.04899  195.99   <2e-16 ***
## pol_stab     0.69955    0.05058   13.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9488 on 374 degrees of freedom
## Multiple R-squared:  0.3384, Adjusted R-squared:  0.3366 
## F-statistic: 191.3 on 1 and 374 DF,  p-value: < 2.2e-16

Time Fixed Effects

# With time fixed effects
m3 <- lm(log(gdp) ~ pol_stab + factor(Year), data = df_wide)
summary(m3)

## 
## Call:
## lm(formula = log(gdp) ~ pol_stab + factor(Year), data = df_wide)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3680 -0.6812  0.2064  0.6954  2.0225 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.58378    0.06950 137.894   <2e-16 ***
## pol_stab          0.69959    0.05064  13.815   <2e-16 ***
## factor(Year)2022  0.03519    0.09797   0.359     0.72    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9499 on 373 degrees of freedom
## Multiple R-squared:  0.3386, Adjusted R-squared:  0.3351 
## F-statistic: 95.48 on 2 and 373 DF,  p-value: < 2.2e-16

Controlling for Area

# Controlling for area
m4 <- lm(log(gdp) ~ pol_stab + factor(Area) + factor(Year), data = df_wide)
summary(m4)

Graphing Regression results

ggplot(df_wide, aes(x = pol_stab, y = log(gdp))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = TRUE) +
  theme_minimal() +
  labs(
    x = "Political stability and absence of violence/terrorism (index)",
    y = "log(GDP per capita, PPP)",
    title = "GDP and Political Stability"
  )

## `geom_smooth()` using formula = 'y ~ x'

Nice regression tables

library(broom)
library(gt)

tidy(m1) %>%
  gt() %>%
  fmt_number(columns = c(estimate, std.error), decimals = 3) %>%
  tab_header(
    title = "Regression results",
    subtitle = "Dependent variable: GDP per capita"
  )

term	estimate	std.error	statistic	p.value
Regression results
Dependent variable: GDP per capita
(Intercept)	26,105.629	1,158.398	22.53597	1.176300e-71
pol_stab	14,200.471	1,196.006	11.87324	8.254138e-28

# Making things nicer, rounding vals
tidy(m2) %>%
  mutate(
    estimate  = round(estimate, 3),
    std.error = round(std.error, 3),
    statistic = round(statistic, 2),
    p.value   = round(p.value, 3)
  ) %>% 
  gt() %>%
  fmt_number(columns = c(estimate, std.error), decimals = 2) %>%
  tab_header(
    title = "Regression results",
    subtitle = "Dependent variable: log(GDP per capita)"
  )

term	estimate	std.error	statistic	p.value
Regression results
Dependent variable: log(GDP per capita)
(Intercept)	9.60	0.05	195.99	0
pol_stab	0.70	0.05	13.83	0

For Future Reference

Keyboard Shortcuts

Make a new code chunk (Command + Option + I)
Assign a new variable (Option + Dash)

Code Chunk Options

Chunk option	What it controls	Why it matters in reports	Plain-language explanation
`echo`	Whether the R code is shown	Separates analysis from presentation	Show or hide the code
`eval`	Whether the code is executed	Lets you display example code without running it	Run this code or not
`message`	Package and function messages	Keeps output clean and professional	Hide messages like “Attaching package…”
`warning`	Warning messages	Prevents confusing output for readers	Show or hide warnings
`include`	Whether code and results appear	Allows silent background computation	Run this but don’t show anything
`results`	How printed output is treated	Required for tables and formatted text	Control how output is displayed
`fig.width` / `fig.height`	Figure size (in inches)	Ensures readable, publication-ready figures	Control plot size
`fig.cap`	Figure caption text	Enables figure numbering and captions	Text shown under the figure
`cache`	Whether results are saved	Speeds up slow reports	Don’t re-run unless code changes
`error`	Behavior when errors occur	Useful for teaching and debugging	Keep knitting even if this fails

lm(y ~ x, data = df)

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00