DATA101 Midterm Project

My midterm project was on a dataset of complete U.S. foreign expenses. From foreignassistance.gov; almost 3.5GB, 3967456 rows/observations by 56 columns/variables: expenses by country, region, managing/implementing/funding agency/entity/partner(s), type of expense, how much, and when. I decided to find out whether the U.S. switched foreign priorities (aid category type) after the 2020s.

# load libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.3

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.3

## Warning: package 'dplyr' was built under R version 4.5.3

## Warning: package 'forcats' was built under R version 4.5.3

## Warning: package 'lubridate' was built under R version 4.5.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# load dataset
setwd("C:/Users/chesl/Desktop/DATA101/datasets")
df <- read_csv("us_foreign_aid_complete.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 3967456 Columns: 56
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (32): Country Code, Country Name, Region Name, Income Group ID, Income G...
## dbl (24): Country ID, Region ID, Managing Agency ID, Managing Sub-agency or ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# dataset dimensions: 3967456*56
dim(df)

## [1] 3967456      56

Cleaning the dataset was relatively easy: remove the unnecessary vars, rename the remaining vars (tolower(), gsub(), rename()); filter for after 2000, but before 2024 (2025–6 columns are blank), and remove na values. Note: using “constant” dollar amounts, i.e. adjusted for inflation.

# selecting important vars
df2 <- df |> select(
  "Country Name",
  "Managing Agency Name", "Managing Sub-agency or Bureau Name",
  "Implementing Partner Category Name", "Implementing Partner Sub-category Name", "Implementing Partner Name",
  "US Category Name", "US Sector Name",
  "Funding Account Name", "Funding Agency Name",
  "Activity Name", "Activity Description",
  "Transaction Type Name", "Fiscal Year", "Current Dollar Amount", "Constant Dollar Amount")

names(df2) <- tolower(names(df2))
names(df2) <- gsub(" ", "_", names(df2))

df3 <- df2 |>
  rename(year = fiscal_year) |>
  rename(category = us_category_name) |>
  rename(constant = constant_dollar_amount) |>
  filter((year >= 2000) & (year <= 2024)) |>
  filter(category != "Multi-sector") |>
  filter(!is.na(year) & !is.na(category) & !is.na(constant))

# df_pre/post: data before/after 2020.
df_pre <- df3 |>
  filter((year < 2020))

df_pre <- df_pre |>
  group_by(category) |>
  summarize(p = sum(constant)/sum(df_pre$constant))

df_post <- df3 |>
  filter(year >= 2020)

df_post <- df_post |>
  group_by(category) |>
  summarize(expenses = sum(constant))

plot1 <- df3 |>
  ggplot(aes(x = year, y = constant)) +
  geom_line()

plot1

This linechart shows total spending by year. Surprisingly, it follows a rough normal distribution, peaking in the middle of the graph, 2010–2012. This graph suggests that the Obama adminstration spent the most amount of money on foreign expenses, whereas President Bush, Trump (1st term), and Biden spent comparatively less.

Statistical analysis: Chi-square test

I am trying to determine whether the 2020s heralded enough change to change the distribution of U.S. foreign expenses. The pre-COVID proportions are the assumed “normal” scenario, which

Ho: the post-COVID expenses have the same proportions.

Ha: the post-COVID expenses have different proportions.

chisq.test(df_post$expenses, df_pre$p)

## Warning in chisq.test(df_post$expenses, df_pre$p): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  df_post$expenses and df_pre$p
## X-squared = 56, df = 49, p-value = 0.2289

# X-squared: 56, df = 49, p-value = 0.2289

# conclusion: do not reject null hypothesis.

The Chi-square test did not surface significant evidence. At a p-value of 0.2289, the post-COVID19 proportions occur 22.89% of the time under pre-COVID19 conditions. What’s the most interesting to me is that category proportions may be related to total spending; although I proved that the year likely doesn’t matter as much with proportions, I would like to revisit this project in the future to assess whether the amount of money influences with categories/sectors are priofize

dataset source: foreignassistance.gov/data, accessed before June 30, 2026.

DATA101 Midterm Project

Cheslav Lukashanets