report-ver2-fixed.knit

Introduction

In this project, I explore whether there is a relationship between a country’s education expenditure as a percentage of GDP and the number of refugees it hosts.

Research Question:

Does higher education expenditure (% of GDP) correlate with more refugees being hosted by the country?

This hypothesis is based on the idea that countries investing more in education may have stronger social systems to support refugee populations.

Data

I used two datasets:
- World Bank: Education expenditure (% of GDP)
- UNHCR: Number of refugees by country and year

The datasets were joined using a composite key (Country Name + Year) in R, equivalent to VLOOKUP in Excel.

library(readxl)
library(tidyverse)
library(readr)

# Load World Bank data
excel_file <- file.path("data", "Edu_GDP_Refugee.xlsx")
wb_data <- read_excel(excel_file, sheet = "WORLDBANK", range = "A5:P271")

# Load UNHCR data
unhcr_data <- read_excel(excel_file, sheet = "UNHCR_元Data")

# Countries of interest
countries <- c("Canada", "United Kingdom", "India", "Kenya",
               "Philippines", "United States", "Japan")

# World Bank filtering & reshaping
wb_filt <- wb_data |>
  filter(`Country Name` %in% countries) |>
  pivot_longer(cols = `2010`:`2023`,
               names_to = "Year",
               values_to = "EduExp_GDP") |>
  mutate(Key = str_c(`Country Name`, Year)) |>
  select(`Country Name`, Year, Key, EduExp_GDP) |>
  filter(!is.na(EduExp_GDP))

# UNHCR filtering
unhcr_filt <- unhcr_data |>
  filter(`Country of Asylum` %in% countries) |>
  mutate(Key = str_c(`Country of Asylum`, Year)) |>
  select(Key, Refugees)

# Join datasets
combined <- wb_filt |>
  left_join(unhcr_filt, by = "Key")

# Remove missing
combined_clean <- combined |>
  filter(!is.na(EduExp_GDP), !is.na(Refugees))

# Preview
head(combined_clean)

## # A tibble: 6 × 5
##   `Country Name` Year  Key        EduExp_GDP Refugees
##   <chr>          <chr> <chr>           <dbl>    <dbl>
## 1 Canada         2019  Canada2019       4.77   101757
## 2 Canada         2020  Canada2020       4.89   109214
## 3 Canada         2021  Canada2021       4.75   130125
## 4 Canada         2022  Canada2022       4.14   140621
## 5 India          2019  India2019        3.90   195103
## 6 India          2020  India2020        4.04   195373

Exploratory Data Analysis

First, I checked for missing values and summarized the dataset.

# Check for missing values
sum(is.na(combined_clean))

## [1] 0

# Show basic summary statistics
summary(combined_clean)

##  Country Name           Year               Key              EduExp_GDP   
##  Length:22          Length:22          Length:22          Min.   :3.155  
##  Class :character   Class :character   Class :character   1st Qu.:3.591  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.001  
##                                                           Mean   :4.109  
##                                                           3rd Qu.:4.720  
##                                                           Max.   :5.374  
##     Refugees     
##  Min.   :   680  
##  1st Qu.:  1215  
##  Median :119670  
##  Mean   :171237  
##  3rd Qu.:235230  
##  Max.   :538899

Next, I visualized the relationship with a scatter plot and calculated the correlation coefficient.

ggplot(combined_clean,
       aes(x = EduExp_GDP, y = Refugees, color = `Country Name`)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Education Expenditure vs Refugees",
       x = "Education Expenditure (% of GDP)",
       y = "Number of Refugees")

## `geom_smooth()` using formula = 'y ~ x'

Figure 1: Scatter plot showing education expenditure vs number of refugees, with a fitted regression line.

# Correlation coefficient
cor(combined_clean$EduExp_GDP, combined_clean$Refugees)

## [1] 0.6656362

Methods

I used a simple linear regression model to quantify the relationship between education expenditure and the number of refugees.

model <- lm(Refugees ~ EduExp_GDP, data = combined_clean)
summary(model)

## 
## Call:
## lm(formula = Refugees ~ EduExp_GDP, data = combined_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212734  -74235  -17379   58023  396088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -623986     201784  -3.092 0.005744 ** 
## EduExp_GDP    193524      48516   3.989 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 146300 on 20 degrees of freedom
## Multiple R-squared:  0.4431, Adjusted R-squared:  0.4152 
## F-statistic: 15.91 on 1 and 20 DF,  p-value: 0.000722

Results

The scatter plot shows a moderate positive trend.
The correlation coefficient provides the strength and direction of the relationship.
From the regression output:

The coefficient for EduExp_GDP shows how the number of refugees changes per unit increase.
The R-squared value indicates the proportion of variance explained by education expenditure alone.

I also generated predicted refugee counts for further use.

combined_clean <- combined_clean |>
  mutate(Refugees_Pred = predict(model))

# Save the dataset with predictions for further analysis
write_csv(combined_clean, "data/combined_with_predictions.csv")
head(combined_clean)

## # A tibble: 6 × 6
##   `Country Name` Year  Key        EduExp_GDP Refugees Refugees_Pred
##   <chr>          <chr> <chr>           <dbl>    <dbl>         <dbl>
## 1 Canada         2019  Canada2019       4.77   101757       299689.
## 2 Canada         2020  Canada2020       4.89   109214       321948.
## 3 Canada         2021  Canada2021       4.75   130125       294980.
## 4 Canada         2022  Canada2022       4.14   140621       177119.
## 5 India          2019  India2019        3.90   195103       130238.
## 6 India          2020  India2020        4.04   195373       157877.

Discussion

This simple model suggests a moderate positive relationship between education expenditure and the number of refugees hosted by a country overall (correlation coefficient ≈ 0.67). The R-squared value of about 0.44 indicates that education expenditure alone explains around 44% of the variation in refugee numbers, but other factors are likely significant as well. For example, countries like Canada and India show a clear positive trend, while Kenya shows a slightly negative one, highlighting the importance of local context.

Limitations:
- This model only uses one predictor variable (education expenditure).
- Other possible factors such as GDP per capita, political climate, or migration policies are not included.
- The dataset is limited to selected countries and years, which may affect generalizability.

Next Steps:
- Include more predictors (e.g., GDP per capita, population).
- Use multiple regression models to control for other variables that may affect refugee numbers and better isolate the impact of education expenditure.
- Add more recent years and additional countries to the dataset to improve robustness.

Personal Reflection:
Through this analysis, I realized that quantitative data alone is not enough — understanding refugee dynamics requires a broader view of social, economic, and political contexts. In the future, I want to explore how policy interventions beyond education spending could better support both refugees and host communities.

References

World Bank: https://data.worldbank.org/
UNHCR: https://www.unhcr.org/refugee-statistics/
R packages: tidyverse, readr, readxl, ggplot2