Project 2

Author

Ryan Seabold

“The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed and developing - in a global partnership.” - UNESCO on Sustainable Development Goals

The data used in this project focus on UNESCO’s Sustainable Development Goal (SDG) 4.3—specifically, indicators 4.3.1, 4.3.2, and 4.3.3, which track progress in equitable access to quality technical, vocational, and tertiary education. These variables are numeric, representing scores or percentages that reflect progress toward these educational goals. The dataset spans the years 2010 to 2022 and includes data for all European Union (EU) countries, sourced from UNESCO. Additionally, the project incorporates GDP per capita data, measured in euros, from Eurostat to evaluate the relationship between economic capacity and educational outcomes.

Data preparation involved several steps, primarily composed of the following: - Filtering the SDG datasets to include only EU countries to ensure comparability across a common economic and political framework. - Creating new variables to capture specific years (e.g., 2022 values) and trends (e.g., year-over-year differences and long-term improvements). - Calculating averages across years to provide summary insights into each country’s performance over time. - Merging data from multiple dataframes into a single cohesive dataset for comprehensive analysis. - Exporting the cleaned and combined dataset for use in Tableau Public, where visualizations highlight patterns and relationships.

This topic and dataset were chosen for their relevance to the EU’s efforts toward achieving educational equity and sustainability. Examining the relationship between economic factors and educational outcomes provides valuable insights into the effectiveness of EU and UNESCO policies, making it meaningful for understanding how economic and social progress are interconnected.

Import necessary libraries

# Import the tidyverse library for data manipulation and visualization

library(tidyverse)

# Import the scales library for graph customization

library(scales)

Set the working directory

Import and clean SDG 4.3 data

# Import the data for SDG 4.3.1, 4.3.2, and 4.3.3 from CSV files
# Each dataset corresponds to a different SDG 4.3 indicator

d431 <- read_csv("./data/4.3.1.csv") # SDG 4.3.1 - Participation rate of youth and adults in formal and non-formal education and training in the previous 12 months
d432 <- read_csv("./data/4.3.2.csv") # SDG 4.3.2 - Gross enrolment ratio for tertiary education by sex 
d433 <- read_csv("./data/4.3.3.csv") # SDG 4.3.3 - Participation rate in technical-vocational programmes (15- to 24-year-olds)

Focus on EU countries

Cyprus is part of the EU, but geographically located in Asia and categorized as such by UNESCO

# Filter the datasets to retain only the "Europe and Northern America" region
# and remove the unnecessary `Region` column after adjusting Cyprus's classification.

d431 <- d431 %>% filter(Region == cyprus_region) %>% select(-Region) # SDG 4.3.1
d432 <- d432 %>% filter(Region == cyprus_region) %>% select(-Region) # SDG 4.3.2
d433 <- d433 %>% filter(Region == cyprus_region) %>% select(-Region) # SDG 4.3.3

Fix country column name

Fix Netherlands country name

# A tibble: 6 × 7
  country  `2010` `2011` `2012` `2013` `2014` `2015`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Austria    13.9   13.8   14     14.1   14.2   14.2
2 Belgium    18.6   18.4   17.9   18.0   18.6   18.1
3 Bulgaria   11.8   11.3   11.5   11.4   11.2   11.6
4 Croatia    13.5   14.0   14.6   14.3   13.5   14.0
5 Czechia    14.1   13.8   13.5   13.0   12.5   11.9
6 Denmark    41.5   41.4   41.0   41.3   41.8   41.2

# A tibble: 6 × 7
  `2016` `2017` `2018` `2019` `2020` `2021` `2022`
   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1   14.0   14.3   13.5   13.2   13.2   13.1   12.9
2   18.6   19.2   18.3   18.8   17.8   23.7   18.5
3   11.6   11.5   11.9   11.4   11.1   11.7   NA  
4   13.1   12.6   13.5   13.6   13.5   15.0   12.7
5   11.8   11.8   11.6   11.4   11.4   11.7   11.7
6   38.4   36.8   33.5   35     30.5   32.4   27.9

Get the differences between 2010 and 2022 data for 4.3.3

Focusing on 4.3.3 as the main indicator

# Calculate the differences between 2010 and 2022 for SDG 4.3.3 by country
# The code uses the `coalesce()` function to handle missing 2010 data by falling back to later years if necessary

d433 <- d433 %>%
  mutate(difference_433_2010_2022 = `2022` - # Subtract 2022 from 2010 (or closest available year)
           coalesce(`2010`, `2011`, `2012`, `2013`)) # Fallback to later years

Import and clean economic data

# Import the GDP per capita data from the CSV file

gdp_per_capita_long <- read_csv("./data/gdppercapita.csv") %>%
  select(geo, TIME_PERIOD, OBS_VALUE) %>% # Remove unnecessary columns for analysis
  rename(country = geo, year = TIME_PERIOD, value = OBS_VALUE) # Rename columns for consistency

Compare change in GDP and change in 4.3.3 between 2010 and 2022

# A tibble: 6 × 4
  `2022` difference_433_2010_2022 difference_gdp_per_ca…¹ 433_over_gdp_2010_20…²
   <dbl>                    <dbl>                   <dbl>                  <dbl>
1   28.8                    4.08                     3390               0.00120 
2   23.3                   -1.83                     3830              -0.000478
3   20.2                    4.35                     2650               0.00164 
4   24.5                    0.950                    3930               0.000242
5   29.3                    3.54                     3520               0.00101 
6   12.2                   -2.39                     7720              -0.000310
# ℹ abbreviated names: ¹difference_gdp_per_capita_2010_2022,
#   ²`433_over_gdp_2010_2022`

Create long versions of data

Add 2022 data to long dataframes

Compile data into one file

# Merge the data into a single dataset for easier analysis

data <- list(d431_long, d432_long, d433_long, gdp_per_capita_long) %>%
  reduce(full_join, by = c("country", "year")) %>%
  rename(gdp_per_capita = value) # Rename the GDP per capita column for clarity

# Create a column for the GDP per capita of each country in 2022

data <- data %>%
  left_join(gdp_per_capita %>% select(country, `2022`),
            by = c("country" = "country")) %>%  # Join GDP data for 2022
  rename(gdp_per_capita_2022 = `2022`)  # Rename the 2022 column for clarity

Change the column order in data

Add Boolean columns for improvement over time

# Add a column to indicate whether the country improved in 4.3.3 between 2010 and 2022

data <- data %>%
  group_by(country) %>%
  mutate(improved_2010_2022 = difference_433_2010_2022 > 0)  # Improvement if the difference is positive

Create columns for effective 4.3.2 value

# Invert the 4.3.2 score to reflect improvement, as lower values indicate worse outcomes.

data <- data %>%
  mutate(`432_effective` = 100 - abs(100 - `432`), # Inverts the score, so lower values become higher (reflecting better outcomes)
         `432_effective_2022` = 100 - abs(100 - coalesce(`2022_432`, as.numeric(d432[19, "2021"])))) # Use 2021 data for the Netherlands, as 2022 data is missing

Create columns for average score

# Create a column for the average score across the three indicators: 4.3.1, effective 4.3.2, and 4.3.3

data <- data %>%
  mutate(average_score = rowMeans(select(., `431`, `432_effective`, `433`),
                                  na.rm = TRUE), # Calculate row-wise mean, ignoring NA values
         average_score_2022 = rowMeans(select(., `2022_431`, `432_effective_2022`, `2022_433`),
                                       na.rm = TRUE)) # Calculate row-wise mean, ignoring NA values

Give each country a region

Ensure columns are the correct type

Change the column order in data again

Create visualizations

Plot each 4.3 subgoal by year

Plot the average 4.3 subgoal measured value by year

Compare the difference in 4.3.3 measured score from 2010 to 2022 and the GDP per capita in 2022

Compare the average 4.3 subgoal measured value by GDP per capita in 2022

Investigate the link between GDP per capita and improvement in subgoal 4.3.3

The negative correlation between GDP per capita and improvement in 4.3.3 measured value implies that poorer countries make more judicious use of funds when improving performance with respect to subgoal 4.3.3, or that poorer countries have more room for improvement in this respect and/or that improving in 4.3.3 is easier for poorer countries.

Create a linear regression

# Fit a linear regression model to compare GDP per capita and the average measured values for the SDG 4.3 goal

lm_model <- lm(gdp_per_capita_2022 ~ average_score_2022, data = data)
summary(lm_model)


Call:
lm(formula = gdp_per_capita_2022 ~ average_score_2022, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-22121 -15050  -4548  11451  49751 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         43876.2     5184.0   8.464 7.25e-16 ***
average_score_2022   -380.1      134.9  -2.818   0.0051 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18560 on 349 degrees of freedom
Multiple R-squared:  0.02225,   Adjusted R-squared:  0.01945 
F-statistic: 7.942 on 1 and 349 DF,  p-value: 0.005105

The low p-value suggests that the results are unlikely to be random. However, the very low adjusted R-squared value suggests that the model is very inaccurate.

Visualize the linear regression

Warning: package 'viridis' was built under R version 4.4.2

Loading required package: viridisLite


Attaching package: 'viridis'

The following object is masked from 'package:scales':

    viridis_pal

Warning: package 'plotly' was built under R version 4.4.2


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

`geom_smooth()` using formula = 'y ~ x'

Make final changes before exporting the data

Fix Boolean columns for compatibility with Tableau

Fix Bulgaria’s 2022 4.3.1 data

# A tibble: 6 × 7
  country  year region         `431` `432` `433` `2022_431`
  <chr>   <dbl> <chr>          <dbl> <dbl> <dbl>      <dbl>
1 Austria  2010 Western Europe  13.9  75.0  24.7       12.9
2 Austria  2011 Western Europe  13.8  77.3  24.2       12.9
3 Austria  2012 Western Europe  14    79.6  23.8       12.9
4 Austria  2013 Western Europe  14.1  81.0  29.6       12.9
5 Austria  2014 Western Europe  14.2  81.0  29.2       12.9
6 Austria  2015 Western Europe  14.2  82.3  28.6       12.9

# A tibble: 6 × 7
  `2022_432` `2022_433` gdp_per_capita gdp_per_capita_2022 `433_over_gdp`
       <dbl>      <dbl>          <dbl>               <dbl>          <dbl>
1       94.5       28.8          35170               38560       0.000703
2       94.5       28.8          36080               38560       0.000669
3       94.5       28.8          36140               38560       0.000658
4       94.5       28.8          35840               38560       0.000827
5       94.5       28.8          35830               38560       0.000815
6       94.5       28.8          35930               38560       0.000797
# ℹ 2 more variables: difference_433_2010_2022 <dbl>,
#   difference_gdp_per_capita_2010_2022 <dbl>

# A tibble: 6 × 7
  `433_over_gdp_2010_2022` improved_2010_2022 improved_from_previous_year
                     <dbl>              <int>                       <int>
1                  0.00120                  1                           1
2                  0.00120                  1                           0
3                  0.00120                  1                           0
4                  0.00120                  1                           1
5                  0.00120                  1                           0
6                  0.00120                  1                           0
# ℹ 4 more variables: `432_effective` <dbl>, `432_effective_2022` <dbl>,
#   average_score <dbl>, average_score_2022 <dbl>

Export the data

# Save the cleaned and transformed data to a CSV file

write.csv(data, file = "data.csv", row.names = FALSE)

Essay

Introduction

Final essay

The research for this project began about a week before the due date, when Project 2 was first introduced in class. My data was approved two days later, after I had found a source for the SDG 4.3 data (UNESCO), a source for the GDP per capita data (Eurostat), and a source for the shapefiles used in Tableau (Natural Earth). I originally thought about using data from data.census.gov, like with Project 1, but I chose to instead focus on the European Union instead.

The visualizations are quite varied and represent many different things. The fact that SDF 4 includes multiple categories, and that 4.3 includes three subgoals, increases the complexity of the project. There aren’t very many patterns present in the visualizations, and GDP per capita and improvement in subgoal 4.3 do not seem to be strongly connected. However, a consistent trend is that countries with a lower GDP per capita tend to start with a lower score and slowly improve. This makes another pattern, that poorer countries tend to improve more, make sense, since the countries that start lower have more room to improve, and generally, lower scores are easier to improve in general.

While working on the project, I realized that I did not know where I wanted to go with it. This is probably visible through the lack of a strong central theme or goal. I plan to plan earlier and spend more time planning in order to make a more cohesive final result.

Data sources: UNESCO and Eurostat Shapefiles source: Natural Earth