# Import the tidyverse library for data manipulation and visualization
library(tidyverse)
# Import the scales library for graph customization
library(scales)
Project 2
“The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed and developing - in a global partnership.” - UNESCO on Sustainable Development Goals
The data used in this project focus on UNESCO’s Sustainable Development Goal (SDG) 4.3—specifically, indicators 4.3.1, 4.3.2, and 4.3.3, which track progress in equitable access to quality technical, vocational, and tertiary education. These variables are numeric, representing scores or percentages that reflect progress toward these educational goals. The dataset spans the years 2010 to 2022 and includes data for all European Union (EU) countries, sourced from UNESCO. Additionally, the project incorporates GDP per capita data, measured in euros, from Eurostat to evaluate the relationship between economic capacity and educational outcomes.
Data preparation involved several steps, primarily composed of the following: - Filtering the SDG datasets to include only EU countries to ensure comparability across a common economic and political framework. - Creating new variables to capture specific years (e.g., 2022 values) and trends (e.g., year-over-year differences and long-term improvements). - Calculating averages across years to provide summary insights into each country’s performance over time. - Merging data from multiple dataframes into a single cohesive dataset for comprehensive analysis. - Exporting the cleaned and combined dataset for use in Tableau Public, where visualizations highlight patterns and relationships.
This topic and dataset were chosen for their relevance to the EU’s efforts toward achieving educational equity and sustainability. Examining the relationship between economic factors and educational outcomes provides valuable insights into the effectiveness of EU and UNESCO policies, making it meaningful for understanding how economic and social progress are interconnected.
Import necessary libraries
Set the working directory
Import and clean SDG 4.3 data
# Import the data for SDG 4.3.1, 4.3.2, and 4.3.3 from CSV files
# Each dataset corresponds to a different SDG 4.3 indicator
<- read_csv("./data/4.3.1.csv") # SDG 4.3.1 - Participation rate of youth and adults in formal and non-formal education and training in the previous 12 months
d431 <- read_csv("./data/4.3.2.csv") # SDG 4.3.2 - Gross enrolment ratio for tertiary education by sex
d432 <- read_csv("./data/4.3.3.csv") # SDG 4.3.3 - Participation rate in technical-vocational programmes (15- to 24-year-olds) d433
Focus on EU countries
Cyprus is part of the EU, but geographically located in Asia and categorized as such by UNESCO
# Filter the datasets to retain only the "Europe and Northern America" region
# and remove the unnecessary `Region` column after adjusting Cyprus's classification.
<- d431 %>% filter(Region == cyprus_region) %>% select(-Region) # SDG 4.3.1
d431 <- d432 %>% filter(Region == cyprus_region) %>% select(-Region) # SDG 4.3.2
d432 <- d433 %>% filter(Region == cyprus_region) %>% select(-Region) # SDG 4.3.3 d433
Fix country column name
Fix Netherlands country name
# A tibble: 6 × 7
country `2010` `2011` `2012` `2013` `2014` `2015`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austria 13.9 13.8 14 14.1 14.2 14.2
2 Belgium 18.6 18.4 17.9 18.0 18.6 18.1
3 Bulgaria 11.8 11.3 11.5 11.4 11.2 11.6
4 Croatia 13.5 14.0 14.6 14.3 13.5 14.0
5 Czechia 14.1 13.8 13.5 13.0 12.5 11.9
6 Denmark 41.5 41.4 41.0 41.3 41.8 41.2
# A tibble: 6 × 7
`2016` `2017` `2018` `2019` `2020` `2021` `2022`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14.0 14.3 13.5 13.2 13.2 13.1 12.9
2 18.6 19.2 18.3 18.8 17.8 23.7 18.5
3 11.6 11.5 11.9 11.4 11.1 11.7 NA
4 13.1 12.6 13.5 13.6 13.5 15.0 12.7
5 11.8 11.8 11.6 11.4 11.4 11.7 11.7
6 38.4 36.8 33.5 35 30.5 32.4 27.9
Get the differences between 2010 and 2022 data for 4.3.3
Focusing on 4.3.3 as the main indicator
# Calculate the differences between 2010 and 2022 for SDG 4.3.3 by country
# The code uses the `coalesce()` function to handle missing 2010 data by falling back to later years if necessary
<- d433 %>%
d433 mutate(difference_433_2010_2022 = `2022` - # Subtract 2022 from 2010 (or closest available year)
coalesce(`2010`, `2011`, `2012`, `2013`)) # Fallback to later years
Import and clean economic data
# Import the GDP per capita data from the CSV file
<- read_csv("./data/gdppercapita.csv") %>%
gdp_per_capita_long select(geo, TIME_PERIOD, OBS_VALUE) %>% # Remove unnecessary columns for analysis
rename(country = geo, year = TIME_PERIOD, value = OBS_VALUE) # Rename columns for consistency
Compare change in GDP and change in 4.3.3 between 2010 and 2022
# A tibble: 6 × 4
`2022` difference_433_2010_2022 difference_gdp_per_ca…¹ 433_over_gdp_2010_20…²
<dbl> <dbl> <dbl> <dbl>
1 28.8 4.08 3390 0.00120
2 23.3 -1.83 3830 -0.000478
3 20.2 4.35 2650 0.00164
4 24.5 0.950 3930 0.000242
5 29.3 3.54 3520 0.00101
6 12.2 -2.39 7720 -0.000310
# ℹ abbreviated names: ¹difference_gdp_per_capita_2010_2022,
# ²`433_over_gdp_2010_2022`
Create long versions of data
Add 2022 data to long dataframes
Compile data into one file
# Merge the data into a single dataset for easier analysis
<- list(d431_long, d432_long, d433_long, gdp_per_capita_long) %>%
data reduce(full_join, by = c("country", "year")) %>%
rename(gdp_per_capita = value) # Rename the GDP per capita column for clarity
# Create a column for the GDP per capita of each country in 2022
<- data %>%
data left_join(gdp_per_capita %>% select(country, `2022`),
by = c("country" = "country")) %>% # Join GDP data for 2022
rename(gdp_per_capita_2022 = `2022`) # Rename the 2022 column for clarity
Change the column order in data
Add Boolean columns for improvement over time
# Add a column to indicate whether the country improved in 4.3.3 between 2010 and 2022
<- data %>%
data group_by(country) %>%
mutate(improved_2010_2022 = difference_433_2010_2022 > 0) # Improvement if the difference is positive
Create columns for effective 4.3.2 value
# Invert the 4.3.2 score to reflect improvement, as lower values indicate worse outcomes.
<- data %>%
data mutate(`432_effective` = 100 - abs(100 - `432`), # Inverts the score, so lower values become higher (reflecting better outcomes)
`432_effective_2022` = 100 - abs(100 - coalesce(`2022_432`, as.numeric(d432[19, "2021"])))) # Use 2021 data for the Netherlands, as 2022 data is missing
Create columns for average score
# Create a column for the average score across the three indicators: 4.3.1, effective 4.3.2, and 4.3.3
<- data %>%
data mutate(average_score = rowMeans(select(., `431`, `432_effective`, `433`),
na.rm = TRUE), # Calculate row-wise mean, ignoring NA values
average_score_2022 = rowMeans(select(., `2022_431`, `432_effective_2022`, `2022_433`),
na.rm = TRUE)) # Calculate row-wise mean, ignoring NA values
Give each country a region
Ensure columns are the correct type
Change the column order in data again
Create visualizations
Plot each 4.3 subgoal by year
Plot the average 4.3 subgoal measured value by year
Compare the difference in 4.3.3 measured score from 2010 to 2022 and the GDP per capita in 2022
Compare the average 4.3 subgoal measured value by GDP per capita in 2022
Investigate the link between GDP per capita and improvement in subgoal 4.3.3
The negative correlation between GDP per capita and improvement in 4.3.3 measured value implies that poorer countries make more judicious use of funds when improving performance with respect to subgoal 4.3.3, or that poorer countries have more room for improvement in this respect and/or that improving in 4.3.3 is easier for poorer countries.
Create a linear regression
# Fit a linear regression model to compare GDP per capita and the average measured values for the SDG 4.3 goal
<- lm(gdp_per_capita_2022 ~ average_score_2022, data = data)
lm_model summary(lm_model)
Call:
lm(formula = gdp_per_capita_2022 ~ average_score_2022, data = data)
Residuals:
Min 1Q Median 3Q Max
-22121 -15050 -4548 11451 49751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43876.2 5184.0 8.464 7.25e-16 ***
average_score_2022 -380.1 134.9 -2.818 0.0051 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18560 on 349 degrees of freedom
Multiple R-squared: 0.02225, Adjusted R-squared: 0.01945
F-statistic: 7.942 on 1 and 349 DF, p-value: 0.005105
The low p-value suggests that the results are unlikely to be random. However, the very low adjusted R-squared value suggests that the model is very inaccurate.
Visualize the linear regression
Warning: package 'viridis' was built under R version 4.4.2
Loading required package: viridisLite
Attaching package: 'viridis'
The following object is masked from 'package:scales':
viridis_pal
Warning: package 'plotly' was built under R version 4.4.2
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
`geom_smooth()` using formula = 'y ~ x'
Make final changes before exporting the data
Fix Boolean columns for compatibility with Tableau
Fix Bulgaria’s 2022 4.3.1 data
# A tibble: 6 × 7
country year region `431` `432` `433` `2022_431`
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Austria 2010 Western Europe 13.9 75.0 24.7 12.9
2 Austria 2011 Western Europe 13.8 77.3 24.2 12.9
3 Austria 2012 Western Europe 14 79.6 23.8 12.9
4 Austria 2013 Western Europe 14.1 81.0 29.6 12.9
5 Austria 2014 Western Europe 14.2 81.0 29.2 12.9
6 Austria 2015 Western Europe 14.2 82.3 28.6 12.9
# A tibble: 6 × 7
`2022_432` `2022_433` gdp_per_capita gdp_per_capita_2022 `433_over_gdp`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 94.5 28.8 35170 38560 0.000703
2 94.5 28.8 36080 38560 0.000669
3 94.5 28.8 36140 38560 0.000658
4 94.5 28.8 35840 38560 0.000827
5 94.5 28.8 35830 38560 0.000815
6 94.5 28.8 35930 38560 0.000797
# ℹ 2 more variables: difference_433_2010_2022 <dbl>,
# difference_gdp_per_capita_2010_2022 <dbl>
# A tibble: 6 × 7
`433_over_gdp_2010_2022` improved_2010_2022 improved_from_previous_year
<dbl> <int> <int>
1 0.00120 1 1
2 0.00120 1 0
3 0.00120 1 0
4 0.00120 1 1
5 0.00120 1 0
6 0.00120 1 0
# ℹ 4 more variables: `432_effective` <dbl>, `432_effective_2022` <dbl>,
# average_score <dbl>, average_score_2022 <dbl>
Export the data
# Save the cleaned and transformed data to a CSV file
write.csv(data, file = "data.csv", row.names = FALSE)
Essay
Introduction
“The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries - developed and developing - in a global partnership.” - UNESCO on Sustainable Development Goals
The data used in this project focus on UNESCO’s Sustainable Development Goal (SDG) 4.3—specifically, indicators 4.3.1, 4.3.2, and 4.3.3, which track progress in equitable access to quality technical, vocational, and tertiary education. These variables are numeric, representing scores or percentages that reflect progress toward these educational goals. The dataset spans the years 2010 to 2022 and includes data for all European Union (EU) countries, sourced from UNESCO. Additionally, the project incorporates GDP per capita data, measured in euros, from Eurostat to evaluate the relationship between economic capacity and educational outcomes.
Data preparation involved several steps, primarily composed of the following: - Filtering the SDG datasets to include only EU countries to ensure comparability across a common economic and political framework. - Creating new variables to capture specific years (e.g., 2022 values) and trends (e.g., year-over-year differences and long-term improvements). - Calculating averages across years to provide summary insights into each country’s performance over time. - Merging data from multiple dataframes into a single cohesive dataset for comprehensive analysis. - Exporting the cleaned and combined dataset for use in Tableau Public, where visualizations highlight patterns and relationships.
This topic and dataset were chosen for their relevance to the EU’s efforts toward achieving educational equity and sustainability. Examining the relationship between economic factors and educational outcomes provides valuable insights into the effectiveness of EU and UNESCO policies, making it meaningful for understanding how economic and social progress are interconnected.
Final essay
The research for this project began about a week before the due date, when Project 2 was first introduced in class. My data was approved two days later, after I had found a source for the SDG 4.3 data (UNESCO), a source for the GDP per capita data (Eurostat), and a source for the shapefiles used in Tableau (Natural Earth). I originally thought about using data from data.census.gov, like with Project 1, but I chose to instead focus on the European Union instead.
The visualizations are quite varied and represent many different things. The fact that SDF 4 includes multiple categories, and that 4.3 includes three subgoals, increases the complexity of the project. There aren’t very many patterns present in the visualizations, and GDP per capita and improvement in subgoal 4.3 do not seem to be strongly connected. However, a consistent trend is that countries with a lower GDP per capita tend to start with a lower score and slowly improve. This makes another pattern, that poorer countries tend to improve more, make sense, since the countries that start lower have more room to improve, and generally, lower scores are easier to improve in general.
While working on the project, I realized that I did not know where I wanted to go with it. This is probably visible through the lack of a strong central theme or goal. I plan to plan earlier and spend more time planning in order to make a more cohesive final result.
Data sources: UNESCO and Eurostat Shapefiles source: Natural Earth