The .Rmd and .html (or .pdf) should be uploaded in TUWEL by the deadline. Refrain from using explanatory comments in the R code chunks but write them as text instead. Points will be deducted if the submitted file is not in a decent form.

Data

Load the data set you exported in the final Task of Case Study 2. Eliminate all observations with missing values in the income status variable.

As a reminder, the data set includes world data from 2020, focusing on:

for most world entities in 2020. The data was downloaded from https://www.cia.gov/the-world-factbook/about/archives/. Additional information on continent, subcontinent/region and income status was appended to the dataset in Case Study 2.

Tasks:

a. Education expenditure in different income levels

Using ggplot2, create a density plot of the education expenditure grouped by income status. The densities for the different groups are superimposed in the same plot rather than in different plots. Ensure that you order the levels of the income status such that in the plots the legend is ordered from High (H) to Low (L).

  • The color of the density lines is black.

  • The area under the density curve should be colored differently among the income status levels.

  • For the colors, choose a transparency level of 0.5 for better visibility.

  • Position the legend at the top center of the plot and give it no title (hint: use element_blank()).

  • Rename the x axis as “Education expenditure (% of GDP)”

Comment briefly on the plot.

library(ggplot2)

# Load data, treating "." as NA
df <- read.csv("file_out2.csv", na.strings = ".")

# Remove rows where income status is missing
df <- df[!is.na(df$status), ]

# Convert expenditure to numeric (the "." were already converted to NA above)
df$expenditure <- as.numeric(df$expenditure)

# Order income status from High to Low
df$status <- factor(df$status, levels = c("H", "UM", "LM", "L"))

# Create density plot
ggplot(df[!is.na(df$expenditure), ], aes(x = expenditure, fill = status)) +
  geom_density(color = "black", alpha = 0.5) +
  scale_fill_manual(values = c("H"  = "#E69F00", "UM" = "#56B4E9", "LM" = "#009E73", "L"  = "#D55E00")) +
  labs(x = "Education expenditure (% of GDP)", y = "Density") +
  theme(
    legend.position = "top",
    legend.justification = "center",
    legend.title = element_blank()
  )

High income (H) countries tend to spend more on education, with their distribution peaking around 4–5% of GDP. As income decreases, the distributions shift towards lower expenditure values. Low income (L) countries are concentrated at the lower end, around 1–3%. However, there is considerable overlap between all groups, suggesting that income status alone does not fully determine education spending.

b. Income status in different continents

Investigate how the income status is distributed in the different continents.

  • Using ggplot2, create a stacked barplot of absolute frequencies showing how the entities are split into continents and income status. Comment the plot.

  • Create another stacked barplot of relative frequencies (height of the bars should be one). Comment the plot.

  • Create a mosaic plot of continents and income status using base R functions.

  • Briefly comment on the differences between the three plots generated to investigate the income distribution among the different continents.

# Keeping only rows where both continent and status are not missing
df_b <- df[!is.na(df$continent) & !is.na(df$status), ]

df_b$status <- factor(df_b$status, levels = c("H", "UM", "LM", "L"))


income_colors <- c("H" = "green", "UM" = "yellow", "LM" = "orange", "L" = "red")

# Plot 1: Stacked barplot of ABSOLUTE frequencies 
ggplot(df_b, aes(x = continent, fill = status)) + geom_bar(position = "stack") + scale_fill_manual(values = income_colors) +
  labs(x = "Continent", y = "Count", title = "Absolute frequency of income status by continent") + theme(legend.position = "top",
        legend.title = element_blank())

# Plot 2: Stacked barplot of RELATIVE frequencies 
ggplot(df_b, aes(x = continent, fill = status)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = income_colors) +
  labs(x = "Continent", y = "Proportion",
       title = "Relative frequency of income status by continent") +
  theme(legend.position = "top",
        legend.title = element_blank())

#  Plot 3: Mosaic plot (base R) 
mosaic_table <- table(df_b$continent, df_b$status)
mosaicplot(mosaic_table,
           main  = "Mosaic plot of continent and income status",
           xlab  = "Continent",
           ylab  = "Income Status",
           color = c("green", "yellow", "orange", "red"))

The absolute barplot is useful for seeing the actual count of countries per continent but makes proportion comparison difficult. The relative barplot makes it easy to compare income distributions across continents since all bars reach 1, but loses information about the total number of countries. The mosaic plot combines both — column width represents the number of countries per continent, while segment height shows the income proportion — making it the most complete visualization of the three.

c. Income status in different subcontinents

For Oceania, investigate further how the income status distribution is in the different subcontinents. Use one of the plots in b. for this purpose. Comment on the results.

# Filter only Oceania
df_c <- df_b[df_b$continent == "Oceania", ]

# Relative frequency stacked barplot for Oceania subcontinents
ggplot(df_c, aes(x = subcontinent, fill = status)) + geom_bar(position = "fill") + scale_fill_manual(values = income_colors) +
  labs(x = "Subcontinent", y = "Proportion", title = "Relative frequency of income status in Oceania subcontinents") +
  theme(legend.position = "top",
        legend.title = element_blank())

Australia and New Zealand consist entirely of High (H) income countries, reflecting their status as wealthy developed nations. Melanesia is predominantly Lower-Middle (LM) income, with a small proportion of Upper-Middle (UM) and High income countries. Micronesia shows a more mixed distribution across all three income groups present in Oceania. Polynesia is mostly Upper-Middle (UM) income, with some High and Lower-Middle income countries. Notably, no Low (L) income countries exist in any Oceania subcontinent, which distinguishes Oceania from continents like Africa.

d. Net migration in different continents

  • Using ggplot2, create parallel boxplots showing the distribution of the net migration rate in the different continents.

  • Prettify the plot (change y-, x-axis labels, etc).

  • Identify which country in Asia constitutes the largest negative outlier and which country in Asia constitutes the largest positive outlier.

  • Comment on the plot.

# Filter out missing continents
df_d <- df[!is.na(df$continent) & !is.na(df$net_migr_rate), ]

# --- Plot: Parallel boxplots of net migration rate by continent ---
ggplot(df_d, aes(x = continent, y = net_migr_rate, fill = continent)) +
  geom_boxplot() +
  labs(x = "Continent", y = "Net Migration Rate (per 1,000 persons)", title = "Distribution of Net Migration Rate by Continent") +
  theme(legend.position = "none")

# --- Identify largest negative and positive outlier in Asia ---
df_asia <- df_d[df_d$continent == "Asia", ]

# Largest negative outlier (lowest net migration rate)
df_asia[which.min(df_asia$net_migr_rate), c("country", "net_migr_rate")]
# Largest positive outlier (highest net migration rate)
df_asia[which.max(df_asia$net_migr_rate), c("country", "net_migr_rate")]

Most continents have a median net migration rate close to zero, indicating a rough balance between people entering and leaving. Europe is the only continent with a clearly positive median, suggesting it is a net destination for migrants. Asia shows the most extreme outliers, with Lebanon having the largest negative rate (-88.7) and Syria the largest positive rate (27.1), likely driven by political instability in the region.

e. Net migration in different subcontinents

The graph in d. clearly does not convey the whole picture. It would be interesting also to look at the subcontinents, as it is likely that a lot of migration flows happen within the continent.

  • Investigate the net migration in different subcontinents using again parallel boxplots. Group the boxplots by continent (hint: use facet_grid with scales = "free_x").

  • Remember to prettify the plot (rotate axis labels if needed).

  • Describe what you see.

# Filter out missing subcontinents and net migration rate
df_e <- df[!is.na(df$subcontinent) & !is.na(df$net_migr_rate) & !is.na(df$continent), ]

# Parallel boxplots of net migration rate by subcontinent, grouped by continent
ggplot(df_e, aes(x = subcontinent, y = net_migr_rate, fill = continent)) +
  geom_boxplot() +
  facet_grid(. ~ continent, scales = "free_x", space = "free_x") +
  labs(x = "Subcontinent",
       y = "Net Migration Rate (per 1,000 persons)",
       title = "Distribution of Net Migration Rate by Subcontinent") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text = element_text(size = 8))

Within Asia, Western Asia shows the most variability in net migration rate, with extreme outliers in both directions, while other Asian subcontinents remain close to zero. In Europe, all subcontinents have positive or near-zero medians, confirming it is generally a migration destination, with Northern and Western Europe showing slightly higher inflows. Oceania’s Polynesia and Micronesia show consistently negative migration rates, suggesting these island regions experience steady population outflow, likely due to limited economic opportunities.

f. Median net migration rate per subcontinent.

The plot in task e. shows the distribution of the net migration rate for each subcontinent. Here you will work on visualizing only one summary statistic, namely the median.

For each subcontinent, calculate the median net migration rate. Then create a plot which contains the sub-regions on the y-axis and the median net migration rate on the x-axis.

  • As geoms use points.

  • Color the points by continent – use a colorblind friendly palette (see e.g., here).

  • Rename the axes.

  • Using fct_reorder from the forcats package, arrange the levels of subcontinent such that in the plot the lowest (bottom) subcontinent contains the lowest median net migration rate and the upper most region contains the highest median net migration rate.

  • Comment on the plot. E.g., what are the regions with the most influx? What are the regions with the most outflux?

library(forcats)

#computing median net migration rate per subcontinent

df_f_summary <- aggregate(
  net_migr_rate ~ subcontinent + continent,
  data = df_e,
  FUN  = function(x) median(x, na.rm = TRUE)
)
names(df_f_summary)[3] <- "median_migration"

#reordering subcontinent by median
df_f_summary$subcontinent <- fct_reorder(df_f_summary$subcontinent,
                                         df_f_summary$median_migration)

continent_colors <- c(
  "Africa"   = "#E69F00",
  "Americas" = "#56B4E9",
  "Asia"     = "#009E73",
  "Europe"   = "#CC79A7",
  "Oceania"  = "#0072B2"
)

ggplot(df_f_summary, aes(x = median_migration,
                          y = subcontinent,
                          color = continent)) +
  geom_point(size = 3) +
  scale_color_manual(values = continent_colors) +
  labs(
    x     = "Median net migration rate (per 1,000 persons)",
    y     = "Subcontinent",
    color = NULL
  ) +
  theme_bw() +
  theme(legend.position = "top")

Australia and New Zealand has the highest median net migration rate, making it the region with the most influx, followed by Western Europe and Northern Europe. On the other end, Micronesia shows the strongest outflux, followed by Polynesia, likely due to emigration to larger neighboring countries. European subcontinents generally attract migrants while small Pacific island regions tend to lose population. African and Asian subcontinents mostly cluster around zero, indicating relatively balanced migration flows.

g. Median youth unemployment rate per subcontinent

For each subcontinent, calculate the median youth unemployment rate. Then create a plot which contains the sub-regions on the y-axis and the median unemployment rate on the x-axis.

  • Use a black and white theme (?theme_bw())

  • As geoms use bars. (hint: pay attention to the statistical transformation taking place in geom_bar() – look into argument stat="identity")

  • Color the bars by continent – use a colorblind friendly palette.

  • Make the bars transparent (use alpha = 0.7).

  • Rename the axes.

  • Using fct_reorder from the forcats package, arrange the levels of subcontinent such that in the plot the lowest (bottom) subcontinent contains the lowest median youth unemployment rate and the upper most region contains the highest median youth unemployment rate.

  • Comment on the plot. E.g., what are the regions with the highest vs lowest youth unemployment rate?

df_g <- df[!is.na(df$subcontinent) & !is.na(df$youth_unempl_rate) & 
           !is.na(df$continent), ]

df_g_summary <- aggregate(
  youth_unempl_rate ~ subcontinent + continent,
  data = df_g,
  FUN  = function(x) median(x, na.rm = TRUE)
)
names(df_g_summary)[3] <- "median_unempl"

df_g_summary$subcontinent <- fct_reorder(
  df_g_summary$subcontinent,
  df_g_summary$median_unempl
)

ggplot(df_g_summary, aes(x = median_unempl,
                          y = subcontinent,
                          fill = continent)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  scale_fill_manual(values = continent_colors) +
  labs(
    x    = "Median youth unemployment rate (%)",
    y    = "Subcontinent",
    fill = NULL
  ) +
  theme_bw() +
  theme(legend.position = "top")

Northern Africa has the highest median youth unemployment rate, followed by Polynesia and Southern Europe. On the lower end, Eastern Asia, Central Asia, and South-eastern Asia show the lowest rates (around 8–9%), suggesting stronger youth employment in those regions. African and European subcontinents generally tend to have higher youth unemployment, while Asian subcontinents cluster toward the lower end. Northern America and Australia and New Zealand sit in the middle range, reflecting relatively moderate youth unemployment in developed Oceanian and American regions.

h. Median youth unemployment rate per subcontinent – with error bars

The value displayed in the barplot in g. is the result of an aggregation, so it might be useful to also plot error bars, to have a general idea on how precise the median unemployment is. This can be achieved by plotting the error bars which reflect the standard deviation or the interquartile range of the variable in each of the subcontinents.

Repeat the plot from Task g. but include also error bars which reflect the 25% and 75% quantiles. You can use geom_errorbar in ggplot2.

# Aggregating all three statistics at once
df_h_summary <- aggregate(
  youth_unempl_rate ~ subcontinent + continent,
  data = df_g,                          
  FUN  = function(x) c(
    median = median(x, na.rm = TRUE),
    q25    = quantile(x, 0.25, na.rm = TRUE),
    q75    = quantile(x, 0.75, na.rm = TRUE)
  )
)

df_h_summary <- do.call(data.frame, df_h_summary)
names(df_h_summary)[3:5] <- c("median_unempl", "q25", "q75")

#applying same fct_reorder as task g
df_h_summary$subcontinent <- fct_reorder(
  df_h_summary$subcontinent,
  df_h_summary$median_unempl
)

ggplot(df_h_summary, aes(x = median_unempl,
                          y = subcontinent,
                          fill = continent)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  geom_errorbar(aes(xmin = q25, xmax = q75),
                width = 0.4,
                color = "black") +
  scale_fill_manual(values = continent_colors) +
  labs(
    x    = "Median youth unemployment rate (%) with IQR",
    y    = "Subcontinent",
    fill = NULL
  ) +
  theme_bw() +
  theme(legend.position = "top")

The error bars reveal how much variation exists within each subcontinent. Polynesia and Northern Africa not only have the highest median youth unemployment but also show wide error bars, indicating high variability among countries within those regions. Sub-Saharan Africa and Latin America and the Caribbean also show notably wide IQRs, suggesting countries within these regions differ greatly from each other. On the other hand, Northern Europe, Australia and New Zealand, and Eastern Asia show very narrow error bars, meaning countries within these subcontinents have consistently similar youth unemployment rates. In general, regions with higher median unemployment tend to also show greater variability.

i. Relationship between education expenditure and net migration rate

Using ggplot2, create a plot showing the relationship between education expenditure and net migration rate.

  • Color the geoms based on the income status.

  • Add a regression line for each development status (using geom_smooth()).

Comment on the plot. Do you see any relationship between the two variables? Do you see any difference among the income levels?

# Filtering out missing values in all three needed columns
df_i <- df[!is.na(df$expenditure) & 
           !is.na(df$net_migr_rate) & 
           !is.na(df$status), ]

ggplot(df_i, aes(x = expenditure,
                  y = net_migr_rate,
                  color = status)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  scale_color_manual(values = c("H"  = "#E69F00",
                                "UM" = "#56B4E9",
                                "LM" = "#009E73",
                                "L"  = "#D55E00")) +
  labs(
    x     = "Education expenditure (% of GDP)",
    y     = "Net migration rate (per 1,000 persons)",
    color = NULL
  ) +
  theme_bw() +
  theme(legend.position = "top")
## `geom_smooth()` using formula = 'y ~ x'

There is no strong clear relationship between education expenditure and net migration rate. The regression lines for H and L income groups are nearly flat, suggesting little to no association. The UM group shows a slight positive trend while LM shows a slight negative trend, but both come with very wide confidence bands indicating high uncertainty. The plot is heavily influenced by two extreme outliers — one country with a net migration rate of around −90 and another with very high education expenditure — which distort the scale and make patterns harder to read. High income countries (H) generally cluster around zero or slightly positive net migration, while lower income groups show more scattered values.

j. Relationship between youth unemployment and net migration rate

Create a plot as in Task i. but for youth unemployment and net migration rate. Comment briefly.

# Filter out missing values
df_j <- df[!is.na(df$youth_unempl_rate) & 
           !is.na(df$net_migr_rate) & 
           !is.na(df$status), ]

ggplot(df_j, aes(x = youth_unempl_rate,
                  y = net_migr_rate,
                  color = status)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  scale_color_manual(values = c("H"  = "#E69F00",
                                "UM" = "#56B4E9",
                                "LM" = "#009E73",
                                "L"  = "#D55E00")) +
  labs(
    x     = "Youth unemployment rate (%)",
    y     = "Net migration rate (per 1,000 persons)",
    color = NULL
  ) +
  theme_bw() +
  theme(legend.position = "top")
## `geom_smooth()` using formula = 'y ~ x'

There is no strong relationship between youth unemployment and net migration rate. The H group shows a slight negative trend — countries with higher youth unemployment tend to have slightly lower net migration, which is intuitive as fewer people migrate to countries with poor job prospects. The L group shows a slight positive trend but with a very wide confidence band, indicating high uncertainty. The UM and LM groups are nearly flat. Overall, the conclusion is similar to task i, no clear linear relationship is visible between the two variables.

k. Merging population data

Go online and find a data set which contains the 2020 population for the countries of the world together with ISO codes.

  • Download this data and merge it to the dataset you are working on in this case study using a left join. (A possible source: World Bank))

  • Inspect the data and check whether the join worked well.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pop_url <- "https://raw.githubusercontent.com/datasets/population/main/data/population.csv"
pop_raw <- read.csv(pop_url)

pop <- pop_raw[pop_raw$Year == 2020, c("Country.Code", "Value")]
names(pop) <- c("iso3c", "population")

df_k <- left_join(df, pop, by = c("ISO3" = "iso3c"))

head(df_k[, c("country", "ISO3", "population")])
sum(!is.na(df_k$population))
## [1] 215
sum(is.na(df_k$population))
## [1] 1

l. Scatterplot of education expenditure and net migration rate in Europe

Make a scatterplot of education expenditure and net migration rate for the countries of Europe.

  • Scale the size of the points according to each country’s population.

  • For better visibility, use a transparency of alpha=0.7.

  • Remove the legend.

  • Comment on the plot.

df_l <- df_k[!is.na(df_k$continent) & df_k$continent == "Europe" &
               !is.na(df_k$expenditure) & !is.na(df_k$net_migr_rate) &
               !is.na(df_k$population), ]

ggplot(df_l, aes(x = expenditure, y = net_migr_rate, size = population)) +
  geom_point(alpha = 0.7) +
  labs(x = "Education expenditure (% of GDP)",
       y = "Net migration rate (per 1,000 persons)") +
  theme_bw() +
  theme(legend.position = "none")

European countries generally have positive net migration rates, meaning more people are moving in than out. Countries with higher education expenditure tend to cluster around moderate to positive net migration values, though the relationship is not strong. Larger countries (bigger points) sit mostly near zero or slightly positive migration.

m. Interactive plot

On the merged data set from Task k., using function ggplotly from package plotly
re-create the scatterplot in Task l., but this time for all countries. Color the points according to their continent.

When hovering over the points the name of the country, the values for education expenditure, net migration rate, and population should be shown. (Hint: use the aesthetic text = Country. In ggplotly use the argument tooltip = c("text", "x", "y", "size")).

Hint: If you are creating a .pdf file, keep in mind that PDF output is static and does not support HTML widgets. To avoid build errors, you can wrap interactive plots in a conditional check or set the chunk option eval=FALSE. Alternatively, you can use the webshot package to automatically convert interactive elements into static images during the knit process.

library(plotly)
## Warning: package 'plotly' was built under R version 4.5.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
df_m <- df_k[!is.na(df_k$expenditure) & !is.na(df_k$net_migr_rate) &
               !is.na(df_k$population) & !is.na(df_k$continent), ]

p <- ggplot(df_m, aes(x = expenditure, y = net_migr_rate,
                       size = population, color = continent,
                       text = country)) +
  geom_point(alpha = 0.7) +
  labs(x = "Education expenditure (% of GDP)",
       y = "Net migration rate (per 1,000 persons)") +
  theme_bw()

ggplotly(p, tooltip = c("text", "x", "y", "size"))

n. Parallel coordinate plot

In parallel coordinate plots each observation or data point is depicted as a line traversing a series of parallel axes, corresponding to a specific variable or dimension. It is often used for identifying clusters in the data.

One can create such a plot using the GGally R package. You should create such a plot where you look at the three main variables in the data set: education expenditure, youth unemployment rate and net migration rate. Color the lines based on the income status. Briefly comment.

library(GGally)
## Warning: package 'GGally' was built under R version 4.5.3
df_n <- df_k[!is.na(df_k$expenditure) & !is.na(df_k$youth_unempl_rate) &
               !is.na(df_k$net_migr_rate) & !is.na(df_k$status), ]

ggparcoord(df_n,
           columns = which(names(df_n) %in% c("expenditure", "youth_unempl_rate", "net_migr_rate")),
           groupColumn = "status",
           scale = "uniminmax",
           alphaLines = 0.4) +
  scale_color_manual(values = c("H"="#E69F00","UM"="#56B4E9","LM"="#009E73","L"="#D55E00")) +
  labs(x = "Variable", y = "Scaled value", color = NULL) +
  theme_bw() +
  theme(legend.position = "top")

High income (H) countries tend to have higher education expenditure, lower youth unemployment, and near-zero or positive net migration. Low income (L) countries show the opposite pattern. There is no single dominant pattern linking all three variables, but income status does partially separate the lines, especially for unemployment.

o. World map visualisation

Create a world map of the education expenditure per country. You can use the vignette https://cran.r-project.org/web/packages/rworldmap/vignettes/rworldmap.pdf to find how to do this in R. Alternatively, you can use other packages (such as ggplot2, sf and rnaturalearthdata) to create a map.

library(rworldmap)
## Warning: package 'rworldmap' was built under R version 4.5.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.5.3
## ### Welcome to rworldmap ###
## For a short introduction type :   vignette('rworldmap')
map_data <- joinCountryData2Map(df_k,
                                joinCode = "ISO3",
                                nameJoinColumn = "ISO3",
                                verbose = FALSE)
## 215 codes from your data successfully matched countries in the map
## 1 codes from your data failed to match with a country code in the map
## 29 codes from the map weren't represented in your data
mapCountryData(map_data,
               nameColumnToPlot = "expenditure",
               mapTitle = "Education Expenditure (% of GDP) by Country, 2020",
               catMethod = "quantiles",
               colourPalette = "heat")