Introduction to Global Education Analysis

This analysis is based on the “Global Education” dataset, which is rich in educational measures from all over the world. It includes important factors including enrollment, out-of-school, and literacy rates in addition to demographic and economic statistics like birth and unemployment rates. Prioritizing the management of missing and irrelevant values throughout the thorough cleaning and preparation of the data for analysis, it was initially in a format that required ‘ISO-8859-1’ encoding.

My interest in this dataset stems from a deep curiosity about the interplay between education and societal factors. Education is a crucial driver of personal and social progress, and analyzing these global metrics offers insights into the educational landscape across different regions.

# Load packages
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyr)
# Read data from CSV file
df <- read_csv("Global_Education.csv", locale = locale(encoding = "ISO-8859-1"))
## Rows: 202 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Countries and areas
## dbl (28): Latitude, Longitude, OOSR_Pre0Primary_Age_Male, OOSR_Pre0Primary_A...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Outlier Analysis

Identifies and removes outliers from a dataset based on the ‘Unemployment Rate’ using the Interquartile Range (IQR) method. It computes the IQR as the difference between the third and first quartiles of the ‘Unemployment Rate,’ establishes lower and upper bounds for outliers as 1.5 times the IQR from these quartiles, and then filters the dataset to remove any values that fall outside these bounds. Finally, the dataset is updated to reflect the filtration, removing the outliers.

# IQR method for Unemployment Rate
Q1 <- quantile(df$Unemployment_Rate, 0.25)
Q3 <- quantile(df$Unemployment_Rate, 0.75)
IQR <- Q3 - Q1

# Define bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter out outliers
df_filtered <- subset(df, Unemployment_Rate > lower_bound & Unemployment_Rate < upper_bound)
df <- df_filtered

Data Exploration

# Summary statistics
summary(df)
##  Countries and areas    Latitude          Longitude       
##  Length:193          Min.   : 0.02356   Min.   :  0.8248  
##  Class :character    1st Qu.:11.20269   1st Qu.: 19.0208  
##  Mode  :character    Median :21.00789   Median : 45.0792  
##                      Mean   :25.19712   Mean   : 56.3761  
##                      3rd Qu.:40.14310   3rd Qu.: 78.9629  
##                      Max.   :64.96305   Max.   :178.0650  
##  OOSR_Pre0Primary_Age_Male OOSR_Pre0Primary_Age_Female OOSR_Primary_Age_Male
##  Min.   : 0.0              Min.   : 0.00               Min.   : 0.000       
##  1st Qu.: 0.0              1st Qu.: 0.00               1st Qu.: 0.000       
##  Median : 8.0              Median : 7.00               Median : 1.000       
##  Mean   :19.1              Mean   :18.79               Mean   : 5.373       
##  3rd Qu.:31.0              3rd Qu.:30.00               3rd Qu.: 6.000       
##  Max.   :96.0              Max.   :96.00               Max.   :58.000       
##  OOSR_Primary_Age_Female OOSR_Lower_Secondary_Age_Male
##  Min.   : 0.000          Min.   : 0.00                
##  1st Qu.: 0.000          1st Qu.: 0.00                
##  Median : 1.000          Median : 2.00                
##  Mean   : 5.715          Mean   : 8.87                
##  3rd Qu.: 7.000          3rd Qu.:13.00                
##  Max.   :67.000          Max.   :61.00                
##  OOSR_Lower_Secondary_Age_Female OOSR_Upper_Secondary_Age_Male
##  Min.   : 0.000                  Min.   : 0.00                
##  1st Qu.: 0.000                  1st Qu.: 1.00                
##  Median : 2.000                  Median :15.00                
##  Mean   : 9.047                  Mean   :20.63                
##  3rd Qu.:11.000                  3rd Qu.:33.00                
##  Max.   :70.000                  Max.   :84.00                
##  OOSR_Upper_Secondary_Age_Female Completion_Rate_Primary_Male
##  Min.   : 0.00                   Min.   :  0.00              
##  1st Qu.: 1.00                   1st Qu.:  0.00              
##  Median :12.00                   Median : 32.00              
##  Mean   :20.37                   Mean   : 40.57              
##  3rd Qu.:31.00                   3rd Qu.: 84.00              
##  Max.   :89.00                   Max.   :100.00              
##  Completion_Rate_Primary_Female Completion_Rate_Lower_Secondary_Male
##  Min.   :  0.00                 Min.   :  0.00                      
##  1st Qu.:  0.00                 1st Qu.:  0.00                      
##  Median : 25.00                 Median : 16.00                      
##  Mean   : 40.74                 Mean   : 31.82                      
##  3rd Qu.: 90.00                 3rd Qu.: 63.00                      
##  Max.   :100.00                 Max.   :100.00                      
##  Completion_Rate_Lower_Secondary_Female Completion_Rate_Upper_Secondary_Male
##  Min.   :  0.00                         Min.   :  0                         
##  1st Qu.:  0.00                         1st Qu.:  0                         
##  Median : 10.00                         Median :  6                         
##  Mean   : 31.98                         Mean   : 22                         
##  3rd Qu.: 69.00                         3rd Qu.: 37                         
##  Max.   :100.00                         Max.   :100                         
##  Completion_Rate_Upper_Secondary_Female Grade_2_3_Proficiency_Reading
##  Min.   :  0.00                         Min.   : 0.00                
##  1st Qu.:  0.00                         1st Qu.: 0.00                
##  Median :  4.00                         Median : 0.00                
##  Mean   : 22.12                         Mean   :22.82                
##  3rd Qu.: 37.00                         3rd Qu.:44.00                
##  Max.   :100.00                         Max.   :99.00                
##  Grade_2_3_Proficiency_Math Primary_End_Proficiency_Reading
##  Min.   : 0.00              Min.   : 0.00                  
##  1st Qu.: 0.00              1st Qu.: 0.00                  
##  Median : 0.00              Median : 0.00                  
##  Mean   :18.25              Mean   :11.04                  
##  3rd Qu.:34.00              3rd Qu.: 0.00                  
##  Max.   :97.00              Max.   :99.00                  
##  Primary_End_Proficiency_Math Lower_Secondary_End_Proficiency_Reading
##  Min.   : 0.00                Min.   : 0.00                          
##  1st Qu.: 0.00                1st Qu.: 0.00                          
##  Median : 0.00                Median : 0.00                          
##  Mean   :10.68                Mean   :26.75                          
##  3rd Qu.: 0.00                3rd Qu.:58.00                          
##  Max.   :89.00                Max.   :89.00                          
##  Lower_Secondary_End_Proficiency_Math Youth_15_24_Literacy_Rate_Male
##  Min.   : 0.00                        Min.   :  0.00                
##  1st Qu.: 0.00                        1st Qu.:  0.00                
##  Median : 0.00                        Median :  0.00                
##  Mean   :25.37                        Mean   : 36.53                
##  3rd Qu.:53.00                        3rd Qu.: 96.00                
##  Max.   :94.00                        Max.   :100.00                
##  Youth_15_24_Literacy_Rate_Female   Birth_Rate   
##  Min.   :  0.00                   Min.   : 0.00  
##  1st Qu.:  0.00                   1st Qu.:10.33  
##  Median :  0.00                   Median :17.26  
##  Mean   : 35.75                   Mean   :18.83  
##  3rd Qu.: 97.00                   3rd Qu.:27.89  
##  Max.   :100.00                   Max.   :46.08  
##  Gross_Primary_Education_Enrollment Gross_Tertiary_Education_Enrollment
##  Min.   :  0.00                     Min.   :  0.00                     
##  1st Qu.: 97.10                     1st Qu.:  8.10                     
##  Median :101.30                     Median : 27.00                     
##  Mean   : 94.63                     Mean   : 34.91                     
##  3rd Qu.:106.90                     3rd Qu.: 61.70                     
##  Max.   :142.50                     Max.   :136.60                     
##  Unemployment_Rate
##  Min.   : 0.000   
##  1st Qu.: 2.180   
##  Median : 4.390   
##  Mean   : 5.313   
##  3rd Qu.: 7.270   
##  Max.   :17.240
library(ggplot2)

# Histogram for Unemployment Rate using ggplot2
ggplot(df, aes(x = Unemployment_Rate)) +
  geom_histogram(binwidth = 1, fill = "#bf8bff", color = "black") +
  labs(title = "Histogram of Unemployment Rate", x = "Unemployment Rate", y = "Frequency") +
  theme_minimal()

This histogram depicts the distribution of unemployment rates, which is most likely based on data from multiple areas or time periods. The x-axis represents the unemployment rate, while the y-axis represents the frequency with which these rates occur in the dataset. The majority of the data is concentrated at the low end of the scale, with a peak between 0 and 5%. There is a significant right skewness, showing that higher unemployment rates are less common. The bars are purple, and there are more than ten unique bins utilized to group the data. This type of distribution implies that, while the majority of data points show a low unemployment rate, there are a few regions or time periods with much higher rates.

Create categorical variables

# For Unemployment_Rate
breaks_unemployment <- quantile(df$Unemployment_Rate, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE)
df$Unemployment_Category <- cut(df$Unemployment_Rate, 
                                            breaks = breaks_unemployment, 
                                            labels = c("low", "medium", "high"), 
                                            include.lowest = TRUE)

# For Birth_Rate
breaks_birth <- quantile(df$Birth_Rate, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE)
df$Birth_Rate_Category <- cut(df$Birth_Rate, 
                                          breaks = breaks_birth, 
                                          labels = c("low", "medium", "high"), 
                                          include.lowest = TRUE)

Boxplot for Unemployment Rate

library(ggdist)
## Warning: package 'ggdist' was built under R version 4.3.2
# Boxplot for Unemployment Rate
df |>
  ggplot(aes(x= Birth_Rate_Category, y = Unemployment_Rate, fill = Birth_Rate_Category)) +
  geom_boxplot(
    width = 0.12,
    outlier.color = NA,
    alpha = 0.5
  )  +
  stat_dots(
    side = "left",
    justification = 1.1,
    binwidth = 0.25
  ) +
  stat_halfeye(
    adjust = 0.5,
    justification = -0.2,
    .width = 0,
    point_colour = NA
  )

The graph is made up of violin plots with box plots superimposed to show the association between different birth rate categories (low, medium, and high) and unemployment rates. The violins demonstrate the data density by displaying common unemployment rates within each birth rate group, whilst the internal box plots show the median, quartiles, and outliers. The distributions varied with each category, as evidenced by the varying shapes and widths of the violins, implying a possible association between birth rates and unemployment that could guide policy decisions.

Boxplot for Birth Rate

# Boxplot for Birth Rate
df |>
  ggplot(aes(x= Unemployment_Category, y = Birth_Rate, fill = Unemployment_Category)) +
  geom_boxplot(
    width = 0.12,
    outlier.color = NA,
    alpha = 0.5
  )  +
  stat_dots(
    side = "left",
    justification = 1.1,
    binwidth = 0.25
  ) +
  stat_halfeye(
    adjust = 0.5,
    justification = -0.2,
    .width = 0,
    point_colour = NA
  )

Scatter plot for Unemployment Rates by Countries and areas

library(plotly)

# Create a scatter plot for Unemployment Rates
fig <- plot_ly(data = df, x = ~`Countries and areas`, y = ~Unemployment_Rate,
               type = 'scatter', mode = 'markers',
               marker = list(color = ~Unemployment_Rate, size = 10)) %>%
  layout(title = 'Unemployment Rates Across Countries',
         xaxis = list(title = 'Countries and areas', tickangle = -45),
         yaxis = list(title = 'Unemployment Rate'))

# Show the plot
fig

The graph shown is a dot plot that depicts unemployment rates in various countries or localities. Along the y-axis, each dot represents a specific data point for a country’s unemployment rate. The nations or areas are listed on the x-axis. Dot size variations appear to represent the frequency or another characteristic related to the unemployment rate, with larger dots presumably indicating a higher frequency or a more significant data point. The color gradient from light to dark may be related to another variable or just serve to distinguish the data points. As illustrated by the larger, darker dots at higher y-values, several countries have unemployment rates that exceed 10%.

Scatter plot for Proficiency in Reading vs. Math for Grade 2-3 Students

# Create a scatter plot
fig <- plot_ly(data = df, x = ~Grade_2_3_Proficiency_Reading, y = ~Grade_2_3_Proficiency_Math,
               type = 'scatter', mode = 'markers', hoverinfo = 'text', color = ~Youth_15_24_Literacy_Rate_Male,
               text = ~paste('Country:', `Countries and areas`, 
                             '<br>Reading Proficiency:', Grade_2_3_Proficiency_Reading, 
                             '<br>Math Proficiency:', Grade_2_3_Proficiency_Math,
                             '<br>Youth Literacy Rate (Male):', Youth_15_24_Literacy_Rate_Male),
               marker = list(size = 10, opacity = 0.7)) %>%
  layout(title = "Proficiency in Reading vs. Math for Grade 2-3 Students",
         xaxis = list(title = "Reading Proficiency"),
         yaxis = list(title = "Math Proficiency"))

# Show the plot
fig

The graph is a scatter plot that depicts the association between reading and math proficiency in Grade 2-3 students. Each point represents a data set in which the x-axis indicates reading competence and the y-axis represents math proficiency, both of which are presumably scored out of 100. The point color coding correlates to the literacy rate of boys aged 15 to 24, with a gradient from green to purple showing increasing literacy rates. Points concentrated to the top right indicate strong skill in both reading and math, which is associated with higher male literacy rates, whereas points clustered to the bottom left suggest lower proficiency in both areas. This plot implies that there may be a favorable relationship between early reading and math skills and the literacy rate among young males.

Unemployment Rate by Countries and areas world map

# Create a choropleth map
fig <- df %>%
  plot_geo(locations = ~`Countries and areas`, locationmode = "country names") %>%
  add_trace(
    z = ~`Unemployment_Rate`, 
    colors = "Greens", 
    zmin = min(df$Unemployment_Rate), 
    zmax = max(df$Unemployment_Rate)
  ) %>%
  layout(
    title = "Unemployment_Rate by Country",
    geo = list(projection = list(type = "natural earth"))
  )

# Show the plot
fig

The graphic shows a global map with a choropleth showing unemployment rates by country. The intensity of the color green symbolizes each country’s unemployment rate—lighter shades correspond to lower rates, while darker shades correspond to greater rates. The color gradient is referenced by the scale on the right, which ranges from 0 to 15. The map allows for a rapid visual comparison across different regions, indicating which countries or regions have greater or lower unemployment rates at a glance.

3D scatter plot to show the relationship of Unemployment vs. Proficiency Reading/Math

df$Unemployment_Category <- as.factor(df$Unemployment_Category)

# Create a new column for colors based on 'Unemployment_Category'
df$color <- ifelse(df$Unemployment_Category == 'low', 'green',
                   ifelse(df$Unemployment_Category == 'medium', 'yellow', 'red'))

# Create the 3D scatter plot using the color column
fig <- plot_ly(data = df, x = ~Primary_End_Proficiency_Math, y = ~Primary_End_Proficiency_Reading, z = ~Unemployment_Rate,
               type = 'scatter3d', mode = 'markers',
               marker = list(size = 10, opacity = 0.7, color = ~color)) %>%
  layout(title = 'Unemployment vs. Proficiency Reading/Math')
# Show the plot
fig

The graphic depicts a 3D scatter plot that compares unemployment rates to reading and math skills. The x-axis appears to represent “Primary End Proficiency,” the y-axis appears to represent “End Proficiency_Math,” and the z-axis appears to represent “Unemployment Rate.” The data points are color-coded, which could indicate another data dimension, such as frequency or another category. The concentration of red dots at the top indicates a cluster of higher unemployment rates linked to specific levels of primary and math ability. This type of visualization can be used to investigate the relationship between educational outcomes and unemployment, indicating a trend or correlation that should be investigated further.

The challenge I faced involved a skewed dataset that influenced the analytical outcomes. It is intriguing to observe an apparent link between levels of education and the unemployment rate. Nevertheless, the dataset was not exhaustive, failing to cover the full spectrum of educational levels. This skewness is a result of the data collection process not obtaining a balanced sample from all educational backgrounds, disproportionately representing certain educational demographics while neglecting others. Such an imbalance in the data can introduce inaccuracies, potentially warping the perceived correlation between educational attainment and unemployment. A lack of diverse educational data points risks overstating or downplaying the actual influence that education has on employment prospects.

Bibliography

Reference link: https://online.hbs.edu/blog/post/types-of-statistical-bias https://www.unesco.org/en https://www.unicefusa.org/?gad_source=1&gclid=CjwKCAiA0syqBhBxEiwAeNx9N7VJP7knlbtKk5b5ssgJsxpdGBP3xIqLilK2em0bzzCVKzPgrzgvuxoCnsYQAvD_BwE