##Date Set Introduction I chose a dataset which contains information on global education rates including rate of enrolled of students, literacy rates, rates of tertiary education(college). All the data was collected from UNESCO institute for statistics and flobal database. I mostly cleaned up the data by selecting the columns I wanted to focus on. There were almost 30 columns and I narrowed it down to less than 10 by focusing on literacy rates and birth rates. I also created new columns to explore relationships such as the difference in literacy rates between genders.
This data set caught my eye because I believe that bringing things to light will increase awareness. Whether it is gender disparity or general education faults, it shouls all be brought to light.
Importing Data/Showing Data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(leaflet)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
setwd("/Users/Briancaceres/Desktop/Data_110")
education_dataset <- read.csv("projectdosdata - Sheet1.csv")
head(education_dataset)
## Countries.and.areas Latitude Longitude OOSR_Pre0Primary_Age_Male
## 1 Afghanistan 34.5289 69.1725 0
## 2 Albania 41.3275 19.8189 4
## 3 Algeria 36.7525 3.0420 0
## 4 Andorra 42.5078 1.5211 0
## 5 Angola -8.8368 13.2343 31
## 6 Anguilla 18.2170 -63.0578 14
## OOSR_Pre0Primary_Age_Female OOSR_Primary_Age_Male OOSR_Primary_Age_Female
## 1 0 0 0
## 2 2 6 3
## 3 0 0 0
## 4 0 0 0
## 5 39 0 0
## 6 0 0 0
## OOSR_Lower_Secondary_Age_Male OOSR_Lower_Secondary_Age_Female
## 1 0 0
## 2 6 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## OOSR_Upper_Secondary_Age_Male OOSR_Upper_Secondary_Age_Female
## 1 44 69
## 2 21 15
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Completion_Rate_Primary_Male Completion_Rate_Primary_Female
## 1 67 40
## 2 94 96
## 3 93 93
## 4 0 0
## 5 63 57
## 6 0 0
## Completion_Rate_Lower_Secondary_Male Completion_Rate_Lower_Secondary_Female
## 1 49 26
## 2 98 97
## 3 49 65
## 4 0 0
## 5 42 32
## 6 0 0
## Completion_Rate_Upper_Secondary_Male Completion_Rate_Upper_Secondary_Female
## 1 32 14
## 2 76 80
## 3 22 37
## 4 0 0
## 5 24 15
## 6 0 0
## Grade_2_3_Proficiency_Reading Grade_2_3_Proficiency_Math
## 1 22 25
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Primary_End_Proficiency_Reading Primary_End_Proficiency_Math
## 1 13 11
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Lower_Secondary_End_Proficiency_Reading Lower_Secondary_End_Proficiency_Math
## 1 0 0
## 2 48 58
## 3 21 19
## 4 0 0
## 5 0 0
## 6 0 0
## Youth_15_24_Literacy_Rate_Male Youth_15_24_Literacy_Rate_Female Birth_Rate
## 1 74 56 32.49
## 2 99 100 11.78
## 3 98 97 24.28
## 4 0 0 7.20
## 5 0 0 40.73
## 6 0 0 0.00
## Gross_Primary_Education_Enrollment Gross_Tertiary_Education_Enrollment
## 1 104.0 9.7
## 2 107.0 55.0
## 3 109.9 51.4
## 4 106.4 0.0
## 5 113.5 9.3
## 6 0.0 0.0
## Unemployment_Rate
## 1 11.12
## 2 12.33
## 3 11.70
## 4 0.00
## 5 6.89
## 6 0.00
I want to make all letters lowercase to keep consistency in later coding. I also see that a lot of countries have missing data in the form of 0 so I want to convert these 0 to NA.
names(education_dataset) <-tolower(names(education_dataset))
education_dataset[education_dataset == 0] <- NA
education_dataset |>
head()
## countries.and.areas latitude longitude oosr_pre0primary_age_male
## 1 Afghanistan 34.5289 69.1725 NA
## 2 Albania 41.3275 19.8189 4
## 3 Algeria 36.7525 3.0420 NA
## 4 Andorra 42.5078 1.5211 NA
## 5 Angola -8.8368 13.2343 31
## 6 Anguilla 18.2170 -63.0578 14
## oosr_pre0primary_age_female oosr_primary_age_male oosr_primary_age_female
## 1 NA NA NA
## 2 2 6 3
## 3 NA NA NA
## 4 NA NA NA
## 5 39 NA NA
## 6 NA NA NA
## oosr_lower_secondary_age_male oosr_lower_secondary_age_female
## 1 NA NA
## 2 6 1
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## oosr_upper_secondary_age_male oosr_upper_secondary_age_female
## 1 44 69
## 2 21 15
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## completion_rate_primary_male completion_rate_primary_female
## 1 67 40
## 2 94 96
## 3 93 93
## 4 NA NA
## 5 63 57
## 6 NA NA
## completion_rate_lower_secondary_male completion_rate_lower_secondary_female
## 1 49 26
## 2 98 97
## 3 49 65
## 4 NA NA
## 5 42 32
## 6 NA NA
## completion_rate_upper_secondary_male completion_rate_upper_secondary_female
## 1 32 14
## 2 76 80
## 3 22 37
## 4 NA NA
## 5 24 15
## 6 NA NA
## grade_2_3_proficiency_reading grade_2_3_proficiency_math
## 1 22 25
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## primary_end_proficiency_reading primary_end_proficiency_math
## 1 13 11
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## lower_secondary_end_proficiency_reading lower_secondary_end_proficiency_math
## 1 NA NA
## 2 48 58
## 3 21 19
## 4 NA NA
## 5 NA NA
## 6 NA NA
## youth_15_24_literacy_rate_male youth_15_24_literacy_rate_female birth_rate
## 1 74 56 32.49
## 2 99 100 11.78
## 3 98 97 24.28
## 4 NA NA 7.20
## 5 NA NA 40.73
## 6 NA NA NA
## gross_primary_education_enrollment gross_tertiary_education_enrollment
## 1 104.0 9.7
## 2 107.0 55.0
## 3 109.9 51.4
## 4 106.4 NA
## 5 113.5 9.3
## 6 NA NA
## unemployment_rate
## 1 11.12
## 2 12.33
## 3 11.70
## 4 NA
## 5 6.89
## 6 NA
I want to see if there is a correlation between a countries birth rate and various variables. First I have to simplify some data, so I will combine male and female categories for literacy rates and call it total average literacy rates.
total_education_data <- education_dataset |>
group_by(countries.and.areas) |>
mutate(mean_literacy_rate = (youth_15_24_literacy_rate_female +youth_15_24_literacy_rate_male)/2)
total_education_data |>
head()
## # A tibble: 6 × 30
## # Groups: countries.and.areas [6]
## countries.and.areas latitude longitude oosr_pre0primary_age_male
## <chr> <dbl> <dbl> <int>
## 1 Afghanistan 34.5 69.2 NA
## 2 Albania 41.3 19.8 4
## 3 Algeria 36.8 3.04 NA
## 4 Andorra 42.5 1.52 NA
## 5 Angola -8.84 13.2 31
## 6 Anguilla 18.2 -63.1 14
## # ℹ 26 more variables: oosr_pre0primary_age_female <int>,
## # oosr_primary_age_male <int>, oosr_primary_age_female <int>,
## # oosr_lower_secondary_age_male <int>, oosr_lower_secondary_age_female <int>,
## # oosr_upper_secondary_age_male <int>, oosr_upper_secondary_age_female <int>,
## # completion_rate_primary_male <int>, completion_rate_primary_female <int>,
## # completion_rate_lower_secondary_male <int>,
## # completion_rate_lower_secondary_female <int>, …
Scatterplot:
ggplot(total_education_data, aes(x=birth_rate, y = mean_literacy_rate)) +
labs(
x = "Birth Rate per 1000",
y = "Youth Literacy Rate",
caption = "Youth defined as ages between 15-24 years old",
title = "Comparing a Country's Birthrate to their Literacy Rate")+
geom_point()+
geom_smooth(method = "lm",
se = FALSE)+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 125 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 125 rows containing missing values (`geom_point()`).
I can see a clear correlation with literacy rate and birthrate. The
higher the birthrate a country has the lower the literacy rate. This may
be an indicator of lack of acess to education.
I also want to explore how literacy rates compare between each sex. I will do that with a bar graph below.
First I created a new data set to keep things organized. I selected a handful of columns after creating a new column using mutate. The new columns “literacy_difference” is the female literacy rate minus the male literacy rate.
bar_plot <- total_education_data |>
group_by(countries.and.areas) |>
mutate(literacy_difference =
(youth_15_24_literacy_rate_female -
youth_15_24_literacy_rate_male)) |> #this is how I calculate the difference in literacy rate between gender and store it in a new column
select(
countries.and.areas,
literacy_difference,
longitude,
latitude,
mean_literacy_rate,
birth_rate
) |>
filter(literacy_difference != 0) |>
mutate(pos = literacy_difference >= 0)|>
na.omit() |>
arrange(desc(literacy_difference))
bar_plot |>
head()
## # A tibble: 6 × 7
## # Groups: countries.and.areas [6]
## countries.and.areas literacy_difference longitude latitude mean_literacy_rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Rwanda 5 26.1 44.4 86.5
## 2 Gabon 3 2.35 48.9 89.5
## 3 Honduras 3 -58.2 6.80 96.5
## 4 East Timor 3 36.3 33.5 83.5
## 5 Bangladesh 2 90.4 23.7 95
## 6 Namibia 2 -6.83 34.0 95
## # ℹ 2 more variables: birth_rate <dbl>, pos <lgl>
ggplot(bar_plot, aes(x=reorder(countries.and.areas, -literacy_difference), y = literacy_difference, fill = pos))+
geom_col(stat = "identity",
show.legend = FALSE)+
labs(
x = "Country",
y = "Literacy Rate Difference",
title = "Youth Female Literacy Rate - Youth Male Literacy Rate",
caption = "Youth defined as age group between 15-24"
)+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, size = 10))
## Warning in geom_col(stat = "identity", show.legend = FALSE): Ignoring unknown
## parameters: `stat`
Looking at the graphs above, I want to focus on the literacy disparity
between sex for each country. I will also try to incorporate birthrate
in my final visualization.
##Attempting to plot onto world map. I again created a new data set to keep things organized. I only took the date columns that I wanted to show in my final visualization
finalviz <- total_education_data |>
group_by(countries.and.areas) |>
mutate(literacy_difference =
(youth_15_24_literacy_rate_female -
youth_15_24_literacy_rate_male)) |>
select(
countries.and.areas,
literacy_difference,
longitude,
latitude,
mean_literacy_rate,
birth_rate
)
finalviz |>
head()
## # A tibble: 6 × 6
## # Groups: countries.and.areas [6]
## countries.and.areas literacy_difference longitude latitude mean_literacy_rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan -18 69.2 34.5 65
## 2 Albania 1 19.8 41.3 99.5
## 3 Algeria -1 3.04 36.8 97.5
## 4 Andorra NA 1.52 42.5 NA
## 5 Angola NA 13.2 -8.84 NA
## 6 Anguilla NA -63.1 18.2 NA
## # ℹ 1 more variable: birth_rate <dbl>
creating a new data set for map interactivity:
literacy <- finalviz
literacy$longitude <- as.numeric(literacy$longitude)
literacy$latitude <- as.numeric(literacy$latitude)
literacy |>
head()
## # A tibble: 6 × 6
## # Groups: countries.and.areas [6]
## countries.and.areas literacy_difference longitude latitude mean_literacy_rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan -18 69.2 34.5 65
## 2 Albania 1 19.8 41.3 99.5
## 3 Algeria -1 3.04 36.8 97.5
## 4 Andorra NA 1.52 42.5 NA
## 5 Angola NA 13.2 -8.84 NA
## 6 Anguilla NA -63.1 18.2 NA
## # ℹ 1 more variable: birth_rate <dbl>
Using the paste0 function to call interactivity later. Also using the leaflet package to map our data points using the existing latitude and longitude columns.
labels <- paste0(
"Birth Rate: ", bar_plot$birth_rate,"<br>",
"Average Literacy Rate: ", bar_plot$mean_literacy_rate,"<br>",
"Female Literacy Rate:",total_education_data$youth_15_24_literacy_rate_female,"<br>",
"Male Literacy Rate:",total_education_data$youth_15_24_literacy_rate_male,"<br>"
)
literacy <- leaflet() |>
setView(lng = -0, lat = 0, zoom = 1.5) |>
addProviderTiles("Esri.WorldStreetMap") |>
addCircles(data = finalviz,
radius = bar_plot$birth_rate*3000,
color = "brown",
popup = labels
)
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
literacy
##Final Essay Sources: https://unstats.un.org/sdgs/report/2019/goal-04/
UNstats provided some background information as to possible reasons some countries have lower literacy rates. The article shows that the Sub-Saharan Africa region has the lowest percentages of trained teachers in pre-primary school (48%). Taking a look at my data visualization it shows that this correalates to their low literacy rates.
The final visualization shows a few things. First it can show how regions/countries vary from eachother in literacy rates and birthrates. It is powerful seeing it on the map as one can start making educated guesses as to whether the differences in statistics are related to geographical reasons and/or geography reasons.