For my midterm project, and related to my major of international studies, I thought it’d be interesting to look at whether countries with higher female literacy rates also tend to have higher female life expectancy.
I chose this topic because as we have heard in school, education, especially for women, is often linked to health and quality of life, and I was curious if that relationship shows up in real world data.
The dataset I will use comes from the World Bank Open Data website. It includes country-level data for female literacy rate and female life expectancy.
The two main variables I use are: - Female literacy rate (% of females ages 15 and above) - Female life expectancy at birth (years)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
literacy <- read.csv("literacy.csv", skip = 4)
life_exp <- read.csv("life_exp.csv", skip = 4)
head(literacy[, 1:6])
## Country.Name Country.Code
## 1 Aruba ABW
## 2 Africa Eastern and Southern AFE
## 3 Afghanistan AFG
## 4 Africa Western and Central AFW
## 5 Angola AGO
## 6 Albania ALB
## Indicator.Name
## 1 Literacy rate, adult female (% of females ages 15 and above)
## 2 Literacy rate, adult female (% of females ages 15 and above)
## 3 Literacy rate, adult female (% of females ages 15 and above)
## 4 Literacy rate, adult female (% of females ages 15 and above)
## 5 Literacy rate, adult female (% of females ages 15 and above)
## 6 Literacy rate, adult female (% of females ages 15 and above)
## Indicator.Code X1960 X1961
## 1 SE.ADT.LITR.FE.ZS NA NA
## 2 SE.ADT.LITR.FE.ZS NA NA
## 3 SE.ADT.LITR.FE.ZS NA NA
## 4 SE.ADT.LITR.FE.ZS NA NA
## 5 SE.ADT.LITR.FE.ZS NA NA
## 6 SE.ADT.LITR.FE.ZS NA NA
head(life_exp[, 1:6])
## Country.Name Country.Code
## 1 Aruba ABW
## 2 Africa Eastern and Southern AFE
## 3 Afghanistan AFG
## 4 Africa Western and Central AFW
## 5 Angola AGO
## 6 Albania ALB
## Indicator.Name Indicator.Code X1960 X1961
## 1 Life expectancy at birth, female (years) SP.DYN.LE00.FE.IN 67.45900 67.39400
## 2 Life expectancy at birth, female (years) SP.DYN.LE00.FE.IN 45.99807 46.36846
## 3 Life expectancy at birth, female (years) SP.DYN.LE00.FE.IN 33.54900 34.04300
## 4 Life expectancy at birth, female (years) SP.DYN.LE00.FE.IN 39.81759 40.12092
## 5 Life expectancy at birth, female (years) SP.DYN.LE00.FE.IN 39.73900 39.88500
## 6 Life expectancy at birth, female (years) SP.DYN.LE00.FE.IN 58.87700 59.99500
## Preparing my data
literacy_2019 <- literacy[, c("Country.Name", "Country.Code", "X2019")]
life_2019 <- life_exp[, c("Country.Name", "Country.Code", "X2019")]
names(literacy_2019)[3] <- "literacy_rate"
names(life_2019)[3] <- "life_expectancy"
project_data <- merge(literacy_2019, life_2019, by = c("Country.Name", "Country.Code"))
head(project_data)
## Country.Name Country.Code literacy_rate life_expectancy
## 1 Afghanistan AFG NA 66.14400
## 2 Africa Eastern and Southern AFE 67.35 66.43444
## 3 Africa Western and Central AFW 51.63 58.20482
## 4 Albania ALB NA 81.06000
## 5 Algeria DZA 74.21 77.09100
## 6 American Samoa ASM NA 75.72300
## Now I'm creating high and low literacy groups
mean_lit <- mean(project_data$literacy_rate, na.rm = TRUE)
project_data$group <- ifelse(project_data$literacy_rate >= mean_lit, "High Literacy", "Low Literacy")
table(project_data$group)
##
## High Literacy Low Literacy
## 61 36
head(project_data)
## Country.Name Country.Code literacy_rate life_expectancy
## 1 Afghanistan AFG NA 66.14400
## 2 Africa Eastern and Southern AFE 67.35 66.43444
## 3 Africa Western and Central AFW 51.63 58.20482
## 4 Albania ALB NA 81.06000
## 5 Algeria DZA 74.21 77.09100
## 6 American Samoa ASM NA 75.72300
## group
## 1 <NA>
## 2 Low Literacy
## 3 Low Literacy
## 4 <NA>
## 5 Low Literacy
## 6 <NA>
summary(project_data)
## Country.Name Country.Code literacy_rate life_expectancy
## Length :266 Length :266 Min. : 18.64 Min. :38.77
## N.unique :266 N.unique :266 1st Qu.: 63.01 1st Qu.:69.57
## N.blank : 0 N.blank : 0 Median : 87.37 Median :76.61
## Min.nchar: 4 Min.nchar: 3 Mean : 79.40 Mean :75.33
## Max.nchar: 73 Max.nchar: 3 3rd Qu.: 95.91 3rd Qu.:81.29
## Max. :100.00 Max. :88.32
## NAs :169 NAs :1
## group
## Length :266
## N.unique : 2
## N.blank : 0
## Min.nchar: 12
## Max.nchar: 13
## NAs :169
##
I divided the countries into two groups based on whether their female literacy rate was above or below the average. I created a boxplot because it is a good way to compare the distribution of female life expectancy between two groups. I also used an independent two-sample t-test because it allows me to determine whether the difference in average life expectancy between the two groups is statistically significant.
boxplot(life_expectancy ~ group,
data = project_data,
main = "Female Life Expectancy by Literacy Group",
xlab = "Literacy Group",
ylab = "Female Life Expectancy (Years)")
## Statistical Analysis
To answer my research question, I used an independent two-sample t-test. This test compares the average female life expectancy between countries with high female literacy rates and countries with low female literacy rates. I used a standard significance level of α = 0.05.
H₀: μHigh = μLow
H₁: μHigh ≠ μLow
Null Hypothesis (H₀): There is no difference in the average female life expectancy between countries with high female literacy rates and countries with low female literacy rates.
Alternative Hypothesis (H₁): Countries with high female literacy rates have a different average female life expectancy than countries with low female literacy rates.
t_test <- t.test(life_expectancy ~ group, data = project_data)
t_test
##
## Welch Two Sample t-test
##
## data: life_expectancy by group
## t = 8.0649, df = 53.371, p-value = 8.401e-11
## alternative hypothesis: true difference in means between group High Literacy and group Low Literacy is not equal to 0
## 95 percent confidence interval:
## 8.245268 13.702885
## sample estimates:
## mean in group High Literacy mean in group Low Literacy
## 76.96767 65.99359
Results of T-Test The t-test produced a p-value of 8.401 × 10⁻¹¹, which is much smaller than my significance level of α = 0.05. Because the p-value is less than 0.05, I reject the null hypothesis. Based on this analysis, there is strong evidence that countries with high female literacy rates have a different average female life expectancy than countries with low female literacy rates. In my dataset, countries with high female literacy had an average female life expectancy of about 77.0 years, compared to about 66.0 years for countries with low female literacy.
I also created a scatterplot to look at the relationship between female literacy rate and female life expectancy across all countries.
plot(project_data$literacy_rate,
project_data$life_expectancy,
main = "Female Literacy Rate vs Female Life Expectancy",
xlab = "Female Literacy Rate (%)",
ylab = "Female Life Expectancy (Years)")
Based on my analyses, I found that countries with higher female literacy rates generally have higher female life expectancy. The boxplot showed that countries in the high literacy group tended to have higher life expectancy values, and the t-test confirmed that this difference was statistically significant.
This project helped me see how education and health can be related on a global scale. And while this analysis does not prove that literacy directly causes people to live longer, it does show a strong relationship between the two variables.
In the future, I think it would be interesting to include other factors such as income level, access to healthcare, or education spending to see whether they also help explain differences in female life expectancy.
World Bank Open Data. (2026). Literacy rate, adult female (% of females ages 15 and above).
World Bank Open Data. (2026). Life expectancy at birth, female (years).