Since ancient times people have written myths and tales about what might help us, if not attain immortality, then at least extend our time on Earth. Even though the survival instinct guided human behavior from the earliest societies, average life expectancy remained below 40 years for most of history. With advances in medicine and improvements in living conditions, the twentieth and twenty-first centuries brought a real breakthrough: average life expectancy surpassed 70 years worldwide, and in some countries, for example, Japan, it reached 83 years.
But why did this happen? Japan is undoubtedly a highly developed country, yet this dramatic increase occurred over a relatively short period in the second half of the twentieth century. Many countries with similar economic development levels do not share the same high life expectancy. Could it be that relatively small factors, such as literacy or access to electricity, play an outsized role?
This is precisely what we investigate in our project. Our main research question is: How do particular socioeconomic, health, and environmental factors influence life expectancy across countries?
We used the lifeexpectancy1.csv dataset to answer this question. The dataset contains 63 rows and 20 columns. Each row represents a country, and the 20 columns include variables such as life expectancy, homicide rate, schooling, and various socioeconomic indicators. The names of all variables are listed below.
dim(life)
## [1] 63 20
names(life)
## [1] "Country" "literacyrate" "homicidiesper100k"
## [4] "electricity" "Schooling" "HIV.AIDS"
## [7] "Status" "wateraccess" "tuberculosis"
## [10] "inflation" "healthexppercapita" "fertilityrate"
## [13] "lifeexp" "internet" "gdppercapita"
## [16] "CO2" "forest" "urbanpop"
## [19] "urbanpopgrowth" "leastdeveloped"
The table below gives a general overview of the structure of the dataset.
head(life)
## # A tibble: 6 × 20
## Country literacyrate homicidiesper100k electricity Schooling HIV.AIDS Status
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Albania 96.8 4.01 2309. 14.2 0.1 Devel…
## 2 Algeria 72.6 1.48 1356. 14.4 0.1 Devel…
## 3 Armenia 99.7 2.48 1966. 12.7 0.1 Devel…
## 4 Azerbaij… 99.8 2.48 2202. 12.2 0.1 Devel…
## 5 Bahrain 94.6 0.524 19592. 14.5 0.1 Devel…
## 6 Banglade… 59.7 2.83 310. 10 0.1 Least…
## # ℹ 13 more variables: wateraccess <dbl>, tuberculosis <dbl>, inflation <dbl>,
## # healthexppercapita <dbl>, fertilityrate <dbl>, lifeexp <dbl>,
## # internet <dbl>, gdppercapita <dbl>, CO2 <dbl>, forest <dbl>,
## # urbanpop <dbl>, urbanpopgrowth <dbl>, leastdeveloped <dbl>
The histogram allows us to examine the distribution of life expectancy, which is the main focus of our analysis.
hist(life$lifeexp)
Through the table below, we can see summary statistics such as the mean, median, minimum, and maximum values of life expectancy in the dataset.
life %>%
summarise(
mean_life = mean(lifeexp, na.rm = TRUE),
median_life = median(lifeexp, na.rm = TRUE),
min_life = min(lifeexp, na.rm = TRUE),
max_life = max(lifeexp, na.rm = TRUE)
)
## # A tibble: 1 × 4
## mean_life median_life min_life max_life
## <dbl> <dbl> <dbl> <dbl>
## 1 74.5 74.9 60.8 83.2
Status is the only categorical data in the dataset — it is not continuous data — and you can check the distribution of the data by status in the table below.
table(life$Status)
##
## Developed Developing Leastdeveloped
## 14 45 4
We can check the average and median life expectancy by status in the table below.
life %>%
group_by(Status) %>%
summarise(
mean_life = mean(lifeexp, na.rm = TRUE),
median_life = median(lifeexp, na.rm = TRUE)
)
## # A tibble: 3 × 3
## Status mean_life median_life
## <chr> <dbl> <dbl>
## 1 Developed 78.7 78.9
## 2 Developing 73.9 74.5
## 3 Leastdeveloped 67.1 67.9
For this project, we use the Life Expectancy Dataset containing country-level indicators across economic, social, demographic, and environmental dimensions. The dataset includes variables compiled from authoritative international sources, such as the World Health Organization (WHO), the World Bank, and the United Nations Development Programme (UNDP). The version we use was accessed through Kaggle, which serves as a repository for publicly available datasets.
Our dataset contained many different factors, but we chose to focus on the less obvious ones. It is immediately clear that countries with a higher burden of disease tend to have lower life expectancy, so we aimed to look beyond these straightforward relationships and concentrate on subtler, less intuitive influences.
To explain cross-country differences, we selected several predictors from different domains:
Homicide rate – captures public safety and social stability. High levels of violence can directly reduce life expectancy and signal broader structural issues.
Average years of schooling – represents educational attainment, which is strongly linked to health behaviors, employment opportunities, and long-term well-being.
Access to electricity and safe water – measures basic infrastructure. These services support disease prevention and improve daily living conditions, especially in lower-income countries.
GDP per capita – reflects a country’s economic development and overall living conditions. It is a widely used determinant of population health
In this project, only minimal data cleaning was required because the Life Expectancy dataset was already well-structured.
First, the CSV file was loaded into R using read_csv(). Then, several variables were formatted appropriately for analysis. For example, during the first graph analysis process, the variable leastdeveloped was converted from numeric values (0/1) into categorical labels (“Non-LDC”, “LDC”), and a new factor variable was created using ntile() to group countries into four levels of homicide rate.
In addition, summary functions such as mean() and median() were applied with na.rm = TRUE to safely handle any missing values. These steps allowed us to prepare the dataset in a clean and consistent format before conducting the visualizations.
To begin examining which factors may influence life expectancy, we start with one of the less obvious variables in our dataset — homicide rate. In this first graph, countries are grouped into four levels of violence.
We observe that nations in the lowest homicide group have the highest median life expectancy, while those in the highest group show clearly lower values. At the same time, the moderate, high, and very high homicide categories cluster at nearly the same level, suggesting that once violence exceeds a certain threshold, further increases do not lead to equally sharp declines in life expectancy.
A notable detail is the position of Least Developed Countries: instead of falling into the lowest life-expectancy range, they appear between the moderate and high homicide groups, indicating that their lower life expectancy is shaped by broader structural factors beyond violence alone.
life$leastdeveloped <- factor(
life$leastdeveloped,
levels = c(0, 1),
labels = c("Non-LDC", "LDC")
)
life <- life %>%
mutate(
homicide_quartile = ntile(homicidiesper100k, 4),
homicide_quartile = factor(
homicide_quartile,
labels = c("Low violence", "Moderate", "High", "Very high")
)
)
library(ggplot2)
life_homicide_boxplot <- ggplot(
life,
aes(
x = homicide_quartile,
y = lifeexp,
fill = leastdeveloped
)
) +
geom_boxplot(
alpha = 0.85,
colour = "black"
) +
scale_fill_manual(
values = c("Non-LDC" = "#1f77b4", "LDC" = "#ff7f0e")
) +
labs(
title = "Life Expectancy by Level of Homicide Rates",
subtitle = "Countries grouped into quartiles by homicide rate",
x = "Homicide Rate Group",
y = "Life Expectancy (years)",
fill = "Development Status"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "top"
)
print(life_homicide_boxplot)
Following our exploration of violence as a potential determinant of life expectancy, we turn to another less obvious factor — education. In this graph, we examine how average years of schooling relate to life expectancy, separating countries into Least Developed (LDC) and non-LDC groups.
A clear pattern emerges: countries with more years of schooling generally achieve longer life expectancy. This relationship is especially strong and consistent among non-LDC nations, which form a well-defined upward trend. In contrast, LDC countries remain clustered in the lower ranges of both education and life expectancy, suggesting that limited access to schooling is closely tied to broader structural constraints that shape population health.
This comparison helps us see that, alongside violence levels, education represents another important, though often overlooked, pathway through which social conditions influence how long people live.
After looking at education, we move to another structural factor that can shape population health: basic infrastructure. This graph examines the relationship between access to electricity and clean water and life expectancy across countries.
Greater infrastructure access is associated with longer life expectancy. What is particularly interesting is that several non-LDC, and even some LDC, countries with relatively low income levels still achieve higher life expectancy when electricity and water coverage are high.
This suggests that infrastructure can partially compensate for economic limitations, offering a pathway to improved health outcomes even in countries that are not yet fully developed. In other words, reliable access to essential services like electricity and clean water plays a crucial and often underestimated role in shaping how long people live.
The last factor we look at is GDP. Before examining the relationship between GDP and life expectancy, we first redefined the country groups and checked their distribution using a histogram. In our analysis, we changed the original two-group classification (LDC and Non-LDC) into three groups: Developed, Developing, and Least Developed. This allows us to look at the GDP–life expectancy relationship in a more detailed way.
We chose a histogram because it is the most intuitive way to show how a continuous variable like life expectancy is distributed. It visually shows how many countries fall into each range, so we can easily see the center, spread, and asymmetry of the data.
We also used facet_wrap() to separate the groups into different panels. If all countries were shown in one plot, it would be difficult to interpret, so splitting the groups makes the differences clearer.
To improve readability, we applied transparency (alpha = 0.7) so the bars would not look too strong, and we used theme_minimal() to remove unnecessary background elements. We also set the number of bins to 20 to avoid creating an overly segmented or noisy distribution.
Overall, this graph shows a clear step pattern based on development level. Developed countries cluster at high life expectancy, developing countries are spread out in the middle range, and least developed countries are concentrated in the lower range.
If Figure 4 shows the distribution of life expectancy for each group using three separate facets, Figure 2 brings all three groups together in one graph so we can directly compare the relationship between GDP and life expectancy. With this scatterplot, we can look at both variables at the same time and understand the overall pattern.
This graph was created using ggplot(), with GDP per capita on the x-axis and life expectancy on the y-axis. We used geom_point(alpha = 0.6) to make the points slightly transparent so that overlapping areas are easier to see. The colors separate countries by development status. We also used labs() and theme_minimal() to clean up the titles, labels, and remove unnecessary visual elements.
GDP has a very wide range, so if we draw it on a normal scale, most countries get pushed to one side and the pattern becomes hard to see. To fix this, we used scale_x_log10(), which changes the axis into a log scale. This reduces distortion and makes the structure of the data much clearer.
Graph 5 shows a clear positive relationship: as GDP increases, life expectancy also increases. Developed countries cluster in the high GDP–high life expectancy area, developing countries are spread out in the middle range, and LDCs are concentrated in the low range. We also see that within the same GDP range, the three groups show similar life expectancy levels. This suggests that GDP is the strongest factor explaining differences in life expectancy across countries.
As mentioned earlier, a scatterplot is the most intuitive way to show the GDP–life expectancy relationship, so we used geom_point once again. We also added geom_smooth(method = “lm”) to check the overall linear trend between the two variables. This helps us clearly see the general increasing pattern among the countries.
Within this linear relationship, however, we wanted to explore whether there are countries that show unexpected patterns. Specifically, we became curious about whether countries with high GDP levels but lower-than-expected life expectancy actually exist.
At first, we tried to identify outliers using the average. We used mean() and subset() to select countries with [①GDP higher than the average] and [②life expectancy lower than the average]. However, the lifeexpectancy dataset contains a very large proportion of developing countries, and life expectancy shows an asymmetric distribution rather than a normal distribution. In this situation, because the average does not properly represent the center of the distribution, it is not reasonable to classify a country as an outlier simply because its life expectancy is slightly below the average.
In fact, Lithuania (74.51 years) and Latvia (74.12 years) were classified as outliers based only on the average value, even though their life expectancy relative to GDP does not deviate from the linear pattern. The difference between these countries and the world average life expectancy (74.54 years) is extremely small, and the issue occurred because the average was not an appropriate standard. In other words, they appeared to be outliers because of the limitations of the average-based criterion.
To solve this problem, we redefined outliers using the median. The
median is more stable because it is not affected by extreme values.
However, the median GDP (7,864 USD) became much lower than the mean GDP
(11,401 USD), creating the problem that too many countries met the
condition of “GDP higher than the median.” In other words, the mean
standard was too strict, and the median standard was too loose.
Therefore, we finally tried using the quantile() function. quantile() calculates quantile values that divide the dataset into specific proportions. For example, the 75% quantile represents the beginning of the top 25%. Using this method, we selected only countries in the top 25% for GDP and the bottom 25% for life expectancy. As a result, several countries (e.g., Lithuania, Latvia, Kazakhstan, Trinidad and Tobago, etc.) appeared as outlier candidates under the mean and median criteria, but only Trinidad and Tobago consistently remained an outlier under the quantile method. This result suggests that Trinidad and Tobago is a structurally unique case where life expectancy is unusually low compared to its GDP level.
During this process, geom_text_repel() was used to clearly label the outlier countries. This function prevents labels from overlapping, making the graph easier to read. We also used fontface() to bold the labels and labs() to organize the titles and axis names.
Figure 7-1 compares various social, health, and education indicators with the global average to explain why Trinidad and Tobago appears as an outlier with low life expectancy relative to its GDP.
To make this graph, we first extracted the row for Trinidad and Tobago using filter() and select(), and then calculated the global average of each variable using summarise(). After that, we created a data frame that puts the two values side by side, converted it into long form using gather(), and visualized it as a bar graph.
A bar graph is useful because it clearly shows the difference between the two groups: Trinidad and Tobago (TT) and the global average. We used geom_col(position = “dodge”) so the two bars appear next to each other for every variable, making it easy to see where the differences are. Because the scales of each variable are very different, we applied scale_x_log10() to reduce distortion, and used theme_minimal() to remove unnecessary visual elements.
From this graph, we can see that Trinidad and Tobago’s literacy rate, education level, water access, and internet penetration are similar to the global average, but its homicide rate—an indicator of public safety—is much higher.
To show this more clearly, we added a separate graph that compares only the homicide rate. We used annotate(“text”) to add the actual numbers above the bars, and we removed the legend with legend.position = “none” because it was not needed. In this graph, we can clearly see again that Trinidad and Tobago’s homicide rate (29.9) is very high—about 3.5 times the world average (8.5)
How can we relate these characteristics of Trinidad and Tobago to its high GDP? Trinidad and Tobago’s high GDP comes mainly from oil and gas. However, these industries are technology-intensive, mechanized, and capital-intensive, so they do not create many jobs. As a result, even though the country’s GDP is high, structural social and economic problems—such as income inequality, unstable employment, and regional poverty—still remain. These vulnerable conditions especially affect young people and disadvantaged groups, making them more likely to be pulled into gangs.
In fact, Trinidad and Tobago has repeatedly declared states of emergency in recent years due to severe gang violence and high crime rates. In 2024 and 2025, the government even mobilized the military because organized crime and retaliatory killings increased sharply. The number of homicides exceeded 600 per year, which is extremely high for a country of its size. This shows that safety issues and social instability—problems that GDP or infrastructure cannot fully explain—may be important reasons why the country’s life expectancy is lower than expected.
Our analysis shows that differences in life expectancy across countries cannot be explained by a single factor. Instead, they are shaped by a combination of social, educational, infrastructural, and safety-related factors that interact with one another.
We began by examining relatively “non-obvious” variables—homicide rates, years of schooling, and access to electricity and clean water. These analysis show that ①countries with lower homicide rates tend to have higher life expectancy, and Least Developed Countries fall into the middle-to-high homicide groups rather than the very high group. This means their low life expectancy cannot be explained by violence alone. ② Countries with more years of schooling generally have higher life expectancy. Education plays an important role in improving health and life expectancy.③ Basic infrastructure—such as access to electricity and clean water—can partly offset economic limitations. When these essential services are widely available, even lower-income countries can achieve higher life expectancy.
We then explored the relationship between GDP and life expectancy. Although GDP emerged as one of the strong single predictor and most countries showed a clear upward relationship, this pattern did not apply uniformly to all nations.
To understand these deviations, we identified countries whose life expectancy was “lower than expected” for their GDP level. Ultimately, Trinidad and Tobago stood out as a structural outlier. Despite its high GDP, extreme levels of violence and persistent social instability significantly suppress the country’s life expectancy. This finding highlights that economic prosperity does not automatically resolve deeper structural vulnerabilities, and that public safety and social stability can be as critical as economic indicators in determining life expectancy.
Taken together, our findings emphasize that life expectancy is shaped by an interconnected system of factors—economy, education, infrastructure, and public safety—not by any single dimension. Improving life expectancy requires a broad policy approach that goes beyond economic growth and focuses on expanding education, strengthening infrastructure, reducing inequality, and ensuring public safety. Only through such integrated efforts can countries build conditions that support longer and healthier lives.