Anusha Y K (1NT24IS040) and Asha N Bhat (1NT24IS046)
STEP 1: Load Required Libraries
For this program, we use three specific libraries to handle the data workflow:
ggplot2: The standard for data visualization (DV) in R.
dplyr: Used for data manipulation, filtering, and creating new variables.
countrycode: An essential tool for converting country names into their respective continents (our categorical grouping).
library(ggplot2)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(countrycode)
STEP 2: Extract External Dataset
We extract data directly from the Our World in Data repository. Using the url() function ensures that RStudio establishes a secure connection to the CSV file online.
# Define the URL pathurl_path <-"https://ourworldindata.org/grapher/life-expectancy-vs-gdp-per-capita.csv"# Read the external CSV filedata_raw <-read.csv(url(url_path))# Standardize column names to lowercase and remove special characters# This prevents "Object not found" errors during plottingnames(data_raw) <-tolower(names(data_raw))names(data_raw) <-gsub("[^a-z0-9]", "_", names(data_raw))
STEP 3: Data Cleaning and Categorical Grouping
Raw data often contains “Regional” rows (like “World” or “High Income”) that aren’t actual countries. We filter these out using the code column. We also use mutate to create cleaner variable names.
data_clean <- data_raw %>%# Filter for the most recent complete year (2022)filter(year ==2022) %>%# Remove non-country entities (rows where code is empty)filter(code !=""&!is.na(code)) %>%mutate(country = entity,life_exp = .[[grep("life", names(.), value =TRUE)[1]]],gdp = .[[grep("gdp", names(.), value =TRUE)[1]]],population = .[[grep("population", names(.), value =TRUE)[1]]] )# Use countrycode to map countries to their Continentsdata_clean$continent <-countrycode(data_clean$country, origin ="country.name", destination ="continent")
Warning: Some values were not matched unambiguously: Africa, Asia, Czechoslovakia, Europe, High-income countries, Kosovo, Low-income countries, Lower-middle-income countries, Micronesia (country), Oceania, Upper-middle-income countries, World, Yugoslavia
To fix unmatched values, please use the `custom_match` argument. If you think the default matching rules should be improved, please file an issue at https://github.com/vincentarelbundock/countrycode/issues
# Remove any remaining missing valuesdata_clean <-na.omit(data_clean)
STEP 4: Generate the Bubble Chart
A bubble chart is a variation of a scatter plot where a third dimension (population) is shown through the size of the dots.
ggplot(data_clean, aes(x = gdp, y = life_exp, size = population, color = continent)) +geom_point(alpha =0.6) +# Set transparency to see overlapping bubblesscale_x_log10() +# Apply log scale to spread out GDP datascale_size(range =c(1, 15), name ="Population") +labs(title ="Life Expectancy vs GDP per Capita (2022)",subtitle ="Categorical grouping by Continent",x ="Wealth: GDP per Capita (Log Scale)",y ="Health: Life Expectancy (Years)",color ="Continent" ) +theme_minimal()
STEP 5: Interpretation and Insights
The Health-Wealth Correlation :The plot demonstrates a strong positive correlation. As nations move from low-income to middle-income status, life expectancy increases significantly.
Grouping by Continent :
Africa: Represents the majority of countries in the lower-left quadrant.
Europe: Most European nations are clustered in the top-right, indicating high wealth and high longevity.
Asia: Shows the largest bubbles (population) and the most significant “climb” toward higher life expectancy.
Use of Logarithmic Scaling : We used scale_x_log10() because the difference in GDP between the poorest and richest nations is massive. Without a log scale, the data would be bunched up against the left axis and unreadable.
The “Bubble” Effect By mapping population to size : we can immediately see that the world’s most populous nations have successfully reached a life expectancy of over 70 years, despite varying GDP levels.