Assignment — LA1: Data Visualization using R

Author

Anusha Y K (1NT24IS040) and Asha N Bhat (1NT24IS046)

STEP 1: Load Required Libraries

For this program, we use three specific libraries to handle the data workflow:

  • ggplot2: The standard for data visualization (DV) in R.

  • dplyr: Used for data manipulation, filtering, and creating new variables.

  • countrycode: An essential tool for converting country names into their respective continents (our categorical grouping).

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(countrycode)

STEP 2: Extract External Dataset

  • We extract data directly from the Our World in Data repository. Using the url() function ensures that RStudio establishes a secure connection to the CSV file online.
# Define the URL path
url_path <- "https://ourworldindata.org/grapher/life-expectancy-vs-gdp-per-capita.csv"

# Read the external CSV file
data_raw <- read.csv(url(url_path))

# Standardize column names to lowercase and remove special characters
# This prevents "Object not found" errors during plotting
names(data_raw) <- tolower(names(data_raw))
names(data_raw) <- gsub("[^a-z0-9]", "_", names(data_raw))

STEP 3: Data Cleaning and Categorical Grouping

Raw data often contains “Regional” rows (like “World” or “High Income”) that aren’t actual countries. We filter these out using the code column. We also use mutate to create cleaner variable names.

data_clean <- data_raw %>%
  # Filter for the most recent complete year (2022)
  filter(year == 2022) %>%
  # Remove non-country entities (rows where code is empty)
  filter(code != "" & !is.na(code)) %>%
  mutate(
    country = entity,
    life_exp = .[[grep("life", names(.), value = TRUE)[1]]],
    gdp = .[[grep("gdp", names(.), value = TRUE)[1]]],
    population = .[[grep("population", names(.), value = TRUE)[1]]]
  )

# Use countrycode to map countries to their Continents
data_clean$continent <- countrycode(data_clean$country, 
                                    origin = "country.name", 
                                    destination = "continent")
Warning: Some values were not matched unambiguously: Africa, Asia, Czechoslovakia, Europe, High-income countries, Kosovo, Low-income countries, Lower-middle-income countries, Micronesia (country), Oceania, Upper-middle-income countries, World, Yugoslavia
To fix unmatched values, please use the `custom_match` argument. If you think the default matching rules should be improved, please file an issue at https://github.com/vincentarelbundock/countrycode/issues
# Remove any remaining missing values
data_clean <- na.omit(data_clean)

STEP 4: Generate the Bubble Chart

  • A bubble chart is a variation of a scatter plot where a third dimension (population) is shown through the size of the dots.
ggplot(data_clean, aes(x = gdp, y = life_exp, size = population, color = continent)) +
  geom_point(alpha = 0.6) + # Set transparency to see overlapping bubbles
  scale_x_log10() +         # Apply log scale to spread out GDP data
  scale_size(range = c(1, 15), name = "Population") +
  labs(
    title = "Life Expectancy vs GDP per Capita (2022)",
    subtitle = "Categorical grouping by Continent",
    x = "Wealth: GDP per Capita (Log Scale)",
    y = "Health: Life Expectancy (Years)",
    color = "Continent"
  ) +
  theme_minimal()

STEP 5: Interpretation and Insights

  1. The Health-Wealth Correlation :The plot demonstrates a strong positive correlation. As nations move from low-income to middle-income status, life expectancy increases significantly.

  2. Grouping by Continent :

  • Africa: Represents the majority of countries in the lower-left quadrant.

  • Europe: Most European nations are clustered in the top-right, indicating high wealth and high longevity.

  • Asia: Shows the largest bubbles (population) and the most significant “climb” toward higher life expectancy.

  1. Use of Logarithmic Scaling : We used scale_x_log10() because the difference in GDP between the poorest and richest nations is massive. Without a log scale, the data would be bunched up against the left axis and unreadable.

  2. The “Bubble” Effect By mapping population to size : we can immediately see that the world’s most populous nations have successfully reached a life expectancy of over 70 years, despite varying GDP levels.