DATA 110 - DS Labs Assignment

Author

Catherine Z. Matenje

DS Labs Assignment

For this assignment, I used the Gapminder data set from the dslabs package, which contains global health and demographic data across countries and years. I decided to create a heat map to visualize how life expectancy is changing over time across continents.

First, to create the visualization, I cleaned the data set by removing missing values from my key variables: continent, year and life expectancy. Second, I explored the data set by viewing the variable names and the values within the continent variable. This allowed me to better understand the structure of the data and identify any variables that need renaming. I noticed that one of the continent categories was labeled “Oceania,” which may not be obvious to all audiences. I decided to rename Oceania to “Australia & Oceania”.

Third, I grouped the data by continent and year, then calculated the average life expectancy for each group. This allows me to summarize the data by region/continent rather than individual countries. Fourth, I reshaped the data set into a matrix format so that the rows represent continents and the columns represent years.

Lastly, I used the matrix and heatmap() function to create a colorful visualization, depicting how life expectancy has changed over time across the continents. The color gradient represents changes in life expectancy, with darker colors representing higher life expectancy and brighter/lighter colors representing lower life expectancy.

Loading Packages and Data Set

# Loading the appropriate/required libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
Warning: package 'dslabs' was built under R version 4.5.3
library(viridis)
Loading required package: viridisLite
# Load gapminder dataset

data(gapminder)

head(gapminder)
              country year infant_mortality life_expectancy fertility
1             Albania 1960           115.40           62.87      6.19
2             Algeria 1960           148.20           47.50      7.65
3              Angola 1960           208.00           35.98      7.32
4 Antigua and Barbuda 1960               NA           62.97      4.43
5           Argentina 1960            59.87           65.39      3.11
6             Armenia 1960               NA           66.86      4.55
  population          gdp continent          region
1    1636054           NA    Europe Southern Europe
2   11124892  13828152297    Africa Northern Africa
3    5270844           NA    Africa   Middle Africa
4      54681           NA  Americas       Caribbean
5   20619075 108322326649  Americas   South America
6    1867396           NA      Asia    Western Asia

Cleaning Data Set and creating summarized Data set for heat map

# Viewing variable names in dataset

names(gapminder)
[1] "country"          "year"             "infant_mortality" "life_expectancy" 
[5] "fertility"        "population"       "gdp"              "continent"       
[9] "region"          
# Viewing names of continents in continent variable

unique(gapminder$continent)
[1] Europe   Africa   Americas Asia     Oceania 
Levels: Africa Americas Asia Europe Oceania
# Continent includes Oceania...I want to rename is slightly for clarity

gapminder <- gapminder |>
  mutate(
    continent = case_when(
      continent == "Oceania" ~ "Australia & Oceania",
      TRUE ~ continent
    )
  )

# Verify change
unique(gapminder$continent)
[1] "Europe"              "Africa"              "Americas"           
[4] "Asia"                "Australia & Oceania"
# I am creating summarized dataset for the heatmap

gapminder_summary <- gapminder |>
  
  # Remove missing values from my key variables
  filter(!is.na(continent), 
         !is.na(life_expectancy), 
         !is.na(year)) |>
  
  # Group data by continent and year
  group_by(continent, year) |>
  
  # Calculate average life expectancy per group
  summarise(
    avg_life_exp = mean(life_expectancy),
    .groups = "drop"
  )

# View first few rows
head(gapminder_summary)
# A tibble: 6 × 3
  continent  year avg_life_exp
  <chr>     <int>        <dbl>
1 Africa     1960         43.1
2 Africa     1961         43.6
3 Africa     1962         44.1
4 Africa     1963         44.6
5 Africa     1964         45.0
6 Africa     1965         45.5

Reshaping the Data into Matrix format

# I am reshaping the data so that rows = continents, columns = years, values = average life expectancy

gapminder_wide <- gapminder_summary |>
  select(continent, year, avg_life_exp) |>
  pivot_wider(names_from = year, values_from = avg_life_exp)

# Convert to matrix format
gapminder_matrix <- data.matrix(gapminder_wide[, -1])

# Assigning row names to be continents
row.names(gapminder_matrix) <- gapminder_wide$continent

Creating the Heat map of Average Life Expectancy

# Create heatmap of average life expectancy

gapminder_heatmap <- heatmap(
  gapminder_matrix,
  
  Rowv = NA,   
  Colv = NA,
  col = viridis(25, option = "plasma"),
  cexCol = 0.7,   # adjusting column labels
  cexRow = 0.9,   # adjusting row labels
  cex.main = 0.7, # adjusting main title 
  scale = "none",   
  xlab = "",
  ylab = "",
  
  main = "Heatmap of Average Life Expectancy by Continent and Year"
)

Summary:

This heat map depicts average life expectancy across continents over time (1960-2016). Each row represents a continent, while each column represents a year. The color gradient indicates life expectancy, with darker colors representing lower life expectancy and lighter colors representing higher life expectancy. Overall, the heat map shows that life expectancy has increased across all continents over time. Europe and Americas in particular have consistently shown the highest life expectancy, although Europe appears to fare better than the Americas. Africa and Asia have experienced improvements, with Africa experiencing the lowest life expectancy overall.

One factor I found surprising is that Australia and Oceania appear to have lower life expectancy compared to regions like Europe and even Asia. However, this likely reflects how the data are grouped and averaged and not reality. Oceania does not only include Australia and New Zealand but also several small Pacific Islands with lower life expectancy, which is likely pulling the regional average down. Therefore, the heat map should be interpreted with caution rather than as a direct reflection of life expectancy on the continent.