LA1 - COVID-19 Data Visualization

Author

Stephen George & Battini Jeevan Kumar

USN

1NT24IS227 & 1NT24IS057

Introduction

This report analyzes the COVID-19 pandemic using the Our World in Data dataset.

The dataset contains:

  • Daily records from 262 countries
  • Date range: January 2020 to December 2025
  • Key columns: new_cases_smoothed, new_deaths_smoothed, total_cases, total_deaths, country, continent, date

We will create four visualizations:

  1. Line plot of cases and deaths over time (India)
  2. Faceted plots for multiple countries
  3. Scatter plot: total cases vs total deaths
  4. Area plot for cumulative cases

Step 1: Load Libraries

We load the following libraries:

  • tidyverse — a collection of packages for data manipulation and visualization (includes ggplot2, dplyr, readr)
  • scales — used to format axis labels (e.g., comma-separated numbers)
  • ggrepel — used to add non-overlapping text labels on plots
library(tidyverse)
library(scales)
library(ggrepel)

Step 2: Load the Dataset

We load the CSV file using read_csv().

  • The date column is in DD-MM-YYYY format, so we use as.Date() with format = "%d-%m-%Y" to convert it properly.
  • We use head() to preview the first few rows.
covid <- read_csv("compact.csv", show_col_types = FALSE)

# Convert date column from character to Date type
covid <- covid |>
  mutate(date = as.Date(date, format = "%d-%m-%Y"))

# Preview
head(covid[, 1:6])
# A tibble: 6 × 6
  country     date       total_cases new_cases new_cases_smoothed
  <chr>       <date>           <dbl>     <dbl>              <dbl>
1 Afghanistan 2020-01-01          NA        NA                 NA
2 Afghanistan 2020-01-02          NA        NA                 NA
3 Afghanistan 2020-01-03          NA        NA                 NA
4 Afghanistan 2020-01-04           0         0                 NA
5 Afghanistan 2020-01-05           0         0                 NA
6 Afghanistan 2020-01-06           0         0                 NA
# ℹ 1 more variable: total_cases_per_million <dbl>

Step 3: Explore the Dataset

Before plotting, we understand the structure of the dataset.

We check:

  • Number of rows and columns
  • Column names
  • Data types
  • Summary statistics
# Dimensions
dim(covid)
[1] 570606     61
# Column names
names(covid)[1:10]
 [1] "country"                        "date"                          
 [3] "total_cases"                    "new_cases"                     
 [5] "new_cases_smoothed"             "total_cases_per_million"       
 [7] "new_cases_per_million"          "new_cases_smoothed_per_million"
 [9] "total_deaths"                   "new_deaths"                    
#Structure 
str(covid)
# Summary statistics
summary(covid[, c("total_cases", "total_deaths")])
  total_cases         total_deaths    
 Min.   :        0   Min.   :      0  
 1st Qu.:     9674   1st Qu.:     80  
 Median :    89168   Median :   1074  
 Mean   : 14698134   Mean   : 158357  
 3rd Qu.:  1113731   3rd Qu.:  13071  
 Max.   :779056637   Max.   :7111504  
 NA's   :12348       NA's   :12348    

Step 4: Filter Data for Selected Countries

We select 6 representative countries for multi-country visualizations.

  • filter() keeps only rows where country matches one of our selected countries
  • These countries represent different continents and pandemic experiences
selected_countries <- c("India", "United States", "Brazil",
                        "United Kingdom", "Germany", "South Africa")

covid_sel <- covid |>
  filter(country %in% selected_countries)

# Confirm filtering worked
unique(covid_sel$country)
[1] "Brazil"         "Germany"        "India"          "South Africa"  
[5] "United Kingdom" "United States" 

Visualization 1: Line Plot — Cases and Deaths Over Time (India)

What We Will Do

We plot new_cases_smoothed and new_deaths_smoothed for India over time.

  • Smoothed values (7-day rolling average) reduce day-to-day noise
  • We use a dual-axis line chart — cases on the left y-axis, deaths on the right
  • scale_factor scales the deaths line so both fit on the same plot

Step 5a: Prepare India Data

india <- covid |>
  filter(country == "India") |>
  select(date, new_cases_smoothed, new_deaths_smoothed) |>
  drop_na()

head(india)
# A tibble: 6 × 3
  date       new_cases_smoothed new_deaths_smoothed
  <date>                  <dbl>               <dbl>
1 2020-01-09                  0                   0
2 2020-01-10                  0                   0
3 2020-01-11                  0                   0
4 2020-01-12                  0                   0
5 2020-01-13                  0                   0
6 2020-01-14                  0                   0

Step 5b: Calculate Scale Factor for Dual Axis

scale_factor <- max(india$new_cases_smoothed, na.rm = TRUE) /
                max(india$new_deaths_smoothed, na.rm = TRUE)

scale_factor
[1] 93.38414

Step 5c: Plot Cases and Deaths Over Time

  • geom_line() draws the line for each metric
  • sec_axis() creates the secondary y-axis for deaths
  • scale_colour_manual() assigns custom colors to each line
  • labs() adds titles, axis labels, and legend title
ggplot(india, aes(x = date)) +
  geom_line(aes(y = new_cases_smoothed, colour = "New Cases"),
            linewidth = 0.8) +
  geom_line(aes(y = new_deaths_smoothed * scale_factor, colour = "New Deaths"),
            linewidth = 0.8, linetype = "dashed") +
  scale_y_continuous(
    name = "New Cases (smoothed)",
    labels = label_comma(),
    sec.axis = sec_axis(
      ~ . / scale_factor,
      name = "New Deaths (smoothed)",
      labels = label_comma()
    )
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "6 months") +
  scale_colour_manual(values = c("New Cases" = "#2196F3",
                                  "New Deaths" = "#F44336")) +
  labs(
    title = "COVID-19 Daily Cases and Deaths — India",
    subtitle = "7-day smoothed values | January 2020 to December 2025",
    x = "Date",
    colour = "Metric"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold")
  )

Interpretation

  • The line plot shows how COVID-19 cases and deaths changed over time in India.
  • The graph shows multiple waves where cases rise and fall.
  • Deaths increase after cases, showing a delay between infection and outcome.

Visualization 2: Faceted Plots — Multiple Countries

What We Will Do

We create one panel per country using facet_wrap().

  • Each panel shows new_cases_smoothed for one country
  • scales = "free_y" allows each panel to have its own y-axis scale
  • This avoids smaller countries being flattened by larger ones (like USA)
  • ncol = 2 arranges panels in 2 columns
covid_sel |>
  select(country, date, new_cases_smoothed) |>
  drop_na() |>
  ggplot(aes(x = date, y = new_cases_smoothed, fill = country)) +
  geom_area(alpha = 0.6) +
  facet_wrap(~ country, scales = "free_y", ncol = 2) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "COVID-19 Daily New Cases — Multi-Country Faceted View",
    subtitle = "7-day smoothed values | Free y-axis scale per country",
    x = "Date",
    y = "New Cases (smoothed)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold")
  )

Interpretation

  • United States: Persistent high waves across the entire period — largest absolute case count
  • Brazil: Severe and prolonged Delta wave — high mortality impact
  • India: Iconic sharp Delta spike followed by rapid decline
  • United Kingdom and Germany: Dominant Omicron surge in early 2022
  • South Africa: Earlier, shorter waves — Omicron variant originated here

Visualization 3: Scatter Plot — Total Cases vs Total Deaths

What We Will Do

We plot total cumulative cases vs total cumulative deaths for every country.

  • We use the latest available record per country using slice_max()
  • !is.na(continent) removes aggregated regional rows like “World” or “Asia”
  • Log scale is applied to both axes because case counts vary from thousands to hundreds of millions
  • geom_smooth(method = "lm") adds a linear regression trend line
  • geom_text_repel() labels our 6 selected countries without overlapping

Step 7a: Get Latest Record Per Country

latest <- covid |>
  filter(!is.na(continent)) |>
  group_by(country) |>
  slice_max(date, n = 1) |>
  ungroup() |>
  select(country, continent, total_cases, total_deaths) |>
  drop_na() |>
  filter(total_cases > 0, total_deaths > 0)
head(latest)
# A tibble: 6 × 4
  country        continent total_cases total_deaths
  <chr>          <chr>           <dbl>        <dbl>
1 Afghanistan    Asia           235214         7998
2 Albania        Europe         337234         3608
3 Algeria        Africa         272435         6881
4 American Samoa Oceania          8359           34
5 Andorra        Europe          48015          159
6 Angola         Africa         107487         1937

Step 7b: Plot the Scatter

ggplot(latest, aes(x = total_cases, y = total_deaths,
                   colour = continent)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey30",
              linewidth = 0.8, linetype = "dashed") +
  geom_text_repel(
    data = latest |> filter(country %in% selected_countries),
    aes(label = country),
    size = 3, fontface = "bold", max.overlaps = 10
  ) +
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_comma()) +
  scale_colour_brewer(palette = "Dark2") +
  labs(
    title = "Total COVID-19 Cases vs Total Deaths by Country",
    subtitle = "Log-log scale | Labelled: 6 highlighted countries",
    x = "Total Confirmed Cases (log scale)",
    y = "Total Deaths (log scale)",
    colour = "Continent"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold")
  )

Interpretation

  • The scatter plot shows the relationship between total cases and total deaths.
  • As the number of cases increases, the number of deaths also increases.
  • This shows a clear positive relationship between cases and deaths.

Visualization 4: Area Plot — Cumulative Cases Over Time

What We Will Do

We create a stacked area plot showing how cumulative cases grew over time for all 6 countries combined.

  • geom_area(position = "stack") stacks each country’s values on top of each other
  • This shows both the total combined scale and each country’s relative contribution
  • scale_fill_brewer(palette = "Set2") assigns distinct colors
covid_sel |>
  select(country, date, total_cases) |>
  drop_na() |>
  ggplot(aes(x = date, y = total_cases, fill = country)) +
  geom_area(position = "stack", alpha = 0.85) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "6 months") +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Cumulative COVID-19 Cases — Selected Countries",
    subtitle = "Stacked area chart | January 2020 to December 2025",
    x = "Date",
    y = "Cumulative Total Cases",
    fill = "Country"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold")
  )

Interpretation

  • The area plot shows the overall growth of COVID-19 cases over time.
  • It helps compare how much each country contributed to the total number of cases.

Summary

Visualization Key Finding
Line plot (India) Three waves; Delta was deadliest; deaths lag cases by ~2 weeks
Faceted plots Each country had a unique pandemic trajectory
Scatter plot Strong case-death correlation; outliers reflect healthcare capacity
Area plot Burden concentrated in USA and India; wave-driven step growth

Conclusion

The analysis shows that COVID-19 trends were different across countries. The United States recorded the highest number of cases, while India and Brazil also showed significant increases. The visualizations help us understand how the pandemic spread over time and how cases and deaths are related.


References