LA1 - COVID-19 Data Visualization

Author

Stephen George & Battini Jeevan Kumar

USN

1NT24IS227 & 1NT24IS057

RPubs Link

https://rpubs.com/Stephen__/1422413

Introduction

This report analyzes the COVID-19 pandemic using the Our World in Data dataset.

The dataset contains:

Daily records from 262 countries
Date range: January 2020 to December 2025
Key columns: new_cases_smoothed, new_deaths_smoothed, total_cases, total_deaths, country, continent, date

We will create four visualizations:

Line plot of cases and deaths over time (India)
Faceted plots for multiple countries
Scatter plot: total cases vs total deaths
Area plot for cumulative cases

Step 1: Load Libraries

We load the following libraries:

tidyverse — a collection of packages for data manipulation and visualization (includes ggplot2, dplyr, readr)
scales — used to format axis labels (e.g., comma-separated numbers)
ggrepel — used to add non-overlapping text labels on plots

library(tidyverse)
library(scales)
library(ggrepel)

Step 2: Load the Dataset

We load the CSV file using read_csv().

The date column is in DD-MM-YYYY format, so we use as.Date() with format = "%d-%m-%Y" to convert it properly.
We use head() to preview the first few rows.

covid <- read_csv("compact.csv", show_col_types = FALSE)

# Convert date column from character to Date type
covid <- covid |>
  mutate(date = as.Date(date, format = "%d-%m-%Y"))

# Preview
head(covid[, 1:6])

# A tibble: 6 × 6
  country     date       total_cases new_cases new_cases_smoothed
  <chr>       <date>           <dbl>     <dbl>              <dbl>
1 Afghanistan 2020-01-01          NA        NA                 NA
2 Afghanistan 2020-01-02          NA        NA                 NA
3 Afghanistan 2020-01-03          NA        NA                 NA
4 Afghanistan 2020-01-04           0         0                 NA
5 Afghanistan 2020-01-05           0         0                 NA
6 Afghanistan 2020-01-06           0         0                 NA
# ℹ 1 more variable: total_cases_per_million <dbl>

Step 3: Explore the Dataset

Before plotting, we understand the structure of the dataset.

We check:

Number of rows and columns
Column names
Data types
Summary statistics

# Dimensions
dim(covid)

[1] 570606     61

# Column names
names(covid)[1:10]

 [1] "country"                        "date"                          
 [3] "total_cases"                    "new_cases"                     
 [5] "new_cases_smoothed"             "total_cases_per_million"       
 [7] "new_cases_per_million"          "new_cases_smoothed_per_million"
 [9] "total_deaths"                   "new_deaths"

#Structure 
str(covid)

# Summary statistics
summary(covid[, c("total_cases", "total_deaths")])

  total_cases         total_deaths    
 Min.   :        0   Min.   :      0  
 1st Qu.:     9674   1st Qu.:     80  
 Median :    89168   Median :   1074  
 Mean   : 14698134   Mean   : 158357  
 3rd Qu.:  1113731   3rd Qu.:  13071  
 Max.   :779056637   Max.   :7111504  
 NA's   :12348       NA's   :12348

Step 4: Filter Data for Selected Countries

We select 6 representative countries for multi-country visualizations.

filter() keeps only rows where country matches one of our selected countries
These countries represent different continents and pandemic experiences

selected_countries <- c("India", "United States", "Brazil",
                        "United Kingdom", "Germany", "South Africa")

covid_sel <- covid |>
  filter(country %in% selected_countries)

# Confirm filtering worked
unique(covid_sel$country)

[1] "Brazil"         "Germany"        "India"          "South Africa"  
[5] "United Kingdom" "United States"

Visualization 1: Line Plot — Cases and Deaths Over Time (India)

What We Will Do

We plot new_cases_smoothed and new_deaths_smoothed for India over time.

Smoothed values (7-day rolling average) reduce day-to-day noise
We use a dual-axis line chart — cases on the left y-axis, deaths on the right
scale_factor scales the deaths line so both fit on the same plot

Step 5a: Prepare India Data

india <- covid |>
  filter(country == "India") |>
  select(date, new_cases_smoothed, new_deaths_smoothed) |>
  drop_na()

head(india)

# A tibble: 6 × 3
  date       new_cases_smoothed new_deaths_smoothed
  <date>                  <dbl>               <dbl>
1 2020-01-09                  0                   0
2 2020-01-10                  0                   0
3 2020-01-11                  0                   0
4 2020-01-12                  0                   0
5 2020-01-13                  0                   0
6 2020-01-14                  0                   0

Step 5b: Calculate Scale Factor for Dual Axis

scale_factor <- max(india$new_cases_smoothed, na.rm = TRUE) /
                max(india$new_deaths_smoothed, na.rm = TRUE)

scale_factor

[1] 93.38414

Step 5c: Plot Cases and Deaths Over Time

geom_line() draws the line for each metric
sec_axis() creates the secondary y-axis for deaths
scale_colour_manual() assigns custom colors to each line
labs() adds titles, axis labels, and legend title

ggplot(india, aes(x = date)) +
  geom_line(aes(y = new_cases_smoothed, colour = "New Cases"),
            linewidth = 0.8) +
  geom_line(aes(y = new_deaths_smoothed * scale_factor, colour = "New Deaths"),
            linewidth = 0.8, linetype = "dashed") +
  scale_y_continuous(
    name = "New Cases (smoothed)",
    labels = label_comma(),
    sec.axis = sec_axis(
      ~ . / scale_factor,
      name = "New Deaths (smoothed)",
      labels = label_comma()
    )
  ) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "6 months") +
  scale_colour_manual(values = c("New Cases" = "#2196F3",
                                  "New Deaths" = "#F44336")) +
  labs(
    title = "COVID-19 Daily Cases and Deaths — India",
    subtitle = "7-day smoothed values | January 2020 to December 2025",
    x = "Date",
    colour = "Metric"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold")
  )

Interpretation

The line plot shows how COVID-19 cases and deaths changed over time in India.
The graph shows multiple waves where cases rise and fall.
Deaths increase after cases, showing a delay between infection and outcome.

Visualization 2: Faceted Plots — Multiple Countries

What We Will Do

We create one panel per country using facet_wrap().

Each panel shows new_cases_smoothed for one country
scales = "free_y" allows each panel to have its own y-axis scale
This avoids smaller countries being flattened by larger ones (like USA)
ncol = 2 arranges panels in 2 columns

covid_sel |>
  select(country, date, new_cases_smoothed) |>
  drop_na() |>
  ggplot(aes(x = date, y = new_cases_smoothed, fill = country)) +
  geom_area(alpha = 0.6) +
  facet_wrap(~ country, scales = "free_y", ncol = 2) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "COVID-19 Daily New Cases — Multi-Country Faceted View",
    subtitle = "7-day smoothed values | Free y-axis scale per country",
    x = "Date",
    y = "New Cases (smoothed)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold")
  )

Interpretation

United States: Persistent high waves across the entire period — largest absolute case count
Brazil: Severe and prolonged Delta wave — high mortality impact
India: Iconic sharp Delta spike followed by rapid decline
United Kingdom and Germany: Dominant Omicron surge in early 2022
South Africa: Earlier, shorter waves — Omicron variant originated here

Visualization 3: Scatter Plot — Total Cases vs Total Deaths

What We Will Do

We plot total cumulative cases vs total cumulative deaths for every country.

We use the latest available record per country using slice_max()
!is.na(continent) removes aggregated regional rows like “World” or “Asia”
Log scale is applied to both axes because case counts vary from thousands to hundreds of millions
geom_smooth(method = "lm") adds a linear regression trend line
geom_text_repel() labels our 6 selected countries without overlapping

Step 7a: Get Latest Record Per Country

latest <- covid |>
  filter(!is.na(continent)) |>
  group_by(country) |>
  slice_max(date, n = 1) |>
  ungroup() |>
  select(country, continent, total_cases, total_deaths) |>
  drop_na() |>
  filter(total_cases > 0, total_deaths > 0)
head(latest)

# A tibble: 6 × 4
  country        continent total_cases total_deaths
  <chr>          <chr>           <dbl>        <dbl>
1 Afghanistan    Asia           235214         7998
2 Albania        Europe         337234         3608
3 Algeria        Africa         272435         6881
4 American Samoa Oceania          8359           34
5 Andorra        Europe          48015          159
6 Angola         Africa         107487         1937

Step 7b: Plot the Scatter

ggplot(latest, aes(x = total_cases, y = total_deaths,
                   colour = continent)) +
  geom_point(alpha = 0.7, size = 2.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey30",
              linewidth = 0.8, linetype = "dashed") +
  geom_text_repel(
    data = latest |> filter(country %in% selected_countries),
    aes(label = country),
    size = 3, fontface = "bold", max.overlaps = 10
  ) +
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_comma()) +
  scale_colour_brewer(palette = "Dark2") +
  labs(
    title = "Total COVID-19 Cases vs Total Deaths by Country",
    subtitle = "Log-log scale | Labelled: 6 highlighted countries",
    x = "Total Confirmed Cases (log scale)",
    y = "Total Deaths (log scale)",
    colour = "Continent"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold")
  )

Interpretation

The scatter plot shows the relationship between total cases and total deaths.
As the number of cases increases, the number of deaths also increases.
This shows a clear positive relationship between cases and deaths.

Visualization 4: Area Plot — Cumulative Cases Over Time

What We Will Do

We create a stacked area plot showing how cumulative cases grew over time for all 6 countries combined.

geom_area(position = "stack") stacks each country’s values on top of each other
This shows both the total combined scale and each country’s relative contribution
scale_fill_brewer(palette = "Set2") assigns distinct colors

covid_sel |>
  select(country, date, total_cases) |>
  drop_na() |>
  ggplot(aes(x = date, y = total_cases, fill = country)) +
  geom_area(position = "stack", alpha = 0.85) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "6 months") +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Cumulative COVID-19 Cases — Selected Countries",
    subtitle = "Stacked area chart | January 2020 to December 2025",
    x = "Date",
    y = "Cumulative Total Cases",
    fill = "Country"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold")
  )

Interpretation

The area plot shows the overall growth of COVID-19 cases over time.
It helps compare how much each country contributed to the total number of cases.

Summary

Visualization	Key Finding
Line plot (India)	Three waves; Delta was deadliest; deaths lag cases by ~2 weeks
Faceted plots	Each country had a unique pandemic trajectory
Scatter plot	Strong case-death correlation; outliers reflect healthcare capacity
Area plot	Burden concentrated in USA and India; wave-driven step growth

Conclusion

The analysis shows that COVID-19 trends were different across countries. The United States recorded the highest number of cases, while India and Brazil also showed significant increases. The visualizations help us understand how the pandemic spread over time and how cases and deaths are related.

References

Our World in Data. (2025). Coronavirus (COVID-19) Deaths. https://ourworldindata.org/covid-deaths