library(tidyverse)
library(scales)
library(ggrepel)LA1 - COVID-19 Data Visualization
USN
1NT24IS227 & 1NT24IS057
RPubs Link
https://rpubs.com/Stephen__/1422413
Introduction
This report analyzes the COVID-19 pandemic using the Our World in Data dataset.
The dataset contains:
- Daily records from 262 countries
- Date range: January 2020 to December 2025
- Key columns:
new_cases_smoothed,new_deaths_smoothed,total_cases,total_deaths,country,continent,date
We will create four visualizations:
- Line plot of cases and deaths over time (India)
- Faceted plots for multiple countries
- Scatter plot: total cases vs total deaths
- Area plot for cumulative cases
Step 1: Load Libraries
We load the following libraries:
tidyverse— a collection of packages for data manipulation and visualization (includesggplot2,dplyr,readr)scales— used to format axis labels (e.g., comma-separated numbers)ggrepel— used to add non-overlapping text labels on plots
Step 2: Load the Dataset
We load the CSV file using read_csv().
- The
datecolumn is inDD-MM-YYYYformat, so we useas.Date()withformat = "%d-%m-%Y"to convert it properly. - We use
head()to preview the first few rows.
covid <- read_csv("compact.csv", show_col_types = FALSE)
# Convert date column from character to Date type
covid <- covid |>
mutate(date = as.Date(date, format = "%d-%m-%Y"))
# Preview
head(covid[, 1:6])# A tibble: 6 × 6
country date total_cases new_cases new_cases_smoothed
<chr> <date> <dbl> <dbl> <dbl>
1 Afghanistan 2020-01-01 NA NA NA
2 Afghanistan 2020-01-02 NA NA NA
3 Afghanistan 2020-01-03 NA NA NA
4 Afghanistan 2020-01-04 0 0 NA
5 Afghanistan 2020-01-05 0 0 NA
6 Afghanistan 2020-01-06 0 0 NA
# ℹ 1 more variable: total_cases_per_million <dbl>
Step 3: Explore the Dataset
Before plotting, we understand the structure of the dataset.
We check:
- Number of rows and columns
- Column names
- Data types
- Summary statistics
# Dimensions
dim(covid)[1] 570606 61
# Column names
names(covid)[1:10] [1] "country" "date"
[3] "total_cases" "new_cases"
[5] "new_cases_smoothed" "total_cases_per_million"
[7] "new_cases_per_million" "new_cases_smoothed_per_million"
[9] "total_deaths" "new_deaths"
#Structure
str(covid)# Summary statistics
summary(covid[, c("total_cases", "total_deaths")]) total_cases total_deaths
Min. : 0 Min. : 0
1st Qu.: 9674 1st Qu.: 80
Median : 89168 Median : 1074
Mean : 14698134 Mean : 158357
3rd Qu.: 1113731 3rd Qu.: 13071
Max. :779056637 Max. :7111504
NA's :12348 NA's :12348
Step 4: Filter Data for Selected Countries
We select 6 representative countries for multi-country visualizations.
filter()keeps only rows wherecountrymatches one of our selected countries- These countries represent different continents and pandemic experiences
selected_countries <- c("India", "United States", "Brazil",
"United Kingdom", "Germany", "South Africa")
covid_sel <- covid |>
filter(country %in% selected_countries)
# Confirm filtering worked
unique(covid_sel$country)[1] "Brazil" "Germany" "India" "South Africa"
[5] "United Kingdom" "United States"
Visualization 1: Line Plot — Cases and Deaths Over Time (India)
What We Will Do
We plot new_cases_smoothed and new_deaths_smoothed for India over time.
- Smoothed values (7-day rolling average) reduce day-to-day noise
- We use a dual-axis line chart — cases on the left y-axis, deaths on the right
scale_factorscales the deaths line so both fit on the same plot
Step 5a: Prepare India Data
india <- covid |>
filter(country == "India") |>
select(date, new_cases_smoothed, new_deaths_smoothed) |>
drop_na()
head(india)# A tibble: 6 × 3
date new_cases_smoothed new_deaths_smoothed
<date> <dbl> <dbl>
1 2020-01-09 0 0
2 2020-01-10 0 0
3 2020-01-11 0 0
4 2020-01-12 0 0
5 2020-01-13 0 0
6 2020-01-14 0 0
Step 5b: Calculate Scale Factor for Dual Axis
scale_factor <- max(india$new_cases_smoothed, na.rm = TRUE) /
max(india$new_deaths_smoothed, na.rm = TRUE)
scale_factor[1] 93.38414
Step 5c: Plot Cases and Deaths Over Time
geom_line()draws the line for each metricsec_axis()creates the secondary y-axis for deathsscale_colour_manual()assigns custom colors to each linelabs()adds titles, axis labels, and legend title
ggplot(india, aes(x = date)) +
geom_line(aes(y = new_cases_smoothed, colour = "New Cases"),
linewidth = 0.8) +
geom_line(aes(y = new_deaths_smoothed * scale_factor, colour = "New Deaths"),
linewidth = 0.8, linetype = "dashed") +
scale_y_continuous(
name = "New Cases (smoothed)",
labels = label_comma(),
sec.axis = sec_axis(
~ . / scale_factor,
name = "New Deaths (smoothed)",
labels = label_comma()
)
) +
scale_x_date(date_labels = "%b %Y", date_breaks = "6 months") +
scale_colour_manual(values = c("New Cases" = "#2196F3",
"New Deaths" = "#F44336")) +
labs(
title = "COVID-19 Daily Cases and Deaths — India",
subtitle = "7-day smoothed values | January 2020 to December 2025",
x = "Date",
colour = "Metric"
) +
theme_minimal() +
theme(
legend.position = "top",
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold")
)Interpretation
- The line plot shows how COVID-19 cases and deaths changed over time in India.
- The graph shows multiple waves where cases rise and fall.
- Deaths increase after cases, showing a delay between infection and outcome.
Visualization 2: Faceted Plots — Multiple Countries
What We Will Do
We create one panel per country using facet_wrap().
- Each panel shows
new_cases_smoothedfor one country scales = "free_y"allows each panel to have its own y-axis scale- This avoids smaller countries being flattened by larger ones (like USA)
ncol = 2arranges panels in 2 columns
covid_sel |>
select(country, date, new_cases_smoothed) |>
drop_na() |>
ggplot(aes(x = date, y = new_cases_smoothed, fill = country)) +
geom_area(alpha = 0.6) +
facet_wrap(~ country, scales = "free_y", ncol = 2) +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
scale_y_continuous(labels = label_comma()) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "COVID-19 Daily New Cases — Multi-Country Faceted View",
subtitle = "7-day smoothed values | Free y-axis scale per country",
x = "Date",
y = "New Cases (smoothed)"
) +
theme_minimal() +
theme(
legend.position = "none",
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold")
)Interpretation
- United States: Persistent high waves across the entire period — largest absolute case count
- Brazil: Severe and prolonged Delta wave — high mortality impact
- India: Iconic sharp Delta spike followed by rapid decline
- United Kingdom and Germany: Dominant Omicron surge in early 2022
- South Africa: Earlier, shorter waves — Omicron variant originated here
Visualization 3: Scatter Plot — Total Cases vs Total Deaths
What We Will Do
We plot total cumulative cases vs total cumulative deaths for every country.
- We use the latest available record per country using
slice_max() !is.na(continent)removes aggregated regional rows like “World” or “Asia”- Log scale is applied to both axes because case counts vary from thousands to hundreds of millions
geom_smooth(method = "lm")adds a linear regression trend linegeom_text_repel()labels our 6 selected countries without overlapping
Step 7a: Get Latest Record Per Country
latest <- covid |>
filter(!is.na(continent)) |>
group_by(country) |>
slice_max(date, n = 1) |>
ungroup() |>
select(country, continent, total_cases, total_deaths) |>
drop_na() |>
filter(total_cases > 0, total_deaths > 0)
head(latest)# A tibble: 6 × 4
country continent total_cases total_deaths
<chr> <chr> <dbl> <dbl>
1 Afghanistan Asia 235214 7998
2 Albania Europe 337234 3608
3 Algeria Africa 272435 6881
4 American Samoa Oceania 8359 34
5 Andorra Europe 48015 159
6 Angola Africa 107487 1937
Step 7b: Plot the Scatter
ggplot(latest, aes(x = total_cases, y = total_deaths,
colour = continent)) +
geom_point(alpha = 0.7, size = 2.5) +
geom_smooth(method = "lm", se = TRUE, colour = "grey30",
linewidth = 0.8, linetype = "dashed") +
geom_text_repel(
data = latest |> filter(country %in% selected_countries),
aes(label = country),
size = 3, fontface = "bold", max.overlaps = 10
) +
scale_x_log10(labels = label_comma()) +
scale_y_log10(labels = label_comma()) +
scale_colour_brewer(palette = "Dark2") +
labs(
title = "Total COVID-19 Cases vs Total Deaths by Country",
subtitle = "Log-log scale | Labelled: 6 highlighted countries",
x = "Total Confirmed Cases (log scale)",
y = "Total Deaths (log scale)",
colour = "Continent"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(face = "bold")
)Interpretation
- The scatter plot shows the relationship between total cases and total deaths.
- As the number of cases increases, the number of deaths also increases.
- This shows a clear positive relationship between cases and deaths.
Visualization 4: Area Plot — Cumulative Cases Over Time
What We Will Do
We create a stacked area plot showing how cumulative cases grew over time for all 6 countries combined.
geom_area(position = "stack")stacks each country’s values on top of each other- This shows both the total combined scale and each country’s relative contribution
scale_fill_brewer(palette = "Set2")assigns distinct colors
covid_sel |>
select(country, date, total_cases) |>
drop_na() |>
ggplot(aes(x = date, y = total_cases, fill = country)) +
geom_area(position = "stack", alpha = 0.85) +
scale_x_date(date_labels = "%b %Y", date_breaks = "6 months") +
scale_y_continuous(labels = label_comma()) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Cumulative COVID-19 Cases — Selected Countries",
subtitle = "Stacked area chart | January 2020 to December 2025",
x = "Date",
y = "Cumulative Total Cases",
fill = "Country"
) +
theme_minimal() +
theme(
legend.position = "right",
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold")
)Interpretation
- The area plot shows the overall growth of COVID-19 cases over time.
- It helps compare how much each country contributed to the total number of cases.
Summary
| Visualization | Key Finding |
|---|---|
| Line plot (India) | Three waves; Delta was deadliest; deaths lag cases by ~2 weeks |
| Faceted plots | Each country had a unique pandemic trajectory |
| Scatter plot | Strong case-death correlation; outliers reflect healthcare capacity |
| Area plot | Burden concentrated in USA and India; wave-driven step growth |
Conclusion
The analysis shows that COVID-19 trends were different across countries. The United States recorded the highest number of cases, while India and Brazil also showed significant increases. The visualizations help us understand how the pandemic spread over time and how cases and deaths are related.
References
- Our World in Data. (2025). Coronavirus (COVID-19) Deaths. https://ourworldindata.org/covid-deaths