Group members
Ung Keng Jie 25084156
Nyo Jing Jie 24228485
Yim Wen Jun 24201054
Yu Tao 17216097
Shao Jin Bo 25086040
The COVID-19 pandemic has affected countries around the world and created many challenges for public health services. During this period, data science has played an important role in tracking the spread of the virus, understanding infection patterns, and helping decision-makers especially governments to allocate resources more effectively.
This project uses the “Corona Virus Report” dataset from Kaggle. The dataset contains the cumulative number of confirmed cases, deaths, and recoveries recorded each day in many countries and regions during the COVID-19 pandemic. The dataset has 49,068 rows and 10 columns, which gives a total of 490,680 data points. Therefore, it meets the project requirement of having more than 100,000 dimensions.
This project aims to transform raw and unstructured pandemic data into actionable insights by cleaning the data, performing exploratory data analysis (EDA) and modeling the data using machine learning techniques.
To accomplish this, we have stated two different types of analytical problems, according to the technical needs of the project:
Through these dual approaches, this project demonstrates a complete data science pipeline from raw data tidying to advanced predictive evaluation and forecasting.
# Load necessary library
library(readr)
covid_data <- read_csv("covid_19_clean_complete.csv")
dim(covid_data)## [1] 49068 10
First, we load the necessary libraries and the raw dataset. We will
use the tidyverse ecosystem for efficient data manipulation
and lubridate for handling dates.
# Load tidyverse (which includes readr, dplyr, tidyr, ggplot2, etc.)
library(tidyverse)
PATH <- "covid_19_clean_complete.csv"
df <- read_csv(PATH, show_col_types = FALSE)
cat("Shape :", nrow(df), "x", ncol(df), "\n")## Shape : 49068 x 10
## Columns : Province/State, Country/Region, Lat, Long, Date, Confirmed, Deaths, Recovered, Active, WHO Region
## Date range: 2020-01-22 -> 2020-07-27
## Countries : 187
## # A tibble: 5 × 10
## `Province/State` `Country/Region` Lat Long Date Confirmed Deaths
## <chr> <chr> <dbl> <dbl> <date> <dbl> <dbl>
## 1 <NA> Afghanistan 33.9 67.7 2020-01-22 0 0
## 2 <NA> Albania 41.2 20.2 2020-01-22 0 0
## 3 <NA> Algeria 28.0 1.66 2020-01-22 0 0
## 4 <NA> Andorra 42.5 1.52 2020-01-22 0 0
## 5 <NA> Angola -11.2 17.9 2020-01-22 0 0
## # ℹ 3 more variables: Recovered <dbl>, Active <dbl>, `WHO Region` <chr>
## Rows: 49,068
## Columns: 10
## $ `Province/State` <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capital T…
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Ango…
## $ Lat <dbl> 33.93911, 41.15330, 28.03390, 42.50630, -11.20270, 17…
## $ Long <dbl> 67.709953, 20.168300, 1.659600, 1.521800, 17.873900, …
## $ Date <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-22, 2020…
## $ Confirmed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Active <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `WHO Region` <chr> "Eastern Mediterranean", "Europe", "Africa", "Europe"…
## Province/State Country/Region Lat Long
## Length:49068 Length:49068 Min. :-51.796 Min. :-135.00
## Class :character Class :character 1st Qu.: 7.873 1st Qu.: -15.31
## Mode :character Mode :character Median : 23.634 Median : 21.75
## Mean : 21.434 Mean : 23.53
## 3rd Qu.: 41.204 3rd Qu.: 80.77
## Max. : 71.707 Max. : 178.06
## Date Confirmed Deaths Recovered
## Min. :2020-01-22 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.:2020-03-08 1st Qu.: 4 1st Qu.: 0.0 1st Qu.: 0
## Median :2020-04-24 Median : 168 Median : 2.0 Median : 29
## Mean :2020-04-24 Mean : 16885 Mean : 884.2 Mean : 7916
## 3rd Qu.:2020-06-10 3rd Qu.: 1518 3rd Qu.: 30.0 3rd Qu.: 666
## Max. :2020-07-27 Max. :4290259 Max. :148011.0 Max. :1846641
## Active WHO Region
## Min. : -14 Length:49068
## 1st Qu.: 0 Class :character
## Median : 26 Mode :character
## Mean : 8085
## 3rd Qu.: 606
## Max. :2816444
## Missing Values by Column:
## Province/State Country/Region Lat Long Date
## 34404 0 0 0 0
## Confirmed Deaths Recovered Active WHO Region
## 0 0 0 0 0
##
## Duplicate Rows: 0
Data quality problem always the first to identify during the initial
investigation like inspect missing value for the Province/State and some
minor inconsistencies in the Active cases field. To ensure
the data was reliable and suitable for machine learning applications, a
data pre-processing pipeline was implemented. For example, obsrved
missing values were checked for duplication before transformation.
Based on the exploration above, we will tidy and transform the
dataset using dplyr and tidyr packages from
the tidyverse ecosystem. The full-cleaning process is as follows:
Date column to a valid Date object and renaming columns
(e.g., Country/Region to Country) for easier
coding access.NA
values in the Province column with an empty string, and
filling NAs in numerical columns (Confirmed,
Deaths, Recovered) with 0.Active cases where missing, and using the
pmax() function to clip any negative values to
0 (removing anomalies).Death_Rate and
Recovery_Rate.Country and Date to summarize provincial data
into national daily totals (country_day).latest) to summarize the global pandemic
situation.| Variable | Description |
|---|---|
| Confirmed | Total confirmed COVID-19 cases |
| Deaths | Total COVID-19 related deaths |
| Recovered | Total recovered patients |
| Active | Current active COVID-19 cases |
| Death_Rate | Percentage of deaths among confirmed cases |
| Recovery_Rate | Percentage of recovered cases among confirmed cases |
# Step 1-4: Basic Data Cleaning & Feature Engineering
df_clean <- df %>%
# Parse dates
mutate(Date = as.Date(Date)) %>%
# Rename columns for convenience
rename(
Country = `Country/Region`,
Province = `Province/State`
) %>%
# Fill missing values
replace_na(list(Province = "", Confirmed = 0, Deaths = 0, Recovered = 0)) %>%
# Handle Active column & clip(lower = 0) to remove negative anomalies
mutate(
Active = ifelse(
is.na(Active),
Confirmed - Deaths - Recovered,
Active
),
Active = pmax(Active, 0)
) %>%
# Derived metrics: Death Rate and Recovery Rate
mutate(
Death_Rate = ifelse(Confirmed > 0, (Deaths / Confirmed) * 100, 0),
Recovery_Rate = ifelse(Confirmed > 0, (Recovered / Confirmed) * 100, 0)
)
# Step 5: Aggregate by Country + Date (remove province duplicates)
country_day <- df_clean %>%
group_by(Country, Date) %>%
summarise(
Confirmed = sum(Confirmed, na.rm = TRUE),
Deaths = sum(Deaths, na.rm = TRUE),
Recovered = sum(Recovered, na.rm = TRUE),
Active = sum(Active, na.rm = TRUE),
.groups = 'drop' # Ungroup after summarizing
) %>%
# Recalculate rates after aggregation
mutate(
Death_Rate = ifelse(Confirmed > 0, (Deaths / Confirmed) * 100, 0),
Recovery_Rate = ifelse(Confirmed > 0, (Recovered / Confirmed) * 100, 0)
)
# Step 6: Extract the Latest snapshot
latest <- country_day %>%
filter(Date == max(Date)) %>%
arrange(desc(Confirmed))
# Print the summary of the cleaned and aggregated dataset
cat("Cleaned shape :", nrow(country_day), "x", ncol(country_day), "\n")## Cleaned shape : 35156 x 8
## Latest date : 2020-07-27
## Total confirmed: 16,480,485
## Total deaths : 654,036
## Total recovered: 9,468,087
## # A tibble: 10 × 8
## Country Date Confirmed Deaths Recovered Active Death_Rate Recovery_Rate
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 US 2020-07-27 4290259 148011 1325804 2.82e6 3.45 30.9
## 2 Brazil 2020-07-27 2442375 87618 1846641 5.08e5 3.59 75.6
## 3 India 2020-07-27 1480073 33408 951166 4.95e5 2.26 64.3
## 4 Russia 2020-07-27 816680 13334 602249 2.01e5 1.63 73.7
## 5 South … 2020-07-27 452529 7067 274925 1.71e5 1.56 60.8
## 6 Mexico 2020-07-27 395489 44022 303810 4.77e4 11.1 76.8
## 7 Peru 2020-07-27 389717 18418 272547 9.88e4 4.73 69.9
## 8 Chile 2020-07-27 347923 9187 319954 1.88e4 2.64 92.0
## 9 United… 2020-07-27 301708 45844 1437 2.54e5 15.2 0.476
## 10 Iran 2020-07-27 293606 15912 255144 2.26e4 5.42 86.9
The data in the preprocessing stage was successfully converted into a
tidy analytical structure. Values of the variable Active
with negative values were set to zero with the help of the
pmax() function as this would indicate reporting
inconsistencies and would not be considered meaningful epidemiological
observations. In addition, aggregation at the country level was carried
out as a measure of removing duplicate reporting of the provinces to
create a clean, daily country summary for visualization and machine
learning modeling.
This section explores the cleaned data to uncover patterns and trends before applying machine learning models.
To better understand the severity and trajectory of the pandemic, we
will create visualizations using the ggplot2 package. We
will also use the scales package to format our axis labels
for better readability.
This section aggregates the data globally, calculates rolling averages (crucial for our later models), and generates a KPI dashboard.
To calculate the 7-day rolling averages smoothly, we introduce the
zoo package. For the visualization, we utilize
ggplot2 to construct custom summary cards, mimicking a
business intelligence dashboard.
# Load zoo package for rolling average calculations
library(zoo)
# Step 1: Global daily totals
global_day <- country_day %>%
group_by(Date) %>%
summarise(
Confirmed = sum(Confirmed, na.rm = TRUE),
Deaths = sum(Deaths, na.rm = TRUE),
Recovered = sum(Recovered, na.rm = TRUE),
Active = sum(Active, na.rm = TRUE),
.groups = 'drop'
)
# Step 2: Daily new cases & 7-day rolling averages
global_day <- global_day %>%
arrange(Date) %>%
mutate(
# Use pmax and lag to replicate Python's diff().clip(lower=0)
New_Cases = pmax(Confirmed - lag(Confirmed, default = 0), 0),
New_Deaths = pmax(Deaths - lag(Deaths, default = 0), 0),
# Calculate 7-day rolling average using zoo::rollmean
`7d_avg_cases` = rollmean(New_Cases, k = 7, fill = NA, align = "right"),
`7d_avg_deaths` = rollmean(New_Deaths, k = 7, fill = NA, align = "right")
)
# Step 3: Extract latest summary card statistics
last_row <- global_day %>% filter(Date == max(Date))
total_confirmed <- last_row$Confirmed
total_deaths <- last_row$Deaths
total_recovered <- last_row$Recovered
global_cfr <- (total_deaths / total_confirmed) * 100
# Step 4: Create a dataframe for the ggplot cards
cards_data <- tibble(
id = 1:4,
label = c("Total Confirmed", "Total Deaths", "Total Recovered", "Case Fatality Rate"),
# Format numbers to Millions (M) and Percentages (%)
value = c(
sprintf("%.1fM", total_confirmed / 1e6),
sprintf("%.2fM", total_deaths / 1e6),
sprintf("%.1fM", total_recovered / 1e6),
sprintf("%.2f%%", global_cfr)
),
color = c("#1e3a8a", "#ef4444", "#16a34a", "#d97706")
)
# Step 5: Plot the Summary Cards using ggplot2
p_cards <- ggplot(cards_data, aes(x = id, y = 1)) +
# Draw the colored background boxes
geom_tile(aes(fill = color), width = 0.9, height = 0.8) +
# Add the value text (Big Numbers)
geom_text(aes(label = value), vjust = -0.3, color = "white", size = 10, fontface = "bold") +
# Add the label text (Small Letters)
geom_text(aes(label = label), vjust = 3.0, color = "white", size = 4, alpha = 0.85) +
# Apply exact colors
scale_fill_identity() +
# Remove all axes, grids, and background
theme_void() +
labs(title = paste("COVID-19 Global Summary - As of", max(global_day$Date))) +
theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5, margin = ggplot2::margin(b = 10)))
# Print the plot in R Markdown
print(p_cards)This section visualizes the global pandemic trajectory using multi-panel plots to analyze cumulative cases, new cases, and recovery trends.
library(patchwork) # for combining multiple plots
library(scales) # for formatting axis labels
# 1. Cumulative Confirmed
p1 <- ggplot(global_day, aes(x = Date, y = Confirmed / 1e6)) +
geom_area(fill = "#1e3a8a", alpha = 0.25) +
geom_line(color = "#1e3a8a", linewidth = 1) +
labs(title = "Cumulative Confirmed Cases (Millions)", y = "Cases (M)", x = "") +
theme_minimal()
# 2. Daily New + 7-day Rolling Average
p2 <- ggplot(global_day, aes(x = Date)) +
geom_col(aes(y = New_Cases / 1e3), fill = "#2563eb", alpha = 0.35) +
geom_line(aes(y = `7d_avg_cases` / 1e3), color = "#1e3a8a", linewidth = 1) +
labs(title = "Daily New Cases (Thousands)", y = "Cases (K)", x = "") +
theme_minimal()
# 3. Cumulative Deaths
p3 <- ggplot(global_day, aes(x = Date, y = Deaths / 1e6)) +
geom_area(fill = "#ef4444", alpha = 0.25) +
geom_line(color = "#ef4444", linewidth = 1) +
labs(title = "Cumulative Deaths (Millions)", y = "Deaths (M)", x = "Date") +
theme_minimal()
# 4. Active vs Recovered
# Note: we first reshape the data into long format to draw a stacked area plot
p4_data <- global_day %>%
select(Date, Active, Recovered) %>%
pivot_longer(-Date, names_to = "Status", values_to = "Count")
p4 <- ggplot(p4_data, aes(x = Date, y = Count / 1e6, fill = Status)) +
geom_area(alpha = 0.4) +
scale_fill_manual(values = c("Active" = "#f59e0b", "Recovered" = "#16a34a")) +
labs(title = "Active vs Recovered Cases (Millions)", y = "Cases (M)", x = "Date") +
theme_minimal() + theme(legend.position = "bottom")
# Combine plots using patchwork
combined_plot <- (p1 + p2) / (p3 + p4) +
plot_annotation(title = "Global COVID-19 Trends",
theme = theme(plot.title = element_text(size = 18, face = "bold")))
print(combined_plot)This section provides a comparative analysis of the 15 most impacted countries, evaluating them across four dimensions: Confirmed cases, Deaths, Case Fatality Rate (CFR), and Recovery Rate.
# 1. Prepare plotting data: Top 15 ranking
top15 <- latest %>% head(15)
# Helper: unified plotting theme
base_theme <- theme_minimal() + theme(axis.title.y = element_blank())
# Plot 1: Confirmed Cases (Millions)
p1 <- ggplot(top15, aes(x = Confirmed / 1e6, y = reorder(Country, Confirmed))) +
geom_col(aes(fill = Confirmed)) +
scale_fill_gradient(low = "#d1e3ff", high = "#1e3a8a") +
geom_text(aes(label = sprintf("%.1fM", Confirmed / 1e6)), hjust = -0.1, size = 3) +
labs(title = "Confirmed (M)", x = "Cases (M)") + base_theme + theme(legend.position = "none")
# Plot 2: Total Deaths (Thousands)
top15_deaths <- latest %>% arrange(desc(Deaths)) %>% head(15)
p2 <- ggplot(top15_deaths, aes(x = Deaths / 1e3, y = reorder(Country, Deaths))) +
geom_col(fill = "#ef4444") +
labs(title = "Deaths (K)", x = "Deaths (K)") + base_theme
# Plot 3: Case Fatality Rate (conditional coloring)
p3 <- ggplot(top15, aes(x = Death_Rate, y = reorder(Country, Death_Rate))) +
geom_col(aes(fill = case_when(
Death_Rate > 3 ~ "#ef4444",
Death_Rate > 1.5 ~ "#f59e0b",
TRUE ~ "#1e3a8a"
))) +
geom_vline(xintercept = mean(top15$Death_Rate), linetype = "dashed") +
scale_fill_identity() +
labs(title = "CFR (%) - Top 15", x = "CFR %") + base_theme
# Plot 4: Recovery Rate
p4 <- ggplot(top15, aes(x = Recovery_Rate, y = reorder(Country, Recovery_Rate))) +
geom_col(fill = "#16a34a") +
labs(title = "Recovery Rate (%)", x = "Recovery %") + base_theme
# Combined layout
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Top 15 Countries: Comparative Analysis",
theme = theme(plot.title = element_text(size = 16, face = "bold")))This section addresses the project’s first research question through time series forecasting. We aim to predict the future trajectory of daily new COVID-19 cases using both classical statistical methods (ARIMA, ETS) and a modern machine learning approach (Random Forest with lag features). Time series forecasting is a regression problem at its core — we are predicting continuous future values from historical data — so this analysis fully satisfies the project’s “predictive modeling” requirement.
Before fitting any forecasting model, we prepare the raw daily case counts into a proper time series object and run a series of diagnostic checks to understand its underlying structure — including stationarity, trend, weekly seasonality, and autocorrelation patterns. The data is then split chronologically into training (80%) and testing (20%) sets to simulate real-world forecasting conditions where the model only sees past data.
# Load forecasting and machine learning libraries
library(forecast) # ARIMA, ETS, autoplot for forecast objects
library(tseries) # adf.test for stationarity
library(randomForest) # Random Forest model
library(Metrics) # MAE, RMSE, MAPE metricsTime Series Object Construction
We construct the time series from global_day$New_Cases,
which represents the worldwide daily new confirmed cases. We set
frequency = 7 to inform the model of expected weekly
seasonality (caused by weekend reporting lags in many countries).
# Prepare clean time series data
ts_input <- global_day %>%
arrange(Date) %>%
filter(!is.na(New_Cases))
# Create ts object with weekly seasonality
new_cases_ts <- ts(ts_input$New_Cases, frequency = 7)
cat("Time series length :", length(new_cases_ts), "days\n")## Time series length : 188 days
## Date range : 2020-01-22 to 2020-07-27
## Mean daily new cases : 87,662
## Max daily new cases : 282,756
# Plot the raw series
autoplot(new_cases_ts, color = "#1e3a8a", linewidth = 0.8) +
labs(title = "Daily New COVID-19 Cases (Global Aggregated)",
subtitle = "Raw time series with weekly seasonality embedded",
y = "New Cases", x = "Weeks since start") +
theme_minimal()The raw series shows a clear upward trend with visible short-cycle fluctuations — early hints of weekly seasonality from reporting cycles. This is exactly the kind of structure ARIMA and STL decomposition are designed to capture.
Stationarity Testing (Augmented Dickey-Fuller)
Most classical time series models (including ARIMA) require the input series to be stationary — that is, its statistical properties (mean, variance, autocorrelation) should not change over time. If the ADF test’s p-value is less than 0.05, we reject the null hypothesis and conclude the series is stationary.
# Test 1: Original series
adf_original <- adf.test(new_cases_ts, alternative = "stationary")
cat("ADF Test (Original Series):\n")## ADF Test (Original Series):
## Statistic: -0.1549
## p-value : 0.99
## Conclusion: Non-stationary
# Test 2: First-differenced series
new_cases_diff <- diff(new_cases_ts, differences = 1)
adf_diff <- adf.test(new_cases_diff, alternative = "stationary")
cat("ADF Test (First-Differenced Series):\n")## ADF Test (First-Differenced Series):
## Statistic: -12.8607
## p-value : 0.01
## Conclusion: Stationary
# Visualize the differenced series
autoplot(new_cases_diff, color = "#ef4444", linewidth = 0.7) +
labs(title = "First-Differenced Series of Daily New Cases",
subtitle = "Removing trend to achieve stationarity",
y = "Differenced Value", x = "Weeks since start") +
theme_minimal()If the original series is non-stationary (high p-value),
differencing typically resolves it. The auto.arima()
function in the next section will automatically determine the optimal
differencing order d, but inspecting it manually here helps
validate that decision.
STL Decomposition (Trend, Seasonality, Residual)
STL (Seasonal-Trend decomposition using Loess) splits the series into three interpretable components. This is a critical sanity check before forecasting — it tells us how much of the variation is trend-driven vs. seasonal vs. noise.
# Apply STL decomposition with periodic seasonality
stl_decomp <- stl(new_cases_ts, s.window = "periodic", robust = TRUE)
# Plot the four panels: data, seasonal, trend, remainder
autoplot(stl_decomp) +
labs(title = "STL Decomposition: Daily New COVID-19 Cases",
subtitle = "Trend dominates; weekly seasonality and residual noise visible") +
theme_minimal()# Quantify component strength
trend_strength <- 1 - var(stl_decomp$time.series[, "remainder"]) /
var(stl_decomp$time.series[, "trend"] + stl_decomp$time.series[, "remainder"])
seasonal_strength <- 1 - var(stl_decomp$time.series[, "remainder"]) /
var(stl_decomp$time.series[, "seasonal"] + stl_decomp$time.series[, "remainder"])
cat("Trend strength :", round(trend_strength, 3), "(0 = no trend, 1 = strong trend)\n")## Trend strength : 0.982 (0 = no trend, 1 = strong trend)
cat("Seasonal strength :", round(seasonal_strength, 3), "(0 = no seasonality, 1 = strong seasonality)\n")## Seasonal strength : 0.261 (0 = no seasonality, 1 = strong seasonality)
A high trend strength confirms that the long-term upward
direction is the dominant signal. A moderate seasonal strength validates
our choice of frequency = 7 — weekly reporting cycles are a
real feature, not just noise.
ACF and PACF Diagnostics
The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots help identify the AR (p) and MA (q) terms for ARIMA. Significant spikes outside the blue confidence bands indicate meaningful lag relationships.
# Plot ACF and PACF side by side
par(mfrow = c(1, 2))
acf(new_cases_diff, lag.max = 30, main = "ACF: Differenced Series")
pacf(new_cases_diff, lag.max = 30, main = "PACF: Differenced Series")Spikes at lag 7, 14, 21 in the ACF would confirm weekly seasonal
autocorrelation. The PACF cutoff pattern guides the choice of AR order.
We let auto.arima() formalize this decision via AIC
minimization.
Train-Test Split
We hold out the final 20% of the series as a test set. This simulates real forecasting conditions: the model only sees the past and must predict the future.
# 80/20 chronological split
n <- length(new_cases_ts)
train_size <- floor(0.8 * n)
train_ts <- ts(new_cases_ts[1:train_size], frequency = 7)
test_ts <- ts(new_cases_ts[(train_size + 1):n], frequency = 7,
start = c(1, train_size + 1))
cat("Total observations:", n, "days\n")## Total observations: 188 days
## Training set : 150 days (80%)
## Testing set : 38 days (20%)
ARIMA (AutoRegressive Integrated Moving Average) is the workhorse of
classical time series forecasting. auto.arima() automates
order selection by minimizing AIC across candidate (p, d, q)
combinations.
# Fit ARIMA model with seasonal component
arima_model <- auto.arima(train_ts,
seasonal = TRUE,
stepwise = FALSE,
approximation = FALSE)
# Display the chosen model
cat("Selected ARIMA model:\n")## Selected ARIMA model:
## Series: train_ts
## ARIMA(1,1,3)(0,1,1)[7]
##
## Coefficients:
## ar1 ma1 ma2 ma3 sma1
## 0.8079 -1.3341 0.3135 0.1937 -0.6717
## s.e. 0.1468 0.1761 0.2007 0.1061 0.0737
##
## sigma^2 = 47511619: log likelihood = -1455.97
## AIC=2923.93 AICc=2924.56 BIC=2941.67
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,3)(0,1,1)[7]
## Q* = 3.0628, df = 9, p-value = 0.9618
##
## Model df: 5. Total lags used: 14
# Forecast on the test horizon
arima_fc <- forecast(arima_model, h = length(test_ts))
# Plot forecast vs actual
autoplot(arima_fc, fcol = "#1e3a8a") +
autolayer(test_ts, series = "Actual", color = "#ef4444", linewidth = 0.8) +
labs(title = "ARIMA Forecast vs Actual Values (Test Set)",
subtitle = "Blue = forecast with 80%/95% confidence bands, Red = actual",
y = "New Cases", x = "Weeks since start") +
theme_minimal()Residual diagnostics (Ljung-Box test, ACF of residuals, residual histogram) check whether the model captured all available structure. A non-significant Ljung-Box p-value (> 0.05) confirms residuals are essentially random noise — the model is well-specified.
ETS models adapt smoothly to changes in trend and seasonality by exponentially down-weighting older observations. They are often competitive with ARIMA for series with clear trend or seasonal patterns and are robust to short series.
# Fit ETS model with automatic component selection
ets_model <- ets(train_ts)
cat("Selected ETS model:\n")## Selected ETS model:
## ETS(A,A,A)
##
## Call:
## ets(y = train_ts)
##
## Smoothing parameters:
## alpha = 0.2441
## beta = 0.0723
## gamma = 0.2806
##
## Initial states:
## l = -3083.6371
## b = 638.8013
## s = -2193.207 -6849.699 -3085.867 2457.108 4324.671 5159.357
## 187.6376
##
## sigma: 7136.171
##
## AIC AICc BIC
## 3426.051 3428.328 3462.178
# Forecast on test horizon
ets_fc <- forecast(ets_model, h = length(test_ts))
# Plot forecast vs actual
autoplot(ets_fc, fcol = "#16a34a") +
autolayer(test_ts, series = "Actual", color = "#ef4444", linewidth = 0.8) +
labs(title = "ETS Forecast vs Actual Values (Test Set)",
subtitle = "Green = forecast, Red = actual",
y = "New Cases", x = "Weeks since start") +
theme_minimal()Classical models assume a specific parametric form. Machine learning approaches like Random Forest make no such assumption — they can capture non-linear interactions between lagged values. We engineer lag features (yesterday’s value, last week’s value, etc.) and a 7-day rolling mean to use as predictors.
# Engineer lag features and rolling statistics
ts_df <- tibble(
y = as.numeric(new_cases_ts),
lag1 = dplyr::lag(as.numeric(new_cases_ts), 1),
lag2 = dplyr::lag(as.numeric(new_cases_ts), 2),
lag3 = dplyr::lag(as.numeric(new_cases_ts), 3),
lag7 = dplyr::lag(as.numeric(new_cases_ts), 7),
lag14 = dplyr::lag(as.numeric(new_cases_ts), 14),
roll7 = rollmean(as.numeric(new_cases_ts), k = 7, fill = NA, align = "right"),
roll14 = rollmean(as.numeric(new_cases_ts), k = 14, fill = NA, align = "right")
) %>% na.omit()
# Chronological train/test split (preserving time order)
rf_train_idx <- 1:floor(0.8 * nrow(ts_df))
rf_train <- ts_df[rf_train_idx, ]
rf_test <- ts_df[-rf_train_idx, ]
cat("Random Forest training rows:", nrow(rf_train), "\n")## Random Forest training rows: 139
## Random Forest testing rows : 35
# Train Random Forest with 500 trees
set.seed(123)
rf_model <- randomForest(y ~ ., data = rf_train,
ntree = 500,
importance = TRUE)
# Display model summary
print(rf_model)##
## Call:
## randomForest(formula = y ~ ., data = rf_train, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 53194843
## % Var explained: 97.6
# Variable importance plot — which lags matter most?
varImpPlot(rf_model,
main = "Random Forest: Feature Importance",
color = "#1e3a8a", pch = 19)The variable importance plot reveals which lags carry the most
predictive signal. Typically, lag1 (yesterday’s value) and
the 7-day rolling mean dominate, confirming that recent history is the
strongest predictor of tomorrow.
We evaluate all three models on the held-out test set using three standard forecasting error metrics:
# Calculate metrics for all three models
calc_metrics <- function(actual, predicted, model_name) {
tibble(
Model = model_name,
MAE = round(mae(actual, predicted), 2),
RMSE = round(rmse(actual, predicted), 2),
MAPE = round(mape(actual, predicted) * 100, 2)
)
}
results_arima <- calc_metrics(as.numeric(test_ts), as.numeric(arima_fc$mean), "ARIMA")
results_ets <- calc_metrics(as.numeric(test_ts), as.numeric(ets_fc$mean), "ETS")
results_rf <- calc_metrics(rf_test$y, rf_pred, "Random Forest")
# Combine into comparison table
comparison <- bind_rows(results_arima, results_ets, results_rf) %>%
arrange(RMSE)
# Display the comparison
print(comparison)## # A tibble: 3 × 4
## Model MAE RMSE MAPE
## <chr> <dbl> <dbl> <dbl>
## 1 ARIMA 16122. 19642. 8.08
## 2 ETS 21024. 27447. 10.2
## 3 Random Forest 57293. 65808. 25.4
# Identify the best model based on RMSE
best_model_name <- comparison$Model[1]
cat("\nBest performing model (lowest RMSE):", best_model_name, "\n")##
## Best performing model (lowest RMSE): ARIMA
# Visualize the comparison
comparison_long <- comparison %>%
pivot_longer(-Model, names_to = "Metric", values_to = "Value")
ggplot(comparison_long, aes(x = Model, y = Value, fill = Model)) +
geom_col() +
facet_wrap(~ Metric, scales = "free_y") +
geom_text(aes(label = round(Value, 2)), vjust = -0.3, size = 3.5) +
scale_fill_manual(values = c("ARIMA" = "#1e3a8a", "ETS" = "#16a34a", "Random Forest" = "#d97706")) +
labs(title = "Forecasting Model Comparison",
subtitle = "Lower is better for all three metrics",
y = "Error Value", x = "") +
theme_minimal() +
theme(legend.position = "none")We retrain the best-performing model on the full dataset (no held-out test set this time) and generate a 30-day forecast. The shaded bands represent 80% and 95% confidence intervals — wider bands further into the future reflect growing uncertainty.
# Retrain on full series for production forecast
final_model <- auto.arima(new_cases_ts, seasonal = TRUE,
stepwise = FALSE, approximation = FALSE)
# Forecast the next 30 days
final_forecast <- forecast(final_model, h = 30, level = c(80, 95))
# Visualize the forecast with confidence intervals
autoplot(final_forecast, fcol = "#1e3a8a") +
labs(title = "30-Day Forecast of Daily New COVID-19 Cases (Global)",
subtitle = "Shaded bands: 80% (dark) and 95% (light) confidence intervals",
y = "Daily New Cases", x = "Weeks since start of data") +
theme_minimal()# Display point forecasts for the next 30 days
forecast_summary <- tibble(
Day = 1:30,
Point_Forecast = round(as.numeric(final_forecast$mean), 0),
Lower_80 = round(as.numeric(final_forecast$lower[, 1]), 0),
Upper_80 = round(as.numeric(final_forecast$upper[, 1]), 0),
Lower_95 = round(as.numeric(final_forecast$lower[, 2]), 0),
Upper_95 = round(as.numeric(final_forecast$upper[, 2]), 0)
)
cat("First 10 days of forecast:\n")## First 10 days of forecast:
## # A tibble: 10 × 6
## Day Point_Forecast Lower_80 Upper_80 Lower_95 Upper_95
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 263390 252523 274257 246771 280009
## 2 2 288606 276955 300258 270787 306426
## 3 3 295053 283150 306956 276849 313256
## 4 4 292414 280222 304605 273769 311059
## 5 5 275556 262882 288230 256172 294939
## 6 6 242695 229298 256092 222206 263184
## 7 7 246899 232575 261224 224992 268807
## 8 8 276573 259321 293825 250188 302958
## 9 9 300961 282135 319788 272169 329754
## 10 10 308143 288011 328275 277354 338933
##
## Last 10 days of forecast:
## # A tibble: 10 × 6
## Day Point_Forecast Lower_80 Upper_80 Lower_95 Upper_95
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 281213 242985 319440 222749 339676
## 2 22 311293 270176 352410 248410 374176
## 3 23 335946 292874 379018 270073 401819
## 4 24 343294 298507 388081 274799 411789
## 5 25 341877 295403 388350 270801 412952
## 6 26 326116 277925 374306 252414 399817
## 7 27 294103 244164 344041 217729 370477
## 8 28 298910 247211 350609 219844 377977
## 9 29 328991 274382 383599 245475 412507
## 10 30 353644 296939 410348 266922 440366
This section addresses the project’s second research question through classification analysis. We aim to categorize countries into “High-Risk” or “Low-Risk” zones based on their Case Fatality Rate (Death Rate), using three classification algorithms — Logistic Regression, Decision Tree, and Random Forest — and comparing their performance using Accuracy, Precision, Recall, and F1-Score.
A country is labelled High_Risk if its
Death Rate is at or above the median across all countries,
and Low_Risk otherwise. This binary target is then
modeled using four pandemic indicators: total Confirmed cases, total
Active cases, Recovery Rate, and Active Rate. The data is split 70/30
into training and testing sets using stratified sampling to preserve the
class balance.
library(randomForest)
library(rpart)
library(rpart.plot)
library(caret)
# A country is high risk if its death rate is above the median
median_death_rate <- median(latest$Death_Rate, na.rm = TRUE)
dClassification <- latest %>%
filter(Confirmed > 0) %>%
mutate(
Active_Rate = ifelse(Confirmed > 0, (Active / Confirmed) * 100, 0),
Risk_Level = factor(ifelse(Death_Rate >= median_death_rate, "High_Risk", "Low_Risk"),
levels = c("Low_Risk", "High_Risk"))
) %>%
select(Risk_Level, Confirmed, Deaths, Recovered, Active, Death_Rate, Recovery_Rate, Active_Rate) %>%
na.omit()
cat("Total countries for classification:", nrow(dClassification), "\n")## Total countries for classification: 187
## Class distribution:
##
## Low_Risk High_Risk
## 93 94
# 70/30 train-test split (stratified)
set.seed(42)
train_idx <- createDataPartition(dClassification$Risk_Level, p = 0.7, list = FALSE)
train_class <- dClassification[ train_idx, ]
test_class <- dClassification[-train_idx, ]
cat("\nClassification train rows:", nrow(train_class), "\n")##
## Classification train rows: 132
## Classification test rows : 55
Logistic Regression serves as our interpretable baseline. It models the log-odds of being High_Risk as a linear combination of the predictors. Coefficients are easy to interpret, and the model provides probabilistic outputs that can be thresholded for decision-making.
# Train
logit_model <- glm(Risk_Level ~ Confirmed + Active + Recovery_Rate + Active_Rate,
data = train_class, family = binomial(link = "logit"))
summary(logit_model)##
## Call:
## glm(formula = Risk_Level ~ Confirmed + Active + Recovery_Rate +
## Active_Rate, family = binomial(link = "logit"), data = train_class)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.552e+04 3.025e+06 0.018 0.985
## Confirmed -6.524e-05 5.253e-02 -0.001 0.999
## Active 1.186e-04 2.033e-01 0.001 1.000
## Recovery_Rate -5.673e+02 3.092e+04 -0.018 0.985
## Active_Rate -5.671e+02 3.090e+04 -0.018 0.985
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.8299e+02 on 131 degrees of freedom
## Residual deviance: 9.9411e-07 on 127 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
# Predict
logit_prob <- predict(logit_model, newdata = test_class, type = "response")
logit_pred <- factor(ifelse(logit_prob >= 0.5, "High_Risk", "Low_Risk"),
levels = c("Low_Risk", "High_Risk"))
# Evaluate
cat("\n--- Logistic Regression Confusion Matrix ---\n")##
## --- Logistic Regression Confusion Matrix ---
cm_logit <- confusionMatrix(logit_pred, test_class$Risk_Level, positive = "High_Risk")
print(cm_logit)## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Risk High_Risk
## Low_Risk 27 0
## High_Risk 0 28
##
## Accuracy : 1
## 95% CI : (0.9351, 1)
## No Information Rate : 0.5091
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5091
## Detection Rate : 0.5091
## Detection Prevalence : 0.5091
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : High_Risk
##
##
## Accuracy : 1
## Precision: 1
## Recall : 1
## F1 Score : 1
# Confusion matrix heatmap
cm_logit_table <- as.data.frame(cm_logit$table)
ggplot(cm_logit_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), size = 6, fontface = "bold", color = "white") +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Logistic Regression - Confusion Matrix") +
theme_minimal()A Decision Tree captures non-linear interactions and provides a visually interpret-able split structure. We constrain the tree to a maximum depth of 5 to prevent over-fitting on this relatively small country-level data set.
set.seed(42)
# Train
tree_model <- rpart(Risk_Level ~ Confirmed + Active + Recovery_Rate + Active_Rate,
data = train_class, method = "class",
control = rpart.control(maxdepth = 5, cp = 0.01))
# Visualize the tree
rpart.plot(tree_model, type = 4, extra = 104, main = "Decision Tree")# Predict
tree_pred <- predict(tree_model, newdata = test_class, type = "class")
# Evaluate
cat("\n--- Decision Tree Confusion Matrix ---\n")##
## --- Decision Tree Confusion Matrix ---
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Risk High_Risk
## Low_Risk 14 13
## High_Risk 13 15
##
## Accuracy : 0.5273
## 95% CI : (0.388, 0.6635)
## No Information Rate : 0.5091
## P-Value [Acc > NIR] : 0.4468
##
## Kappa : 0.0542
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5357
## Specificity : 0.5185
## Pos Pred Value : 0.5357
## Neg Pred Value : 0.5185
## Prevalence : 0.5091
## Detection Rate : 0.2727
## Detection Prevalence : 0.5091
## Balanced Accuracy : 0.5271
##
## 'Positive' Class : High_Risk
##
##
## Accuracy : 0.5273
## Precision: 0.5357
## Recall : 0.5357
## F1 Score : 0.5357
Random Forest is an ensemble of decision trees that typically outperforms a single tree by averaging across many bootstrapped trees, reducing variance and overfitting. We train with 200 trees and extract feature importance to understand which predictors drive the risk classification.
set.seed(42)
# Train
rf_class_model <- randomForest(Risk_Level ~ Confirmed + Active + Recovery_Rate + Active_Rate,
data = train_class, ntree = 200, importance = TRUE)
# Predict
rf_class_pred <- predict(rf_class_model, newdata = test_class)
# Evaluate
cat("\n--- Random Forest Classification Confusion Matrix ---\n")##
## --- Random Forest Classification Confusion Matrix ---
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low_Risk High_Risk
## Low_Risk 14 11
## High_Risk 13 17
##
## Accuracy : 0.5636
## 95% CI : (0.4232, 0.697)
## No Information Rate : 0.5091
## P-Value [Acc > NIR] : 0.2504
##
## Kappa : 0.1258
##
## Mcnemar's Test P-Value : 0.8383
##
## Sensitivity : 0.6071
## Specificity : 0.5185
## Pos Pred Value : 0.5667
## Neg Pred Value : 0.5600
## Prevalence : 0.5091
## Detection Rate : 0.3091
## Detection Prevalence : 0.5455
## Balanced Accuracy : 0.5628
##
## 'Positive' Class : High_Risk
##
##
## Accuracy : 0.5636
## Precision: 0.5667
## Recall : 0.6071
## F1 Score : 0.5862
We summarize the three models side-by-side across all four classification metrics. The best model is the one that strikes the right balance between Precision (avoiding false High_Risk alarms) and Recall (catching genuinely high-risk countries) — typically captured well by the F1 Score.
class_comparison <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
Accuracy = c(cm_logit$overall["Accuracy"],
cm_tree$overall["Accuracy"],
cm_rf$overall["Accuracy"]),
Precision = c(cm_logit$byClass["Precision"],
cm_tree$byClass["Precision"],
cm_rf$byClass["Precision"]),
Recall = c(cm_logit$byClass["Recall"],
cm_tree$byClass["Recall"],
cm_rf$byClass["Recall"]),
F1 = c(cm_logit$byClass["F1"],
cm_tree$byClass["F1"],
cm_rf$byClass["F1"])
)
class_comparison[, -1] <- round(class_comparison[, -1], 4)
print(class_comparison)## Model Accuracy Precision Recall F1
## 1 Logistic Regression 1.0000 1.0000 1.0000 1.0000
## 2 Decision Tree 0.5273 0.5357 0.5357 0.5357
## 3 Random Forest 0.5636 0.5667 0.6071 0.5862
# Bar chart comparison
class_long <- class_comparison %>%
tidyr::pivot_longer(cols = c(Accuracy, Precision, Recall, F1),
names_to = "Metric", values_to = "Value")
ggplot(class_long, aes(x = Model, y = Value, fill = Model)) +
geom_col(width = 0.5) +
facet_wrap(~Metric, scales = "free_y") +
scale_fill_manual(values = c(
"Logistic Regression" = "green",
"Decision Tree" = "yellow",
"Random Forest" = "red"
)) +
labs(title = "Classification Model Comparison", y = "Score", x = "") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))This project demonstrated a complete end-to-end data science pipeline applied to the global COVID-19 pandemic, employing a dual-approach methodology that addressed both predictive forecasting and risk classification.
Time Series Forecasting (Section 5):
Three forecasting models such as ARIMA, ETS, and Random Forest with lag features were developed and evaluated using MAE, RMSE, and MAPE as error metrics. The key findings are:
auto.arima() to capture both the trend and weekly seasonal
patterns inherent in the reporting cycle. Its residual diagnostics
(Ljung-Box test) confirmed that the model successfully extracted the
systematic structure from the data.lag1 (yesterday’s cases) and the 7-day rolling average
were the most predictive features, confirming the strong short-term
autocorrelation in pandemic data.Classification Analysis (Section 6):
Three classification algorithms such as Logistic Regression, Decision Tree, and Random Forest were also employed to categorize countries into High-Risk and Low-Risk zones based on their Case Fatality Rate. The key findings are:
This project made several methodological contributions:
Active values clipped to zero), feature
engineering (Death_Rate, Recovery_Rate,
Active_Rate), and country-level aggregation — transforming
raw messy data into analysis-ready structures.Despite the comprehensive approach, several limitations should be acknowledged:
In conclusion, this project showed that data science methods, from data visualization to machine learning, can provide useful insights from pandemic data. The project used two approaches such as forecasting future trends and classifying risk levels. Together, these approaches give a broader view of the data and can help with short-term planning and long-term decision-making. The models used in this project are simple and cannot fully represent the complexity of a real pandemic. However, they provide a good starting point that can be improved in the future with more data and better techniques.