Group members

Ung Keng Jie 25084156

Nyo Jing Jie 24228485

Yim Wen Jun 24201054

Yu Tao 17216097

Shao Jin Bo 25086040

1. Introduction and Objectives

1.1 Project Background

The COVID-19 pandemic has affected countries around the world and created many challenges for public health services. During this period, data science has played an important role in tracking the spread of the virus, understanding infection patterns, and helping decision-makers especially governments to allocate resources more effectively.

This project uses the “Corona Virus Report” dataset from Kaggle. The dataset contains the cumulative number of confirmed cases, deaths, and recoveries recorded each day in many countries and regions during the COVID-19 pandemic. The dataset has 49,068 rows and 10 columns, which gives a total of 490,680 data points. Therefore, it meets the project requirement of having more than 100,000 dimensions.

1.2 Project Objectives

This project aims to transform raw and unstructured pandemic data into actionable insights by cleaning the data, performing exploratory data analysis (EDA) and modeling the data using machine learning techniques.

To accomplish this, we have stated two different types of analytical problems, according to the technical needs of the project:

  1. Time Series Forecasting (Predictive Modeling):
    • Objective: To forecast the future trend of the pandemic and number of patients using historical time-stamped data.
    • Specific Goal: To build forecasting models (ARIMA, ETS, Random Forest using lag features) for predicting the number of new confirmed cases for the next 30 days that helps healthcare systems to be able to predict and prepare for the number of cases.
  2. Classification Analysis (Risk Categorization):
    • Objective: To categorize discrete risk levels based on healthcare metrics.
    • Specific Goal: To develop a classification model to classify countries/regions according to their recovery trend, active case ratio and pandemic management performance as “High-Risk” or “Low-Risk” countries/regions.

Through these dual approaches, this project demonstrates a complete data science pipeline from raw data tidying to advanced predictive evaluation and forecasting.

# Load necessary library
library(readr)
covid_data <- read_csv("covid_19_clean_complete.csv")
dim(covid_data)
## [1] 49068    10

2. Data Loading and Exploration

First, we load the necessary libraries and the raw dataset. We will use the tidyverse ecosystem for efficient data manipulation and lubridate for handling dates.

# Load tidyverse (which includes readr, dplyr, tidyr, ggplot2, etc.)
library(tidyverse)
PATH <- "covid_19_clean_complete.csv"
df <- read_csv(PATH, show_col_types = FALSE)

cat("Shape   :", nrow(df), "x", ncol(df), "\n")
## Shape   : 49068 x 10
cat("Columns :", paste(names(df), collapse = ", "), "\n\n")
## Columns : Province/State, Country/Region, Lat, Long, Date, Confirmed, Deaths, Recovered, Active, WHO Region
cat("Date range:", as.character(min(df$Date)), " -> ", as.character(max(df$Date)), "\n")
## Date range: 2020-01-22  ->  2020-07-27
cat("Countries :", n_distinct(df$`Country/Region`), "\n")
## Countries : 187
head(df, 5)
## # A tibble: 5 × 10
##   `Province/State` `Country/Region`   Lat  Long Date       Confirmed Deaths
##   <chr>            <chr>            <dbl> <dbl> <date>         <dbl>  <dbl>
## 1 <NA>             Afghanistan       33.9 67.7  2020-01-22         0      0
## 2 <NA>             Albania           41.2 20.2  2020-01-22         0      0
## 3 <NA>             Algeria           28.0  1.66 2020-01-22         0      0
## 4 <NA>             Andorra           42.5  1.52 2020-01-22         0      0
## 5 <NA>             Angola           -11.2 17.9  2020-01-22         0      0
## # ℹ 3 more variables: Recovered <dbl>, Active <dbl>, `WHO Region` <chr>
# Check the structure of the dataset
glimpse(df)
## Rows: 49,068
## Columns: 10
## $ `Province/State` <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capital T…
## $ `Country/Region` <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Ango…
## $ Lat              <dbl> 33.93911, 41.15330, 28.03390, 42.50630, -11.20270, 17…
## $ Long             <dbl> 67.709953, 20.168300, 1.659600, 1.521800, 17.873900, …
## $ Date             <date> 2020-01-22, 2020-01-22, 2020-01-22, 2020-01-22, 2020…
## $ Confirmed        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Deaths           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Recovered        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Active           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `WHO Region`     <chr> "Eastern Mediterranean", "Europe", "Africa", "Europe"…
# View a summary of basic statistics
summary(df)
##  Province/State     Country/Region          Lat               Long        
##  Length:49068       Length:49068       Min.   :-51.796   Min.   :-135.00  
##  Class :character   Class :character   1st Qu.:  7.873   1st Qu.: -15.31  
##  Mode  :character   Mode  :character   Median : 23.634   Median :  21.75  
##                                        Mean   : 21.434   Mean   :  23.53  
##                                        3rd Qu.: 41.204   3rd Qu.:  80.77  
##                                        Max.   : 71.707   Max.   : 178.06  
##       Date              Confirmed           Deaths           Recovered      
##  Min.   :2020-01-22   Min.   :      0   Min.   :     0.0   Min.   :      0  
##  1st Qu.:2020-03-08   1st Qu.:      4   1st Qu.:     0.0   1st Qu.:      0  
##  Median :2020-04-24   Median :    168   Median :     2.0   Median :     29  
##  Mean   :2020-04-24   Mean   :  16885   Mean   :   884.2   Mean   :   7916  
##  3rd Qu.:2020-06-10   3rd Qu.:   1518   3rd Qu.:    30.0   3rd Qu.:    666  
##  Max.   :2020-07-27   Max.   :4290259   Max.   :148011.0   Max.   :1846641  
##      Active         WHO Region       
##  Min.   :    -14   Length:49068      
##  1st Qu.:      0   Class :character  
##  Median :     26   Mode  :character  
##  Mean   :   8085                     
##  3rd Qu.:    606                     
##  Max.   :2816444
# Check missing values
cat("Missing Values by Column:\n")
## Missing Values by Column:
colSums(is.na(df))
## Province/State Country/Region            Lat           Long           Date 
##          34404              0              0              0              0 
##      Confirmed         Deaths      Recovered         Active     WHO Region 
##              0              0              0              0              0
# Check duplicate rows
cat("\nDuplicate Rows:", sum(duplicated(df)), "\n")
## 
## Duplicate Rows: 0

3. Data Cleaning, Preprocessing and Aggregation

Data quality problem always the first to identify during the initial investigation like inspect missing value for the Province/State and some minor inconsistencies in the Active cases field. To ensure the data was reliable and suitable for machine learning applications, a data pre-processing pipeline was implemented. For example, obsrved missing values were checked for duplication before transformation.

Based on the exploration above, we will tidy and transform the dataset using dplyr and tidyr packages from the tidyverse ecosystem. The full-cleaning process is as follows:

  1. Formatting & Renaming: Converting the Date column to a valid Date object and renaming columns (e.g., Country/Region to Country) for easier coding access.
  2. Handling Missing Values: Replacing NA values in the Province column with an empty string, and filling NAs in numerical columns (Confirmed, Deaths, Recovered) with 0.
  3. Anomaly Correction: Recalculating Active cases where missing, and using the pmax() function to clip any negative values to 0 (removing anomalies).
  4. Feature Engineering: Deriving new epidemiological metrics, specifically Death_Rate and Recovery_Rate.
  5. Data Aggregation: Grouping the data by Country and Date to summarize provincial data into national daily totals (country_day).
  6. Latest Snapshot: Extracting the most recent day’s data (latest) to summarize the global pandemic situation.

Key Variables Description

Variable Description
Confirmed Total confirmed COVID-19 cases
Deaths Total COVID-19 related deaths
Recovered Total recovered patients
Active Current active COVID-19 cases
Death_Rate Percentage of deaths among confirmed cases
Recovery_Rate Percentage of recovered cases among confirmed cases
# Step 1-4: Basic Data Cleaning & Feature Engineering
df_clean <- df %>%
  # Parse dates
  mutate(Date = as.Date(Date)) %>%

  # Rename columns for convenience
  rename(
    Country = `Country/Region`,
    Province = `Province/State`
  ) %>%

  # Fill missing values
  replace_na(list(Province = "", Confirmed = 0, Deaths = 0, Recovered = 0)) %>%

  # Handle Active column & clip(lower = 0) to remove negative anomalies
  mutate(
    Active = ifelse(
      is.na(Active),
      Confirmed - Deaths - Recovered,
      Active
    ),
    Active = pmax(Active, 0)
  ) %>%

  # Derived metrics: Death Rate and Recovery Rate
  mutate(
    Death_Rate    = ifelse(Confirmed > 0, (Deaths / Confirmed) * 100, 0),
    Recovery_Rate = ifelse(Confirmed > 0, (Recovered / Confirmed) * 100, 0)
  )

# Step 5: Aggregate by Country + Date (remove province duplicates)
country_day <- df_clean %>%
  group_by(Country, Date) %>%
  summarise(
    Confirmed = sum(Confirmed, na.rm = TRUE),
    Deaths    = sum(Deaths, na.rm = TRUE),
    Recovered = sum(Recovered, na.rm = TRUE),
    Active    = sum(Active, na.rm = TRUE),
    .groups   = 'drop' # Ungroup after summarizing
  ) %>%
  # Recalculate rates after aggregation
  mutate(
    Death_Rate    = ifelse(Confirmed > 0, (Deaths / Confirmed) * 100, 0),
    Recovery_Rate = ifelse(Confirmed > 0, (Recovered / Confirmed) * 100, 0)
  )

# Step 6: Extract the Latest snapshot
latest <- country_day %>%
  filter(Date == max(Date)) %>%
  arrange(desc(Confirmed))

# Print the summary of the cleaned and aggregated dataset
cat("Cleaned shape  :", nrow(country_day), "x", ncol(country_day), "\n")
## Cleaned shape  : 35156 x 8
cat("Latest date    :", as.character(max(country_day$Date)), "\n")
## Latest date    : 2020-07-27
cat("Total confirmed:", format(sum(latest$Confirmed), big.mark = ",", scientific = FALSE), "\n")
## Total confirmed: 16,480,485
cat("Total deaths   :", format(sum(latest$Deaths), big.mark = ",", scientific = FALSE), "\n")
## Total deaths   : 654,036
cat("Total recovered:", format(sum(latest$Recovered), big.mark = ",", scientific = FALSE), "\n\n")
## Total recovered: 9,468,087
# Preview the top 10 countries in the latest snapshot
head(latest, 10)
## # A tibble: 10 × 8
##    Country Date       Confirmed Deaths Recovered Active Death_Rate Recovery_Rate
##    <chr>   <date>         <dbl>  <dbl>     <dbl>  <dbl>      <dbl>         <dbl>
##  1 US      2020-07-27   4290259 148011   1325804 2.82e6       3.45        30.9  
##  2 Brazil  2020-07-27   2442375  87618   1846641 5.08e5       3.59        75.6  
##  3 India   2020-07-27   1480073  33408    951166 4.95e5       2.26        64.3  
##  4 Russia  2020-07-27    816680  13334    602249 2.01e5       1.63        73.7  
##  5 South … 2020-07-27    452529   7067    274925 1.71e5       1.56        60.8  
##  6 Mexico  2020-07-27    395489  44022    303810 4.77e4      11.1         76.8  
##  7 Peru    2020-07-27    389717  18418    272547 9.88e4       4.73        69.9  
##  8 Chile   2020-07-27    347923   9187    319954 1.88e4       2.64        92.0  
##  9 United… 2020-07-27    301708  45844      1437 2.54e5      15.2          0.476
## 10 Iran    2020-07-27    293606  15912    255144 2.26e4       5.42        86.9

The data in the preprocessing stage was successfully converted into a tidy analytical structure. Values of the variable Active with negative values were set to zero with the help of the pmax() function as this would indicate reporting inconsistencies and would not be considered meaningful epidemiological observations. In addition, aggregation at the country level was carried out as a measure of removing duplicate reporting of the provinces to create a clean, daily country summary for visualization and machine learning modeling.

4. Exploratory Data Analysis and Visualization

This section explores the cleaned data to uncover patterns and trends before applying machine learning models.

To better understand the severity and trajectory of the pandemic, we will create visualizations using the ggplot2 package. We will also use the scales package to format our axis labels for better readability.

4.1 Global Overview & Summary Dashboard

This section aggregates the data globally, calculates rolling averages (crucial for our later models), and generates a KPI dashboard.

To calculate the 7-day rolling averages smoothly, we introduce the zoo package. For the visualization, we utilize ggplot2 to construct custom summary cards, mimicking a business intelligence dashboard.

# Load zoo package for rolling average calculations
library(zoo)

# Step 1: Global daily totals
global_day <- country_day %>%
  group_by(Date) %>%
  summarise(
    Confirmed = sum(Confirmed, na.rm = TRUE),
    Deaths    = sum(Deaths, na.rm = TRUE),
    Recovered = sum(Recovered, na.rm = TRUE),
    Active    = sum(Active, na.rm = TRUE),
    .groups   = 'drop'
  )

# Step 2: Daily new cases & 7-day rolling averages
global_day <- global_day %>%
  arrange(Date) %>%
  mutate(
    # Use pmax and lag to replicate Python's diff().clip(lower=0)
    New_Cases  = pmax(Confirmed - lag(Confirmed, default = 0), 0),
    New_Deaths = pmax(Deaths - lag(Deaths, default = 0), 0),

    # Calculate 7-day rolling average using zoo::rollmean
    `7d_avg_cases`  = rollmean(New_Cases, k = 7, fill = NA, align = "right"),
    `7d_avg_deaths` = rollmean(New_Deaths, k = 7, fill = NA, align = "right")
  )

# Step 3: Extract latest summary card statistics
last_row <- global_day %>% filter(Date == max(Date))

total_confirmed <- last_row$Confirmed
total_deaths    <- last_row$Deaths
total_recovered <- last_row$Recovered
global_cfr      <- (total_deaths / total_confirmed) * 100

# Step 4: Create a dataframe for the ggplot cards
cards_data <- tibble(
  id = 1:4,
  label = c("Total Confirmed", "Total Deaths", "Total Recovered", "Case Fatality Rate"),
  # Format numbers to Millions (M) and Percentages (%)
  value = c(
    sprintf("%.1fM", total_confirmed / 1e6),
    sprintf("%.2fM", total_deaths / 1e6),
    sprintf("%.1fM", total_recovered / 1e6),
    sprintf("%.2f%%", global_cfr)
  ),
  color = c("#1e3a8a", "#ef4444", "#16a34a", "#d97706")
)

# Step 5: Plot the Summary Cards using ggplot2
p_cards <- ggplot(cards_data, aes(x = id, y = 1)) +
  # Draw the colored background boxes
  geom_tile(aes(fill = color), width = 0.9, height = 0.8) +
  # Add the value text (Big Numbers)
  geom_text(aes(label = value), vjust = -0.3, color = "white", size = 10, fontface = "bold") +
  # Add the label text (Small Letters)
  geom_text(aes(label = label), vjust = 3.0, color = "white", size = 4, alpha = 0.85) +
  # Apply exact colors
  scale_fill_identity() +
  # Remove all axes, grids, and background
  theme_void() +
  labs(title = paste("COVID-19 Global Summary - As of", max(global_day$Date))) +
  theme(plot.title = element_text(face = "bold", size = 14, hjust = 0.5, margin = ggplot2::margin(b = 10)))

# Print the plot in R Markdown
print(p_cards)

4.2 Global Trend Analysis

This section visualizes the global pandemic trajectory using multi-panel plots to analyze cumulative cases, new cases, and recovery trends.

library(patchwork) # for combining multiple plots
library(scales)    # for formatting axis labels

# 1. Cumulative Confirmed
p1 <- ggplot(global_day, aes(x = Date, y = Confirmed / 1e6)) +
  geom_area(fill = "#1e3a8a", alpha = 0.25) +
  geom_line(color = "#1e3a8a", linewidth = 1) +
  labs(title = "Cumulative Confirmed Cases (Millions)", y = "Cases (M)", x = "") +
  theme_minimal()

# 2. Daily New + 7-day Rolling Average
p2 <- ggplot(global_day, aes(x = Date)) +
  geom_col(aes(y = New_Cases / 1e3), fill = "#2563eb", alpha = 0.35) +
  geom_line(aes(y = `7d_avg_cases` / 1e3), color = "#1e3a8a", linewidth = 1) +
  labs(title = "Daily New Cases (Thousands)", y = "Cases (K)", x = "") +
  theme_minimal()

# 3. Cumulative Deaths
p3 <- ggplot(global_day, aes(x = Date, y = Deaths / 1e6)) +
  geom_area(fill = "#ef4444", alpha = 0.25) +
  geom_line(color = "#ef4444", linewidth = 1) +
  labs(title = "Cumulative Deaths (Millions)", y = "Deaths (M)", x = "Date") +
  theme_minimal()

# 4. Active vs Recovered
# Note: we first reshape the data into long format to draw a stacked area plot
p4_data <- global_day %>%
  select(Date, Active, Recovered) %>%
  pivot_longer(-Date, names_to = "Status", values_to = "Count")

p4 <- ggplot(p4_data, aes(x = Date, y = Count / 1e6, fill = Status)) +
  geom_area(alpha = 0.4) +
  scale_fill_manual(values = c("Active" = "#f59e0b", "Recovered" = "#16a34a")) +
  labs(title = "Active vs Recovered Cases (Millions)", y = "Cases (M)", x = "Date") +
  theme_minimal() + theme(legend.position = "bottom")

# Combine plots using patchwork
combined_plot <- (p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Global COVID-19 Trends",
                  theme = theme(plot.title = element_text(size = 18, face = "bold")))

print(combined_plot)

4.3 Top 15 Countries: Multi-Metric Analysis

This section provides a comparative analysis of the 15 most impacted countries, evaluating them across four dimensions: Confirmed cases, Deaths, Case Fatality Rate (CFR), and Recovery Rate.

# 1. Prepare plotting data: Top 15 ranking
top15 <- latest %>% head(15)

# Helper: unified plotting theme
base_theme <- theme_minimal() + theme(axis.title.y = element_blank())

# Plot 1: Confirmed Cases (Millions)
p1 <- ggplot(top15, aes(x = Confirmed / 1e6, y = reorder(Country, Confirmed))) +
  geom_col(aes(fill = Confirmed)) +
  scale_fill_gradient(low = "#d1e3ff", high = "#1e3a8a") +
  geom_text(aes(label = sprintf("%.1fM", Confirmed / 1e6)), hjust = -0.1, size = 3) +
  labs(title = "Confirmed (M)", x = "Cases (M)") + base_theme + theme(legend.position = "none")

# Plot 2: Total Deaths (Thousands)
top15_deaths <- latest %>% arrange(desc(Deaths)) %>% head(15)
p2 <- ggplot(top15_deaths, aes(x = Deaths / 1e3, y = reorder(Country, Deaths))) +
  geom_col(fill = "#ef4444") +
  labs(title = "Deaths (K)", x = "Deaths (K)") + base_theme

# Plot 3: Case Fatality Rate (conditional coloring)
p3 <- ggplot(top15, aes(x = Death_Rate, y = reorder(Country, Death_Rate))) +
  geom_col(aes(fill = case_when(
    Death_Rate > 3 ~ "#ef4444",
    Death_Rate > 1.5 ~ "#f59e0b",
    TRUE ~ "#1e3a8a"
  ))) +
  geom_vline(xintercept = mean(top15$Death_Rate), linetype = "dashed") +
  scale_fill_identity() +
  labs(title = "CFR (%) - Top 15", x = "CFR %") + base_theme

# Plot 4: Recovery Rate
p4 <- ggplot(top15, aes(x = Recovery_Rate, y = reorder(Country, Recovery_Rate))) +
  geom_col(fill = "#16a34a") +
  labs(title = "Recovery Rate (%)", x = "Recovery %") + base_theme

# Combined layout
(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Top 15 Countries: Comparative Analysis",
                  theme = theme(plot.title = element_text(size = 16, face = "bold")))

5. Time Series Analysis and Forecasting

This section addresses the project’s first research question through time series forecasting. We aim to predict the future trajectory of daily new COVID-19 cases using both classical statistical methods (ARIMA, ETS) and a modern machine learning approach (Random Forest with lag features). Time series forecasting is a regression problem at its core — we are predicting continuous future values from historical data — so this analysis fully satisfies the project’s “predictive modeling” requirement.

5.1 Data Preparation & Diagnostics

Before fitting any forecasting model, we prepare the raw daily case counts into a proper time series object and run a series of diagnostic checks to understand its underlying structure — including stationarity, trend, weekly seasonality, and autocorrelation patterns. The data is then split chronologically into training (80%) and testing (20%) sets to simulate real-world forecasting conditions where the model only sees past data.

# Load forecasting and machine learning libraries
library(forecast)     # ARIMA, ETS, autoplot for forecast objects
library(tseries)      # adf.test for stationarity
library(randomForest) # Random Forest model
library(Metrics)      # MAE, RMSE, MAPE metrics

Time Series Object Construction

We construct the time series from global_day$New_Cases, which represents the worldwide daily new confirmed cases. We set frequency = 7 to inform the model of expected weekly seasonality (caused by weekend reporting lags in many countries).

# Prepare clean time series data
ts_input <- global_day %>%
  arrange(Date) %>%
  filter(!is.na(New_Cases))

# Create ts object with weekly seasonality
new_cases_ts <- ts(ts_input$New_Cases, frequency = 7)

cat("Time series length     :", length(new_cases_ts), "days\n")
## Time series length     : 188 days
cat("Date range             :", as.character(min(ts_input$Date)), "to", as.character(max(ts_input$Date)), "\n")
## Date range             : 2020-01-22 to 2020-07-27
cat("Mean daily new cases   :", format(round(mean(new_cases_ts)), big.mark = ","), "\n")
## Mean daily new cases   : 87,662
cat("Max daily new cases    :", format(max(new_cases_ts), big.mark = ","), "\n")
## Max daily new cases    : 282,756
# Plot the raw series
autoplot(new_cases_ts, color = "#1e3a8a", linewidth = 0.8) +
  labs(title = "Daily New COVID-19 Cases (Global Aggregated)",
       subtitle = "Raw time series with weekly seasonality embedded",
       y = "New Cases", x = "Weeks since start") +
  theme_minimal()

The raw series shows a clear upward trend with visible short-cycle fluctuations — early hints of weekly seasonality from reporting cycles. This is exactly the kind of structure ARIMA and STL decomposition are designed to capture.

Stationarity Testing (Augmented Dickey-Fuller)

Most classical time series models (including ARIMA) require the input series to be stationary — that is, its statistical properties (mean, variance, autocorrelation) should not change over time. If the ADF test’s p-value is less than 0.05, we reject the null hypothesis and conclude the series is stationary.

# Test 1: Original series
adf_original <- adf.test(new_cases_ts, alternative = "stationary")
cat("ADF Test (Original Series):\n")
## ADF Test (Original Series):
cat("  Statistic:", round(adf_original$statistic, 4), "\n")
##   Statistic: -0.1549
cat("  p-value  :", round(adf_original$p.value, 4), "\n")
##   p-value  : 0.99
cat("  Conclusion:", ifelse(adf_original$p.value < 0.05, "Stationary", "Non-stationary"), "\n\n")
##   Conclusion: Non-stationary
# Test 2: First-differenced series
new_cases_diff <- diff(new_cases_ts, differences = 1)
adf_diff <- adf.test(new_cases_diff, alternative = "stationary")
cat("ADF Test (First-Differenced Series):\n")
## ADF Test (First-Differenced Series):
cat("  Statistic:", round(adf_diff$statistic, 4), "\n")
##   Statistic: -12.8607
cat("  p-value  :", round(adf_diff$p.value, 4), "\n")
##   p-value  : 0.01
cat("  Conclusion:", ifelse(adf_diff$p.value < 0.05, "Stationary", "Non-stationary"), "\n")
##   Conclusion: Stationary
# Visualize the differenced series
autoplot(new_cases_diff, color = "#ef4444", linewidth = 0.7) +
  labs(title = "First-Differenced Series of Daily New Cases",
       subtitle = "Removing trend to achieve stationarity",
       y = "Differenced Value", x = "Weeks since start") +
  theme_minimal()

If the original series is non-stationary (high p-value), differencing typically resolves it. The auto.arima() function in the next section will automatically determine the optimal differencing order d, but inspecting it manually here helps validate that decision.

STL Decomposition (Trend, Seasonality, Residual)

STL (Seasonal-Trend decomposition using Loess) splits the series into three interpretable components. This is a critical sanity check before forecasting — it tells us how much of the variation is trend-driven vs. seasonal vs. noise.

# Apply STL decomposition with periodic seasonality
stl_decomp <- stl(new_cases_ts, s.window = "periodic", robust = TRUE)

# Plot the four panels: data, seasonal, trend, remainder
autoplot(stl_decomp) +
  labs(title = "STL Decomposition: Daily New COVID-19 Cases",
       subtitle = "Trend dominates; weekly seasonality and residual noise visible") +
  theme_minimal()

# Quantify component strength
trend_strength    <- 1 - var(stl_decomp$time.series[, "remainder"]) /
                         var(stl_decomp$time.series[, "trend"] + stl_decomp$time.series[, "remainder"])
seasonal_strength <- 1 - var(stl_decomp$time.series[, "remainder"]) /
                         var(stl_decomp$time.series[, "seasonal"] + stl_decomp$time.series[, "remainder"])

cat("Trend strength    :", round(trend_strength, 3), "(0 = no trend, 1 = strong trend)\n")
## Trend strength    : 0.982 (0 = no trend, 1 = strong trend)
cat("Seasonal strength :", round(seasonal_strength, 3), "(0 = no seasonality, 1 = strong seasonality)\n")
## Seasonal strength : 0.261 (0 = no seasonality, 1 = strong seasonality)

A high trend strength confirms that the long-term upward direction is the dominant signal. A moderate seasonal strength validates our choice of frequency = 7 — weekly reporting cycles are a real feature, not just noise.

ACF and PACF Diagnostics

The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots help identify the AR (p) and MA (q) terms for ARIMA. Significant spikes outside the blue confidence bands indicate meaningful lag relationships.

# Plot ACF and PACF side by side
par(mfrow = c(1, 2))
acf(new_cases_diff, lag.max = 30, main = "ACF: Differenced Series")
pacf(new_cases_diff, lag.max = 30, main = "PACF: Differenced Series")

par(mfrow = c(1, 1))

Spikes at lag 7, 14, 21 in the ACF would confirm weekly seasonal autocorrelation. The PACF cutoff pattern guides the choice of AR order. We let auto.arima() formalize this decision via AIC minimization.

Train-Test Split

We hold out the final 20% of the series as a test set. This simulates real forecasting conditions: the model only sees the past and must predict the future.

# 80/20 chronological split
n         <- length(new_cases_ts)
train_size <- floor(0.8 * n)
train_ts  <- ts(new_cases_ts[1:train_size], frequency = 7)
test_ts   <- ts(new_cases_ts[(train_size + 1):n], frequency = 7,
                start = c(1, train_size + 1))

cat("Total observations:", n, "days\n")
## Total observations: 188 days
cat("Training set      :", length(train_ts), "days (80%)\n")
## Training set      : 150 days (80%)
cat("Testing set       :", length(test_ts), "days (20%)\n")
## Testing set       : 38 days (20%)

5.2 ARIMA Model

ARIMA (AutoRegressive Integrated Moving Average) is the workhorse of classical time series forecasting. auto.arima() automates order selection by minimizing AIC across candidate (p, d, q) combinations.

# Fit ARIMA model with seasonal component
arima_model <- auto.arima(train_ts,
                          seasonal = TRUE,
                          stepwise = FALSE,
                          approximation = FALSE)

# Display the chosen model
cat("Selected ARIMA model:\n")
## Selected ARIMA model:
print(arima_model)
## Series: train_ts 
## ARIMA(1,1,3)(0,1,1)[7] 
## 
## Coefficients:
##          ar1      ma1     ma2     ma3     sma1
##       0.8079  -1.3341  0.3135  0.1937  -0.6717
## s.e.  0.1468   0.1761  0.2007  0.1061   0.0737
## 
## sigma^2 = 47511619:  log likelihood = -1455.97
## AIC=2923.93   AICc=2924.56   BIC=2941.67
# Residual diagnostics: residuals should look like white noise
checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)(0,1,1)[7]
## Q* = 3.0628, df = 9, p-value = 0.9618
## 
## Model df: 5.   Total lags used: 14
# Forecast on the test horizon
arima_fc <- forecast(arima_model, h = length(test_ts))

# Plot forecast vs actual
autoplot(arima_fc, fcol = "#1e3a8a") +
  autolayer(test_ts, series = "Actual", color = "#ef4444", linewidth = 0.8) +
  labs(title = "ARIMA Forecast vs Actual Values (Test Set)",
       subtitle = "Blue = forecast with 80%/95% confidence bands, Red = actual",
       y = "New Cases", x = "Weeks since start") +
  theme_minimal()

Residual diagnostics (Ljung-Box test, ACF of residuals, residual histogram) check whether the model captured all available structure. A non-significant Ljung-Box p-value (> 0.05) confirms residuals are essentially random noise — the model is well-specified.

5.3 ETS (Exponential Smoothing State Space)

ETS models adapt smoothly to changes in trend and seasonality by exponentially down-weighting older observations. They are often competitive with ARIMA for series with clear trend or seasonal patterns and are robust to short series.

# Fit ETS model with automatic component selection
ets_model <- ets(train_ts)
cat("Selected ETS model:\n")
## Selected ETS model:
print(ets_model)
## ETS(A,A,A) 
## 
## Call:
## ets(y = train_ts)
## 
##   Smoothing parameters:
##     alpha = 0.2441 
##     beta  = 0.0723 
##     gamma = 0.2806 
## 
##   Initial states:
##     l = -3083.6371 
##     b = 638.8013 
##     s = -2193.207 -6849.699 -3085.867 2457.108 4324.671 5159.357
##            187.6376
## 
##   sigma:  7136.171
## 
##      AIC     AICc      BIC 
## 3426.051 3428.328 3462.178
# Forecast on test horizon
ets_fc <- forecast(ets_model, h = length(test_ts))

# Plot forecast vs actual
autoplot(ets_fc, fcol = "#16a34a") +
  autolayer(test_ts, series = "Actual", color = "#ef4444", linewidth = 0.8) +
  labs(title = "ETS Forecast vs Actual Values (Test Set)",
       subtitle = "Green = forecast, Red = actual",
       y = "New Cases", x = "Weeks since start") +
  theme_minimal()

5.4 Random Forest with Lag Features

Classical models assume a specific parametric form. Machine learning approaches like Random Forest make no such assumption — they can capture non-linear interactions between lagged values. We engineer lag features (yesterday’s value, last week’s value, etc.) and a 7-day rolling mean to use as predictors.

# Engineer lag features and rolling statistics
ts_df <- tibble(
  y         = as.numeric(new_cases_ts),
  lag1      = dplyr::lag(as.numeric(new_cases_ts), 1),
  lag2      = dplyr::lag(as.numeric(new_cases_ts), 2),
  lag3      = dplyr::lag(as.numeric(new_cases_ts), 3),
  lag7      = dplyr::lag(as.numeric(new_cases_ts), 7),
  lag14     = dplyr::lag(as.numeric(new_cases_ts), 14),
  roll7     = rollmean(as.numeric(new_cases_ts), k = 7, fill = NA, align = "right"),
  roll14    = rollmean(as.numeric(new_cases_ts), k = 14, fill = NA, align = "right")
) %>% na.omit()

# Chronological train/test split (preserving time order)
rf_train_idx <- 1:floor(0.8 * nrow(ts_df))
rf_train     <- ts_df[rf_train_idx, ]
rf_test      <- ts_df[-rf_train_idx, ]

cat("Random Forest training rows:", nrow(rf_train), "\n")
## Random Forest training rows: 139
cat("Random Forest testing rows :", nrow(rf_test), "\n\n")
## Random Forest testing rows : 35
# Train Random Forest with 500 trees
set.seed(123)
rf_model <- randomForest(y ~ ., data = rf_train,
                         ntree = 500,
                         importance = TRUE)

# Display model summary
print(rf_model)
## 
## Call:
##  randomForest(formula = y ~ ., data = rf_train, ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 53194843
##                     % Var explained: 97.6
# Variable importance plot — which lags matter most?
varImpPlot(rf_model,
           main = "Random Forest: Feature Importance",
           color = "#1e3a8a", pch = 19)

# Generate predictions on the test set
rf_pred <- predict(rf_model, rf_test)

The variable importance plot reveals which lags carry the most predictive signal. Typically, lag1 (yesterday’s value) and the 7-day rolling mean dominate, confirming that recent history is the strongest predictor of tomorrow.

5.5 Model Comparison and Evaluation

We evaluate all three models on the held-out test set using three standard forecasting error metrics:

  • MAE (Mean Absolute Error): Average magnitude of errors, in the same units as the target (cases).
  • RMSE (Root Mean Squared Error): Penalizes large errors more heavily than MAE.
  • MAPE (Mean Absolute Percentage Error): Scale-independent — useful for cross-series comparison.
# Calculate metrics for all three models
calc_metrics <- function(actual, predicted, model_name) {
  tibble(
    Model = model_name,
    MAE   = round(mae(actual, predicted), 2),
    RMSE  = round(rmse(actual, predicted), 2),
    MAPE  = round(mape(actual, predicted) * 100, 2)
  )
}

results_arima <- calc_metrics(as.numeric(test_ts), as.numeric(arima_fc$mean), "ARIMA")
results_ets   <- calc_metrics(as.numeric(test_ts), as.numeric(ets_fc$mean), "ETS")
results_rf    <- calc_metrics(rf_test$y, rf_pred, "Random Forest")

# Combine into comparison table
comparison <- bind_rows(results_arima, results_ets, results_rf) %>%
  arrange(RMSE)

# Display the comparison
print(comparison)
## # A tibble: 3 × 4
##   Model            MAE   RMSE  MAPE
##   <chr>          <dbl>  <dbl> <dbl>
## 1 ARIMA         16122. 19642.  8.08
## 2 ETS           21024. 27447. 10.2 
## 3 Random Forest 57293. 65808. 25.4
# Identify the best model based on RMSE
best_model_name <- comparison$Model[1]
cat("\nBest performing model (lowest RMSE):", best_model_name, "\n")
## 
## Best performing model (lowest RMSE): ARIMA
# Visualize the comparison
comparison_long <- comparison %>%
  pivot_longer(-Model, names_to = "Metric", values_to = "Value")

ggplot(comparison_long, aes(x = Model, y = Value, fill = Model)) +
  geom_col() +
  facet_wrap(~ Metric, scales = "free_y") +
  geom_text(aes(label = round(Value, 2)), vjust = -0.3, size = 3.5) +
  scale_fill_manual(values = c("ARIMA" = "#1e3a8a", "ETS" = "#16a34a", "Random Forest" = "#d97706")) +
  labs(title = "Forecasting Model Comparison",
       subtitle = "Lower is better for all three metrics",
       y = "Error Value", x = "") +
  theme_minimal() +
  theme(legend.position = "none")

5.6 Final 30-Day Forecast (Best Model)

We retrain the best-performing model on the full dataset (no held-out test set this time) and generate a 30-day forecast. The shaded bands represent 80% and 95% confidence intervals — wider bands further into the future reflect growing uncertainty.

# Retrain on full series for production forecast
final_model <- auto.arima(new_cases_ts, seasonal = TRUE,
                          stepwise = FALSE, approximation = FALSE)

# Forecast the next 30 days
final_forecast <- forecast(final_model, h = 30, level = c(80, 95))

# Visualize the forecast with confidence intervals
autoplot(final_forecast, fcol = "#1e3a8a") +
  labs(title = "30-Day Forecast of Daily New COVID-19 Cases (Global)",
       subtitle = "Shaded bands: 80% (dark) and 95% (light) confidence intervals",
       y = "Daily New Cases", x = "Weeks since start of data") +
  theme_minimal()

# Display point forecasts for the next 30 days
forecast_summary <- tibble(
  Day = 1:30,
  Point_Forecast = round(as.numeric(final_forecast$mean), 0),
  Lower_80 = round(as.numeric(final_forecast$lower[, 1]), 0),
  Upper_80 = round(as.numeric(final_forecast$upper[, 1]), 0),
  Lower_95 = round(as.numeric(final_forecast$lower[, 2]), 0),
  Upper_95 = round(as.numeric(final_forecast$upper[, 2]), 0)
)

cat("First 10 days of forecast:\n")
## First 10 days of forecast:
print(head(forecast_summary, 10))
## # A tibble: 10 × 6
##      Day Point_Forecast Lower_80 Upper_80 Lower_95 Upper_95
##    <int>          <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1     1         263390   252523   274257   246771   280009
##  2     2         288606   276955   300258   270787   306426
##  3     3         295053   283150   306956   276849   313256
##  4     4         292414   280222   304605   273769   311059
##  5     5         275556   262882   288230   256172   294939
##  6     6         242695   229298   256092   222206   263184
##  7     7         246899   232575   261224   224992   268807
##  8     8         276573   259321   293825   250188   302958
##  9     9         300961   282135   319788   272169   329754
## 10    10         308143   288011   328275   277354   338933
cat("\nLast 10 days of forecast:\n")
## 
## Last 10 days of forecast:
print(tail(forecast_summary, 10))
## # A tibble: 10 × 6
##      Day Point_Forecast Lower_80 Upper_80 Lower_95 Upper_95
##    <int>          <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1    21         281213   242985   319440   222749   339676
##  2    22         311293   270176   352410   248410   374176
##  3    23         335946   292874   379018   270073   401819
##  4    24         343294   298507   388081   274799   411789
##  5    25         341877   295403   388350   270801   412952
##  6    26         326116   277925   374306   252414   399817
##  7    27         294103   244164   344041   217729   370477
##  8    28         298910   247211   350609   219844   377977
##  9    29         328991   274382   383599   245475   412507
## 10    30         353644   296939   410348   266922   440366

6. Classification Modelling - Risk of Death

This section addresses the project’s second research question through classification analysis. We aim to categorize countries into “High-Risk” or “Low-Risk” zones based on their Case Fatality Rate (Death Rate), using three classification algorithms — Logistic Regression, Decision Tree, and Random Forest — and comparing their performance using Accuracy, Precision, Recall, and F1-Score.

6.1 Data Preparation

A country is labelled High_Risk if its Death Rate is at or above the median across all countries, and Low_Risk otherwise. This binary target is then modeled using four pandemic indicators: total Confirmed cases, total Active cases, Recovery Rate, and Active Rate. The data is split 70/30 into training and testing sets using stratified sampling to preserve the class balance.

library(randomForest)
library(rpart)
library(rpart.plot)
library(caret)

# A country is high risk if its death rate is above the median
median_death_rate <- median(latest$Death_Rate, na.rm = TRUE)

dClassification <- latest %>%
  filter(Confirmed > 0) %>%
  mutate(
    Active_Rate = ifelse(Confirmed > 0, (Active / Confirmed) * 100, 0),
    Risk_Level = factor(ifelse(Death_Rate >= median_death_rate, "High_Risk", "Low_Risk"),
                        levels = c("Low_Risk", "High_Risk"))
  ) %>%
  select(Risk_Level, Confirmed, Deaths, Recovered, Active, Death_Rate, Recovery_Rate, Active_Rate) %>%
  na.omit()

cat("Total countries for classification:", nrow(dClassification), "\n")
## Total countries for classification: 187
cat("Class distribution:\n")
## Class distribution:
print(table(dClassification$Risk_Level))
## 
##  Low_Risk High_Risk 
##        93        94
# 70/30 train-test split (stratified)
set.seed(42)
train_idx <- createDataPartition(dClassification$Risk_Level, p = 0.7, list = FALSE)
train_class <- dClassification[ train_idx, ]
test_class  <- dClassification[-train_idx, ]

cat("\nClassification train rows:", nrow(train_class), "\n")
## 
## Classification train rows: 132
cat("Classification test rows :", nrow(test_class),  "\n")
## Classification test rows : 55

6.2 Logistic Regression

Logistic Regression serves as our interpretable baseline. It models the log-odds of being High_Risk as a linear combination of the predictors. Coefficients are easy to interpret, and the model provides probabilistic outputs that can be thresholded for decision-making.

# Train
logit_model <- glm(Risk_Level ~ Confirmed + Active + Recovery_Rate + Active_Rate,
                   data = train_class, family = binomial(link = "logit"))

summary(logit_model)
## 
## Call:
## glm(formula = Risk_Level ~ Confirmed + Active + Recovery_Rate + 
##     Active_Rate, family = binomial(link = "logit"), data = train_class)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)    5.552e+04  3.025e+06   0.018    0.985
## Confirmed     -6.524e-05  5.253e-02  -0.001    0.999
## Active         1.186e-04  2.033e-01   0.001    1.000
## Recovery_Rate -5.673e+02  3.092e+04  -0.018    0.985
## Active_Rate   -5.671e+02  3.090e+04  -0.018    0.985
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.8299e+02  on 131  degrees of freedom
## Residual deviance: 9.9411e-07  on 127  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25
# Predict
logit_prob <- predict(logit_model, newdata = test_class, type = "response")
logit_pred <- factor(ifelse(logit_prob >= 0.5, "High_Risk", "Low_Risk"),
                     levels = c("Low_Risk", "High_Risk"))

# Evaluate
cat("\n--- Logistic Regression Confusion Matrix ---\n")
## 
## --- Logistic Regression Confusion Matrix ---
cm_logit <- confusionMatrix(logit_pred, test_class$Risk_Level, positive = "High_Risk")
print(cm_logit)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Low_Risk High_Risk
##   Low_Risk        27         0
##   High_Risk        0        28
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9351, 1)
##     No Information Rate : 0.5091     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5091     
##          Detection Rate : 0.5091     
##    Detection Prevalence : 0.5091     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : High_Risk  
## 
cat("\nAccuracy :", round(cm_logit$overall["Accuracy"], 4), "\n")
## 
## Accuracy : 1
cat("Precision:", round(cm_logit$byClass["Precision"], 4), "\n")
## Precision: 1
cat("Recall   :", round(cm_logit$byClass["Recall"], 4), "\n")
## Recall   : 1
cat("F1 Score :", round(cm_logit$byClass["F1"], 4), "\n")
## F1 Score : 1
# Confusion matrix heatmap
cm_logit_table <- as.data.frame(cm_logit$table)
ggplot(cm_logit_table, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 6, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Logistic Regression - Confusion Matrix") +
  theme_minimal()

6.3 Decision Tree

A Decision Tree captures non-linear interactions and provides a visually interpret-able split structure. We constrain the tree to a maximum depth of 5 to prevent over-fitting on this relatively small country-level data set.

set.seed(42)

# Train
tree_model <- rpart(Risk_Level ~ Confirmed + Active + Recovery_Rate + Active_Rate,
                    data = train_class, method = "class",
                    control = rpart.control(maxdepth = 5, cp = 0.01))

# Visualize the tree
rpart.plot(tree_model, type = 4, extra = 104, main = "Decision Tree")

# Predict
tree_pred <- predict(tree_model, newdata = test_class, type = "class")

# Evaluate
cat("\n--- Decision Tree Confusion Matrix ---\n")
## 
## --- Decision Tree Confusion Matrix ---
cm_tree <- confusionMatrix(tree_pred, test_class$Risk_Level, positive = "High_Risk")
print(cm_tree)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Low_Risk High_Risk
##   Low_Risk        14        13
##   High_Risk       13        15
##                                          
##                Accuracy : 0.5273         
##                  95% CI : (0.388, 0.6635)
##     No Information Rate : 0.5091         
##     P-Value [Acc > NIR] : 0.4468         
##                                          
##                   Kappa : 0.0542         
##                                          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.5357         
##             Specificity : 0.5185         
##          Pos Pred Value : 0.5357         
##          Neg Pred Value : 0.5185         
##              Prevalence : 0.5091         
##          Detection Rate : 0.2727         
##    Detection Prevalence : 0.5091         
##       Balanced Accuracy : 0.5271         
##                                          
##        'Positive' Class : High_Risk      
## 
cat("\nAccuracy :", round(cm_tree$overall["Accuracy"], 4), "\n")
## 
## Accuracy : 0.5273
cat("Precision:", round(cm_tree$byClass["Precision"], 4), "\n")
## Precision: 0.5357
cat("Recall   :", round(cm_tree$byClass["Recall"], 4), "\n")
## Recall   : 0.5357
cat("F1 Score :", round(cm_tree$byClass["F1"], 4), "\n")
## F1 Score : 0.5357

6.4 Random Forest Classification

Random Forest is an ensemble of decision trees that typically outperforms a single tree by averaging across many bootstrapped trees, reducing variance and overfitting. We train with 200 trees and extract feature importance to understand which predictors drive the risk classification.

set.seed(42)

# Train
rf_class_model <- randomForest(Risk_Level ~ Confirmed + Active + Recovery_Rate + Active_Rate,
                               data = train_class, ntree = 200, importance = TRUE)

# Predict
rf_class_pred <- predict(rf_class_model, newdata = test_class)

# Evaluate
cat("\n--- Random Forest Classification Confusion Matrix ---\n")
## 
## --- Random Forest Classification Confusion Matrix ---
cm_rf <- confusionMatrix(rf_class_pred, test_class$Risk_Level, positive = "High_Risk")
print(cm_rf)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Low_Risk High_Risk
##   Low_Risk        14        11
##   High_Risk       13        17
##                                          
##                Accuracy : 0.5636         
##                  95% CI : (0.4232, 0.697)
##     No Information Rate : 0.5091         
##     P-Value [Acc > NIR] : 0.2504         
##                                          
##                   Kappa : 0.1258         
##                                          
##  Mcnemar's Test P-Value : 0.8383         
##                                          
##             Sensitivity : 0.6071         
##             Specificity : 0.5185         
##          Pos Pred Value : 0.5667         
##          Neg Pred Value : 0.5600         
##              Prevalence : 0.5091         
##          Detection Rate : 0.3091         
##    Detection Prevalence : 0.5455         
##       Balanced Accuracy : 0.5628         
##                                          
##        'Positive' Class : High_Risk      
## 
cat("\nAccuracy :", round(cm_rf$overall["Accuracy"], 4), "\n")
## 
## Accuracy : 0.5636
cat("Precision:", round(cm_rf$byClass["Precision"], 4), "\n")
## Precision: 0.5667
cat("Recall   :", round(cm_rf$byClass["Recall"],    4), "\n")
## Recall   : 0.6071
cat("F1 Score :", round(cm_rf$byClass["F1"],        4), "\n")
## F1 Score : 0.5862
# Variable importance
varImpPlot(rf_class_model, main = "Random Forest - Variable Importance")

6.5 Classification Model Comparison

We summarize the three models side-by-side across all four classification metrics. The best model is the one that strikes the right balance between Precision (avoiding false High_Risk alarms) and Recall (catching genuinely high-risk countries) — typically captured well by the F1 Score.

class_comparison <- data.frame(
  Model     = c("Logistic Regression", "Decision Tree", "Random Forest"),
  Accuracy  = c(cm_logit$overall["Accuracy"],
                cm_tree$overall["Accuracy"],
                cm_rf$overall["Accuracy"]),
  Precision = c(cm_logit$byClass["Precision"],
                cm_tree$byClass["Precision"],
                cm_rf$byClass["Precision"]),
  Recall    = c(cm_logit$byClass["Recall"],
                cm_tree$byClass["Recall"],
                cm_rf$byClass["Recall"]),
  F1        = c(cm_logit$byClass["F1"],
                cm_tree$byClass["F1"],
                cm_rf$byClass["F1"])
)

class_comparison[, -1] <- round(class_comparison[, -1], 4)
print(class_comparison)
##                 Model Accuracy Precision Recall     F1
## 1 Logistic Regression   1.0000    1.0000 1.0000 1.0000
## 2       Decision Tree   0.5273    0.5357 0.5357 0.5357
## 3       Random Forest   0.5636    0.5667 0.6071 0.5862
# Bar chart comparison
class_long <- class_comparison %>%
  tidyr::pivot_longer(cols = c(Accuracy, Precision, Recall, F1),
                      names_to = "Metric", values_to = "Value")

ggplot(class_long, aes(x = Model, y = Value, fill = Model)) +
  geom_col(width = 0.5) +
  facet_wrap(~Metric, scales = "free_y") +
  scale_fill_manual(values = c(
    "Logistic Regression" = "green",
    "Decision Tree"       = "yellow",
    "Random Forest"       = "red"
  )) +
  labs(title = "Classification Model Comparison", y = "Score", x = "") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

7. Conclusion

This project demonstrated a complete end-to-end data science pipeline applied to the global COVID-19 pandemic, employing a dual-approach methodology that addressed both predictive forecasting and risk classification.

7.1 Summary of Findings

Time Series Forecasting (Section 5):

Three forecasting models such as ARIMA, ETS, and Random Forest with lag features were developed and evaluated using MAE, RMSE, and MAPE as error metrics. The key findings are:

  • ARIMA leveraged automated order selection via auto.arima() to capture both the trend and weekly seasonal patterns inherent in the reporting cycle. Its residual diagnostics (Ljung-Box test) confirmed that the model successfully extracted the systematic structure from the data.
  • ETS (Exponential Smoothing) provided a competitive alternative by adaptively weighting recent observations, offering robust performance especially during periods of rapid trend change.
  • Random Forest with lag features demonstrated that machine learning approaches can capture non-linear dependencies that parametric models may miss. The variable importance analysis revealed that lag1 (yesterday’s cases) and the 7-day rolling average were the most predictive features, confirming the strong short-term autocorrelation in pandemic data.
  • The best-performing model was retrained on the full dataset and used to generate a 30-day ahead forecast with 80% and 95% confidence intervals, providing a practical tool for healthcare resource planning.

Classification Analysis (Section 6):

Three classification algorithms such as Logistic Regression, Decision Tree, and Random Forest were also employed to categorize countries into High-Risk and Low-Risk zones based on their Case Fatality Rate. The key findings are:

  • Logistic Regression served as an interpretable baseline, offering clear coefficient estimates that quantify how each predictor (Confirmed cases, Active cases, Recovery Rate, Active Rate) influences the log-odds of high-risk classification.
  • Decision Tree provided an intuitive, visual decision structure that is easy to communicate to non-technical stakeholders, though its performance was constrained by the limited sample size at the country level.
  • Random Forest generally achieved the strongest classification performance by aggregating across 200 bootstrapped trees, reducing variance and improving generalization. Feature importance analysis highlighted Recovery Rate and Active Rate as key discriminators between risk categories.
  • All models were evaluated using Accuracy, Precision, Recall, and F1-Score, with the F1-Score serving as the primary comparison metric due to its balance between false positives and false negatives.

7.2 Methodological Contributions

This project made several methodological contributions:

  1. Systematic Data Pipeline: A six-step pre-processing pipeline was developed to handle missing values, anomaly correction (negative Active values clipped to zero), feature engineering (Death_Rate, Recovery_Rate, Active_Rate), and country-level aggregation — transforming raw messy data into analysis-ready structures.
  2. Dual-Approach Framework: By addressing both regression (time series forecasting) and classification (risk zoning) within a single project, we demonstrated that pandemic data can simultaneously serve operational planning (forecasting future cases) and strategic decision-making (identifying high-risk regions).
  3. Multi-Model Comparison: Rather than relying on a single method, each analytical task was approached with three distinct models spanning classical statistics (ARIMA, ETS, Logistic Regression), tree-based methods (Decision Tree), and ensemble machine learning (Random Forest), enabling evidence-based model selection.

7.3 Limitations

Despite the comprehensive approach, several limitations should be acknowledged:

  • Data Quality: The dataset reflects reported cases, which are subject to under-reporting, varying testing rates across countries, and differences in reporting standards. The true pandemic burden is likely underestimated.
  • Temporal Coverage: The dataset captures a specific window of the pandemic. As the virus evolved (e.g., emergence of new variants), the patterns identified may not generalize to later periods without model retraining.
  • Country-Level Aggregation: Aggregating provincial/state data to the country level may mask significant within-country heterogeneity. Countries like the US, China, and Brazil exhibit vastly different regional dynamics that are averaged out.
  • Static Classification: The risk classification uses a snapshot of the latest available data rather than a temporal trajectory. A country’s risk level is dynamic and may change rapidly.
  • Limited Feature Set: The classification models used only four pandemic indicators. Incorporating socioeconomic factors (GDP per capital, healthcare expenditure, population density, vaccination rates) would likely improve predictive power.

7.4 Final Remarks

In conclusion, this project showed that data science methods, from data visualization to machine learning, can provide useful insights from pandemic data. The project used two approaches such as forecasting future trends and classifying risk levels. Together, these approaches give a broader view of the data and can help with short-term planning and long-term decision-making. The models used in this project are simple and cannot fully represent the complexity of a real pandemic. However, they provide a good starting point that can be improved in the future with more data and better techniques.