INTODUCTION

COVID-19 has significantly influenced global health and socio-economic structures since its outbreak in late 2019. Analyzing country-specific cumulative incidence data helps reveal patterns and disparities in the spread and impact of the virus.

Objective

To demonstrate a structured data analytics process, including:

Data cleaning and preprocessing for consistency.
Exploratory data analysis (EDA) to uncover trends.
Visualization of country-level cumulative incidence using R.

Data Source:

International Centre for Mathematical Modelling and Data Analytics

Significance:

This project provides insights into country-level trends and highlights the importance of data-driven strategies in pandemic management and public health planning.

Data Exploration and Visualization

Import the data set

setwd("C:/Users/USER PC/Documents/GIT/Data Analytics with Prof/data sets")
covid <- read.csv("covid_incidence.csv")
head(covid)

##   X                     Name Cumulative_incidence
## 1 1                   Global             11220.31
## 2 2 United States of America             64796.07
## 3 3                    India              7545.93
## 4 4                   Brazil             37456.22
## 5 5       Russian Federation             23154.94
## 6 6       The United Kingdom             43565.32

View(covid)

Summary statistics

summary(covid)

##        X              Name           Cumulative_incidence
##  Min.   :  1.00   Length:238         Min.   :     0.0    
##  1st Qu.: 60.25   Class :character   1st Qu.:   878.5    
##  Median :119.50   Mode  :character   Median :  6853.4    
##  Mean   :119.50                      Mean   : 16482.5    
##  3rd Qu.:178.75                      3rd Qu.: 26727.9    
##  Max.   :238.00                      Max.   :109868.6    
##                                      NA's   :1

dim(covid)

## [1] 238   3

Deal with missing values

sum(is.na(covid))

## [1] 1

# which row has the missing 
row_with_missing <- covid[rowSums(is.na(covid))>0,]
row_with_missing

##       X  Name Cumulative_incidence
## 187 187 Other                   NA

The country does not have a name, it seem to be other countries that are not documented put together, i will remove the row as a means of cleaning my data.

covid <- covid[-187,]
dim(covid)

## [1] 237   3

lets have a new summary statistics

summary(covid)

##        X             Name           Cumulative_incidence
##  Min.   :  1.0   Length:237         Min.   :     0.0    
##  1st Qu.: 60.0   Class :character   1st Qu.:   878.5    
##  Median :119.0   Mode  :character   Median :  6853.4    
##  Mean   :119.2                      Mean   : 16482.5    
##  3rd Qu.:178.0                      3rd Qu.: 26727.9    
##  Max.   :238.0                      Max.   :109868.6

The data is skewed to the right and the outliers are valid since this is about number of incidence in differnt countries and the incidence in one country cannot be said to be dependent on the incidence in another country.

Lets see if we can group this countries into different regions

library(countrycode)

## Warning: package 'countrycode' was built under R version 4.4.2

# Map countries to regions using the 'continent' field
covid$Region <- countrycode(sourcevar = covid$Name,
                           origin = "country.name",
                           destination = "continent")

## Warning: Some values were not matched unambiguously: Bonaire, Global, Kosovo[1], Saba, Saint Martin, Sint Eustatius

# Identify unmatched values
unmatched <- covid$Name[is.na(covid$Region)]
unmatched

## [1] "Global"         "Kosovo[1]"      "Saint Martin"   "Bonaire"       
## [5] "Sint Eustatius" "Saba"

I will create a manual match for those that didn’t match

# Custom mapping for unmatched values
custom_regions <- data.frame(
  Name = c("Bonaire", "Global", "Kosovo[1]", "Saba", "Saint Martin", "Sint Eustatius"),
  Region = c("Caribbean", "Global", "Europe", "Caribbean", "Caribbean", "Caribbean")
)
# Merge custom mapping with original dataset
library(tidyverse)
covid <- covid %>%
  left_join(custom_regions, by = "Name", suffix = c("", ".custom"))
# Fill in missing regions
covid$Region <- ifelse(is.na(covid$Region), covid$Region.custom, covid$Region)
covid <- covid %>% select(-Region.custom)

head(covid)

##   X                     Name Cumulative_incidence   Region
## 1 1                   Global             11220.31   Global
## 2 2 United States of America             64796.07 Americas
## 3 3                    India              7545.93     Asia
## 4 4                   Brazil             37456.22 Americas
## 5 5       Russian Federation             23154.94   Europe
## 6 6       The United Kingdom             43565.32   Europe

Now we have an extra column that shows the region of each country

Lets move on to visualization

Lets view countries and the incidences they have

covid$Highlight <- ifelse(covid$Cumulative_incidence == max(covid$Cumulative_incidence), "Highest", "Other")

ggplot(covid, aes(as.factor(Name), Cumulative_incidence, fill = Highlight)) +
  geom_col() +
  scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8))+
  labs(title = "Countries and cumulative incidence",x = "Country", y = "Cumulative Incidence",
    fill = "Legend")

Andorra has the most comulative incidence

Lets take a look by regions

ggplot(covid, aes(as.factor(Region), Cumulative_incidence)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
  labs(title = "Regions and cumulative incidence")

There are more incidences in the Europe than in other regions

Lets look at each region closely

Countries in Africa

africa <- subset(covid, Region == "Africa")
africa$Highlight <- ifelse(africa$Cumulative_incidence == max(africa$Cumulative_incidence), "Highest", "Other")
ggplot(africa, aes(Name, Cumulative_incidence, fill = Highlight))+
  geom_col() +
  scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
  labs(title = "Countries in Africa and cumulative incidence")

There are more incidence in the Republic of Congo than other countries in Africa

Lets look at the Americas

americas <- subset(covid, Region == "Americas")
ggplot(americas, aes(Name, Cumulative_incidence))+
  geom_col(fill = "steelblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
  labs(title = "Countries in America and cumulative incidence")

The united state of america has the higest case in Americas

Lets look at Asia

asia <- subset(covid, Region == "Asia")
asia$Highlight <- ifelse(asia$Cumulative_incidence == max(asia$Cumulative_incidence), "Highest", "Other")
ggplot(asia, aes(Name, Cumulative_incidence, fill = Highlight))+
  geom_col() +
  scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
  labs(title = "Countries in Asia and cumulative incidence")

Georgia has the highest number of incidence in Asia

Lets look at Caribbean

caribbean <- subset(covid, Region == "Caribbean")
caribbean$Highlight <- ifelse(caribbean$Cumulative_incidence == max(caribbean$Cumulative_incidence), "Highest", "Other")
ggplot(caribbean, aes(Name, Cumulative_incidence, fill = Highlight))+
  geom_col() +
  scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 15))+
  labs(title = "Countries in Caribbean and cumulative incidence")

Saint Martin has the highest cases in Asia

Lets look at Europe

europe <- subset(covid, Region == "Europe")
europe$Highlight <- ifelse(europe$Cumulative_incidence == max(europe$Cumulative_incidence), "Highest", "Other")
ggplot(europe, aes(Name, Cumulative_incidence, fill = Highlight))+
  geom_col() +
  scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
  labs(title = "Countries in Europe and cumulative incidence")

Andorra has the highest inciddence in Europe

Lets take a look at Oceania

oceania <- subset(covid, Region == "Oceania")
oceania$Highlight <- ifelse(oceania$Cumulative_incidence == max(oceania$Cumulative_incidence), "Highest", "Other")
ggplot(oceania, aes(Name, Cumulative_incidence, fill = Highlight))+
  geom_col() +
  scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 15))+
  labs(title = "Countries in Oceania and cumulative incidence")

French Polynesia has the highest incidence in Oceania.

Analyzing Cumulative Incidence of Covid-19 Across Countries