Akanni, Samuel Ifeoluwa
2024-11-26
COVID-19 has significantly influenced global health and socio-economic structures since its outbreak in late 2019. Analyzing country-specific cumulative incidence data helps reveal patterns and disparities in the spread and impact of the virus.
Objective
To demonstrate a structured data analytics process, including:
Data Source:
International Centre for Mathematical Modelling and Data Analytics
Significance:
This project provides insights into country-level trends and highlights the importance of data-driven strategies in pandemic management and public health planning.
Import the data set
setwd("C:/Users/USER PC/Documents/GIT/Data Analytics with Prof/data sets")
covid <- read.csv("covid_incidence.csv")
head(covid)## X Name Cumulative_incidence
## 1 1 Global 11220.31
## 2 2 United States of America 64796.07
## 3 3 India 7545.93
## 4 4 Brazil 37456.22
## 5 5 Russian Federation 23154.94
## 6 6 The United Kingdom 43565.32
## X Name Cumulative_incidence
## Min. : 1.00 Length:238 Min. : 0.0
## 1st Qu.: 60.25 Class :character 1st Qu.: 878.5
## Median :119.50 Mode :character Median : 6853.4
## Mean :119.50 Mean : 16482.5
## 3rd Qu.:178.75 3rd Qu.: 26727.9
## Max. :238.00 Max. :109868.6
## NA's :1
## [1] 238 3
## [1] 1
## X Name Cumulative_incidence
## 187 187 Other NA
The country does not have a name, it seem to be other countries that are not documented put together, i will remove the row as a means of cleaning my data.
## [1] 237 3
## X Name Cumulative_incidence
## Min. : 1.0 Length:237 Min. : 0.0
## 1st Qu.: 60.0 Class :character 1st Qu.: 878.5
## Median :119.0 Mode :character Median : 6853.4
## Mean :119.2 Mean : 16482.5
## 3rd Qu.:178.0 3rd Qu.: 26727.9
## Max. :238.0 Max. :109868.6
The data is skewed to the right and the outliers are valid since this is about number of incidence in differnt countries and the incidence in one country cannot be said to be dependent on the incidence in another country.
## Warning: package 'countrycode' was built under R version 4.4.2
# Map countries to regions using the 'continent' field
covid$Region <- countrycode(sourcevar = covid$Name,
origin = "country.name",
destination = "continent")## Warning: Some values were not matched unambiguously: Bonaire, Global, Kosovo[1], Saba, Saint Martin, Sint Eustatius
## [1] "Global" "Kosovo[1]" "Saint Martin" "Bonaire"
## [5] "Sint Eustatius" "Saba"
# Custom mapping for unmatched values
custom_regions <- data.frame(
Name = c("Bonaire", "Global", "Kosovo[1]", "Saba", "Saint Martin", "Sint Eustatius"),
Region = c("Caribbean", "Global", "Europe", "Caribbean", "Caribbean", "Caribbean")
)
# Merge custom mapping with original dataset
library(tidyverse)
covid <- covid %>%
left_join(custom_regions, by = "Name", suffix = c("", ".custom"))
# Fill in missing regions
covid$Region <- ifelse(is.na(covid$Region), covid$Region.custom, covid$Region)
covid <- covid %>% select(-Region.custom)
head(covid)## X Name Cumulative_incidence Region
## 1 1 Global 11220.31 Global
## 2 2 United States of America 64796.07 Americas
## 3 3 India 7545.93 Asia
## 4 4 Brazil 37456.22 Americas
## 5 5 Russian Federation 23154.94 Europe
## 6 6 The United Kingdom 43565.32 Europe
Now we have an extra column that shows the region of each country
Lets view countries and the incidences they have
covid$Highlight <- ifelse(covid$Cumulative_incidence == max(covid$Cumulative_incidence), "Highest", "Other")
ggplot(covid, aes(as.factor(Name), Cumulative_incidence, fill = Highlight)) +
geom_col() +
scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8))+
labs(title = "Countries and cumulative incidence",x = "Country", y = "Cumulative Incidence",
fill = "Legend")
Andorra has the most comulative incidence
ggplot(covid, aes(as.factor(Region), Cumulative_incidence)) +
geom_col(fill = "steelblue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
labs(title = "Regions and cumulative incidence")
There are more incidences in the Europe than in other regions
Countries in Africa
africa <- subset(covid, Region == "Africa")
africa$Highlight <- ifelse(africa$Cumulative_incidence == max(africa$Cumulative_incidence), "Highest", "Other")
ggplot(africa, aes(Name, Cumulative_incidence, fill = Highlight))+
geom_col() +
scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
labs(title = "Countries in Africa and cumulative incidence")
There are more incidence in the Republic of Congo than other countries
in Africa
americas <- subset(covid, Region == "Americas")
ggplot(americas, aes(Name, Cumulative_incidence))+
geom_col(fill = "steelblue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
labs(title = "Countries in America and cumulative incidence")
The united state of america has the higest case in Americas
asia <- subset(covid, Region == "Asia")
asia$Highlight <- ifelse(asia$Cumulative_incidence == max(asia$Cumulative_incidence), "Highest", "Other")
ggplot(asia, aes(Name, Cumulative_incidence, fill = Highlight))+
geom_col() +
scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
labs(title = "Countries in Asia and cumulative incidence")
Georgia has the highest number of incidence in Asia
caribbean <- subset(covid, Region == "Caribbean")
caribbean$Highlight <- ifelse(caribbean$Cumulative_incidence == max(caribbean$Cumulative_incidence), "Highest", "Other")
ggplot(caribbean, aes(Name, Cumulative_incidence, fill = Highlight))+
geom_col() +
scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 15))+
labs(title = "Countries in Caribbean and cumulative incidence")
Saint Martin has the highest cases in Asia
europe <- subset(covid, Region == "Europe")
europe$Highlight <- ifelse(europe$Cumulative_incidence == max(europe$Cumulative_incidence), "Highest", "Other")
ggplot(europe, aes(Name, Cumulative_incidence, fill = Highlight))+
geom_col() +
scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 10))+
labs(title = "Countries in Europe and cumulative incidence")
Andorra has the highest inciddence in Europe
oceania <- subset(covid, Region == "Oceania")
oceania$Highlight <- ifelse(oceania$Cumulative_incidence == max(oceania$Cumulative_incidence), "Highest", "Other")
ggplot(oceania, aes(Name, Cumulative_incidence, fill = Highlight))+
geom_col() +
scale_fill_manual(values = c("Highest" = "red", "Other" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 15))+
labs(title = "Countries in Oceania and cumulative incidence")
French Polynesia has the highest incidence in Oceania.