COVID-19 has been one of the most significant global health crises of our time. As the pandemic has spread across the world, it has affected millions of people, causing illness, death, and widespread disruption to daily life. Governments and public health organizations have responded with a range of measures, from lockdowns and travel restrictions to vaccination campaigns and public education efforts.
In this document, we will explore a dataset of COVID-19 cases and deaths from around the world. Our goal is to gain insights into how the pandemic has evolved over time and across different countries. We will start by importing the dataset into RStudio, then perform some data cleaning and manipulation to prepare the data for analysis. Once we have a clean dataset, we will generate visualizations and statistics to help us answer key questions about the pandemic, such as the number of cases and deaths, the distribution of the virus across different countries, and the effectiveness of various containment measures.
data <- read.csv("/Users/paul/Downloads/covid_worldwide.csv", stringsAsFactors = FALSE)
We notice that some of the columns that should be numeric are currently stored as character strings. We use the as.integer function to convert the Total.Cases and Total_Deaths columns to integer data type.
data$Total.Cases <- as.integer(data$Total.Cases)
data$Total.Deaths <- as.integer(data$Total.Deaths)
We check for missing values in the dataset using the sum(is.na()) function, which returns the total number of missing values in the dataset.
sum(is.na(data))
## [1] 48
We find that there are 48 missing values in the dataset. To ensure the integrity of our analysis, we remove rows with missing values using the na.omit() function.
data_clean <- na.omit(data)
We check the column types of our cleaned dataset to confirm that they are correctly formatted. We use the sapply() function to apply the class() function to each column of our data frame. We will also make the naming of all columns more uniformed.
sapply(data_clean, class)
## Serial_Number Country Total.Cases Total.Deaths Total_Recovered
## "integer" "character" "integer" "integer" "numeric"
## Total_Test Population
## "numeric" "numeric"
# Replace "." with "_" in column names
colnames(data_clean) <- gsub("\\.", "_", colnames(data_clean))
Now that we have cleaned and prepared the dataset, we can start exploring the data to gain insights into the pandemic. We will start by answering some basic questions about the dataset.
Let’s take a quick look at some basic information about the dataset:
# View the first few rows of the dataset
head(data_clean)
## Serial_Number Country Total_Cases Total_Deaths Total_Recovered Total_Test
## 1 1 USA 104196861 1132935 101322779 1159832679
## 2 2 India 44682784 530740 44150289 915265788
## 3 3 France 39524311 164233 39264546 271490188
## 4 4 Germany 37779833 165711 37398100 122332384
## 5 5 Brazil 36824580 697074 35919372 63776166
## 6 6 Japan 32588442 68399 21567425 92144639
## Population
## 1 334805269
## 2 1406631776
## 3 65584518
## 4 83883596
## 5 215353593
## 6 125584838
# View the dimensions of the dataset
dim(data_clean)
## [1] 195 7
# Check the data types of the columns
sapply(data_clean, class)
## Serial_Number Country Total_Cases Total_Deaths Total_Recovered
## "integer" "character" "integer" "integer" "numeric"
## Total_Test Population
## "numeric" "numeric"
# Check for missing values
sum(is.na(data_clean))
## [1] 0
# Summary
summary(data_clean)
## Serial_Number Country Total_Cases Total_Deaths
## Min. : 1.0 Length:195 Min. : 1403 Min. : 1
## 1st Qu.: 51.5 Class :character 1st Qu.: 37866 1st Qu.: 313
## Median :104.0 Mode :character Median : 297757 Median : 3155
## Mean :107.2 Mean : 3329258 Mean : 33787
## 3rd Qu.:162.0 3rd Qu.: 1723625 3rd Qu.: 16877
## Max. :225.0 Max. :104196861 Max. :1132935
## Total_Recovered Total_Test Population
## Min. : 438 Min. :7.850e+03 Min. :4.965e+03
## 1st Qu.: 34700 1st Qu.:4.010e+05 1st Qu.:1.100e+06
## Median : 288991 Median :2.610e+06 Median :6.845e+06
## Mean : 3197261 Mean :3.375e+07 Mean :3.207e+07
## 3rd Qu.: 1708095 3rd Qu.:1.477e+07 3rd Qu.:2.783e+07
## Max. :101322779 Max. :1.160e+09 Max. :1.407e+09
The head() function allows us to view the first few rows of the dataset, giving us an idea of the structure of the data. The dim() function tells us how many rows and columns are in the dataset, and the sapply() function shows us the data types of each column. Finally, the sum(is.na()) function tells us how many missing values there are in the dataset.
One of the most basic questions we can ask about the pandemic is: how many cases and deaths have occurred worldwide? We can use the sum() function to calculate the total number of cases and deaths:
# Calculate the total number of cases and deaths
total_cases <- sum(data_clean$Total_Cases)
total_deaths <- sum(data_clean$Total_Deaths)
# Print the results
cat("Total number of cases worldwide:", total_cases, "\n")
## Total number of cases worldwide: 649205332
cat("Total number of deaths worldwide:", total_deaths, "\n")
## Total number of deaths worldwide: 6588473
Next, let’s look at the distribution of cases and deaths by country. We can use a bar chart to visualize the number of cases and deaths in each country:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
install.packages("ggthemes")
##
## The downloaded binary packages are in
## /var/folders/ws/4llnbj9571vfz1ttb53qml2r0000gp/T//RtmpnUdbPA/downloaded_packages
library(ggthemes)
# Create a bar chart of the top 10 countries with the most cases
top_10_cases <- data_clean %>%
arrange(desc(Total_Cases)) %>%
head(10) %>%
arrange(Total_Cases)
ggplot(top_10_cases, aes(x = reorder(Country, Total_Cases), y = Total_Cases, fill = Country)) +
geom_bar(stat = "identity") +
ggtitle("Top 10 Countries by Total Cases") +
xlab("Country") +
ylab("Total Cases") +
theme_economist() +
theme(legend.position = "none")
# Create a bar chart of the top 10 countries with the most deaths
top_10_deaths <- data_clean %>%
arrange(desc(Total_Deaths)) %>%
head(10) %>%
arrange(Total_Deaths)
ggplot(top_10_deaths, aes(x = reorder(Country, Total_Deaths), y = Total_Deaths, fill = Country)) +
geom_bar(stat = "identity") +
ggtitle("Top 10 Countries by Total Deaths") +
xlab("Country") +
ylab("Total Deaths") +
theme_economist() +
theme(legend.position = "none")
These charts give us a quick overview of the countries that have been hit hardest by the pandemic. We can see that the United States has had the most cases and deaths by far, followed by India, Brazil, and several European countries.
Finally, let’s create a geographical plot of the distribution of deaths from around the world. We can use the ggplot2 and maps packages to create a world map with circles representing the number of deaths in each country:
library(maps)
# Aggregate the data by country
data_agg <- data_clean %>%
group_by(Country) %>%
summarise(Total_Deaths = sum(Total_Deaths)) %>%
ungroup()
# Create the world map
world_map <- map_data("world")
# Join the COVID-19 data with the world map data
map_data <- left_join(world_map, data_agg, by = c("region" = "Country"))
# Create the choropleth map
deaths_map <- ggplot(map_data, aes(x = long, y = lat, group = group, fill = Total_Deaths)) +
geom_polygon(color = "black", size = 0.2) +
scale_fill_gradient(low = "#f1eef6", high = "#b30000", guide = "colorbar", label = scales::comma_format()) +
theme_void() +
theme(panel.background = element_rect(fill = "#f8f8f8", color = NA),
legend.background = element_rect(fill = "#f8f8f8", color = NA),
legend.key.size = unit(0.4, "cm"),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12, face = "bold"),
plot.margin = unit(c(0, 0, 0, 0), "cm"),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
# Add labels and title
deaths_map <- deaths_map +
labs(title = "COVID-19 Deaths by Country",
subtitle = "Data as of April 1st, 2023",
fill = "Deaths",
caption = "Source: Our World in Data") +
theme(plot.title = element_text(size = 20, face = "bold", margin = margin(0, 0, 0.5, 0)),
plot.subtitle = element_text(size = 16, margin = margin(0, 0, 0.5, 0)),
plot.caption = element_text(hjust = 0, size = 10, margin = margin(1, 0, 0, 0)))
# Display the map
deaths_map
To find the top 5 countries in active cases, we need to calculate the number of active cases for each country. We can do this by subtracting the number of total deaths and total recoveries from the total cases.
library(dplyr)
# Calculate active cases for each country
data_clean$Active_Cases <- data_clean$Total_Cases - data_clean$Total_Deaths - data_clean$Total_Recovered
# Get top 5 countries with highest active cases
top_active_cases <- data_clean %>%
select(Country, Active_Cases) %>%
arrange(desc(Active_Cases)) %>%
head(5)
# Format numbers with comma separator
top_active_cases$Active_Cases <- format(top_active_cases$Active_Cases, big.mark = ",", scientific = FALSE)
# Print the table with header row
knitr::kable(top_active_cases, align = "c", col.names = c("Country", "Active Cases"), caption = "Top 5 Countries with Highest Active Cases")
| Country | Active Cases |
|---|---|
| Japan | 10,952,618 |
| USA | 1,741,147 |
| Poland | 925,549 |
| Vietnam | 870,843 |
| Mexico | 429,421 |
The case fatality rate (CFR) is a useful metric for understanding the severity of a disease outbreak. It tells us the proportion of people who die after contracting the disease, which is an important measure of its impact on public health. By comparing the CFR across different countries, we can get a sense of how well different health systems are coping with the pandemic and identify areas where more resources may be needed to prevent further deaths.
To answer this question, we can calculate the CFR for each country by dividing the total deaths by the total cases and multiplying by 100 to get a percentage. We can then create a bar chart to compare the CFR across different countries.
# Load the required packages
library(ggplot2)
# Calculate CFR for each country
cfr_data <- data_clean %>%
group_by(Country) %>%
summarise(Total_Cases = sum(Total_Cases), Total_Deaths = sum(Total_Deaths)) %>%
mutate(CFR = Total_Deaths / Total_Cases)
# Select top 30 countries by CFR
top30_cfr <- cfr_data %>%
arrange(desc(CFR)) %>%
head(30)
# Create CFR plot with economist theme
cfr_plot <- ggplot(top30_cfr, aes(x = reorder(Country, CFR), y = CFR)) +
geom_bar(stat = "identity", fill = "red") +
coord_flip() +
labs(title = "Case Fatality Rates by Country (Top 30)",
subtitle = "Data as of April 1st, 2023",
x = "Country",
y = "Case Fatality Rate",
caption = "Source: Our World in Data") +
theme_economist(base_size = 12, base_family = "Arial") +
theme(plot.title = element_text(size = 20, face = "bold"),
plot.subtitle = element_text(size = 16),
plot.caption = element_text(hjust = 0, size = 10),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12),
axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
# Display the plot
cfr_plot
From the bar chart, we can see that Yemen has the highest CFR, followed by Peru, Mexico, Syria, and Brazil. It’s important to note that the CFR is not a perfect measure of the severity of COVID-19, as it can be influenced by many factors such as the age distribution of the population, healthcare capacity, and testing availability.
Overall, the document explores a dataset of COVID-19 cases and deaths from around the world and uses RStudio to perform data cleaning, manipulation, and analysis. The document presents several key findings, including the total number of cases and deaths worldwide, the distribution of cases and deaths by country, and the case fatality rate (CFR) for COVID-19 across different countries.
Some of the notable insights include the United States having the highest number of cases and deaths by far, followed by India and Brazil. The document also shows that Yemen has the highest CFR, followed by Peru, Mexico, Syria, and Brazil. The document provides several visualizations to help understand the data, such as bar charts of the top 10 countries with the most cases and deaths, a geographical plot of the distribution of deaths from around the world, and a bar chart comparing the CFR across different countries.
Overall, the document provides valuable insights into the global
impact of the COVID-19 pandemic and highlights the importance of
continued efforts to control and mitigate the spread of the virus.
Thank you for reading.
Paul Carmody