Introduction

COVID-19 has been one of the most significant global health crises of our time. As the pandemic has spread across the world, it has affected millions of people, causing illness, death, and widespread disruption to daily life. Governments and public health organizations have responded with a range of measures, from lockdowns and travel restrictions to vaccination campaigns and public education efforts.

In this document, we will explore a dataset of COVID-19 cases and deaths from around the world. Our goal is to gain insights into how the pandemic has evolved over time and across different countries. We will start by importing the dataset into RStudio, then perform some data cleaning and manipulation to prepare the data for analysis. Once we have a clean dataset, we will generate visualizations and statistics to help us answer key questions about the pandemic, such as the number of cases and deaths, the distribution of the virus across different countries, and the effectiveness of various containment measures.

Import data:

data <- read.csv("/Users/paul/Downloads/covid_worldwide.csv", stringsAsFactors = FALSE)

Data Preparation

Changing Column Types

We notice that some of the columns that should be numeric are currently stored as character strings. We use the as.integer function to convert the Total.Cases and Total_Deaths columns to integer data type.

data$Total.Cases <- as.integer(data$Total.Cases)
data$Total.Deaths <- as.integer(data$Total.Deaths)

Removing Missing Values

We check for missing values in the dataset using the sum(is.na()) function, which returns the total number of missing values in the dataset.

sum(is.na(data))
## [1] 48

We find that there are 48 missing values in the dataset. To ensure the integrity of our analysis, we remove rows with missing values using the na.omit() function.

data_clean <- na.omit(data)

Checking Column Types

We check the column types of our cleaned dataset to confirm that they are correctly formatted. We use the sapply() function to apply the class() function to each column of our data frame. We will also make the naming of all columns more uniformed.

sapply(data_clean, class)
##   Serial_Number         Country     Total.Cases    Total.Deaths Total_Recovered 
##       "integer"     "character"       "integer"       "integer"       "numeric" 
##      Total_Test      Population 
##       "numeric"       "numeric"
# Replace "." with "_" in column names
colnames(data_clean) <- gsub("\\.", "_", colnames(data_clean))

Explore Data

Now that we have cleaned and prepared the dataset, we can start exploring the data to gain insights into the pandemic. We will start by answering some basic questions about the dataset.

Basic Information

Let’s take a quick look at some basic information about the dataset:

# View the first few rows of the dataset
head(data_clean)
##   Serial_Number Country Total_Cases Total_Deaths Total_Recovered Total_Test
## 1             1     USA   104196861      1132935       101322779 1159832679
## 2             2   India    44682784       530740        44150289  915265788
## 3             3  France    39524311       164233        39264546  271490188
## 4             4 Germany    37779833       165711        37398100  122332384
## 5             5  Brazil    36824580       697074        35919372   63776166
## 6             6   Japan    32588442        68399        21567425   92144639
##   Population
## 1  334805269
## 2 1406631776
## 3   65584518
## 4   83883596
## 5  215353593
## 6  125584838
# View the dimensions of the dataset
dim(data_clean)
## [1] 195   7
# Check the data types of the columns
sapply(data_clean, class)
##   Serial_Number         Country     Total_Cases    Total_Deaths Total_Recovered 
##       "integer"     "character"       "integer"       "integer"       "numeric" 
##      Total_Test      Population 
##       "numeric"       "numeric"
# Check for missing values
sum(is.na(data_clean))
## [1] 0
# Summary
summary(data_clean)
##  Serial_Number     Country           Total_Cases         Total_Deaths    
##  Min.   :  1.0   Length:195         Min.   :     1403   Min.   :      1  
##  1st Qu.: 51.5   Class :character   1st Qu.:    37866   1st Qu.:    313  
##  Median :104.0   Mode  :character   Median :   297757   Median :   3155  
##  Mean   :107.2                      Mean   :  3329258   Mean   :  33787  
##  3rd Qu.:162.0                      3rd Qu.:  1723625   3rd Qu.:  16877  
##  Max.   :225.0                      Max.   :104196861   Max.   :1132935  
##  Total_Recovered       Total_Test          Population       
##  Min.   :      438   Min.   :7.850e+03   Min.   :4.965e+03  
##  1st Qu.:    34700   1st Qu.:4.010e+05   1st Qu.:1.100e+06  
##  Median :   288991   Median :2.610e+06   Median :6.845e+06  
##  Mean   :  3197261   Mean   :3.375e+07   Mean   :3.207e+07  
##  3rd Qu.:  1708095   3rd Qu.:1.477e+07   3rd Qu.:2.783e+07  
##  Max.   :101322779   Max.   :1.160e+09   Max.   :1.407e+09

The head() function allows us to view the first few rows of the dataset, giving us an idea of the structure of the data. The dim() function tells us how many rows and columns are in the dataset, and the sapply() function shows us the data types of each column. Finally, the sum(is.na()) function tells us how many missing values there are in the dataset.

Number of Cases and Deaths

One of the most basic questions we can ask about the pandemic is: how many cases and deaths have occurred worldwide? We can use the sum() function to calculate the total number of cases and deaths:

# Calculate the total number of cases and deaths
total_cases <- sum(data_clean$Total_Cases)
total_deaths <- sum(data_clean$Total_Deaths)

# Print the results
cat("Total number of cases worldwide:", total_cases, "\n")
## Total number of cases worldwide: 649205332
cat("Total number of deaths worldwide:", total_deaths, "\n")
## Total number of deaths worldwide: 6588473

Distribution of Cases and Deaths by Country

Next, let’s look at the distribution of cases and deaths by country. We can use a bar chart to visualize the number of cases and deaths in each country:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
install.packages("ggthemes")
## 
## The downloaded binary packages are in
##  /var/folders/ws/4llnbj9571vfz1ttb53qml2r0000gp/T//RtmpnUdbPA/downloaded_packages
library(ggthemes)
# Create a bar chart of the top 10 countries with the most cases
top_10_cases <- data_clean %>% 
  arrange(desc(Total_Cases)) %>% 
  head(10) %>% 
  arrange(Total_Cases)

ggplot(top_10_cases, aes(x = reorder(Country, Total_Cases), y = Total_Cases, fill = Country)) +
  geom_bar(stat = "identity") +
  ggtitle("Top 10 Countries by Total Cases") +
  xlab("Country") +
  ylab("Total Cases") +
  theme_economist() +
  theme(legend.position = "none")

# Create a bar chart of the top 10 countries with the most deaths
top_10_deaths <- data_clean %>% 
  arrange(desc(Total_Deaths)) %>% 
  head(10) %>% 
  arrange(Total_Deaths)

ggplot(top_10_deaths, aes(x = reorder(Country, Total_Deaths), y = Total_Deaths, fill = Country)) +
  geom_bar(stat = "identity") +
  ggtitle("Top 10 Countries by Total Deaths") +
  xlab("Country") +
  ylab("Total Deaths") +
  theme_economist() +
  theme(legend.position = "none")

These charts give us a quick overview of the countries that have been hit hardest by the pandemic. We can see that the United States has had the most cases and deaths by far, followed by India, Brazil, and several European countries.

Geographical Distribution of Deaths

Finally, let’s create a geographical plot of the distribution of deaths from around the world. We can use the ggplot2 and maps packages to create a world map with circles representing the number of deaths in each country:

library(maps)

# Aggregate the data by country
data_agg <- data_clean %>% 
  group_by(Country) %>% 
  summarise(Total_Deaths = sum(Total_Deaths)) %>% 
  ungroup()

# Create the world map
world_map <- map_data("world")

# Join the COVID-19 data with the world map data
map_data <- left_join(world_map, data_agg, by = c("region" = "Country"))

# Create the choropleth map
deaths_map <- ggplot(map_data, aes(x = long, y = lat, group = group, fill = Total_Deaths)) +
  geom_polygon(color = "black", size = 0.2) +
  scale_fill_gradient(low = "#f1eef6", high = "#b30000", guide = "colorbar", label = scales::comma_format()) +
  theme_void() +
  theme(panel.background = element_rect(fill = "#f8f8f8", color = NA),
        legend.background = element_rect(fill = "#f8f8f8", color = NA),
        legend.key.size = unit(0.4, "cm"),
        legend.text = element_text(size = 10),
        legend.title = element_text(size = 12, face = "bold"),
        plot.margin = unit(c(0, 0, 0, 0), "cm"),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
# Add labels and title
deaths_map <- deaths_map +
  labs(title = "COVID-19 Deaths by Country",
       subtitle = "Data as of April 1st, 2023",
       fill = "Deaths",
       caption = "Source: Our World in Data") +
  theme(plot.title = element_text(size = 20, face = "bold", margin = margin(0, 0, 0.5, 0)),
        plot.subtitle = element_text(size = 16, margin = margin(0, 0, 0.5, 0)),
        plot.caption = element_text(hjust = 0, size = 10, margin = margin(1, 0, 0, 0)))

# Display the map
deaths_map

Top 5 countries in active cases

To find the top 5 countries in active cases, we need to calculate the number of active cases for each country. We can do this by subtracting the number of total deaths and total recoveries from the total cases.

library(dplyr)

# Calculate active cases for each country
data_clean$Active_Cases <- data_clean$Total_Cases - data_clean$Total_Deaths - data_clean$Total_Recovered

# Get top 5 countries with highest active cases
top_active_cases <- data_clean %>% 
  select(Country, Active_Cases) %>% 
  arrange(desc(Active_Cases)) %>% 
  head(5)

# Format numbers with comma separator
top_active_cases$Active_Cases <- format(top_active_cases$Active_Cases, big.mark = ",", scientific = FALSE)

# Print the table with header row
knitr::kable(top_active_cases, align = "c", col.names = c("Country", "Active Cases"), caption = "Top 5 Countries with Highest Active Cases")
Top 5 Countries with Highest Active Cases
Country Active Cases
Japan 10,952,618
USA 1,741,147
Poland 925,549
Vietnam 870,843
Mexico 429,421

What is the case fatality rate (CFR) for COVID-19 across different countries?

The case fatality rate (CFR) is a useful metric for understanding the severity of a disease outbreak. It tells us the proportion of people who die after contracting the disease, which is an important measure of its impact on public health. By comparing the CFR across different countries, we can get a sense of how well different health systems are coping with the pandemic and identify areas where more resources may be needed to prevent further deaths.

To answer this question, we can calculate the CFR for each country by dividing the total deaths by the total cases and multiplying by 100 to get a percentage. We can then create a bar chart to compare the CFR across different countries.

# Load the required packages
library(ggplot2)

# Calculate CFR for each country
cfr_data <- data_clean %>%
  group_by(Country) %>%
  summarise(Total_Cases = sum(Total_Cases), Total_Deaths = sum(Total_Deaths)) %>%
  mutate(CFR = Total_Deaths / Total_Cases)

# Select top 30 countries by CFR
top30_cfr <- cfr_data %>%
  arrange(desc(CFR)) %>%
  head(30)

# Create CFR plot with economist theme
cfr_plot <- ggplot(top30_cfr, aes(x = reorder(Country, CFR), y = CFR)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  labs(title = "Case Fatality Rates by Country (Top 30)",
       subtitle = "Data as of April 1st, 2023",
       x = "Country",
       y = "Case Fatality Rate",
       caption = "Source: Our World in Data") +
  theme_economist(base_size = 12, base_family = "Arial") +
  theme(plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(size = 16),
        plot.caption = element_text(hjust = 0, size = 10),
        axis.text.y = element_text(size = 10),
        axis.title.y = element_text(size = 12),
        axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

# Display the plot
cfr_plot

From the bar chart, we can see that Yemen has the highest CFR, followed by Peru, Mexico, Syria, and Brazil. It’s important to note that the CFR is not a perfect measure of the severity of COVID-19, as it can be influenced by many factors such as the age distribution of the population, healthcare capacity, and testing availability.

Conclusion

Overall, the document explores a dataset of COVID-19 cases and deaths from around the world and uses RStudio to perform data cleaning, manipulation, and analysis. The document presents several key findings, including the total number of cases and deaths worldwide, the distribution of cases and deaths by country, and the case fatality rate (CFR) for COVID-19 across different countries.

Some of the notable insights include the United States having the highest number of cases and deaths by far, followed by India and Brazil. The document also shows that Yemen has the highest CFR, followed by Peru, Mexico, Syria, and Brazil. The document provides several visualizations to help understand the data, such as bar charts of the top 10 countries with the most cases and deaths, a geographical plot of the distribution of deaths from around the world, and a bar chart comparing the CFR across different countries.

Overall, the document provides valuable insights into the global impact of the COVID-19 pandemic and highlights the importance of continued efforts to control and mitigate the spread of the virus.

Thank you for reading.
Paul Carmody