Data Analysis Project: Investigating COVID-19 Trends

Introduction

The COVID-19 pandemic has resulted in an unprecedented health crisis globally. Understanding the trends of the virus is essential to combat this crisis. For this project, we will be analyzing a publicly available dataset sourced from Kaggle.com, which contains daily updates on confirmed, death, recovered, active cases, and tests conducted across all countries between January 20, 2020 and June 1, 2020. The main objective of this project is to identify the top 10 countries with the most tests conducted and determine the ratio of positive cases to tests conducted.

1. Load and Understand the Data

First, we load the necessary packages and the data. Understanding the data structure is the first step in any data analysis.

# Loading the libraries required for our project
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr)
library(ggplot2)

# Loading the data
covid_df <- read_csv("./covid19.csv")

## Rows: 10903 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Continent_Name, Two_Letter_Country_Code, Country_Region, Province_...
## dbl  (9): positive, hospitalized, recovered, death, total_tested, active, ho...
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Displaying basic data
glimpse(covid_df)

## Rows: 10,903
## Columns: 14
## $ Date                    <date> 2020-01-20, 2020-01-22, 2020-01-22, 2020-01-2…
## $ Continent_Name          <chr> "Asia", "North America", "North America", "Nor…
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "US"…
## $ Country_Region          <chr> "South Korea", "United States", "United States…
## $ Province_State          <chr> "All States", "All States", "Washington", "All…
## $ positive                <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0, 1…
## $ hospitalized            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ recovered               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ death                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_tested            <dbl> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ active                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hospitalizedCurr        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_tested            <dbl> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_positive          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

We can see that out dataset contains 10903 rows and 14 columns.

2. Isolate the Rows We Need

We are interested in country-level data, so we filter for ‘All States’ from the Province_State column.

covid_df_all_states <- covid_df %>% filter(Province_State=="All States")

3. Isolate the Columns We Need

For our analysis, we need only a subset of the available columns.

covid_df_all_states_daily <- covid_df_all_states %>% 
  select("Date", "Country_Region", "active", "hospitalizedCurr", "daily_tested", "daily_positive")

4. Extracting the Top Ten Tested Cases Countries

Here, we group by each country, calculate the total number of tested cases, and select the top ten countries.

covid_df_all_states_daily_sum <- covid_df_all_states_daily %>% 
  group_by(Country_Region) %>%
  summarize(tested=sum(daily_tested), positive=sum(daily_positive), active=sum(active), hospitalized=sum(hospitalizedCurr)) %>%
  arrange(desc(tested))

covid_top_10 <- covid_df_all_states_daily_sum %>% slice_max(tested, n = 10)

# Print the summary tables
kable(covid_top_10, caption = "Top Ten Tested Countries")

Top Ten Tested Countries
Country_Region	tested	positive	active	hospitalized
United States	17282363	1877179	0	0
Russia	10542266	406368	6924890	0
Italy	4091291	251710	6202214	1699003
India	3692851	60959	0	0
Turkey	2031192	163941	2980960	0
Canada	1654779	90873	56454	0
United Kingdom	1473672	166909	0	0
Australia	1252900	7200	134586	6655
Peru	976790	59497	0	0
Poland	928256	23987	538203	0

After this step, we find that the countries with the highest number of tests conducted are the United States, Russia and Italy.

5. Identifying the Highest Positive Against Tested Cases

Next, we calculate the proportion of positive cases to tests conducted for the top 10 countries.

covid_top_10 <- covid_top_10 %>%
  mutate(prop_positive = positive / tested) %>%
  arrange(desc(prop_positive))

positive_tested_top_10 <- covid_top_10 %>% slice_max(prop_positive, n = 10)

# Print the summary tables
kable(positive_tested_top_10, caption = "Top Ten Positive Against Tested Countries")

Top Ten Positive Against Tested Countries
Country_Region	tested	positive	active	hospitalized	prop_positive
United Kingdom	1473672	166909	0	0	0.1132606
United States	17282363	1877179	0	0	0.1086182
Turkey	2031192	163941	2980960	0	0.0807117
Italy	4091291	251710	6202214	1699003	0.0615234
Peru	976790	59497	0	0	0.0609107
Canada	1654779	90873	56454	0	0.0549155
Russia	10542266	406368	6924890	0	0.0385466
Poland	928256	23987	538203	0	0.0258409
India	3692851	60959	0	0	0.0165073
Australia	1252900	7200	134586	6655	0.0057467

From this analysis, we can see that the countries with the highest proportion of positive cases to tests conducted are the United Kingdom, the United States, and Turkey.

6. Visualizing the Data

We visualize the countries with the highest proportion of positive cases.

# Creating a plot to visualize the highest positive to tested cases ratio
ggplot(positive_tested_top_10, aes(x = reorder(Country_Region, prop_positive), y = prop_positive)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(prop_positive, 3)), 
            hjust = -0.1, 
            size = 3.5) +
  coord_flip() +
  labs(title = "Top 10 Countries with the Highest COVID-19 Positive to Tested Cases Ratio",
       x = "Country", 
       y = "Positive to Tested Cases Ratio") +
  theme_minimal() +
  theme(plot.margin = margin(1, 2, 1, 1, "cm"))

The visualization clearly shows that the United Kingdom has the highest positive to tested cases ratio among the countries with the most tests conducted, followed by the United States and Turkey.

Conclusion

This project examined trends in COVID-19 cases from January 20, 2020 to June 1, 2020, focusing specifically on the top 10 countries with the highest number of tests conducted. The data reveals that the United States, Russia, and Italy were the countries with the highest number of tests conducted, with 17,282,363, 10,542,266, and 4,091,291 tests respectively.

Investigation of the ratio of positive cases to tests conducted revealed that the United Kingdom (11.33%), the United States (10.86%), and Turkey (8.07%) had the highest proportions. This might suggest a high transmission rate of the virus in these countries. Notably, Australia, while ranking 8th in terms of testing, managed to keep both its positive cases and hospitalizations relatively low, suggesting a lower transmission rate.

A deeper look at the statistics for the United States showed that, despite conducting the most tests, it had the second-highest proportion of positive cases. This highlights the severity and widespread nature of the disease in the country during the period under study.

The information derived from this project can be crucial for various stakeholders in formulating response strategies to the pandemic. For health authorities, these findings can offer insights into the effectiveness of testing strategies and help identify areas where more testing may be needed. Comparing the proportion of positive cases can also provide insights into the spread and severity of the disease in different regions.

Limitations of this project include the reliance on the accuracy and consistency of reporting in each country, which can vary due to factors such as differences in testing capacity, reporting guidelines, and disease prevalence. The analysis did not consider the population size of the countries, which may impact the proportion of positive cases. Lastly, as the situation with COVID-19 is rapidly evolving, the data and findings are likely to change over time.

For future research, incorporating additional data such as population size, health infrastructure, and government policies could provide more context to the findings and improve the understanding of the global impact of COVID-19.