The COVID-19 pandemic has resulted in an unprecedented health crisis globally. Understanding the trends of the virus is essential to combat this crisis. For this project, we will be analyzing a publicly available dataset sourced from Kaggle.com, which contains daily updates on confirmed, death, recovered, active cases, and tests conducted across all countries between January 20, 2020 and June 1, 2020. The main objective of this project is to identify the top 10 countries with the most tests conducted and determine the ratio of positive cases to tests conducted.
First, we load the necessary packages and the data. Understanding the data structure is the first step in any data analysis.
# Loading the libraries required for our project
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ggplot2)
# Loading the data
covid_df <- read_csv("./covid19.csv")
## Rows: 10903 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Continent_Name, Two_Letter_Country_Code, Country_Region, Province_...
## dbl (9): positive, hospitalized, recovered, death, total_tested, active, ho...
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Displaying basic data
glimpse(covid_df)
## Rows: 10,903
## Columns: 14
## $ Date <date> 2020-01-20, 2020-01-22, 2020-01-22, 2020-01-2…
## $ Continent_Name <chr> "Asia", "North America", "North America", "Nor…
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "US"…
## $ Country_Region <chr> "South Korea", "United States", "United States…
## $ Province_State <chr> "All States", "All States", "Washington", "All…
## $ positive <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0, 1…
## $ hospitalized <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ recovered <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ death <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_tested <dbl> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ active <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hospitalizedCurr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_tested <dbl> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_positive <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
We can see that out dataset contains 10903 rows and 14 columns.
We are interested in country-level data, so we filter for ‘All
States’ from the Province_State column.
covid_df_all_states <- covid_df %>% filter(Province_State=="All States")
For our analysis, we need only a subset of the available columns.
covid_df_all_states_daily <- covid_df_all_states %>%
select("Date", "Country_Region", "active", "hospitalizedCurr", "daily_tested", "daily_positive")
Here, we group by each country, calculate the total number of tested cases, and select the top ten countries.
covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarize(tested=sum(daily_tested), positive=sum(daily_positive), active=sum(active), hospitalized=sum(hospitalizedCurr)) %>%
arrange(desc(tested))
covid_top_10 <- covid_df_all_states_daily_sum %>% slice_max(tested, n = 10)
# Print the summary tables
kable(covid_top_10, caption = "Top Ten Tested Countries")
| Country_Region | tested | positive | active | hospitalized |
|---|---|---|---|---|
| United States | 17282363 | 1877179 | 0 | 0 |
| Russia | 10542266 | 406368 | 6924890 | 0 |
| Italy | 4091291 | 251710 | 6202214 | 1699003 |
| India | 3692851 | 60959 | 0 | 0 |
| Turkey | 2031192 | 163941 | 2980960 | 0 |
| Canada | 1654779 | 90873 | 56454 | 0 |
| United Kingdom | 1473672 | 166909 | 0 | 0 |
| Australia | 1252900 | 7200 | 134586 | 6655 |
| Peru | 976790 | 59497 | 0 | 0 |
| Poland | 928256 | 23987 | 538203 | 0 |
After this step, we find that the countries with the highest number of tests conducted are the United States, Russia and Italy.
Next, we calculate the proportion of positive cases to tests conducted for the top 10 countries.
covid_top_10 <- covid_top_10 %>%
mutate(prop_positive = positive / tested) %>%
arrange(desc(prop_positive))
positive_tested_top_10 <- covid_top_10 %>% slice_max(prop_positive, n = 10)
# Print the summary tables
kable(positive_tested_top_10, caption = "Top Ten Positive Against Tested Countries")
| Country_Region | tested | positive | active | hospitalized | prop_positive |
|---|---|---|---|---|---|
| United Kingdom | 1473672 | 166909 | 0 | 0 | 0.1132606 |
| United States | 17282363 | 1877179 | 0 | 0 | 0.1086182 |
| Turkey | 2031192 | 163941 | 2980960 | 0 | 0.0807117 |
| Italy | 4091291 | 251710 | 6202214 | 1699003 | 0.0615234 |
| Peru | 976790 | 59497 | 0 | 0 | 0.0609107 |
| Canada | 1654779 | 90873 | 56454 | 0 | 0.0549155 |
| Russia | 10542266 | 406368 | 6924890 | 0 | 0.0385466 |
| Poland | 928256 | 23987 | 538203 | 0 | 0.0258409 |
| India | 3692851 | 60959 | 0 | 0 | 0.0165073 |
| Australia | 1252900 | 7200 | 134586 | 6655 | 0.0057467 |
From this analysis, we can see that the countries with the highest proportion of positive cases to tests conducted are the United Kingdom, the United States, and Turkey.
We visualize the countries with the highest proportion of positive cases.
# Creating a plot to visualize the highest positive to tested cases ratio
ggplot(positive_tested_top_10, aes(x = reorder(Country_Region, prop_positive), y = prop_positive)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = round(prop_positive, 3)),
hjust = -0.1,
size = 3.5) +
coord_flip() +
labs(title = "Top 10 Countries with the Highest COVID-19 Positive to Tested Cases Ratio",
x = "Country",
y = "Positive to Tested Cases Ratio") +
theme_minimal() +
theme(plot.margin = margin(1, 2, 1, 1, "cm"))
The visualization clearly shows that the United Kingdom has the highest positive to tested cases ratio among the countries with the most tests conducted, followed by the United States and Turkey.
This project examined trends in COVID-19 cases from January 20, 2020 to June 1, 2020, focusing specifically on the top 10 countries with the highest number of tests conducted. The data reveals that the United States, Russia, and Italy were the countries with the highest number of tests conducted, with 17,282,363, 10,542,266, and 4,091,291 tests respectively.
Investigation of the ratio of positive cases to tests conducted revealed that the United Kingdom (11.33%), the United States (10.86%), and Turkey (8.07%) had the highest proportions. This might suggest a high transmission rate of the virus in these countries. Notably, Australia, while ranking 8th in terms of testing, managed to keep both its positive cases and hospitalizations relatively low, suggesting a lower transmission rate.
A deeper look at the statistics for the United States showed that, despite conducting the most tests, it had the second-highest proportion of positive cases. This highlights the severity and widespread nature of the disease in the country during the period under study.
The information derived from this project can be crucial for various stakeholders in formulating response strategies to the pandemic. For health authorities, these findings can offer insights into the effectiveness of testing strategies and help identify areas where more testing may be needed. Comparing the proportion of positive cases can also provide insights into the spread and severity of the disease in different regions.
Limitations of this project include the reliance on the accuracy and consistency of reporting in each country, which can vary due to factors such as differences in testing capacity, reporting guidelines, and disease prevalence. The analysis did not consider the population size of the countries, which may impact the proportion of positive cases. Lastly, as the situation with COVID-19 is rapidly evolving, the data and findings are likely to change over time.
For future research, incorporating additional data such as population size, health infrastructure, and government policies could provide more context to the findings and improve the understanding of the global impact of COVID-19.