Title: “DataQuest.io Guided Project: Investigating COVID-19 Virus Trends”
Author: “Charlie Ha”
Date: Aug 10, 2020
Output: html_notebook
A pneumonia of unknown cause detected in Wuhan, China was first internationally reported from China on 31 December 2019. Today we know this virus as Coronavirus. COVID-19 which stands for COronaVIrus Disease is the disease caused by this virus. Since then, the world has been engaged in the fight against this pandemic. Several measures have therefore been taken to “flatten the curve”. We have consequently experienced social distancing and many people have passed away as well.
In the solidarity to face this unprecedented global crisis, several organizations did not hesitate to share several datasets allowing the conduction of several kinds of analysis in order to understand this pandemic.
It is natural for us to analyze these datasets by ourselves to answer questions since we cannot always rely on the news, and we are data scientists.
This analysis tries to provide an answer to this question:
Which countries have had the highest number of positive cases against the number of tests?
Step 1: Understanding the Data The full dataset is available from Kaggle: https://www.kaggle.com/lin0li/covid19testing
The following dataset has been prepared and available for download here: (https://dq-content.s3.amazonaws.com/505/covid19.csv) <- “covid19.cvs”
#Loding the data into RStudio
library(readr)
covid_df <- read.csv("covid19.csv")
#How much data do we have?
dim(covid_df)
## [1] 10903 14
#What are the column names of the data?
vector_cols <- colnames(covid_df)
vector_cols
## [1] "Date" "Continent_Name"
## [3] "Two_Letter_Country_Code" "Country_Region"
## [5] "Province_State" "positive"
## [7] "hospitalized" "recovered"
## [9] "death" "total_tested"
## [11] "active" "hospitalizedCurr"
## [13] "daily_tested" "daily_positive"
#Displaying the first few rows of the dataset
head(covid_df)
## Date Continent_Name Two_Letter_Country_Code Country_Region
## 1 2020-01-20 Asia KR South Korea
## 2 2020-01-22 North America US United States
## 3 2020-01-22 North America US United States
## 4 2020-01-23 North America US United States
## 5 2020-01-23 North America US United States
## 6 2020-01-24 Asia KR South Korea
## Province_State positive hospitalized recovered death total_tested active
## 1 All States 1 0 0 0 4 0
## 2 All States 1 0 0 0 1 0
## 3 Washington 1 0 0 0 1 0
## 4 All States 1 0 0 0 1 0
## 5 Washington 1 0 0 0 1 0
## 6 All States 2 0 0 0 27 0
## hospitalizedCurr daily_tested daily_positive
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 5 0
#Display summary of the dataset using a funtion from the tibble package
library(tibble)
glimpse(covid_df)
## Rows: 10,903
## Columns: 14
## $ Date <chr> "2020-01-20", "2020-01-22", "2020-01-22", "...
## $ Continent_Name <chr> "Asia", "North America", "North America", "...
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "...
## $ Country_Region <chr> "South Korea", "United States", "United Sta...
## $ Province_State <chr> "All States", "All States", "Washington", "...
## $ positive <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0...
## $ hospitalized <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ recovered <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ death <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ total_tested <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, ...
## $ active <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hospitalizedCurr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_tested <int> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_positive <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
From this observation, we deduce that this data consists of 14 columns and 10903 rows. We also see that the Province_State column has mixed data from different levels. We will extract only the country-level data in order to not bias our analyses.
Step 2: Isolating the Rows We Need
#The dataset is filtered using the filter() function from dyplr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
covid_df_all_states <- filter(covid_df,Province_State=="All States")
Step 3: Isolating the Columns We Need
covid_df_all_states_daily <- select(covid_df_all_states,Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)
Step 4: Extracting the Top Ten Cases Countries Now that we have cleaned up the data, we can now group and aggregate the data by country while summing up daily totals.
covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarise(tested = sum(daily_tested),
positive = sum(daily_positive),
active = sum(active),
hospitalized = sum(hospitalizedCurr)) %>%
arrange(-tested)
## `summarise()` ungrouping output (override with `.groups` argument)
Lets isolate to only the top ten countries:
covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
print(covid_top_10)
## # A tibble: 10 x 5
## Country_Region tested positive active hospitalized
## <chr> <int> <int> <int> <int>
## 1 United States 17282363 1877179 0 0
## 2 Russia 10542266 406368 6924890 0
## 3 Italy 4091291 251710 6202214 1699003
## 4 India 3692851 60959 0 0
## 5 Turkey 2031192 163941 2980960 0
## 6 Canada 1654779 90873 56454 0
## 7 United Kingdom 1473672 166909 0 0
## 8 Australia 1252900 7200 134586 6655
## 9 Peru 976790 59497 0 0
## 10 Poland 928256 23987 538203 0
Step 5: Identifying the Highest Positive Against Tested Cases Vectors are created from the covid_top_10 dataframe
countries <- covid_top_10$Country_Region
tested_cases <- covid_top_10$tested
positive_cases <- covid_top_10$positive
active_cases <- covid_top_10$active
hospitalized_cases <-covid_top_10$hospitalized
Now we can name the different vectors with the country names:
names(positive_cases) <- countries
names(tested_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries
To identify the top three positive against total tested cases, we can divide the vector positive_cases by the vector tested_cases
pos_vs_tested<- covid_top_10%>%
mutate(
positive_vs_tested = positive_cases/tested_cases
)%>%
arrange(-positive_vs_tested)
View(pos_vs_tested)
positive_tested_top_3 <- c("United Kingdom" = 0.11,"United States" = 0.10,"Turkey" = 0.08)
positive_tested_top_3
## United Kingdom United States Turkey
## 0.11 0.10 0.08
Step 6: Keeping relevant Information To make sure we won’t lose other information about these countries we can create a matrix that contains the ratio and the overall number of COVID-19 tested, positive, active and hospitalized cases.
united_kingdom <- c(0.11, 1473672, 166909, 0, 0)
united_states <- c(0.10, 17282363, 1877179, 0, 0)
turkey <- c(0.08, 2031192, 163941, 2980960, 0)
covid_mat <- rbind(united_kingdom,united_states,turkey)
colnames(covid_mat) <- c("Ratio", "tested", "positive", "active", "hospitalized")
print(covid_mat)
## Ratio tested positive active hospitalized
## united_kingdom 0.11 1473672 166909 0 0
## united_states 0.10 17282363 1877179 0 0
## turkey 0.08 2031192 163941 2980960 0
Step 7: Putting it all together We put all our answers and datasets together in a list that allows us to have a global view
question <- "Which countries have had the highest number of positive cases against the number of tests?"
answer <- c("Positive tested cases" = positive_tested_top_3)
data_structure_list <- list(covid_df,covid_df_all_states,covid_df_all_states_daily,covid_top_10)
matrix <- list(covid_mat)
vector_list <- list(vector_cols, countries)
names(data_structure_list) <- c("Dataframes", "Matrices", "Vectors")
We can put everything together :
covid_analysis_list <- list(question, answer, data_structure_list)
covid_analysis_list[[2]]
## Positive tested cases.United Kingdom Positive tested cases.United States
## 0.11 0.10
## Positive tested cases.Turkey
## 0.08
Conclusion After cleaning up the data and performing some simple analysis, we can determine that the countries have had the highest number of positive cases against the number of tests are United Kingdom, United States and Turkey between the 20th of January and the 1st of June 2020.