Investigating COVID-19 Virus Trends

A pneumonia of unknown cause detected in Wuhan, China was first internationally reported from China on 31 December 2019. Today we know this virus as Coronavirus. COVID-19 which stands for COronaVIrus Disease is the disease caused by this virus. Since then, the world has been engaged in the fight against this pandemic. Several measures have therefore been taken to “flatten the curve”. We have consequently experienced social distancing and many people have passed away as well.

In the solidarity to face this unprecedented global crisis, several organizations did not hesitate to share several datasets allowing the conduction of several kinds of analysis in order to understand this pandemic.

It is natural for us to analyze these datasets by ourselves to answer questions since we cannot always rely on the news, and we are data scientists.

This analysis tries to provide an answer to this question:

Which countries have had the highest number of positive cases against the number of tests?

Step 1: Understanding the Data The full dataset is available from Kaggle: https://www.kaggle.com/lin0li/covid19testing

The following dataset has been prepared and available for download here: (https://dq-content.s3.amazonaws.com/505/covid19.csv) <- “covid19.cvs”

#Loding the data into RStudio 

library(readr)

covid_df <- read.csv("covid19.csv")

#How much data do we have?

dim(covid_df)

## [1] 10903    14

#What are the column names of the data? 

vector_cols <- colnames(covid_df)

vector_cols

##  [1] "Date"                    "Continent_Name"         
##  [3] "Two_Letter_Country_Code" "Country_Region"         
##  [5] "Province_State"          "positive"               
##  [7] "hospitalized"            "recovered"              
##  [9] "death"                   "total_tested"           
## [11] "active"                  "hospitalizedCurr"       
## [13] "daily_tested"            "daily_positive"

#Displaying the first few rows of the dataset

head(covid_df)

##         Date Continent_Name Two_Letter_Country_Code Country_Region
## 1 2020-01-20           Asia                      KR    South Korea
## 2 2020-01-22  North America                      US  United States
## 3 2020-01-22  North America                      US  United States
## 4 2020-01-23  North America                      US  United States
## 5 2020-01-23  North America                      US  United States
## 6 2020-01-24           Asia                      KR    South Korea
##   Province_State positive hospitalized recovered death total_tested active
## 1     All States        1            0         0     0            4      0
## 2     All States        1            0         0     0            1      0
## 3     Washington        1            0         0     0            1      0
## 4     All States        1            0         0     0            1      0
## 5     Washington        1            0         0     0            1      0
## 6     All States        2            0         0     0           27      0
##   hospitalizedCurr daily_tested daily_positive
## 1                0            0              0
## 2                0            0              0
## 3                0            0              0
## 4                0            0              0
## 5                0            0              0
## 6                0            5              0

#Display summary of the dataset using a funtion from the tibble package

library(tibble)

glimpse(covid_df)

## Rows: 10,903
## Columns: 14
## $ Date                    <chr> "2020-01-20", "2020-01-22", "2020-01-22", "...
## $ Continent_Name          <chr> "Asia", "North America", "North America", "...
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "...
## $ Country_Region          <chr> "South Korea", "United States", "United Sta...
## $ Province_State          <chr> "All States", "All States", "Washington", "...
## $ positive                <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0...
## $ hospitalized            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ recovered               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ death                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ total_tested            <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, ...
## $ active                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hospitalizedCurr        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_tested            <int> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_positive          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

From this observation, we deduce that this data consists of 14 columns and 10903 rows. We also see that the Province_State column has mixed data from different levels. We will extract only the country-level data in order to not bias our analyses.

Step 2: Isolating the Rows We Need

#The dataset is filtered using the filter() function from dyplr

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

covid_df_all_states <- filter(covid_df,Province_State=="All States")

Step 3: Isolating the Columns We Need

covid_df_all_states_daily <- select(covid_df_all_states,Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

Step 4: Extracting the Top Ten Cases Countries Now that we have cleaned up the data, we can now group and aggregate the data by country while summing up daily totals.

covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%

group_by(Country_Region) %>%

summarise(tested = sum(daily_tested),
          positive = sum(daily_positive),
          active = sum(active),
          hospitalized = sum(hospitalizedCurr)) %>%
arrange(-tested)

## `summarise()` ungrouping output (override with `.groups` argument)

Lets isolate to only the top ten countries:

covid_top_10 <- head(covid_df_all_states_daily_sum, 10)

print(covid_top_10)

## # A tibble: 10 x 5
##    Country_Region   tested positive  active hospitalized
##    <chr>             <int>    <int>   <int>        <int>
##  1 United States  17282363  1877179       0            0
##  2 Russia         10542266   406368 6924890            0
##  3 Italy           4091291   251710 6202214      1699003
##  4 India           3692851    60959       0            0
##  5 Turkey          2031192   163941 2980960            0
##  6 Canada          1654779    90873   56454            0
##  7 United Kingdom  1473672   166909       0            0
##  8 Australia       1252900     7200  134586         6655
##  9 Peru             976790    59497       0            0
## 10 Poland           928256    23987  538203            0

Step 5: Identifying the Highest Positive Against Tested Cases Vectors are created from the covid_top_10 dataframe

countries <- covid_top_10$Country_Region

tested_cases <- covid_top_10$tested

positive_cases <- covid_top_10$positive

active_cases <- covid_top_10$active

hospitalized_cases <-covid_top_10$hospitalized

Now we can name the different vectors with the country names:

names(positive_cases) <- countries

names(tested_cases) <- countries

names(active_cases) <- countries

names(hospitalized_cases) <- countries

To identify the top three positive against total tested cases, we can divide the vector positive_cases by the vector tested_cases

pos_vs_tested<- covid_top_10%>%

mutate(

    positive_vs_tested = positive_cases/tested_cases
    
)%>%

arrange(-positive_vs_tested)

View(pos_vs_tested)

positive_tested_top_3 <- c("United Kingdom" = 0.11,"United States" = 0.10,"Turkey" = 0.08)

positive_tested_top_3

## United Kingdom  United States         Turkey 
##           0.11           0.10           0.08

Step 6: Keeping relevant Information To make sure we won’t lose other information about these countries we can create a matrix that contains the ratio and the overall number of COVID-19 tested, positive, active and hospitalized cases.

united_kingdom <- c(0.11, 1473672, 166909, 0, 0)

united_states <- c(0.10, 17282363, 1877179, 0, 0)

turkey <- c(0.08, 2031192, 163941, 2980960, 0)

covid_mat <- rbind(united_kingdom,united_states,turkey)

colnames(covid_mat) <- c("Ratio", "tested", "positive", "active", "hospitalized")

print(covid_mat)

##                Ratio   tested positive  active hospitalized
## united_kingdom  0.11  1473672   166909       0            0
## united_states   0.10 17282363  1877179       0            0
## turkey          0.08  2031192   163941 2980960            0

Step 7: Putting it all together We put all our answers and datasets together in a list that allows us to have a global view

question <- "Which countries have had the highest number of positive cases against the number of tests?"

answer <- c("Positive tested cases" = positive_tested_top_3)

data_structure_list <- list(covid_df,covid_df_all_states,covid_df_all_states_daily,covid_top_10)

matrix <- list(covid_mat)

vector_list <- list(vector_cols, countries)

names(data_structure_list) <- c("Dataframes", "Matrices", "Vectors")

We can put everything together :

covid_analysis_list <- list(question, answer, data_structure_list)

covid_analysis_list[[2]]

## Positive tested cases.United Kingdom  Positive tested cases.United States 
##                                 0.11                                 0.10 
##         Positive tested cases.Turkey 
##                                 0.08

Conclusion After cleaning up the data and performing some simple analysis, we can determine that the countries have had the highest number of positive cases against the number of tests are United Kingdom, United States and Turkey between the 20th of January and the 1st of June 2020.

Covid-19 Virus Trends

Charlie Ha

11/08/2020

Investigating COVID-19 Virus Trends