Investigating COVID-19 Virus Trends 🚀

In this project, we will analyze some insights and trends of a pandemic caused by Coronavirus disease, also known as COVID-19. This analysis is conducted with the dataset made available by Sloth in Kaggle.

The dataset was collected between January 20 and June 1 in the year 2020, and contains the following variables:

Date: Date
Continent_Name: Continent names
Two_Letter_Country_Code: Country codes
Country_Region: Country names
Province_State: States/province names; value is All States when state/provincial level data is not available.
positive: Cumulative number of positive cases reported.
active: Number of active cases on that day.
hospitalized: Cumulative number of hospitalized cases reported.
hospitalizedCurr: Number of actively hospitalized cases on that day.
recovered: Cumulative number of recovered cases reported.
death: Cumulative number of deaths reported.
total_tested: Cumulative number of tests conducted.
daily_tested: Number of tests conducted on the day; if daily data is unavailable, daily tested is averaged across number of days in between.
daily_positive: Number of positive cases reported on the day; if daily data is unavailable, daily positive is averaged across number of days in.

Problem Statement

This project aims to answer the two questions:

Which top 10 countries have the most COVID-19 cases?

Which top 3 countries have reported the highest number of positive cases in relation to the number of tests conducted?

Summary of Analysis

After analyzing the dataset, we have the answers to the questions in the problem statement.

The top 10 countries with the most COVID-19 cases are:

Country_Region	tested
United States	34218115
Russia	10542266
Italy	4091291
India	3692851
Canada	3314703
Australia	2658762
Turkey	2031192
United Kingdom	1473672
Peru	976790
Poland	928256

The top 3 countries reporting the highest number of positive cases against the number of tests conducted are:

Country	ratio
United Kingdom	0.11326062
United States	0.10697945
Turkey	0.08071172

Exploration and Preparation

In this section, we will obtain some basic information about the COVID-19 dataset, and prepare it for analysis.

Initial Information

Let us now do:

load the dataset in a dataframe,
get the dimensions,
get (verify) the column names,
display the first few records, and
obtain some summary information.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

covid_df <- read.csv("covid19.csv")

dimensions <- dim(covid_df)
dimensions

## [1] 10903    14

After loading the dataset in a dataframe, we find that there are

10,903 records, and
14 variables.

vector_cols <- colnames(covid_df)
vector_cols

##  [1] "Date"                    "Continent_Name"         
##  [3] "Two_Letter_Country_Code" "Country_Region"         
##  [5] "Province_State"          "positive"               
##  [7] "hospitalized"            "recovered"              
##  [9] "death"                   "total_tested"           
## [11] "active"                  "hospitalizedCurr"       
## [13] "daily_tested"            "daily_positive"

Once we find that there are 14 columns, we obtain the column names and store them in a character vector.

head(covid_df)

##         Date Continent_Name Two_Letter_Country_Code Country_Region
## 1 2020-01-20           Asia                      KR    South Korea
## 2 2020-01-22  North America                      US  United States
## 3 2020-01-22  North America                      US  United States
## 4 2020-01-23  North America                      US  United States
## 5 2020-01-23  North America                      US  United States
## 6 2020-01-24           Asia                      KR    South Korea
##   Province_State positive hospitalized recovered death total_tested active
## 1     All States        1            0         0     0            4      0
## 2     All States        1            0         0     0            1      0
## 3     Washington        1            0         0     0            1      0
## 4     All States        1            0         0     0            1      0
## 5     Washington        1            0         0     0            1      0
## 6     All States        2            0         0     0           27      0
##   hospitalizedCurr daily_tested daily_positive
## 1                0            0              0
## 2                0            0              0
## 3                0            0              0
## 4                0            0              0
## 5                0            0              0
## 6                0            5              0

Here are the first few rows of the loaded dataset. While most columns have a clear data type, we must ascertain for sure beyond viewing a sample of the dataset.

glimpse(covid_df)

## Rows: 10,903
## Columns: 14
## $ Date                    <chr> "2020-01-20", "2020-01-22", "2020-01-22", "202…
## $ Continent_Name          <chr> "Asia", "North America", "North America", "Nor…
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "US"…
## $ Country_Region          <chr> "South Korea", "United States", "United States…
## $ Province_State          <chr> "All States", "All States", "Washington", "All…
## $ positive                <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0, 1…
## $ hospitalized            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ recovered               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ death                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_tested            <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ active                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hospitalizedCurr        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_tested            <int> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ daily_positive          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Here, we confirm that:

the first 5 columns are character columns, where they contain the date and location information of a record, and
the remaining 9 columns are integer columns, where they contain numerical information about a record relevant to understanding the viral pandemic.

Note that most of the numerical columns contain numerous 0s. If we take a look at the date column, the dates are really close to the beginning of the COVID-19 pandemic. This explains why numbers are close to 0.

Gathering Relevant Records

In this section, we will accomplish two tasks:

filter records that represent the country, instead of a province or a state in that country, and
filter our dataset by keeping the variables invovled with daily information.

covid_df_all_states <- covid_df %>%
  filter(Province_State == "All States") %>%
  select(-Province_State)
head(covid_df_all_states)

##         Date Continent_Name Two_Letter_Country_Code Country_Region positive
## 1 2020-01-20           Asia                      KR    South Korea        1
## 2 2020-01-22  North America                      US  United States        1
## 3 2020-01-23  North America                      US  United States        1
## 4 2020-01-24           Asia                      KR    South Korea        2
## 5 2020-01-24  North America                      US  United States        1
## 6 2020-01-25        Oceania                      AU      Australia        4
##   hospitalized recovered death total_tested active hospitalizedCurr
## 1            0         0     0            4      0                0
## 2            0         0     0            1      0                0
## 3            0         0     0            1      0                0
## 4            0         0     0           27      0                0
## 5            0         0     0            1      0                0
## 6            0         0     0            0      0                0
##   daily_tested daily_positive
## 1            0              0
## 2            0              0
## 3            0              0
## 4            5              0
## 5            0              0
## 6            0              0

In the first task, we keep only the records where the Province_State column contains the value All States. Since the remaining records will be the same (all of them will be All States), it is safe to remove the Province_State entirely from the dataset.

covid_df_all_states_daily <- covid_df %>%
  select(
    Date,
    Country_Region,
    active,
    hospitalizedCurr,
    daily_tested,
    daily_positive
  )
head(covid_df_all_states_daily)

##         Date Country_Region active hospitalizedCurr daily_tested daily_positive
## 1 2020-01-20    South Korea      0                0            0              0
## 2 2020-01-22  United States      0                0            0              0
## 3 2020-01-22  United States      0                0            0              0
## 4 2020-01-23  United States      0                0            0              0
## 5 2020-01-23  United States      0                0            0              0
## 6 2020-01-24    South Korea      0                0            5              0

In the last task, we keep the columns which involve daily information which are the Date, active, hospitalizedCurr, daily_tested, and daily_positive columns. Country_region is obviously kept for to keep the location information necessary to answer the project’s main question.

Analysis on Daily Pandemic Variables

In this section, we will accomplish the following tasks:

determine the sum of tested cases by country,
determine the sum of positive cases by country,
determine the sum of active cases by country,
determine the sum of hospitalized cases (within a day) by country, and
determine the top 10 countries with the highest number of positive cases against their number of tests.

Note that the first four tasks will be needed to accomplish the last task.

covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
  group_by(Country_Region) %>%
  summarise(
    tested = sum(daily_tested),
    positive = sum(daily_positive),
    active = sum(active),
    hospitalized = sum(hospitalizedCurr)
  ) %>%
  arrange(-tested)
covid_df_all_states_daily_sum

## # A tibble: 109 × 5
##    Country_Region   tested positive   active hospitalized
##    <chr>             <int>    <int>    <int>        <int>
##  1 United States  34218115  3660635 27687993      2899151
##  2 Russia         10542266   406368  6924890            0
##  3 Italy           4091291   251710  6202214      1699003
##  4 India           3692851    60959        0            0
##  5 Canada          3314703   181769  1477358            0
##  6 Australia       2658762    14400   269172        14020
##  7 Turkey          2031192   163941  2980960            0
##  8 United Kingdom  1473672   166909        0            0
##  9 Peru             976790    59497        0            0
## 10 Poland           928256    23987   538203            0
## # ℹ 99 more rows

In this first four tasks, we first group the dataframe by Country_region before obtaining the tested, positive, active, and hospitalized columns. Afterwards, we sort the resulting dataframe by the tested column in descending order.

covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
covid_top_10

## # A tibble: 10 × 5
##    Country_Region   tested positive   active hospitalized
##    <chr>             <int>    <int>    <int>        <int>
##  1 United States  34218115  3660635 27687993      2899151
##  2 Russia         10542266   406368  6924890            0
##  3 Italy           4091291   251710  6202214      1699003
##  4 India           3692851    60959        0            0
##  5 Canada          3314703   181769  1477358            0
##  6 Australia       2658762    14400   269172        14020
##  7 Turkey          2031192   163941  2980960            0
##  8 United Kingdom  1473672   166909        0            0
##  9 Peru             976790    59497        0            0
## 10 Poland           928256    23987   538203            0

In the Country_Region column of the covid_top10 dataframe, we have the top 10 countries with the most COVID-19 cases.

ans1 <- covid_top_10 %>% select("Country_Region", "tested")
ans1

## # A tibble: 10 × 2
##    Country_Region   tested
##    <chr>             <int>
##  1 United States  34218115
##  2 Russia         10542266
##  3 Italy           4091291
##  4 India           3692851
##  5 Canada          3314703
##  6 Australia       2658762
##  7 Turkey          2031192
##  8 United Kingdom  1473672
##  9 Peru             976790
## 10 Poland           928256

To accomplish the last task, we must extract the first ten countries which appear in the covid_df_all_states_daily_sum dataframe obtained in the first four tasks.

countries <- covid_top_10[["Country_Region"]]
tested_cases <- covid_top_10[["tested"]]
positive_cases <- covid_top_10[["positive"]]
active_cases <- covid_top_10[["active"]]
hospitalized_cases <- covid_top_10[["hospitalized"]]

names(tested_cases) <- countries
names(positive_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries

positive_tested <- positive_cases / tested_cases
positive_tested

##  United States         Russia          Italy          India         Canada 
##    0.106979446    0.038546552    0.061523368    0.016507300    0.054837191 
##      Australia         Turkey United Kingdom           Peru         Poland 
##    0.005416055    0.080711720    0.113260617    0.060910738    0.025840932

We obtain the ratio of the number of positive cases against the number of tested cases in the top 10 countries, storing the resulting operation in a vector.

positive_tested_top_3 <- c(
  "United Kingdom" = 0.113260617,
  "United States" = 0.106979446,
  "Turkey" = 0.080711720
)
positive_tested_top_3

## United Kingdom  United States         Turkey 
##     0.11326062     0.10697945     0.08071172

From this vector, we obtain the top 3 countries with the highest ratios of positive cases against the tested cases.

covid_top_10_wratio <- tibble(
  Country = countries,
  ratio = positive_tested,
  tested = tested_cases,
  positive = positive_cases,
  active = active_cases,
  hospitalized = hospitalized_cases
)

united_kingdom <- covid_top_10_wratio %>%
  filter(Country == "United Kingdom")
united_states <- covid_top_10_wratio %>%
  filter(Country == "United States")
turkey <- covid_top_10_wratio %>%
  filter(Country == "Turkey")

covid_top3_df <- rbind(
  united_kingdom,
  united_states,
  turkey
) %>%
  mutate(
    tested = as.numeric(tested),
    positive = as.numeric(positive),
    active = as.numeric(active),
    hospitalized = as.numeric(hospitalized)
  )
covid_top3_df

## # A tibble: 3 × 6
##   Country         ratio   tested positive   active hospitalized
##   <chr>           <dbl>    <dbl>    <dbl>    <dbl>        <dbl>
## 1 United Kingdom 0.113   1473672   166909        0            0
## 2 United States  0.107  34218115  3660635 27687993      2899151
## 3 Turkey         0.0807  2031192   163941  2980960            0

united_kingdom_vec <- c(0.11326062, 1473672, 166909, 0, 0)
united_states_vec <- c(0.10697945, 34218115, 3660635, 27687993, 2899151)
turkey_vec <- c(0.08071172, 2031192, 163941, 2980960, 0)

covid_mat <- rbind(
  united_kingdom_vec,
  united_states_vec,
  turkey_vec
)
colnames(covid_mat) <- c("ratio", "tested", "positive", "active", "hospitalized")
covid_mat

##                         ratio   tested positive   active hospitalized
## united_kingdom_vec 0.11326062  1473672   166909        0            0
## united_states_vec  0.10697945 34218115  3660635 27687993      2899151
## turkey_vec         0.08071172  2031192   163941  2980960            0

positive_tested_top_3 <- covid_mat[, "ratio"]
positive_tested_top_3

## united_kingdom_vec  united_states_vec         turkey_vec 
##         0.11326062         0.10697945         0.08071172

question <- "Which countries have had the highest number of positive cases against the number of tests?"
answer <- c("Positive against tested cases" = positive_tested_top_3)

data_structure_list <- list(
  covid_df,
  covid_df_all_states,
  covid_df_all_states_daily,
  covid_top_10,
  vector_cols,
  countries
)

covid_analysis_list <- list(
  question,
  answer,
  data_structure_list
)

covid_analysis_list[[2]]

## Positive against tested cases.united_kingdom_vec 
##                                       0.11326062 
##  Positive against tested cases.united_states_vec 
##                                       0.10697945 
##         Positive against tested cases.turkey_vec 
##                                       0.08071172