In this project, we will analyze some insights and trends of a pandemic caused by Coronavirus disease, also known as COVID-19. This analysis is conducted with the dataset made available by Sloth in Kaggle.
The dataset was collected between January 20 and June 1 in the year 2020, and contains the following variables:
Date: DateContinent_Name: Continent namesTwo_Letter_Country_Code: Country codesCountry_Region: Country namesProvince_State: States/province names; value is
All States when state/provincial level data is not
available.positive: Cumulative number of positive cases
reported.active: Number of active cases on that day.hospitalized: Cumulative number of hospitalized cases
reported.hospitalizedCurr: Number of actively hospitalized cases
on that day.recovered: Cumulative number of recovered cases
reported.death: Cumulative number of deaths reported.total_tested: Cumulative number of tests
conducted.daily_tested: Number of tests conducted on the day; if
daily data is unavailable, daily tested is averaged across number of
days in between.daily_positive: Number of positive cases reported on
the day; if daily data is unavailable, daily positive is averaged across
number of days in.This project aims to answer the two questions:
- Which top 10 countries have the most COVID-19 cases?
- Which top 3 countries have reported the highest number of positive cases in relation to the number of tests conducted?
After analyzing the dataset, we have the answers to the questions in the problem statement.
The top 10 countries with the most COVID-19 cases are:
| Country_Region | tested |
|---|---|
| United States | 34218115 |
| Russia | 10542266 |
| Italy | 4091291 |
| India | 3692851 |
| Canada | 3314703 |
| Australia | 2658762 |
| Turkey | 2031192 |
| United Kingdom | 1473672 |
| Peru | 976790 |
| Poland | 928256 |
The top 3 countries reporting the highest number of positive cases against the number of tests conducted are:
| Country | ratio |
|---|---|
| United Kingdom | 0.11326062 |
| United States | 0.10697945 |
| Turkey | 0.08071172 |
In this section, we will obtain some basic information about the COVID-19 dataset, and prepare it for analysis.
Let us now do:
library(tidyverse)
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
## β dplyr 1.1.4 β readr 2.1.5
## β forcats 1.0.1 β stringr 1.5.2
## β ggplot2 4.0.0 β tibble 3.3.0
## β lubridate 1.9.4 β tidyr 1.3.1
## β purrr 1.1.0
## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
## β dplyr::filter() masks stats::filter()
## β dplyr::lag() masks stats::lag()
## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
covid_df <- read.csv("covid19.csv")
dimensions <- dim(covid_df)
dimensions
## [1] 10903 14
After loading the dataset in a dataframe, we find that there are
vector_cols <- colnames(covid_df)
vector_cols
## [1] "Date" "Continent_Name"
## [3] "Two_Letter_Country_Code" "Country_Region"
## [5] "Province_State" "positive"
## [7] "hospitalized" "recovered"
## [9] "death" "total_tested"
## [11] "active" "hospitalizedCurr"
## [13] "daily_tested" "daily_positive"
Once we find that there are 14 columns, we obtain the column names and store them in a character vector.
head(covid_df)
## Date Continent_Name Two_Letter_Country_Code Country_Region
## 1 2020-01-20 Asia KR South Korea
## 2 2020-01-22 North America US United States
## 3 2020-01-22 North America US United States
## 4 2020-01-23 North America US United States
## 5 2020-01-23 North America US United States
## 6 2020-01-24 Asia KR South Korea
## Province_State positive hospitalized recovered death total_tested active
## 1 All States 1 0 0 0 4 0
## 2 All States 1 0 0 0 1 0
## 3 Washington 1 0 0 0 1 0
## 4 All States 1 0 0 0 1 0
## 5 Washington 1 0 0 0 1 0
## 6 All States 2 0 0 0 27 0
## hospitalizedCurr daily_tested daily_positive
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 5 0
Here are the first few rows of the loaded dataset. While most columns have a clear data type, we must ascertain for sure beyond viewing a sample of the dataset.
glimpse(covid_df)
## Rows: 10,903
## Columns: 14
## $ Date <chr> "2020-01-20", "2020-01-22", "2020-01-22", "202β¦
## $ Continent_Name <chr> "Asia", "North America", "North America", "Norβ¦
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "US"β¦
## $ Country_Region <chr> "South Korea", "United States", "United Statesβ¦
## $ Province_State <chr> "All States", "All States", "Washington", "Allβ¦
## $ positive <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0, 1β¦
## $ hospitalized <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
## $ recovered <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
## $ death <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
## $ total_tested <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, 0, β¦
## $ active <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
## $ hospitalizedCurr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
## $ daily_tested <int> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
## $ daily_positive <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0β¦
Here, we confirm that:
Note that most of the numerical columns contain numerous 0s. If we take a look at the date column, the dates are really close to the beginning of the COVID-19 pandemic. This explains why numbers are close to 0.
In this section, we will accomplish two tasks:
covid_df_all_states <- covid_df %>%
filter(Province_State == "All States") %>%
select(-Province_State)
head(covid_df_all_states)
## Date Continent_Name Two_Letter_Country_Code Country_Region positive
## 1 2020-01-20 Asia KR South Korea 1
## 2 2020-01-22 North America US United States 1
## 3 2020-01-23 North America US United States 1
## 4 2020-01-24 Asia KR South Korea 2
## 5 2020-01-24 North America US United States 1
## 6 2020-01-25 Oceania AU Australia 4
## hospitalized recovered death total_tested active hospitalizedCurr
## 1 0 0 0 4 0 0
## 2 0 0 0 1 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 27 0 0
## 5 0 0 0 1 0 0
## 6 0 0 0 0 0 0
## daily_tested daily_positive
## 1 0 0
## 2 0 0
## 3 0 0
## 4 5 0
## 5 0 0
## 6 0 0
In the first task, we keep only the records where the
Province_State column contains the value
All States. Since the remaining records will be the same
(all of them will be All States), it is safe to remove the
Province_State entirely from the dataset.
covid_df_all_states_daily <- covid_df %>%
select(
Date,
Country_Region,
active,
hospitalizedCurr,
daily_tested,
daily_positive
)
head(covid_df_all_states_daily)
## Date Country_Region active hospitalizedCurr daily_tested daily_positive
## 1 2020-01-20 South Korea 0 0 0 0
## 2 2020-01-22 United States 0 0 0 0
## 3 2020-01-22 United States 0 0 0 0
## 4 2020-01-23 United States 0 0 0 0
## 5 2020-01-23 United States 0 0 0 0
## 6 2020-01-24 South Korea 0 0 5 0
In the last task, we keep the columns which involve daily information
which are the Date, active,
hospitalizedCurr, daily_tested, and
daily_positive columns. Country_region is
obviously kept for to keep the location information necessary to answer
the projectβs main question.
In this section, we will accomplish the following tasks:
Note that the first four tasks will be needed to accomplish the last task.
covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarise(
tested = sum(daily_tested),
positive = sum(daily_positive),
active = sum(active),
hospitalized = sum(hospitalizedCurr)
) %>%
arrange(-tested)
covid_df_all_states_daily_sum
## # A tibble: 109 Γ 5
## Country_Region tested positive active hospitalized
## <chr> <int> <int> <int> <int>
## 1 United States 34218115 3660635 27687993 2899151
## 2 Russia 10542266 406368 6924890 0
## 3 Italy 4091291 251710 6202214 1699003
## 4 India 3692851 60959 0 0
## 5 Canada 3314703 181769 1477358 0
## 6 Australia 2658762 14400 269172 14020
## 7 Turkey 2031192 163941 2980960 0
## 8 United Kingdom 1473672 166909 0 0
## 9 Peru 976790 59497 0 0
## 10 Poland 928256 23987 538203 0
## # βΉ 99 more rows
In this first four tasks, we first group the dataframe by
Country_region before obtaining the tested,
positive, active, and
hospitalized columns. Afterwards, we sort the resulting
dataframe by the tested column in descending order.
covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
covid_top_10
## # A tibble: 10 Γ 5
## Country_Region tested positive active hospitalized
## <chr> <int> <int> <int> <int>
## 1 United States 34218115 3660635 27687993 2899151
## 2 Russia 10542266 406368 6924890 0
## 3 Italy 4091291 251710 6202214 1699003
## 4 India 3692851 60959 0 0
## 5 Canada 3314703 181769 1477358 0
## 6 Australia 2658762 14400 269172 14020
## 7 Turkey 2031192 163941 2980960 0
## 8 United Kingdom 1473672 166909 0 0
## 9 Peru 976790 59497 0 0
## 10 Poland 928256 23987 538203 0
In the Country_Region column of the
covid_top10 dataframe, we have the top 10 countries with
the most COVID-19 cases.
ans1 <- covid_top_10 %>% select("Country_Region", "tested")
ans1
## # A tibble: 10 Γ 2
## Country_Region tested
## <chr> <int>
## 1 United States 34218115
## 2 Russia 10542266
## 3 Italy 4091291
## 4 India 3692851
## 5 Canada 3314703
## 6 Australia 2658762
## 7 Turkey 2031192
## 8 United Kingdom 1473672
## 9 Peru 976790
## 10 Poland 928256
To accomplish the last task, we must extract the first ten countries
which appear in the covid_df_all_states_daily_sum dataframe
obtained in the first four tasks.
countries <- covid_top_10[["Country_Region"]]
tested_cases <- covid_top_10[["tested"]]
positive_cases <- covid_top_10[["positive"]]
active_cases <- covid_top_10[["active"]]
hospitalized_cases <- covid_top_10[["hospitalized"]]
names(tested_cases) <- countries
names(positive_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries
positive_tested <- positive_cases / tested_cases
positive_tested
## United States Russia Italy India Canada
## 0.106979446 0.038546552 0.061523368 0.016507300 0.054837191
## Australia Turkey United Kingdom Peru Poland
## 0.005416055 0.080711720 0.113260617 0.060910738 0.025840932
We obtain the ratio of the number of positive cases against the number of tested cases in the top 10 countries, storing the resulting operation in a vector.
positive_tested_top_3 <- c(
"United Kingdom" = 0.113260617,
"United States" = 0.106979446,
"Turkey" = 0.080711720
)
positive_tested_top_3
## United Kingdom United States Turkey
## 0.11326062 0.10697945 0.08071172
From this vector, we obtain the top 3 countries with the highest ratios of positive cases against the tested cases.
covid_top_10_wratio <- tibble(
Country = countries,
ratio = positive_tested,
tested = tested_cases,
positive = positive_cases,
active = active_cases,
hospitalized = hospitalized_cases
)
united_kingdom <- covid_top_10_wratio %>%
filter(Country == "United Kingdom")
united_states <- covid_top_10_wratio %>%
filter(Country == "United States")
turkey <- covid_top_10_wratio %>%
filter(Country == "Turkey")
covid_top3_df <- rbind(
united_kingdom,
united_states,
turkey
) %>%
mutate(
tested = as.numeric(tested),
positive = as.numeric(positive),
active = as.numeric(active),
hospitalized = as.numeric(hospitalized)
)
covid_top3_df
## # A tibble: 3 Γ 6
## Country ratio tested positive active hospitalized
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 United Kingdom 0.113 1473672 166909 0 0
## 2 United States 0.107 34218115 3660635 27687993 2899151
## 3 Turkey 0.0807 2031192 163941 2980960 0
united_kingdom_vec <- c(0.11326062, 1473672, 166909, 0, 0)
united_states_vec <- c(0.10697945, 34218115, 3660635, 27687993, 2899151)
turkey_vec <- c(0.08071172, 2031192, 163941, 2980960, 0)
covid_mat <- rbind(
united_kingdom_vec,
united_states_vec,
turkey_vec
)
colnames(covid_mat) <- c("ratio", "tested", "positive", "active", "hospitalized")
covid_mat
## ratio tested positive active hospitalized
## united_kingdom_vec 0.11326062 1473672 166909 0 0
## united_states_vec 0.10697945 34218115 3660635 27687993 2899151
## turkey_vec 0.08071172 2031192 163941 2980960 0
positive_tested_top_3 <- covid_mat[, "ratio"]
positive_tested_top_3
## united_kingdom_vec united_states_vec turkey_vec
## 0.11326062 0.10697945 0.08071172
question <- "Which countries have had the highest number of positive cases against the number of tests?"
answer <- c("Positive against tested cases" = positive_tested_top_3)
data_structure_list <- list(
covid_df,
covid_df_all_states,
covid_df_all_states_daily,
covid_top_10,
vector_cols,
countries
)
covid_analysis_list <- list(
question,
answer,
data_structure_list
)
covid_analysis_list[[2]]
## Positive against tested cases.united_kingdom_vec
## 0.11326062
## Positive against tested cases.united_states_vec
## 0.10697945
## Positive against tested cases.turkey_vec
## 0.08071172