The world has been engaged in the fight against this pandemic. Several measures have been taken to flatten the curve. We have consequently experienced social distancing and many people have passes away around the world as well.
In the solidarity to face this unprecedented global crisis, several organizations did not hesitate to share several datasets allowing the conduction of several kinds of analysis in order to gain a deeper understanding of this pandemic.
Our analysis tries to provide answers to the following questions:
Which countries have had the highest number of deaths due to COVID- 19?
Which countries have had the highest number of positive cases against the number of tests?
Which countries have made the best effort in terms of the number of COVID-19 tests conducted related to their population?
Which countries were ultimately the most and least affected related to their population?
Step 1: Understanding the Data
#load readr package to allow us to read the dataset
library(readr)
covid_df <- read.csv("covid19.csv")
#Check the dimensions of the dataframe
dim(covid_df)
## [1] 10903 14
#Store the column names into a vector
vector_cols <- colnames(covid_df)
vector_cols
## [1] "Date" "Continent_Name"
## [3] "Two_Letter_Country_Code" "Country_Region"
## [5] "Province_State" "positive"
## [7] "hospitalized" "recovered"
## [9] "death" "total_tested"
## [11] "active" "hospitalizedCurr"
## [13] "daily_tested" "daily_positive"
# Sneak peek of the data using the glimpse function from tidyverse
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v dplyr 1.0.1
## v tibble 3.0.3 v stringr 1.4.0
## v tidyr 1.1.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
glimpse(covid_df)
## Rows: 10,903
## Columns: 14
## $ Date <chr> "2020-01-20", "2020-01-22", "2020-01-22", "...
## $ Continent_Name <chr> "Asia", "North America", "North America", "...
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "...
## $ Country_Region <chr> "South Korea", "United States", "United Sta...
## $ Province_State <chr> "All States", "All States", "Washington", "...
## $ positive <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0...
## $ hospitalized <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ recovered <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ death <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ total_tested <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, ...
## $ active <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hospitalizedCurr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_tested <int> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_positive <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
As we can see, these function give us a summary of the data. There are 10903 rows and 14 columns of data.
Isolating the Data
The column Province_state has mixed data from different levels so we must filter country-level data
#Using the filter function to isolate all states
library(dplyr)
covid_df_all_states<-filter(covid_df, Province_State=="All States")
#Using the select function to select columns related to cumulative measures
covid_df_all_states_cumulative<- select(covid_df_all_states,Date, Continent_Name, Two_Letter_Country_Code, positive, hospitalized, recovered, death, total_tested)
#Using the select function again to select columns related to daily measures
covid_df_all_states_daily <- select(covid_df_all_states, Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)
Identifying Countries with the Highest Number of Deaths
To determine which countries have had the highest number of deaths due to COVID-19, we can use data visualization to explore all the data at once. This provides a global view of the data.
#Sum up the total deaths by countries
covid_df_all_states_cumulative_max <- covid_df_all_states_cumulative %>%
group_by(Continent_Name, Two_Letter_Country_Code)%>%
filter(death>0)%>%
summarise(total_death = sum(death),
total_positive = sum(positive),
total_hospitalized = sum(hospitalized),
total_recovered = sum(recovered),
sum_total = sum(total_tested)) %>%
arrange(-total_death)
## `summarise()` regrouping output by 'Continent_Name' (override with `.groups` argument)
#Load ggplot2 to visualize the top deaths by country
library(ggplot2)
qplot(x=Two_Letter_Country_Code,
y=total_death,
colour=Continent_Name,
data=covid_df_all_states_cumulative_max)
death_top_3<-c("US", "IT", "GB")
As we can see from the plot, the top three countries with the highest COVID-19 deaths are the United States of America, Italy, and Great Britain.
Extracting the Top Ten Tested Cases Countries
The data can now be agregated by country by summing up the daily totals.
covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarise(tested = sum(daily_tested),
positive = sum(daily_positive),
active = sum(active),
hospitalized = sum(hospitalizedCurr)) %>%
arrange(-tested)
## `summarise()` ungrouping output (override with `.groups` argument)
Let’s isolate only the top ten countries:
covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
print(covid_top_10)
## # A tibble: 10 x 5
## Country_Region tested positive active hospitalized
## <chr> <int> <int> <int> <int>
## 1 United States 17282363 1877179 0 0
## 2 Russia 10542266 406368 6924890 0
## 3 Italy 4091291 251710 6202214 1699003
## 4 India 3692851 60959 0 0
## 5 Turkey 2031192 163941 2980960 0
## 6 Canada 1654779 90873 56454 0
## 7 United Kingdom 1473672 166909 0 0
## 8 Australia 1252900 7200 134586 6655
## 9 Peru 976790 59497 0 0
## 10 Poland 928256 23987 538203 0
Identifying the Highest Positive Against Tested Cases
The data from the top countries can be stored into vectors for analysis
#Column data is stored into vectors
countries <- covid_top_10$Country_Region
tested_cases <- covid_top_10$tested
positive_cases <- covid_top_10$positive
active_cases <- covid_top_10$active
hospitalized_cases <- covid_top_10$hospitalized
#Vectors are named by their respective countries
names(tested_cases) <- countries
names(positive_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries
#Calculating the ratio of positive cases against the number of tested cases
pos_vs_test<-positive_cases/tested_cases
print(pos_vs_test)
## United States Russia Italy India Turkey
## 0.108618191 0.038546552 0.061523368 0.016507300 0.080711720
## Canada United Kingdom Australia Peru Poland
## 0.054915490 0.113260617 0.005746668 0.060910738 0.025840932
positive_tested_top_3<-c("United Kingdom"=0.11, "United States"=0.10, "Turkey"=0.08)
print(positive_tested_top_3)
## United Kingdom United States Turkey
## 0.11 0.10 0.08
We can see that United Kingdom has the highest number of positive cases of COVID-19 per total number of individuals tested.
Scaling of Data to Population Level
To win this fight against the virus, each country has to defend itself as best it can. We can quantify this effort for the top ten tested cases at the population level of each country.
We first need the total population of each of the top ten countries:
covid_mat <- cbind(tested_cases, positive_cases, active_cases, hospitalized_cases)
population<- c(331002651, 145934462, 60461826, 1380004385, 84339067, 37742154, 67886011, 25499884, 32971854, 37846611)
#Compare all data in proportion to the countries population
covid_mat <- covid_mat*100/population
print(covid_mat)
## tested_cases positive_cases active_cases hospitalized_cases
## United States 5.221216 0.567119023 0.0000000 0.00000000
## Russia 7.223973 0.278459244 4.7452054 0.00000000
## Italy 6.766734 0.416312269 10.2580660 2.81004249
## India 0.267597 0.004417305 0.0000000 0.00000000
## Turkey 2.408364 0.194383227 3.5344949 0.00000000
## Canada 4.384432 0.240773221 0.1495781 0.00000000
## United Kingdom 2.170804 0.245866560 0.0000000 0.00000000
## Australia 4.913356 0.028235423 0.5277906 0.02609816
## Peru 2.962496 0.180447845 0.0000000 0.00000000
## Poland 2.452679 0.063379519 1.4220639 0.00000000
Ranking Countries Related to their population
To be able to compare these values to determine how the countries have been impacted related to one another, we can rank each country.
#ranking each column from previous Covid-19 matrix
tested_cases_rank <- rank(covid_mat[,"tested_cases"])
positive_cases_rank <- rank(covid_mat[,"positive_cases"])
active_cases_rank <- rank(covid_mat[,"active_cases"])
hospitalized_cases_rank <- rank(covid_mat[,"hospitalized_cases"])
#bind each ranking by row
covid_mat_rank <- rbind(tested_cases_rank,positive_cases_rank,active_cases_rank,hospitalized_cases_rank)
covid_mat_rank
## United States Russia Italy India Turkey Canada
## tested_cases_rank 8.0 10.0 9 1.0 3.0 6.0
## positive_cases_rank 10.0 8.0 9 1.0 5.0 6.0
## active_cases_rank 2.5 9.0 10 2.5 8.0 5.0
## hospitalized_cases_rank 4.5 4.5 10 4.5 4.5 4.5
## United Kingdom Australia Peru Poland
## tested_cases_rank 2.0 7 5.0 4.0
## positive_cases_rank 7.0 2 4.0 3.0
## active_cases_rank 2.5 6 2.5 7.0
## hospitalized_cases_rank 4.5 9 4.5 4.5
Identifying Affected Countries Related to their Population
We can provide each country with an aggregate score by summing the rankings.
colSums(covid_mat_rank[-1, ])
## United States Russia Italy India Turkey
## 17.0 21.5 29.0 8.0 17.5
## Canada United Kingdom Australia Peru Poland
## 15.5 14.0 17.0 11.0 14.5
best_effort_tested_cased_top_3 <- c("India", "United Kingdom", "Turkey")
most_affected_country <- "Italy"
least_affected_country <- "India"
Putting it all Together
Now that we have completed our analysis, lets review the questions we were trying to answer.
Which countries have had the highest number of deaths due to COVID- 19?
Which countries have had the highest number of positive cases against the number of tests?
Which countries have made the best effort in terms of the number of COVID-19 tests conducted related to their population?
Which countries were ultimately the most and least affected related to their population?