Covid-19

The world has been engaged in the fight against this pandemic. Several measures have been taken to flatten the curve. We have consequently experienced social distancing and many people have passes away around the world as well.

In the solidarity to face this unprecedented global crisis, several organizations did not hesitate to share several datasets allowing the conduction of several kinds of analysis in order to gain a deeper understanding of this pandemic.

Our analysis tries to provide answers to the following questions:

Step 1: Understanding the Data

#load readr package to allow us to read the dataset

library(readr)
covid_df <- read.csv("covid19.csv")
#Check the dimensions of the dataframe 

dim(covid_df)
## [1] 10903    14
#Store the column names into a vector

vector_cols <- colnames(covid_df)
vector_cols
##  [1] "Date"                    "Continent_Name"         
##  [3] "Two_Letter_Country_Code" "Country_Region"         
##  [5] "Province_State"          "positive"               
##  [7] "hospitalized"            "recovered"              
##  [9] "death"                   "total_tested"           
## [11] "active"                  "hospitalizedCurr"       
## [13] "daily_tested"            "daily_positive"
# Sneak peek of the data using the glimpse function from tidyverse
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v dplyr   1.0.1
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.1     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
glimpse(covid_df)
## Rows: 10,903
## Columns: 14
## $ Date                    <chr> "2020-01-20", "2020-01-22", "2020-01-22", "...
## $ Continent_Name          <chr> "Asia", "North America", "North America", "...
## $ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "...
## $ Country_Region          <chr> "South Korea", "United States", "United Sta...
## $ Province_State          <chr> "All States", "All States", "Washington", "...
## $ positive                <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0...
## $ hospitalized            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ recovered               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ death                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ total_tested            <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, ...
## $ active                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ hospitalizedCurr        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_tested            <int> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ daily_positive          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

As we can see, these function give us a summary of the data. There are 10903 rows and 14 columns of data.

Isolating the Data

The column Province_state has mixed data from different levels so we must filter country-level data

#Using the filter function to isolate all states

library(dplyr)
covid_df_all_states<-filter(covid_df, Province_State=="All States")
#Using the select function to select columns related to cumulative measures

covid_df_all_states_cumulative<- select(covid_df_all_states,Date, Continent_Name, Two_Letter_Country_Code, positive, hospitalized, recovered, death, total_tested)
#Using the select function again to select columns related to daily measures

covid_df_all_states_daily <- select(covid_df_all_states, Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

Identifying Countries with the Highest Number of Deaths

To determine which countries have had the highest number of deaths due to COVID-19, we can use data visualization to explore all the data at once. This provides a global view of the data.

#Sum up the total deaths by countries

covid_df_all_states_cumulative_max <- covid_df_all_states_cumulative %>%
  
group_by(Continent_Name, Two_Letter_Country_Code)%>%
filter(death>0)%>%
summarise(total_death = sum(death),
          total_positive = sum(positive),
          total_hospitalized = sum(hospitalized),
          total_recovered = sum(recovered),
          sum_total = sum(total_tested)) %>%
arrange(-total_death)
## `summarise()` regrouping output by 'Continent_Name' (override with `.groups` argument)
#Load ggplot2 to visualize the top deaths by country

library(ggplot2)
qplot(x=Two_Letter_Country_Code,
      y=total_death,
      colour=Continent_Name,
      data=covid_df_all_states_cumulative_max)

death_top_3<-c("US", "IT", "GB")

As we can see from the plot, the top three countries with the highest COVID-19 deaths are the United States of America, Italy, and Great Britain.

Extracting the Top Ten Tested Cases Countries

The data can now be agregated by country by summing up the daily totals.

covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%

group_by(Country_Region) %>%

summarise(tested = sum(daily_tested),
          positive = sum(daily_positive),
          active = sum(active),
          hospitalized = sum(hospitalizedCurr)) %>%
arrange(-tested)
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s isolate only the top ten countries:

covid_top_10 <- head(covid_df_all_states_daily_sum, 10)

print(covid_top_10)
## # A tibble: 10 x 5
##    Country_Region   tested positive  active hospitalized
##    <chr>             <int>    <int>   <int>        <int>
##  1 United States  17282363  1877179       0            0
##  2 Russia         10542266   406368 6924890            0
##  3 Italy           4091291   251710 6202214      1699003
##  4 India           3692851    60959       0            0
##  5 Turkey          2031192   163941 2980960            0
##  6 Canada          1654779    90873   56454            0
##  7 United Kingdom  1473672   166909       0            0
##  8 Australia       1252900     7200  134586         6655
##  9 Peru             976790    59497       0            0
## 10 Poland           928256    23987  538203            0

Identifying the Highest Positive Against Tested Cases

The data from the top countries can be stored into vectors for analysis

#Column data is stored into vectors
countries <- covid_top_10$Country_Region
tested_cases <- covid_top_10$tested
positive_cases <- covid_top_10$positive
active_cases <- covid_top_10$active
hospitalized_cases <- covid_top_10$hospitalized
  
#Vectors are named by their respective countries 
names(tested_cases) <- countries
names(positive_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries

#Calculating the ratio of positive cases against the number of tested cases
pos_vs_test<-positive_cases/tested_cases
print(pos_vs_test)
##  United States         Russia          Italy          India         Turkey 
##    0.108618191    0.038546552    0.061523368    0.016507300    0.080711720 
##         Canada United Kingdom      Australia           Peru         Poland 
##    0.054915490    0.113260617    0.005746668    0.060910738    0.025840932
positive_tested_top_3<-c("United Kingdom"=0.11, "United States"=0.10, "Turkey"=0.08)
print(positive_tested_top_3)
## United Kingdom  United States         Turkey 
##           0.11           0.10           0.08

We can see that United Kingdom has the highest number of positive cases of COVID-19 per total number of individuals tested.

Scaling of Data to Population Level

To win this fight against the virus, each country has to defend itself as best it can. We can quantify this effort for the top ten tested cases at the population level of each country.

We first need the total population of each of the top ten countries:

covid_mat <- cbind(tested_cases, positive_cases, active_cases, hospitalized_cases)
population<- c(331002651, 145934462, 60461826, 1380004385, 84339067, 37742154, 67886011, 25499884, 32971854, 37846611)

#Compare all data in proportion to the countries population
covid_mat <- covid_mat*100/population
print(covid_mat)
##                tested_cases positive_cases active_cases hospitalized_cases
## United States      5.221216    0.567119023    0.0000000         0.00000000
## Russia             7.223973    0.278459244    4.7452054         0.00000000
## Italy              6.766734    0.416312269   10.2580660         2.81004249
## India              0.267597    0.004417305    0.0000000         0.00000000
## Turkey             2.408364    0.194383227    3.5344949         0.00000000
## Canada             4.384432    0.240773221    0.1495781         0.00000000
## United Kingdom     2.170804    0.245866560    0.0000000         0.00000000
## Australia          4.913356    0.028235423    0.5277906         0.02609816
## Peru               2.962496    0.180447845    0.0000000         0.00000000
## Poland             2.452679    0.063379519    1.4220639         0.00000000

Ranking Countries Related to their population

To be able to compare these values to determine how the countries have been impacted related to one another, we can rank each country.

#ranking each column from previous Covid-19 matrix

tested_cases_rank <- rank(covid_mat[,"tested_cases"])
positive_cases_rank <- rank(covid_mat[,"positive_cases"])
active_cases_rank <- rank(covid_mat[,"active_cases"])
hospitalized_cases_rank <- rank(covid_mat[,"hospitalized_cases"])

#bind each ranking by row
covid_mat_rank <- rbind(tested_cases_rank,positive_cases_rank,active_cases_rank,hospitalized_cases_rank)
covid_mat_rank
##                         United States Russia Italy India Turkey Canada
## tested_cases_rank                 8.0   10.0     9   1.0    3.0    6.0
## positive_cases_rank              10.0    8.0     9   1.0    5.0    6.0
## active_cases_rank                 2.5    9.0    10   2.5    8.0    5.0
## hospitalized_cases_rank           4.5    4.5    10   4.5    4.5    4.5
##                         United Kingdom Australia Peru Poland
## tested_cases_rank                  2.0         7  5.0    4.0
## positive_cases_rank                7.0         2  4.0    3.0
## active_cases_rank                  2.5         6  2.5    7.0
## hospitalized_cases_rank            4.5         9  4.5    4.5

Identifying Affected Countries Related to their Population

We can provide each country with an aggregate score by summing the rankings.

colSums(covid_mat_rank[-1, ])
##  United States         Russia          Italy          India         Turkey 
##           17.0           21.5           29.0            8.0           17.5 
##         Canada United Kingdom      Australia           Peru         Poland 
##           15.5           14.0           17.0           11.0           14.5
best_effort_tested_cased_top_3 <- c("India", "United Kingdom", "Turkey")
most_affected_country <- "Italy"
least_affected_country <- "India"

Putting it all Together

Now that we have completed our analysis, lets review the questions we were trying to answer.

Which countries have had the highest number of deaths due to COVID- 19?

Which countries have had the highest number of positive cases against the number of tests?

Which countries have made the best effort in terms of the number of COVID-19 tests conducted related to their population?

Which countries were ultimately the most and least affected related to their population?