INVESTIGATING COVID-19 VIRUS TRENDS

by Dr. Michael Ige

Introduction
This is my first practice project in the learning program for Data Analyst under DataQuest. The project is about the global pandemic that first broke out in China and spread globally. The pandemic disrupted all aspects of humanity throughout all countries of the world without exceptions. To battle this pandemic, scientists relied on data collected about the incidents for analysis, help in understanding the disease, and find solutions to the menace. This guided project will use a data set collected with the 6-month period of January 20th to June 1st, 2022.

The objective of this exercise is to build my skills and understanding of data analysis workflow by evaluating the COVID-19 situation through this data set. By the end of the project my goal is to provide an answer to the question: Which countries have had the highest number of positive cases against the number of tests?

Data

The data set for this project is from the source: https://app.dataquest.io/c/92/m/505/guided-project%3A-investigating-covid-19-virus-trends/2/understanding-the-data. The dataset contains daily & cummulative number of COVID-19 tests conducted , number of positive, hospitalized, recovered & death cases reported by country.

Data Import

The first step towards familiarity with our data is analysis is to import and store the data set. This will be accomplished as follows:

library(readr)
tested_worldwide <- read_csv("C:/Users/babao/OneDrive - Oakland City University/OAKLAND CITY/DATA ANALYTICS/DataQuest/Data/tested_worldwide.csv")

## Rows: 27641 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country_Region, Province_State
## dbl  (9): positive, active, hospitalized, hospitalizedCurr, recovered, death...
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

covid_df <- tested_worldwide

The data set has now been imported and stored as “covid_df”

Data Structure

We shall explore the data structure covering display of some common attributes of the data frame functions.

Dimension of the dataframe:

dim(covid_df)

## [1] 27641    12

Data column names.

colnames(covid_df)

##  [1] "Date"             "Country_Region"   "Province_State"   "positive"        
##  [5] "active"           "hospitalized"     "hospitalizedCurr" "recovered"       
##  [9] "death"            "total_tested"     "daily_tested"     "daily_positive"

Data columns descriptions.

The dataset have the following 12 columns descriptions:

Date: Date
Country_Region: Country names
Province_State: States/province names; value is All States when state/provincial level data is not available
positive: Cumulative number of positive cases reported.
active: Number of active cases on that day.
hospitalized: Cumulative number of hospitalized cases reported.
hospitalizedCurr: Number of actively hospitalized cases on that day.
recovered: Cumulative number of recovered cases reported.
death: Cumulative number of deaths reported.
total_tested: Cumulative number of tests conducted.
daily_tested: Number of tests conducted on the day; if daily data is unavailable, daily tested is averaged across number of days in between.
daily_positive: Number of positive cases reported on the day; if daily data is unavailable, daily positive is averaged across number of days in.

Store the column names as “vector_cols”, disply the contents, and determine the data structure of the variable.

vector_cols <- colnames(covid_df)
vector_cols

##  [1] "Date"             "Country_Region"   "Province_State"   "positive"        
##  [5] "active"           "hospitalized"     "hospitalizedCurr" "recovered"       
##  [9] "death"            "total_tested"     "daily_tested"     "daily_positive"

class(vector_cols)

## [1] "character"

Display the first few rows of the dataset:

head(covid_df)

## # A tibble: 6 × 12
##   Date       Country_Region Province_State positive active hospitalized
##   <date>     <chr>          <chr>             <dbl>  <dbl>        <dbl>
## 1 2020-01-16 Iceland        All States            3     NA           NA
## 2 2020-01-17 Iceland        All States            4     NA           NA
## 3 2020-01-18 Iceland        All States            7     NA           NA
## 4 2020-01-20 South Korea    All States            1     NA           NA
## 5 2020-01-22 United States  All States            0     NA           NA
## 6 2020-01-22 United States  Massachusetts         0     NA           NA
## # … with 6 more variables: hospitalizedCurr <dbl>, recovered <dbl>,
## #   death <dbl>, total_tested <dbl>, daily_tested <dbl>, daily_positive <dbl>

Display the summary of the dataset.

library(tibble)
glimpse(covid_df)

## Rows: 27,641
## Columns: 12
## $ Date             <date> 2020-01-16, 2020-01-17, 2020-01-18, 2020-01-20, 2020…
## $ Country_Region   <chr> "Iceland", "Iceland", "Iceland", "South Korea", "Unit…
## $ Province_State   <chr> "All States", "All States", "All States", "All States…
## $ positive         <dbl> 3, 4, 7, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, NA, NA, NA,…
## $ active           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalized     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalizedCurr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ recovered        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ death            <dbl> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA…
## $ total_tested     <dbl> NA, NA, NA, 4, 0, 0, 0, 0, 0, 0, 27, 0, 0, 0, NA, NA,…
## $ daily_tested     <dbl> NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 5, 0, 0, 0, NA, …
## $ daily_positive   <dbl> NA, 1, 3, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, NA, NA…

Using the “glimpse()” is very useful to see the colums of the dataset and display some portion of the data with respect to each attribute.

Dataframe

At this point we want to extract the data that is relevant to answer the underlining question of this project. We shall extract the country-level data to keep the data related to “All States” only.

Filter the rows related to “All States”, remove the “Province_State” column, and store the result in “coovid_df_all_states”:

library(dplyr)

covid_df_all_states <- covid_df %>%
  filter(Province_State == "All States") %>%
  select(-Province_State)

We are able to remove the “Province_State” column without loosing information because we are interested in the data at the country level only.

Extract the columns related to the daily measures and store results as “covid_df_all_states_daily”:

covid_df_all_states_daily <- covid_df_all_states %>% 
  select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

Top Ten Cases Countries Data

Summarize the ” covid_df_all_states_daily” dataframe by computing the sum of the number of tested, positive, active, and hospitalized cases grouped by the “Country_Region” column.

covid_df_all_states_daily <- covid_df_all_states %>% 
  select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

Summarize the data based on the “Country_Region”:

covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarize(tested = sum(daily_tested), 
  positive = sum(daily_positive),
  active = sum(active),
  hospitalized = sum(hospitalizedCurr)) %>% 
arrange(desc(tested))

Extract the top ten rows from “covid_df_all_states_daily_sum” dataframe and store as “covid_top_10”:

covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
covid_top_10

## # A tibble: 10 × 5
##    Country_Region tested positive   active hospitalized
##    <chr>           <dbl>    <dbl>    <dbl>        <dbl>
##  1 Afghanistan        NA       NA       NA           NA
##  2 Albania            NA       NA       NA           NA
##  3 Algeria            NA       NA       NA           NA
##  4 Argentina          NA       NA       NA           NA
##  5 Armenia            NA       NA  1846922           NA
##  6 Australia          NA       NA       NA           NA
##  7 Austria            NA       NA       NA           NA
##  8 Azerbaijan         NA       NA       NA           NA
##  9 Bahrain            NA       NA       NA           NA
## 10 Bangladesh         NA       NA 14479558           NA

Create vectors from “covid_top_10” dataframe:

countries <- covid_top_10$Country_Region
tested_cases <- covid_top_10$tested
positive_cases <- covid_top_10$positive
active_cases <- covid_top_10$active
hospitalized_cases <- covid_top_10$hospitalized

Name the vectors

names(positive_cases) <- countries
names(tested_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries

Identify the top three positive against tested cases.

positive_cases

## Afghanistan     Albania     Algeria   Argentina     Armenia   Australia 
##          NA          NA          NA          NA          NA          NA 
##     Austria  Azerbaijan     Bahrain  Bangladesh 
##          NA          NA          NA          NA

sum(positive_cases)

## [1] NA

mean(positive_cases)

## [1] NA

positive_cases/sum(positive_cases)

## Afghanistan     Albania     Algeria   Argentina     Armenia   Australia 
##          NA          NA          NA          NA          NA          NA 
##     Austria  Azerbaijan     Bahrain  Bangladesh 
##          NA          NA          NA          NA

positive_cases/tested_cases

## Afghanistan     Albania     Algeria   Argentina     Armenia   Australia 
##          NA          NA          NA          NA          NA          NA 
##     Austria  Azerbaijan     Bahrain  Bangladesh 
##          NA          NA          NA          NA

Identify the top three positive against tested case.names.

positive_tested_top_3 <- c("United Kingdom" = 0.11, "United States" = 0.10, "Turkey" = 0.08)

Create vectors from

united_kingdom <- c(0.11, 1473672, 166909, 0, 0)
united_states <- c(0.10, 17282363, 1877179, 0, 0)
turkey <- c(0.08, 2031192, 163941, 2980960, 0)

Top 3 countries table
Create a matrix for the top three countries:

covid_mat <- rbind(united_kingdom, united_states, turkey)

Name the columns

colnames(covid_mat) <- c("Ratio", "tested", "positive", "active", "hospitalized")
covid_mat

##                Ratio   tested positive  active hospitalized
## united_kingdom  0.11  1473672   166909       0            0
## united_states   0.10 17282363  1877179       0            0
## turkey          0.08  2031192   163941 2980960            0

Project Question and Answer

question <- "Which countries have had the highest number of positive cases against the number of tests?"
answer <- c("Positive tested cases" = positive_tested_top_3)
datasets <- list(
  original = covid_df,
  allstates = covid_df_all_states,
  daily = covid_df_all_states_daily,
  top_10 = covid_top_10
)
matrices <- list(covid_mat)
vectors <- list(vector_cols, countries)
data_structure_list <- list("dataframe" = datasets, "matrix" = matrices, "vector" = vectors)
covid_analysis_list <- list(question, answer, data_structure_list)
covid_analysis_list[[2]]

## Positive tested cases.United Kingdom  Positive tested cases.United States 
##                                 0.11                                 0.10 
##         Positive tested cases.Turkey 
##                                 0.08