Introduction
This is my first practice project in the learning program for Data
Analyst under DataQuest. The project is about the global pandemic that
first broke out in China and spread globally. The pandemic disrupted all
aspects of humanity throughout all countries of the world without
exceptions. To battle this pandemic, scientists relied on data collected
about the incidents for analysis, help in understanding the disease, and
find solutions to the menace. This guided project will use a data set
collected with the 6-month period of January 20th to June 1st, 2022.
The objective of this exercise is to build my skills and understanding of data analysis workflow by evaluating the COVID-19 situation through this data set. By the end of the project my goal is to provide an answer to the question: Which countries have had the highest number of positive cases against the number of tests?
Data
Data Import
library(readr)
tested_worldwide <- read_csv("C:/Users/babao/OneDrive - Oakland City University/OAKLAND CITY/DATA ANALYTICS/DataQuest/Data/tested_worldwide.csv")
## Rows: 27641 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country_Region, Province_State
## dbl (9): positive, active, hospitalized, hospitalizedCurr, recovered, death...
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid_df <- tested_worldwide
Data Structure
We shall explore the data structure covering display of some common attributes of the data frame functions.
dim(covid_df)
## [1] 27641 12
colnames(covid_df)
## [1] "Date" "Country_Region" "Province_State" "positive"
## [5] "active" "hospitalized" "hospitalizedCurr" "recovered"
## [9] "death" "total_tested" "daily_tested" "daily_positive"
The dataset have the following 12 columns descriptions:
vector_cols <- colnames(covid_df)
vector_cols
## [1] "Date" "Country_Region" "Province_State" "positive"
## [5] "active" "hospitalized" "hospitalizedCurr" "recovered"
## [9] "death" "total_tested" "daily_tested" "daily_positive"
class(vector_cols)
## [1] "character"
head(covid_df)
## # A tibble: 6 × 12
## Date Country_Region Province_State positive active hospitalized
## <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-16 Iceland All States 3 NA NA
## 2 2020-01-17 Iceland All States 4 NA NA
## 3 2020-01-18 Iceland All States 7 NA NA
## 4 2020-01-20 South Korea All States 1 NA NA
## 5 2020-01-22 United States All States 0 NA NA
## 6 2020-01-22 United States Massachusetts 0 NA NA
## # … with 6 more variables: hospitalizedCurr <dbl>, recovered <dbl>,
## # death <dbl>, total_tested <dbl>, daily_tested <dbl>, daily_positive <dbl>
library(tibble)
glimpse(covid_df)
## Rows: 27,641
## Columns: 12
## $ Date <date> 2020-01-16, 2020-01-17, 2020-01-18, 2020-01-20, 2020…
## $ Country_Region <chr> "Iceland", "Iceland", "Iceland", "South Korea", "Unit…
## $ Province_State <chr> "All States", "All States", "All States", "All States…
## $ positive <dbl> 3, 4, 7, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, NA, NA, NA,…
## $ active <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalized <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalizedCurr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ recovered <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ death <dbl> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA…
## $ total_tested <dbl> NA, NA, NA, 4, 0, 0, 0, 0, 0, 0, 27, 0, 0, 0, NA, NA,…
## $ daily_tested <dbl> NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 5, 0, 0, 0, NA, …
## $ daily_positive <dbl> NA, 1, 3, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, NA, NA…
Dataframe
At this point we want to extract the data that is relevant to answer the underlining question of this project. We shall extract the country-level data to keep the data related to “All States” only.
library(dplyr)
covid_df_all_states <- covid_df %>%
filter(Province_State == "All States") %>%
select(-Province_State)
covid_df_all_states_daily <- covid_df_all_states %>%
select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)
Top Ten Cases Countries Data
covid_df_all_states_daily <- covid_df_all_states %>%
select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)
covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarize(tested = sum(daily_tested),
positive = sum(daily_positive),
active = sum(active),
hospitalized = sum(hospitalizedCurr)) %>%
arrange(desc(tested))
covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
covid_top_10
## # A tibble: 10 × 5
## Country_Region tested positive active hospitalized
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan NA NA NA NA
## 2 Albania NA NA NA NA
## 3 Algeria NA NA NA NA
## 4 Argentina NA NA NA NA
## 5 Armenia NA NA 1846922 NA
## 6 Australia NA NA NA NA
## 7 Austria NA NA NA NA
## 8 Azerbaijan NA NA NA NA
## 9 Bahrain NA NA NA NA
## 10 Bangladesh NA NA 14479558 NA
countries <- covid_top_10$Country_Region
tested_cases <- covid_top_10$tested
positive_cases <- covid_top_10$positive
active_cases <- covid_top_10$active
hospitalized_cases <- covid_top_10$hospitalized
names(positive_cases) <- countries
names(tested_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries
positive_cases
## Afghanistan Albania Algeria Argentina Armenia Australia
## NA NA NA NA NA NA
## Austria Azerbaijan Bahrain Bangladesh
## NA NA NA NA
sum(positive_cases)
## [1] NA
mean(positive_cases)
## [1] NA
positive_cases/sum(positive_cases)
## Afghanistan Albania Algeria Argentina Armenia Australia
## NA NA NA NA NA NA
## Austria Azerbaijan Bahrain Bangladesh
## NA NA NA NA
positive_cases/tested_cases
## Afghanistan Albania Algeria Argentina Armenia Australia
## NA NA NA NA NA NA
## Austria Azerbaijan Bahrain Bangladesh
## NA NA NA NA
positive_tested_top_3 <- c("United Kingdom" = 0.11, "United States" = 0.10, "Turkey" = 0.08)
united_kingdom <- c(0.11, 1473672, 166909, 0, 0)
united_states <- c(0.10, 17282363, 1877179, 0, 0)
turkey <- c(0.08, 2031192, 163941, 2980960, 0)
Top 3 countries table
Create a matrix for the top three countries:
covid_mat <- rbind(united_kingdom, united_states, turkey)
colnames(covid_mat) <- c("Ratio", "tested", "positive", "active", "hospitalized")
covid_mat
## Ratio tested positive active hospitalized
## united_kingdom 0.11 1473672 166909 0 0
## united_states 0.10 17282363 1877179 0 0
## turkey 0.08 2031192 163941 2980960 0
Project Question and Answer
question <- "Which countries have had the highest number of positive cases against the number of tests?"
answer <- c("Positive tested cases" = positive_tested_top_3)
datasets <- list(
original = covid_df,
allstates = covid_df_all_states,
daily = covid_df_all_states_daily,
top_10 = covid_top_10
)
matrices <- list(covid_mat)
vectors <- list(vector_cols, countries)
data_structure_list <- list("dataframe" = datasets, "matrix" = matrices, "vector" = vectors)
covid_analysis_list <- list(question, answer, data_structure_list)
covid_analysis_list[[2]]
## Positive tested cases.United Kingdom Positive tested cases.United States
## 0.11 0.10
## Positive tested cases.Turkey
## 0.08