Since December 31 2019, World Health Organization announced the alert that an unmatchable and extremely infectious virus spread from Wuhan City, Hubei Province of China to the rest of China and the world. The daily level information on the affected people can give some insights and even alert to people and governments to take effective protective actions.
Data Collection: The dataset used in the project is 2019 Novel Coronavirus (2019-nCoV). There are three separate datasets: 1) time_series_covid_19_confirmed.csv; 2) time_series_covid_19_recovered.csv; 3) time_series_covid_19_deaths.csv
Data Description: 1. ObservationDate (wide format) - Date of the observation in MM/DD/YYYY 2. Province/State - Province or state of the observation (empty when missing) 3. Country/Region - Country of observation 4. Confirmed - Cumulative number of confirmed cases till that date 5. Deaths - Cumulative number of of deaths till that date 6. Recovered - Cumulative number of recovered cases till that date
Research Questions: 1. How does COVID-19 cases (confirmed/recovered/death) look like all over the world by countries (regions)? 2. Top 1/10 countries/regions that have the most confirmed, deaths, and recovery cases? 3. What’s the global COVID-19 new cases (confirmed/recovered/death) growing trend? 4. What is the current situation of the U.S, such as new cases by day and overall trend? 5. What is the U.S. mortality rate and recovery rate trend?
Visualization Methods: The project combines different visualization tools, mainly plotly and ggplot2 packages. In terms of the charts, interactive mapping charts, bar charts, and time series line charts are presented and analyzed.
Please find the write-ups for each grahpics at the Summary section.
After importing three separate datasets, I shaped them into long formats from wide formats and changed the Date variable type to date using as.Date() function. Then, combined three datasets using left join function by date and country/region. Next, replaced the NA/missing values at Confirmed, Recovered, and Death variable with zero since it makes sense to regard NA value as no such cases happened during the specific date.
Here are several insights from the plots:
From the global perspective, 31.78 million cases were confirmed back in Sept, 2020, along with 21.75 million recovered cases and 0.9 million death cases. Geographically speaking, Americas is the continent that affected most in terms of the number of COVID cases. United States and Brazil are two countries that the most confirmed COVID-19 cases. Besides, India has the most confirmed COVID-19 case in Asia. Russia, South Africa, and Spain are in bad situation. In terms of the recovered cases, it’s correlated with the confirmed COVID cases. However, the cases of Brazil and India are nearly double that of United States surprisingly. The death cases of United States are nearly two and half times that of India and double that of Brazil.
U.S. is the country that has the most COVID-19 confirmed cases and death cases, while India is the country has the most recovered cases.
Up to September 2020, U.S. has nearly 7 million confirmed cases, which is almost one fifth of the global confirmed cases. Meanwhile, the recovered cases of U.S. is 2.67 million, which however is around one tenth of the global recovered cases. We can tell that the recovery rate of U.S. is far lower than other countries. Besides, nearly 0.2 million people died from COVID-19 in the U.S., which is less than one fifth of the global death cases.
Looking at the daily new cases growing trend, June and July had the lower growing cases and after that, August and September had a second spike, which can be related to the Black-live-matters movement. The end of the September had a third spike, which can be affected by the school and restaurants reopening polices.
World daily new confirmed cases were slowly growing back in June and seems to grow faster after September. Based on the trend, we seem to be in the second wave.
The mortality rate is always higher than the recovery rate in the U.S. The mortality and recovery rates were lowest back in March and April when the COVID started to spread in the States. The rates had a sharp increase at the end of July and then went down at the beginning of August. Right now, the mortality rate is near 0.42 and the recovery rate is near 0.38.
Top 10 countries have he most confirmed cases: U.S., India, Brazil, Russia, Colombia, Peru, Mexico, Spain, South Africa, and Argentina
Top 10 countries have the most recovered cases: India, Brazil, U.S., Russia, Colombia, Peru, Mexico, South Africa, Argentina, and Chile - we can tell that while Spain has the among the top 10 most confirmed cases countries, it’s not in the top 10 most recovered cases countries.
Top 10 countries have the most death cases: U.S., Brazil, India, Mexico, United Kingdom, Italy, Peru, France, Iran, and Colombia. It’s quite surprising to see that countries such as U.S., UK, Italy, and France have more death cases than most of the developing countries.
---
title: "512 - Final Project"
author: "Zhuoxin Jiang"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme: readable
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(dygraphs)
library(ggthemes)
library(corrplot)
library(countrycode)
library(stringr)
library(forcats)
library(plotly)
```
Project Objective {data-icon="fa-database"}
=============================
Column {data-width=400}
-----------------------------------------------------------------------
### Data Summary
Since December 31 2019, World Health Organization announced the alert that an unmatchable and extremely infectious virus spread from Wuhan City, Hubei Province of China to the rest of China and the world. The daily level information on the affected people can give some insights and even alert to people and governments to take effective protective actions.
Data Collection: The dataset used in the project is 2019 Novel Coronavirus (2019-nCoV). There are three separate datasets:
1) time_series_covid_19_confirmed.csv;
2) time_series_covid_19_recovered.csv;
3) time_series_covid_19_deaths.csv
Data Description:
1. ObservationDate (wide format) - Date of the observation in MM/DD/YYYY
2. Province/State - Province or state of the observation (empty when missing)
3. Country/Region - Country of observation
4. Confirmed - Cumulative number of confirmed cases till that date
5. Deaths - Cumulative number of of deaths till that date
6. Recovered - Cumulative number of recovered cases till that date
Research Questions:
1. How does COVID-19 cases (confirmed/recovered/death) look like all over the world by countries (regions)?
2. Top 1/10 countries/regions that have the most confirmed, deaths, and recovery cases?
3. What's the global COVID-19 new cases (confirmed/recovered/death) growing trend?
4. What is the current situation of the U.S, such as new cases by day and overall trend?
5. What is the U.S. mortality rate and recovery rate trend?
Visualization Methods: The project combines different visualization tools, mainly plotly and ggplot2 packages. In terms of the charts, interactive mapping charts, bar charts, and time series line charts are presented and analyzed.
Please find the write-ups for each grahpics at the Summary section.
Column {data-width=600}
-----------------------------------------------------------------------
### Data Snapshot - Tabular View
```{r}
# import original datasets
confirm <- read_csv("~/Downloads/time_series_covid_19_confirmed.csv")
recover <- read_csv("~/Downloads/time_series_covid_19_recovered.csv")
death <- read_csv("~/Downloads/time_series_covid_19_deaths.csv")
# shape three datasets into long format
confirm_long <- confirm %>%
gather("Date", "Confirmed", -c("Province/State", "Country/Region", "Lat", "Long")) %>%
mutate(Date = as.Date(Date, "%m/%d/%y"))
recover_long <- recover %>%
gather("Date", "Recover", -c("Province/State", "Country/Region", "Lat", "Long")) %>%
mutate(Date = as.Date(Date, "%m/%d/%y"))
death_long <- death %>%
gather("Date", "Death", -c("Province/State", "Country/Region", "Lat", "Long")) %>%
mutate(Date = as.Date(Date, "%m/%d/%y"))
# Combine three datasets & select data after March 2020
df <- confirm_long %>% left_join(recover_long) %>% left_join(death_long)
df <- subset(df, Date >'2020-03-01')
df$Confirmed <- as.numeric(df$Confirmed)
df$`Province/State`<- as.factor(df$`Province/State`)
df$`Country/Region` <- as.factor(df$`Country/Region`)
# Missing values - replace the NA value at Confirmed/Death/Recover columns with 0
df$Confirmed[is.na(df$Confirmed)] <- 0
df$Death[is.na(df$Death)] <- 0
df$Recover[is.na(df$Recover)] <- 0
DT::datatable(df)
```
### Data Wrangling
After importing three separate datasets, I shaped them into long formats from wide formats and changed the Date variable type to date using as.Date() function. Then, combined three datasets using left join function by date and country/region. Next, replaced the NA/missing values at Confirmed, Recovered, and Death variable with zero since it makes sense to regard NA value as no such cases happened during the specific date.
Total COVID-19 Cases {data-icon="fa-globe" data-orientation=rows}
=============================
Row {data-height = 180}
-----------------------------------------------------------------------
```{r}
# Create a new dataset of total cases grouped by Date and Country/Region
df_total <- df %>%
select(Date,`Country/Region`,Confirmed,Recover,Death) %>%
group_by(`Country/Region`,Date) %>%
summarise(total_confirmed =sum(Confirmed),
total_recovered =sum(Recover),
total_death = sum(Death)) %>%
filter(Date == max(Date))
# import Country codes dataset
df1 <- read.csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')
names(df1) <- c("Country","GDP_Billions","Code")
df1$Country <- df1$Country %>%
str_replace(pattern = "United States", replacement = "US") %>%
str_replace(pattern = "Macedonia", replacement = "North Macedonia") %>%
str_replace(pattern = "Czech Republic", replacement = "Czechia") %>%
str_replace(pattern = "Taiwan", replacement = "Taiwan*") %>%
str_replace(pattern = "West Bank", replacement = "West Bank and Gaza") %>%
str_replace(pattern = "Congo, Democratic Republic of the", replacement = "Congo (Kinshasa)") %>%
str_replace(pattern = "Congo, Republic of the", replacement = "Congo (Brazzaville)") %>%
str_replace(pattern = "Bahamas, The", replacement = "Bahamas") %>%
str_replace(pattern = "Swaziland", replacement = "Eswatini") %>%
str_replace(pattern = "Gambia, The", replacement = "Gambia")
names(df_total) <- c('Country','Date','Total_Confirmed_Cases','Total_Recovered_Cases','Total_Death_Cases')
# merge the dataset and filter out the missing rows
df_total_map <- df_total %>% left_join(df1, by = 'Country') %>% filter(!is.na(Code))
df_total_map$Country <- as.factor(df_total_map$Country)
View(df_total_map)
```
### Total Confirmed Cases
```{r}
total_confirm <- sum(df_total_map$Total_Confirmed_Cases)
valueBox(value = paste(round(total_confirm/1000000, 2),'Million'), caption = "Global Confirmed Cases Till Sept. 2020", color = ifelse(total_confirm > 30000000, "success", "warning"))
```
### Total Recovered Cases
```{R}
total_recover <- sum(df_total_map$Total_Recovered_Cases)
valueBox(value = paste(round(total_recover/1000000, 2),'Million'), caption = "Global Recovered Cases Till Sept. 2020", color = ifelse(total_confirm > 20000000, "success", "warning"))
```
### Total Death Cases
```{r}
total_death <- sum(df_total_map$Total_Death_Cases)
valueBox(value = paste(round(total_death/1000000, 2),'Million'), caption = "Global Death Cases Till Sept. 2020", color = ifelse(total_confirm < 0, "success", "warning"))
```
Row {.tabset}
-----------------------------------------------------------------------
```{r}
# visualize the world map
line <- list(color = "grey", width = 0.3)
g <- list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = 'Mercator')
) # specify map projection/options
```
### Confirmed Cases Map
```{R}
world_covid_con_map <- plot_geo(df_total_map) %>%
add_trace(
z = ~ Total_Confirmed_Cases, color = ~Total_Confirmed_Cases, colors = 'Blues',
text = ~ Country, locations = ~Code, marker = list(line = line)
) %>%
colorbar(title = 'Total Confirmed COVID-19 Cases') %>%
layout(
title = 'The Latest COVID-19 Confirmed Cases',
geo = g
)
htmltools::tagList(list(world_covid_con_map))
```
### Recovered Cases Map
```{R}
world_covid_rec_map <- plot_geo(df_total_map) %>%
add_trace(
z = ~ Total_Recovered_Cases, color = ~Total_Recovered_Cases, colors = 'Greens',
text = ~ Country, locations = ~Code, marker = list(line = line)
) %>%
colorbar(title = 'Total Recovered COVID-19 Cases') %>%
layout(
title = 'The Latest COVID-19 Recovered Cases',
geo = g
)
htmltools::tagList(list(world_covid_rec_map))
```
### Death Cases Map
```{r}
world_covid_dea_map <- plot_geo(df_total_map) %>%
add_trace(
z = ~ Total_Death_Cases, color = ~ Total_Death_Cases, colors = 'Reds',
text = ~ Country, locations = ~Code, marker = list(line = line)
) %>%
colorbar(title = 'Total Death COVID-19 Cases') %>%
layout(
title = 'The Latest COVID-19 Death Cases',
geo = g
)
htmltools::tagList(list(world_covid_dea_map))
```
New Cases Trend {data-icon="fa-chart-line"}
=============================
Column {data-width=500}
-----------------------------------------------------------------------
### New Cases - Tabular View
```{R}
new_cases <- df %>%
select(Date,Confirmed,Recover,Death) %>%
group_by(Date) %>%
summarise(Confirmed = sum(Confirmed),
Death = sum(Death),
Recovered = sum(Recover)) %>%
mutate("New_Cases" = Confirmed - lag(Confirmed, 1))
new_cases$New_Cases[is.na(new_cases$New_Cases)] <- 0
DT::datatable(new_cases)
```
Column {data-width=500}
-----------------------------------------------------------------------
### World New Confirmed Cases
```{r}
ggplot(new_cases, aes(x=Date)) +
geom_bar(aes(y=Confirmed), position = 'stack', stat = 'identity', fill='Blue') +
geom_bar(aes(y=Recovered), position = 'stack', stat = 'identity', fill='Green') +
geom_bar(aes(y=Death), position = 'stack', stat = 'identity', fill='Red') +
labs(title = "World Cases Trend",
subtitle = "Blue-Confirmed; Green-Recovered; Red-Death") +
theme(
plot.title = element_text(size = 20, face = 'bold'),
plot.subtitle = element_text(size = 10, color = 'red')
) +
theme_minimal()
```
Country/Region Analysis {data-icon="fa-flag" data-orientation=rows}
=============================
```{R}
df2 <- df_total_map %>%
select(Country,Total_Confirmed_Cases) %>% arrange(desc(Total_Confirmed_Cases))
df2_top10 <- df2[1:10,]
df3 <- df_total_map %>%
select(Country,Total_Recovered_Cases) %>% arrange(desc(Total_Recovered_Cases))
df3_top10 <- df3[1:10,]
df4 <- df_total_map %>%
select(Country,Total_Death_Cases) %>% arrange(desc(Total_Death_Cases))
df4_top10 <- df4[1:10,]
```
Row {data-height = 180}
-----------------------------------------------------------------------
### Country - the most confirmed cases
```{R}
top_confirmed <- df2[1,]
valueBox(value = 'U.S.', caption = "the Most Confirmed Cases")
```
### Country - the most recovered cases
```{R}
top_recovered <- df3[1,]
valueBox(value = 'India', caption = "the Most Recovered Cases")
```
### Country - the most death cases
```{R}
top_death <- df4[1,]
valueBox(value = 'U.S.', caption = "the Most Death Cases")
```
Row {.tabset}
-----------------------------------------------------------------------
### Top 10 countries (most confirmed cases)
```{R}
ggplot(df2_top10, aes(x= reorder(Country,Total_Confirmed_Cases), y= Total_Confirmed_Cases)) +
geom_bar(stat = 'identity', fill='blue') +
scale_y_continuous(breaks = seq(0, 65000000, by = 500000)) +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 45,hjust = 1)) +
coord_flip() +
xlab('Country') +
ylab('Confirmed Cases') +
ggtitle('Top 10 Countries - Most Confirmed Cases')
```
### Top 10 countries (most recovered cases)
```{r}
ggplot(df3_top10, aes(x= reorder(Country,Total_Recovered_Cases), y= Total_Recovered_Cases)) +
geom_bar(stat = 'identity', fill='green') +
scale_y_continuous(breaks = seq(0, 45000000, by = 500000)) +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 45,hjust = 1)) +
coord_flip() +
xlab('Country') +
ylab('Recovered Cases') +
ggtitle('Top 10 Countries - Most Recovered Cases') +
theme_classic()
```
### Top 10 countries (most death cases)
```{r}
ggplot(df4_top10, aes(x= reorder(Country,Total_Death_Cases), y= Total_Death_Cases)) +
geom_bar(stat = 'identity', fill='red') +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 45,hjust = 1)) +
coord_flip() +
xlab('Country') +
ylab('Death Cases') +
ggtitle('Top 10 Countries - Most Death Cases') +
theme_classic()
```
U.S. COVID-19 Analysis {data-icon="fa-globe-americas" data-orientation=rows}
=============================
```{R}
df5 <- df %>%
select(Date,`Country/Region`,`Province/State`,Lat,Long,Confirmed,Recover,Death) %>%
group_by(`Country/Region`,`Province/State`,Lat, Long, Date) %>%
summarise(Confirmed = sum(Confirmed),
Deaths = sum(Death),
Recovered = sum(Recover)) %>%
mutate("New_Confirmed_Cases" = Confirmed - lag(Confirmed, 1))
df5$New_Confirmed_Cases[is.na(df5$New_Confirmed_Cases)] <-0
us <- df5[df5$`Country/Region` %in% c('US'),]
```
Row {data-height = 180}
-----------------------------------------------------------------------
### Total confirmed cases
```{R}
us_total_confirm <- sum(us$New_Confirmed_Cases)
valueBox(value = paste(round(us_total_confirm/1000000, 2),'Million'), caption = "U.S. Confirmed Cases Till Sept. 2020", color = ifelse(total_confirm > 30000000, "success", "warning"))
```
### Total recovered cases
```{R}
us %>% filter(Date == max(Date)) -> df6
valueBox(value = paste(round(df6$Recovered/1000000, 2),'Million'), caption = "U.S. Recovered Cases Till Sept. 2020", color = ifelse(total_confirm > 300000, "success", "warning"))
```
### Total death cases
```{R}
valueBox(value = paste(round(df6$Deaths/1000000, 2),'Million'), caption = "U.S. Death Cases Till Sept. 2020", color = ifelse(total_confirm > 100000, "warning", "success"))
```
Row {data-orientation=columns data-height = 640}}
-----------------------------------------------------------------------
### Time series/trend analysis/stack
```{R}
ggplot(us, aes(x=Date)) +
geom_bar(aes(y=Confirmed), position = 'stack', stat = 'identity', fill='Blue') +
geom_bar(aes(y=Recovered), position = 'stack', stat = 'identity', fill='Green') +
geom_bar(aes(y=Deaths), position = 'stack', stat = 'identity', fill='Red') +
labs(title = "U.S. Cases Trend",
subtitle = "Blue-Confirmed; Green-Recovered; Red-Death") +
theme(
plot.title = element_text(size = 20, face = 'bold'),
plot.subtitle = element_text(size = 10, color = 'red')
) +
theme_minimal()
```
### New Cases
```{R}
ggplot(us, aes(Date, New_Confirmed_Cases))+
geom_smooth() +
geom_bar(stat = 'identity', position = 'stack') +
theme_light() +
labs(title = "U.S. New Confirmed Cases Daily Trend") +
ylab( 'Daily Confirmed Cases')
```
### Motality rate and recovery rate
```{r}
us %>% group_by(Date, Confirmed) %>%
mutate(Mortality_rate = Deaths/Confirmed,
Recovery_rate = Recovered/Confirmed) %>% ungroup() ->df_9
df_9 %>% select(Date, Recovery_rate, Mortality_rate) %>%
gather(Status, Ratio, -Date) %>%
ggplot(aes(Date, Ratio, fill =Status)) +
geom_bar(stat = "identity", position = "Stack") +
ggtitle("U.S. Mortality Rate and Recovery Rate Trend") +
theme_clean()
```
Summary {data-icon="fa-book"}
=============================
Column
-----------------------------------------------------------------------
### Conclusion
Here are several insights from the plots:
1. From the global perspective, 31.78 million cases were confirmed back in Sept, 2020, along with 21.75 million recovered cases and 0.9 million death cases. Geographically speaking, Americas is the continent that affected most in terms of the number of COVID cases. United States and Brazil are two countries that the most confirmed COVID-19 cases. Besides, India has the most confirmed COVID-19 case in Asia. Russia, South Africa, and Spain are in bad situation. In terms of the recovered cases, it's correlated with the confirmed COVID cases. However, the cases of Brazil and India are nearly double that of United States surprisingly. The death cases of United States are nearly two and half times that of India and double that of Brazil.
2. U.S. is the country that has the most COVID-19 confirmed cases and death cases, while India is the country has the most recovered cases.
3. Up to September 2020, U.S. has nearly 7 million confirmed cases, which is almost one fifth of the global confirmed cases. Meanwhile, the recovered cases of U.S. is 2.67 million, which however is around one tenth of the global recovered cases. We can tell that the recovery rate of U.S. is far lower than other countries. Besides, nearly 0.2 million people died from COVID-19 in the U.S., which is less than one fifth of the global death cases.
4. Looking at the daily new cases growing trend, June and July had the lower growing cases and after that, August and September had a second spike, which can be related to the Black-live-matters movement. The end of the September had a third spike, which can be affected by the school and restaurants reopening polices.
5. World daily new confirmed cases were slowly growing back in June and seems to grow faster after September. Based on the trend, we seem to be in the second wave.
6. The mortality rate is always higher than the recovery rate in the U.S. The mortality and recovery rates were lowest back in March and April when the COVID started to spread in the States. The rates had a sharp increase at the end of July and then went down at the beginning of August. Right now, the mortality rate is near 0.42 and the recovery rate is near 0.38.
7. Top 10 countries have he most confirmed cases: U.S., India, Brazil, Russia, Colombia, Peru, Mexico, Spain, South Africa, and Argentina
8. Top 10 countries have the most recovered cases: India, Brazil, U.S., Russia, Colombia, Peru, Mexico, South Africa, Argentina, and Chile - we can tell that while Spain has the among the top 10 most confirmed cases countries, it's not in the top 10 most recovered cases countries.
9. Top 10 countries have the most death cases: U.S., Brazil, India, Mexico, United Kingdom, Italy, Peru, France, Iran, and Colombia. It's quite surprising to see that countries such as U.S., UK, Italy, and France have more death cases than most of the developing countries.