For this project, I wanted to take a look into some data on Covid 19 in Vietnam - my hometown. When Covid shut the world down, there were a few countries that were able to contain it in the beginning - Vietnam being one of them. Vietnamese government was strict about quarantine rules, lockdowns were issued and monitored; some might say the country’s method was drastic; but it was necessary.
This dataset comprises of information sourced from the Vietnam Health Ministry and multiple Vietnamese news websites such as vnexpress.net, thanhnien.vn, and bnews.vn. Vietnam tracked and published as many of the cases names and general locations where they had been as possible to alert its citizens. This practice could have been controversial, arguing that it violated one’s privacy, but it certainly did assist the Vietnamese government in keeping Covid-19 contained; at least at the beginning of the pandemic.
In order to create the graphs that I envision, I had to clean the data up a bit, starting with converting the column titles to all lower case font and remove spaces. Then I eliminated all the “unnecessary” columns that I don’t need to focus on the ones that I do need to use:
#install.packages("treemap")
#install.packages("RColorBrewer")
library(treemap)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(RColorBrewer)
setwd("C:/Users/myngu/OneDrive/Montgomery College/DATA 110/Data Sets")
VN_COVID <- read_csv("vietnamcovid.csv")
## Rows: 288 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (28): ID, Gender, Nationality, Detection Location, Treatment Location, H...
## dbl (2): Age, Travel History
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(VN_COVID) <- tolower(names(VN_COVID))
names(VN_COVID) <- gsub(" ","",names(VN_COVID))
str(VN_COVID)
## spc_tbl_ [288 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : chr [1:288] "BN1" "BN2" "BN3" "BN4" ...
## $ gender : chr [1:288] "M" "M" "F" "M" ...
## $ age : num [1:288] 66 28 25 29 23 25 73 29 30 42 ...
## $ nationality : chr [1:288] "China" "China" "Vietnam" "Vietnam" ...
## $ detectionlocation : chr [1:288] "Ho Chi Minh City" "Ho Chi Minh City" "Thanh Hoa" "Vinh Phuc" ...
## $ treatmentlocation : chr [1:288] "Ho Chi Minh City" "Ho Chi Minh City" "Thanh Hoa" "Vinh Phuc" ...
## $ hospital : chr [1:288] "Bệnh viện Chợ Rẫy" "Bệnh viện Chợ Rẫy" "Bệnh viện đa khoa tỉnh Thanh Hóa" "Bệnh viện Bệnh Nhiệt đới Trung Ương (cơ sở Đông Anh)" ...
## $ confirmeddate : chr [1:288] "23-Jan" "23-Jan" "30-Jan" "30-Jan" ...
## $ travelhistory : num [1:288] 1 0 1 1 1 0 1 1 1 0 ...
## $ travelcountry : chr [1:288] "China" "Vietnam" "China" "China" ...
## $ travelcountry,correct : chr [1:288] "China" NA "China" "China" ...
## $ sourceofinfection : chr [1:288] NA "F1 of BN1" NA NA ...
## $ relationship : chr [1:288] NA "Son" NA NA ...
## $ flightid(date) : chr [1:288] NA NA NA NA ...
## $ infectioncluster : chr [1:288] NA NA NA "Vinh Phuc" ...
## $ healthconditionwhenconfirmed : chr [1:288] "Cold/Flu-like symptoms" "Cold/Flu-like symptoms" "Cold/Flu-like symptoms" "Cold/Flu-like symptoms" ...
## $ detailsymptomswhenconfirmed(cleanup): chr [1:288] "fever" "fever" "fever, cough" "fever, cough, sore throat" ...
## $ detailsymptomswhenconfirmed : chr [1:288] "fever" "fever" "fever, cough" "fever, cough, sore throat" ...
## $ developingsymptoms : chr [1:288] NA NA NA NA ...
## $ underlyinghealthcondition : chr [1:288] NA NA NA NA ...
## $ dischargeddate : chr [1:288] "12-Feb" "4-Feb" "(Feb 2020, no exact date)" "18-Feb" ...
## $ re-infected : chr [1:288] NA NA NA NA ...
## $ re-discharged : chr [1:288] NA NA NA NA ...
## $ reference1 : chr [1:288] "https://vnexpress.net/dich-viem-phoi-corona/hai-nguoi-viem-phoi-vu-han-cach-ly-tai-benh-vien-cho-ray-4046299.html" NA "https://vnexpress.net/dich-viem-phoi-corona/ba-nguoi-viet-viem-phoi-da-tiep-xuc-nhieu-nguoi-4048068.html" "https://vnexpress.net/dich-viem-phoi-corona/ba-nguoi-viet-viem-phoi-da-tiep-xuc-nhieu-nguoi-4048068.html" ...
## $ reference2 : chr [1:288] NA NA NA "https://vnexpress.net/suc-khoe/hai-benh-nhan-corona-vinh-phuc-xuat-vien-4056081.html" ...
## $ reference3 : chr [1:288] NA NA NA "https://bnews.vn/dich-do-virus-corona-them-2-benh-nhan-o-vinh-phuc-duoc-xuat-vien/147907.html" ...
## $ reference4 : chr [1:288] NA NA NA NA ...
## $ reference5 : chr [1:288] NA NA NA NA ...
## $ note : chr [1:288] NA NA NA NA ...
## $ numberofnegativetestbeforedischarged: chr [1:288] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_character(),
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Nationality = col_character(),
## .. `Detection Location` = col_character(),
## .. `Treatment Location` = col_character(),
## .. Hospital = col_character(),
## .. `Confirmed Date` = col_character(),
## .. `Travel History` = col_double(),
## .. `Travel Country` = col_character(),
## .. `Travel Country, Correct` = col_character(),
## .. `Source of Infection` = col_character(),
## .. Relationship = col_character(),
## .. `Flight ID (Date)` = col_character(),
## .. `Infection Cluster` = col_character(),
## .. `Health Condition When Confirmed` = col_character(),
## .. `Detail Symptoms When Confirmed (clean up)` = col_character(),
## .. `Detail Symptoms When Confirmed` = col_character(),
## .. `Developing Symptoms` = col_character(),
## .. `Underlying Health Condition` = col_character(),
## .. `Discharged Date` = col_character(),
## .. `Re-Infected` = col_character(),
## .. `Re-discharged` = col_character(),
## .. `Reference 1` = col_character(),
## .. `Reference 2` = col_character(),
## .. `Reference 3` = col_character(),
## .. `Reference 4` = col_character(),
## .. `Reference 5` = col_character(),
## .. Note = col_character(),
## .. `Number of negative test before discharged` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
vncovid <- VN_COVID %>%
group_by(confirmeddate, travelcountry, detectionlocation, gender, id, age) %>%
summarise(age = sum(age, na.rm = TRUE))
Older generation was speculated to have worse symptoms when contracting Covid-19, hence more likely to be hospitalized, let’s see if the numbers reflect that. I also wanted to explore if genders also play a role.
Each point represents a Covid patient that was hospitalized in Vietnam from January to May 2020. Each faint vertical line in the graph area represents a particular day, there are certain days in the range where there were zero hospitalized cases; while there most days had multiple.
graph1 <- ggplot(vncovid, aes(x = confirmeddate, y = age)) +
geom_point(aes(color = age)) +
theme_dark(base_size = 10) +
ggtitle("Vietnam Hospitalized Covid Cases by Age and Gender") +
ylab("Age") +
xlab("Confirmed Date")
graph1 <- ggplotly(graph1)
graph1
This bar graph shows the hospitalized Covid cases by gender more clearly as they are literally split into 2 catergories on the X axis: Male & Female.
graph2 <- vncovid %>%
ggplot() +
geom_bar(aes(x=gender, y=age, fill=confirmeddate),
position = "dodge", stat = "identity") +
theme_minimal(base_size = 10) +
ggtitle("Vietnam Hospitalized Covid Patient by Age and Gender") +
xlab("Gender") +
ylab("Age") +
labs(fill = "Confirmed Date")
graph2 <- ggplotly(graph2)
graph2
I elected to use a treemap to present this data because it shows a quick and easy to understand visual of the cities with most hospitalized Covid cases to least.
treemap(vncovid, index="detectionlocation", vSize="age",
vColor="age", type="manual",
palette="Set1")
For this graph, I wanted to see how many hospitalized cases came from people that travel from another country into Vietnam, versus cases that were contracted from within the country.
colourCount = length(unique(vncovid$travelcountry))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
graph3 <- vncovid %>%
ggplot() +
geom_histogram(aes(age, fill = travelcountry)) +
scale_fill_manual(values = getPalette(colourCount))+
theme_light(base_size = 10) +
ggtitle("Original Location of Vietnam Hospitalized Covid Cases") +
xlab("Age") +
labs(fill = "Travel Country")
graph3 <- ggplotly(graph3)
graph3
I personally included ggplotly into my graphs whenever I can because I like the interaction it provides. It definitely gives more dimension to the graphs and makes certain elements easier to understand.
Contrary to my expectation, graph 1 shows that there are more hospitalized cases among people between the ages of 25 to 50, than those that are older than 50. This could be the result of younger people being more careless. As far as genders go, from the bar graph, they seem to not play a role in the hospitalization of Covid cases.
It was no surprise to me that Ha Noi and Ho Chi Minh cities have the most hospitalized cases, given that they are the two biggest cities in the country and major traveling hubs. Additionally, they have the most developed hospitals and it’s likely that people travel there to get treated.
I made a huge assumption that most travel cases would come from China since allegedly the Corona virus originated from there and Vietnam and China share borders; however, much to my surprise, a lot of the cases were from people that came to Vietnam from the Uk or the United Arab Emirates. Most of the cases came from within Vietnam, potentially from first, second or third-hand (and so on) exposures.
1: I wish I figured out a way to convert the “Confirmed Dates” of the cases into months, the graphs would potentially be cleaner and the x axis lables won’t be as messy.
2: The Confirmed Dates whenever I use them in a graph are not in chronological order. I’m not sure why. Had they been in order, I could determine which month had more cases than other. My assumption is that the cases would go up over time, but I could be wrong. As the country got a hold of Covid, maybe the number went down mid April or May.
3: I wanted to incorporate geom_area to show the Travel Countries but could not get it to show the visual I want in a logical manner. I would much rather show that instead of the histogram for my last graph, I think it would show the data in a much more meaningful way.
4: Graph 2 - the Bargraph was rough until I figured it out. Here’s a picture:
whatafail
so my attempt to add the picture is a fail… although it did show up in R!