Dataset Topic

For this project, I wanted to take a look into some data on Covid 19 in Vietnam - my hometown. When Covid shut the world down, there were a few countries that were able to contain it in the beginning - Vietnam being one of them. Vietnamese government was strict about quarantine rules, lockdowns were issued and monitored; some might say the country’s method was drastic; but it was necessary.

This dataset comprises of information sourced from the Vietnam Health Ministry and multiple Vietnamese news websites such as vnexpress.net, thanhnien.vn, and bnews.vn. Vietnam tracked and published as many of the cases names and general locations where they had been as possible to alert its citizens. This practice could have been controversial, arguing that it violated one’s privacy, but it certainly did assist the Vietnamese government in keeping Covid-19 contained; at least at the beginning of the pandemic.

This dataset has a ton of different variables which offer a few different topics that I want to explore:
  1. The potential roles of age and gender in hospitalizing Covid patients (geom point and geom bar)
  2. Among the cities where patients were hospitalized, which cities had the most patients (treemap)
  3. Where the hospitalized patients came to Vietnam from (histogram)

Clean up the data

In order to create the graphs that I envision, I had to clean the data up a bit, starting with converting the column titles to all lower case font and remove spaces. Then I eliminated all the “unnecessary” columns that I don’t need to focus on the ones that I do need to use:

  • Travel Country
  • Confirmed Date
  • Detection Location
  • Gender
  • Patient ID
  • Age

Load the dataset

#install.packages("treemap")
#install.packages("RColorBrewer")
library(treemap)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(RColorBrewer)
setwd("C:/Users/myngu/OneDrive/Montgomery College/DATA 110/Data Sets")
VN_COVID <- read_csv("vietnamcovid.csv")
## Rows: 288 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (28): ID, Gender, Nationality, Detection Location, Treatment Location, H...
## dbl  (2): Age, Travel History
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up the data

names(VN_COVID) <- tolower(names(VN_COVID))
names(VN_COVID) <- gsub(" ","",names(VN_COVID))
str(VN_COVID)
## spc_tbl_ [288 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                                  : chr [1:288] "BN1" "BN2" "BN3" "BN4" ...
##  $ gender                              : chr [1:288] "M" "M" "F" "M" ...
##  $ age                                 : num [1:288] 66 28 25 29 23 25 73 29 30 42 ...
##  $ nationality                         : chr [1:288] "China" "China" "Vietnam" "Vietnam" ...
##  $ detectionlocation                   : chr [1:288] "Ho Chi Minh City" "Ho Chi Minh City" "Thanh Hoa" "Vinh Phuc" ...
##  $ treatmentlocation                   : chr [1:288] "Ho Chi Minh City" "Ho Chi Minh City" "Thanh Hoa" "Vinh Phuc" ...
##  $ hospital                            : chr [1:288] "Bệnh viện Chợ Rẫy" "Bệnh viện Chợ Rẫy" "Bệnh viện đa khoa tỉnh Thanh Hóa" "Bệnh viện Bệnh Nhiệt đới Trung Ương (cơ sở Đông Anh)" ...
##  $ confirmeddate                       : chr [1:288] "23-Jan" "23-Jan" "30-Jan" "30-Jan" ...
##  $ travelhistory                       : num [1:288] 1 0 1 1 1 0 1 1 1 0 ...
##  $ travelcountry                       : chr [1:288] "China" "Vietnam" "China" "China" ...
##  $ travelcountry,correct               : chr [1:288] "China" NA "China" "China" ...
##  $ sourceofinfection                   : chr [1:288] NA "F1 of BN1" NA NA ...
##  $ relationship                        : chr [1:288] NA "Son" NA NA ...
##  $ flightid(date)                      : chr [1:288] NA NA NA NA ...
##  $ infectioncluster                    : chr [1:288] NA NA NA "Vinh Phuc" ...
##  $ healthconditionwhenconfirmed        : chr [1:288] "Cold/Flu-like symptoms" "Cold/Flu-like symptoms" "Cold/Flu-like symptoms" "Cold/Flu-like symptoms" ...
##  $ detailsymptomswhenconfirmed(cleanup): chr [1:288] "fever" "fever" "fever, cough" "fever, cough, sore throat" ...
##  $ detailsymptomswhenconfirmed         : chr [1:288] "fever" "fever" "fever, cough" "fever, cough, sore throat" ...
##  $ developingsymptoms                  : chr [1:288] NA NA NA NA ...
##  $ underlyinghealthcondition           : chr [1:288] NA NA NA NA ...
##  $ dischargeddate                      : chr [1:288] "12-Feb" "4-Feb" "(Feb 2020, no exact date)" "18-Feb" ...
##  $ re-infected                         : chr [1:288] NA NA NA NA ...
##  $ re-discharged                       : chr [1:288] NA NA NA NA ...
##  $ reference1                          : chr [1:288] "https://vnexpress.net/dich-viem-phoi-corona/hai-nguoi-viem-phoi-vu-han-cach-ly-tai-benh-vien-cho-ray-4046299.html" NA "https://vnexpress.net/dich-viem-phoi-corona/ba-nguoi-viet-viem-phoi-da-tiep-xuc-nhieu-nguoi-4048068.html" "https://vnexpress.net/dich-viem-phoi-corona/ba-nguoi-viet-viem-phoi-da-tiep-xuc-nhieu-nguoi-4048068.html" ...
##  $ reference2                          : chr [1:288] NA NA NA "https://vnexpress.net/suc-khoe/hai-benh-nhan-corona-vinh-phuc-xuat-vien-4056081.html" ...
##  $ reference3                          : chr [1:288] NA NA NA "https://bnews.vn/dich-do-virus-corona-them-2-benh-nhan-o-vinh-phuc-duoc-xuat-vien/147907.html" ...
##  $ reference4                          : chr [1:288] NA NA NA NA ...
##  $ reference5                          : chr [1:288] NA NA NA NA ...
##  $ note                                : chr [1:288] NA NA NA NA ...
##  $ numberofnegativetestbeforedischarged: chr [1:288] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_character(),
##   ..   Gender = col_character(),
##   ..   Age = col_double(),
##   ..   Nationality = col_character(),
##   ..   `Detection Location` = col_character(),
##   ..   `Treatment Location` = col_character(),
##   ..   Hospital = col_character(),
##   ..   `Confirmed Date` = col_character(),
##   ..   `Travel History` = col_double(),
##   ..   `Travel Country` = col_character(),
##   ..   `Travel Country, Correct` = col_character(),
##   ..   `Source of Infection` = col_character(),
##   ..   Relationship = col_character(),
##   ..   `Flight ID (Date)` = col_character(),
##   ..   `Infection Cluster` = col_character(),
##   ..   `Health Condition When Confirmed` = col_character(),
##   ..   `Detail Symptoms When Confirmed (clean up)` = col_character(),
##   ..   `Detail Symptoms When Confirmed` = col_character(),
##   ..   `Developing Symptoms` = col_character(),
##   ..   `Underlying Health Condition` = col_character(),
##   ..   `Discharged Date` = col_character(),
##   ..   `Re-Infected` = col_character(),
##   ..   `Re-discharged` = col_character(),
##   ..   `Reference 1` = col_character(),
##   ..   `Reference 2` = col_character(),
##   ..   `Reference 3` = col_character(),
##   ..   `Reference 4` = col_character(),
##   ..   `Reference 5` = col_character(),
##   ..   Note = col_character(),
##   ..   `Number of negative test before discharged` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
vncovid <- VN_COVID %>%
  group_by(confirmeddate, travelcountry, detectionlocation, gender, id, age) %>%
  summarise(age = sum(age, na.rm = TRUE))

Different ages and gender of hospitalized cases

Older generation was speculated to have worse symptoms when contracting Covid-19, hence more likely to be hospitalized, let’s see if the numbers reflect that. I also wanted to explore if genders also play a role.

Graph 1: Age

Each point represents a Covid patient that was hospitalized in Vietnam from January to May 2020. Each faint vertical line in the graph area represents a particular day, there are certain days in the range where there were zero hospitalized cases; while there most days had multiple.

graph1 <- ggplot(vncovid, aes(x = confirmeddate, y = age)) +
  geom_point(aes(color = age)) +
  theme_dark(base_size = 10) +
    ggtitle("Vietnam Hospitalized Covid Cases by Age and Gender") +
  ylab("Age") + 
  xlab("Confirmed Date")
graph1 <- ggplotly(graph1)
graph1

Graph 2: Bar Graph

This bar graph shows the hospitalized Covid cases by gender more clearly as they are literally split into 2 catergories on the X axis: Male & Female.

graph2 <- vncovid %>%
  ggplot() +
  geom_bar(aes(x=gender, y=age, fill=confirmeddate), 
           position = "dodge", stat = "identity") +
  theme_minimal(base_size = 10) +
  ggtitle("Vietnam Hospitalized Covid Patient by Age and Gender") +
  xlab("Gender") +
  ylab("Age") +
  labs(fill = "Confirmed Date")
graph2  <- ggplotly(graph2)
graph2

Vietnam Hostpitalized Covid Cases by Location

I elected to use a treemap to present this data because it shows a quick and easy to understand visual of the cities with most hospitalized Covid cases to least.

treemap(vncovid, index="detectionlocation", vSize="age", 
        vColor="age", type="manual",    
        palette="Set1")

The original locations of hospitalized Covid cases in Vietnam

For this graph, I wanted to see how many hospitalized cases came from people that travel from another country into Vietnam, versus cases that were contracted from within the country.

colourCount = length(unique(vncovid$travelcountry))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))

graph3 <- vncovid %>%
  ggplot() +
  geom_histogram(aes(age, fill = travelcountry)) +
  scale_fill_manual(values = getPalette(colourCount))+
  theme_light(base_size = 10) +
  ggtitle("Original Location of Vietnam Hospitalized Covid Cases") +
  xlab("Age") + 
  labs(fill = "Travel Country")
graph3 <- ggplotly(graph3)
graph3

Findings

I personally included ggplotly into my graphs whenever I can because I like the interaction it provides. It definitely gives more dimension to the graphs and makes certain elements easier to understand.

Ages and Gender

Contrary to my expectation, graph 1 shows that there are more hospitalized cases among people between the ages of 25 to 50, than those that are older than 50. This could be the result of younger people being more careless. As far as genders go, from the bar graph, they seem to not play a role in the hospitalization of Covid cases.

Covid cases by location

It was no surprise to me that Ha Noi and Ho Chi Minh cities have the most hospitalized cases, given that they are the two biggest cities in the country and major traveling hubs. Additionally, they have the most developed hospitals and it’s likely that people travel there to get treated.

Covid cases that came from another country

I made a huge assumption that most travel cases would come from China since allegedly the Corona virus originated from there and Vietnam and China share borders; however, much to my surprise, a lot of the cases were from people that came to Vietnam from the Uk or the United Arab Emirates. Most of the cases came from within Vietnam, potentially from first, second or third-hand (and so on) exposures.

Lessons learned

1: I wish I figured out a way to convert the “Confirmed Dates” of the cases into months, the graphs would potentially be cleaner and the x axis lables won’t be as messy.

2: The Confirmed Dates whenever I use them in a graph are not in chronological order. I’m not sure why. Had they been in order, I could determine which month had more cases than other. My assumption is that the cases would go up over time, but I could be wrong. As the country got a hold of Covid, maybe the number went down mid April or May.

3: I wanted to incorporate geom_area to show the Travel Countries but could not get it to show the visual I want in a logical manner. I would much rather show that instead of the histogram for my last graph, I think it would show the data in a much more meaningful way.

4: Graph 2 - the Bargraph was rough until I figured it out. Here’s a picture:

whatafail

whatafail

so my attempt to add the picture is a fail… although it did show up in R!