Taiwan Covid19 data exploration

The data below is collected from the official website of Taiwan Central for Disease Control. As the world has currently suffered from the unfortunate pandemic, governments in every nation are trying their best to contain the virus and protect the citizens. Taiwan is considered as the model on battling against the disease of covid19. The following data will show an overview of the cases in Taiwan.
Import package
rm(list=ls(all=T))

library(dplyr)
library(ggplot2)
library(tidyverse)
library(corpus)
library(tm)
Importing the dataset
setwd("~/Desktop/Covid")
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.2
T <- read_excel("嚴重特殊傳染性肺炎 Taiwan CDC 公布病例資訊.xlsx", sheet = "確診病例")
## New names:
## * `` -> ...14
View(T)
str(T)
## tibble [438 × 15] (S3: tbl_df/tbl/data.frame)
##  $ CDC 新聞稿 URL : chr [1:438] "https://www.cdc.gov.tw/Bulletin/Detail/6oHuoqzW9e_onW0AaMEemg?typeid=9" "https://www.cdc.gov.tw/Bulletin/Detail/ozDpnZZxwa-kBKTXbdS0Kw?typeid=9" "https://www.cdc.gov.tw/Bulletin/Detail/ozDpnZZxwa-kBKTXbdS0Kw?typeid=9" "https://www.cdc.gov.tw/Bulletin/Detail/1lqFGlxtUhCpE_quNLIfLg?typeid=9" ...
##  $ 案例編號       : num [1:438] 1 2 3 4 5 6 7 8 9 10 ...
##  $ 性別           : chr [1:438] "女" "女" "男" "女" ...
##  $ 年齡層
## (~多歲): num [1:438] 50 50 50 50 50 70 70 50 40 40 ...
##  $ 就醫日         : chr [1:438] "43850.0" "43853.0" "43853.0" "43855.0" ...
##  $ 確診日         : POSIXct[1:438], format: "2020-01-21" "2020-01-24" ...
##  $ 相關地點       : chr [1:438] "武漢" "武漢" "武漢" "武漢, 歐洲" ...
##  $ 境外或是本土   : chr [1:438] "境外" "境外" "境外" "境外" ...
##  $ 群組           : chr [1:438] NA NA NA NA ...
##  $ 確診前症狀     : chr [1:438] "發燒、咳嗽、呼吸急促" "發燒" "感冒症狀" "咳嗽" ...
##  $ 其他公布說明   : chr [1:438] "於中國大陸武漢工作,昨日由武漢搭機入境,因有發燒、咳嗽、呼吸急促等症狀,由機場檢疫人員安排後送就醫,X光檢查顯示"| __truncated__ "分別為50多歲中國籍女性(案1)及50多歲男性國人(案2),皆為1月21日入境;案1為1月23日因發燒就醫,案2於1月20日出現感冒"| __truncated__ "分別為50多歲中國籍女性(案1)及50多歲男性國人(案2),皆為1月21日入境;案1為1月23日因發燒就醫,案2於1月20日出現感冒"| __truncated__ "1月13日至15日曾有中國大陸武漢旅遊史,未前往華南海鮮市場,1月16日至25日至歐洲旅遊,個案於1月22日起有咳嗽症狀,25"| __truncated__ ...
##  $ 死亡日         : POSIXct[1:438], format: NA NA ...
##  $ 解除隔離日     : POSIXct[1:438], format: "2020-02-06" NA ...
##  $ ...14          : logi [1:438] NA NA NA NA NA NA ...
##  $ 個案頁面建立   : chr [1:438] "✔" "✔" "✔" "✔" ...
Overview of the trend

The first reported case in Taiwan is on 21/01/2020, and by 04/05/2020, Taiwan has a total of 438 cases and 6 deaths.

##default ggplot colour
 gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
 }

T$確診日 <- as.Date(T$確診日)

T %>% count(確診日) %>% ggplot(aes(x=確診日,n)) + geom_bar(stat="identity",position = position_dodge(),fill = "#FF6666")  + ggtitle("Number of confirmed cases by date") + xlab("date") + ylab("number of cases") + scale_x_date(date_breaks = "1 week") + theme(axis.text.x= element_text(angle=90)) 

T %>% group_by(境外或是本土) %>% count(確診日) %>% ggplot(aes(x=確診日,n),group=境外或是本土) + geom_line(aes(col=境外或是本土)) + geom_point(aes(col=境外或是本土)) + ggtitle("Number of confirmed cases by date") + xlab("date") + ylab("number of cases") + scale_x_date(date_breaks = "1 week") + theme(axis.text.x= element_text(angle=90))  + ggtitle("Number of confirmed cases by date and type") +  labs(color="Type") + scale_color_manual(labels = c("imported", "military", "domestic"), values = gg_color_hue(3)) 

## Average case
filter(T %>% group_by(境外或是本土) %>% count(確診日),境外或是本土=="本土")$n %>% sum() / as.numeric(max(T$確診日) - min(T$確診日)) 
## [1] 0.5288462
filter(T %>% group_by(境外或是本土) %>% count(確診日),境外或是本土=="境外")$n %>% sum() / as.numeric(max(T$確診日) - min(T$確診日)) 
## [1] 3.336538

As we can see from the graph, the situation is generally controlled below 10 cases every day. From early March to April, the situation went worse in Europe and the US. Therefore, there was a high peak in that period of time due to the high number of citizens coming back from abroad. From the second graph, it shows that the domestic cases are actually really low, which are under 5 cases per day throughout the period. After mid-April, there was another high peak because of numerous incidents that happened on a warship that was traveled abroad on a mission and came back. After the outbreak of the incident, the soldiers on the warship were immediately summoned and being monitored, and the situation is quickly contained.

Key take away

  1. Low domestic cases (average cases < 0.6 / per day)
  2. A high number of cases because of citizen coming from abroad (average cases 3.3 / per day )

The overall trend demonstrates the effort that Taiwan’s government has put into to contain the virus, and it has done a sublime job on keeping the domestic case at a really low point. Meanwhile, after the high peak due to aborad citizens and the military incident, the governemnt quickly surppressed the cases and monitored the situation preventing it from spreading. The agile and flexible approach that the Taiwan government took has been a huge success.

Gender and age group
library(readr)
T2 <- read_csv("Day_Confirmation_Age_County_Gender_19CoV.csv")
## Parsed with column specification:
## cols(
##   確定病名 = col_character(),
##   個案研判日 = col_date(format = ""),
##   縣市 = col_character(),
##   性別 = col_character(),
##   是否為境外移入 = col_character(),
##   年齡層 = col_character(),
##   確定病例數 = col_double()
## )
View(T2)


#### T dataset from "嚴重特殊傳染性肺炎 Taiwan CDC 公布病例資訊"
names(T)[4] <- "Age"

## minimal age
min(T$Age %>% na.omit())
## [1] 4
max(T$Age %>% na.omit())
## [1] 80
##number of cases by age group
T %>% group_by(性別) %>% count(Age) %>% na.omit() %>% ggplot(aes(x=factor(Age),y=n,width=.75,fill=性別)) + geom_bar(stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Age") + labs(fill="Gender") + scale_fill_manual(labels = c("female", "male"), values = gg_color_hue(2)) + theme(axis.text.x= element_text(angle=90)) + ggtitle("number of cases by age group") 

#### T2 dataset from "Taiwan CDC open portal"
## restructure the interval for plotting
str(T2)
## tibble [406 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ 確定病名      : chr [1:406] "嚴重特殊傳染性肺炎" "嚴重特殊傳染性肺炎" "嚴重特殊傳染性肺炎" "嚴重特殊傳染性肺炎" ...
##  $ 個案研判日    : Date[1:406], format: "2020-01-22" "2020-01-24" ...
##  $ 縣市          : chr [1:406] "高雄市" "台北市" "高雄市" "台北市" ...
##  $ 性別          : chr [1:406] "女" "女" "男" "女" ...
##  $ 是否為境外移入: chr [1:406] "是" "是" "是" "是" ...
##  $ 年齡層        : chr [1:406] "55-59" "50-54" "55-59" "55-59" ...
##  $ 確定病例數    : num [1:406] 1 1 1 1 1 2 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   確定病名 = col_character(),
##   ..   個案研判日 = col_date(format = ""),
##   ..   縣市 = col_character(),
##   ..   性別 = col_character(),
##   ..   是否為境外移入 = col_character(),
##   ..   年齡層 = col_character(),
##   ..   確定病例數 = col_double()
##   .. )
T2$年齡層[which(T2$年齡層== "4")] <- "04"
T2$年齡層[which(T2$年齡層== "5-9")] <- "05-09"

##
T2 %>% group_by(是否為境外移入) %>% count(性別) %>% na.omit() %>% ggplot(aes(x=是否為境外移入,y=n,width=.75,fill=性別)) + geom_bar(stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Gender") + labs(fill="Gender") + scale_fill_manual(labels = c("female", "male"), values = gg_color_hue(2)) + ggtitle("number of cases by gender and area") + scale_x_discrete(labels=c("否" = "domestic", "是" = "imported")) + geom_text(aes(label = n), position = position_dodge(0.8), vjust = -0.3)

## number of cases by age group
T2 %>% group_by(性別) %>% count(年齡層) %>% na.omit() %>% ggplot(aes(x=年齡層,y=n,width=.75,fill=性別)) + geom_bar(stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Gender") + labs(fill="Gender") + scale_fill_manual(labels = c("female", "male"), values = gg_color_hue(2)) + theme(axis.text.x= element_text(angle=90)) + ggtitle("number of cases by age group") 

## number of cases by age group (imported)
T2[T2$是否為境外移入 == "是",] %>% group_by(性別) %>% count(年齡層) %>% na.omit() %>% ggplot(aes(x=年齡層,y=n,width=.75,fill=性別)) + geom_bar(stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Age") + labs(fill="Gender") + scale_fill_manual(labels = c("female", "male"), values = gg_color_hue(2)) + theme(axis.text.x= element_text(angle=90)) + ggtitle("number of cases by age group - imported")

## number of cases by age group (domestic)
T2[T2$是否為境外移入 == "否",] %>% group_by(性別) %>% count(年齡層) %>% na.omit() %>% ggplot(aes(x=年齡層,y=n,width=.75,fill=性別)) + geom_bar(stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Age") + labs(fill="Gender") + scale_fill_manual(labels = c("female", "male"), values = gg_color_hue(2)) + theme(axis.text.x= element_text(angle=90)) + ggtitle("number of cases by age group - domestic")

Two files are collected from the CDC’s official websites. From the data of the first file, it might be confused that the age group that has the highest cases is 20-30, which contradicts other information that the older people are the most vulnerable group when facing the virus. In the second graph, we can see that the cases among males and females are pretty much equal both in domestic and abroad.

In the second file, we have a more specific age group. As we break down the data of age group in domestic and abroad, we realize that the reason why the highest cases happen in the young age group is that those cases are mostly from the aborad citizens, including students and young employees. If we look at the cases of age group domestically, we can see that higher numbers of cases happened in elder age groups.

Key take away

Overall, younger ager groups (20-35) take up most cases while it is caused by cases that coming from abroad. If it only considers cases that happened domestically, elder age groups take up the most cases.

Location
## 境外
str(T)
## tibble [438 × 15] (S3: tbl_df/tbl/data.frame)
##  $ CDC 新聞稿 URL: chr [1:438] "https://www.cdc.gov.tw/Bulletin/Detail/6oHuoqzW9e_onW0AaMEemg?typeid=9" "https://www.cdc.gov.tw/Bulletin/Detail/ozDpnZZxwa-kBKTXbdS0Kw?typeid=9" "https://www.cdc.gov.tw/Bulletin/Detail/ozDpnZZxwa-kBKTXbdS0Kw?typeid=9" "https://www.cdc.gov.tw/Bulletin/Detail/1lqFGlxtUhCpE_quNLIfLg?typeid=9" ...
##  $ 案例編號      : num [1:438] 1 2 3 4 5 6 7 8 9 10 ...
##  $ 性別          : chr [1:438] "女" "女" "男" "女" ...
##  $ Age           : num [1:438] 50 50 50 50 50 70 70 50 40 40 ...
##  $ 就醫日        : chr [1:438] "43850.0" "43853.0" "43853.0" "43855.0" ...
##  $ 確診日        : Date[1:438], format: "2020-01-21" "2020-01-24" ...
##  $ 相關地點      : chr [1:438] "武漢" "武漢" "武漢" "武漢, 歐洲" ...
##  $ 境外或是本土  : chr [1:438] "境外" "境外" "境外" "境外" ...
##  $ 群組          : chr [1:438] NA NA NA NA ...
##  $ 確診前症狀    : chr [1:438] "發燒、咳嗽、呼吸急促" "發燒" "感冒症狀" "咳嗽" ...
##  $ 其他公布說明  : chr [1:438] "於中國大陸武漢工作,昨日由武漢搭機入境,因有發燒、咳嗽、呼吸急促等症狀,由機場檢疫人員安排後送就醫,X光檢查顯示"| __truncated__ "分別為50多歲中國籍女性(案1)及50多歲男性國人(案2),皆為1月21日入境;案1為1月23日因發燒就醫,案2於1月20日出現感冒"| __truncated__ "分別為50多歲中國籍女性(案1)及50多歲男性國人(案2),皆為1月21日入境;案1為1月23日因發燒就醫,案2於1月20日出現感冒"| __truncated__ "1月13日至15日曾有中國大陸武漢旅遊史,未前往華南海鮮市場,1月16日至25日至歐洲旅遊,個案於1月22日起有咳嗽症狀,25"| __truncated__ ...
##  $ 死亡日        : POSIXct[1:438], format: NA NA ...
##  $ 解除隔離日    : POSIXct[1:438], format: "2020-02-06" NA ...
##  $ ...14         : logi [1:438] NA NA NA NA NA NA ...
##  $ 個案頁面建立  : chr [1:438] "✔" "✔" "✔" "✔" ...
location <- T[T$境外或是本土 == "境外", 7]
location =Corpus(VectorSource(location$相關地點))
location[[1]]$content 
## [1] "武漢"
for(i in seq(location)){
  location[[i]]<-gsub(","," ",location[[i]])
  location[[i]]<-gsub("、"," ",location[[i]])
  }

location[[1]]$content
## [1] "武漢"
location = DocumentTermMatrix(location)

## turn into matrix 
location = as.data.frame(as.matrix(location))

names(colSums(location)) -> t1
colSums(location) %>% as.data.frame() -> t2
t2$Var <- t1
names(t2) <- c("Freq","Var")

as.data.frame(t1) -> t1
t1$Freq <- t2$Freq
names(t1) <- c("Var","Freq")

library(wordcloud2)
wordcloud2(t1)

Using the word cloud, we can see that American, United Kingdom, French, Spain, and Turkey are countries that most infected citizens had been to.

Key take away

  1. The US and Europe are the main countries that infected citizens went before coming back to Taiwan.
##境內
library(ggplot2)
library(showtext)
## Warning: package 'showtext' was built under R version 3.5.2
## Loading required package: sysfonts
## Loading required package: showtextdb
showtext_auto()
#### number of cases among cities in Tawian / domestic
T2[T2$是否為境外移入 == "否",] %>% count(縣市) %>% na.omit() %>% ggplot(aes(x=reorder(縣市,n),y=n,width=.75)) + geom_bar(fill = "#FF6666",stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Age") + labs(fill="Gender") + xlab("") + ggtitle("number of cases among cities in Tawian / domestic") 
#### number of cases among cities in Tawian / include the aborad
T2 %>% group_by(是否為境外移入) %>% count(縣市) %>% na.omit() %>% ggplot(aes(x=reorder(縣市,n),y=n,width=.75,fill=是否為境外移入)) + geom_bar(stat="identity",position = position_dodge()) + ylab("number of cases") + xlab("Age") + labs(fill="Type") + xlab("") + ggtitle("number of cases among cities in Tawian") + theme(axis.text.x= element_text(angle=90)) + scale_fill_manual(labels = c("domestic", "imported"), values = gg_color_hue(2))

Number of cases among cities in Taiwan / domestic Number of cases among cities in Taiwan

Key take away

  1. Domestically, New Taipei, Taipei, and Taoyuan had most cases.
  2. Consider the imported cases, the order is pretty much the same, Taipei became the city that has most cases, followed by New Taipei, Kaohsiung, and Taoyuan. It should be noted that there are no cases in Kaohsiung domestically, while it rose to the top 3 because of the warship incident since it parked at the port in Kaoshiung.
Symptom Analysis
## build corpus
corpus = Corpus(VectorSource(T$確診前症狀))
corpus[[1]]$content 
## [1] "發燒、咳嗽、呼吸急促"
## remove Punctuation
for(i in seq(corpus)){
  corpus[[i]]<-gsub("、"," ",corpus[[i]]) }

corpus[[1]]$content 
## [1] "發燒 咳嗽 呼吸急促"
##count frequency 
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 438, terms: 150)>>
## Non-/sparse entries: 1067/64633
## Sparsity           : 98%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## turn into matrix 
matrix = as.data.frame(as.matrix(frequencies))

names(colSums(matrix)) -> t3
colSums(matrix) %>% as.data.frame() -> t4
t4$Var <- t3
names(t4) <- c("Freq","Var")

as.data.frame(t3) -> t3
t3$Freq <- t4$Freq
names(t3) <- c("Var","Freq")

library(wordcloud2)
Symptoms

Again, here we use the world cloud to observe the symptoms of the cases. #### Key take away 1. The most regular symptoms among the cases are fever(162), coughing(156), running nose(92), sore throat(82), stuffy nose(49), headache(35).

動態圖
library(googleVis)
## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'
## 
## Welcome to googleVis version 0.6.3
## 
## Please read Google's Terms of Use
## before you start using the package:
## https://developers.google.com/terms/
## 
## Note, the plot method of googleVis will by default use
## the standard browser to display its output.
## 
## See the googleVis package vignettes for more details,
## or visit https://github.com/mages/googleVis.
## 
## To suppress this message use:
## suppressPackageStartupMessages(library(googleVis))
### Taiwan data sorted by cities
####google Vis 
T3 = T2 %>% 
  group_by (個案研判日,縣市) %>%
  summarise(   
    group_size=n()) %>% ungroup()

str(T3)
## tibble [229 × 3] (S3: tbl_df/tbl/data.frame)
##  $ 個案研判日: Date[1:229], format: "2020-01-22" "2020-01-24" ...
##  $ 縣市      : chr [1:229] "高雄市" "台北市" "高雄市" "台北市" ...
##  $ group_size: int [1:229] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   確定病名 = col_character(),
##   ..   個案研判日 = col_date(format = ""),
##   ..   縣市 = col_character(),
##   ..   性別 = col_character(),
##   ..   是否為境外移入 = col_character(),
##   ..   年齡層 = col_character(),
##   ..   確定病例數 = col_double()
##   .. )
as.Date(T3$個案研判日) -> T3$個案研判日
Motion=gvisMotionChart(T3, 
                       "縣市",
                       "個案研判日")
plot(Motion)
## starting httpd help server ...
##  done
### Global data
library(readr)
setwd("~/Desktop/Covid/novel-corona-virus-2019-dataset")
G <- read_csv("covid_19_data.csv", 
    col_types = cols(ObservationDate = col_date(format = "%m/%d/%Y")))
View(G)

names(G)[4] <- "Country"
names(G)
## [1] "SNo"             "ObservationDate" "Province/State"  "Country"        
## [5] "Last Update"     "Confirmed"       "Deaths"          "Recovered"
G1 = G %>% group_by(ObservationDate,Country) %>%
  summarise(
    confirmed = sum(Confirmed),
    Deaths = sum(Deaths),
    Recovered = sum(Recovered)
  ) %>% ungroup()

Motion1=gvisMotionChart(G1, 
                       "Country",
                       "ObservationDate")
plot(Motion1)

setwd("~/Desktop/Covid")
Trend of confirmed cases among cities in Taiwan

Trend of confirmed cases among cities in Taiwan

Trend of confirmed cases in the world

Trend of confirmed cases in the world

Trend of confirmed cases and deaths in the world

Trend of confirmed cases and deaths in the world

Lastly, Goolegvis graphs show the trend of confirmed cases both in Taiwan and in the world.

As discussed in the previous section, after the peak time between March and April and the military incident, the number of cases among cities in Taiwan is well controlled and showing a downward trend.

Globally, the US had the most confirmed cases followed by Spain, Italy, and the UK. Overall, now is a difficult time for the US and Europe area.

In the past few days, Taiwan has set a record of o-case in 7 consecutive days. Meanwhile, countries around the world have also started to see progress in containing viruses and diminish confirmed cases.

Key take away

Stay home, stay safe, and we will pass thourhg this together