Data 607 - Project 2

We have chosen to investigate the coronavirus data set provided by John Hopkins University which themselves source the data from DXY, an online platform run by members of the Chinese nedical community.

COVID-19, the scientific name of the coronavirus, had its first recorded case in Wuhan, China in December 2019 and has since across the world with more than 100,000 cases globally as of early March.

We will start by looking at general trends of the spread of the virus, followed by a look at the mortality rate. We finish by investigating the growth rate of confirmed cases, specifically in the most mature case, China, to see if we can find any indication about growth there and make any conclusions about the future of the virus in the rest of the world.

1) Loading required libraries and downloading data

We load all required libraries:

library(httr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(date)
library(tidyverse)

We download there seperate CSV (for confirmed cases, recoveries and deaths) from the John Hopkins GitHub account and save them as data frames:

urlRemote  <- "https://raw.githubusercontent.com/"
pathGithub <- "CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
fileNameConfirmed   <- "time_series_19-covid-Confirmed.csv"
fileNameRecovered   <- "time_series_19-covid-Recovered.csv"
fileNameDeaths   <- "time_series_19-covid-Deaths.csv"


confs <- read.csv(paste0(urlRemote, pathGithub, fileNameConfirmed),header = TRUE)
recovs<- read.csv(paste0(urlRemote, pathGithub, fileNameRecovered),header = TRUE)
deaths <- read.csv(paste0(urlRemote, pathGithub, fileNameDeaths),header = TRUE)
 
kable(head(confs[,1:10]))

Province.State	Country.Region	Lat	Long	X1.22.20	X1.23.20	X1.24.20	X1.25.20	X1.26.20	X1.27.20
Anhui	Mainland China	31.8257	117.2264	1	9	15	39	60	70
Beijing	Mainland China	40.1824	116.4142	14	22	36	41	68	80
Chongqing	Mainland China	30.0572	107.8740	6	9	27	57	75	110
Fujian	Mainland China	26.0789	117.9874	1	5	10	18	35	59
Gansu	Mainland China	36.0611	103.8343	0	2	2	4	7	14
Guangdong	Mainland China	23.3417	113.4244	26	32	53	78	111	151

2) Data Clean-up

As we can see in the above table the data is provide in a long format. In the below section we transform all 3 data frames to take long format:

#Creating a long table using gather
confs <- tidyr::gather(confs, "Date", "Amount", -Province.State, -Country.Region, -Lat, -Long)
#transforming the date column into a date structure:
confs$Date<-as.Date(confs$Date,format='X%m.%d.%y')
#adding a status label to flag data belongs to confirmed data set
confs$Status <- 'confirmed'

#Creating a long table using gather
recovs <- tidyr::gather(recovs, "Date", "Amount", -Province.State, -Country.Region, -Lat, -Long)
#transforming the date column into a date structure:
recovs$Date<-as.Date(recovs$Date,format='X%m.%d.%y')
#adding a status label to flag data belongs to recovered data set
recovs$Status <- 'recovered'

#Creating a long table using gather
deaths <- tidyr::gather(deaths, "Date", "Amount", -Province.State, -Country.Region, -Lat, -Long)
#transforming the date column into a date structure:
deaths$Date<-as.Date(deaths$Date,format='X%m.%d.%y')
#adding a status label to flag data belongs to deaddata set
deaths$Status <- 'dead'

kable(head(confs))

Province.State	Country.Region	Lat	Long	Date	Amount	Status
Anhui	Mainland China	31.8257	117.2264	2020-01-22	1	confirmed
Beijing	Mainland China	40.1824	116.4142	2020-01-22	14	confirmed
Chongqing	Mainland China	30.0572	107.8740	2020-01-22	6	confirmed
Fujian	Mainland China	26.0789	117.9874	2020-01-22	1	confirmed
Gansu	Mainland China	36.0611	103.8343	2020-01-22	0	confirmed
Guangdong	Mainland China	23.3417	113.4244	2020-01-22	26	confirmed

The above example shows the first rows of our new long data frame for the confirmed data set. The recovered and dead data set have the identiacal layout, only with different values in the Amount and Status column.

Next we want to combine the confirmed, recovered and dead data set into on master data set:

#we stack all three data sets underneath each other
master <- rbind(confs,recovs,deaths)
#we then widen the data set to have three new columns one for each amount of confirmed, recovered and dead cases:
master <- tidyr::spread(master,Status,Amount)

kable(head(master))

Province.State	Country.Region	Lat	Long	Date	confirmed
Anhui	Mainland China	31.8257	117.2264	2020-01-22	1
Beijing	Mainland China	40.1824	116.4142	2020-01-22	14
Chongqing	Mainland China	30.0572	107.8740	2020-01-22	6
Fujian	Mainland China	26.0789	117.9874	2020-01-22	1
Gansu	Mainland China	36.0611	103.8343	2020-01-22	0
Guangdong	Mainland China	23.3417	113.4244	2020-01-22	26

The above is now our clean up data set that we will work with for the remainder of the investiagtion

3) Global Trend Investigation

First, we will gain a view of the global trends of confirmed cases, recoveries and deaths (going forward referred to as measurement tripplet):

#aggregate the data to make our original data set into a global data set with no more locational details:
mastertotal <- stats::aggregate(master[ ,6:8], FUN = "sum", by = list(master$Date))
#rename the date column to be date again
names(mastertotal)[1]<-"Date"

kable(head(mastertotal))

Date	confirmed	dead	recovered
2020-01-22	555	17	28
2020-01-23	653	18	30
2020-01-24	941	26	36
2020-01-25	1434	42	39
2020-01-26	2118	56	52
2020-01-27	2927	82	61

The global data set provides our measurement tripplet for global numbers for each date.

#sizing up our y axis
max_y <- max(mastertotal$confirmed)*1.1
#defining our color scheme and legend names:
colors <- c("confirmed" = "blue", "dead" = "red", "recovered" = "green")

#using ggplot with three geom_lines to show our measurement triplet in one graph:
ggplot(mastertotal, aes(x = Date)) +
    geom_line(aes(y = confirmed, color = "confirmed"), size = 1.2) +
    geom_line(aes(y = recovered, color = "recovered"), size = 1.2) +
    geom_line(aes(y = dead, color = "dead"), size = 1.2) +
    labs(x = "Date",
         y = "Number of patients",
         color = "Legend") +
    scale_color_manual(values = colors) +
    ggtitle("Global confirmed, recovered and dead patients") +
    theme(plot.title = element_text(hjust = 0.5))

One behavior we can see is a large spike in confirmed cases around February 10th which was a correction of measurement by the Chinese governement, strongly increasing the number of cases in a short span of time. We can see that after some lead time the recovery curve has moved somewhat in parallel while the curve for the cases that have passed away seemingly has not changed over the span of recording. This is mostly due to the very different scale which makes it is difficult to evaluate the behavior of deaths versus new confirmed cases. Therefore, I am interested in looking at the death per confirmed cases (mortality) ratio.

4) Mortality Rate Globally

Before we begin, we need to acknowledge that there will be a lag between the measurment of a confirmed case and a death cases, however since we do not have any information about how much that seems to be we will ignore this lag and see which conclusions we can draw.

#create a new column which contains the ratio of cases of death and total confirmed cases
mastertotal$deathratio <- mastertotal$dead/mastertotal$confirmed*100
colors <- c("death per confirmed" = "black")

ggplot(mastertotal, aes(x = Date)) +
    geom_line(aes(y = deathratio, color = "death per confirmed"), size = 1.2) +
    labs(x = "Date",
         y = "death/confirmed (%)",
         color = "Legend") +
    scale_color_manual(values = colors) +
    ggtitle("Deaths per confirmed case") +
    theme(plot.title = element_text(hjust = 0.5))

We can see that the death per confirmed case reduced to 2% in the first half of February, which was the number that was in the news throughout that time period. However, more recently, reporters have been speaking of a mortality rate closer to 3%, an uptick which can also be seen in the graph. This uptick could be precisely due to the aforementioned lag between a person being tested positive and passing away.

5) Focusing on China

I would like to see what the calculation looks like when we reduce the sample set to China, to see if we see similar behavior in a more “mature” state of the crisis.

#we filter our master data set to only include mainland china
masterChina <-  master %>% filter(Country.Region == "Mainland China")
#we then again remove all locational details to comtinue our investigation only with our measurement triplet for each date specific to China
masterChinatotal <- stats::aggregate(masterChina[ ,6:8], FUN = "sum", by = list(masterChina$Date))
names(masterChinatotal)[1]<-"Date"

#create a new column for mortality rate
masterChinatotal$deathratio <-masterChinatotal$dead/masterChinatotal$confirmed*100

colors <- c("death per confirmed" = "black")
ggplot(masterChinatotal, aes(x = Date)) +
    geom_line(aes(y = deathratio, color = "death per confirmed"), size = 1.2) +
    labs(x = "Date",
         y = "death/confirmed (%)",
         color = "Legend") +
    scale_color_manual(values = colors) +
    ggtitle("Deaths per confirmed case in China") +
    theme(plot.title = element_text(hjust = 0.5))

In general, we can see that both the global and Chinese mortality rate graph look very similar. This is primarily due to the fact that the number of Chinese confirmed and death cases vastly outweighs the rest of the world. We can see, however, that while the global mortality rate actually seemed to be plateauing that of China seems to be growing steadily towards the right end of our graphs. Again, this could be due to the fact that while the many countries are only now seeing their first confirmed cases with very few deaths, China might be seeing a more realistic mortality rate - not a great sign for this to come for the rest of the world.

To not leave this incestigation on a too pesimistic note we will look at the growth rate of confirmed cases in China:

masterChinatotal$confirmshift <- masterChinatotal$confirm

We create a 1 day shifted confirmed cases column in order to compare it the non shifted column:

 masterChinatotal$confirmshift[2:length(masterChinatotal$confirmshift)] <- masterChinatotal$confirmshift[1:(length(masterChinatotal$confirmshift)-1)]

 masterChinatotal$confirmgrowth <- masterChinatotal$confirmed / masterChinatotal$confirmshift -1

In the below graph we can see a clear reduction …

colors <- c("confirmed growth" = "black")
ggplot(masterChinatotal, aes(x = Date)) +
    geom_line(aes(y = confirmgrowth, color = "confirmed growth"), size = 1.2) +
    labs(x = "Date",
         y = "daily confirmed growth (%)",
         color = "Legend") +
    scale_color_manual(values = colors) +
    ggtitle("Daily Confirmed Growth Rate") +
    theme(plot.title = element_text(hjust = 0.5))

…and we can see even clearer when looking at more recent and stable data that there in no strong growth in new cases:

numrows <- nrow(masterChinatotal)

colors <- c("confirmed growth" = "black")

ggplot(masterChinatotal[25:numrows,], aes(x = Date)) +
    geom_line(aes(y = confirmgrowth, color = "confirmed growth"), size = 1.2) +
    labs(x = "Date",
         y = "daily confirmed growth (%)",
         color = "Legend") +
    scale_color_manual(values = colors) +
    ggtitle("Daily Confirmed Growth Rate") +
    theme(plot.title = element_text(hjust = 0.5))

GitHub: https://github.com/chilleundso/DATA607/blob/master/Assignment5/Data607_Assignment5.Rmd

RPubs: https://rpubs.com/ManolisM/Data607_Assignment5_flight_delays

Data 607 - Project 2 - Part 1

Manolis Manoli

1) Loading required libraries and downloading data

2) Data Clean-up

3) Global Trend Investigation

4) Mortality Rate Globally

5) Focusing on China