We have chosen to investigate the coronavirus data set provided by John Hopkins University which themselves source the data from DXY, an online platform run by members of the Chinese nedical community.
COVID-19, the scientific name of the coronavirus, had its first recorded case in Wuhan, China in December 2019 and has since across the world with more than 100,000 cases globally as of early March.
We will start by looking at general trends of the spread of the virus, followed by a look at the mortality rate. We finish by investigating the growth rate of confirmed cases, specifically in the most mature case, China, to see if we can find any indication about growth there and make any conclusions about the future of the virus in the rest of the world.
We load all required libraries:
library(httr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(date)
library(tidyverse)
We download there seperate CSV (for confirmed cases, recoveries and deaths) from the John Hopkins GitHub account and save them as data frames:
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
fileNameConfirmed <- "time_series_19-covid-Confirmed.csv"
fileNameRecovered <- "time_series_19-covid-Recovered.csv"
fileNameDeaths <- "time_series_19-covid-Deaths.csv"
confs <- read.csv(paste0(urlRemote, pathGithub, fileNameConfirmed),header = TRUE)
recovs<- read.csv(paste0(urlRemote, pathGithub, fileNameRecovered),header = TRUE)
deaths <- read.csv(paste0(urlRemote, pathGithub, fileNameDeaths),header = TRUE)
kable(head(confs[,1:10]))
| Province.State | Country.Region | Lat | Long | X1.22.20 | X1.23.20 | X1.24.20 | X1.25.20 | X1.26.20 | X1.27.20 |
|---|---|---|---|---|---|---|---|---|---|
| Anhui | Mainland China | 31.8257 | 117.2264 | 1 | 9 | 15 | 39 | 60 | 70 |
| Beijing | Mainland China | 40.1824 | 116.4142 | 14 | 22 | 36 | 41 | 68 | 80 |
| Chongqing | Mainland China | 30.0572 | 107.8740 | 6 | 9 | 27 | 57 | 75 | 110 |
| Fujian | Mainland China | 26.0789 | 117.9874 | 1 | 5 | 10 | 18 | 35 | 59 |
| Gansu | Mainland China | 36.0611 | 103.8343 | 0 | 2 | 2 | 4 | 7 | 14 |
| Guangdong | Mainland China | 23.3417 | 113.4244 | 26 | 32 | 53 | 78 | 111 | 151 |
As we can see in the above table the data is provide in a long format. In the below section we transform all 3 data frames to take long format:
#Creating a long table using gather
confs <- tidyr::gather(confs, "Date", "Amount", -Province.State, -Country.Region, -Lat, -Long)
#transforming the date column into a date structure:
confs$Date<-as.Date(confs$Date,format='X%m.%d.%y')
#adding a status label to flag data belongs to confirmed data set
confs$Status <- 'confirmed'
#Creating a long table using gather
recovs <- tidyr::gather(recovs, "Date", "Amount", -Province.State, -Country.Region, -Lat, -Long)
#transforming the date column into a date structure:
recovs$Date<-as.Date(recovs$Date,format='X%m.%d.%y')
#adding a status label to flag data belongs to recovered data set
recovs$Status <- 'recovered'
#Creating a long table using gather
deaths <- tidyr::gather(deaths, "Date", "Amount", -Province.State, -Country.Region, -Lat, -Long)
#transforming the date column into a date structure:
deaths$Date<-as.Date(deaths$Date,format='X%m.%d.%y')
#adding a status label to flag data belongs to deaddata set
deaths$Status <- 'dead'
kable(head(confs))
| Province.State | Country.Region | Lat | Long | Date | Amount | Status |
|---|---|---|---|---|---|---|
| Anhui | Mainland China | 31.8257 | 117.2264 | 2020-01-22 | 1 | confirmed |
| Beijing | Mainland China | 40.1824 | 116.4142 | 2020-01-22 | 14 | confirmed |
| Chongqing | Mainland China | 30.0572 | 107.8740 | 2020-01-22 | 6 | confirmed |
| Fujian | Mainland China | 26.0789 | 117.9874 | 2020-01-22 | 1 | confirmed |
| Gansu | Mainland China | 36.0611 | 103.8343 | 2020-01-22 | 0 | confirmed |
| Guangdong | Mainland China | 23.3417 | 113.4244 | 2020-01-22 | 26 | confirmed |
The above example shows the first rows of our new long data frame for the confirmed data set. The recovered and dead data set have the identiacal layout, only with different values in the Amount and Status column.
Next we want to combine the confirmed, recovered and dead data set into on master data set:
#we stack all three data sets underneath each other
master <- rbind(confs,recovs,deaths)
#we then widen the data set to have three new columns one for each amount of confirmed, recovered and dead cases:
master <- tidyr::spread(master,Status,Amount)
kable(head(master))
| Province.State | Country.Region | Lat | Long | Date | confirmed | dead | recovered |
|---|---|---|---|---|---|---|---|
| Anhui | Mainland China | 31.8257 | 117.2264 | 2020-01-22 | 1 | 0 | 0 |
| Beijing | Mainland China | 40.1824 | 116.4142 | 2020-01-22 | 14 | 0 | 0 |
| Chongqing | Mainland China | 30.0572 | 107.8740 | 2020-01-22 | 6 | 0 | 0 |
| Fujian | Mainland China | 26.0789 | 117.9874 | 2020-01-22 | 1 | 0 | 0 |
| Gansu | Mainland China | 36.0611 | 103.8343 | 2020-01-22 | 0 | 0 | 0 |
| Guangdong | Mainland China | 23.3417 | 113.4244 | 2020-01-22 | 26 | 0 | 0 |
The above is now our clean up data set that we will work with for the remainder of the investiagtion
First, we will gain a view of the global trends of confirmed cases, recoveries and deaths (going forward referred to as measurement tripplet):
#aggregate the data to make our original data set into a global data set with no more locational details:
mastertotal <- stats::aggregate(master[ ,6:8], FUN = "sum", by = list(master$Date))
#rename the date column to be date again
names(mastertotal)[1]<-"Date"
kable(head(mastertotal))
| Date | confirmed | dead | recovered |
|---|---|---|---|
| 2020-01-22 | 555 | 17 | 28 |
| 2020-01-23 | 653 | 18 | 30 |
| 2020-01-24 | 941 | 26 | 36 |
| 2020-01-25 | 1434 | 42 | 39 |
| 2020-01-26 | 2118 | 56 | 52 |
| 2020-01-27 | 2927 | 82 | 61 |
The global data set provides our measurement tripplet for global numbers for each date.
#sizing up our y axis
max_y <- max(mastertotal$confirmed)*1.1
#defining our color scheme and legend names:
colors <- c("confirmed" = "blue", "dead" = "red", "recovered" = "green")
#using ggplot with three geom_lines to show our measurement triplet in one graph:
ggplot(mastertotal, aes(x = Date)) +
geom_line(aes(y = confirmed, color = "confirmed"), size = 1.2) +
geom_line(aes(y = recovered, color = "recovered"), size = 1.2) +
geom_line(aes(y = dead, color = "dead"), size = 1.2) +
labs(x = "Date",
y = "Number of patients",
color = "Legend") +
scale_color_manual(values = colors) +
ggtitle("Global confirmed, recovered and dead patients") +
theme(plot.title = element_text(hjust = 0.5))
One behavior we can see is a large spike in confirmed cases around February 10th which was a correction of measurement by the Chinese governement, strongly increasing the number of cases in a short span of time. We can see that after some lead time the recovery curve has moved somewhat in parallel while the curve for the cases that have passed away seemingly has not changed over the span of recording. This is mostly due to the very different scale which makes it is difficult to evaluate the behavior of deaths versus new confirmed cases. Therefore, I am interested in looking at the death per confirmed cases (mortality) ratio.
Before we begin, we need to acknowledge that there will be a lag between the measurment of a confirmed case and a death cases, however since we do not have any information about how much that seems to be we will ignore this lag and see which conclusions we can draw.
#create a new column which contains the ratio of cases of death and total confirmed cases
mastertotal$deathratio <- mastertotal$dead/mastertotal$confirmed*100
colors <- c("death per confirmed" = "black")
ggplot(mastertotal, aes(x = Date)) +
geom_line(aes(y = deathratio, color = "death per confirmed"), size = 1.2) +
labs(x = "Date",
y = "death/confirmed (%)",
color = "Legend") +
scale_color_manual(values = colors) +
ggtitle("Deaths per confirmed case") +
theme(plot.title = element_text(hjust = 0.5))
We can see that the death per confirmed case reduced to 2% in the first half of February, which was the number that was in the news throughout that time period. However, more recently, reporters have been speaking of a mortality rate closer to 3%, an uptick which can also be seen in the graph. This uptick could be precisely due to the aforementioned lag between a person being tested positive and passing away.
I would like to see what the calculation looks like when we reduce the sample set to China, to see if we see similar behavior in a more “mature” state of the crisis.
#we filter our master data set to only include mainland china
masterChina <- master %>% filter(Country.Region == "Mainland China")
#we then again remove all locational details to comtinue our investigation only with our measurement triplet for each date specific to China
masterChinatotal <- stats::aggregate(masterChina[ ,6:8], FUN = "sum", by = list(masterChina$Date))
names(masterChinatotal)[1]<-"Date"
#create a new column for mortality rate
masterChinatotal$deathratio <-masterChinatotal$dead/masterChinatotal$confirmed*100
colors <- c("death per confirmed" = "black")
ggplot(masterChinatotal, aes(x = Date)) +
geom_line(aes(y = deathratio, color = "death per confirmed"), size = 1.2) +
labs(x = "Date",
y = "death/confirmed (%)",
color = "Legend") +
scale_color_manual(values = colors) +
ggtitle("Deaths per confirmed case in China") +
theme(plot.title = element_text(hjust = 0.5))
In general, we can see that both the global and Chinese mortality rate graph look very similar. This is primarily due to the fact that the number of Chinese confirmed and death cases vastly outweighs the rest of the world. We can see, however, that while the global mortality rate actually seemed to be plateauing that of China seems to be growing steadily towards the right end of our graphs. Again, this could be due to the fact that while the many countries are only now seeing their first confirmed cases with very few deaths, China might be seeing a more realistic mortality rate - not a great sign for this to come for the rest of the world.
To not leave this incestigation on a too pesimistic note we will look at the growth rate of confirmed cases in China:
masterChinatotal$confirmshift <- masterChinatotal$confirm
We create a 1 day shifted confirmed cases column in order to compare it the non shifted column:
masterChinatotal$confirmshift[2:length(masterChinatotal$confirmshift)] <- masterChinatotal$confirmshift[1:(length(masterChinatotal$confirmshift)-1)]
masterChinatotal$confirmgrowth <- masterChinatotal$confirmed / masterChinatotal$confirmshift -1
In the below graph we can see a clear reduction …
colors <- c("confirmed growth" = "black")
ggplot(masterChinatotal, aes(x = Date)) +
geom_line(aes(y = confirmgrowth, color = "confirmed growth"), size = 1.2) +
labs(x = "Date",
y = "daily confirmed growth (%)",
color = "Legend") +
scale_color_manual(values = colors) +
ggtitle("Daily Confirmed Growth Rate") +
theme(plot.title = element_text(hjust = 0.5))
…and we can see even clearer when looking at more recent and stable data that there in no strong growth in new cases:
numrows <- nrow(masterChinatotal)
colors <- c("confirmed growth" = "black")
ggplot(masterChinatotal[25:numrows,], aes(x = Date)) +
geom_line(aes(y = confirmgrowth, color = "confirmed growth"), size = 1.2) +
labs(x = "Date",
y = "daily confirmed growth (%)",
color = "Legend") +
scale_color_manual(values = colors) +
ggtitle("Daily Confirmed Growth Rate") +
theme(plot.title = element_text(hjust = 0.5))
GitHub: https://github.com/chilleundso/DATA607/blob/master/Assignment5/Data607_Assignment5.Rmd
RPubs: https://rpubs.com/ManolisM/Data607_Assignment5_flight_delays