2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC
This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.
The data is available from 22 Jan, 2020.
The data used for this analysis can be downloaded here: Covid-19 Data
This dataset particularly focusses upon the starting stages of the outbreak and shows how the epidemic gradually turned itself into pandemic.
library(ggplot2)
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
knitr::opts_chunk$set(echo = TRUE)
data<-read.csv("C:/Users/virar/Desktop/ranga/R/covid/COVID19data.csv")
dim(data)
## [1] 1085 27
str(data)
## 'data.frame': 1085 obs. of 27 variables:
## $ ï..id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ case_in_country : int NA NA NA NA NA NA NA NA NA NA ...
## $ reporting.date : chr "1/20/2020" "1/20/2020" "1/21/2020" "1/21/2020" ...
## $ X : logi NA NA NA NA NA NA ...
## $ summary : chr "First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, vi"| __truncated__ "First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arri"| __truncated__ "First confirmed imported cases in Zhejiang: patient is male, 46, lives in Wuhan, self-driving from Wuhan to Han"| __truncated__ "new confirmed imported COVID-19 pneumonia in Tianjin: female, age 60, recently visited Wuhan, visited fever cli"| __truncated__ ...
## $ location : chr "Shenzhen, Guangdong" "Shanghai" "Zhejiang" "Tianjin" ...
## $ country : chr "China" "China" "China" "China" ...
## $ gender : chr "male" "female" "male" "female" ...
## $ age : num 66 56 46 60 58 44 34 37 39 56 ...
## $ symptom_onset : chr "01/03/20" "1/15/2020" "01/04/20" NA ...
## $ If_onset_approximated: int 0 0 0 NA NA 0 0 0 0 0 ...
## $ hosp_visit_date : chr "01/11/20" "1/15/2020" "1/17/2020" "1/19/2020" ...
## $ exposure_start : chr "12/29/2019" NA NA NA ...
## $ exposure_end : chr "01/04/20" "01/12/20" "01/03/20" NA ...
## $ visiting.Wuhan : int 1 0 0 1 0 0 0 1 1 1 ...
## $ from.Wuhan : int 0 1 1 0 0 1 1 0 0 0 ...
## $ death : chr "0" "0" "0" "0" ...
## $ recovered : chr "0" "0" "0" "0" ...
## $ symptom : chr "" "" "" "" ...
## $ source : chr "Shenzhen Municipal Health Commission" "Official Weibo of Shanghai Municipal Health Commission" "Health Commission of Zhejiang Province" "人民日报å®\230方微å\215š" ...
## $ link : chr "http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm" "https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment" "http://www.zjwjw.gov.cn/art/2020/1/21/art_1202101_41786033.html" "https://m.weibo.cn/status/4463235401268457?" ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : logi NA NA NA NA NA NA ...
## $ X.3 : logi NA NA NA NA NA NA ...
## $ X.4 : logi NA NA NA NA NA NA ...
## $ X.5 : logi NA NA NA NA NA NA ...
## $ X.6 : logi NA NA NA NA NA NA ...
We’ll be primarily focussing and using the “death” column of the data as our first objective is to calculate death rate.
str(data$death)
## chr [1:1085] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
unique(data$death)
## [1] "0" "1" "2/14/2020" "2/26/2020" "2/13/2020" "2/28/2020"
## [7] "2/27/2020" "2/25/2020" "2/23/2020" "2/24/2020" "2/22/2020" "02/01/20"
## [13] "2/19/2020" "2/21/2020"
Some of the values of the column has the “0” which signifies patient is not dead and some of the values has “1” which signifies the patient is dead. Apparently, there are some information about the date of the death of the patient. We’ll be cleaning those values as along the project it’ll be tough to handle them as dates and we’ll calculate the death rate.
data$death_number<-as.integer(data$death!=0) #we'll give values equal to 1 for every non zero value but the NAs will remain the same
unique(data$death_number)
## [1] 0 1
death_rate<-sum(data$death_number)/nrow(data) #death rate calculation
Now, we’ll find out whether it is true that people who survive is younger than people who are dead and whether it is true that males don’t survive the fatality of coronovirus over females.
num_dead<- subset(data, data$death_number==1) #subsetting data based on whether the patient was alive or not
num_alive<- subset( data, data$death_number ==0)
mean(num_alive$age, na.rm= T) #calucaltion of mean value to establish a relation
## [1] 48.07229
mean(num_dead$age, na.rm = T)
## [1] 68.58621
num_males<- subset(data, data$gender=="male") #subsetting data based on the gender of the patient
num_females<- subset( data, data$gender =="female")
mean(num_females$death_number, na.rm = T) #calucaltion of mean value to establish a relation
## [1] 0.03664921
mean(num_males$death_number,na.rm = T)
## [1] 0.08461538
The data shows that the younger people have tendency to survive much better than older people and males have a higher death rate than females. But we need to make sure whether the results are statistically significant to draw any conclusions from this analysis.
t.test(num_alive$age,num_dead$age, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: num_alive$age and num_dead$age
## t = -10.839, df = 72.234, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.28669 -16.74114
## sample estimates:
## mean of x mean of y
## 48.07229 68.58621
t.test(num_females$death_number,num_males$death_number, alternative = "two.sided") #calculation of p-values of both the cases to make sure to reject the null hypothesis
##
## Welch Two Sample t-test
##
## data: num_females$death_number and num_males$death_number
## t = -3.084, df = 894.06, p-value = 0.002105
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07849151 -0.01744083
## sample estimates:
## mean of x mean of y
## 0.03664921 0.08461538
#if p-value is below 0.05, the probability can be considered not significant and we can conclude the relation between those 2 values
#two sided hypothesis
Based on the p-values calculated, we can conclude (with 95 percent confidence interval) that:
data2<-data %>% select(c(case_in_country, #selecting required columns from primary data
reporting.date,
age,
country,
from.Wuhan,
visiting.Wuhan)) %>%rename(date='reporting.date') #renaming reporting date
data2$date<- mdy(data2$date) #using lubridate package, coercing the character into POSIXct
sum(is.na(data2$case_in_country)) #shows the number of missing values in the data
## [1] 197
sum(is.na(data2$case_in_country[1:197])) #this proves that the first 197 values are unknown, hence we'll remove it
## [1] 197
nrow(data2)
## [1] 1085
data2<-data2[198:1085,] #subsetting the real values of cases_in_country
data2 %>% group_by(country) %>% summarise(total_cases=sum(case_in_country))%>%arrange(desc(total_cases)) %>% ungroup() #calculating total number of cases by grouping the data first ased on country and then, calculating the total number of cases.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 37 x 2
## country total_cases
## <chr> <int>
## 1 Japan 18145
## 2 South Korea 10477
## 3 Hong Kong 4434
## 4 Singapore 4371
## 5 Germany 1485
## 6 Thailand 861
## 7 France 780
## 8 Spain 595
## 9 Taiwan 595
## 10 Malaysia 276
## # ... with 27 more rows
This shows that the data has more of Japan and South korea cases. Hence, we’ll plot those two of them and check how many of them where from wuhan or contracted the virus while visiting Wuhan
data2$from.Wuhan[data2$from.Wuhan==0]<- "NO"
data2$from.Wuhan[data2$from.Wuhan==1]<-"YES"
data2$visiting.Wuhan[data2$visiting.Wuhan==0]<- "NO"
data2$visiting.Wuhan[data2$visiting.Wuhan==1]<-"YES"
data2<-data2[complete.cases(data2[,2]),] #removing the NA values from the date column
koreaplot1<-data2 %>% filter(country =="South Korea") %>% #plotting date vs cases' count in South Korea particularly who are from Wuhan and got infected with covid-19
ggplot(aes(date, case_in_country,colour=from.Wuhan)) +
theme_grey(base_size = 15)+
theme(axis.text.x = element_text(angle =90))+
scale_x_date(date_breaks = "1 week") +
geom_line(size=1)+
geom_smooth(method = "lm",colour="blue",lwd=1)+
xlab("Reporting Date of the Patient")+
ylab("Number of cases tested positive")+
ggtitle("Patients admitted to hospitals in South Korea")+ylim(0,1500)
japanplot1<-data2 %>% filter(country =="Japan") %>%
ggplot(aes(date, case_in_country,colour=from.Wuhan))+ #plotting date vs cases' count in Japan particularly who are from Wuhan and got infected with covid-19
theme_grey(base_size = 15)+
theme(axis.text.x = element_text(angle =90))+
scale_x_date(date_breaks = "1 week")+
geom_line(size=1)+
geom_smooth(method = "lm",colour="blue",lwd=1)+
xlab("Reporting Date of the Patient")+
ylab("Number of cases tested positive")+
ggtitle("Patients admitted to hospitals in Japan")+ ylim(0,1500)
grid.arrange(koreaplot1,japanplot1,ncol=2) #arranges both plots side by side
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 32 rows containing missing values (geom_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 34 rows containing missing values (geom_smooth).
The plot when the patient contracted the vius after visiting wuhan as a tourist is given below.
koreaplot2<-data2 %>% filter(country =="South Korea") %>% #plotting date vs cases' count in South Korea particularly who visited Wuhan as tourist and got infected with covid-19
ggplot(aes(date, case_in_country,colour=visiting.Wuhan)) +
theme_grey(base_size = 15)+
theme(axis.text.x = element_text(angle =90))+
scale_x_date(date_breaks = "1 week") +geom_line(size=1)+
geom_smooth(method = "lm",colour="blue",lwd=1)+
xlab("Reporting Date of the Patient")+
ylab("Number of cases tested positive")+
ggtitle("Patients admitted to hospitals in South Korea")+ylim(0,1500)
japanplot2<-data2 %>% filter(country =="Japan") %>% #plotting date vs cases' count in Japan particularly who visited Wuhan as tourist and got infected with covid-19
ggplot(aes(date, case_in_country,colour=visiting.Wuhan)) +
theme_grey(base_size = 15)+
theme(axis.text.x = element_text(angle =90))+
scale_x_date(date_breaks = "1 week")+
geom_line(size=1)+
geom_smooth(method = "lm",colour="blue",lwd=1)+
xlab("Reporting Date of the Patient")+
ylab("Number of cases tested positive")+
ggtitle("Patients admitted to hospitals in Japan")+ ylim(0,1500)
grid.arrange(koreaplot2,japanplot2,ncol=2) #arranges both plots side by side
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 32 rows containing missing values (geom_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 34 rows containing missing values (geom_smooth).