Covid-19 Data analysis

INTRODUCTION

2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

DATASET

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.

The data is available from 22 Jan, 2020.

The data used for this analysis can be downloaded here: Covid-19 Data

This dataset particularly focusses upon the starting stages of the outbreak and shows how the epidemic gradually turned itself into pandemic.

library(ggplot2)
library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.0.3

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

knitr::opts_chunk$set(echo = TRUE)

data<-read.csv("C:/Users/virar/Desktop/ranga/R/covid/COVID19data.csv")
dim(data)

## [1] 1085   27

str(data)

## 'data.frame':    1085 obs. of  27 variables:
##  $ ï..id                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ case_in_country      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ reporting.date       : chr  "1/20/2020" "1/20/2020" "1/21/2020" "1/21/2020" ...
##  $ X                    : logi  NA NA NA NA NA NA ...
##  $ summary              : chr  "First confirmed imported COVID-19 pneumonia patient in Shenzhen (from Wuhan): male, 66, shenzheng residence, vi"| __truncated__ "First confirmed imported COVID-19 pneumonia patient in Shanghai (from Wuhan): female, 56, Wuhan residence, arri"| __truncated__ "First confirmed imported cases in Zhejiang: patient is male, 46, lives in Wuhan, self-driving from Wuhan to Han"| __truncated__ "new confirmed imported COVID-19 pneumonia in Tianjin: female, age 60, recently visited Wuhan, visited fever cli"| __truncated__ ...
##  $ location             : chr  "Shenzhen, Guangdong" "Shanghai" "Zhejiang" "Tianjin" ...
##  $ country              : chr  "China" "China" "China" "China" ...
##  $ gender               : chr  "male" "female" "male" "female" ...
##  $ age                  : num  66 56 46 60 58 44 34 37 39 56 ...
##  $ symptom_onset        : chr  "01/03/20" "1/15/2020" "01/04/20" NA ...
##  $ If_onset_approximated: int  0 0 0 NA NA 0 0 0 0 0 ...
##  $ hosp_visit_date      : chr  "01/11/20" "1/15/2020" "1/17/2020" "1/19/2020" ...
##  $ exposure_start       : chr  "12/29/2019" NA NA NA ...
##  $ exposure_end         : chr  "01/04/20" "01/12/20" "01/03/20" NA ...
##  $ visiting.Wuhan       : int  1 0 0 1 0 0 0 1 1 1 ...
##  $ from.Wuhan           : int  0 1 1 0 0 1 1 0 0 0 ...
##  $ death                : chr  "0" "0" "0" "0" ...
##  $ recovered            : chr  "0" "0" "0" "0" ...
##  $ symptom              : chr  "" "" "" "" ...
##  $ source               : chr  "Shenzhen Municipal Health Commission" "Official Weibo of Shanghai Municipal Health Commission" "Health Commission of Zhejiang Province" "äººæ°‘æ—¥æŠ¥å®\230æ–¹å¾®å\215š" ...
##  $ link                 : chr  "http://wjw.sz.gov.cn/wzx/202001/t20200120_18987787.htm" "https://www.weibo.com/2372649470/IqogQhgfa?from=page_1001062372649470_profile&wvr=6&mod=weibotime&type=comment" "http://www.zjwjw.gov.cn/art/2020/1/21/art_1202101_41786033.html" "https://m.weibo.cn/status/4463235401268457?" ...
##  $ X.1                  : logi  NA NA NA NA NA NA ...
##  $ X.2                  : logi  NA NA NA NA NA NA ...
##  $ X.3                  : logi  NA NA NA NA NA NA ...
##  $ X.4                  : logi  NA NA NA NA NA NA ...
##  $ X.5                  : logi  NA NA NA NA NA NA ...
##  $ X.6                  : logi  NA NA NA NA NA NA ...

CLEANING THE DATA

We’ll be primarily focussing and using the “death” column of the data as our first objective is to calculate death rate.

str(data$death)

##  chr [1:1085] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...

unique(data$death)

##  [1] "0"         "1"         "2/14/2020" "2/26/2020" "2/13/2020" "2/28/2020"
##  [7] "2/27/2020" "2/25/2020" "2/23/2020" "2/24/2020" "2/22/2020" "02/01/20" 
## [13] "2/19/2020" "2/21/2020"

Some of the values of the column has the “0” which signifies patient is not dead and some of the values has “1” which signifies the patient is dead. Apparently, there are some information about the date of the death of the patient. We’ll be cleaning those values as along the project it’ll be tough to handle them as dates and we’ll calculate the death rate.

data$death_number<-as.integer(data$death!=0)  #we'll give values equal to 1 for every non zero value but the NAs will remain the same
unique(data$death_number)

## [1] 0 1

death_rate<-sum(data$death_number)/nrow(data)   #death rate calculation

Now, we’ll find out whether it is true that people who survive is younger than people who are dead and whether it is true that males don’t survive the fatality of coronovirus over females.

num_dead<- subset(data, data$death_number==1)     #subsetting data based on whether the patient was alive or not
num_alive<- subset( data, data$death_number ==0)
mean(num_alive$age, na.rm= T)                     #calucaltion of mean value to establish a relation

## [1] 48.07229

mean(num_dead$age, na.rm = T)

## [1] 68.58621

num_males<- subset(data, data$gender=="male")    #subsetting data based on the gender of the patient
num_females<- subset( data, data$gender =="female")
mean(num_females$death_number, na.rm = T)         #calucaltion of mean value to establish a relation

## [1] 0.03664921

mean(num_males$death_number,na.rm = T)

## [1] 0.08461538

The data shows that the younger people have tendency to survive much better than older people and males have a higher death rate than females. But we need to make sure whether the results are statistically significant to draw any conclusions from this analysis.

t.test(num_alive$age,num_dead$age, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  num_alive$age and num_dead$age
## t = -10.839, df = 72.234, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24.28669 -16.74114
## sample estimates:
## mean of x mean of y 
##  48.07229  68.58621

t.test(num_females$death_number,num_males$death_number, alternative = "two.sided")   #calculation of p-values of both the cases to make sure to reject the null hypothesis

## 
##  Welch Two Sample t-test
## 
## data:  num_females$death_number and num_males$death_number
## t = -3.084, df = 894.06, p-value = 0.002105
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07849151 -0.01744083
## sample estimates:
##  mean of x  mean of y 
## 0.03664921 0.08461538

#if p-value is below 0.05, the probability can be considered not significant and we can conclude the relation between those 2 values
#two sided hypothesis

CONCLUSION

Based on the p-values calculated, we can conclude (with 95 percent confidence interval) that:

The difference between the age of a patient who probalistically will die and the age of a patient who will survive is 24.2-16.7 years and younger patients have more probablity to survive.
The males have 1.7-7.8% higher fatality rate that females.

VISUALISING THE DATA

data2<-data %>% select(c(case_in_country,    #selecting required columns from primary data
                 reporting.date, 
                 age, 
                 country, 
                 from.Wuhan, 
                 visiting.Wuhan)) %>%rename(date='reporting.date') #renaming reporting date


data2$date<- mdy(data2$date) #using lubridate package, coercing the character into POSIXct

sum(is.na(data2$case_in_country)) #shows the number of missing values in the data

## [1] 197

sum(is.na(data2$case_in_country[1:197])) #this proves that the first 197 values are unknown, hence we'll remove it

## [1] 197

nrow(data2)

## [1] 1085

data2<-data2[198:1085,] #subsetting the real values of cases_in_country

data2 %>% group_by(country) %>% summarise(total_cases=sum(case_in_country))%>%arrange(desc(total_cases)) %>% ungroup()       #calculating  total number of cases by grouping the data first ased on country and then, calculating the total number of cases.

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 37 x 2
##    country     total_cases
##    <chr>             <int>
##  1 Japan             18145
##  2 South Korea       10477
##  3 Hong Kong          4434
##  4 Singapore          4371
##  5 Germany            1485
##  6 Thailand            861
##  7 France              780
##  8 Spain               595
##  9 Taiwan              595
## 10 Malaysia            276
## # ... with 27 more rows

This shows that the data has more of Japan and South korea cases. Hence, we’ll plot those two of them and check how many of them where from wuhan or contracted the virus while visiting Wuhan

data2$from.Wuhan[data2$from.Wuhan==0]<- "NO"   
data2$from.Wuhan[data2$from.Wuhan==1]<-"YES"
data2$visiting.Wuhan[data2$visiting.Wuhan==0]<- "NO"
data2$visiting.Wuhan[data2$visiting.Wuhan==1]<-"YES"

data2<-data2[complete.cases(data2[,2]),]    #removing the NA values from the date column
 
koreaplot1<-data2 %>% filter(country =="South Korea") %>%   #plotting date vs cases' count in South Korea particularly who are from Wuhan and got infected with covid-19
  ggplot(aes(date, case_in_country,colour=from.Wuhan)) + 
  theme_grey(base_size = 15)+
  theme(axis.text.x = element_text(angle =90))+ 
  scale_x_date(date_breaks = "1 week") +
  geom_line(size=1)+ 
  geom_smooth(method = "lm",colour="blue",lwd=1)+
  xlab("Reporting Date of the Patient")+
  ylab("Number of cases tested positive")+
  ggtitle("Patients admitted to hospitals in South Korea")+ylim(0,1500)

japanplot1<-data2 %>% filter(country =="Japan") %>% 
  ggplot(aes(date, case_in_country,colour=from.Wuhan))+  #plotting date vs cases' count in Japan particularly who are from Wuhan and got infected with covid-19
  theme_grey(base_size = 15)+
  theme(axis.text.x = element_text(angle =90))+
  scale_x_date(date_breaks = "1 week")+
  geom_line(size=1)+
  geom_smooth(method = "lm",colour="blue",lwd=1)+
  xlab("Reporting Date of the Patient")+
  ylab("Number of cases tested positive")+
  ggtitle("Patients admitted to hospitals in Japan")+ ylim(0,1500)

grid.arrange(koreaplot1,japanplot1,ncol=2)    #arranges both plots side by side

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 32 rows containing missing values (geom_smooth).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 34 rows containing missing values (geom_smooth).

The plot when the patient contracted the vius after visiting wuhan as a tourist is given below.

koreaplot2<-data2 %>% filter(country =="South Korea") %>%   #plotting date vs cases' count in South Korea particularly who visited Wuhan as tourist and got infected with covid-19
  ggplot(aes(date, case_in_country,colour=visiting.Wuhan)) +
  theme_grey(base_size = 15)+
  theme(axis.text.x = element_text(angle =90))+ 
  scale_x_date(date_breaks = "1 week") +geom_line(size=1)+ 
  geom_smooth(method = "lm",colour="blue",lwd=1)+ 
  xlab("Reporting Date of the Patient")+
  ylab("Number of cases tested positive")+
  ggtitle("Patients admitted to hospitals in South Korea")+ylim(0,1500)

japanplot2<-data2 %>% filter(country =="Japan") %>%    #plotting date vs cases' count in Japan particularly who visited Wuhan as tourist and got infected with covid-19
  ggplot(aes(date, case_in_country,colour=visiting.Wuhan)) + 
  theme_grey(base_size = 15)+
  theme(axis.text.x = element_text(angle =90))+
  scale_x_date(date_breaks = "1 week")+
  geom_line(size=1)+ 
  geom_smooth(method = "lm",colour="blue",lwd=1)+ 
  xlab("Reporting Date of the Patient")+
  ylab("Number of cases tested positive")+
  ggtitle("Patients admitted to hospitals in Japan")+ ylim(0,1500)

grid.arrange(koreaplot2,japanplot2,ncol=2)    #arranges both plots side by side

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 32 rows containing missing values (geom_smooth).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 34 rows containing missing values (geom_smooth).