COVID-19 Study: Forecasting using R Programming

Alok Pratap Singh

Research Scholar, Department of Psychology, University of Allahabad. Publication date: 11 September, 2020

Installing Package “coronavirus”

if(!require(coronavirus)) {
  install.packages("coronavirus", dependencies = T)
}

## Loading required package: coronavirus

On December 31, 2019, the World Health Organization (WHO) was informed of an outbreak of “pneumonia of unknown cause” detected in Wuhan City, Hubei Province, China – the seventh-largest city in China with 11 million residents. As of January 23, there are over 800 cases of 2019-nCoV confirmed globally, including cases in at least 20 regions in China and nine countries/territories. The first reported infected individuals, some of whom showed symptoms as early as December 8, were discovered to be among stallholders from the Wuhan South China Seafood Market. Subsequently, the wet market was closed on Jan 1. The virus causing the outbreak was quickly determined to be a novel coronavirus. On January 10, gene sequencing further determined it to be the new Wuhan coronavirus, namely 2019-nCoV, a betacoronavirus, related to the Middle Eastern Respiratory Syndrome virus (MERS-CoV) and the Severe Acute Respiratory Syndrome virus (SARSCoV). However, the mortality and transmissibility of 2019-nCoV are still unknown, and likely to vary from those of the prior referenced coronaviruses.

Fetching data

# coronavirus::refresh_coronavirus_jhu() -> dataframe

Loding data

readr::read_csv("C:/Users/Asus/Documents/R Pubs/R Pubs/alltis/alltis.csv") -> alltis

## Warning: Missing column names filled in: 'X1' [1]

## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   ds = col_date(format = ""),
##   location = col_character(),
##   location_type = col_character(),
##   location_code = col_character(),
##   location_code_type = col_character(),
##   data_type = col_character(),
##   value = col_double(),
##   lat = col_double(),
##   long = col_double()
## )

library(magrittr) # to use %>% (pipe operator)
names(alltis)

##  [1] "X1"                 "ds"                 "location"          
##  [4] "location_type"      "location_code"      "location_code_type"
##  [7] "data_type"          "value"              "lat"               
## [10] "long"

alltis$X1 <- NULL

knitr::kable(head(alltis,5))

ds	location	location_type	location_code	location_code_type	data_type	value	lat	long
2020-06-25	Afghanistan	country	AF	iso_3166_2	deaths_new	36	33.93911	67.70995
2020-02-20	Afghanistan	country	AF	iso_3166_2	cases_new	0	33.93911	67.70995
2020-03-23	Afghanistan	country	AF	iso_3166_2	deaths_new	0	33.93911	67.70995
2020-02-18	Afghanistan	country	AF	iso_3166_2	cases_new	0	33.93911	67.70995
2020-02-19	Afghanistan	country	AF	iso_3166_2	cases_new	0	33.93911	67.70995

dplyr::glimpse(alltis)

## Rows: 181,335
## Columns: 9
## $ ds                 <date> 2020-06-25, 2020-02-20, 2020-03-23, 2020-02-18, 20~
## $ location           <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha~
## $ location_type      <chr> "country", "country", "country", "country", "countr~
## $ location_code      <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF~
## $ location_code_type <chr> "iso_3166_2", "iso_3166_2", "iso_3166_2", "iso_3166~
## $ data_type          <chr> "deaths_new", "cases_new", "deaths_new", "cases_new~
## $ value              <dbl> 36, 0, 0, 0, 0, 37, 10, 0, 100, 24, 0, 0, 0, 0, 0, ~
## $ lat                <dbl> 33.93911, 33.93911, 33.93911, 33.93911, 33.93911, 3~
## $ long               <dbl> 67.70995, 67.70995, 67.70995, 67.70995, 67.70995, 6~

Missing Data

colSums(is.na(alltis))

##                 ds           location      location_type      location_code 
##                  0                  0                  0               3003 
## location_code_type          data_type              value                lat 
##               2310                  0                  0                  0 
##               long 
##                  0

alltis data frame have 1 Lakh and 81 thousand 335 number of observation, or 181335 rows.
Having 9 variables, or 9 columns. named- date, location, location_type, location_code, location_code_type, data_type, value, lat, long..
All things look normal on a cursory look. I just have to make the date in the correct format i.e ymd: year month date format.
I am not interested in studying all the variables provided in this dataframe. So, before starting I have to use the `filters and select command (in dplyr package)’ to select or remove the variables/observations that I am interested in.

Installing libraries

library(dplyr, warn.conflicts = F)  # data wrangling
library(lubridate, warn.conflicts = F) # works with dates
library(ggplot2) # for data visualization
library(ggthemes) # for interactive themes and colors
library(prophet) # forecasting

theme_set(theme_bw()) # all graph will follow the same theme automatically
options(scipen = 999) # No scientific annotation

Filtering data for India, new cases, removing observation containing `0`

# alltis %>% rename(ds= date) ->alltis
alltis$ds <- as.Date(alltis$ds) # renaming date as "ds"
alltis$ds <- ymd(alltis$ds) #year months date
data <- alltis # duplicating all `alltis` to`data`
data= data %>% dplyr::filter(location== "India") %>% 
  select(ds, location, data_type, data_type, value) %>% 
  filter(value!=0)
data <- as_tibble(data) # converting to tibble format
data %>% head() %>% kableExtra::kable()

ds	location	data_type	value
2020-05-09	India	cases_new	3113
2020-05-03	India	cases_new	2806
2020-05-07	India	cases_new	3364
2020-04-27	India	cases_new	1561
2020-04-28	India	cases_new	1873
2020-05-04	India	cases_new	3932

How many types of cases are there in our data?

What is the time period of the observations ?

How many observations are left after filtering ?

unique(data$data_type);range(data$ds); dim(data)

## [1] "cases_new"     "recovered_new" "deaths_new"

## [1] "2020-01-30" "2020-09-08"

## [1] 543   4

# arranging data in `decreasing order of date column`
data <- data %>% arrange(desc(ds))
# top 20 data
pander::pander(data[1:10,])

ds	location	data_type	value
2020-09-08	India	recovered_new	74894
2020-09-08	India	cases_new	89706
2020-09-08	India	deaths_new	1115
2020-09-07	India	recovered_new	73521
2020-09-07	India	cases_new	75809
2020-09-07	India	deaths_new	1133
2020-09-06	India	recovered_new	69564
2020-09-06	India	cases_new	90802
2020-09-06	India	deaths_new	1016
2020-09-05	India	recovered_new	73642

Data Visualization

ggplot(data, aes(ds, value, col= data_type)) +
  geom_line() + geom_point() +
  labs(x= "Months (2020)","Number of Cases",y= "New Caeses/ Day", title = "India", subtitle = "Condition of new cases of COVID-19 In India") + theme_linedraw()

See the effect of lockdown
New corona cases increased with the termination of lockdown.
Number of cases_new and recovered_new are close i.e death rate is 2 %.

`scale_y_log10()`

ggplot(data, aes(ds, value, col= data_type)) +
  geom_line(size=1) + geom_point(col= "black", size= .9) +
  labs(x= "Months (2020)","Number of Cases",y= "New Caeses (log10)", title = "India", 
       subtitle = "Condition of new cases of COVID-19") +
  theme_classic() +
  scale_y_log10()

Now we can see a sharp increase in the number of deaths.
The recovery rate is improving.

Forecasting

New corona cases upto the next 31 days

Note: library prophet is not for epidemiological model

Renaming y to my dependent variable, and filtering data_type: cases_new Making new data frame for data_type: cases_new named newcases

data %>% rename(y= value) ->data
#filtering
newcases <- data %>% dplyr::filter(data_type=="cases_new")
#now top 4 data
pander::pander(newcases[1:4,])

ds	location	data_type	y
2020-09-08	India	cases_new	89706
2020-09-07	India	cases_new	75809
2020-09-06	India	cases_new	90802
2020-09-05	India	cases_new	90632

Forceasting the new_cases/day for next 31 days

m <- prophet(df= newcases) # model to predict
futuredf1 <- make_future_dataframe(m = m, periods = 31) 
forecast1 <- predict(m, futuredf1) # predicting
# Plotting
dyplot.prophet(m, forecast1)

Result

This forecasting shows that on 09 oct 1 lakh 16 thousand cases will be added.

This is an interactive graph, you can move your cursor over the line and see the actual and predicated value of any day.

Probability to get infected

prophet::prophet_plot_components(m , forecast1)

Monday is safest day and Wednesday is most dangerous (hehe, kidding)
The lowest number of data has been uploaded on Monday, but the good thing is that they have done everything well on all other days of the week. They are working good!

Forecasting death/day for next 31 days

#filtering
newdeath <- data %>% dplyr::filter(data_type=="deaths_new")
#now top 4 data
pander::pander(head(newdeath,4))

ds	location	data_type	y
2020-09-08	India	deaths_new	1115
2020-09-07	India	deaths_new	1133
2020-09-06	India	deaths_new	1016
2020-09-05	India	deaths_new	1065

m <- prophet(newdeath)
futuredf2 <- make_future_dataframe(m = m, periods = 31)
forecast2 <- predict(m, futuredf2)
dyplot.prophet(m, forecast2)

Note: You can see the forecasting by moving the cursor over the line.

Probability of death…

prophet_plot_components(m, forecast2)

Most deaths on Tuesday, least on Monday (or most of death-data updated on Tuesday)

Forecasting Recovery/Day

#filtering
newrecovery <- data %>% dplyr::filter(data_type=="recovered_new")
# model making and forecasting
m <- prophet(df = newrecovery)
futuredf3 <- make_future_dataframe(m = m, periods = 31)
forecast3 <- predict(m, futuredf3)
dyplot.prophet(m, forecast3)

prophet::prophet_plot_components(m, forecast3)

India, US, Russia, new cases

data <- alltis
data %>% dplyr::filter(data_type=="cases_new", 
  location==c("India","US","Russia")) %>% 
  ggplot(aes(ds, value)) + 
  geom_line(aes(col= location, linetype= location), size=1) +
  scale_color_calc() + geom_point(aes(col=location),size=5)+
  labs(x= "Year 2020", y= "New cases/ Day",
       title = "India, Russia, US")

Cases are increasing in India, cases in America have started decreasing now, the situation in Russia is stable.

Top 7 countries in Corona Cases per Day

data <- alltis
data %>% group_by(location) %>% 
  summarise(value= sum(value)) %>% 
  ungroup() %>% arrange(desc(value)) %>% 
  top_n(20) %>% rename(new_cases= value)->n

# plotting barplot for top 7 countries in `new corona cases/Day`
n %>% top_n(7) %>% 
  ggplot(aes(reorder(location, desc(new_cases)), new_cases, fill= location)) + geom_col(col="black", size=2) +
  scale_fill_calc()+
  labs(x= "Countries", y= "New cases/Day", caption = Sys.Date())

pander::pander(n)

location	new_cases
US	8875773
Brazil	7861958
India	7842862
Russia	1898039
Mexico	1246485
Peru	1243802
South Africa	1223256
Colombia	1222161
Argentina	877029
Chile	834953
Iran	751068
Spain	714483
Saudi Arabia	624620
Pakistan	592524
Bangladesh	561612
Turkey	543297
Italy	526517
Germany	491672
Iraq	483559
France	460426

# extracting the names of top 7 countries
n %>% top_n(7)-> n

## Selecting by new_cases

Trends of top 7 countries regarding new cases of COVID-19

scale_y_log(10)

#removing all "0" containing values
data %>% dplyr::filter(value!= 0)->data
loc <- n$location
loc[1:7]  #top7 location in corona new cases

## [1] "US"           "Brazil"       "India"        "Russia"       "Mexico"      
## [6] "Peru"         "South Africa"

data %>% dplyr::filter(location== c("US","Brazil","India","Russia","Mexico","Peru","South Africa")) %>% dplyr::filter(data_type=="cases_new") %>% ggplot(aes(ds, value)) +
  geom_line(aes(col= location, linetype= location), size=1) +
  geom_point(aes(col=location),size=2) +scale_y_log10() +
  scale_color_calc() +
  labs(x= "Year 2020", y= "New_cases/day (log 10)")

Conclusions

Cases are continuously increasing in India, but we have to keep in mind that India’s population is very huge. India’s performance is still efficient in terms of cases per million population.
The corona cases have started to grow high, and if the control is not done here soon then the situation can become very bitter.

Converting and saving `alltis` to “.xlsx” format.

# CSV (excel) format
# write.csv(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.csv")
# .sav (spss) format
# haven::write_sav(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.sav")

#in .txt format
# write.table(alltis,"C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.txt)

Thank You

Regards

Please visit my profile

Alok Pratap Singh (Research Scholar)

Linkedin (Open in New TAB)

apsingh@allduniv.ac.in

Department of Psychology

University of Allahabad

Click on download alltis.rar and unrar to get data in .csv, .sav, & .txt format. Please open the download link in new tab

Without data you’re just another person with an opinion