COVID-19 Study: Forecasting using R Programming

Alok Pratap Singh

Research Scholar, Department of Psychology, University of Allahabad. Publication date: 11 September, 2020

Installing Package “coronavirus”

if(!require(coronavirus)) {
  install.packages("coronavirus", dependencies = T)
}
## Loading required package: coronavirus

On December 31, 2019, the World Health Organization (WHO) was informed of an outbreak of “pneumonia of unknown cause” detected in Wuhan City, Hubei Province, China – the seventh-largest city in China with 11 million residents. As of January 23, there are over 800 cases of 2019-nCoV confirmed globally, including cases in at least 20 regions in China and nine countries/territories. The first reported infected individuals, some of whom showed symptoms as early as December 8, were discovered to be among stallholders from the Wuhan South China Seafood Market. Subsequently, the wet market was closed on Jan 1. The virus causing the outbreak was quickly determined to be a novel coronavirus. On January 10, gene sequencing further determined it to be the new Wuhan coronavirus, namely 2019-nCoV, a betacoronavirus, related to the Middle Eastern Respiratory Syndrome virus (MERS-CoV) and the Severe Acute Respiratory Syndrome virus (SARSCoV). However, the mortality and transmissibility of 2019-nCoV are still unknown, and likely to vary from those of the prior referenced coronaviruses.

Fetching data

# coronavirus::refresh_coronavirus_jhu() -> dataframe

Loding data

readr::read_csv("C:/Users/Asus/Documents/R Pubs/R Pubs/alltis/alltis.csv") -> alltis
## Warning: Missing column names filled in: 'X1' [1]
## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   ds = col_date(format = ""),
##   location = col_character(),
##   location_type = col_character(),
##   location_code = col_character(),
##   location_code_type = col_character(),
##   data_type = col_character(),
##   value = col_double(),
##   lat = col_double(),
##   long = col_double()
## )
library(magrittr) # to use %>% (pipe operator)
names(alltis)
##  [1] "X1"                 "ds"                 "location"          
##  [4] "location_type"      "location_code"      "location_code_type"
##  [7] "data_type"          "value"              "lat"               
## [10] "long"
alltis$X1 <- NULL
knitr::kable(head(alltis,5))
ds location location_type location_code location_code_type data_type value lat long
2020-06-25 Afghanistan country AF iso_3166_2 deaths_new 36 33.93911 67.70995
2020-02-20 Afghanistan country AF iso_3166_2 cases_new 0 33.93911 67.70995
2020-03-23 Afghanistan country AF iso_3166_2 deaths_new 0 33.93911 67.70995
2020-02-18 Afghanistan country AF iso_3166_2 cases_new 0 33.93911 67.70995
2020-02-19 Afghanistan country AF iso_3166_2 cases_new 0 33.93911 67.70995
dplyr::glimpse(alltis)
## Rows: 181,335
## Columns: 9
## $ ds                 <date> 2020-06-25, 2020-02-20, 2020-03-23, 2020-02-18, 20~
## $ location           <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgha~
## $ location_type      <chr> "country", "country", "country", "country", "countr~
## $ location_code      <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF~
## $ location_code_type <chr> "iso_3166_2", "iso_3166_2", "iso_3166_2", "iso_3166~
## $ data_type          <chr> "deaths_new", "cases_new", "deaths_new", "cases_new~
## $ value              <dbl> 36, 0, 0, 0, 0, 37, 10, 0, 100, 24, 0, 0, 0, 0, 0, ~
## $ lat                <dbl> 33.93911, 33.93911, 33.93911, 33.93911, 33.93911, 3~
## $ long               <dbl> 67.70995, 67.70995, 67.70995, 67.70995, 67.70995, 6~

Missing Data

colSums(is.na(alltis))
##                 ds           location      location_type      location_code 
##                  0                  0                  0               3003 
## location_code_type          data_type              value                lat 
##               2310                  0                  0                  0 
##               long 
##                  0
  • alltis data frame have 1 Lakh and 81 thousand 335 number of observation, or 181335 rows.

  • Having 9 variables, or 9 columns. named- date, location, location_type, location_code, location_code_type, data_type, value, lat, long..

  • All things look normal on a cursory look. I just have to make the date in the correct format i.e ymd: year month date format.

  • I am not interested in studying all the variables provided in this dataframe. So, before starting I have to use the `filters and select command (in dplyr package)’ to select or remove the variables/observations that I am interested in.

Installing libraries

library(dplyr, warn.conflicts = F)  # data wrangling
library(lubridate, warn.conflicts = F) # works with dates
library(ggplot2) # for data visualization
library(ggthemes) # for interactive themes and colors
library(prophet) # forecasting
theme_set(theme_bw()) # all graph will follow the same theme automatically
options(scipen = 999) # No scientific annotation

Filtering data for India, new cases, removing observation containing 0

# alltis %>% rename(ds= date) ->alltis
alltis$ds <- as.Date(alltis$ds) # renaming date as "ds"
alltis$ds <- ymd(alltis$ds) #year months date
data <- alltis # duplicating all `alltis` to`data`
data= data %>% dplyr::filter(location== "India") %>% 
  select(ds, location, data_type, data_type, value) %>% 
  filter(value!=0)
data <- as_tibble(data) # converting to tibble format
data %>% head() %>% kableExtra::kable()
ds location data_type value
2020-05-09 India cases_new 3113
2020-05-03 India cases_new 2806
2020-05-07 India cases_new 3364
2020-04-27 India cases_new 1561
2020-04-28 India cases_new 1873
2020-05-04 India cases_new 3932

How many types of cases are there in our data?

What is the time period of the observations ?

How many observations are left after filtering ?

unique(data$data_type);range(data$ds); dim(data)
## [1] "cases_new"     "recovered_new" "deaths_new"
## [1] "2020-01-30" "2020-09-08"
## [1] 543   4
# arranging data in `decreasing order of date column`
data <- data %>% arrange(desc(ds))
# top 20 data
pander::pander(data[1:10,])
ds location data_type value
2020-09-08 India recovered_new 74894
2020-09-08 India cases_new 89706
2020-09-08 India deaths_new 1115
2020-09-07 India recovered_new 73521
2020-09-07 India cases_new 75809
2020-09-07 India deaths_new 1133
2020-09-06 India recovered_new 69564
2020-09-06 India cases_new 90802
2020-09-06 India deaths_new 1016
2020-09-05 India recovered_new 73642

Data Visualization

ggplot(data, aes(ds, value, col= data_type)) +
  geom_line() + geom_point() +
  labs(x= "Months (2020)","Number of Cases",y= "New Caeses/ Day", title = "India", subtitle = "Condition of new cases of COVID-19 In India") + theme_linedraw()

  • See the effect of lockdown
  • New corona cases increased with the termination of lockdown.
  • Number of cases_new and recovered_new are close i.e death rate is 2 %.

scale_y_log10()

ggplot(data, aes(ds, value, col= data_type)) +
  geom_line(size=1) + geom_point(col= "black", size= .9) +
  labs(x= "Months (2020)","Number of Cases",y= "New Caeses (log10)", title = "India", 
       subtitle = "Condition of new cases of COVID-19") +
  theme_classic() +
  scale_y_log10()

  • Now we can see a sharp increase in the number of deaths.

  • The recovery rate is improving.

Forecasting

New corona cases upto the next 31 days

Note: library prophet is not for epidemiological model

Renaming y to my dependent variable, and filtering data_type: cases_new Making new data frame for data_type: cases_new named newcases

data %>% rename(y= value) ->data
#filtering
newcases <- data %>% dplyr::filter(data_type=="cases_new")
#now top 4 data
pander::pander(newcases[1:4,])
ds location data_type y
2020-09-08 India cases_new 89706
2020-09-07 India cases_new 75809
2020-09-06 India cases_new 90802
2020-09-05 India cases_new 90632

Forceasting the new_cases/day for next 31 days

m <- prophet(df= newcases) # model to predict
futuredf1 <- make_future_dataframe(m = m, periods = 31) 
forecast1 <- predict(m, futuredf1) # predicting
# Plotting
dyplot.prophet(m, forecast1)

Result

  • This forecasting shows that on 09 oct 1 lakh 16 thousand cases will be added.

This is an interactive graph, you can move your cursor over the line and see the actual and predicated value of any day.

Probability to get infected

prophet::prophet_plot_components(m , forecast1)

  • Monday is safest day and Wednesday is most dangerous (hehe, kidding)

  • The lowest number of data has been uploaded on Monday, but the good thing is that they have done everything well on all other days of the week. They are working good!

Forecasting death/day for next 31 days

#filtering
newdeath <- data %>% dplyr::filter(data_type=="deaths_new")
#now top 4 data
pander::pander(head(newdeath,4))
ds location data_type y
2020-09-08 India deaths_new 1115
2020-09-07 India deaths_new 1133
2020-09-06 India deaths_new 1016
2020-09-05 India deaths_new 1065
m <- prophet(newdeath)
futuredf2 <- make_future_dataframe(m = m, periods = 31)
forecast2 <- predict(m, futuredf2)
dyplot.prophet(m, forecast2)

Note: You can see the forecasting by moving the cursor over the line.

Probability of death…

prophet_plot_components(m, forecast2)

  • Most deaths on Tuesday, least on Monday (or most of death-data updated on Tuesday)

Forecasting Recovery/Day

#filtering
newrecovery <- data %>% dplyr::filter(data_type=="recovered_new")
# model making and forecasting
m <- prophet(df = newrecovery)
futuredf3 <- make_future_dataframe(m = m, periods = 31)
forecast3 <- predict(m, futuredf3)
dyplot.prophet(m, forecast3)
prophet::prophet_plot_components(m, forecast3)

India, US, Russia, new cases

data <- alltis
data %>% dplyr::filter(data_type=="cases_new", 
  location==c("India","US","Russia")) %>% 
  ggplot(aes(ds, value)) + 
  geom_line(aes(col= location, linetype= location), size=1) +
  scale_color_calc() + geom_point(aes(col=location),size=5)+
  labs(x= "Year 2020", y= "New cases/ Day",
       title = "India, Russia, US")

  • Cases are increasing in India, cases in America have started decreasing now, the situation in Russia is stable.

Top 7 countries in Corona Cases per Day

data <- alltis
data %>% group_by(location) %>% 
  summarise(value= sum(value)) %>% 
  ungroup() %>% arrange(desc(value)) %>% 
  top_n(20) %>% rename(new_cases= value)->n

# plotting barplot for top 7 countries in `new corona cases/Day`
n %>% top_n(7) %>% 
  ggplot(aes(reorder(location, desc(new_cases)), new_cases, fill= location)) + geom_col(col="black", size=2) +
  scale_fill_calc()+
  labs(x= "Countries", y= "New cases/Day", caption = Sys.Date())

pander::pander(n)
location new_cases
US 8875773
Brazil 7861958
India 7842862
Russia 1898039
Mexico 1246485
Peru 1243802
South Africa 1223256
Colombia 1222161
Argentina 877029
Chile 834953
Iran 751068
Spain 714483
Saudi Arabia 624620
Pakistan 592524
Bangladesh 561612
Turkey 543297
Italy 526517
Germany 491672
Iraq 483559
France 460426
# extracting the names of top 7 countries
n %>% top_n(7)-> n
## Selecting by new_cases

Conclusions

  • Cases are continuously increasing in India, but we have to keep in mind that India’s population is very huge. India’s performance is still efficient in terms of cases per million population.

  • The corona cases have started to grow high, and if the control is not done here soon then the situation can become very bitter.

Converting and saving alltis to “.xlsx” format.

# CSV (excel) format
# write.csv(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.csv")
# .sav (spss) format
# haven::write_sav(alltis, "C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.sav")

#in .txt format
# write.table(alltis,"C:/Users/Alok Pratap Singh/Documents/R Pubs/R Pubs/alltis.txt)

Thank You

Regards

Please visit my profile

Alok Pratap Singh (Research Scholar)

Linkedin (Open in New TAB)

Department of Psychology

University of Allahabad

Click on download alltis.rar and unrar to get data in .csv, .sav, & .txt format. Please open the download link in new tab

Without data you’re just another person with an opinion

.