Project Overviews:

  1. Drawing live covid19 time series data with the help of ‘covid19.analytics’ package.

  2. Prepare data for further analysis.

  3. Display covid19 spread in India.

  4. Prediction for next 3 months in India with the help of ‘prophet’ package.

  5. Evaluation of model performance.

Covid19 data is available for 269 countries and territories all over the world from 22nd January 2020.

#############################
#### Scrap Covid19 data #####
#### and data prepration ####
#############################

world_covid <- covid19.data(case= "ts-confirmed")
## Data being read from JHU/CCSE repository
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Reading data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
## Data retrieved on 2020-11-19 10:57:58 || Range of dates on data: 2020-01-22--2020-11-17 | Nbr of records: 269
## --------------------------------------------------------------------------------
View(world_covid)

ind_covid <- world_covid %>% filter(Country.Region=='India')
View(ind_covid)

# Transpose
ind_covid <- data.frame(t(ind_covid))

str(ind_covid)
## 'data.frame':    305 obs. of  1 variable:
##  $ t.ind_covid.: chr  "" "India" "20.59368" "78.96288" ...
ind_covid <- cbind(rownames(ind_covid), data.frame(ind_covid, row.names = NULL))
View(ind_covid)

# Rename the columns
colnames(ind_covid)<- c('Date', 'Number_of_cases') 

# Remove unwanted rows
ind_covid <- slice(ind_covid, -c(1:4))

str(ind_covid)
## 'data.frame':    301 obs. of  2 variables:
##  $ Date           : chr  "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25" ...
##  $ Number_of_cases: chr  "0" "0" "0" "0" ...
ind_covid$Date <- ymd(ind_covid$Date)

ind_covid$Number_of_cases <- as.numeric(ind_covid$Number_of_cases)
################################################
#### Visualization Covid19 Spread in India #####
################################################
attach(ind_covid)
qplot(Date, Number_of_cases, xlab = '', ylab= 'Number of cases',
      main= 'Covid19 Spread in India')

Prediction of covid19 in India for next 3 months with the help of package called ‘prophet’.

################################################
#### Prediction of Covid19 Spread in India #####
################################################

attach(ind_covid)
## The following objects are masked from ind_covid (pos = 3):
## 
##     Date, Number_of_cases
ds <- Date
y <- Number_of_cases
mydf <- data.frame(ds,y)

d<- prophet(mydf)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
pred <- make_future_dataframe(d, periods = 90)
View(tail(pred))


forecast <- predict(d,pred)

dyplot.prophet(d,forecast,xlab= '', ylab= 'Number of cases',
               main= 'Covid19 prediction in India')
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
prophet_plot_components(d,forecast)

By 15th February number of confirmed cases may reach upto 14.8 millions. These predictions are valid assuming that current situation continues.

Weekly trend shows a drop in number of cases on every Tuesdays and a spike on Wednesday which is not supposed to be real but artificial may be because of data entry,testing and reporting.And definitely it does not mean that the risk of being infected is lower on Wednesdays and higher saturdays.

# Model accuracy.

#######################
#### Model Accuracy####
#######################


projected <- forecast$yhat[1:301]

real_value <- d$history$y

plot(projected, real_value, xlab = 'Predicted Value', ylab = 'True Value',
     main= 'True Values vs Predicted Values')


abline(lm(projected~real_value), col= 'blue')

summary(lm(projected~real_value))
## 
## Call:
## lm(formula = projected ~ real_value)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -198508   -6859   -1132    3252  303318 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.738e+03  5.195e+03   0.335    0.738    
## real_value  9.992e-01  1.423e-03 701.955   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72260 on 299 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9994 
## F-statistic: 4.927e+05 on 1 and 299 DF,  p-value: < 2.2e-16

A clear linear pattern can be seen. p-value is 2.2e-16 which is very low that indicates that the model is statistically significant.