Brief Introduction

In 2020 ,the world was shocked by the presence of a new virus from China, called the Novel Corona Virus (Covid - 19). The virus is so disastrous that it affects numerous country, and many people died.

Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.

At this time, there are no specific vaccines or treatments for COVID-19. However, there are many ongoing clinical trials evaluating potential treatments. WHO will continue to provide updated information as soon as clinical findings become available.

In the mean time Kaggle has launched several challenges in order to provide useful insights that may answer some of the open scientific questions about the virus. In this project we will try to give insights about the COVID19 Global Forecasting. in which we’re trying to fit worldwide data in order to predict the pandemic evolution, hopefully helping to determine which factors impact the transmission behavior of COVID-19.

Goals & Objective

Predict the pandemic evolution through forecasting from the previous trends.

Import Library

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(rpart)
library(tidyverse)

## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --

## v tibble  2.1.3     v purrr   0.3.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'readr' was built under R version 3.6.3

## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

library(forecast)

## Warning: package 'forecast' was built under R version 3.6.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(TTR)

## Warning: package 'TTR' was built under R version 3.6.3

Import Data

setwd("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Corona")

train <- read.csv("train.csv")

head(train)

##   Id Province.State Country.Region Lat Long       Date ConfirmedCases
## 1  1                   Afghanistan  33   65 2020-01-22              0
## 2  2                   Afghanistan  33   65 2020-01-23              0
## 3  3                   Afghanistan  33   65 2020-01-24              0
## 4  4                   Afghanistan  33   65 2020-01-25              0
## 5  5                   Afghanistan  33   65 2020-01-26              0
## 6  6                   Afghanistan  33   65 2020-01-27              0
##   Fatalities
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

Exploratory Data Analysis

Converting `Date` from a categorical variable to a date variable

train$Date <- as.character(train$Date)
train$Date <- as.Date(train$Date, format = "%Y-%m-%d")
range(train$Date)

## [1] "2020-01-22" "2020-03-24"

Number of Confirmed Cases & Deaths Across Countries / Cities

global_confirmed_cases <- train %>% select(Country.Region, ConfirmedCases, Date) %>%
          group_by(Date) %>% 
          summarise(ConfirmedCases = sum(ConfirmedCases))

head(global_confirmed_cases)

## # A tibble: 6 x 2
##   Date       ConfirmedCases
##   <date>              <dbl>
## 1 2020-01-22            539
## 2 2020-01-23            627
## 3 2020-01-24            901
## 4 2020-01-25           1347
## 5 2020-01-26           1959
## 6 2020-01-27           2694

global_fatalities <- train %>% select(Country.Region, Fatalities, Date) %>%
          group_by(Date) %>% 
          summarise(Fatalities = sum(Fatalities))

head(global_fatalities)

## # A tibble: 6 x 2
##   Date       Fatalities
##   <date>          <dbl>
## 1 2020-01-22         17
## 2 2020-01-23         18
## 3 2020-01-24         25
## 4 2020-01-25         41
## 5 2020-01-26         53
## 6 2020-01-27         79

Plotting the number of confirmed cases

p <- ggplot(global_confirmed_cases, aes(x=Date, y=ConfirmedCases, group=1)) +      
     geom_line() +  
     xlab("Date") + 
     ylab("Confirmed cases")
p

We can see a logarithmic trends starting at march 15. Therefore we will try to verify this by transformming the y-axis to logarithmic base 2.

p <- ggplot(global_confirmed_cases, aes(x=Date, y=ConfirmedCases, group=1)) +      
     geom_line() +  
     scale_y_continuous(trans='log2') + 
     xlab("Date") + 
     ylab("Confirmed cases")
p

We can see that exponential behaviour is confirmed. Therefore we can say that the number of cases exhibits exponentials behaviour at March 15.

Time Series Component

In time series analysis we separate the time series in different components, such as: level, trend, seasonal, and noise. The time series decomposition would be:

Additive: yt=level+trend+seasonal+noise Multiplicative: yt=level∗trend∗seasonal∗noise

As you know, there is no a seasonal component because there is no enough information but there are some infectious diseases. Infections diseases need at least two years to estimate the seasonal component.

First of all, we will plot the time series.

range(global_confirmed_cases$Date)

## [1] "2020-01-22" "2020-03-24"

inds <- seq(as.Date("2020-01-22"), as.Date("2020-03-23"), by = "day")

global_confirmed_cases_ts <- ts(global_confirmed_cases$ConfirmedCases,  
                           start = c( 2020,  as.numeric(format(inds[1], "%j"))), 
                           frequency = 365)

plot(global_confirmed_cases_ts, xlab="Time", ylab="Confirmed Cases", bty="l")

We are going to use an additive model since that it appears that the variance of the time series doesn’t change over different values of the time series. We will smooth the time series in order to see the trend component.

We will use exponential smoothing technique since we need to considered all the data.

Forecasting Model

We will first try to use

Our model will be:

yt=trend+noise

Performance Evaluation

We choose a week as a forecast horizon and as a metric we will use the RMSE defined as follows:

RMSE=∑i=1ne2tn−−−−−−√

Where et is the error defined as the difference between the observed and predicted value.

In order to train and validate the model we will use the following partitions:

## Create a daily Date object - helps my work on dates
inds_train <- seq(as.Date("2020-01-22"), as.Date("2020-03-15"), by = "day")
inds_valid <- seq(as.Date("2020-03-16"), as.Date("2020-03-22"), by = "day")


train.ts <- window(global_confirmed_cases_ts, start = c(2020, as.numeric(format(inds_train[1], "%j"))), end = c(2020, as.numeric(format(inds_train[length(inds_train)], "%j" ))))
valid.ts <- window(global_confirmed_cases_ts, start = c(2020, as.numeric(format(inds_valid[1], "%j"))), end = c(2020, as.numeric(format(inds_valid[length(inds_valid)], "%j"))))

Holts Exponential Smoothing

Holt’s exponential smoothing is good for time series that exhibits trend and has two parameters α and γ ,

α and γ are hyperparameters and we have to tune them, in order to do this we are going to use the ets function provided by the package forecast in the following way.

forecast_horizon <- 7 

hwin <- ets(train.ts, model = "AAN")
hwin_pred <- forecast(hwin, h = forecast_horizon, level = 0)

hwin

## ETS(A,A,N) 
## 
## Call:
##  ets(y = train.ts, model = "AAN") 
## 
##   Smoothing parameters:
##     alpha = 0.9393 
##     beta  = 0.4057 
## 
##   Initial states:
##     l = -1896.6408 
##     b = 1188.5646 
## 
##   sigma:  2956.111
## 
##      AIC     AICc      BIC 
## 1084.345 1085.595 1094.290

The value of alpha equals of 0.9393 means that the level of the series is more weight to most recent information but beta equals to 0.4057 means that there are some influence from past information.

plot(hwin_pred,  ylab = "Confirmed Cases", xlab = "Date", bty = "l")
lines(hwin_pred$fitted, col = "blue")
lines(valid.ts)

It seems that the model underestimate the global confirmed cases.

accuracy(hwin_pred, valid.ts)

##                      ME      RMSE       MAE       MPE     MAPE MASE        ACF1
## Training set   435.5461  2844.519  1463.763 -1.034337 14.47864  NaN -0.01896363
## Test set     38964.7018 50024.889 38964.702 13.863615 13.86362  NaN  0.58764349
##              Theil's U
## Training set        NA
## Test set      1.818235

We’ve got an RMSE of 50024.889 in the test set. This model will be our base line. Finally, we train the model with the whole data set.

base_line <- ets(train.ts, model = "AAN")
base_line_pred <- forecast(base_line, h = forecast_horizon, level = 0)

base_line_pred

##           Point Forecast     Lo 0     Hi 0
## 2020.2055       174830.1 174830.1 174830.1
## 2020.2082       185559.4 185559.4 185559.4
## 2020.2110       196288.8 196288.8 196288.8
## 2020.2137       207018.2 207018.2 207018.2
## 2020.2164       217747.5 217747.5 217747.5
## 2020.2192       228476.9 228476.9 228476.9
## 2020.2219       239206.2 239206.2 239206.2

plot(base_line_pred,  ylab = "Confirmed Cases", xlab = "Date", bty = "l")
lines(base_line_pred$fitted, col = "blue")
lines(valid.ts, col = "red")

# Conclusion

In this project we have succesfully perform a forecast modelling using Holt’s Exponential smoothing method, to predict the number of confirmed cases for Corona Virus.

Covid-19 Forecast Analysis

Ezra Soterion Nugroho

2/7/2020

Brief Introduction

Goals & Objective

Import Library

Import Data

Exploratory Data Analysis

Converting `Date` from a categorical variable to a date variable

Number of Confirmed Cases & Deaths Across Countries / Cities

Time Series Component

Forecasting Model

Performance Evaluation

Holts Exponential Smoothing

Covid-19 Forecast Analysis

Ezra Soterion Nugroho

2/7/2020

Brief Introduction

Goals & Objective

Import Library

Import Data

Exploratory Data Analysis

Converting Date from a categorical variable to a date variable

Number of Confirmed Cases & Deaths Across Countries / Cities

Time Series Component

Forecasting Model

Performance Evaluation

Holts Exponential Smoothing

Converting `Date` from a categorical variable to a date variable