Brief Introduction
In 2020 ,the world was shocked by the presence of a new virus from China, called the Novel Corona Virus (Covid - 19). The virus is so disastrous that it affects numerous country, and many people died.
Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
At this time, there are no specific vaccines or treatments for COVID-19. However, there are many ongoing clinical trials evaluating potential treatments. WHO will continue to provide updated information as soon as clinical findings become available.
In the mean time Kaggle has launched several challenges in order to provide useful insights that may answer some of the open scientific questions about the virus. In this project we will try to give insights about the COVID19 Global Forecasting. in which we’re trying to fit worldwide data in order to predict the pandemic evolution, hopefully helping to determine which factors impact the transmission behavior of COVID-19.
Goals & Objective
- Predict the pandemic evolution through forecasting from the previous trends.
Import Library
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggplot2
## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v purrr 0.3.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'readr' was built under R version 3.6.3
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## Warning: package 'forecast' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Warning: package 'TTR' was built under R version 3.6.3
Import Data
setwd("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Corona")
train <- read.csv("train.csv")
head(train)## Id Province.State Country.Region Lat Long Date ConfirmedCases
## 1 1 Afghanistan 33 65 2020-01-22 0
## 2 2 Afghanistan 33 65 2020-01-23 0
## 3 3 Afghanistan 33 65 2020-01-24 0
## 4 4 Afghanistan 33 65 2020-01-25 0
## 5 5 Afghanistan 33 65 2020-01-26 0
## 6 6 Afghanistan 33 65 2020-01-27 0
## Fatalities
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Exploratory Data Analysis
Converting Date from a categorical variable to a date variable
train$Date <- as.character(train$Date)
train$Date <- as.Date(train$Date, format = "%Y-%m-%d")
range(train$Date)## [1] "2020-01-22" "2020-03-24"
Number of Confirmed Cases & Deaths Across Countries / Cities
global_confirmed_cases <- train %>% select(Country.Region, ConfirmedCases, Date) %>%
group_by(Date) %>%
summarise(ConfirmedCases = sum(ConfirmedCases))## # A tibble: 6 x 2
## Date ConfirmedCases
## <date> <dbl>
## 1 2020-01-22 539
## 2 2020-01-23 627
## 3 2020-01-24 901
## 4 2020-01-25 1347
## 5 2020-01-26 1959
## 6 2020-01-27 2694
global_fatalities <- train %>% select(Country.Region, Fatalities, Date) %>%
group_by(Date) %>%
summarise(Fatalities = sum(Fatalities))## # A tibble: 6 x 2
## Date Fatalities
## <date> <dbl>
## 1 2020-01-22 17
## 2 2020-01-23 18
## 3 2020-01-24 25
## 4 2020-01-25 41
## 5 2020-01-26 53
## 6 2020-01-27 79
Plotting the number of confirmed cases
p <- ggplot(global_confirmed_cases, aes(x=Date, y=ConfirmedCases, group=1)) +
geom_line() +
xlab("Date") +
ylab("Confirmed cases")
p We can see a logarithmic trends starting at march 15. Therefore we will try to verify this by transformming the y-axis to logarithmic base 2.
p <- ggplot(global_confirmed_cases, aes(x=Date, y=ConfirmedCases, group=1)) +
geom_line() +
scale_y_continuous(trans='log2') +
xlab("Date") +
ylab("Confirmed cases")
pWe can see that exponential behaviour is confirmed. Therefore we can say that the number of cases exhibits exponentials behaviour at March 15.
Time Series Component
In time series analysis we separate the time series in different components, such as: level, trend, seasonal, and noise. The time series decomposition would be:
Additive: yt=level+trend+seasonal+noise Multiplicative: yt=level∗trend∗seasonal∗noise
As you know, there is no a seasonal component because there is no enough information but there are some infectious diseases. Infections diseases need at least two years to estimate the seasonal component.
First of all, we will plot the time series.
## [1] "2020-01-22" "2020-03-24"
inds <- seq(as.Date("2020-01-22"), as.Date("2020-03-23"), by = "day")
global_confirmed_cases_ts <- ts(global_confirmed_cases$ConfirmedCases,
start = c( 2020, as.numeric(format(inds[1], "%j"))),
frequency = 365) We are going to use an additive model since that it appears that the variance of the time series doesn’t change over different values of the time series. We will smooth the time series in order to see the trend component.
We will use exponential smoothing technique since we need to considered all the data.
Forecasting Model
We will first try to use
Our model will be:
yt=trend+noise
Performance Evaluation
We choose a week as a forecast horizon and as a metric we will use the RMSE defined as follows:
RMSE=∑i=1ne2tn−−−−−−√
Where et is the error defined as the difference between the observed and predicted value.
In order to train and validate the model we will use the following partitions:
## Create a daily Date object - helps my work on dates
inds_train <- seq(as.Date("2020-01-22"), as.Date("2020-03-15"), by = "day")
inds_valid <- seq(as.Date("2020-03-16"), as.Date("2020-03-22"), by = "day")
train.ts <- window(global_confirmed_cases_ts, start = c(2020, as.numeric(format(inds_train[1], "%j"))), end = c(2020, as.numeric(format(inds_train[length(inds_train)], "%j" ))))
valid.ts <- window(global_confirmed_cases_ts, start = c(2020, as.numeric(format(inds_valid[1], "%j"))), end = c(2020, as.numeric(format(inds_valid[length(inds_valid)], "%j"))))Holts Exponential Smoothing
Holt’s exponential smoothing is good for time series that exhibits trend and has two parameters α and γ ,
α and γ are hyperparameters and we have to tune them, in order to do this we are going to use the ets function provided by the package forecast in the following way.
forecast_horizon <- 7
hwin <- ets(train.ts, model = "AAN")
hwin_pred <- forecast(hwin, h = forecast_horizon, level = 0)## ETS(A,A,N)
##
## Call:
## ets(y = train.ts, model = "AAN")
##
## Smoothing parameters:
## alpha = 0.9393
## beta = 0.4057
##
## Initial states:
## l = -1896.6408
## b = 1188.5646
##
## sigma: 2956.111
##
## AIC AICc BIC
## 1084.345 1085.595 1094.290
The value of alpha equals of 0.9393 means that the level of the series is more weight to most recent information but beta equals to 0.4057 means that there are some influence from past information.
plot(hwin_pred, ylab = "Confirmed Cases", xlab = "Date", bty = "l")
lines(hwin_pred$fitted, col = "blue")
lines(valid.ts) It seems that the model underestimate the global confirmed cases.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 435.5461 2844.519 1463.763 -1.034337 14.47864 NaN -0.01896363
## Test set 38964.7018 50024.889 38964.702 13.863615 13.86362 NaN 0.58764349
## Theil's U
## Training set NA
## Test set 1.818235
We’ve got an RMSE of 50024.889 in the test set. This model will be our base line. Finally, we train the model with the whole data set.
base_line <- ets(train.ts, model = "AAN")
base_line_pred <- forecast(base_line, h = forecast_horizon, level = 0)## Point Forecast Lo 0 Hi 0
## 2020.2055 174830.1 174830.1 174830.1
## 2020.2082 185559.4 185559.4 185559.4
## 2020.2110 196288.8 196288.8 196288.8
## 2020.2137 207018.2 207018.2 207018.2
## 2020.2164 217747.5 217747.5 217747.5
## 2020.2192 228476.9 228476.9 228476.9
## 2020.2219 239206.2 239206.2 239206.2
plot(base_line_pred, ylab = "Confirmed Cases", xlab = "Date", bty = "l")
lines(base_line_pred$fitted, col = "blue")
lines(valid.ts, col = "red") # Conclusion
In this project we have succesfully perform a forecast modelling using Holt’s Exponential smoothing method, to predict the number of confirmed cases for Corona Virus.