COVID-19 cases trend among Malaysia states and federal territories

Hello, everyone! We are going to demonstrate on our project, which is the study on COVID-19 cases in Malaysia states.

Title of the project

Introduction

In our daily news recently, we could clearly observe the huge number of COVID-19 confirmed cases in Malaysia. COVID-19 infections could bring us serious harm which we could not even imagine. Thus, in this research, we are keen to observe the trend of COVID-19 cases among different states in Malaysia. Then, we would like to have a look on the trend in 2021.

Dataset source

https://github.com/ynshung/covid-19-malaysia/blob/master/covid-19-my-states-cases.csv

https://github.com/ynshung/covid-19-malaysia/blob/master/covid-19-malaysia.csv

Dataset details

Our first dataset contained information on the number of confirmed COVID-19 cases among 13 Malaysia states and 3 federal territories.

There are a total of 290 rows of data, starting from Mar 13th 2020 up to Dec 27th 2020. The columns consisting of the cumulative number of cases from one day to the next, as well as a column of date.The data need to be cleaned as we need to have daily confirmed cases in each state for analysis purposes instead of just cumulative from one day to the next. Also, empty row which depicts no confirmed cases on that day need to be cleaned as well. There are also 3 rows with ‘-’, to depict 0 cases on that day. Data cleaning will be carried out before starting with analysis.

Our second dataset contained information on the number of daily death due to COVID-19 infection in Malaysia as a whole.

There are 339 rows of data ranging from Jan 24th 2020 up to Dec 27th 2020. There are columns related to cumulative number of COVID-19 cases in Malaysia, discharged cases, death cases and number of patient who are in ICU. We are interested to look at the death data, in order to study on classification question after that. Thus, we will filter the dataset to the date same as our first dataset, which was started on Mar 13th 2020, and we will only select one column (death) and append it to our first dataset.

Some details on this project

1. Title: COVID-19 cases trend among Malaysia states and federal territories
1. Year: 2020 dataset (starting from Mar 13th 2020 up to Dec 27th 2020)
1. Purpose of dataset: To observe the trend of COVID-19 cases in Malaysia
1. Dimension: 291 rows x 17 columns (1st dataset) & 340 rows x 5 columns (2nd dataset)

Analysis details

Research questions:

1. What is the predicted COVID-19 daily new cases in Jan and Feb 2021?
1. Would the day under prediction having any death on that particular day?

Research Objectives/Goals:

1. To identify the trend of COVID-19 cases among different states and federal territories in Malaysia.
1. To observe the predicted trend of COVID-19 cases in Malaysia in near future.
1. To enable states with huge number of cases to be more cautious and having more concrete policies to stop the infection.
1. To observe whether one particular day in near future would be having any death.

Answer to be found from this dataset:

1. To find and have a clear picture on how the trend of COVID-19 cases is among different states and federal territories in Malaysia.
1. To predict on the potential outbreak of COVID-19 cases in Malaysia in near future.
1. To enable natives to be more cautious on the potential outbreak by staying at home, and follow tightly to the rules and SOPs.
1. To enable government to know the trend of having any death in near future, in order to prepare and come out with plan to stop the infection chain.

Data Cleaning steps

First, we will ensure the required packages are all installed, then call these libraries using library().

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:purrr':
## 
##     compact

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ggplot2)
library(TTR)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tseries)
library(caTools)
library(klaR)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

We will then check on the directory to ensure the path is correct.

getwd()

## [1] "C:/Users/KEVIN LIM/Desktop/Master studies/Programming for Data Science/Group project/updated"

From the dataset obtained from data source, it was in csv format. We will use function read.csv to read the data into R Markdown.

data<-read.csv("new.dataset2.csv")

To kickstart with a quick view on the data, we use function head().

head(data)

##         date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020      1     5            7     2       87              11      1
## 2 14/03/2020      2     5            7     2       92              19      6
## 3 15/03/2020     NA    NA           NA    NA       NA              NA     NA
## 4 16/03/2020      8    31           15    18      144              42     14
## 5 17/03/2020      8    31           23    23      161              45     17
## 6 18/03/2020      8    36           30    28      192              45     18
##   johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1    20      2         NA        3    15      NA              40            1
## 2    22      2         NA        3    26       6              43            1
## 3    NA     NA         NA       NA    NA      NA              NA             
## 4    52     19          4       18    57      21             106            -
## 5    77     28          7       25    82      29             113            -
## 6    88     29         10       30   103      49             119            -
##   wp.labuan
## 1         2
## 2         2
## 3        NA
## 4         4
## 5         4
## 6         5

We also use fucntion str() and summary() to have some pictures before starting to clean the data.

#To check on our data set whether is there any missing values, we use head and str function to have a quick glance.
data1<-data
dim(data1)

## [1] 290  17

str(data1)

## 'data.frame':    290 obs. of  17 variables:
##  $ date           : chr  "13/03/2020" "14/03/2020" "15/03/2020" "16/03/2020" ...
##  $ perlis         : int  1 2 NA 8 8 8 9 9 9 9 ...
##  $ kedah          : int  5 5 NA 31 31 36 40 41 47 52 ...
##  $ pulau.pinang   : int  7 7 NA 15 23 30 32 37 50 58 ...
##  $ perak          : int  2 2 NA 18 23 28 35 45 55 66 ...
##  $ selangor       : int  87 92 NA 144 161 192 223 263 292 309 ...
##  $ negeri.sembilan: int  11 19 NA 42 45 45 56 65 70 78 ...
##  $ melaka         : int  1 6 NA 14 17 18 20 22 22 23 ...
##  $ johor          : int  20 22 NA 52 77 88 101 114 129 145 ...
##  $ pahang         : int  2 2 NA 19 28 29 32 36 37 40 ...
##  $ terengganu     : int  NA NA NA 4 7 10 11 20 27 32 ...
##  $ kelantan       : int  3 3 NA 18 25 30 44 51 61 63 ...
##  $ sabah          : int  15 26 NA 57 82 103 112 119 136 158 ...
##  $ sarawak        : int  NA 6 NA 21 29 49 51 58 68 76 ...
##  $ wp.kuala.lumpur: int  40 43 NA 106 113 119 123 139 166 183 ...
##  $ wp.putrajaya   : chr  "1" "1" "" "-" ...
##  $ wp.labuan      : int  2 2 NA 4 4 5 5 5 5 5 ...

summary(data1)

##      date               perlis          kedah         pulau.pinang   
##  Length:290         Min.   : 1.00   Min.   :   5.0   Min.   :   7.0  
##  Class :character   1st Qu.:18.00   1st Qu.:  96.0   1st Qu.: 121.0  
##  Mode  :character   Median :20.00   Median : 125.0   Median : 121.0  
##                     Mean   :26.93   Mean   : 774.4   Mean   : 549.1  
##                     3rd Qu.:38.00   3rd Qu.:1890.0   3rd Qu.: 402.0  
##                     Max.   :46.00   Max.   :2957.0   Max.   :3201.0  
##                     NA's   :1       NA's   :1        NA's   :1       
##      perak           selangor     negeri.sembilan     melaka     
##  Min.   :   2.0   Min.   :   87   Min.   :  11    Min.   :  1.0  
##  1st Qu.: 255.0   1st Qu.: 1829   1st Qu.: 792    1st Qu.:216.0  
##  Median : 264.0   Median : 2130   Median :1029    Median :258.0  
##  Mean   : 538.4   Mean   : 4459   Mean   :1608    Mean   :268.2  
##  3rd Qu.: 383.0   3rd Qu.: 3014   3rd Qu.:1096    3rd Qu.:281.0  
##  Max.   :3050.0   Max.   :29272   Max.   :7496    Max.   :955.0  
##  NA's   :1        NA's   :1       NA's   :1       NA's   :1      
##      johor            pahang       terengganu       kelantan    
##  Min.   :  20.0   Min.   :   2   Min.   :  4.0   Min.   :  3.0  
##  1st Qu.: 671.0   1st Qu.: 344   1st Qu.:111.0   1st Qu.:156.0  
##  Median : 743.0   Median : 370   Median :114.0   Median :160.0  
##  Mean   : 898.3   Mean   : 368   Mean   :137.9   Mean   :191.1  
##  3rd Qu.: 834.0   3rd Qu.: 387   3rd Qu.:158.5   3rd Qu.:173.0  
##  Max.   :4632.0   Max.   :1003   Max.   :301.0   Max.   :606.0  
##  NA's   :1        NA's   :1      NA's   :3       NA's   :1      
##      sabah          sarawak       wp.kuala.lumpur wp.putrajaya      
##  Min.   :   15   Min.   :   6.0   Min.   :   40   Length:290        
##  1st Qu.:  343   1st Qu.: 549.0   1st Qu.: 1807   Class :character  
##  Median :  402   Median : 678.0   Median : 2491   Mode  :character  
##  Mean   : 6534   Mean   : 662.1   Mean   : 2893                     
##  3rd Qu.: 6286   3rd Qu.: 760.0   3rd Qu.: 2824                     
##  Max.   :36074   Max.   :1108.0   Max.   :12494                     
##  NA's   :1       NA's   :2        NA's   :1                         
##    wp.labuan     
##  Min.   :   2.0  
##  1st Qu.:  16.0  
##  Median :  20.0  
##  Mean   : 278.8  
##  3rd Qu.: 111.0  
##  Max.   :1636.0  
##  NA's   :1

From the checking of structure above, we noticed that there is one column, ‘wp.putrajaya’ with character form instead of numeric form, thus we need to convert it.

data2<-transform(data1, wp.putrajaya=as.numeric(wp.putrajaya))

## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion

head(data2)

##         date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020      1     5            7     2       87              11      1
## 2 14/03/2020      2     5            7     2       92              19      6
## 3 15/03/2020     NA    NA           NA    NA       NA              NA     NA
## 4 16/03/2020      8    31           15    18      144              42     14
## 5 17/03/2020      8    31           23    23      161              45     17
## 6 18/03/2020      8    36           30    28      192              45     18
##   johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1    20      2         NA        3    15      NA              40            1
## 2    22      2         NA        3    26       6              43            1
## 3    NA     NA         NA       NA    NA      NA              NA           NA
## 4    52     19          4       18    57      21             106           NA
## 5    77     28          7       25    82      29             113           NA
## 6    88     29         10       30   103      49             119           NA
##   wp.labuan
## 1         2
## 2         2
## 3        NA
## 4         4
## 5         4
## 6         5

We will then check whether is there any position have NA or missing value.

rowSums(is.na(data2))==0

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [145]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [157]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [205]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [217]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [241]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [253]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [277]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [289]  TRUE  TRUE

which(is.na(data2), arr.ind=TRUE)

##       row col
##  [1,]   3   2
##  [2,]   3   3
##  [3,]   3   4
##  [4,]   3   5
##  [5,]   3   6
##  [6,]   3   7
##  [7,]   3   8
##  [8,]   3   9
##  [9,]   3  10
## [10,]   1  11
## [11,]   2  11
## [12,]   3  11
## [13,]   3  12
## [14,]   3  13
## [15,]   1  14
## [16,]   3  14
## [17,]   3  15
## [18,]   3  16
## [19,]   4  16
## [20,]   5  16
## [21,]   6  16
## [22,]   3  17

For those with empty info, they are not having any incremental in number of COVID-19 cases on that day. Thus, the incremental is 0, and since the data were all in cumulative form, we will then use the function to retrieve back the values from previous row and put it in for the row with empty information.

data3<-data2%>%
  fill(c(perlis,kedah, pulau.pinang, perak, selangor, negeri.sembilan, melaka,  johor,  pahang, terengganu, kelantan,   sabah,  sarawak,    wp.kuala.lumpur,    wp.putrajaya,   wp.labuan), .direction = "down")

head(data3)

##         date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020      1     5            7     2       87              11      1
## 2 14/03/2020      2     5            7     2       92              19      6
## 3 15/03/2020      2     5            7     2       92              19      6
## 4 16/03/2020      8    31           15    18      144              42     14
## 5 17/03/2020      8    31           23    23      161              45     17
## 6 18/03/2020      8    36           30    28      192              45     18
##   johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1    20      2         NA        3    15      NA              40            1
## 2    22      2         NA        3    26       6              43            1
## 3    22      2         NA        3    26       6              43            1
## 4    52     19          4       18    57      21             106            1
## 5    77     28          7       25    82      29             113            1
## 6    88     29         10       30   103      49             119            1
##   wp.labuan
## 1         2
## 2         2
## 3         2
## 4         4
## 5         4
## 6         5

As Terengganu and Sarawak having no value at the first day of data, we will fill in 0 for the respective empty row.

which(is.na(data3), arr.ind=TRUE)

##      row col
## [1,]   1  11
## [2,]   2  11
## [3,]   3  11
## [4,]   1  14

data3[is.na(data3)]<-0
head(data3)

##         date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020      1     5            7     2       87              11      1
## 2 14/03/2020      2     5            7     2       92              19      6
## 3 15/03/2020      2     5            7     2       92              19      6
## 4 16/03/2020      8    31           15    18      144              42     14
## 5 17/03/2020      8    31           23    23      161              45     17
## 6 18/03/2020      8    36           30    28      192              45     18
##   johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1    20      2          0        3    15       0              40            1
## 2    22      2          0        3    26       6              43            1
## 3    22      2          0        3    26       6              43            1
## 4    52     19          4       18    57      21             106            1
## 5    77     28          7       25    82      29             113            1
## 6    88     29         10       30   103      49             119            1
##   wp.labuan
## 1         2
## 2         2
## 3         2
## 4         4
## 5         4
## 6         5

To confirm that there is no other missing values, we use following functions to check again.

which(is.na(data3), arr.ind=TRUE)

##      row col

data3[!complete.cases(data3),]

##  [1] date            perlis          kedah           pulau.pinang   
##  [5] perak           selangor        negeri.sembilan melaka         
##  [9] johor           pahang          terengganu      kelantan       
## [13] sabah           sarawak         wp.kuala.lumpur wp.putrajaya   
## [17] wp.labuan      
## <0 rows> (or 0-length row.names)

As the data were all in cumulative number from one day to the next, to enhance our analysis processes, we use the function below to count the number of new daily incremental in confirmed COVID-19 cases. New columns were created for each of the states and federal territories. This function allows us to find the differences of current row and previous row, and thus the differences will be the new incremental number of cases on that day.

data3$perlis_daily <- ave(data3$perlis, FUN = function(x) c(0, diff(x)))
data3$kedah_daily <- ave(data3$kedah, FUN = function(x) c(0, diff(x)))
data3$pulau.pinang_daily <- ave(data3$pulau.pinang, FUN = function(x) c(0, diff(x)))
data3$perak_daily <- ave(data3$perak, FUN = function(x) c(0, diff(x)))
data3$selangor_daily <- ave(data3$selangor, FUN = function(x) c(0, diff(x)))
data3$negeri.sembilan_daily <- ave(data3$negeri.sembilan, FUN = function(x) c(0, diff(x)))
data3$melaka_daily <- ave(data3$melaka, FUN = function(x) c(0, diff(x)))
data3$johor_daily <- ave(data3$johor, FUN = function(x) c(0, diff(x)))
data3$pahang_daily <- ave(data3$pahang, FUN = function(x) c(0, diff(x)))
data3$terengganu_daily <- ave(data3$terengganu, FUN = function(x) c(0, diff(x)))
data3$kelantan_daily <- ave(data3$kelantan, FUN = function(x) c(0, diff(x)))
data3$sabah_daily <- ave(data3$sabah, FUN = function(x) c(0, diff(x)))
data3$sarawak_daily <- ave(data3$sarawak, FUN = function(x) c(0, diff(x)))
data3$wp.kuala.lumpur_daily <- ave(data3$wp.kuala.lumpur, FUN = function(x) c(0, diff(x)))
data3$wp.putrajaya_daily <- ave(data3$wp.putrajaya, FUN = function(x) c(0, diff(x)))
data3$wp.labuan_daily <- ave(data3$wp.labuan, FUN = function(x) c(0, diff(x)))

head(data3)

##         date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020      1     5            7     2       87              11      1
## 2 14/03/2020      2     5            7     2       92              19      6
## 3 15/03/2020      2     5            7     2       92              19      6
## 4 16/03/2020      8    31           15    18      144              42     14
## 5 17/03/2020      8    31           23    23      161              45     17
## 6 18/03/2020      8    36           30    28      192              45     18
##   johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1    20      2          0        3    15       0              40            1
## 2    22      2          0        3    26       6              43            1
## 3    22      2          0        3    26       6              43            1
## 4    52     19          4       18    57      21             106            1
## 5    77     28          7       25    82      29             113            1
## 6    88     29         10       30   103      49             119            1
##   wp.labuan perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1         2            0           0                  0           0
## 2         2            1           0                  0           0
## 3         2            0           0                  0           0
## 4         4            6          26                  8          16
## 5         4            0           0                  8           5
## 6         5            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1              0                     0            0           0            0
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              0           0             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily
## 1                     0                  0               0
## 2                     3                  0               0
## 3                     0                  0               0
## 4                    63                  0               2
## 5                     7                  0               0
## 6                     6                  0               1

From previous function, we observed the differences of cumulative cases from one day to the next to get the incremental number of cases, however, for the very first row, there is no previous row to be deducted, thus, we need to replace the first row of these newly computed columns with the first row of cumulative data.

data3[1,18:33]<-data3[1,2:17]

Get the total daily new cases from all the states in Malaysia

total_daily<-data3[,18:33]
data3$total_daily<-rowSums(total_daily)

Generate a new data which include all state new cases and total

daily_new_case<-data3[,-c(2:17)]
head(daily_new_case)

##         date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020            1           5                  7           2
## 2 14/03/2020            1           0                  0           0
## 3 15/03/2020            0           0                  0           0
## 4 16/03/2020            6          26                  8          16
## 5 17/03/2020            0           0                  8           5
## 6 18/03/2020            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1             87                    11            1          20            2
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              3          15             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily
## 1                    40                  1               2         197
## 2                     3                  0               0          41
## 3                     0                  0               0           0
## 4                    63                  0               2         316
## 5                     7                  0               0         120
## 6                     6                  0               1         117

We read in data for death due to COVID-19 cases in Malaysia

deathdata<-read.csv("deathdata.csv")
head(deathdata)

##         date cases discharged death icu
## 1 24/01/2020     0          0     0   0
## 2 25/01/2020     3          0     0   0
## 3 26/01/2020     4          0     0   0
## 4 27/01/2020     4          0     0   0
## 5 28/01/2020     4          0     0   0
## 6 29/01/2020     7          0     0   0

Narrow down to the date that is in the same period with our first dataset.

deathdata1<- deathdata[50:339,1:4]
head(deathdata1)

##          date cases discharged death
## 50 13/03/2020   197         32     0
## 51 14/03/2020   238         35     0
## 52 15/03/2020   428         42     0
## 53 16/03/2020   553         42     0
## 54 17/03/2020   673         49     2
## 55 18/03/2020   790         60     2

We combine the data on death column to our first data

data5<-daily_new_case
data5$deathcum<-deathdata1$death
head(data5)

##         date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020            1           5                  7           2
## 2 14/03/2020            1           0                  0           0
## 3 15/03/2020            0           0                  0           0
## 4 16/03/2020            6          26                  8          16
## 5 17/03/2020            0           0                  8           5
## 6 18/03/2020            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1             87                    11            1          20            2
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              3          15             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily deathcum
## 1                    40                  1               2         197        0
## 2                     3                  0               0          41        0
## 3                     0                  0               0           0        0
## 4                    63                  0               2         316        0
## 5                     7                  0               0         120        2
## 6                     6                  0               1         117        2

Count on daily death cases from the cumulative deaths in data

data5$death_daily <- ave(data5$deathcum, FUN = function(x) c(0, diff(x)))
head(data5)

##         date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020            1           5                  7           2
## 2 14/03/2020            1           0                  0           0
## 3 15/03/2020            0           0                  0           0
## 4 16/03/2020            6          26                  8          16
## 5 17/03/2020            0           0                  8           5
## 6 18/03/2020            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1             87                    11            1          20            2
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              3          15             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily deathcum
## 1                    40                  1               2         197        0
## 2                     3                  0               0          41        0
## 3                     0                  0               0           0        0
## 4                    63                  0               2         316        0
## 5                     7                  0               0         120        2
## 6                     6                  0               1         117        2
##   death_daily
## 1           0
## 2           0
## 3           0
## 4           0
## 5           2
## 6           0

Having a new column with yes for death exists on that day and vice versa

data5$death_exist <- ifelse(data5$death_daily >0, "YES", "NO")
daily_new_case<-data5
head(data5)

##         date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020            1           5                  7           2
## 2 14/03/2020            1           0                  0           0
## 3 15/03/2020            0           0                  0           0
## 4 16/03/2020            6          26                  8          16
## 5 17/03/2020            0           0                  8           5
## 6 18/03/2020            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1             87                    11            1          20            2
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              3          15             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily deathcum
## 1                    40                  1               2         197        0
## 2                     3                  0               0          41        0
## 3                     0                  0               0           0        0
## 4                    63                  0               2         316        0
## 5                     7                  0               0         120        2
## 6                     6                  0               1         117        2
##   death_daily death_exist
## 1           0          NO
## 2           0          NO
## 3           0          NO
## 4           0          NO
## 5           2         YES
## 6           0          NO

Drop out column deathcum and death_daily

daily_new_case<-daily_new_case[,-c(19:20)]
head(daily_new_case)

##         date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020            1           5                  7           2
## 2 14/03/2020            1           0                  0           0
## 3 15/03/2020            0           0                  0           0
## 4 16/03/2020            6          26                  8          16
## 5 17/03/2020            0           0                  8           5
## 6 18/03/2020            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1             87                    11            1          20            2
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              3          15             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily
## 1                    40                  1               2         197
## 2                     3                  0               0          41
## 3                     0                  0               0           0
## 4                    63                  0               2         316
## 5                     7                  0               0         120
## 6                     6                  0               1         117
##   death_exist
## 1          NO
## 2          NO
## 3          NO
## 4          NO
## 5         YES
## 6          NO

Save the data into a csv file name “daily new case”

write.csv(daily_new_case,"daily case.csv", row.names = TRUE)
head(daily_new_case)

##         date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020            1           5                  7           2
## 2 14/03/2020            1           0                  0           0
## 3 15/03/2020            0           0                  0           0
## 4 16/03/2020            6          26                  8          16
## 5 17/03/2020            0           0                  8           5
## 6 18/03/2020            0           5                  7           5
##   selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1             87                    11            1          20            2
## 2              5                     8            5           2            0
## 3              0                     0            0           0            0
## 4             52                    23            8          30           17
## 5             17                     3            3          25            9
## 6             31                     0            1          11            1
##   terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1                0              3          15             0
## 2                0              0          11             6
## 3                0              0           0             0
## 4                4             15          31            15
## 5                3              7          25             8
## 6                3              5          21            20
##   wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily
## 1                    40                  1               2         197
## 2                     3                  0               0          41
## 3                     0                  0               0           0
## 4                    63                  0               2         316
## 5                     7                  0               0         120
## 6                     6                  0               1         117
##   death_exist
## 1          NO
## 2          NO
## 3          NO
## 4          NO
## 5         YES
## 6          NO

str(daily_new_case)

## 'data.frame':    290 obs. of  19 variables:
##  $ date                 : chr  "13/03/2020" "14/03/2020" "15/03/2020" "16/03/2020" ...
##  $ perlis_daily         : num  1 1 0 6 0 0 1 0 0 0 ...
##  $ kedah_daily          : num  5 0 0 26 0 5 4 1 6 5 ...
##  $ pulau.pinang_daily   : num  7 0 0 8 8 7 2 5 13 8 ...
##  $ perak_daily          : num  2 0 0 16 5 5 7 10 10 11 ...
##  $ selangor_daily       : num  87 5 0 52 17 31 31 40 29 17 ...
##  $ negeri.sembilan_daily: num  11 8 0 23 3 0 11 9 5 8 ...
##  $ melaka_daily         : num  1 5 0 8 3 1 2 2 0 1 ...
##  $ johor_daily          : num  20 2 0 30 25 11 13 13 15 16 ...
##  $ pahang_daily         : num  2 0 0 17 9 1 3 4 1 3 ...
##  $ terengganu_daily     : num  0 0 0 4 3 3 1 9 7 5 ...
##  $ kelantan_daily       : num  3 0 0 15 7 5 14 7 10 2 ...
##  $ sabah_daily          : num  15 11 0 31 25 21 9 7 17 22 ...
##  $ sarawak_daily        : num  0 6 0 15 8 20 2 7 10 8 ...
##  $ wp.kuala.lumpur_daily: num  40 3 0 63 7 6 4 16 27 17 ...
##  $ wp.putrajaya_daily   : num  1 0 0 0 0 0 5 0 3 0 ...
##  $ wp.labuan_daily      : num  2 0 0 2 0 1 0 0 0 0 ...
##  $ total_daily          : num  197 41 0 316 120 117 109 130 153 123 ...
##  $ death_exist          : chr  "NO" "NO" "NO" "NO" ...

Run program for question 1: Predict the Jan and Feb 2019 COVID-19 daily new cases Using ARIMA model for regression since the data in in time based series

inds <- seq(as.Date("13/03/2020"), as.Date("27/03/2020"), by = "day")
#convert the data into time based series        
myts <- ts(daily_new_case$total_daily,   
    start = c(2020, as.numeric(format(inds[1], "%j"))), frequency = 365)
#check the first date 
as.numeric(format(inds[1], "%j"))

## [1] 79

EDA

Plot the number of daily cases vs date

plot(myts, main="Daily covid 19 cases in Malaysia",xlab="Date",ylab="Number of daily cases")

Regression

Apply ARIMA model for regression and predict the next 60days of number daily cases

fit <- auto.arima(myts)
fore <- forecast(fit, h = 60)
plot(fore, main="Daily covid 19 cases in Malaysia",xlab="Date",ylab="Number of daily cases")

Evaluation on the ARIMA model

summary(fore)

## 
## Forecast method: ARIMA(2,1,3) with drift
## 
## Model Information:
## Series: myts 
## ARIMA(2,1,3) with drift 
## 
## Coefficients:
##           ar1      ar2      ma1     ma2      ma3   drift
##       -0.5486  -0.8767  -0.1113  0.4066  -0.7917  5.0579
## s.e.   0.0348   0.0819   0.0469  0.0855   0.0409  2.1015
## 
## sigma^2 estimated as 29327:  log likelihood=-1895.01
## AIC=3804.01   AICc=3804.41   BIC=3829.68
## 
## Error measures:
##                      ME     RMSE      MAE  MPE MAPE MASE        ACF1
## Training set -0.3008319 169.1729 95.68176 -Inf  Inf  NaN -0.03735384
## 
## Forecasts:
##           Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2021.0082       1516.094 1296.623 1735.566 1180.442 1851.747
## 2021.0110       1544.956 1313.137 1776.774 1190.419 1899.492
## 2021.0137       1692.030 1454.684 1929.376 1329.041 2055.020
## 2021.0164       1598.313 1360.349 1836.278 1234.378 1962.248
## 2021.0192       1533.055 1288.481 1777.629 1159.011 1907.099
## 2021.0219       1663.280 1410.341 1916.219 1276.443 2050.117
## 2021.0247       1661.318 1407.084 1915.552 1272.501 2050.135
## 2021.0274       1560.497 1303.149 1817.844 1166.917 1954.076
## 2021.0301       1629.792 1364.075 1895.509 1223.413 2036.171
## 2021.0329       1692.432 1423.834 1961.030 1281.646 2103.218
## 2021.0356       1609.586 1339.206 1879.967 1196.075 2023.098
## 2021.0384       1612.386 1335.617 1889.155 1189.104 2035.668
## 2021.0411       1695.745 1414.331 1977.160 1265.359 2126.131
## 2021.0438       1659.828 1376.738 1942.919 1226.879 2092.778
## 2021.0466       1618.720 1331.418 1906.021 1179.330 2058.110
## 2021.0493       1685.025 1392.188 1977.863 1237.169 2132.882
## 2021.0521       1696.957 1401.857 1992.056 1245.641 2148.273
## 2021.0548       1644.550 1346.638 1942.462 1188.934 2100.167
## 2021.0575       1675.106 1371.949 1978.263 1211.467 2138.745
## 2021.0603       1716.554 1410.232 2022.876 1248.074 2185.034
## 2021.0630       1679.296 1370.722 1987.869 1207.373 2151.218
## 2021.0658       1675.666 1362.806 1988.525 1197.188 2154.143
## 2021.0685       1722.587 1405.821 2039.353 1238.135 2207.039
## 2021.0712       1712.296 1393.250 2031.342 1224.357 2200.235
## 2021.0740       1689.074 1366.706 2011.441 1196.054 2182.093
## 2021.0767       1723.101 1396.594 2049.609 1223.752 2222.451
## 2021.0795       1737.060 1407.910 2066.210 1233.668 2240.451
## 2021.0822       1711.838 1379.987 2043.689 1204.315 2219.361
## 2021.0849       1725.704 1389.979 2061.429 1212.257 2239.151
## 2021.0877       1752.475 1413.670 2091.281 1234.317 2270.633
## 2021.0904       1737.900 1396.630 2079.169 1215.973 2259.827
## 2021.0932       1734.693 1390.050 2079.336 1207.607 2261.779
## 2021.0959       1761.497 1413.493 2109.501 1229.271 2293.723
## 2021.0986       1761.871 1411.355 2112.386 1225.804 2297.938
## 2021.1014       1750.434 1397.010 2103.858 1209.919 2290.950
## 2021.1041       1768.647 1411.845 2125.449 1222.965 2314.329
## 2021.1068       1780.949 1421.446 2140.452 1231.136 2330.761
## 2021.1096       1770.500 1408.381 2132.620 1216.686 2324.315
## 2021.1123       1777.714 1412.411 2143.018 1219.030 2336.398
## 2021.1151       1795.183 1426.991 2163.376 1232.082 2358.285
## 2021.1178       1791.543 1420.839 2162.247 1224.600 2358.486
## 2021.1205       1790.492 1416.878 2164.106 1219.099 2361.885
## 2021.1233       1806.527 1429.940 2183.114 1230.587 2382.467
## 2021.1260       1810.918 1431.791 2190.046 1231.093 2390.744
## 2021.1288       1806.719 1424.920 2188.518 1222.808 2390.630
## 2021.1315       1817.439 1432.712 2202.166 1229.050 2405.829
## 2021.1342       1827.507 1440.156 2214.858 1235.105 2419.909
## 2021.1370       1824.852 1434.976 2214.728 1228.589 2421.116
## 2021.1397       1829.750 1437.079 2222.420 1229.212 2430.287
## 2021.1425       1841.657 1446.297 2237.016 1237.007 2446.307
## 2021.1452       1843.098 1445.267 2240.929 1234.668 2451.528
## 2021.1479       1844.136 1443.668 2244.603 1231.673 2456.598
## 2021.1507       1854.570 1451.407 2257.733 1237.985 2471.154
## 2021.1534       1860.203 1454.562 2265.845 1239.828 2480.578
## 2021.1562       1860.232 1452.084 2268.380 1236.024 2484.440
## 2021.1589       1867.544 1456.756 2278.332 1239.298 2495.790
## 2021.1616       1875.774 1462.483 2289.066 1243.699 2507.849
## 2021.1644       1877.116 1461.399 2292.833 1241.331 2512.901
## 2021.1671       1881.432 1463.166 2299.698 1241.749 2521.114
## 2021.1699       1890.155 1469.376 2310.934 1246.628 2533.681

Classification

Classification model to predict death occur in each day

#Filter our the wanted data for classification
classify<-daily_new_case[,-1]
head(classify)

##   perlis_daily kedah_daily pulau.pinang_daily perak_daily selangor_daily
## 1            1           5                  7           2             87
## 2            1           0                  0           0              5
## 3            0           0                  0           0              0
## 4            6          26                  8          16             52
## 5            0           0                  8           5             17
## 6            0           5                  7           5             31
##   negeri.sembilan_daily melaka_daily johor_daily pahang_daily terengganu_daily
## 1                    11            1          20            2                0
## 2                     8            5           2            0                0
## 3                     0            0           0            0                0
## 4                    23            8          30           17                4
## 5                     3            3          25            9                3
## 6                     0            1          11            1                3
##   kelantan_daily sabah_daily sarawak_daily wp.kuala.lumpur_daily
## 1              3          15             0                    40
## 2              0          11             6                     3
## 3              0           0             0                     0
## 4             15          31            15                    63
## 5              7          25             8                     7
## 6              5          21            20                     6
##   wp.putrajaya_daily wp.labuan_daily total_daily death_exist
## 1                  1               2         197          NO
## 2                  0               0          41          NO
## 3                  0               0           0          NO
## 4                  0               2         316          NO
## 5                  0               0         120         YES
## 6                  0               1         117          NO

Set up for classification

classify$death_exist<-as.factor(classify$death_exist)
str(classify)

## 'data.frame':    290 obs. of  18 variables:
##  $ perlis_daily         : num  1 1 0 6 0 0 1 0 0 0 ...
##  $ kedah_daily          : num  5 0 0 26 0 5 4 1 6 5 ...
##  $ pulau.pinang_daily   : num  7 0 0 8 8 7 2 5 13 8 ...
##  $ perak_daily          : num  2 0 0 16 5 5 7 10 10 11 ...
##  $ selangor_daily       : num  87 5 0 52 17 31 31 40 29 17 ...
##  $ negeri.sembilan_daily: num  11 8 0 23 3 0 11 9 5 8 ...
##  $ melaka_daily         : num  1 5 0 8 3 1 2 2 0 1 ...
##  $ johor_daily          : num  20 2 0 30 25 11 13 13 15 16 ...
##  $ pahang_daily         : num  2 0 0 17 9 1 3 4 1 3 ...
##  $ terengganu_daily     : num  0 0 0 4 3 3 1 9 7 5 ...
##  $ kelantan_daily       : num  3 0 0 15 7 5 14 7 10 2 ...
##  $ sabah_daily          : num  15 11 0 31 25 21 9 7 17 22 ...
##  $ sarawak_daily        : num  0 6 0 15 8 20 2 7 10 8 ...
##  $ wp.kuala.lumpur_daily: num  40 3 0 63 7 6 4 16 27 17 ...
##  $ wp.putrajaya_daily   : num  1 0 0 0 0 0 5 0 3 0 ...
##  $ wp.labuan_daily      : num  2 0 0 2 0 1 0 0 0 0 ...
##  $ total_daily          : num  197 41 0 316 120 117 109 130 153 123 ...
##  $ death_exist          : Factor w/ 2 levels "NO","YES": 1 1 1 1 2 1 1 1 2 2 ...

set.seed(1234)
# split data into train-test set in 70:30
spl=sample.split(classify$death_exist,SplitRatio = 0.7)
train=subset(classify,spl==TRUE)
test=subset(classify,spl==FALSE)
print(dim(train));print(dim(test))

## [1] 203  18

## [1] 87 18

Build the logistic regression model

model= glm(death_exist ~ . , family="binomial", data = train, maxit = 100)
summary(model)

## 
## Call:
## glm(formula = death_exist ~ ., family = "binomial", data = train, 
##     maxit = 100)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.06985  -0.61099   0.00006   0.40964   1.97652  
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.7177481  0.2892019  -5.940 2.86e-09 ***
## perlis_daily          -0.4139469  0.3215992  -1.287   0.1980    
## kedah_daily           -0.0144593  0.0104829  -1.379   0.1678    
## pulau.pinang_daily     0.0239075  0.0346198   0.691   0.4898    
## perak_daily            0.0286444  0.0349355   0.820   0.4123    
## selangor_daily         0.0011243  0.0030498   0.369   0.7124    
## negeri.sembilan_daily -0.0025468  0.0102248  -0.249   0.8033    
## melaka_daily          -0.0589361  0.0257683  -2.287   0.0222 *  
## johor_daily            0.0376186  0.0359527   1.046   0.2954    
## pahang_daily           0.1328124  0.0690213   1.924   0.0543 .  
## terengganu_daily       0.1452480  0.1431703   1.015   0.3103    
## kelantan_daily        -0.1248346  0.1093079  -1.142   0.2534    
## sabah_daily            0.0059690  0.0031291   1.908   0.0564 .  
## sarawak_daily          0.1498877  0.0516346   2.903   0.0037 ** 
## wp.kuala.lumpur_daily -0.0006991  0.0061294  -0.114   0.9092    
## wp.putrajaya_daily     0.1597031  0.1434928   1.113   0.2657    
## wp.labuan_daily        0.0248882  0.0738479   0.337   0.7361    
## total_daily                   NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 281.41  on 202  degrees of freedom
## Residual deviance: 152.86  on 186  degrees of freedom
## AIC: 186.86
## 
## Number of Fisher Scoring iterations: 8

Baseline accuracy

prop.table(table(train$death_exist))

## 
##        NO       YES 
## 0.4975369 0.5024631

#Majority class of target variable has a portion of 0.50, the baseline accuracy will is 50 percent

Prediciton on training data

predictTrain = predict(model, data = train, type = "response")
# Confusion Matrix
table(train$death_exist, predictTrain >= 0.5)

##      
##       FALSE TRUE
##   NO     91   10
##   YES    26   76

Accuracy of training data

(91+76)/nrow(train)

## [1] 0.8226601

Predictions on the test data

predictTest = predict(model, newdata = test, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

# Confusion Matrix
table(test$death_exist,predictTest>=0.5)

##      
##       FALSE TRUE
##   NO     39    4
##   YES    15   29

Accuracy of testing data

(39+29)/nrow(test)

## [1] 0.7816092

Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.

COVID-19 cases trend among Malaysia states and federal territories

Lim Kevin 17140821 and Leow Yeong Chyi S2022330

1/3/2021

Title of the project

Introduction

Dataset source

Dataset details

Analysis details

Data Cleaning steps

EDA

Regression

Classification