Hello, everyone! We are going to demonstrate on our project, which is the study on COVID-19 cases in Malaysia states.
COVID-19 cases trend among Malaysia states and federal territories
In our daily news recently, we could clearly observe the huge number of COVID-19 confirmed cases in Malaysia. COVID-19 infections could bring us serious harm which we could not even imagine. Thus, in this research, we are keen to observe the trend of COVID-19 cases among different states in Malaysia. Then, we would like to have a look on the trend in 2021.
https://github.com/ynshung/covid-19-malaysia/blob/master/covid-19-my-states-cases.csv
https://github.com/ynshung/covid-19-malaysia/blob/master/covid-19-malaysia.csv
Our first dataset contained information on the number of confirmed COVID-19 cases among 13 Malaysia states and 3 federal territories.
There are a total of 290 rows of data, starting from Mar 13th 2020 up to Dec 27th 2020. The columns consisting of the cumulative number of cases from one day to the next, as well as a column of date.The data need to be cleaned as we need to have daily confirmed cases in each state for analysis purposes instead of just cumulative from one day to the next. Also, empty row which depicts no confirmed cases on that day need to be cleaned as well. There are also 3 rows with ‘-’, to depict 0 cases on that day. Data cleaning will be carried out before starting with analysis.
Our second dataset contained information on the number of daily death due to COVID-19 infection in Malaysia as a whole.
There are 339 rows of data ranging from Jan 24th 2020 up to Dec 27th 2020. There are columns related to cumulative number of COVID-19 cases in Malaysia, discharged cases, death cases and number of patient who are in ICU. We are interested to look at the death data, in order to study on classification question after that. Thus, we will filter the dataset to the date same as our first dataset, which was started on Mar 13th 2020, and we will only select one column (death) and append it to our first dataset.
Some details on this project
Research questions:
Research Objectives/Goals:
Answer to be found from this dataset:
First, we will ensure the required packages are all installed, then call these libraries using library().
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
library(TTR)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
library(caTools)
library(klaR)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
We will then check on the directory to ensure the path is correct.
getwd()
## [1] "C:/Users/KEVIN LIM/Downloads"
From the dataset obtained from data source, it was in csv format. We will use function read.csv to read the data into R Markdown.
data<-read.csv("new dataset2.csv")
To kickstart with a quick view on the data, we use function head().
head(data)
## date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020 1 5 7 2 87 11 1
## 2 14/03/2020 2 5 7 2 92 19 6
## 3 15/03/2020 NA NA NA NA NA NA NA
## 4 16/03/2020 8 31 15 18 144 42 14
## 5 17/03/2020 8 31 23 23 161 45 17
## 6 18/03/2020 8 36 30 28 192 45 18
## johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1 20 2 NA 3 15 NA 40 1
## 2 22 2 NA 3 26 6 43 1
## 3 NA NA NA NA NA NA NA
## 4 52 19 4 18 57 21 106 -
## 5 77 28 7 25 82 29 113 -
## 6 88 29 10 30 103 49 119 -
## wp.labuan
## 1 2
## 2 2
## 3 NA
## 4 4
## 5 4
## 6 5
We also use fucntion str() and summary() to have some pictures before starting to clean the data.
#To check on our data set whether is there any missing values, we use head and str function to have a quick glance.
data1<-data
dim(data1)
## [1] 290 17
str(data1)
## 'data.frame': 290 obs. of 17 variables:
## $ date : chr "13/03/2020" "14/03/2020" "15/03/2020" "16/03/2020" ...
## $ perlis : int 1 2 NA 8 8 8 9 9 9 9 ...
## $ kedah : int 5 5 NA 31 31 36 40 41 47 52 ...
## $ pulau.pinang : int 7 7 NA 15 23 30 32 37 50 58 ...
## $ perak : int 2 2 NA 18 23 28 35 45 55 66 ...
## $ selangor : int 87 92 NA 144 161 192 223 263 292 309 ...
## $ negeri.sembilan: int 11 19 NA 42 45 45 56 65 70 78 ...
## $ melaka : int 1 6 NA 14 17 18 20 22 22 23 ...
## $ johor : int 20 22 NA 52 77 88 101 114 129 145 ...
## $ pahang : int 2 2 NA 19 28 29 32 36 37 40 ...
## $ terengganu : int NA NA NA 4 7 10 11 20 27 32 ...
## $ kelantan : int 3 3 NA 18 25 30 44 51 61 63 ...
## $ sabah : int 15 26 NA 57 82 103 112 119 136 158 ...
## $ sarawak : int NA 6 NA 21 29 49 51 58 68 76 ...
## $ wp.kuala.lumpur: int 40 43 NA 106 113 119 123 139 166 183 ...
## $ wp.putrajaya : chr "1" "1" "" "-" ...
## $ wp.labuan : int 2 2 NA 4 4 5 5 5 5 5 ...
summary(data1)
## date perlis kedah pulau.pinang
## Length:290 Min. : 1.00 Min. : 5.0 Min. : 7.0
## Class :character 1st Qu.:18.00 1st Qu.: 96.0 1st Qu.: 121.0
## Mode :character Median :20.00 Median : 125.0 Median : 121.0
## Mean :26.93 Mean : 774.4 Mean : 549.1
## 3rd Qu.:38.00 3rd Qu.:1890.0 3rd Qu.: 402.0
## Max. :46.00 Max. :2957.0 Max. :3201.0
## NA's :1 NA's :1 NA's :1
## perak selangor negeri.sembilan melaka
## Min. : 2.0 Min. : 87 Min. : 11 Min. : 1.0
## 1st Qu.: 255.0 1st Qu.: 1829 1st Qu.: 792 1st Qu.:216.0
## Median : 264.0 Median : 2130 Median :1029 Median :258.0
## Mean : 538.4 Mean : 4459 Mean :1608 Mean :268.2
## 3rd Qu.: 383.0 3rd Qu.: 3014 3rd Qu.:1096 3rd Qu.:281.0
## Max. :3050.0 Max. :29272 Max. :7496 Max. :955.0
## NA's :1 NA's :1 NA's :1 NA's :1
## johor pahang terengganu kelantan
## Min. : 20.0 Min. : 2 Min. : 4.0 Min. : 3.0
## 1st Qu.: 671.0 1st Qu.: 344 1st Qu.:111.0 1st Qu.:156.0
## Median : 743.0 Median : 370 Median :114.0 Median :160.0
## Mean : 898.3 Mean : 368 Mean :137.9 Mean :191.1
## 3rd Qu.: 834.0 3rd Qu.: 387 3rd Qu.:158.5 3rd Qu.:173.0
## Max. :4632.0 Max. :1003 Max. :301.0 Max. :606.0
## NA's :1 NA's :1 NA's :3 NA's :1
## sabah sarawak wp.kuala.lumpur wp.putrajaya
## Min. : 15 Min. : 6.0 Min. : 40 Length:290
## 1st Qu.: 343 1st Qu.: 549.0 1st Qu.: 1807 Class :character
## Median : 402 Median : 678.0 Median : 2491 Mode :character
## Mean : 6534 Mean : 662.1 Mean : 2893
## 3rd Qu.: 6286 3rd Qu.: 760.0 3rd Qu.: 2824
## Max. :36074 Max. :1108.0 Max. :12494
## NA's :1 NA's :2 NA's :1
## wp.labuan
## Min. : 2.0
## 1st Qu.: 16.0
## Median : 20.0
## Mean : 278.8
## 3rd Qu.: 111.0
## Max. :1636.0
## NA's :1
From the checking of structure above, we noticed that there is one column, ‘wp.putrajaya’ with character form instead of numeric form, thus we need to convert it.
data2<-transform(data1, wp.putrajaya=as.numeric(wp.putrajaya))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(data2)
## date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020 1 5 7 2 87 11 1
## 2 14/03/2020 2 5 7 2 92 19 6
## 3 15/03/2020 NA NA NA NA NA NA NA
## 4 16/03/2020 8 31 15 18 144 42 14
## 5 17/03/2020 8 31 23 23 161 45 17
## 6 18/03/2020 8 36 30 28 192 45 18
## johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1 20 2 NA 3 15 NA 40 1
## 2 22 2 NA 3 26 6 43 1
## 3 NA NA NA NA NA NA NA NA
## 4 52 19 4 18 57 21 106 NA
## 5 77 28 7 25 82 29 113 NA
## 6 88 29 10 30 103 49 119 NA
## wp.labuan
## 1 2
## 2 2
## 3 NA
## 4 4
## 5 4
## 6 5
We will then check whether is there any position have NA or missing value.
rowSums(is.na(data2))==0
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [109] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [145] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [157] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [169] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [193] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [205] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [217] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [229] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [253] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [265] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [277] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [289] TRUE TRUE
which(is.na(data2), arr.ind=TRUE)
## row col
## [1,] 3 2
## [2,] 3 3
## [3,] 3 4
## [4,] 3 5
## [5,] 3 6
## [6,] 3 7
## [7,] 3 8
## [8,] 3 9
## [9,] 3 10
## [10,] 1 11
## [11,] 2 11
## [12,] 3 11
## [13,] 3 12
## [14,] 3 13
## [15,] 1 14
## [16,] 3 14
## [17,] 3 15
## [18,] 3 16
## [19,] 4 16
## [20,] 5 16
## [21,] 6 16
## [22,] 3 17
For those with empty info, they are not having any incremental in number of COVID-19 cases on that day. Thus, the incremental is 0, and since the data were all in cumulative form, we will then use the function to retrieve back the values from previous row and put it in for the row with empty information.
data3<-data2%>%
fill(c(perlis,kedah, pulau.pinang, perak, selangor, negeri.sembilan, melaka, johor, pahang, terengganu, kelantan, sabah, sarawak, wp.kuala.lumpur, wp.putrajaya, wp.labuan), .direction = "down")
head(data3)
## date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020 1 5 7 2 87 11 1
## 2 14/03/2020 2 5 7 2 92 19 6
## 3 15/03/2020 2 5 7 2 92 19 6
## 4 16/03/2020 8 31 15 18 144 42 14
## 5 17/03/2020 8 31 23 23 161 45 17
## 6 18/03/2020 8 36 30 28 192 45 18
## johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1 20 2 NA 3 15 NA 40 1
## 2 22 2 NA 3 26 6 43 1
## 3 22 2 NA 3 26 6 43 1
## 4 52 19 4 18 57 21 106 1
## 5 77 28 7 25 82 29 113 1
## 6 88 29 10 30 103 49 119 1
## wp.labuan
## 1 2
## 2 2
## 3 2
## 4 4
## 5 4
## 6 5
As Terengganu and Sarawak having no value at the first day of data, we will fill in 0 for the respective empty row.
which(is.na(data3), arr.ind=TRUE)
## row col
## [1,] 1 11
## [2,] 2 11
## [3,] 3 11
## [4,] 1 14
data3[is.na(data3)]<-0
head(data3)
## date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020 1 5 7 2 87 11 1
## 2 14/03/2020 2 5 7 2 92 19 6
## 3 15/03/2020 2 5 7 2 92 19 6
## 4 16/03/2020 8 31 15 18 144 42 14
## 5 17/03/2020 8 31 23 23 161 45 17
## 6 18/03/2020 8 36 30 28 192 45 18
## johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1 20 2 0 3 15 0 40 1
## 2 22 2 0 3 26 6 43 1
## 3 22 2 0 3 26 6 43 1
## 4 52 19 4 18 57 21 106 1
## 5 77 28 7 25 82 29 113 1
## 6 88 29 10 30 103 49 119 1
## wp.labuan
## 1 2
## 2 2
## 3 2
## 4 4
## 5 4
## 6 5
To confirm that there is no other missing values, we use following functions to check again.
which(is.na(data3), arr.ind=TRUE)
## row col
data3[!complete.cases(data3),]
## [1] date perlis kedah pulau.pinang
## [5] perak selangor negeri.sembilan melaka
## [9] johor pahang terengganu kelantan
## [13] sabah sarawak wp.kuala.lumpur wp.putrajaya
## [17] wp.labuan
## <0 rows> (or 0-length row.names)
As the data were all in cumulative number from one day to the next, to enhance our analysis processes, we use the function below to count the number of new daily incremental in confirmed COVID-19 cases. New columns were created for each of the states and federal territories. This function allows us to find the differences of current row and previous row, and thus the differences will be the new incremental number of cases on that day.
data3$perlis_daily <- ave(data3$perlis, FUN = function(x) c(0, diff(x)))
data3$kedah_daily <- ave(data3$kedah, FUN = function(x) c(0, diff(x)))
data3$pulau.pinang_daily <- ave(data3$pulau.pinang, FUN = function(x) c(0, diff(x)))
data3$perak_daily <- ave(data3$perak, FUN = function(x) c(0, diff(x)))
data3$selangor_daily <- ave(data3$selangor, FUN = function(x) c(0, diff(x)))
data3$negeri.sembilan_daily <- ave(data3$negeri.sembilan, FUN = function(x) c(0, diff(x)))
data3$melaka_daily <- ave(data3$melaka, FUN = function(x) c(0, diff(x)))
data3$johor_daily <- ave(data3$johor, FUN = function(x) c(0, diff(x)))
data3$pahang_daily <- ave(data3$pahang, FUN = function(x) c(0, diff(x)))
data3$terengganu_daily <- ave(data3$terengganu, FUN = function(x) c(0, diff(x)))
data3$kelantan_daily <- ave(data3$kelantan, FUN = function(x) c(0, diff(x)))
data3$sabah_daily <- ave(data3$sabah, FUN = function(x) c(0, diff(x)))
data3$sarawak_daily <- ave(data3$sarawak, FUN = function(x) c(0, diff(x)))
data3$wp.kuala.lumpur_daily <- ave(data3$wp.kuala.lumpur, FUN = function(x) c(0, diff(x)))
data3$wp.putrajaya_daily <- ave(data3$wp.putrajaya, FUN = function(x) c(0, diff(x)))
data3$wp.labuan_daily <- ave(data3$wp.labuan, FUN = function(x) c(0, diff(x)))
head(data3)
## date perlis kedah pulau.pinang perak selangor negeri.sembilan melaka
## 1 13/03/2020 1 5 7 2 87 11 1
## 2 14/03/2020 2 5 7 2 92 19 6
## 3 15/03/2020 2 5 7 2 92 19 6
## 4 16/03/2020 8 31 15 18 144 42 14
## 5 17/03/2020 8 31 23 23 161 45 17
## 6 18/03/2020 8 36 30 28 192 45 18
## johor pahang terengganu kelantan sabah sarawak wp.kuala.lumpur wp.putrajaya
## 1 20 2 0 3 15 0 40 1
## 2 22 2 0 3 26 6 43 1
## 3 22 2 0 3 26 6 43 1
## 4 52 19 4 18 57 21 106 1
## 5 77 28 7 25 82 29 113 1
## 6 88 29 10 30 103 49 119 1
## wp.labuan perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 2 0 0 0 0
## 2 2 1 0 0 0
## 3 2 0 0 0 0
## 4 4 6 26 8 16
## 5 4 0 0 8 5
## 6 5 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 0 0 0 0 0
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 0 0 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily
## 1 0 0 0
## 2 3 0 0
## 3 0 0 0
## 4 63 0 2
## 5 7 0 0
## 6 6 0 1
From previous function, we observed the differences of cumulative cases from one day to the next to get the incremental number of cases, however, for the very first row, there is no previous row to be deducted, thus, we need to replace the first row of these newly computed columns with the first row of cumulative data.
data3[1,18:33]<-data3[1,2:17]
Get the total daily new cases from all the states in Malaysia
total_daily<-data3[,18:33]
data3$total_daily<-rowSums(total_daily)
Generate a new data which include all state new cases and total
daily_new_case<-data3[,-c(2:17)]
head(daily_new_case)
## date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020 1 5 7 2
## 2 14/03/2020 1 0 0 0
## 3 15/03/2020 0 0 0 0
## 4 16/03/2020 6 26 8 16
## 5 17/03/2020 0 0 8 5
## 6 18/03/2020 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 87 11 1 20 2
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 3 15 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily
## 1 40 1 2 197
## 2 3 0 0 41
## 3 0 0 0 0
## 4 63 0 2 316
## 5 7 0 0 120
## 6 6 0 1 117
We read in data for death due to COVID-19 cases in Malaysia
deathdata<-read.csv("deathdata.csv")
head(deathdata)
## date cases discharged death icu
## 1 24/01/2020 0 0 0 0
## 2 25/01/2020 3 0 0 0
## 3 26/01/2020 4 0 0 0
## 4 27/01/2020 4 0 0 0
## 5 28/01/2020 4 0 0 0
## 6 29/01/2020 7 0 0 0
Narrow down to the date that is in the same period with our first dataset.
deathdata1<- deathdata[50:339,1:4]
head(deathdata1)
## date cases discharged death
## 50 13/03/2020 197 32 0
## 51 14/03/2020 238 35 0
## 52 15/03/2020 428 42 0
## 53 16/03/2020 553 42 0
## 54 17/03/2020 673 49 2
## 55 18/03/2020 790 60 2
We combine the data on death column to our first data
data5<-daily_new_case
data5$deathcum<-deathdata1$death
head(data5)
## date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020 1 5 7 2
## 2 14/03/2020 1 0 0 0
## 3 15/03/2020 0 0 0 0
## 4 16/03/2020 6 26 8 16
## 5 17/03/2020 0 0 8 5
## 6 18/03/2020 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 87 11 1 20 2
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 3 15 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily deathcum
## 1 40 1 2 197 0
## 2 3 0 0 41 0
## 3 0 0 0 0 0
## 4 63 0 2 316 0
## 5 7 0 0 120 2
## 6 6 0 1 117 2
Count on daily death cases from the cumulative deaths in data
data5$death_daily <- ave(data5$deathcum, FUN = function(x) c(0, diff(x)))
head(data5)
## date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020 1 5 7 2
## 2 14/03/2020 1 0 0 0
## 3 15/03/2020 0 0 0 0
## 4 16/03/2020 6 26 8 16
## 5 17/03/2020 0 0 8 5
## 6 18/03/2020 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 87 11 1 20 2
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 3 15 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily deathcum
## 1 40 1 2 197 0
## 2 3 0 0 41 0
## 3 0 0 0 0 0
## 4 63 0 2 316 0
## 5 7 0 0 120 2
## 6 6 0 1 117 2
## death_daily
## 1 0
## 2 0
## 3 0
## 4 0
## 5 2
## 6 0
Having a new column with yes for death exists on that day and vice versa
data5$death_exist <- ifelse(data5$death_daily >0, "YES", "NO")
daily_new_case<-data5
head(data5)
## date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020 1 5 7 2
## 2 14/03/2020 1 0 0 0
## 3 15/03/2020 0 0 0 0
## 4 16/03/2020 6 26 8 16
## 5 17/03/2020 0 0 8 5
## 6 18/03/2020 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 87 11 1 20 2
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 3 15 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily deathcum
## 1 40 1 2 197 0
## 2 3 0 0 41 0
## 3 0 0 0 0 0
## 4 63 0 2 316 0
## 5 7 0 0 120 2
## 6 6 0 1 117 2
## death_daily death_exist
## 1 0 NO
## 2 0 NO
## 3 0 NO
## 4 0 NO
## 5 2 YES
## 6 0 NO
Drop out column deathcum and death_daily
daily_new_case<-daily_new_case[,-c(19:20)]
head(daily_new_case)
## date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020 1 5 7 2
## 2 14/03/2020 1 0 0 0
## 3 15/03/2020 0 0 0 0
## 4 16/03/2020 6 26 8 16
## 5 17/03/2020 0 0 8 5
## 6 18/03/2020 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 87 11 1 20 2
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 3 15 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily
## 1 40 1 2 197
## 2 3 0 0 41
## 3 0 0 0 0
## 4 63 0 2 316
## 5 7 0 0 120
## 6 6 0 1 117
## death_exist
## 1 NO
## 2 NO
## 3 NO
## 4 NO
## 5 YES
## 6 NO
Save the data into a csv file name “daily new case”
write.csv(daily_new_case,"daily case.csv", row.names = TRUE)
head(daily_new_case)
## date perlis_daily kedah_daily pulau.pinang_daily perak_daily
## 1 13/03/2020 1 5 7 2
## 2 14/03/2020 1 0 0 0
## 3 15/03/2020 0 0 0 0
## 4 16/03/2020 6 26 8 16
## 5 17/03/2020 0 0 8 5
## 6 18/03/2020 0 5 7 5
## selangor_daily negeri.sembilan_daily melaka_daily johor_daily pahang_daily
## 1 87 11 1 20 2
## 2 5 8 5 2 0
## 3 0 0 0 0 0
## 4 52 23 8 30 17
## 5 17 3 3 25 9
## 6 31 0 1 11 1
## terengganu_daily kelantan_daily sabah_daily sarawak_daily
## 1 0 3 15 0
## 2 0 0 11 6
## 3 0 0 0 0
## 4 4 15 31 15
## 5 3 7 25 8
## 6 3 5 21 20
## wp.kuala.lumpur_daily wp.putrajaya_daily wp.labuan_daily total_daily
## 1 40 1 2 197
## 2 3 0 0 41
## 3 0 0 0 0
## 4 63 0 2 316
## 5 7 0 0 120
## 6 6 0 1 117
## death_exist
## 1 NO
## 2 NO
## 3 NO
## 4 NO
## 5 YES
## 6 NO
str(daily_new_case)
## 'data.frame': 290 obs. of 19 variables:
## $ date : chr "13/03/2020" "14/03/2020" "15/03/2020" "16/03/2020" ...
## $ perlis_daily : num 1 1 0 6 0 0 1 0 0 0 ...
## $ kedah_daily : num 5 0 0 26 0 5 4 1 6 5 ...
## $ pulau.pinang_daily : num 7 0 0 8 8 7 2 5 13 8 ...
## $ perak_daily : num 2 0 0 16 5 5 7 10 10 11 ...
## $ selangor_daily : num 87 5 0 52 17 31 31 40 29 17 ...
## $ negeri.sembilan_daily: num 11 8 0 23 3 0 11 9 5 8 ...
## $ melaka_daily : num 1 5 0 8 3 1 2 2 0 1 ...
## $ johor_daily : num 20 2 0 30 25 11 13 13 15 16 ...
## $ pahang_daily : num 2 0 0 17 9 1 3 4 1 3 ...
## $ terengganu_daily : num 0 0 0 4 3 3 1 9 7 5 ...
## $ kelantan_daily : num 3 0 0 15 7 5 14 7 10 2 ...
## $ sabah_daily : num 15 11 0 31 25 21 9 7 17 22 ...
## $ sarawak_daily : num 0 6 0 15 8 20 2 7 10 8 ...
## $ wp.kuala.lumpur_daily: num 40 3 0 63 7 6 4 16 27 17 ...
## $ wp.putrajaya_daily : num 1 0 0 0 0 0 5 0 3 0 ...
## $ wp.labuan_daily : num 2 0 0 2 0 1 0 0 0 0 ...
## $ total_daily : num 197 41 0 316 120 117 109 130 153 123 ...
## $ death_exist : chr "NO" "NO" "NO" "NO" ...
Run program for question 1: Predict the Jan and Feb COVID-19 daily new cases Using ARIMA model for regression since the data in in time based series
inds <- seq(as.Date("13/03/2020"), as.Date("27/03/2020"), by = "day")
#convert the data into time based series
myts <- ts(daily_new_case$total_daily,
start = c(2020, as.numeric(format(inds[1], "%j"))), frequency = 365)
#check the first date
as.numeric(format(inds[1], "%j"))
## [1] 79
Plot the number of daily cases vs date
plot(myts, main="Daily covid 19 cases in Malaysia",xlab="Date",ylab="Number of daily cases")
Apply ARIMA model for regression and predict the next 60days of number daily cases
fit <- auto.arima(myts)
fore <- forecast(fit, h = 60)
plot(fore, main="Daily covid 19 cases in Malaysia",xlab="Date",ylab="Number of daily cases")
Evaluation on the ARIMA model
summary(fore)
##
## Forecast method: ARIMA(2,1,3) with drift
##
## Model Information:
## Series: myts
## ARIMA(2,1,3) with drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 drift
## -0.5486 -0.8767 -0.1113 0.4066 -0.7917 5.0579
## s.e. 0.0348 0.0819 0.0469 0.0855 0.0409 2.1015
##
## sigma^2 estimated as 29327: log likelihood=-1895.01
## AIC=3804.01 AICc=3804.41 BIC=3829.68
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.3008319 169.1729 95.68176 -Inf Inf NaN -0.03735384
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2021.0082 1516.094 1296.623 1735.566 1180.442 1851.747
## 2021.0110 1544.956 1313.137 1776.774 1190.419 1899.492
## 2021.0137 1692.030 1454.684 1929.376 1329.041 2055.020
## 2021.0164 1598.313 1360.349 1836.278 1234.378 1962.248
## 2021.0192 1533.055 1288.481 1777.629 1159.011 1907.099
## 2021.0219 1663.280 1410.341 1916.219 1276.443 2050.117
## 2021.0247 1661.318 1407.084 1915.552 1272.501 2050.135
## 2021.0274 1560.497 1303.149 1817.844 1166.917 1954.076
## 2021.0301 1629.792 1364.075 1895.509 1223.413 2036.171
## 2021.0329 1692.432 1423.834 1961.030 1281.646 2103.218
## 2021.0356 1609.586 1339.206 1879.967 1196.075 2023.098
## 2021.0384 1612.386 1335.617 1889.155 1189.104 2035.668
## 2021.0411 1695.745 1414.331 1977.160 1265.359 2126.131
## 2021.0438 1659.828 1376.738 1942.919 1226.879 2092.778
## 2021.0466 1618.720 1331.418 1906.021 1179.330 2058.110
## 2021.0493 1685.025 1392.188 1977.863 1237.169 2132.882
## 2021.0521 1696.957 1401.857 1992.056 1245.641 2148.273
## 2021.0548 1644.550 1346.638 1942.462 1188.934 2100.167
## 2021.0575 1675.106 1371.949 1978.263 1211.467 2138.745
## 2021.0603 1716.554 1410.232 2022.876 1248.074 2185.034
## 2021.0630 1679.296 1370.722 1987.869 1207.373 2151.218
## 2021.0658 1675.666 1362.806 1988.525 1197.188 2154.143
## 2021.0685 1722.587 1405.821 2039.353 1238.135 2207.039
## 2021.0712 1712.296 1393.250 2031.342 1224.357 2200.235
## 2021.0740 1689.074 1366.706 2011.441 1196.054 2182.093
## 2021.0767 1723.101 1396.594 2049.609 1223.752 2222.451
## 2021.0795 1737.060 1407.910 2066.210 1233.668 2240.451
## 2021.0822 1711.838 1379.987 2043.689 1204.315 2219.361
## 2021.0849 1725.704 1389.979 2061.429 1212.257 2239.151
## 2021.0877 1752.475 1413.670 2091.281 1234.317 2270.633
## 2021.0904 1737.900 1396.630 2079.169 1215.973 2259.827
## 2021.0932 1734.693 1390.050 2079.336 1207.607 2261.779
## 2021.0959 1761.497 1413.493 2109.501 1229.271 2293.723
## 2021.0986 1761.871 1411.355 2112.386 1225.804 2297.938
## 2021.1014 1750.434 1397.010 2103.858 1209.919 2290.950
## 2021.1041 1768.647 1411.845 2125.449 1222.965 2314.329
## 2021.1068 1780.949 1421.446 2140.452 1231.136 2330.761
## 2021.1096 1770.500 1408.381 2132.620 1216.686 2324.315
## 2021.1123 1777.714 1412.411 2143.018 1219.030 2336.398
## 2021.1151 1795.183 1426.991 2163.376 1232.082 2358.285
## 2021.1178 1791.543 1420.839 2162.247 1224.600 2358.486
## 2021.1205 1790.492 1416.878 2164.106 1219.099 2361.885
## 2021.1233 1806.527 1429.940 2183.114 1230.587 2382.467
## 2021.1260 1810.918 1431.791 2190.046 1231.093 2390.744
## 2021.1288 1806.719 1424.920 2188.518 1222.808 2390.630
## 2021.1315 1817.439 1432.712 2202.166 1229.050 2405.829
## 2021.1342 1827.507 1440.156 2214.858 1235.105 2419.909
## 2021.1370 1824.852 1434.976 2214.728 1228.589 2421.116
## 2021.1397 1829.750 1437.079 2222.420 1229.212 2430.287
## 2021.1425 1841.657 1446.297 2237.016 1237.007 2446.307
## 2021.1452 1843.098 1445.267 2240.929 1234.668 2451.528
## 2021.1479 1844.136 1443.668 2244.603 1231.673 2456.598
## 2021.1507 1854.570 1451.407 2257.733 1237.985 2471.154
## 2021.1534 1860.203 1454.562 2265.845 1239.828 2480.578
## 2021.1562 1860.232 1452.084 2268.380 1236.024 2484.440
## 2021.1589 1867.544 1456.756 2278.332 1239.298 2495.790
## 2021.1616 1875.774 1462.483 2289.066 1243.699 2507.849
## 2021.1644 1877.116 1461.399 2292.833 1241.331 2512.901
## 2021.1671 1881.432 1463.166 2299.698 1241.749 2521.114
## 2021.1699 1890.155 1469.376 2310.934 1246.628 2533.681
Classification model to predict death occur in each day
#Filter our the wanted data for classification
classify<-daily_new_case[,-1]
head(classify)
## perlis_daily kedah_daily pulau.pinang_daily perak_daily selangor_daily
## 1 1 5 7 2 87
## 2 1 0 0 0 5
## 3 0 0 0 0 0
## 4 6 26 8 16 52
## 5 0 0 8 5 17
## 6 0 5 7 5 31
## negeri.sembilan_daily melaka_daily johor_daily pahang_daily terengganu_daily
## 1 11 1 20 2 0
## 2 8 5 2 0 0
## 3 0 0 0 0 0
## 4 23 8 30 17 4
## 5 3 3 25 9 3
## 6 0 1 11 1 3
## kelantan_daily sabah_daily sarawak_daily wp.kuala.lumpur_daily
## 1 3 15 0 40
## 2 0 11 6 3
## 3 0 0 0 0
## 4 15 31 15 63
## 5 7 25 8 7
## 6 5 21 20 6
## wp.putrajaya_daily wp.labuan_daily total_daily death_exist
## 1 1 2 197 NO
## 2 0 0 41 NO
## 3 0 0 0 NO
## 4 0 2 316 NO
## 5 0 0 120 YES
## 6 0 1 117 NO
Set up for classification
classify$death_exist<-as.factor(classify$death_exist)
str(classify)
## 'data.frame': 290 obs. of 18 variables:
## $ perlis_daily : num 1 1 0 6 0 0 1 0 0 0 ...
## $ kedah_daily : num 5 0 0 26 0 5 4 1 6 5 ...
## $ pulau.pinang_daily : num 7 0 0 8 8 7 2 5 13 8 ...
## $ perak_daily : num 2 0 0 16 5 5 7 10 10 11 ...
## $ selangor_daily : num 87 5 0 52 17 31 31 40 29 17 ...
## $ negeri.sembilan_daily: num 11 8 0 23 3 0 11 9 5 8 ...
## $ melaka_daily : num 1 5 0 8 3 1 2 2 0 1 ...
## $ johor_daily : num 20 2 0 30 25 11 13 13 15 16 ...
## $ pahang_daily : num 2 0 0 17 9 1 3 4 1 3 ...
## $ terengganu_daily : num 0 0 0 4 3 3 1 9 7 5 ...
## $ kelantan_daily : num 3 0 0 15 7 5 14 7 10 2 ...
## $ sabah_daily : num 15 11 0 31 25 21 9 7 17 22 ...
## $ sarawak_daily : num 0 6 0 15 8 20 2 7 10 8 ...
## $ wp.kuala.lumpur_daily: num 40 3 0 63 7 6 4 16 27 17 ...
## $ wp.putrajaya_daily : num 1 0 0 0 0 0 5 0 3 0 ...
## $ wp.labuan_daily : num 2 0 0 2 0 1 0 0 0 0 ...
## $ total_daily : num 197 41 0 316 120 117 109 130 153 123 ...
## $ death_exist : Factor w/ 2 levels "NO","YES": 1 1 1 1 2 1 1 1 2 2 ...
set.seed(1234)
# split data into train-test set in 70:30
spl=sample.split(classify$death_exist,SplitRatio = 0.7)
train=subset(classify,spl==TRUE)
test=subset(classify,spl==FALSE)
print(dim(train));print(dim(test))
## [1] 203 18
## [1] 87 18
Build the logistic regression model
model= glm(death_exist ~ . , family="binomial", data = train, maxit = 100)
summary(model)
##
## Call:
## glm(formula = death_exist ~ ., family = "binomial", data = train,
## maxit = 100)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.06985 -0.61099 0.00006 0.40964 1.97652
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7177481 0.2892019 -5.940 2.86e-09 ***
## perlis_daily -0.4139469 0.3215992 -1.287 0.1980
## kedah_daily -0.0144593 0.0104829 -1.379 0.1678
## pulau.pinang_daily 0.0239075 0.0346198 0.691 0.4898
## perak_daily 0.0286444 0.0349355 0.820 0.4123
## selangor_daily 0.0011243 0.0030498 0.369 0.7124
## negeri.sembilan_daily -0.0025468 0.0102248 -0.249 0.8033
## melaka_daily -0.0589361 0.0257683 -2.287 0.0222 *
## johor_daily 0.0376186 0.0359527 1.046 0.2954
## pahang_daily 0.1328124 0.0690213 1.924 0.0543 .
## terengganu_daily 0.1452480 0.1431703 1.015 0.3103
## kelantan_daily -0.1248346 0.1093079 -1.142 0.2534
## sabah_daily 0.0059690 0.0031291 1.908 0.0564 .
## sarawak_daily 0.1498877 0.0516346 2.903 0.0037 **
## wp.kuala.lumpur_daily -0.0006991 0.0061294 -0.114 0.9092
## wp.putrajaya_daily 0.1597031 0.1434928 1.113 0.2657
## wp.labuan_daily 0.0248882 0.0738479 0.337 0.7361
## total_daily NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 281.41 on 202 degrees of freedom
## Residual deviance: 152.86 on 186 degrees of freedom
## AIC: 186.86
##
## Number of Fisher Scoring iterations: 8
Baseline accuracy
prop.table(table(train$death_exist))
##
## NO YES
## 0.4975369 0.5024631
#Majority class of target variable has a portion of 0.50, the baseline accuracy will is 50 percent
Prediciton on training data
predictTrain = predict(model, data = train, type = "response")
# Confusion Matrix
table(train$death_exist, predictTrain >= 0.5)
##
## FALSE TRUE
## NO 91 10
## YES 26 76
Accuracy of training data
(91+76)/nrow(train)
## [1] 0.8226601
Predictions on the test data
predictTest = predict(model, newdata = test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Confusion Matrix
table(test$death_exist,predictTest>=0.5)
##
## FALSE TRUE
## NO 39 4
## YES 15 29
Accuracy of testing data
(39+29)/nrow(test)
## [1] 0.7816092
Overall, the logistic regression model is beating the baseline accuracy by a big margin on both the train and test datasets, and the results are very good.