Goal:

The goal is to predict the cumulative number of confirmed cases and resulting fatalities, for the upcoming dates within United States.

COVID-19 Background:

COVID-19 is caused by a new coronavirus disease. The Coronaviridae family includes a group of large, single, and plus stranded RNA viruses isolated from multiple species, and known to cause the common cold and diarrheal diseases in humans (Dey, Rahman, Siddiqi, & Howlader, 2020). In the United States, different parts of the country are seeing different levels of COVID-19 activity. Currently, the United States nationally is in the acceleration phase of the pandemic (National Center for Immunization and Respiratory Diseases, 2020).

Introduction and Dataset Description:

Working on two Covid-19 datasets which is collected by Johns Hopkins University.

Dataset 1- Johns Hopkins COVID-19 Case Tracker * The data is available at county level in the United States and represents the number of confirmed cases, number of deaths reported by each state in the United States. * There are 13 variables and 2886 observations, and this dataset is paired with population data of counties both urban and rural.

Dataset 2- Timeseries COVID-19 Cases and Deaths * This dataset contains daily time series data of United States, including confirmed cases and deaths, reported at the county level. * There are 15 variables and 338208 observations in the dataset; we will be using 5 variables only for our analysis as others are not required in the prediction of future cases. 5 key variables that we are using: ‘date’, ‘state’, ‘cumulative_cases’, ‘cumulative_deaths’, ‘location_name’.

Note: We have data starting from Jan-22-2020 until May-4-2020 but have zero cumulative cases and deaths until Mar-23-2020.

Risk assessment:

  1. The dataset we are using was updated on 2020-04-29, not a dynamic dataset. Thus, there is a possibility of lacking accuracy for the present environment.
  2. Lack of sufficient historical data to make forecast.

Analysis Methods:

  1. Exploratory Data Analysis (EDA)
  2. Linear Regression Model
  3. Time series analysis
  4. SIR Model using Covid19.Analytics package

Loading the required pacakges

check.packages <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

packages<-c("stats4", "dplyr", "stringr", "tidyr", "ggplot2", "reshape2", "rpart", "corrplot", "geosphere", "leaflet", "leaflet.extras", "maps", "ggpubr", "psych", "mice", "VIM", "tidyverse", "lubridate", "forecast", "tseries", "xts", "caret", "caTools","ggfortify", "astsa", "changepoint", "strucchange", "magrittr","TSstudio", "plotly", "RColorBrewer", "prophet", "lattice", "dygraphs", "viridis","usmap")
check.packages(packages)

Loading the covid-19 county level data

getwd <- getwd()
setwd(getwd)
df <- read.csv("county_level_confirmed_cases.csv")

Data Manipultaion

#Assigning smaller names to the variables
colnames(df)
##  [1] "last_update"          "state"                "county_name"         
##  [4] "county_name_long"     "fips_code"            "lat"                 
##  [7] "lon"                  "NCHS_urbanization"    "total_population"    
## [10] "confirmed"            "confirmed_per_100000" "deaths"              
## [13] "deaths_per_100000"
names(df)[names(df) == "deaths_per_100000"] <- "DeathsPer"
names(df)[names(df) == "confirmed_per_100000"] <- "ConfirmedPer"
names(df)[names(df) == "total_population"] <- "Population"
names(df)[names(df) == "county_name_long"] <- "FullCountyName"
cat('Data Dimensions are: ', dim(df))
## Data Dimensions are:  2886 13
summary(df)
##               last_update              state          county_name  
##  2020-04-29 01:32:25:2886   Texas         : 209   Unassigned:  49  
##                             Georgia       : 159   Washington:  29  
##                             Virginia      : 134   Jefferson :  26  
##                             Kentucky      : 113   Franklin  :  24  
##                             Missouri      : 100   Jackson   :  22  
##                             North Carolina:  98   Lincoln   :  22  
##                             (Other)       :2073   (Other)   :2714  
##                        FullCountyName   fips_code          lat       
##  unassigned to a single county:   4   Min.   : 1001   Min.   :19.60  
##  Abbeville, South Carolina, US:   1   1st Qu.:18140   1st Qu.:34.53  
##  Acadia, Louisiana, US        :   1   Median :29153   Median :38.09  
##  Accomack, Virginia, US       :   1   Mean   :31335   Mean   :38.17  
##  Ada, Idaho, US               :   1   3rd Qu.:46049   3rd Qu.:41.57  
##  Adair, Iowa, US              :   1   Max.   :90056   Max.   :65.51  
##  (Other)                      :2877   NA's   :4       NA's   :57     
##       lon                    NCHS_urbanization   Population      
##  Min.   :-164.04                      :  60    Min.   :     418  
##  1st Qu.: -96.79   Large central metro:  64    1st Qu.:   14155  
##  Median : -89.15   Large fringe metro : 367    Median :   30270  
##  Mean   : -91.16   Medium metro       : 365    Mean   :  113355  
##  3rd Qu.: -82.95   Micropolitan       : 618    3rd Qu.:   77255  
##  Max.   : -67.63   Non-core           :1064    Max.   :10098052  
##  NA's   :57        Small metro        : 348    NA's   :60        
##    confirmed         ConfirmedPer         deaths           DeathsPer      
##  Min.   :     0.0   Min.   :   0.00   Min.   :    0.00   Min.   :  0.000  
##  1st Qu.:     5.0   1st Qu.:  31.55   1st Qu.:    0.00   1st Qu.:  0.000  
##  Median :    21.0   Median :  64.42   Median :    0.00   Median :  0.000  
##  Mean   :   350.2   Mean   : 150.90   Mean   :   20.19   Mean   :  6.295  
##  3rd Qu.:    90.0   3rd Qu.: 149.95   3rd Qu.:    3.00   3rd Qu.:  5.785  
##  Max.   :162338.0   Max.   :5512.96   Max.   :17682.00   Max.   :268.100  
##                     NA's   :60                           NA's   :60
  • As can be seen from the output of the above summary, there are few NA values. We need to handle the missing values
colSums(is.na(df))
##       last_update             state       county_name    FullCountyName 
##                 0                 0                 0                 0 
##         fips_code               lat               lon NCHS_urbanization 
##                 4                57                57                 0 
##        Population         confirmed      ConfirmedPer            deaths 
##                60                 0                60                 0 
##         DeathsPer 
##                60
  • Since there are missing values in the numeric columns, we think the best way to impute missing values is using Predictive Mean Matching (PMM), from MICE package, to fill them.

  • MICE (Multivariate Imputation via Chained Equations) assumes that the missing data are missing randomly. The original dataset shows that the cases of data missing only related to the observed data, while the missing value can be predicted by using existing data via MICE.

numeric_df  <- select_if(df, is.numeric)
mice_plot <- aggr(numeric_df, col=mdc(1:2),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(numeric_df), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##      Variable       Count
##    Population 0.020790021
##  ConfirmedPer 0.020790021
##     DeathsPer 0.020790021
##           lat 0.019750520
##           lon 0.019750520
##     fips_code 0.001386001
##     confirmed 0.000000000
##        deaths 0.000000000
  • The density chart clearly displays the pattern of our dataset missing, and surprisingly, the statistic of confirmed cases and deaths are totally complete so that we don’t need to conduct any prediction. Meanwhile, the histogram clearly depicts the influence of missing values on the variables. For example, due to the incomplete number of populations, the confirmed cases per unit also can’t be counted.

  • Imputing the dataset using PMM method, in this case, ‘m’ refers to 5 imputed data sets; ‘maxit’ refers to the number of iterations taken to impute the missing values, and ‘method’ refers to ‘predictive mean matching’.

imputed_Data <- mice(numeric_df, m=5, maxit = 10, method = 'mean', seed = 40)

#summary(imputed_Data)
#Imputed dataset
imputed_Data=complete(imputed_Data,5)
cat(dim(imputed_Data))
## 2886 8
head(imputed_Data)
##   fips_code      lat       lon Population confirmed ConfirmedPer deaths
## 1      1001 32.53953 -86.64408      55200        40        72.46      4
## 2      1003 30.72775 -87.72207     208107       171        82.17      3
## 3      1005 31.86826 -85.38713      25782        37       143.51      0
## 4      1007 32.99642 -87.12511      22527        42       186.44      0
## 5      1009 33.98211 -86.56791      57645        34        58.98      0
## 6      1011 32.10031 -85.71266      10352        12       115.92      0
##   DeathsPer
## 1      7.25
## 2      1.44
## 3      0.00
## 4      0.00
## 5      0.00
## 6      0.00

Let’s look if we have got the NAs imputed:

colSums(is.na(imputed_Data)) #check the reamin NA
##    fips_code          lat          lon   Population    confirmed ConfirmedPer 
##            0            0            0            0            0            0 
##       deaths    DeathsPer 
##            0            0
imputed_Data$last_update = df$last_update
imputed_Data$state = df$state
imputed_Data$county_name = df$county_name
imputed_Data$FullCountyName = df$FullCountyName
imputed_Data$NCHS_urbanization = df$NCHS_urbanization
  • Now there are no missing or NA values in the dataset, thus we can start with our further analysis.

Exploratory Data Analysis

Exploratory data analysis is a method to process initial data exploration, such as visualization or basic statistics to provide a preliminary understanding of the data to facilitate subsequent complex or rigorous analysis of the data.

In this case, for two such large scale datasets, we finish the basic data visualization using EDA to help us decide the strategies in further analysis. According to the charts, such as bar charts and heatmap, we can discover the significant pattern in the dataset and then dig out the underlying questions.

suppressPackageStartupMessages(library(tidyverse)) 
 suppressPackageStartupMessages(library(ggplot2))
 suppressPackageStartupMessages(library(viridis))

1. Total confirmed cases by state- top 10

 imputed_Data %>% filter(state!="Unknown") %>% group_by(state) %>% summarise(TOTALSTATE=sum(confirmed)) %>% 
 arrange(desc(TOTALSTATE)) %>% head(10) %>%
 ggplot(aes(x=reorder(state,TOTALSTATE),y=TOTALSTATE, fill=factor(state))) + 
 geom_bar(stat='identity') +  coord_flip() +
 theme_light() +
 ggtitle("Confirmed Cases by State") + xlab("County") + ylab("Count") +
 theme(legend.position="none")

From the above graph, this is clearly evident that the highest number of cases are in New York state followed by New Jersey and California.

2. Total confirmed cases by county- top 10

 imputed_Data %>% filter(county_name!="Unknown") %>% group_by(county_name) %>% summarise(TOTAL=sum(confirmed)) %>% 
 arrange(desc(TOTAL)) %>% head(10) %>%
 ggplot(aes(x=reorder(county_name,TOTAL),y=TOTAL, fill=factor(county_name))) + 
 geom_bar(stat='identity') +  coord_flip() +
 theme_light() +
 ggtitle(" Confirmed Cases by County") + xlab("County") + ylab("Count") +
 theme(legend.position="none")

At the County level, New York City and Suffolk experienced the highest number of cases. Not surprisingly, compared to other states, the number of confirmed cases of New York state keeps staying at the top of the list, having a huge gap with the rest.

3. Total death cases by state- top 10

 imputed_Data %>% filter(state!="Unknown") %>% group_by(state) %>% summarise(TOTALSTATE=sum(deaths)) %>% 
 arrange(desc(TOTALSTATE)) %>% head(10) %>%
 ggplot(aes(x=reorder(state,TOTALSTATE),y=TOTALSTATE, fill=factor(state))) + 
 geom_bar(stat='identity') +  coord_flip() +
 theme_light() +
 ggtitle("Death cases by State") + xlab("State") + ylab("Count") +
 theme(legend.position="none")

4. Total death cases by county- top 10

 imputed_Data %>% filter(county_name!="Unknown") %>% group_by(county_name) %>% summarise(TOTAL=sum(deaths)) %>% 
 arrange(desc(TOTAL)) %>% head(10) %>%
 ggplot(aes(x=reorder(county_name,TOTAL),y=TOTAL, fill=factor(county_name))) + 
 geom_bar(stat='identity') +  coord_flip() +
 theme_light() +
 ggtitle("Death cases by County") + xlab("County") + ylab("Count") +
 theme(legend.position="none")

Summary:
From the above graphs, it is clearly observed that the death cases in New York and New Jersey state are at the topmost while Michigan is at the third level instead of California state which being at third in number of confirmed cases.

Worth to mention is that, although the confirmed cases of both Wayne and Nassau are not on the top 5, but their number of death cases are ranked in top 3.

5. Total confirmed cases and deaths on the U.S. map

state_map <- us_map(regions = "states")
county_map <- us_map(regions = "counties")
#Libraries for state heat map
 suppressPackageStartupMessages(library(usmap))

#Rename 
 colnames(statepop)[colnames(statepop)=="full"] <- "state"

#Summarize data for map
#Cases 
  imputed_Data %>% select(state, deaths, confirmed) %>% group_by(state) %>%
  summarize(TOTALCASES=sum(confirmed)) -> CASES
#Deaths
  imputed_Data %>% select(state, deaths, confirmed) %>% group_by(state) %>%
  summarize(TOTALDEATHS=sum(deaths)) -> DEATHS

#Convert sumarized data to data frame 
 CASES<- data.frame(CASES)
 DEATHS<-data.frame(DEATHS)

#Merge data
 ALLCASES<-left_join(CASES, statepop, by="state")
 ALLDEATHS<-left_join(DEATHS, statepop,  by="state")

#Plot maps
 plot_usmap(data=ALLCASES, values="TOTALCASES", regions = "state") + 
  scale_fill_viridis(discrete=FALSE) + 
  theme(legend.position = "right") + labs(fill="TOATLCASES") +
  ggtitle("Total Confirmed Cases by State")

plot_usmap(data=ALLDEATHS, values="TOTALDEATHS", regions = "state") + 
  scale_fill_viridis(discrete=FALSE) + 
  theme(legend.position = "right") + labs(fill="TOTALDEATHS") +
  ggtitle("Total Confirmed Deaths by State")

Interpretation:
Based on the bar and map charts, the number of confirmed cases and deaths that are significantly different between cities and countries. Therefore, we decided to explore the difference of confirmed cases between large cities to go further study the geographical distribution. In two visualization methods to process EDA, both can clearly demonstrate the comparison between states or cities, in confirmed or death number. The maps can directly show the density of cases whereas the bar charts can help the audience evaluate the differences. On the surface, the bar charts tell us that the greater the confirmed cases, the more deaths. To deeper analyze the relationship between confirmed cases and deaths, we plan to build a linear regression model.

6. The total number of confirmed cases and deaths over time in United States from March to May

Loading the timeseries dataset

timeSeriesData <- read.csv("C:\\Users\\Kashika\\Desktop\\county_timeseries.csv")
head(timeSeriesData)
##        uid location_type fips_code location_name   state      date
## 1 84001001        county      1001       Autauga Alabama 1/22/2020
## 2 84001001        county      1001       Autauga Alabama 1/23/2020
## 3 84001001        county      1001       Autauga Alabama 1/24/2020
## 4 84001001        county      1001       Autauga Alabama 1/25/2020
## 5 84001001        county      1001       Autauga Alabama 1/26/2020
## 6 84001001        county      1001       Autauga Alabama 1/27/2020
##   total_population cumulative_cases cumulative_cases_per_100_000
## 1            55200                0                            0
## 2            55200                0                            0
## 3            55200                0                            0
## 4            55200                0                            0
## 5            55200                0                            0
## 6            55200                0                            0
##   cumulative_deaths cumulative_deaths_per_100_000 new_cases new_deaths
## 1                 0                             0        NA         NA
## 2                 0                             0         0          0
## 3                 0                             0         0          0
## 4                 0                             0         0          0
## 5                 0                             0         0          0
## 6                 0                             0         0          0
##   new_cases_per_100_000 new_deaths_per_100_000
## 1                    NA                     NA
## 2                     0                      0
## 3                     0                      0
## 4                     0                      0
## 5                     0                      0
## 6                     0                      0
tail(timeSeriesData)
##             uid     location_type fips_code location_name          state
## 338203 84099999 special_exception     99999               Grand Princess
## 338204 84099999 special_exception     99999               Grand Princess
## 338205 84099999 special_exception     99999               Grand Princess
## 338206 84099999 special_exception     99999               Grand Princess
## 338207 84099999 special_exception     99999               Grand Princess
## 338208 84099999 special_exception     99999               Grand Princess
##             date total_population cumulative_cases cumulative_cases_per_100_000
## 338203 4/29/2020               NA              103                           NA
## 338204 4/30/2020               NA              103                           NA
## 338205  5/1/2020               NA              103                           NA
## 338206  5/2/2020               NA              103                           NA
## 338207  5/3/2020               NA              103                           NA
## 338208  5/4/2020               NA              103                           NA
##        cumulative_deaths cumulative_deaths_per_100_000 new_cases new_deaths
## 338203                 3                            NA         0          0
## 338204                 3                            NA         0          0
## 338205                 3                            NA         0          0
## 338206                 3                            NA         0          0
## 338207                 3                            NA         0          0
## 338208                 3                            NA         0          0
##        new_cases_per_100_000 new_deaths_per_100_000
## 338203                    NA                     NA
## 338204                    NA                     NA
## 338205                    NA                     NA
## 338206                    NA                     NA
## 338207                    NA                     NA
## 338208                    NA                     NA
timeSeriesData$date <-as.Date(timeSeriesData$date,format="%m/%d/%y") #set date format
#Total cases and deaths
 ggplot(timeSeriesData, aes(date)) +                   
 geom_line(aes(y=cumulative_cases), colour="cyan3") +
 geom_line(aes(y=cumulative_deaths), colour="darkred") +  
 theme_light() +
 ggtitle(label="Total Cases and Deaths Over Time",  subtitle="blue=cases, red=deaths") + 
 ylab("Count") + xlab("Date") + 
 theme(axis.text.x=element_text(angle=40))

Summary: It is evident from the above graph as the number of confirmed cases are increasing exponentially, number of deaths are also increasing.

Note:
Since we do have zero entries in dataset until March 22, the graph does not show any values before it.


Linear Regression Model:

To explore the relationship between total cases and deaths in the nation, we decide to use linear regression model to stimulate the trend so that we can predict upcoming condition. The linear mathematical model can determine the value of one dependent variable base on the value of one given independent variable.
Currently, the number of confirmed cases constantly increases. Although the rate has not fixed, we still can use the linear regression model to make a short-term prediction for the number of deaths.
————————————————-

#abstract dataset
dat <- select(timeSeriesData, cumulative_cases,cumulative_deaths)
colnames(dat) <- c("totalConfirmed", "totalDeaths")
summary(dat)
##  totalConfirmed       totalDeaths       
##  Min.   :     0.00   Min.   :    0.000  
##  1st Qu.:     0.00   1st Qu.:    0.000  
##  Median :     0.00   Median :    0.000  
##  Mean   :    74.25   Mean   :    3.867  
##  3rd Qu.:     3.00   3rd Qu.:    0.000  
##  Max.   :175651.00   Max.   :19057.000
#build regression model using lm()
dat_lm1 <- lm(totalDeaths~totalConfirmed, data=dat)
options(digits=2)
summary(dat_lm1)
## 
## Call:
## lm(formula = totalDeaths ~ totalConfirmed, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2241.4     2.8     3.0     3.0  2820.7 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.00581    0.07317   -41.1   <2e-16 ***
## totalConfirmed  0.09256    0.00005  1850.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42 on 338206 degrees of freedom
## Multiple R-squared:  0.91,   Adjusted R-squared:  0.91 
## F-statistic: 3.43e+06 on 1 and 338206 DF,  p-value: <2e-16
#plot the model
ggplot(dat_lm1, aes(x = totalConfirmed, y = totalDeaths)) + 
  geom_point(col ="red" )+
  stat_smooth(method = "lm", col = "cornflowerblue")
## `geom_smooth()` using formula 'y ~ x'

Interpretation:
Since p-values are smaller than 0.05, the linear regression shows a positive correlation between the total number of confirmed cases and deaths during February and May within the nation. For each 100 additional cases of confirmed cases, the deaths increase by 9.


Time Series Analysis:

Time series analysis is a statistical technique that deals with analysis of time series data, or trend in order to extract meaningful statistics and other characteristics of the data. We perform time series forecasting to predict future values based on the previously observed values.

Time series data represents a set of observations on the values that a variable takes at different times (in our dataset, we have date which represents time).

We will be using Facebook’s open source “Prophet model” to forecast total confirmed cases and deaths for future dates.

State wise predictions- for top 3 states

  1. Data Preparation

Since, we do not need all the 15 variables thus, keeping only the columns that are required for our analysis using keep function.

keep <- c("uid", "location_name", "state", "date", "cumulative_cases", "cumulative_deaths")
df  <- timeSeriesData[keep]
head(df)
##        uid location_name   state       date cumulative_cases cumulative_deaths
## 1 84001001       Autauga Alabama 2020-01-22                0                 0
## 2 84001001       Autauga Alabama 2020-01-23                0                 0
## 3 84001001       Autauga Alabama 2020-01-24                0                 0
## 4 84001001       Autauga Alabama 2020-01-25                0                 0
## 5 84001001       Autauga Alabama 2020-01-26                0                 0
## 6 84001001       Autauga Alabama 2020-01-27                0                 0
tail(df)
##             uid location_name          state       date cumulative_cases
## 338203 84099999               Grand Princess 2020-04-29              103
## 338204 84099999               Grand Princess 2020-04-30              103
## 338205 84099999               Grand Princess 2020-05-01              103
## 338206 84099999               Grand Princess 2020-05-02              103
## 338207 84099999               Grand Princess 2020-05-03              103
## 338208 84099999               Grand Princess 2020-05-04              103
##        cumulative_deaths
## 338203                 3
## 338204                 3
## 338205                 3
## 338206                 3
## 338207                 3
## 338208                 3
colSums(is.na(df))
##               uid     location_name             state              date 
##                 0                 0                 0                 0 
##  cumulative_cases cumulative_deaths 
##                 0                 0
  • Now, we do not have any NA values, we are good to perform Exploratory Data Analysis. Also,let’s assign meaningful names to the columns in order to make visualizations more effective.
#changing names of the columns
names(df)[names(df) == "location_name"] <- "county"
names(df)[names(df) == "cumulative_cases"] <- "confirmedcases"
names(df)[names(df) == "cumulative_deaths"] <- "deaths"

colnames(df)
## [1] "uid"            "county"         "state"          "date"          
## [5] "confirmedcases" "deaths"
str(df)
## 'data.frame':    338208 obs. of  6 variables:
##  $ uid           : int  84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 ...
##  $ county        : Factor w/ 1900 levels "","Abbeville",..: 83 83 83 83 83 83 83 83 83 83 ...
##  $ state         : Factor w/ 53 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date          : Date, format: "2020-01-22" "2020-01-23" ...
##  $ confirmedcases: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deaths        : int  0 0 0 0 0 0 0 0 0 0 ...
  • We have 2 variables i.e., county and state as a factor type, let’s convert it to character type and parse the date column as “date” before peroforming future forecast.
#Converting factor datatypes to character
df[["state"]] <- as.character(df[["state"]] )
df[["county"]] <- as.character(df[["county"]] )
df$date <- as.Date(df$date, format = "%m/%d/%Y")
str(df)
## 'data.frame':    338208 obs. of  6 variables:
##  $ uid           : int  84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 ...
##  $ county        : chr  "Autauga" "Autauga" "Autauga" "Autauga" ...
##  $ state         : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ date          : Date, format: "2020-01-22" "2020-01-23" ...
##  $ confirmedcases: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deaths        : int  0 0 0 0 0 0 0 0 0 0 ...
head(df)
##        uid  county   state       date confirmedcases deaths
## 1 84001001 Autauga Alabama 2020-01-22              0      0
## 2 84001001 Autauga Alabama 2020-01-23              0      0
## 3 84001001 Autauga Alabama 2020-01-24              0      0
## 4 84001001 Autauga Alabama 2020-01-25              0      0
## 5 84001001 Autauga Alabama 2020-01-26              0      0
## 6 84001001 Autauga Alabama 2020-01-27              0      0

1. Getting Coronavirus data for New York state

We observed from our EDA that New York is at the top level in the number of confirmed cases as well as fatalities; let’s dig deeper and predict the number of cases and fatalities for the next 45 days of the top 3 states and USA.

Filter the data only for New York state and then aggregate the obtained data by taking sum of the total confirmed cases and deaths and finally group by date.

# group by date

coronaData_NewYork <- df %>%
  
  select(date, state, confirmedcases, deaths)  %>% 
  
  group_by(date) %>%
  
  filter(state == "New York") %>%
  
  summarise(totalConfirmed = sum(confirmedcases,na.rm=TRUE), totalDeaths = sum(deaths,na.rm=TRUE))

tail(coronaData_NewYork)
## # A tibble: 6 x 3
##   date       totalConfirmed totalDeaths
##   <date>              <int>       <int>
## 1 2020-04-29         299691       23477
## 2 2020-04-30         304372       23587
## 3 2020-05-01         308314       24039
## 4 2020-05-02         312977       24198
## 5 2020-05-03         316415       24708
## 6 2020-05-04         318953       24999

The Prophet model requires the date column to be named ‘ds’ and the variable column to be named ‘y’ to perform forecast.

# get coronavirus confirmed cases data for New York state
coronaData_confirmed_NewYork <- coronaData_NewYork %>%
  select(date, totalConfirmed) 

# get coronavirus deaths data for New York state
coronaData_death_NewYork <- coronaData_NewYork %>%
  select(date, totalDeaths)

# transforming data for forecasting    
names(coronaData_confirmed_NewYork) <- c("ds", "y")
names(coronaData_death_NewYork) <- c("ds", "y")
  • We can now apply Prophet Model to our obtained New York dataset.
  • Here, ‘yhat’ represents the prediction, while ‘yhat_lower’ and ‘yhat_upper’ represents the lower and the upper bound of the prediction respectively.
#Predicting covid-19 confirmed cases of New York State  using predict()  
mNYcc <- prophet(coronaData_confirmed_NewYork)

futureNYcc <- make_future_dataframe(mNYcc,periods = 45)

forecastNYcc <- predict(mNYcc, futureNYcc)

tail(forecastNYcc[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##             ds   yhat yhat_lower yhat_upper
## 144 2020-06-13 572405     498084     647904
## 145 2020-06-14 577967     501424     654690
## 146 2020-06-15 583226     502954     662432
## 147 2020-06-16 589737     505455     672561
## 148 2020-06-17 595946     508087     680246
## 149 2020-06-18 602114     509864     688622

Plotting the Forecast

Visualization, in time series data, makes more sense than numbers. Therefore, we plot the forecast by calling plot() and prophet_plot_component() to see the daily, weekly, and monthly trends in order to have a clearer view of the trend.

#plot the prediction
plot(mNYcc,forecastNYcc) + labs(title = "Predicting Cononavirus Confirmed Cases for New York State", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

#broken down the prediction into trend and weekly seasonality and plot
prophet_plot_components(mNYcc, forecastNYcc)

dyplot.prophet(mNYcc, forecastNYcc)
#Predicting covid-19 death cases of New York state
mNYdeaths <- prophet(coronaData_death_NewYork)

futureNYdeaths <- make_future_dataframe(mNYdeaths,periods = 45)

forecastNYdeaths <- predict(mNYdeaths, futureNYdeaths)

tail(forecastNYdeaths[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##             ds  yhat yhat_lower yhat_upper
## 144 2020-06-13 47764      41676      55105
## 145 2020-06-14 48221      41705      55885
## 146 2020-06-15 48748      42090      56706
## 147 2020-06-16 49388      42537      57695
## 148 2020-06-17 49925      42767      58530
## 149 2020-06-18 50448      42960      59289
#plot the prediction
plot(mNYdeaths,forecastNYdeaths) + labs(title = "Predicting Cononavirus deaths for New York State", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

#seperate the trend and weekly seasonality
prophet_plot_components(mNYdeaths,forecastNYdeaths)

dyplot.prophet(mNYdeaths,forecastNYdeaths)

Interpretation:
The plots above show a significant upward trend. Particularly, in the prediction of New York state, there is a dramatic increase in the number of confirmed cases and fatalities for the upcoming days. Let’s further analyze the trend to arrive at a concrete conclusion.

The trend shows an increase in the number of confirmed cases as well as fatalities for the upcoming days, which is not good. Also, we can observe that the cases and deaths are increasing more over the weekends as compared to weekdays and Monday experienced a smaller number of cases and deaths as compared to other days.

2. Getting Coronavirus data for New Jersey state

# group by date
coronaData_NewJersey <- df %>%
  
  select(date, state, confirmedcases, deaths)  %>% 
  
  group_by(date) %>%
  
  filter(state == "New Jersey") %>%
  
  summarise(totalConfirmed = sum(confirmedcases,na.rm=TRUE), totalDeaths = sum(deaths,na.rm=TRUE))

tail(coronaData_NewJersey)
## # A tibble: 6 x 3
##   date       totalConfirmed totalDeaths
##   <date>              <int>       <int>
## 1 2020-04-29         116365        6771
## 2 2020-04-30         118652        7228
## 3 2020-05-01         121190        7538
## 4 2020-05-02         123717        7742
## 5 2020-05-03         126744        7871
## 6 2020-05-04         128269        7910
# get coronavirus confirmed cases data for New Jersey state

coronaData_confirmed_NewJersey <- coronaData_NewJersey %>%
  select(date, totalConfirmed) 

# get coronavirus deaths data for New Jersey state

coronaData_death_NewJersey <- coronaData_NewJersey %>%
  select(date, totalDeaths)

# transforming data for forecasting  
names(coronaData_confirmed_NewJersey) <- c("ds", "y")
names(coronaData_death_NewJersey) <- c("ds", "y")
# for coronaData_confirmed_New Jersey

mNJ <- prophet(coronaData_confirmed_NewJersey)

futureNJ <- make_future_dataframe(mNJ,periods = 45)

forecastNJ <- predict(mNJ, futureNJ)

#tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

plot(mNJ,forecastNJ) + labs(title = "Predicting Cononavirus Confirmed Cases for New Jersey State", 
                        x= "Date", 
                        y = "Count")

dyplot.prophet(mNJ, forecastNJ)

Predicting covid-19 death cases of New Jersey state

mNJdeaths <- prophet(coronaData_death_NewJersey)

futureNJdeaths <- make_future_dataframe(mNJdeaths,periods = 45)

forecastNJdeaths <- predict(mNJdeaths, futureNJdeaths)

#tail(forecastNJdeaths[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

plot(mNJdeaths,forecastNJdeaths) + labs(title = "Predicting Cononavirus deaths for New Jersey State", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

dyplot.prophet(mNJdeaths,forecastNJdeaths)
  1. Getting Coronavirus data for Massachusetts state
# group by date

coronaData_Mass <- df %>%
  
  select(date, state, confirmedcases, deaths)  %>% 
  
  group_by(date) %>%
  
  filter(state == "Massachusetts") %>%
  
  summarise(totalConfirmed = sum(confirmedcases,na.rm=TRUE), totalDeaths = sum(deaths,na.rm=TRUE))
# get coronavirus confirmed cases data for Massachusetts state

coronaData_confirmed_Mass <- coronaData_Mass %>%
  select(date, totalConfirmed) 

# get coronavirus deaths data for Massachusetts state

coronaData_death_Mass <- coronaData_Mass %>%
  select(date, totalDeaths)
names(coronaData_confirmed_Mass) <- c("ds", "y")
names(coronaData_death_Mass) <- c("ds", "y")
# for coronaData_confirmed_Massachusetts 

mMass <- prophet(coronaData_confirmed_Mass)

futureMass <- make_future_dataframe(mMass,periods = 45)

forecastMass <- predict(mMass, futureMass)

#tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

plot(mMass,forecastMass) + labs(title = "Predicting Cononavirus Confirmed Cases for Massachusetts State", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

dyplot.prophet(mMass, forecastMass)
Predicting covid-19 death cases of Massachusetts state
mMassdeaths <- prophet(coronaData_death_Mass)

futureMassdeaths <- make_future_dataframe(mMassdeaths,periods = 45)

forecastMassdeaths <- predict(mMassdeaths, futureMassdeaths)

#tail(forecastMassdeaths[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

plot(mMassdeaths,forecastMassdeaths) + labs(title = "Predicting Cononavirus deaths for Massachusetts State", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

dyplot.prophet(mMassdeaths,forecastMassdeaths)
  1. Getting Coronavirus data for United States

The figure shows the actual trend and prediction of the number of nationwide confirmed cases and deaths. We aggregate the data frame by taking the sum of the total confirmed cases and deaths.

# group by date

coronaData_US <- df %>%
  select(date, state, confirmedcases, deaths )  %>% 
  
  group_by(date) %>%
  summarise(totalConfirmed = sum(confirmedcases,na.rm=TRUE), totalDeaths = sum(deaths,na.rm=TRUE))

tail(coronaData_US)
## # A tibble: 6 x 3
##   date       totalConfirmed totalDeaths
##   <date>              <int>       <int>
## 1 2020-04-29        1038264       60870
## 2 2020-04-30        1067660       62893
## 3 2020-05-01        1101661       64838
## 4 2020-05-02        1130557       66263
## 5 2020-05-03        1156007       67574
## 6 2020-05-04        1178307       68814
# get confirmed cases data of USA
coronaData_confirmedUS<- coronaData_US %>%
  select(date, totalConfirmed) 

# get death cases data of USA
coronaData_deathUS <- coronaData_US %>%
  select(date, totalDeaths )
  • Transforming the columns to ds and y for forecasting as the Prophet model requires the date column to be named ‘ds’ and the variable column to be named ‘y’.
names(coronaData_confirmedUS) <- c("ds", "y")
names(coronaData_deathUS) <- c("ds", "y")

Applying the Prophet model on the number confirmed cases for USA

Here, Prophet model is framing the forecasting as a curve-fitting problem rather than explicitly looking at the time-based dependence of each observation within a time series.

m <- prophet(coronaData_confirmedUS)

future <- make_future_dataframe(m,periods = 45)

forecastUS <- predict(m, future)

tail(forecastUS[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##             ds    yhat yhat_lower yhat_upper
## 144 2020-06-13 2339356    2180611    2520411
## 145 2020-06-14 2366946    2203075    2558842
## 146 2020-06-15 2394079    2224533    2593345
## 147 2020-06-16 2422979    2246656    2629651
## 148 2020-06-17 2451237    2268744    2663694
## 149 2020-06-18 2481011    2290905    2699061
plot(m,forecastUS) + labs(title = "Predicting Cononavirus Confirmed Cases for USA", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

prophet_plot_components(m,forecastUS)

dyplot.prophet(m,forecastUS)

Interpretation:
The plots above shows an upward trend i.e., the number of confirmed cases is increasing exponentially for the upcoming days. Let’s look at the components plot to analyze the daily and weekly trend.

The trend plot shows an increase in the number of confirmed cases for the upcoming days, which is not good. From the weekly component plot, we can observe that the number of cases is increasing more over the weekends; also, the number of cases is decreasing from start of the week until mid-week and then again it increases starting from Thursday until Sunday.

Predicting covid-19 death cases for USA

mdeathUS <- prophet(coronaData_deathUS)

futuredeathUS <- make_future_dataframe(mdeathUS,periods = 45)

forecastdeathUS <- predict(mdeathUS,futuredeathUS)

tail(forecastdeathUS[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##             ds   yhat yhat_lower yhat_upper
## 144 2020-06-13 150375     139131     160743
## 145 2020-06-14 152151     140434     162937
## 146 2020-06-15 154008     141854     165149
## 147 2020-06-16 156218     143555     167667
## 148 2020-06-17 158298     145109     170379
## 149 2020-06-18 160330     146755     173127
plot(mdeathUS,forecastdeathUS ) + labs(title = "Predicting Cononavirus deaths for USA", 
                        x= "Date", 
                        y = "Count", daily.seasonality=TRUE )

prophet_plot_components(mdeathUS,forecastdeathUS)

dyplot.prophet(mdeathUS,forecastdeathUS)

Interpretation:
The plots above shows an upward trend means that the number of fatalities is increasing rapidly for the upcoming days that is, following the same trend as of confirmed cases within USA, which is not good. Let’s look at the components plot to analyze the daily and weekly trend.
From the weekly component plot, we can observe that the number of deaths keeps on increasing starting from Tuesday until Saturday.


Further Analysis:


1.In order to understand the effect this this Pandemic and how it is progressing at a granular level, we thought of exploring the “covid19.analytics” R package.

This package allows users to obtain live* worldwide data from the novel CoronaVirus Disease originally reported in 2019, and it is published by the JHU CCSE repository.

Let’s install the package

#install.packages("covid19.analytics")
library(covid19.analytics)

2.The ‘report.summary’ function generates an overall report summarizing the different datasets. Let us generate the summary report of time series data for US.

report.summary(cases.to.process = "TS", geo.loc = "US")
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
## -------------------------------------------------------------------------------- 
## ################################################################################ 
##   ##### TS-CONFIRMED Cases  -- Data dated:  2020-05-17  ::  2020-05-18 12:16:23 
## ################################################################################ 
##   Number of Countries/Regions reported:  1 
##   Number of Cities/Provinces reported:  1 
##   Unique number of distinct geographical locations combined: 1 
## -------------------------------------------------------------------------------- 
##   For selected locations ts-confirmed  Totals: 1486757 
## -------------------------------------------------------------------------------- 
##   Country.Region Province.State  Totals RelPerc GlobalPerc LastDayChange   t-2   t-3   t-7  t-14  t-30
## 1             US                1486757     100         32         18937 24996 25050 18621 22335 32491
## -------------------------------------------------------------------------------- 
##   Global Perc. Average:  31.54 (sd: NA) 
##   Global Perc. Average in top  1 :  31.54 (sd: NA) 
## -------------------------------------------------------------------------------- 
## ================================================================================

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
## -------------------------------------------------------------------------------- 
## ################################################################################ 
##   ##### TS-DEATHS Cases  -- Data dated:  2020-05-17  ::  2020-05-18 12:16:24 
## ################################################################################ 
##   Number of Countries/Regions reported:  1 
##   Number of Cities/Provinces reported:  1 
##   Unique number of distinct geographical locations combined: 1 
## -------------------------------------------------------------------------------- 
##   For selected locations ts-deaths  Totals: 89562 
## -------------------------------------------------------------------------------- 
##   Country.Region Province.State Totals Perc LastDayChange  t-2  t-3  t-7 t-14 t-30
## 1             US                 89562    6           808 1224 1632 1156 1240 2342
## -------------------------------------------------------------------------------- 
## -------------------------------------------------------------------------------- 
## ================================================================================

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
## -------------------------------------------------------------------------------- 
## ################################################################################ 
##   ##### TS-RECOVERED Cases  -- Data dated:  2020-05-17  ::  2020-05-18 12:16:24 
## ################################################################################ 
##   Number of Countries/Regions reported:  1 
##   Number of Cities/Provinces reported:  1 
##   Unique number of distinct geographical locations combined: 1 
## -------------------------------------------------------------------------------- 
##   For selected locations ts-recovered  Totals: 272265 
## -------------------------------------------------------------------------------- 
##   Country.Region Province.State Totals Perc LastDayChange   t-2  t-3   t-7 t-14 t-30
## 1             US                272265   18          3889 17629 4333 16564 7028 6295
## -------------------------------------------------------------------------------- 
## -------------------------------------------------------------------------------- 
## ================================================================================

##  
## 
## ******************************************************************************** 
## ********************************  OVERALL SUMMARY******************************** 
## ******************************************************************************** 
##   ****  Time Series  US TOTS **** 
##       ts-confirmed   ts-deaths   ts-recovered 
##       1486757    89562   272265 
##              6.02%       18.31% 
##   ****  Time Series  US AVGS **** 
##       ts-confirmed   ts-deaths   ts-recovered 
##       1486757    89562   272265 
##              6.02%       18.31% 
##   ****  Time Series  US SDS **** 
##       ts-confirmed   ts-deaths   ts-recovered 
##       NA NA  NA 
##              NA%         NA% 
##   
## 
##  * Statistical estimators computed considering 1/1/1 independent reported entries per case-type 
## ********************************************************************************
  • From the above overall summary, we can see the summary report of ‘ts-confirmed’, ‘ts-deaths’, ‘ts-recovered’ for United States. The number of confirmed cases in US is 1467820 till date and number of total deaths 88754 and total recovered cases are 268376.
  • Also, we can see the percentage of “Deaths” and “Recovered” with respect to the “Confirmed” number of cases which are 6.05 and 18.28 respectively.

3.Now let us look at the daily changes, we can compute this using growth.rate() function.

The growth rate is defined as the ratio of the daily changes between two consecutive dates.

# read time series data for confirmed cases
data <- covid19.data("ts-ALL")
growth.rate(data, geo.loc = "US")

  • From the above scatter and bar plots, we can see as the number of confirmed cases are increasing rapidly, recovered cases are increasing too which means people are getting cured faster. Less number of deaths has been observed as compared to the number of confirmed and recovered cases i.e. out of 1.4 million confirmed cases, 6% of people lead to death.

4.Let’s now model the evolution of COVID virus spread using ‘generate.SIR.model ()’.

This function identifies the data points where the onset of the epidemy began and consider the following data points to generate a proper guess for the two parameters describing the SIR (Susceptible-Infected-Recovered) model.

library("plotly")

# modelling the spread for the US, storing the model and generating an interactive visualization
US.SIR.model <- generate.SIR.model(data,"US", staticPlt=FALSE)
## ################################################################################ 
## ################################################################################ 
## Processing...  US 
##   [1]       1       1       2       2       5       5       5       5       5
##  [10]       7       8       8      11      11      11      11      11      11
##  [19]      11      11      12      12      13      13      13      13      13
##  [28]      13      13      13      15      15      15      51      51      57
##  [37]      58      60      68      74      98     118     149     217     262
##  [46]     402     518     583     959    1281    1663    2179    2727    3499
##  [55]    4632    6421    7783   13747   19273   25600   33276   43843   53736
##  [64]   65778   83836  101657  121465  140909  161831  188172  213242  243622
##  [73]  275367  308650  336802  366317  397121  428654  462780  496535  526396
##  [82]  555313  580619  607670  636350  667592  699706  732197  758809  784326
##  [91]  811865  840351  869170  905358  938154  965785  988197 1012582 1039909
## [100] 1069424 1103461 1132539 1158040 1180375 1204351 1229331 1257023 1283929
## [109] 1309550 1329260 1347881 1369376 1390406 1417774 1442824 1467820 1486757
## [1] 33
##  [1]    15    51    51    57    58    60    68    74    98   118   149   217
## [13]   262   402   518   583   959  1281  1663  2179  2727  3499  4632  6421
## [25]  7783 13747
## ------------------------  Parameters used to create model ------------------------ 
##      Region: US 
##      Time interval to consider: t0=33 - t1= ; tfinal=90 
##          t0: 2020-02-24 -- t1:  
##      Number of days considered for initial guess: 26 
##      Fatality rate: 0.02 
##      Population of the region: 1.4e+09 
## -------------------------------------------------------------------------------- 
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
##  beta gamma 
##  0.63  0.37 
##   R0 = 1.72836242101617 
##   Max of infecteded: 146337626.31  ( 10.45 %) 
##   Max nbr of casualties, assuming  2% fatality rate: 2926752.53 
##   Max reached at day : 68 ==>  2020-05-02 
## ================================================================================
# plotting and visualizing the model
plt.SIR.model(US.SIR.model,"US",interactiveFig=TRUE, fileName="US.SIR.model" )

Note: Please ignore the last graph, as the package is still under development lead to overlapping of title on the graphs.

Interpretation: From the above summary, we can see that the percentage of infected people is 10.45% of the US population and around 2% of infected people lead to fatalities. Also, from the SIR plot, the graph for the number of susceptible and infected people seems to bend towards zero and the recovered shows an upward trend leading to people getting cured at a higher rate.


Conclusion:


  • EDA analysis give the analysts a basic knowledge about the datasets, helping them to plan further steps and explore the meaningful questions.

  • In the analysis of the relationship between deaths and confirmed cases, the statistic result shows that for each 100 additional cases of confirmed cases, the deaths increase by 9.

  • Last but not the least, with an exponential rise in the number of confirmed cases, the number of fatalities follows the trend of confirmed cases and increases rapidly over the time.

  • Notably, the number of confirmed cases and deaths is at peak over the weekends, especially on Fridays and Saturdays. According to social convention, we assume that this situation caused by social activities over the weekends or the celebration of festivals. Also, we think one of the main reasons why the number of confirmed cases went up on Friday is that because this is the end of a week, the test center needs to release their results.

  • Infected people are getting recovered at a faster rate and the number of confirmed cases till date lead to 1.4 Million out of which 6% of infected people lead to fatalities and around 18% has been recovered.


References: