The goal is to predict the cumulative number of confirmed cases and resulting fatalities, for the upcoming dates within United States.
COVID-19 is caused by a new coronavirus disease. The Coronaviridae family includes a group of large, single, and plus stranded RNA viruses isolated from multiple species, and known to cause the common cold and diarrheal diseases in humans (Dey, Rahman, Siddiqi, & Howlader, 2020). In the United States, different parts of the country are seeing different levels of COVID-19 activity. Currently, the United States nationally is in the acceleration phase of the pandemic (National Center for Immunization and Respiratory Diseases, 2020).
Working on two Covid-19 datasets which is collected by Johns Hopkins University.
Dataset 1- Johns Hopkins COVID-19 Case Tracker * The data is available at county level in the United States and represents the number of confirmed cases, number of deaths reported by each state in the United States. * There are 13 variables and 2886 observations, and this dataset is paired with population data of counties both urban and rural.
Dataset 2- Timeseries COVID-19 Cases and Deaths * This dataset contains daily time series data of United States, including confirmed cases and deaths, reported at the county level. * There are 15 variables and 338208 observations in the dataset; we will be using 5 variables only for our analysis as others are not required in the prediction of future cases. 5 key variables that we are using: ‘date’, ‘state’, ‘cumulative_cases’, ‘cumulative_deaths’, ‘location_name’.
Note: We have data starting from Jan-22-2020 until May-4-2020 but have zero cumulative cases and deaths until Mar-23-2020.
check.packages <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
packages<-c("stats4", "dplyr", "stringr", "tidyr", "ggplot2", "reshape2", "rpart", "corrplot", "geosphere", "leaflet", "leaflet.extras", "maps", "ggpubr", "psych", "mice", "VIM", "tidyverse", "lubridate", "forecast", "tseries", "xts", "caret", "caTools","ggfortify", "astsa", "changepoint", "strucchange", "magrittr","TSstudio", "plotly", "RColorBrewer", "prophet", "lattice", "dygraphs", "viridis","usmap")
check.packages(packages)
getwd <- getwd()
setwd(getwd)
df <- read.csv("county_level_confirmed_cases.csv")
#Assigning smaller names to the variables
colnames(df)
## [1] "last_update" "state" "county_name"
## [4] "county_name_long" "fips_code" "lat"
## [7] "lon" "NCHS_urbanization" "total_population"
## [10] "confirmed" "confirmed_per_100000" "deaths"
## [13] "deaths_per_100000"
names(df)[names(df) == "deaths_per_100000"] <- "DeathsPer"
names(df)[names(df) == "confirmed_per_100000"] <- "ConfirmedPer"
names(df)[names(df) == "total_population"] <- "Population"
names(df)[names(df) == "county_name_long"] <- "FullCountyName"
cat('Data Dimensions are: ', dim(df))
## Data Dimensions are: 2886 13
summary(df)
## last_update state county_name
## 2020-04-29 01:32:25:2886 Texas : 209 Unassigned: 49
## Georgia : 159 Washington: 29
## Virginia : 134 Jefferson : 26
## Kentucky : 113 Franklin : 24
## Missouri : 100 Jackson : 22
## North Carolina: 98 Lincoln : 22
## (Other) :2073 (Other) :2714
## FullCountyName fips_code lat
## unassigned to a single county: 4 Min. : 1001 Min. :19.60
## Abbeville, South Carolina, US: 1 1st Qu.:18140 1st Qu.:34.53
## Acadia, Louisiana, US : 1 Median :29153 Median :38.09
## Accomack, Virginia, US : 1 Mean :31335 Mean :38.17
## Ada, Idaho, US : 1 3rd Qu.:46049 3rd Qu.:41.57
## Adair, Iowa, US : 1 Max. :90056 Max. :65.51
## (Other) :2877 NA's :4 NA's :57
## lon NCHS_urbanization Population
## Min. :-164.04 : 60 Min. : 418
## 1st Qu.: -96.79 Large central metro: 64 1st Qu.: 14155
## Median : -89.15 Large fringe metro : 367 Median : 30270
## Mean : -91.16 Medium metro : 365 Mean : 113355
## 3rd Qu.: -82.95 Micropolitan : 618 3rd Qu.: 77255
## Max. : -67.63 Non-core :1064 Max. :10098052
## NA's :57 Small metro : 348 NA's :60
## confirmed ConfirmedPer deaths DeathsPer
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 5.0 1st Qu.: 31.55 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 21.0 Median : 64.42 Median : 0.00 Median : 0.000
## Mean : 350.2 Mean : 150.90 Mean : 20.19 Mean : 6.295
## 3rd Qu.: 90.0 3rd Qu.: 149.95 3rd Qu.: 3.00 3rd Qu.: 5.785
## Max. :162338.0 Max. :5512.96 Max. :17682.00 Max. :268.100
## NA's :60 NA's :60
colSums(is.na(df))
## last_update state county_name FullCountyName
## 0 0 0 0
## fips_code lat lon NCHS_urbanization
## 4 57 57 0
## Population confirmed ConfirmedPer deaths
## 60 0 60 0
## DeathsPer
## 60
Since there are missing values in the numeric columns, we think the best way to impute missing values is using Predictive Mean Matching (PMM), from MICE package, to fill them.
MICE (Multivariate Imputation via Chained Equations) assumes that the missing data are missing randomly. The original dataset shows that the cases of data missing only related to the observed data, while the missing value can be predicted by using existing data via MICE.
numeric_df <- select_if(df, is.numeric)
mice_plot <- aggr(numeric_df, col=mdc(1:2),
numbers=TRUE, sortVars=TRUE,
labels=names(numeric_df), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Population 0.020790021
## ConfirmedPer 0.020790021
## DeathsPer 0.020790021
## lat 0.019750520
## lon 0.019750520
## fips_code 0.001386001
## confirmed 0.000000000
## deaths 0.000000000
The density chart clearly displays the pattern of our dataset missing, and surprisingly, the statistic of confirmed cases and deaths are totally complete so that we don’t need to conduct any prediction. Meanwhile, the histogram clearly depicts the influence of missing values on the variables. For example, due to the incomplete number of populations, the confirmed cases per unit also can’t be counted.
Imputing the dataset using PMM method, in this case, ‘m’ refers to 5 imputed data sets; ‘maxit’ refers to the number of iterations taken to impute the missing values, and ‘method’ refers to ‘predictive mean matching’.
imputed_Data <- mice(numeric_df, m=5, maxit = 10, method = 'mean', seed = 40)
#summary(imputed_Data)
#Imputed dataset
imputed_Data=complete(imputed_Data,5)
cat(dim(imputed_Data))
## 2886 8
head(imputed_Data)
## fips_code lat lon Population confirmed ConfirmedPer deaths
## 1 1001 32.53953 -86.64408 55200 40 72.46 4
## 2 1003 30.72775 -87.72207 208107 171 82.17 3
## 3 1005 31.86826 -85.38713 25782 37 143.51 0
## 4 1007 32.99642 -87.12511 22527 42 186.44 0
## 5 1009 33.98211 -86.56791 57645 34 58.98 0
## 6 1011 32.10031 -85.71266 10352 12 115.92 0
## DeathsPer
## 1 7.25
## 2 1.44
## 3 0.00
## 4 0.00
## 5 0.00
## 6 0.00
Let’s look if we have got the NAs imputed:
colSums(is.na(imputed_Data)) #check the reamin NA
## fips_code lat lon Population confirmed ConfirmedPer
## 0 0 0 0 0 0
## deaths DeathsPer
## 0 0
imputed_Data$last_update = df$last_update
imputed_Data$state = df$state
imputed_Data$county_name = df$county_name
imputed_Data$FullCountyName = df$FullCountyName
imputed_Data$NCHS_urbanization = df$NCHS_urbanization
Exploratory data analysis is a method to process initial data exploration, such as visualization or basic statistics to provide a preliminary understanding of the data to facilitate subsequent complex or rigorous analysis of the data.
In this case, for two such large scale datasets, we finish the basic data visualization using EDA to help us decide the strategies in further analysis. According to the charts, such as bar charts and heatmap, we can discover the significant pattern in the dataset and then dig out the underlying questions.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(viridis))
imputed_Data %>% filter(state!="Unknown") %>% group_by(state) %>% summarise(TOTALSTATE=sum(confirmed)) %>%
arrange(desc(TOTALSTATE)) %>% head(10) %>%
ggplot(aes(x=reorder(state,TOTALSTATE),y=TOTALSTATE, fill=factor(state))) +
geom_bar(stat='identity') + coord_flip() +
theme_light() +
ggtitle("Confirmed Cases by State") + xlab("County") + ylab("Count") +
theme(legend.position="none")
From the above graph, this is clearly evident that the highest number of cases are in New York state followed by New Jersey and California.
imputed_Data %>% filter(county_name!="Unknown") %>% group_by(county_name) %>% summarise(TOTAL=sum(confirmed)) %>%
arrange(desc(TOTAL)) %>% head(10) %>%
ggplot(aes(x=reorder(county_name,TOTAL),y=TOTAL, fill=factor(county_name))) +
geom_bar(stat='identity') + coord_flip() +
theme_light() +
ggtitle(" Confirmed Cases by County") + xlab("County") + ylab("Count") +
theme(legend.position="none")
At the County level, New York City and Suffolk experienced the highest number of cases. Not surprisingly, compared to other states, the number of confirmed cases of New York state keeps staying at the top of the list, having a huge gap with the rest.
imputed_Data %>% filter(state!="Unknown") %>% group_by(state) %>% summarise(TOTALSTATE=sum(deaths)) %>%
arrange(desc(TOTALSTATE)) %>% head(10) %>%
ggplot(aes(x=reorder(state,TOTALSTATE),y=TOTALSTATE, fill=factor(state))) +
geom_bar(stat='identity') + coord_flip() +
theme_light() +
ggtitle("Death cases by State") + xlab("State") + ylab("Count") +
theme(legend.position="none")
imputed_Data %>% filter(county_name!="Unknown") %>% group_by(county_name) %>% summarise(TOTAL=sum(deaths)) %>%
arrange(desc(TOTAL)) %>% head(10) %>%
ggplot(aes(x=reorder(county_name,TOTAL),y=TOTAL, fill=factor(county_name))) +
geom_bar(stat='identity') + coord_flip() +
theme_light() +
ggtitle("Death cases by County") + xlab("County") + ylab("Count") +
theme(legend.position="none")
Summary:
From the above graphs, it is clearly observed that the death cases in New York and New Jersey state are at the topmost while Michigan is at the third level instead of California state which being at third in number of confirmed cases.
Worth to mention is that, although the confirmed cases of both Wayne and Nassau are not on the top 5, but their number of death cases are ranked in top 3.
state_map <- us_map(regions = "states")
county_map <- us_map(regions = "counties")
#Libraries for state heat map
suppressPackageStartupMessages(library(usmap))
#Rename
colnames(statepop)[colnames(statepop)=="full"] <- "state"
#Summarize data for map
#Cases
imputed_Data %>% select(state, deaths, confirmed) %>% group_by(state) %>%
summarize(TOTALCASES=sum(confirmed)) -> CASES
#Deaths
imputed_Data %>% select(state, deaths, confirmed) %>% group_by(state) %>%
summarize(TOTALDEATHS=sum(deaths)) -> DEATHS
#Convert sumarized data to data frame
CASES<- data.frame(CASES)
DEATHS<-data.frame(DEATHS)
#Merge data
ALLCASES<-left_join(CASES, statepop, by="state")
ALLDEATHS<-left_join(DEATHS, statepop, by="state")
#Plot maps
plot_usmap(data=ALLCASES, values="TOTALCASES", regions = "state") +
scale_fill_viridis(discrete=FALSE) +
theme(legend.position = "right") + labs(fill="TOATLCASES") +
ggtitle("Total Confirmed Cases by State")
plot_usmap(data=ALLDEATHS, values="TOTALDEATHS", regions = "state") +
scale_fill_viridis(discrete=FALSE) +
theme(legend.position = "right") + labs(fill="TOTALDEATHS") +
ggtitle("Total Confirmed Deaths by State")
Interpretation:
Based on the bar and map charts, the number of confirmed cases and deaths that are significantly different between cities and countries. Therefore, we decided to explore the difference of confirmed cases between large cities to go further study the geographical distribution. In two visualization methods to process EDA, both can clearly demonstrate the comparison between states or cities, in confirmed or death number. The maps can directly show the density of cases whereas the bar charts can help the audience evaluate the differences. On the surface, the bar charts tell us that the greater the confirmed cases, the more deaths. To deeper analyze the relationship between confirmed cases and deaths, we plan to build a linear regression model.
Loading the timeseries dataset
timeSeriesData <- read.csv("C:\\Users\\Kashika\\Desktop\\county_timeseries.csv")
head(timeSeriesData)
## uid location_type fips_code location_name state date
## 1 84001001 county 1001 Autauga Alabama 1/22/2020
## 2 84001001 county 1001 Autauga Alabama 1/23/2020
## 3 84001001 county 1001 Autauga Alabama 1/24/2020
## 4 84001001 county 1001 Autauga Alabama 1/25/2020
## 5 84001001 county 1001 Autauga Alabama 1/26/2020
## 6 84001001 county 1001 Autauga Alabama 1/27/2020
## total_population cumulative_cases cumulative_cases_per_100_000
## 1 55200 0 0
## 2 55200 0 0
## 3 55200 0 0
## 4 55200 0 0
## 5 55200 0 0
## 6 55200 0 0
## cumulative_deaths cumulative_deaths_per_100_000 new_cases new_deaths
## 1 0 0 NA NA
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## new_cases_per_100_000 new_deaths_per_100_000
## 1 NA NA
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
tail(timeSeriesData)
## uid location_type fips_code location_name state
## 338203 84099999 special_exception 99999 Grand Princess
## 338204 84099999 special_exception 99999 Grand Princess
## 338205 84099999 special_exception 99999 Grand Princess
## 338206 84099999 special_exception 99999 Grand Princess
## 338207 84099999 special_exception 99999 Grand Princess
## 338208 84099999 special_exception 99999 Grand Princess
## date total_population cumulative_cases cumulative_cases_per_100_000
## 338203 4/29/2020 NA 103 NA
## 338204 4/30/2020 NA 103 NA
## 338205 5/1/2020 NA 103 NA
## 338206 5/2/2020 NA 103 NA
## 338207 5/3/2020 NA 103 NA
## 338208 5/4/2020 NA 103 NA
## cumulative_deaths cumulative_deaths_per_100_000 new_cases new_deaths
## 338203 3 NA 0 0
## 338204 3 NA 0 0
## 338205 3 NA 0 0
## 338206 3 NA 0 0
## 338207 3 NA 0 0
## 338208 3 NA 0 0
## new_cases_per_100_000 new_deaths_per_100_000
## 338203 NA NA
## 338204 NA NA
## 338205 NA NA
## 338206 NA NA
## 338207 NA NA
## 338208 NA NA
timeSeriesData$date <-as.Date(timeSeriesData$date,format="%m/%d/%y") #set date format
#Total cases and deaths
ggplot(timeSeriesData, aes(date)) +
geom_line(aes(y=cumulative_cases), colour="cyan3") +
geom_line(aes(y=cumulative_deaths), colour="darkred") +
theme_light() +
ggtitle(label="Total Cases and Deaths Over Time", subtitle="blue=cases, red=deaths") +
ylab("Count") + xlab("Date") +
theme(axis.text.x=element_text(angle=40))
Summary: It is evident from the above graph as the number of confirmed cases are increasing exponentially, number of deaths are also increasing.
Note:
Since we do have zero entries in dataset until March 22, the graph does not show any values before it.
To explore the relationship between total cases and deaths in the nation, we decide to use linear regression model to stimulate the trend so that we can predict upcoming condition. The linear mathematical model can determine the value of one dependent variable base on the value of one given independent variable.
Currently, the number of confirmed cases constantly increases. Although the rate has not fixed, we still can use the linear regression model to make a short-term prediction for the number of deaths.
————————————————-
#abstract dataset
dat <- select(timeSeriesData, cumulative_cases,cumulative_deaths)
colnames(dat) <- c("totalConfirmed", "totalDeaths")
summary(dat)
## totalConfirmed totalDeaths
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000
## Mean : 74.25 Mean : 3.867
## 3rd Qu.: 3.00 3rd Qu.: 0.000
## Max. :175651.00 Max. :19057.000
#build regression model using lm()
dat_lm1 <- lm(totalDeaths~totalConfirmed, data=dat)
options(digits=2)
summary(dat_lm1)
##
## Call:
## lm(formula = totalDeaths ~ totalConfirmed, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2241.4 2.8 3.0 3.0 2820.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.00581 0.07317 -41.1 <2e-16 ***
## totalConfirmed 0.09256 0.00005 1850.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42 on 338206 degrees of freedom
## Multiple R-squared: 0.91, Adjusted R-squared: 0.91
## F-statistic: 3.43e+06 on 1 and 338206 DF, p-value: <2e-16
#plot the model
ggplot(dat_lm1, aes(x = totalConfirmed, y = totalDeaths)) +
geom_point(col ="red" )+
stat_smooth(method = "lm", col = "cornflowerblue")
## `geom_smooth()` using formula 'y ~ x'
Interpretation:
Since p-values are smaller than 0.05, the linear regression shows a positive correlation between the total number of confirmed cases and deaths during February and May within the nation. For each 100 additional cases of confirmed cases, the deaths increase by 9.
Time series analysis is a statistical technique that deals with analysis of time series data, or trend in order to extract meaningful statistics and other characteristics of the data. We perform time series forecasting to predict future values based on the previously observed values.
Time series data represents a set of observations on the values that a variable takes at different times (in our dataset, we have date which represents time).
We will be using Facebook’s open source “Prophet model” to forecast total confirmed cases and deaths for future dates.
Since, we do not need all the 15 variables thus, keeping only the columns that are required for our analysis using keep function.
keep <- c("uid", "location_name", "state", "date", "cumulative_cases", "cumulative_deaths")
df <- timeSeriesData[keep]
head(df)
## uid location_name state date cumulative_cases cumulative_deaths
## 1 84001001 Autauga Alabama 2020-01-22 0 0
## 2 84001001 Autauga Alabama 2020-01-23 0 0
## 3 84001001 Autauga Alabama 2020-01-24 0 0
## 4 84001001 Autauga Alabama 2020-01-25 0 0
## 5 84001001 Autauga Alabama 2020-01-26 0 0
## 6 84001001 Autauga Alabama 2020-01-27 0 0
tail(df)
## uid location_name state date cumulative_cases
## 338203 84099999 Grand Princess 2020-04-29 103
## 338204 84099999 Grand Princess 2020-04-30 103
## 338205 84099999 Grand Princess 2020-05-01 103
## 338206 84099999 Grand Princess 2020-05-02 103
## 338207 84099999 Grand Princess 2020-05-03 103
## 338208 84099999 Grand Princess 2020-05-04 103
## cumulative_deaths
## 338203 3
## 338204 3
## 338205 3
## 338206 3
## 338207 3
## 338208 3
colSums(is.na(df))
## uid location_name state date
## 0 0 0 0
## cumulative_cases cumulative_deaths
## 0 0
#changing names of the columns
names(df)[names(df) == "location_name"] <- "county"
names(df)[names(df) == "cumulative_cases"] <- "confirmedcases"
names(df)[names(df) == "cumulative_deaths"] <- "deaths"
colnames(df)
## [1] "uid" "county" "state" "date"
## [5] "confirmedcases" "deaths"
str(df)
## 'data.frame': 338208 obs. of 6 variables:
## $ uid : int 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 ...
## $ county : Factor w/ 1900 levels "","Abbeville",..: 83 83 83 83 83 83 83 83 83 83 ...
## $ state : Factor w/ 53 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : Date, format: "2020-01-22" "2020-01-23" ...
## $ confirmedcases: int 0 0 0 0 0 0 0 0 0 0 ...
## $ deaths : int 0 0 0 0 0 0 0 0 0 0 ...
#Converting factor datatypes to character
df[["state"]] <- as.character(df[["state"]] )
df[["county"]] <- as.character(df[["county"]] )
df$date <- as.Date(df$date, format = "%m/%d/%Y")
str(df)
## 'data.frame': 338208 obs. of 6 variables:
## $ uid : int 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 84001001 ...
## $ county : chr "Autauga" "Autauga" "Autauga" "Autauga" ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ date : Date, format: "2020-01-22" "2020-01-23" ...
## $ confirmedcases: int 0 0 0 0 0 0 0 0 0 0 ...
## $ deaths : int 0 0 0 0 0 0 0 0 0 0 ...
head(df)
## uid county state date confirmedcases deaths
## 1 84001001 Autauga Alabama 2020-01-22 0 0
## 2 84001001 Autauga Alabama 2020-01-23 0 0
## 3 84001001 Autauga Alabama 2020-01-24 0 0
## 4 84001001 Autauga Alabama 2020-01-25 0 0
## 5 84001001 Autauga Alabama 2020-01-26 0 0
## 6 84001001 Autauga Alabama 2020-01-27 0 0