COVID-19 Case/Death Count Predictions

NOTE: Unless otherwise stated, all code can be found at the end of this post (search “R Code”).

Introduction
Data Exploration, Literature Review, Assumptions
1. Confirmed Cases
2. Fatalities
Model Construction
1. ETS Model Exploration
  1. Confirmed Cases
  2. Fatalities
2. Random Walk Model Exploration
  1. Confirmed Cases
  2. Fatalities
Conclusions and Further Discussion
References
R Code

Introduction

COVID-19 case and fatality data was retrieved for the time period from 1/23/2020 through 6/10/2020 for the purposes of forecasting future levels of confirmed cases and resulting fatalities at the level of Country, Province/State (where applicable), and County (where applicable). At its most granular, the number of geographical regions for which a forecast could be produced is 3,463; this number reduces to 320 if viewed only at the levels of Country and Province/State.

A model for this purpose was especially important from the time period of 1/2020 through 6/2020 due to the risks associated with the virus’ newness and the desire to contain it with limited resources. With an ability to forecast confirmed cases and fatalities at the county-level (where applicable), officials would be better equipped to deploy scarce resources most effectively, and determine defensible public health measures to employ, such as mask wearing. In addition to responding to the virus’ onset, officials would also be able to generate new forecasts to determine the efficacy of enacted public health measures and resource deployment.

Despite the value that a county-level model would provide, many safety measures were enacted no lower than the Province/State level in the early days of the pandemic suggesting that spread of the virus was mostly caused by inner-Province/State travel/interaction. In this way, the data was summarized to 320 regions by Country and Province/State. Additionally, conducting analysis in this way allows for easier integration of other data useful in modeling case/fatality trajectory.

Data Exploration, Literature Review, Assumptions

To Table of Contents

The data did not contain any missing values between the period of 1/23/2020 and 6/10/2020; plots of confirmed cases and COVID-19 fatalities are below, aggregated to the global level. An early assumption of modeling was that the plot shapes seen below would be similar across essentially all regions in the data, although the scale of case counts and fatalities would undoubtedly be different. One cause of any difference might have been due to state-specific public health measures (e.g., mask wearing, social distancing, etc.) which was also retrieved for use in modeling.

As the below infographic from the US Center for Disease Control indicates, case transmissibility does not appear to be impacted by age; fatality, on the other hand does. Additionally, it can be argued that the availability of medicine and the quality of available medicine are positively correlated with the average income of a region and have a significant impact on the fatality rate. Although a variety of data exists across these items, it was assumed that the trend of COVID case counts and deaths would inherently capture these components in a region. Other demographics, such as political associations, were considered as there were differing opinions about the efficacy of different treatments and the willingness to cooperate with mandated public health measures. However, given the short time frame of the data, such items were once again considered to be captured in each region’s confirmed case and fatality trend.

COVID-19 - Transmissibility, Hospitalization, and Death relative to 18-29 Age Group

To summarize, all models, including fatalities were built on the following key assumptions:

The confirmed case and fatality data capture certain nuances by region, such as age exposure, political association, healthcare quality, etc.
In addition to analyzing existing trends/seasonality in confirmed case data, it has been assumed that confirmed case counts are likely impacted by public policy measures, such as masking and social distancing, enacted to prevent case counts from rising.
Fatalities, although likely driven in part by age and healthcare quality (see item 1), are considered most impacted by the presence of confirmed cases. The downward trend that is seemingly present in the data likely represents a higher prevalence of infections of younger people translating to less serious illness in addition to a learning curve in proper treatment by hospital staffs across the world. The data is assumed to capture these nuances along with the impacts from age and healthcare quality.

Confirmed Case Data Exploration

To Table of Contents

Pure time series data was joined to available data from the COVID-19 Data Hub (see References), which is a historical collection of government response actions during the pandemic. From numbers of tests administered, available treatment space, and details about the different public health measures enacted globally, the Data Hub provides a variety of useful cross-sectional data points as it relates to forecasting COVID-19 confirmed case counts, especially. Ultimately, four indexed statistics were retrieved from Data Hub and joined to the Kaggle COVID-19 datasets, and are summarized below with definitions taken directly from the Data Hub’s website:

Government Response Index: Records how the response of governments has varied, becoming stronger or weaker over the course of the outbreak;
Stringency Index: Records the strictness of ‘lockdown style’ policies that primarily restrict people’s behavior;
Containment Health Index: Combines ‘lockdown’ restrictions and closures with measures such as testing policy and contact tracing, short term investment in healthcare, as well as investments in vaccines; and
Economic Support Index: Records measures such as income support and debt relief.

Data Hub data was retrieved at the country level despite being available at the granularity of the county. In the absence of a consistent numeric key, it was expected that naming convention would differ between the Kaggle and Data Hub datasets suggesting that higher levels of granularity would add layers of complication when merging the data. Initial attempts of merging the data at the country level highlighted some of these complications with identified differences in naming convention and the start date of certain data points. For example, the table below summarizes countries which match in name between the Kaggle and Data Hub data, but do not line up perfectly in terms of the availability of data on certain days (e.g., Australia does not appear in Data Hub data until 1/25/2020 vs. 1/23/2020 in the Kaggle dataset).

Country_Region	Govt_Response	Stringency	Containment_Health	Economic_Support
Antigua and Barbuda	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Armenia	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Australia	2020-01-25	2020-01-25	2020-01-25	2020-01-25
Austria	2020-02-26	2020-02-26	2020-02-26	2020-02-26
Brazil	2020-02-25	2020-02-25	2020-02-25	2020-02-25
Chile	2020-03-03	2020-03-03	2020-03-03	2020-03-03
Colombia	2020-03-05	2020-03-05	2020-03-05	2020-03-05
Comoros	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Croatia	2020-02-25	2020-02-25	2020-02-25	2020-02-25
Diamond Princess	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Equatorial Guinea	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Germany	2020-01-02	2020-01-02	2020-01-02	2020-01-02
Grenada	2022-01-01	2022-01-01	2022-01-01	2022-01-01
Guinea-Bissau	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Holy See	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Italy	2020-02-24	2020-02-24	2020-02-24	2020-02-24
Liechtenstein	2020-02-24	2020-02-24	2020-02-24	2020-02-24
Lithuania	2020-03-20	2020-03-20	2020-03-20	2020-03-20
Maldives	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Montenegro	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Netherlands	2020-02-27	2020-02-27	2020-02-27	2020-02-27
North Macedonia	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Peru	2020-03-06	2020-03-06	2020-03-06	2020-03-06
Portugal	2020-02-26	2020-02-26	2020-02-26	2020-02-26
Saint Kitts and Nevis	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Saint Lucia	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Saint Vincent and the Grenadines	9999-12-31	9999-12-31	9999-12-31	9999-12-31
Sao Tome and Principe	9999-12-31	9999-12-31	9999-12-31	9999-12-31
South Africa	2020-02-07	2020-02-07	2020-02-07	2020-02-07
Sweden	2020-02-01	2020-02-01	2020-02-01	2020-02-01
United Kingdom	2020-01-30	2020-01-30	2020-01-30	2020-01-30

Upon further research, it was clear that some Data Hub countries did not contain any data across any of the four indices; some contained data, but it did not appear in the Data Hub dataset until after 1/23/2020. In this way, null values were filled with zeroes up until the point at which data was available. Although it’s likely that some countries with no measures in place by Data Hub’s standards were requiring public safety measures, the lack of data implies limited communication channels suggesting potentially low levels of compliance/awareness among these countries’ citizens.

The below table also highlights some of the difficulties in the naming conventions across both datasets. Overall, the changes made to country names were as follows:

CHANGES TO THE DATA HUB DATA

“Cape Verde” to “Cabo Verde”
“Congo” to “Congo (Brazzaville)”
“Congo, the Democratic Republic of the” to “Congo (Kinshasa)”
“Czech Republic” to “Czechia”
“Taiwan” to "Taiwan*"
“United States” to “US”
“Palestine” to “West Bank and Gaza”

CHANGES TO THE KAGGLE DATA

“Western Sahara” to “Morocco”

COUNTRIES NOT REPRESENTED IN DATA HUB Zeroes were recorded across all index values for these countries

Burma
Eswatini
MS Zaandam (cruise ship)

Country_Region	Spelling
Burma	1
Cabo Verde	1
Congo (Brazzaville)	1
Congo (Kinshasa)	1
Czechia	1
Eswatini	1
MS Zaandam	1
Taiwan*	1
US	1
West Bank and Gaza	1
Western Sahara	1

Literature throughout the pandemic has been focused on the lagged effects of certain policy measures on case counts, as well as the lagged effects of case counts on the number of hospitilizations and deaths. For this research, lagged effects of data external to the time series (e.g., case counts are external to the fatalities time series) were considered in weekly periods up to 4 weeks. If data was not available as far back as 4 weeks, the null values were replaced with zeroes indicating the absence of public policy measures.

Data exploration was largely conducted on a globally aggregated dataset of case counts and fatalities. Data Hub metrics were calculated as a weighted average based on the proportion of country’s population to the global population. It was thought that restrictions on high population versus low population countries could theoretically have different global impacts. This approach of course ignores the impacts of population densities and assumes similarities internationally.

As earlier plots indicated, case counts and deaths began to display signs of more consistent trend and seasonality at about 4/1/2020. Although the earlier period from 1/23/2020 through 3/31/2020 was important, the goal of current modeling is to forecast new case counts and fatalities from a base assumption that the expectations surrounding the virus will not significantly change going forward (e.g., the virus’ presence is no longer a surprise). In other words, although the earlier period signals that any new variant or new virus onset is likely to lead to exponentially increasing case counts and deaths from baseline levels, as of April 2020 it was expected that the response to virus would continue to strike a more serious tone than earlier months of virus onset. The plot below demonstrates the improvement in trend and seasonality when dates beyond 4/2020 are used as opposed to the complete time period.

Histograms of case count and public policy indices are below. Overall, data since 4/1/2020 display mixed results as it relates to the strength of normality. The case count data (top left) shows strong normality; however, many of the other government related metrics are skewed to the left. This result is not necessarily surprising as many large governments enacted extreme public safety measures; it was unclear the effects such measures had on cases, as early Pearson correlation coefficients (top row of correlation matrix) indicate. With difficulties identifying suitable transformations, often weak Pearson’s correlation and/or unexpected positive correlation, and a time series that displays more consistent trend and seasonality, it was determined that use of external data points, at least those studied, would be superfluous when trying to forecast future confirmed case counts.

Histogram Plots

Correlation Matrix

Fatalities Data Exploration

To Table of Contents

Exploration of fatalities is below. Like case counts, fatality data exhibits more consistent trend and seasonality beginning in 4/1/2020. Therefore, similarly to case count data, the decision was made to forecast fatalities using 4/1/2020 through 6/10/2020 as a model training period. Although a new virus or variant would likely cause an exponential jump in fatality count, the decision to model from 4/1/2020 operates on a base assumption that the virus is not expected to change drastically in the near term, and a recognition that predicting a new variant of the virus would be out of scope for this work. The plots below show a comparison between the full training dataset (right plot), and the data from 4/1/2020 (left plot). Data from 4/1/2020 onward indicates a slightly downward trend in fatalities despite an upward trend in case counts. It is likely that the virus’ effects on age are on display here; as a higher percentage of younger people were infected, the percent of cases that resulted in fatalities began to decrease. In this way, case data by age group would likely exhibit stronger correlative properties than case counts alone, as is elaborated on below.

The presence of a confirmed case is assumed to be a primary driver behind the presence of a fatality. This relationship was examined on a lagged basis from 0 days to 28 days. Histograms of fatality counts and case counts by lag are below. Generally, the data shows some normality indicating appropriateness of further examination for cross-sectional modeling.

The correlation matrix below offers insight into potential correlation between fatality count and case counts from 8 to 21 days prior. Research done by the National Institute of Health (Jin, 2021), suggested that COVID deaths lagged cases by an average of 8 days. As the plot shows, although some correlation exists, it is in a direction opposite to what would be expected. For example, confirmed cases from 9 days prior show a degree of negative correlation with fatalities which, intuitively, does not make sense. Therefore, although a death cannot exist without a case, it was decided to exclude case counts as an external predictor from modeling.

Model Construction

To Table of Contents

Forecasts for case counts and fatalities were made from pure time series models. Given the seemingly low level of volatility in the data from 4/1/2020 onward, ETS models were constructed from that time forward. Of note, it remains understood that the introduction of a new variant would likely cause any chosen model with an ETS construct to be highly inaccurate; however, these models have been built with the base assumption that, by 4/1/2020, governments and their citizens were taking the virus’ presence more seriously suggesting that trends seen from 4/1/2020 onward were likely to continue in the near term.

The intent of the models is to be region specific, forecasting future case counts and fatalities at the granularity of the state level (where applicable). Therefore, while global aggregation indicates more consistent trend and seasonality for case counts and fatalities, it is highly unlikely that each state has a confirmed case and/or fatality pattern that matches identically to the global aggregation. At 4/1/2020, the virus remained relatively new and had not struck some parts of the world as strongly as others. When limitations in COVID-19 testing and treatment are factored into consideration, the latter statement strengthens.

As a final note, it was not until model development that it was realized the test dataset covers the time period from 4/27/2020 through 6/10/2020, a time period also included in the training data. Therefore, in order to test the models against unknown data, ETS model development was only done on the period from 4/1/2020 through 4/26/2020; random-walk models were constructed with data from 1/23/2020 through 4/26/2020. It is understood that a 4/1/2020 training data start does not necessarily provide an ideal range of time for modeling; however, the presence of an exponentially increasing case count in the earliest periods of the pandemic justify to an extent the identification of a separating point in the data for ETS model construction.

ETS Model Exploration and Testing

To Table of Contents

ETS models were built at the global level as well as at the state level. The intent of this action was to enrich later discussion surrounding potential model limitations and findings.

Confirmed Cases

To Table of Contents

Global Data

Three ETS models were developed to forecast global confirmed cases; as a reminder, these models were trained on data from 4/1/2020 through 4/26/2020. In summary, the models tested include:

ETS(M,N,M) - This model suggests minimal to no data trend and a seasonality increasing in variability.
ETS(M,Ad,M) - This model was built as a variation of the ETS(M,N,M) model. Multiplicative seasonality was maintained, and a damped additive trend was included.
ETS(A,Ad,A) - Finally, this model was built to test against the multiplicative model with damped additive trend.

A plot of the models against actual values from 4/27/2020 through 6/10/2020 is below along with a summary of model performance on training data and test data.

.model	.type	RMSE	MAE	MPE	MAPE
ETS_MNM	Test	14882.514	10736.669	1.8900053	7.166570
ETS_MAdM	Test	12337.069	9503.132	-1.6351922	6.567055
ETS_AAdA	Test	14367.443	10610.904	0.7310527	7.156847
ETS_MNM	Training	7396.932	6196.212	-0.4193950	4.352037
ETS_MAdM	Training	7493.605	6395.282	-1.1288024	4.550829
ETS_AAdA	Training	7250.246	5995.457	0.4192436	4.178627

Target	.model	sigma2	log_lik	AIC	AICc	BIC	MSE	AMSE	MAE
ConfirmedCases	ETS_MNM	4.124600e-03	-273.8910	567.7819	582.4486	580.3629	54714609	50763958	0.0435129
ConfirmedCases	ETS_AAdA	9.762269e+07	-273.4638	572.9276	603.2610	589.2829	52566066	52979526	5995.4574007
ConfirmedCases	ETS_MAdM	5.172900e-03	-274.4932	574.9864	605.3197	591.3416	56154117	57529991	0.0449001

Based on the visuals above, with similar \(AIC_c\) values across the three models, the ETS(M,Ad,M) model performed best on the test set with an RMSE of \(12,337.069\). As a final step, the residuals across the three models were compared; residual plots are below and are generally representative of white noise across all three models.

State-level Data

Initially, it was hoped that state-level data could be modeled using the same three ETS constructs used for modeling the aggregated case data; with thrown errors for an ETS(M,Ad,M) model, only an auto-generated and ETS(A,Ad,A) model were constructed. Given the granularity of the data, assessing model performance was also not as straight-forward as for the aggregated data. Therefore, forecast performance was weighted by the proportion of total confirmed cases tallied in a particular region relative to toal confirmed cases globally. Although not perfect, this simplified the model comparisons. Of additional importance, residual plots were not thoroughly examined in this round of ETS modeling as they would have to be examined for each of the 320 regions modeled.

RMSE (Test) - ETSauto	RMSE (Train) - ETSauto	RMSE (Test) - ETS(A,Ad,A)	RMSE (Train) - ETS(A,Ad,A)
4147.301	1655.087	3673.851	1274.45

Overall, the ETS(A,Ad,A) model performed best across the entire dataset based on the results in the above table. This result is not largely surprising; states with matured case counts likely had trend and seasonality similar to that of the globally aggregated data.

Fatalities

To Table of Contents

Global Data

The forecasting of COVID-19 fatalities was approached in a way very similar to case data forecasting; three ETS models were produced:

ETS(M,N,M);
ETS(M,Ad,M); and
ETS(A,Ad,A)

The results of the models’ forecasts and predictability are shown in the plot and tables below. Looking at the plot, the prediction intervals for some forecasts go into the negative thousands; once realized, it was determined that an ETS model for fatality data at the state level would likely lead to complex interpretation.

.model	.type	RMSE	MAE	MPE	MAPE
ETS_MNM	Test	1683.832	1363.203	-15.254272	23.13700
ETS_MAdM	Test	2391.208	2079.548	-31.848103	35.56033
ETS_AAdA	Test	2985.747	2618.446	-40.529320	42.62803
ETS_MNM	Training	1664.187	1145.257	-2.927777	11.95205
ETS_MAdM	Training	1474.562	1028.742	-2.738848	10.31276
ETS_AAdA	Training	1713.981	1446.796	-1.708166	14.86066

State-level Data

State-level fatality forecasts were also produced using ETS modeling mirroring the approach for state-level confirmed case forecasting. The table below gives the weighted-average RMSE results across testing and training data for the two models built. Overall interpretation of the results is somewhat difficult; however, it is clear that the auto-generated model produced more favorable results.

RMSE (Test) - ETSauto	RMSE (Train) - ETSauto	RMSE (Test) - ETS(A,Ad,A)	RMSE (Train) - ETS(A,Ad,A)
441.4164	303.8716	467.6366	235.5334

Random-Walk Models (with and without Drift)

To Table of Contents

Below are two plots showing the path of confirmed cases and fatalities from 1/23/2020 through 4/26/2020. As is apparent in the data, and as has been previously discussed, both plots display an initial exponential increase in counts, followed by a period of more consistent trend and seasonality. ETS and model construction would likely struggle with the initial period of exponential growth. Although ARIMA modeling would likely do better, making the stationary is somewhat challenging. Further, transformation of the data is largely an inconclusive exercise.

Given these observations, it is likely that a random-walk model will provide reasonable and easily interpretable forecasts, especially over short periods such as 45 days. Further, such a model will be better at responding to changes in confirmed case and fatality trajectory, for example at at the onset of a new variant.

Confirmed Cases

To Table of Contents

Global Data

Results of random-walk models built from globally aggregated data are below. Overall, the models performed reasonably, although not as well as the ETS models built from later data (e.g., 4/1/2020 through 4/26/2020). Nevertheless, modeling with a random-walk might be a better hedge against potential changes in virus structure/transmission that result in a renewed period of exponentially increasing cases.

.model	.type	RMSE	MAE	MPE	MAPE
RW_drift	Test	22798.44	17844.53	8.616572	11.529119
RW_nodrift	Test	22798.44	17844.53	8.616572	11.529119
RW_drift	Training	14549.75	11912.88	-0.385294	8.463919
RW_nodrift	Training	14549.75	11912.88	-0.385294	8.463919

Residual plots are below and offer mixed results. Overall, the low residual values early in the data are likely most attributable to limited testing availability. Variability in the residuals naturally increases as the global response to COVID-19 increases. Additionally, the residuals from both models (which are actually identical) show high levels of autocorrelation suggesting errors that are not representative of white noise.

State-level Data

Once again, state-level data was examined in a way similar to global data. However, given the complexities of individually analyzing 320 different regions, RMSE values were once again examined on the training and test data as a weighted average based on the proportion of total confirmed cases by state through 4/26/2020. For example, if New York State in the US accounted for 50% of total cases through 4/26/2020, the RMSE values associated with New York State would be weighted accordingly.

A summary of RMSE values is below and displays some of the best results across all state-level modeling thusfar. RMSE values for random-walk models on the test data equated to a weighted-average of \(3,107.44\) versus RMSE \(= 4,147.3\) for the auto-generated ETS state-level model and RMSE \(= 3,673.851\) for the ETS(A,Ad,A) state-level model. This finding strengthens the arguments made for using random-walk at a more granular level. It is likely that state-level experience varied in timing (e.g., New York State was affected earlier and worse than Vermont in the United States) - random-walk models can pick up on these discrepancies by simplifying all assumptions.

RMSE (Test) - RW Drift	RMSE (Train) - RW Drift	RMSE (Test) - RW No Drift	RMSE (Train) - RW No Drift
3107.444	1220.407	3107.444	1220.407

Fatalities

To Table of Contents

State-level Data

Given the strength of the random-walk model on state-level data, it was decided that the same modeling approach would be used on fatality data. Once again, the newness of the virus and the expected lag between case counts and death data highlight the strength of a random-walk model - by simplifying the assumptions of the model to make forecasts equal to the last recorded value, random-walk models respond quickly to changes in data shape without the extra work of remodeling and re-examining model assumptions.

The weighted-average RMSE values are below for random-walk models with and without drift; ultimately these are identical in this exercise. Compared to the state-level results using ETS modeling, the random-walk models are more favorable.

RMSE (Test) - RW Drift	RMSE (Train) - RW Drift	RMSE (Test) - RW No Drift	RMSE (Train) - RW No Drift
248.2173	187.6042	248.2173	187.6042

Conclusions and Further Discussion

To Table of Contents

Random-walk modeling appears to be the most appropriate time series modeling technique to apply to COVID-19 confirmed cases and fatalities forecasting. Given the newness of the virus in the training data, and the potential for the data to have large unexpected swings, random-walk models respond most quickly to sudden changes, and are better structured for the significant differences that can be present globally as it relates to virus transmission.

Despite the models’ strengths as it relates to this work, there are limitations. For example, the models are only responsive; they are incapable of indicating when a spike in cases might occur and the scale of any increase. Additionally, the model does not appropriately capture all of the factors that result in confirmed cases and fatalities. For example, the number of tests administered undoubtedly have a large effect on the number of confirmed cases (e.g., there cannot be a confirmed case without a test). Additionally, as it relates to fatalities, the effect of age is absent in the data; having case counts by age group would likely highlight stronger correlation between case counts and fatalities.

Further work should be geared toward incorporating a wider variety of data into this analysis, and having data structure that better supports such incorporation (e.g., consistent data key by state/county). Doing so might allow for better forecasting of a significant increase in case counts.

Despite the limitations and the additional work that would enhance modeling results, the random-walk model remains a useful tool in giving a generally good idea of the virus’ containment/spread.

References

To Table of Contents

Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

Jin R. (2021). The lag between daily reported Covid-19 cases and deaths and its relationship to age. Journal of public health research, 10(3), 2049. https://doi.org/10.4081/jphr.2021.2049

Appendix: R Code

To Table of Contents

knitr::opts_chunk$set(echo = TRUE)
local({
  hook_source <- knitr::knit_hooks$get('source')
  knitr::knit_hooks$set(source = function(x, options) {
    x <- x[!grepl('# SECRET!!$', x)]
    hook_source(x, options)
  })
})

#INVOKE APPROPRIATE LIBRARIES

library("feasts")
library("seasonal")
library("tsibble")
library("tsibbledata")
library("dplyr")
library("ggplot2")
library("forecast")
library("fable")
library("fpp3")
library("sqldf")
library("psych")
library("PerformanceAnalytics")
library("car")
library("kableExtra")
library("glmnet")
library("ISLR")
library("leaps")


#CSV Path Definition
csv_path <- paste0("/Users/timothywhite/Library/CloudStorage/",
                   "OneDrive-Personal/Documents/PROFESSIONAL/GRAD SCHOOL/",
                   "BOSTON COLLEGE/SPRING 2022/",

#CSV File Import
train_data <- read.csv(paste0(csv_path, "train.csv"),
                       stringsAsFactors = TRUE)

train_data <- as.data.frame(train_data)

#Data Manipulation - blanks to "NA"; concatenation to different Geography_ID's
train_data_fin <- sqldf('

    WITH DATA AS (
    
        SELECT Id
        , CASE WHEN County = "" THEN "NA" ELSE County 
            END AS County
        , CASE WHEN Province_State = "" THEN "NA" ELSE Province_State
            END AS Province_State
        , Country_Region
        , Population
        , Weight
        , Date
        , Target
        , TargetValue
        
        FROM train_data
    
    )

    SELECT Id
    , County
    , Province_State
    , Country_Region
    , Country_Region AS Geography_ID1
    , Country_Region||"_"||Province_State AS Geography_ID2
    , Country_Region||"_"||Province_State||"_"||County AS Geography_ID3
    , Population
    , Weight
    , Date
    , Target
    , TargetValue
    
    FROM DATA
                    
')

#Geography_ID2 Reference Table
geography_ref2 <- sqldf('
                        
    SELECT DISTINCT Geography_ID2
    , Country_Region
    , Province_State
    
    FROM train_data_fin
                        
')

#Geography_ID3 Reference Table
geography_ref3 <- sqldf('
                       
    SELECT DISTINCT Geography_ID3
    , Country_Region
    , Province_State
    , County
    
    FROM train_data_fin
                       
')

#Training Data - Confirmed Cases
#Creating tsibble by Geography_ID3 and global aggregate
train_data_case <- 
  train_data_fin %>% 
  filter(Target == "ConfirmedCases")

train_data_case2 <-
  train_data_case %>%
  mutate(Date = as.Date(Date)) %>%
  select(Date,
         Country_Region,
         Province_State,
         Geography_ID2,
         Target,
         Population,
         TargetValue)

train_data_case3 <-
  train_data_case2 %>%
  group_by(Date, Country_Region, Province_State, Geography_ID2, Target) %>%
  summarize(TargetValue = sum(TargetValue),
            Population = sum(Population)) %>%
  ungroup()

train_data_case_ts <-
  train_data_case3 %>%
  mutate(TargetValue = as.double(TargetValue),
         Population = as.double(Population)) %>%
  as_tsibble(key = c(Country_Region,
                     Province_State,
                     Geography_ID2,
                     Target),
             index = Date)

train_data_case_agg <-
  train_data_case_ts %>%
  summarize(TargetT = sum(TargetValue),
            PopulationT = sum(Population))

#Training Data - Fatalities
#Creating tsibble by Geography_ID3 and global aggregate
train_data_dth <- 
  train_data_fin %>% 
  filter(Target == "Fatalities")

train_data_dth2 <-
  train_data_dth %>%
  mutate(Date = as.Date(Date)) %>%
  select(Date,
         Country_Region,
         Province_State,
         Geography_ID2,
         Target,
         Population,
         TargetValue)

train_data_dth3 <-
  train_data_dth2 %>%
  group_by(Date, Country_Region, Province_State, Geography_ID2, Target) %>%
  summarize(TargetValue = sum(TargetValue),
            Population = sum(Population)) %>%
  ungroup()

train_data_dth_ts <-
  train_data_dth3 %>%
  mutate(TargetValue = as.double(TargetValue),
         Population = as.double(Population)) %>%
  as_tsibble(key = c(Country_Region,
                     Province_State,
                     Geography_ID2,
                     Target),
             index = Date)

train_data_dth_agg <-
  train_data_dth_ts %>%
  summarize(TargetT = sum(TargetValue),
            PopulationT = sum(Population))


#Plot of case data, aggregated globally
train_data_case_agg %>%
  autoplot(TargetT) +
  labs(x = "Date",
       y = "Case Count",
       title = "Daily COVID-19 Confirmed Cases",
       subtitle = "Global Aggregation") +
  theme_classic()

#Plot of death data, aggregated globally
train_data_dth_agg %>%
  autoplot(TargetT, col = "red") +
  labs(x = "Date",
       y = "Fatality Count",
       title = "Daily COVID-19 Fatalities",
       subtitle = "Global Aggregation") +
  theme_classic()




#Pull in cross-sectional data from COVID-19 Data Hub
policy_measures <- read.csv("https://storage.covid19datahub.io/level/1.csv")

#Tidy up dataset for joining
policy_measures2 <-
  policy_measures %>%
  mutate(date = as.Date(date)) %>%
  select(date,
         administrative_area_level_1,
         government_response_index,
         stringency_index,
         containment_health_index, #tests likely correlated with index value
         economic_support_index)

#Initial join of external data to Kaggle dataset
train_data_case4 <- sqldf('
                          
    SELECT C.Date
    , C.Target
    , C.Country_Region
    , C.Province_State
    , C.Geography_ID2
    , P.government_response_index AS Govt_Response_Idx
    , P.stringency_index AS Stringency_Idx
    , P.containment_health_index AS Contain_Hlth_Idx
    , P.economic_support_index AS Economic_Support_Idx
    
    FROM train_data_case3 AS C
    
    LEFT JOIN policy_measures2 AS P
    ON P.administrative_area_level_1 = C.Country_Region
    AND P.date = C.Date
                          
')

#EXPLORATION OF NULL AND MISSING VALUES AFTER JOIN
#Country names from Kaggle Dataset
country_vector <- as.vector(unique(train_data_case4$Country_Region))

#Empty dataframe for results storing
results_df <- data.frame(matrix(ncol = 6,
                                nrow = length(country_vector)))

colnames(results_df) <- c("Country_Region", 
                          "Total_Obs", 
                          "PCT_Govt_Response_NA",
                          "PCT_Stringency_NA",
                          "PCT_Contain_Hlth_NA",
                          "PCT_Economic_Support_NA")

#Loop - Cycles through each country (Kaggle-named) individually
#Total Observations by country
#Missing government response index observations (%)
#Missing stringency index observations (%)
#Missing containment health index observations (%)
#Missing economic support index observations (%)
for(i in 1:length(country_vector)) {

  loop_tbl <-
    train_data_case4 %>%
    filter(Country_Region == country_vector[i])
  
  results_df[i,1] <- country_vector[i]
  
  Total_Obs <- length(rownames(loop_tbl))
  results_df[i,2] <- Total_Obs
  
  Govt_Response_NA <- loop_tbl$Govt_Response_Idx[is.na(loop_tbl$Govt_Response_Idx) == TRUE]
  results_df[i,3] <- length(Govt_Response_NA)/Total_Obs
  
  Stringency_NA <- loop_tbl$Stringency_Idx[is.na(loop_tbl$Stringency_Idx) == TRUE]
  results_df[i,4] <- length(Stringency_NA)/Total_Obs
  
  Contain_Hlth_NA <- loop_tbl$Contain_Hlth_Idx[is.na(loop_tbl$Contain_Hlth_Idx) == TRUE]
  results_df[i,5] <- length(Contain_Hlth_NA)/Total_Obs
  
  Economic_Support_NA <- loop_tbl$Economic_Support_Idx[is.na(loop_tbl$Economic_Support_Idx) == TRUE]
  results_df[i,6] <- length(Economic_Support_NA)/Total_Obs
  
}

#Pull only the rows where there is missing data
results_df_rsch <-
  results_df %>%
  filter(PCT_Govt_Response_NA > 0)

#Vector of countries with missing data (Kaggle-named)
missing_countries <- as.vector(unique(results_df_rsch$Country_Region))

#Vector of countries (Data Hub)
policy_countries <-
  sort(as.vector(unique(policy_measures2$administrative_area_level_1)))

#Empty dataframe for results storing
missing_df_rsch <- data.frame(matrix(ncol = 6,
                                     nrow = length(missing_countries)))

colnames(missing_df_rsch) <- c("Country_Region", 
                               "Govt_Response", 
                               "Stringency", 
                               "Containment_Health", 
                               "Economic_Support",
                               "Spelling")

#Loop ID'ing data as missing or ID'ing a spelling difference in country
for(i in 1:length(missing_countries)) {
  
  missing_df_rsch[i,1] <- missing_countries[i]
  
  if(missing_countries[i] %in% policy_countries) {
    
    missing_table <-
      policy_measures2 %>%
      filter(administrative_area_level_1 == missing_countries[i])
    
    missing_table_govt_rmna <-
      missing_table %>%
      filter(is.na(government_response_index) == FALSE) %>%
      select(date,
             administrative_area_level_1,
             government_response_index)
    
    if(length(rownames(missing_table_govt_rmna)) > 0) {
      
      missing_df_rsch[i,2] <- as.Date(min(missing_table_govt_rmna$date),
                                      origin = '1970-01-01')
      
    } else {
      
      missing_df_rsch[i,2] <- as.Date('9999-12-31',
                                      origin = '1970-01-01')
      
    }
    
    missing_table_stringency_rmna <-
      missing_table %>%
      filter(is.na(stringency_index) == FALSE) %>%
      select(date,
             administrative_area_level_1,
             stringency_index)
    
    if(length(rownames(missing_table_stringency_rmna)) > 0) {
      
      missing_df_rsch[i,3] <- as.Date(min(missing_table_stringency_rmna$date),
                                      origin = '1970-01-01')
      
    } else {
      
      missing_df_rsch[i,3] <- as.Date('9999-12-31',
                                      origin = '1970-01-01')
      
    }
    
    missing_table_containment_rmna <-
      missing_table %>%
      filter(is.na(containment_health_index) == FALSE) %>%
      select(date,
             administrative_area_level_1,
             containment_health_index)
    
    if(length(rownames(missing_table_containment_rmna)) > 0) {
      
      missing_df_rsch[i,4] <- as.Date(min(missing_table_containment_rmna$date),
                                      origin = '1970-01-01')
      
    } else {
      
      missing_df_rsch[i,4] <- as.Date('9999-12-31',
                                      origin = '1970-01-01')
      
    }
    
    missing_table_economic_rmna <-
      missing_table %>%
      filter(is.na(economic_support_index) == FALSE) %>%
      select(date,
             administrative_area_level_1,
             economic_support_index)
    
    if(length(rownames(missing_table_economic_rmna)) > 0) {
      
      missing_df_rsch[i,5] <- as.Date(min(missing_table_economic_rmna$date),
                                      origin = '1970-01-01')
      
    } else {
      
      missing_df_rsch[i,5] <- as.Date('9999-12-31',
                                      origin = '1970-01-01')
      
    }
    
  } else {
    
    missing_df_rsch[i,6] <- 1
    
  }
  
}

#Missing data due to date
missing_df_rsch_date <-
  missing_df_rsch %>%
  mutate(Govt_Response = as.Date(Govt_Response,
                                 origin = '1970-01-01'),
         Stringency = as.Date(Stringency,
                              origin = '1970-01-01'),
         Containment_Health = as.Date(Containment_Health,
                                      origin = '1970-01-01'),
         Economic_Support = as.Date(Economic_Support,
                                    origin = '1970-01-01')) %>%
  filter(is.na(Spelling) == TRUE) %>%
  select(Country_Region,
         Govt_Response,
         Stringency,
         Containment_Health,
         Economic_Support)

kable(missing_df_rsch_date) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


missing_df_rsch_date2 <- sqldf("

    WITH DATA AS (
    
        SELECT M.Country_Region
        , M.Govt_Response
        , R.PCT_Govt_Response_NA
        , JULIANDAY('2020-06-10') - (JULIANDAY(DATE(M.Govt_Response))+2440588) 
            AS Govt_CoveredTime
        , M.Stringency
        , R.PCT_Stringency_NA
        , JULIANDAY('2020-06-10') - (JULIANDAY(DATE(M.Stringency))+2440588) 
            AS Stringency_CoveredTime
        , M.Containment_Health
        , R.PCT_Contain_Hlth_NA
        , JULIANDAY('2020-06-10') - (JULIANDAY(DATE(M.Containment_Health))+2440588) 
            AS Contain_CoveredTime
        , M.Economic_Support
        , R.PCT_Economic_Support_NA
        , JULIANDAY('2020-06-10') - (JULIANDAY(DATE(M.Economic_Support))+2440588) 
            AS Economic_CoveredTime
        , JULIANDAY('2020-06-10') - JULIANDAY('2020-01-23') AS Total_CoveredTime
        
        FROM missing_df_rsch_date AS M
        
        LEFT JOIN results_df_rsch AS R
        ON R.Country_Region = M.Country_Region
        
    )
    
    SELECT Country_Region
    , Govt_Response
    , PCT_Govt_Response_NA
    , CASE WHEN CAST(Govt_CoveredTime AS FLOAT) < 0 THEN CAST(0 AS FLOAT)
        ELSE CAST(Govt_CoveredTime AS FLOAT) END AS Govt_CoveredTime
    , Stringency
    , PCT_Stringency_NA
    , CASE WHEN CAST(Stringency_CoveredTime AS FLOAT) < 0 THEN CAST(0 AS FLOAT)
        ELSE CAST(Stringency_CoveredTime AS FLOAT) END AS Stringency_CoveredTime
    , Containment_Health
    , PCT_Contain_Hlth_NA
    , CASE WHEN CAST(Contain_CoveredTime AS FLOAT) < 0 THEN CAST(0 AS FLOAT)
        ELSE CAST(Contain_CoveredTime AS FLOAT) END AS Contain_CoveredTime
    , Economic_Support
    , PCT_Economic_Support_NA
    , CASE WHEN CAST(Economic_CoveredTime AS FLOAT) < 0 THEN CAST(0 AS FLOAT)
        ELSE CAST(Economic_CoveredTime AS FLOAT) END AS Economic_CoveredTime
    , CAST(Total_CoveredTime AS FLOAT) AS Total_CoveredTime
    
    FROM DATA
                               
")

missing_df_rsch_date3 <- sqldf("

    WITH DATA AS (
                               
        SELECT Country_Region
        , PCT_Govt_Response_NA
        , Govt_CoveredTime/Total_CoveredTime AS PCT_Govt_Covered
        , PCT_Stringency_NA
        , Stringency_CoveredTime/Total_CoveredTime AS PCT_Stringency_Covered
        , PCT_Contain_Hlth_NA
        , Contain_CoveredTime/Total_CoveredTime AS PCT_Contain_Hlth_Covered
        , PCT_Economic_Support_NA
        , Economic_CoveredTime/Total_CoveredTime AS PCT_Economic_Support_Covered
        
        FROM missing_df_rsch_date2
    
    )
    
    SELECT Country_Region
    , PCT_Govt_Response_NA + PCT_Govt_Covered AS PCT_Govt_Acctd
    , PCT_Stringency_NA + PCT_Stringency_Covered AS PCT_Stringency_Acctd
    , PCT_Contain_Hlth_NA + PCT_Contain_Hlth_Covered AS PCT_Contain_Hlth_Acctd
    , PCT_Economic_Support_NA + PCT_Economic_Support_Covered AS PCT_Economic_Support_Acctd
    
    FROM DATA
                               
")

missing_df_rsch_spelling <-
  missing_df_rsch %>%
  filter(is.na(Spelling) == FALSE) %>%
  select(Country_Region,
         Spelling)

kable(missing_df_rsch_spelling) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")

policy_title_old <- c("Cape Verde", 
                      "Congo", 
                      "Congo, the Democratic Republic of the",
                      "Czech Republic",
                      "Taiwan",
                      "United States",
                      "Palestine")

policy_title_new <- c("Cabo Verde", 
                      "Congo (Brazzaville)", 
                      "Congo (Kinshasa)",
                      "Czechia",
                      "Taiwan*",
                      "US",
                      "West Bank and Gaza")

#Changes to training data countries
train_data_case3b <- train_data_case3
train_data_case3b$Country_Region[train_data_case3b$Country_Region == "Western Sahara"] <- c("Morocco")

train_data_dth3b <- train_data_dth3
train_data_dth3b$Country_Region[train_data_dth3b$Country_Region == "Western Sahara"] <- c("Morocco")

policy_measures2b <- policy_measures2

for(i in 1:length(policy_title_old)) {
  
  policy_measures2b$administrative_area_level_1[
    policy_measures2b$administrative_area_level_1 == policy_title_old[i]
  ] <- policy_title_new[i]
  
}

train_data_case4b <- sqldf('
                          
    SELECT C.Date
    , C.Target
    , C.Country_Region
    , C.Province_State
    , C.Geography_ID2
    , CASE WHEN P.government_response_index IS NULL THEN 0
        ELSE P.government_response_index END AS Govt_Response_Idx
    , CASE WHEN P.stringency_index IS NULL THEN 0
        ELSE P.stringency_index END AS Stringency_Idx
    , CASE WHEN P.containment_health_index IS NULL THEN 0
        ELSE P.containment_health_index END AS Contain_Hlth_Idx
    , CASE WHEN P.economic_support_index IS NULL THEN 0
        ELSE P.economic_support_index END AS Economic_Support_Idx
    , TargetValue
    , Population
    
    FROM train_data_case3b AS C
    
    LEFT JOIN policy_measures2b AS P
    ON P.administrative_area_level_1 = C.Country_Region
    AND P.date = C.Date
                          
')


#lagging variables; creating weights for weighted averages (visuals)
train_data_case4c <- sqldf("
    
    WITH DATA AS (
        
        SELECT Date
        , Target
        , Country_Region
        , Province_State
        , Geography_ID2
        , CASE WHEN 
            MIN(Govt_Response_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Govt_Response_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING)
          END AS Govt_Response_28
        , CASE WHEN 
            MIN(Govt_Response_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Govt_Response_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING)
          END AS Govt_Response_21
        , CASE WHEN 
            MIN(Govt_Response_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Govt_Response_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING)
          END AS Govt_Response_14
        , CASE WHEN 
            MIN(Govt_Response_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Govt_Response_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING)
          END AS Govt_Response_7
        , Govt_Response_Idx AS Govt_Response_0
        , CASE WHEN 
            MIN(Stringency_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Stringency_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING)
          END AS Stringency_28
        , CASE WHEN 
            MIN(Stringency_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Stringency_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING)
          END AS Stringency_21
        , CASE WHEN 
            MIN(Stringency_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Stringency_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING)
          END AS Stringency_14
        , CASE WHEN 
            MIN(Stringency_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Stringency_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING)
          END AS Stringency_7
        , Stringency_Idx AS Stringency_0
        , CASE WHEN 
            MIN(Contain_Hlth_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Contain_Hlth_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING)
          END AS Contain_Hlth_28
        , CASE WHEN 
            MIN(Contain_Hlth_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Contain_Hlth_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING)
          END AS Contain_Hlth_21
        , CASE WHEN 
            MIN(Contain_Hlth_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Contain_Hlth_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING)
          END AS Contain_Hlth_14
        , CASE WHEN 
            MIN(Contain_Hlth_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Contain_Hlth_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING)
          END AS Contain_Hlth_7
        , Contain_Hlth_Idx AS Contain_Hlth_0
        , CASE WHEN 
            MIN(Economic_Support_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Economic_Support_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING)
          END AS Economic_Support_28
        , CASE WHEN 
            MIN(Economic_Support_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Economic_Support_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING)
          END AS Economic_Support_21
        , CASE WHEN 
            MIN(Economic_Support_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Economic_Support_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING)
          END AS Economic_Support_14
        , CASE WHEN 
            MIN(Economic_Support_Idx) 
              OVER (PARTITION BY Geography_ID2 
                    ORDER BY Date ASC 
                    ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING) IS NULL
            THEN CAST(0 AS FLOAT)
            ELSE MIN(Economic_Support_Idx) 
                  OVER (PARTITION BY Geography_ID2 
                        ORDER BY Date ASC 
                        ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING)
          END AS Economic_Support_7
        , Economic_Support_Idx AS Economic_Support_0
        , TargetValue
        , SUM(TargetValue) OVER(PARTITION BY Date) AS TargetValueT
        , CAST(Population AS FLOAT) AS Population
        , SUM(CAST(Population AS FLOAT)) OVER(PARTITION BY Date) AS PopulationT
        
        FROM train_data_case4b
    
    )
    
    SELECT C.*
    , CAST(Population AS FLOAT)/CAST(PopulationT AS FLOAT) AS CaseWeight
    
    FROM DATA AS C
                           
")

train_data_case4d <- sqldf("
                           
    WITH DATA AS (
    
        SELECT C.*
        , CaseWeight * Govt_Response_28 AS Govt28Weight
        , CaseWeight * Govt_Response_21 AS Govt21Weight
        , CaseWeight * Govt_Response_14 AS Govt14Weight
        , CaseWeight * Govt_Response_7 AS Govt7Weight
        , CaseWeight * Govt_Response_0 AS Govt0Weight
        , CaseWeight * Stringency_28 AS Stringency28Weight
        , CaseWeight * Stringency_21 AS Stringency21Weight
        , CaseWeight * Stringency_14 AS Stringency14Weight
        , CaseWeight * Stringency_7 AS Stringency7Weight
        , CaseWeight * Stringency_0 AS Stringency0Weight
        , CaseWeight * Contain_Hlth_28 AS Hlth28Weight
        , CaseWeight * Contain_Hlth_21 AS Hlth21Weight
        , CaseWeight * Contain_Hlth_14 AS Hlth14Weight
        , CaseWeight * Contain_Hlth_7 AS Hlth7Weight
        , CaseWeight * Contain_Hlth_0 AS Hlth0Weight
        , CaseWeight * Economic_Support_28 AS Economic28Weight
        , CaseWeight * Economic_Support_21 AS Economic21Weight
        , CaseWeight * Economic_Support_14 AS Economic14Weight
        , CaseWeight * Economic_Support_7 AS Economic7Weight
        , CaseWeight * Economic_Support_0 AS Economic0Weight
        
        FROM train_data_case4c AS C
    
    )
    
    SELECT C.*
    , SUM(Govt28Weight) OVER(PARTITION BY Date) AS Govt28T
    , SUM(Govt21Weight) OVER(PARTITION BY Date) AS Govt21T
    , SUM(Govt14Weight) OVER(PARTITION BY Date) AS Govt14T
    , SUM(Govt7Weight) OVER(PARTITION BY Date) AS Govt7T
    , SUM(Govt0Weight) OVER(PARTITION BY Date) AS Govt0T
    , SUM(Stringency28Weight) OVER(PARTITION BY Date) AS Stringency28T
    , SUM(Stringency21Weight) OVER(PARTITION BY Date) AS Stringency21T
    , SUM(Stringency14Weight) OVER(PARTITION BY Date) AS Stringency14T
    , SUM(Stringency7Weight) OVER(PARTITION BY Date) AS Stringency7T
    , SUM(Stringency0Weight) OVER(PARTITION BY Date) AS Stringency0T
    , SUM(Hlth28Weight) OVER(PARTITION BY Date) AS Hlth28T
    , SUM(Hlth21Weight) OVER(PARTITION BY Date) AS Hlth21T
    , SUM(Hlth14Weight) OVER(PARTITION BY Date) AS Hlth14T
    , SUM(Hlth7Weight) OVER(PARTITION BY Date) AS Hlth7T
    , SUM(Hlth0Weight) OVER(PARTITION BY Date) AS Hlth0T
    , SUM(Economic28Weight) OVER(PARTITION BY Date) AS Economic28T
    , SUM(Economic21Weight) OVER(PARTITION BY Date) AS Economic21T
    , SUM(Economic14Weight) OVER(PARTITION BY Date) AS Economic14T
    , SUM(Economic7Weight) OVER(PARTITION BY Date) AS Economic7T
    , SUM(Economic0Weight) OVER(PARTITION BY Date) AS Economic0T
    
    FROM DATA AS C
                           
")

train_data_case_agg2 <- sqldf('
                              
    SELECT Date
    , Target
    , AVG(PopulationT) AS PopulationT
    , CAST(AVG(TargetValueT) AS FLOAT) AS TargetValueT
    , AVG(Govt28T) AS Govt28T
    , AVG(Govt21T) AS Govt21T
    , AVG(Govt14T) AS Govt14T
    , AVG(Govt7T) AS Govt7T
    , AVG(Govt0T) AS Govt0T
    , AVG(Stringency28T) AS Stringency28T
    , AVG(Stringency21T) AS Stringency21T
    , AVG(Stringency14T) AS Stringency14T
    , AVG(Stringency7T) AS Stringency7T
    , AVG(Stringency0T) AS Stringency0T
    , AVG(Hlth28T) AS Hlth28T
    , AVG(Hlth21T) AS Hlth21T
    , AVG(Hlth14T) AS Hlth14T
    , AVG(Hlth7T) AS Hlth7T
    , AVG(Hlth0T) AS Hlth0T
    , AVG(Economic28T) AS Economic28T
    , AVG(Economic21T) AS Economic21T
    , AVG(Economic14T) AS Economic14T
    , AVG(Economic7T) AS Economic7T
    , AVG(Economic0T) AS Economic0T
    
    FROM train_data_case4d
    
    GROUP BY Date, Target
                              
')

train_data_case_agg2b <-
  train_data_case_agg2 %>%
  filter(Date >= '2020-04-01')


ggplot(data = train_data_case_agg2b,
       aes(x = Date,
           y = TargetValueT)) +
  labs(x = "Date",
       y = "Case Count",
       title = "Daily Confirmed COVID-19 Cases",
       subtitle = "4/1/2020 - 6/10/2020") +
  geom_line(col = "black") +
  theme_classic()

ggplot(data = train_data_case_agg2,
       aes(x = Date,
           y = TargetValueT)) +
  labs(x = "Date",
       y = "Case Count",
       title = "Daily Confirmed COVID-19 Cases",
       subtitle = "1/23/2020 - 6/10/2020") +
  geom_line(col = "black") +
  theme_classic()


#Case Count Visualizations

train_case_col <- as.vector(colnames(train_data_case_agg2b))

for(i in 4:length(train_case_col)) {
  
  hist_data <- data.frame(cbind(train_data_case_agg2b[,1],
                                train_data_case_agg2b[,i]))
  
  colnames(hist_data) <- c("Date",
                           "x")
  
  print(
    
    ggplot(data = hist_data,
           aes(x = x)) +
     geom_histogram(aes(y =..density..), 
                    col = "black", 
                    fill = "grey",
                    bins = 20) +
     geom_density(alpha = .2, fill = "#FF6666") +
     theme_classic() +
      labs(x = paste("Bin: ", train_case_col[i]),
           y = "Density",
           title = "Histogram",
           subtitle = paste("Case Data: ", train_case_col[i]))
    
  )
  
}


chart.Correlation(train_data_case_agg2b[, 4:24])


#Fatality Data

train_data_dth3b <-
  train_data_dth3 %>%
  mutate(TargetValue = as.double(TargetValue))

train_data_dth4 <- sqldf('
                         
    SELECT D.Date
    , D.Target
    , D.Country_Region
    , D.Province_State
    , D.Geography_ID2
    , C.TargetValue AS ConfirmedCases
    , D.TargetValue
    
    FROM train_data_dth3b AS D
    
    LEFT JOIN train_data_case3b AS C
    ON C.Geography_ID2 = D.Geography_ID2
    AND C.Date = D.Date
                         
')

train_data_dth4b <- sqldf('
                          
    SELECT D.Date
    , D.Target
    , D.Country_Region
    , D.Province_State
    , D.Geography_ID2
    , D.ConfirmedCases as Case0
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
      END AS Case1
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 2 PRECEDING AND 2 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 2 PRECEDING AND 2 PRECEDING)
      END AS Case2
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 3 PRECEDING AND 3 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 3 PRECEDING AND 3 PRECEDING)
      END AS Case3
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 4 PRECEDING AND 4 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 4 PRECEDING AND 4 PRECEDING)
      END AS Case4
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 5 PRECEDING AND 5 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 5 PRECEDING AND 5 PRECEDING)
      END AS Case5
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 6 PRECEDING AND 6 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 6 PRECEDING AND 6 PRECEDING)
      END AS Case6
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 7 PRECEDING AND 7 PRECEDING)
      END AS Case7
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 8 PRECEDING AND 8 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 8 PRECEDING AND 8 PRECEDING)
      END AS Case8
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 9 PRECEDING AND 9 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 9 PRECEDING AND 9 PRECEDING)
      END AS Case9
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 10 PRECEDING AND 10 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 10 PRECEDING AND 10 PRECEDING)
      END AS Case10
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 11 PRECEDING AND 11 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 11 PRECEDING AND 11 PRECEDING)
      END AS Case11
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 12 PRECEDING AND 12 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 12 PRECEDING AND 12 PRECEDING)
      END AS Case12
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 13 PRECEDING AND 13 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 13 PRECEDING AND 13 PRECEDING)
      END AS Case13
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 14 PRECEDING AND 14 PRECEDING)
      END AS Case14
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 15 PRECEDING AND 15 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 15 PRECEDING AND 15 PRECEDING)
      END AS Case15
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 16 PRECEDING AND 16 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 16 PRECEDING AND 16 PRECEDING)
      END AS Case16
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 17 PRECEDING AND 17 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 17 PRECEDING AND 17 PRECEDING)
      END AS Case17
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 18 PRECEDING AND 18 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 18 PRECEDING AND 18 PRECEDING)
      END AS Case18
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 19 PRECEDING AND 19 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 19 PRECEDING AND 19 PRECEDING)
      END AS Case19
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 20 PRECEDING AND 20 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 20 PRECEDING AND 20 PRECEDING)
      END AS Case20
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 21 PRECEDING AND 21 PRECEDING)
      END AS Case21
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 22 PRECEDING AND 22 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 22 PRECEDING AND 22 PRECEDING)
      END AS Case22
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 23 PRECEDING AND 23 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 23 PRECEDING AND 23 PRECEDING)
      END AS Case23
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 24 PRECEDING AND 24 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 24 PRECEDING AND 24 PRECEDING)
      END AS Case24
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 25 PRECEDING AND 25 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 25 PRECEDING AND 25 PRECEDING)
      END AS Case25
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 26 PRECEDING AND 26 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 26 PRECEDING AND 26 PRECEDING)
      END AS Case26
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 27 PRECEDING AND 27 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 27 PRECEDING AND 27 PRECEDING)
      END AS Case27
    , CASE WHEN 
        MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING)
          IS NULL THEN CAST(0 AS FLOAT) ELSE
         MIN(D.ConfirmedCases) OVER(PARTITION BY Geography_ID2
                                   ORDER BY Date ASC
                                   ROWS BETWEEN 28 PRECEDING AND 28 PRECEDING)
      END AS Case28
    , D.TargetValue
      
      FROM train_data_dth4 AS D
                          
')

train_data_dth4c <- sqldf("
                          
    SELECT Date
    , Target
    , Country_Region
    , Province_State
    , Geography_ID2
    , Case0
    , Case1
    , Case2
    , Case3
    , Case4
    , Case5
    , Case6
    , Case7
    , Case8
    , Case9
    , Case10
    , Case11
    , Case12
    , Case13
    , Case14
    , Case15
    , Case16
    , Case17
    , Case18
    , Case19
    , Case20
    , Case21
    , Case22
    , Case23
    , Case24
    , Case25
    , Case26
    , Case27
    , Case28
    , TargetValue
    /*TOTALS*/
    , SUM(Case0) OVER(PARTITION BY Date) AS Case0T
    , SUM(Case1) OVER(PARTITION BY Date) AS Case1T
    , SUM(Case2) OVER(PARTITION BY Date) AS Case2T
    , SUM(Case3) OVER(PARTITION BY Date) AS Case3T
    , SUM(Case4) OVER(PARTITION BY Date) AS Case4T
    , SUM(Case5) OVER(PARTITION BY Date) AS Case5T
    , SUM(Case6) OVER(PARTITION BY Date) AS Case6T
    , SUM(Case7) OVER(PARTITION BY Date) AS Case7T
    , SUM(Case8) OVER(PARTITION BY Date) AS Case8T
    , SUM(Case9) OVER(PARTITION BY Date) AS Case9T
    , SUM(Case10) OVER(PARTITION BY Date) AS Case10T
    , SUM(Case11) OVER(PARTITION BY Date) AS Case11T
    , SUM(Case12) OVER(PARTITION BY Date) AS Case12T
    , SUM(Case13) OVER(PARTITION BY Date) AS Case13T
    , SUM(Case14) OVER(PARTITION BY Date) AS Case14T
    , SUM(Case15) OVER(PARTITION BY Date) AS Case15T
    , SUM(Case16) OVER(PARTITION BY Date) AS Case16T
    , SUM(Case17) OVER(PARTITION BY Date) AS Case17T
    , SUM(Case18) OVER(PARTITION BY Date) AS Case18T
    , SUM(Case19) OVER(PARTITION BY Date) AS Case19T
    , SUM(Case20) OVER(PARTITION BY Date) AS Case20T
    , SUM(Case21) OVER(PARTITION BY Date) AS Case21T
    , SUM(Case22) OVER(PARTITION BY Date) AS Case22T
    , SUM(Case23) OVER(PARTITION BY Date) AS Case23T
    , SUM(Case24) OVER(PARTITION BY Date) AS Case24T
    , SUM(Case25) OVER(PARTITION BY Date) AS Case25T
    , SUM(Case26) OVER(PARTITION BY Date) AS Case26T
    , SUM(Case27) OVER(PARTITION BY Date) AS Case27T
    , SUM(Case28) OVER(PARTITION BY Date) AS Case28T
    , SUM(TargetValue) OVER(PARTITION BY Date) AS TargetValueT
    
    FROM train_data_dth4b
    
")

train_data_dth_agg2 <- sqldf("
                             
    SELECT Date
    , Target
    , AVG(TargetValueT) AS TargetValueT
    , AVG(Case0T) AS Case0T
    , AVG(Case1T) AS Case1T
    , AVG(Case2T) AS Case2T
    , AVG(Case3T) AS Case3T
    , AVG(Case4T) AS Case4T
    , AVG(Case5T) AS Case5T
    , AVG(Case6T) AS Case6T
    , AVG(Case7T) AS Case7T
    , AVG(Case8T) AS Case8T
    , AVG(Case9T) AS Case9T
    , AVG(Case10T) AS Case10T
    , AVG(Case11T) AS Case11T
    , AVG(Case12T) AS Case12T
    , AVG(Case13T) AS Case13T
    , AVG(Case14T) AS Case14T
    , AVG(Case15T) AS Case15T
    , AVG(Case16T) AS Case16T
    , AVG(Case17T) AS Case17T
    , AVG(Case18T) AS Case18T
    , AVG(Case19T) AS Case19T
    , AVG(Case20T) AS Case20T
    , AVG(Case21T) AS Case21T
    , AVG(Case22T) AS Case22T
    , AVG(Case23T) AS Case23T
    , AVG(Case24T) AS Case24T
    , AVG(Case25T) AS Case25T
    , AVG(Case26T) AS Case26T
    , AVG(Case27T) AS Case27T
    , AVG(Case28T) AS Case28T
    
    FROM train_data_dth4c
    
    GROUP BY Date, Target
                             
")

train_data_dth_agg2b <-
  train_data_dth_agg2 %>%
  filter(Date >= '2020-04-01')

ggplot(data = train_data_dth_agg2b,
       aes(x = Date,
           y = TargetValueT)) +
  geom_line() +
  labs(x = "Date",
       y = "Fatality Count",
       title = "Daily COVID-19 Fatality Count",
       subtitle = "4/1/2020 - 6/10/2020")

ggplot(data = train_data_dth_agg2,
       aes(x = Date,
           y = TargetValueT)) +
  geom_line() +
  labs(x = "Date",
       y = "Fatality Count",
       title = "Daily COVID-19 Fatality Count",
       subtitle = "1/23/2020 - 6/10/2020")


#Death Count Visualizations

train_dth_col <- as.vector(colnames(train_data_dth_agg2b))

for(i in 3:length(train_dth_col)) {
  
  hist_data <- data.frame(cbind(train_data_dth_agg2b[,1],
                                train_data_dth_agg2b[,i]))
  
  colnames(hist_data) <- c("Date",
                           "x")
  
  print(
    
    ggplot(data = hist_data,
           aes(x = x)) +
      geom_histogram(aes(y =..density..), 
                     col = "black", 
                     fill = "grey",
                     bins = 20) +
      geom_density(alpha = .2, fill = "#FF6666") +
      theme_classic() +
      labs(x = paste("Bin: ", train_dth_col[i]),
           y = "Density",
           title = "Histogram",
           subtitle = paste("Case Data: ", train_dth_col[i]))
    
  )
  
}


#Data says anywhere from 8 days to 3 weeks for lag

train_data_dth_agg2b %>%
  select(TargetValueT,
         Case21T,
         Case20T,
         Case19T,
         Case18T,
         Case17T,
         Case16T,
         Case15T,
         Case14T,
         Case13T,
         Case12T,
         Case11T,
         Case10T,
         Case9T,
         Case8T) %>%
  chart.Correlation()


#ETS MODEL CONSTRUCTION - CASES
train_data_case_aggts <-
  train_data_case_agg2b %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         TargetValueT) %>%
  mutate(TargetValueT = as.double(TargetValueT)) %>%
  as_tsibble(key = c(Target),
             index = Date)

#Test Data = 4/27/2020 through 6/10/2020
test_data_case_agg <-
  train_data_case_agg2b %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         TargetValueT) %>%
  mutate(TargetValueT = as.double(TargetValueT)) %>%
  as_tsibble(key = c(Target),
             index = Date)

train_data_case_agg_ETS <-
  train_data_case_aggts %>%
  model(ETS_MNM = ETS(TargetValueT ~ error("M") + trend("N") + season("M")),
        ETS_AAdA = ETS(TargetValueT ~ error("A") + trend("Ad") + season("A")),
        ETS_MAdM = ETS(TargetValueT ~ error("M") + trend("Ad") + season("M")))

train_data_case_agg_fc <-
  train_data_case_agg_ETS %>%
  forecast(h = 45)

train_data_case_agg_fc %>%
  autoplot(test_data_case_agg) +
  labs(x = "Month",
       y = "Confirmed Case Count",
       title = "Daily Confirmed Cases, Global",
       subtitle = "12 Month Forecast, ETS") +
  theme_classic()

ETS_case_agg_summary <- rbind(train_data_case_agg_fc %>% 
                                filter(.model == "ETS_MNM") %>%
                                accuracy(test_data_case_agg) %>%
                                select(.model, .type, RMSE:MAPE),
                              train_data_case_agg_fc %>% 
                                filter(.model == "ETS_MAdM") %>%
                                accuracy(test_data_case_agg) %>%
                                select(.model, .type, RMSE:MAPE),
                              train_data_case_agg_fc %>% 
                                filter(.model == "ETS_AAdA") %>%
                                accuracy(test_data_case_agg) %>%
                                select(.model, .type, RMSE:MAPE),
                              train_data_case_agg_ETS %>% 
                                select(ETS_MNM) %>%
                                accuracy() %>%
                                select(.model, .type, RMSE:MAPE),
                              train_data_case_agg_ETS %>% 
                                select(ETS_MAdM) %>%
                                accuracy() %>%
                                select(.model, .type, RMSE:MAPE),
                              train_data_case_agg_ETS %>% 
                                select(ETS_AAdA) %>%
                                accuracy() %>%
                                select(.model, .type, RMSE:MAPE))

kable(ETS_case_agg_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")

kable(glance(train_data_case_agg_ETS)) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


train_data_case_agg_ETS %>%
  select(ETS_MAdM) %>%
  gg_tsresiduals() +
  labs(title = "Residual Plots",
       subtitle = "ETS(M,Ad,M)")

train_data_case_agg_ETS %>%
  select(ETS_MNM) %>%
  gg_tsresiduals() +
  labs(title = "Residual Plots",
       subtitle = "ETS(M,N,M)")

train_data_case_agg_ETS %>%
  select(ETS_AAdA) %>%
  gg_tsresiduals() +
  labs(title = "Residual Plots",
       subtitle = "ETS(A,Ad,A)")


train_data_case_ts2 <-
  train_data_case4d %>%
  filter(Date >= '2020-04-01') %>%
  filter(Date < '2020-04-27') %>%
    select(Date,
           Target,
           Geography_ID2,
           TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

#Test Data = 4/27/2020 through 6/10/2020
test_data_case_ts <-
  train_data_case4d %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

train_data_case_ETS <-
  train_data_case_ts2 %>%
  model(ETSauto = ETS(TargetValue),
        ETS_AAdA = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")))

train_data_case_fc <-
  train_data_case_ETS %>%
  forecast(h = 45)

ETS_case_results_state <- rbind(train_data_case_fc %>% accuracy(test_data_case_ts) %>% 
                                   select(Geography_ID2, 
                                          .model, 
                                          .type, 
                                          RMSE:MAPE),
                                 train_data_case_ETS %>% accuracy() %>%
                                   select(Geography_ID2, 
                                          .model, 
                                          .type, 
                                          RMSE:MAPE))

train_data_case_model <- 
  train_data_case4d %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue)

train_data_case_model2 <- sqldf("
    
    WITH DATA AS (
    
        SELECT Target,
        Geography_ID2,
        SUM(TargetValue) AS TargetValueState
        
        FROM train_data_case_model
        
        GROUP BY Target, Geography_ID2
    
    )
    
    SELECT C.Target,
    C.Geography_ID2,
    C.TargetValueState,
    SUM(C.TargetValueState) OVER(PARTITION BY C.Target) AS TargetValueT
    , M.RMSE AS RMSE_autoTest
    , M2.RMSE AS RMSE_AAdA_Test
    , M3.RMSE AS RMSE_autoTrain
    , M4.RMSE AS RMSE_AAdA_Train
    
    FROM DATA AS C
    
    LEFT JOIN ETS_case_results_state AS M
    ON M.Geography_ID2 = C.Geography_ID2
    AND M.'.model' = 'ETSauto'
    AND M.'.type' = 'Test'
    
    LEFT JOIN ETS_case_results_state AS M2
    ON M2.Geography_ID2 = C.Geography_ID2
    AND M2.'.model' = 'ETS_AAdA'
    AND M2.'.type' = 'Test'
    
    LEFT JOIN ETS_case_results_state AS M3
    ON M3.Geography_ID2 = C.Geography_ID2
    AND M3.'.model' = 'ETSauto'
    AND M3.'.type' = 'Training'
    
    LEFT JOIN ETS_case_results_state AS M4
    ON M4.Geography_ID2 = C.Geography_ID2
    AND M4.'.model' = 'ETS_AAdA'
    AND M4.'.type' = 'Training'
                                
")

train_data_case_model3 <- sqldf("
    
    WITH DATA AS (
               
        SELECT Target
        , Geography_ID2
        , CAST(TargetValueState AS FLOAT)/CAST(TargetValueT AS FLOAT) AS TargetWeight
        , RMSE_autoTest
        , RMSE_AAdA_Test
        , RMSE_autoTrain
        , RMSE_AAdA_Train
        
        FROM train_data_case_model2
    
    )
    
    SELECT Target
    , Geography_ID2
    , TargetWeight * RMSE_autoTest AS Tgt_RMSEautoTest
    , TargetWeight * RMSE_AAdA_Test AS Tgt_RMSEAAdA_Test
    , TargetWeight * RMSE_autoTrain AS Tgt_RMSEautoTrain
    , TargetWeight * RMSE_AAdA_Train AS Tgt_RMSEAAdA_Train
    
    FROM DATA
    
")

ETS_case_state_summary <- cbind(sum(train_data_case_model3$Tgt_RMSEautoTest),
                                   sum(train_data_case_model3$Tgt_RMSEautoTrain),
                                   sum(train_data_case_model3$Tgt_RMSEAAdA_Test),
                                   sum(train_data_case_model3$Tgt_RMSEAAdA_Train))

colnames(ETS_case_state_summary) <- c("RMSE (Test) - ETSauto",
                                         "RMSE (Train) - ETSauto",
                                         "RMSE (Test) - ETS(A,Ad,A)",
                                         "RMSE (Train) - ETS(A,Ad,A)")

kable(ETS_case_state_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


#Fatalities
#ETS MODEL CONSTRUCTION - FATALITIES
train_data_dth_aggts <-
  train_data_dth_agg2b %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         TargetValueT) %>%
  mutate(TargetValueT = as.double(TargetValueT)) %>%
  as_tsibble(key = c(Target),
             index = Date)

#Test Data = 4/27/2020 through 6/10/2020

test_data_dth_agg <-
  train_data_dth_agg2b %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         TargetValueT) %>%
  mutate(TargetValueT = as.double(TargetValueT)) %>%
  as_tsibble(key = c(Target),
             index = Date)

train_data_dth_agg_ETS <-
  train_data_dth_aggts %>%
  model(ETS_MNM = ETS(TargetValueT ~ error("M") + trend("N") + season("M")),
        ETS_AAdA = ETS(TargetValueT ~ error("A") + trend("Ad") + season("A")),
        ETS_MAdM = ETS(TargetValueT ~ error("M") + trend("Ad") + season("M")))

train_data_dth_agg_fc <-
  train_data_dth_agg_ETS %>%
  forecast(h = 45)

train_data_dth_agg_fc %>%
  autoplot(test_data_dth_agg) +
  labs(x = "Month",
       y = "Fatality Count",
       title = "Daily Fatalities, Global",
       subtitle = "12 Month Forecast, ETS")

ETS_dth_agg_summary <- rbind(train_data_dth_agg_fc %>% 
                                 filter(.model == "ETS_MNM") %>%
                                 accuracy(test_data_dth_agg) %>%
                                 select(.model, .type, RMSE:MAPE),
                               train_data_dth_agg_fc %>% 
                                 filter(.model == "ETS_MAdM") %>%
                                 accuracy(test_data_dth_agg) %>%
                                 select(.model, .type, RMSE:MAPE),
                               train_data_dth_agg_fc %>% 
                                 filter(.model == "ETS_AAdA") %>%
                                 accuracy(test_data_dth_agg) %>%
                                 select(.model, .type, RMSE:MAPE),
                               train_data_dth_agg_ETS %>% 
                                 select(ETS_MNM) %>%
                                 accuracy() %>%
                                 select(.model, .type, RMSE:MAPE),
                               train_data_dth_agg_ETS %>% 
                                 select(ETS_MAdM) %>%
                                 accuracy() %>%
                                 select(.model, .type, RMSE:MAPE),
                               train_data_dth_agg_ETS %>% 
                                 select(ETS_AAdA) %>%
                                 accuracy() %>%
                                 select(.model, .type, RMSE:MAPE))

kable(ETS_dth_agg_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


train_data_dth_ts2 <-
  train_data_dth4c %>%
  filter(Date >= '2020-04-01') %>%
  filter(Date < '2020-04-27') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

#Test Data = 4/27/2020 through 6/10/2020
test_data_dth_ts <-
  train_data_dth4c %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

train_data_dth_ETS <-
  train_data_dth_ts2 %>%
  model(ETSauto = ETS(TargetValue),
        ETS_AAdA = ETS(TargetValue ~ error("A") + trend("Ad") + season("A")))

train_data_dth_fc <-
  train_data_dth_ETS %>%
  forecast(h = 45)

ETS_dth_results_state <- rbind(train_data_dth_fc %>% accuracy(test_data_dth_ts) %>% 
                                  select(Geography_ID2, 
                                         .model, 
                                         .type, 
                                         RMSE:MAPE),
                                train_data_dth_ETS %>% accuracy() %>%
                                  select(Geography_ID2, 
                                         .model, 
                                         .type, 
                                         RMSE:MAPE))

train_data_dth_model <- 
  train_data_dth4c %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue)

train_data_dth_model2 <- sqldf("
    
    WITH DATA AS (
    
        SELECT Target,
        Geography_ID2,
        SUM(TargetValue) AS TargetValueState
        
        FROM train_data_dth_model
        
        GROUP BY Target, Geography_ID2
    
    )
    
    SELECT C.Target,
    C.Geography_ID2,
    C.TargetValueState,
    SUM(C.TargetValueState) OVER(PARTITION BY C.Target) AS TargetValueT
    , M.RMSE AS RMSE_autoTest
    , M2.RMSE AS RMSE_AAdA_Test
    , M3.RMSE AS RMSE_autoTrain
    , M4.RMSE AS RMSE_AAdA_Train
    
    FROM DATA AS C
    
    LEFT JOIN ETS_dth_results_state AS M
    ON M.Geography_ID2 = C.Geography_ID2
    AND M.'.model' = 'ETSauto'
    AND M.'.type' = 'Test'
    
    LEFT JOIN ETS_dth_results_state AS M2
    ON M2.Geography_ID2 = C.Geography_ID2
    AND M2.'.model' = 'ETS_AAdA'
    AND M2.'.type' = 'Test'
    
    LEFT JOIN ETS_dth_results_state AS M3
    ON M3.Geography_ID2 = C.Geography_ID2
    AND M3.'.model' = 'ETSauto'
    AND M3.'.type' = 'Training'
    
    LEFT JOIN ETS_dth_results_state AS M4
    ON M4.Geography_ID2 = C.Geography_ID2
    AND M4.'.model' = 'ETS_AAdA'
    AND M4.'.type' = 'Training'
                                
")

train_data_dth_model3 <- sqldf("
    
    WITH DATA AS (
               
        SELECT Target
        , Geography_ID2
        , CAST(TargetValueState AS FLOAT)/CAST(TargetValueT AS FLOAT) AS TargetWeight
        , RMSE_autoTest
        , RMSE_AAdA_Test
        , RMSE_autoTrain
        , RMSE_AAdA_Train
        
        FROM train_data_dth_model2
    
    )
    
    SELECT Target
    , Geography_ID2
    , TargetWeight * RMSE_autoTest AS Tgt_RMSEautoTest
    , TargetWeight * RMSE_AAdA_Test AS Tgt_RMSEAAdA_Test
    , TargetWeight * RMSE_autoTrain AS Tgt_RMSEautoTrain
    , TargetWeight * RMSE_AAdA_Train AS Tgt_RMSEAAdA_Train
    
    FROM DATA
    
")

ETS_dth_state_summary <- cbind(sum(train_data_dth_model3$Tgt_RMSEautoTest),
                                sum(train_data_dth_model3$Tgt_RMSEautoTrain),
                                sum(train_data_dth_model3$Tgt_RMSEAAdA_Test),
                                sum(train_data_dth_model3$Tgt_RMSEAAdA_Train))

colnames(ETS_dth_state_summary) <- c("RMSE (Test) - ETSauto",
                                      "RMSE (Train) - ETSauto",
                                      "RMSE (Test) - ETS(A,Ad,A)",
                                      "RMSE (Train) - ETS(A,Ad,A)")

kable(ETS_dth_state_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


#Plot of case data, aggregated globally
train_data_case_agg %>%
  autoplot(TargetT) +
  labs(x = "Date",
       y = "Case Count",
       title = "Daily COVID-19 Confirmed Cases",
       subtitle = "Global Aggregation") +
  theme_classic()

#Plot of death data, aggregated globally
train_data_dth_agg %>%
  autoplot(TargetT, col = "red") +
  labs(x = "Date",
       y = "Fatality Count",
       title = "Daily COVID-19 Fatalities",
       subtitle = "Global Aggregation") +
  theme_classic()


train_data_caseRW_aggts <-
  train_data_case_agg2 %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         TargetValueT) %>%
  mutate(TargetValueT = as.double(TargetValueT)) %>%
  as_tsibble(key = c(Target),
             index = Date)

test_data_case_agg <-
  train_data_case_agg2b %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         TargetValueT) %>%
  mutate(TargetValueT = as.double(TargetValueT)) %>%
  as_tsibble(key = c(Target),
             index = Date)

train_data_case_agg_RW <-
  train_data_case_aggts %>%
  model(RW_drift = RW(TargetValueT, drift = TRUE),
        RW_nodrift = RW(TargetValueT, drift = FALSE))

train_data_caseRW_agg_fc <-
  train_data_case_agg_RW %>%
  forecast(h = 45)

train_data_caseRW_agg_fc %>%
  autoplot(test_data_case_agg) +
  labs(x = "Month",
       y = "Confirmed Case Count",
       title = "Daily Confirmed Cases, Global",
       subtitle = "12 Month Forecast, Random-Walk") +
  theme_classic()

RW_case_agg_summary <- rbind(train_data_caseRW_agg_fc %>% 
                                filter(.model == "RW_drift") %>%
                                accuracy(test_data_case_agg) %>%
                                select(.model, .type, RMSE:MAPE),
                             train_data_caseRW_agg_fc %>% 
                                filter(.model == "RW_nodrift") %>%
                                accuracy(test_data_case_agg) %>%
                                select(.model, .type, RMSE:MAPE),
                             train_data_case_agg_RW %>% 
                                select(RW_drift) %>%
                                accuracy() %>%
                                select(.model, .type, RMSE:MAPE),
                             train_data_case_agg_RW %>% 
                                select(RW_nodrift) %>%
                                accuracy() %>%
                                select(.model, .type, RMSE:MAPE))

kable(RW_case_agg_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


train_data_case_agg_RW %>%
  select(RW_drift) %>%
  gg_tsresiduals() +
  labs(title = "Residual Plots",
       subtitle = "RW w/ Drift")

train_data_case_agg_RW %>%
  select(RW_nodrift) %>%
  gg_tsresiduals() +
  labs(title = "Residual Plots",
       subtitle = "RW w/o Drift")


#RW MODEL CONSTRUCTION - CASES
train_data_caseRW_ts2 <-
  train_data_case4d %>%
  filter(Date <= '2020-04-26') %>%
    select(Date,
           Target,
           Geography_ID2,
           TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

#Test Data = 4/27/2020 through 6/10/2020
test_data_case_ts <-
  train_data_case4d %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

train_data_case_RW <-
  train_data_caseRW_ts2 %>%
  model(RW_drift = RW(TargetValue, drift = TRUE),
        RW_nodrift = RW(TargetValue, drift = FALSE))

train_data_caseRW_fc <-
  train_data_case_RW %>%
  forecast(h = 45)

RW_case_results_state <- rbind(train_data_caseRW_fc %>% accuracy(test_data_case_ts) %>% 
                                   select(Geography_ID2, 
                                          .model, 
                                          .type, 
                                          RMSE:MAPE),
                               train_data_case_RW %>% accuracy() %>%
                                   select(Geography_ID2, 
                                          .model, 
                                          .type, 
                                          RMSE:MAPE))

train_data_caseRW_model <- 
  train_data_case4d %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue)

train_data_caseRW_model2 <- sqldf("
    
    WITH DATA AS (
    
        SELECT Target,
        Geography_ID2,
        SUM(TargetValue) AS TargetValueState
        
        FROM train_data_caseRW_model
        
        GROUP BY Target, Geography_ID2
    
    )
    
    SELECT C.Target,
    C.Geography_ID2,
    C.TargetValueState,
    SUM(C.TargetValueState) OVER(PARTITION BY C.Target) AS TargetValueT
    , M.RMSE AS RMSE_RWD_Test
    , M2.RMSE AS RMSE_RWND_Test
    , M3.RMSE AS RMSE_RWD_Train
    , M4.RMSE AS RMSE_RWND_Train
    
    FROM DATA AS C
    
    LEFT JOIN RW_case_results_state AS M
    ON M.Geography_ID2 = C.Geography_ID2
    AND M.'.model' = 'RW_drift'
    AND M.'.type' = 'Test'
    
    LEFT JOIN RW_case_results_state AS M2
    ON M2.Geography_ID2 = C.Geography_ID2
    AND M2.'.model' = 'RW_nodrift'
    AND M2.'.type' = 'Test'
    
    LEFT JOIN RW_case_results_state AS M3
    ON M3.Geography_ID2 = C.Geography_ID2
    AND M3.'.model' = 'RW_drift'
    AND M3.'.type' = 'Training'
    
    LEFT JOIN RW_case_results_state AS M4
    ON M4.Geography_ID2 = C.Geography_ID2
    AND M4.'.model' = 'RW_nodrift'
    AND M4.'.type' = 'Training'
                                
")

train_data_caseRW_model3 <- sqldf("
    
    WITH DATA AS (
               
        SELECT Target
        , Geography_ID2
        , CAST(TargetValueState AS FLOAT)/CAST(TargetValueT AS FLOAT) AS TargetWeight
        , RMSE_RWD_Test
        , RMSE_RWND_Test
        , RMSE_RWD_Train
        , RMSE_RWND_Train
        
        FROM train_data_caseRW_model2
    
    )
    
    SELECT Target
    , Geography_ID2
    , TargetWeight * RMSE_RWD_Test AS Tgt_RWDTest
    , TargetWeight * RMSE_RWND_Test AS Tgt_RWNDTest
    , TargetWeight * RMSE_RWD_Train AS Tgt_RWDTrain
    , TargetWeight * RMSE_RWND_Train AS Tgt_RWNDTrain
    
    FROM DATA
    
")

RW_case_state_summary <- cbind(sum(train_data_caseRW_model3$Tgt_RWDTest),
                                   sum(train_data_caseRW_model3$Tgt_RWDTrain),
                                   sum(train_data_caseRW_model3$Tgt_RWNDTest),
                                   sum(train_data_caseRW_model3$Tgt_RWNDTrain))

colnames(RW_case_state_summary) <- c("RMSE (Test) - RW Drift",
                                     "RMSE (Train) - RW Drift",
                                     "RMSE (Test) - RW No Drift",
                                     "RMSE (Train) - RW No Drift")

kable(RW_case_state_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")


#RW MODEL CONSTRUCTION - FATALITIES
train_data_dthRW_ts2 <-
  train_data_dth4c %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

#Test Data = 4/27/2020 through 6/10/2020
test_data_dth_ts <-
  train_data_dth4c %>%
  filter(Date >= '2020-04-27') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue) %>%
  mutate(TargetValue = as.double(TargetValue)) %>%
  as_tsibble(key = c(Target,Geography_ID2),
             index = Date)

train_data_dth_RW <-
  train_data_dthRW_ts2 %>%
  model(RW_drift = RW(TargetValue, drift = TRUE),
        RW_nodrift = RW(TargetValue, drift = FALSE))

train_data_dthRW_fc <-
  train_data_dth_RW %>%
  forecast(h = 45)

RW_dth_results_state <- rbind(train_data_dthRW_fc %>% accuracy(test_data_dth_ts) %>% 
                                 select(Geography_ID2, 
                                        .model, 
                                        .type, 
                                        RMSE:MAPE),
                              train_data_dth_RW %>% accuracy() %>%
                                 select(Geography_ID2, 
                                        .model, 
                                        .type, 
                                        RMSE:MAPE))

train_data_dthRW_model <- 
  train_data_dth4c %>%
  filter(Date <= '2020-04-26') %>%
  select(Date,
         Target,
         Geography_ID2,
         TargetValue)

train_data_dthRW_model2 <- sqldf("
    
    WITH DATA AS (
    
        SELECT Target,
        Geography_ID2,
        SUM(TargetValue) AS TargetValueState
        
        FROM train_data_dthRW_model
        
        GROUP BY Target, Geography_ID2
    
    )
    
    SELECT C.Target,
    C.Geography_ID2,
    C.TargetValueState,
    SUM(C.TargetValueState) OVER(PARTITION BY C.Target) AS TargetValueT
    , M.RMSE AS RMSE_RWD_Test
    , M2.RMSE AS RMSE_RWND_Test
    , M3.RMSE AS RMSE_RWD_Train
    , M4.RMSE AS RMSE_RWND_Train
    
    FROM DATA AS C
    
    LEFT JOIN RW_dth_results_state AS M
    ON M.Geography_ID2 = C.Geography_ID2
    AND M.'.model' = 'RW_drift'
    AND M.'.type' = 'Test'
    
    LEFT JOIN RW_dth_results_state AS M2
    ON M2.Geography_ID2 = C.Geography_ID2
    AND M2.'.model' = 'RW_nodrift'
    AND M2.'.type' = 'Test'
    
    LEFT JOIN RW_dth_results_state AS M3
    ON M3.Geography_ID2 = C.Geography_ID2
    AND M3.'.model' = 'RW_drift'
    AND M3.'.type' = 'Training'
    
    LEFT JOIN RW_dth_results_state AS M4
    ON M4.Geography_ID2 = C.Geography_ID2
    AND M4.'.model' = 'RW_nodrift'
    AND M4.'.type' = 'Training'
                                
")

train_data_dthRW_model3 <- sqldf("
    
    WITH DATA AS (
               
        SELECT Target
        , Geography_ID2
        , CAST(TargetValueState AS FLOAT)/CAST(TargetValueT AS FLOAT) AS TargetWeight
        , RMSE_RWD_Test
        , RMSE_RWND_Test
        , RMSE_RWD_Train
        , RMSE_RWND_Train
        
        FROM train_data_dthRW_model2
    
    )
    
    SELECT Target
    , Geography_ID2
    , TargetWeight * RMSE_RWD_Test AS Tgt_RWDTest
    , TargetWeight * RMSE_RWND_Test AS Tgt_RWNDTest
    , TargetWeight * RMSE_RWD_Train AS Tgt_RWDTrain
    , TargetWeight * RMSE_RWND_Train AS Tgt_RWNDTrain
    
    FROM DATA
    
")

RW_dth_state_summary <- cbind(sum(train_data_dthRW_model3$Tgt_RWDTest),
                               sum(train_data_dthRW_model3$Tgt_RWDTrain),
                               sum(train_data_dthRW_model3$Tgt_RWNDTest),
                               sum(train_data_dthRW_model3$Tgt_RWNDTrain))

colnames(RW_dth_state_summary) <- c("RMSE (Test) - RW Drift",
                                     "RMSE (Train) - RW Drift",
                                     "RMSE (Test) - RW No Drift",
                                     "RMSE (Train) - RW No Drift")

kable(RW_dth_state_summary) %>%
  kable_styling(latex_options = "striped") %>%
  kable_styling(latex_options = "HOLD_position")

COVID-19 Case/Death Count Predictions

Timothy White

4/20/2022

TABLE OF CONTENTS

Introduction

Data Exploration, Literature Review, Assumptions

Confirmed Case Data Exploration

Fatalities Data Exploration

Model Construction

ETS Model Exploration and Testing

Confirmed Cases

Fatalities

Random-Walk Models (with and without Drift)

Confirmed Cases

Fatalities

Conclusions and Further Discussion

References

Appendix: R Code