library(tidyverse)
## -- Attaching packages ---------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1.9000 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
d <- read.csv("FluNet_SAmerica_15.csv", skip = 3)
I am participating in a forecasting competition, this a link to the page. Briefly, I am trying to forecast the number of influenza positive samples recorded for epiweek 28 (week ending June 15) this year in Argentina based on data from the World Health Organization.
Well, actually, the task is to forecast the chance that the number of influenza positive samples during epiweek 28 will fall into a particular range. The ranges are 0-60, 60-170, 170-270, 270-390, and > 390.
If I knew nothing about the typical trajectory of the flu season or what a typical number for that week was, I’d guess 20% chance for each of the five categories (the total adds up to 100%). However, I can access historical data to get an idea of what a typical number looks like. So I figure I should be able to do better than even chances across the board. Here’s a plot with what the data looks like for years since the 2009 pandemic:
d %>%
filter(Country == "Argentina") %>%
ggplot(data = ., aes(x = Week,
y = ALL_INF,
group = Year,
colour = factor(Year))) +
geom_line(size = 1) +
geom_hline(yintercept = c(60, 170, 270, 370), linetype = "dotted") +
geom_vline(xintercept = 28)
On the y-axis is the weekly total number of positive influenza samples that year. The x-axis indicates the epidemiological week. Remember I need to forecast for week 28, which I’ve marked with a black vertical line. I’ve also given dotted horizontal lines to indicate the prediction intervals. In ascending order the horizontal lines have yintercept 60, 170, 270, and 370.
There’s two types of trajectories, broadly speaking; epidemics and non-epidemics. When I say epidemic I’m talking about years 2013, 2016, and 2017. These years have huge peaks and their year end totals are double or more the other years. The other years are the non-epidemics.
So far I’ve used the historical data to generate averages for the two kinds of trajectories, and I’m monitoring each new report for information about whether its going to be an epidemic year or a non-epidemic year. I just have to use my best judgement here because I don’t know what else to do.
Is there a way to use bayesian methods to decide whether its going to be an epidemic or non-empidemic year in a data-driven way? Any general advice for making this type of forecast?