With the attention that the Influenza virus is receiving in the news lately, it seemed appropriate to take a look at historical flu data and learn more about it.
There are two main types of flu virus:
Type A
Type B
World Health Organization – Flu Data
Other data sourced but not used in this presentation includes:
The following R packages were used in producing the charts and graphs used to analyze the flu data.
The WHO flu data is obtained from the WHO website (listed above), by entering the date and level (national, state, etc.) that you want the data. The results are then downloaded as a .csv file.
This code is reading in the .csv file and manipulating the data to obtain more information. Three new fields are added, first is the “pctposcases” which is the percentages of total flu test specimens processed that come back positive. The new variables “pctposacases” and “pctposbcases” are the percentage of positive tests of flu type A and B respectively.
Next the start and end date are transformed to a standard month/day/year format and the Title field, which defines the status of the flu season (e.g. Sporadic, Local Outbreak, etc.) is coerced to a factor. Finally, the first few columns are not relevant to our research and are dropped.
flu.data <- read.csv("FluCasesReported.csv")
flu.data$pctposcases <- flu.data$ALL_INF / flu.data$SPEC_PROCESSED_NB
flu.data$pctposacases <- flu.data$INF_A / flu.data$SPEC_PROCESSED_NB
flu.data$pctposbcases <- flu.data$INF_B / flu.data$SPEC_PROCESSED_NB
flu.data$SDATE <- as.Date(flu.data$SDATE, "%m/%d/%Y")
flu.data$EDATE <- as.Date(flu.data$EDATE, "%m/%d/%Y")
flu.data$TITLE <- as.factor(flu.data$TITLE)
flu.data <- select(flu.data, Year:pctposbcases)
The following code creates the time series to be used in our analysis.
flu.ts <- ts(flu.data[,c(12,16,17,19)], start = 2009, frequency = 52)
This chart is all flu cases reported in the United States from 2009 through January 2018. Note the clear seasonality and recent increasing trend since 2010.
autoplot(flu.ts[,3]) +
ggtitle("Flu Cases per Week") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
Layering in types A and B against the total flu data. Notice flu A cases are much higher than B. This is expected as type A is known to be more common and more prone to cause a pandemic outbreak.
autoplot(flu.ts[,1:3]) +
ggtitle("Flu Cases per Week") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
Plotting cases of type A against type B shows an interesting phenomenon. Type B seems to increase after type A peaks for the season. Type B seems to peak after type A season is nearly complete. The other interesting phenomenon is the irregular data between 2009 and 2010. This is the same time that the “Swine” Flu pandemic came about. This will be reviewed in just a moment.
autoplot(flu.ts[,c(1:2)]) +
ggtitle("Flu Cases A and B per Week") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
Given what we know about the seasonality and our irregular 2009-2010 data, the polar plot is a great visual to show our seasonality and the Swine flu exception.
ggseasonplot(flu.ts[ ,3], polar = TRUE) +
ylab("Flu Cases")+
ggtitle("Flu Cases by Week")
Let’s look at the anomaly that is 2009-2010. It was stated earlier that the H1N1 (Swine Flu) strain was rampant during this time period. The radar plot clearly shows this. How does the H1N1 strain compare to the others though.
flu.h1n1 <- ts(flu.data[,c(8,12,16,17,19)], start = 2009, frequency = 52)
Notice the outbreak of this strain in the 2009-2010 time frame, but also notice the cyclical 2-year recurrance of this strain. It appears it’s beginning its surge in 2018.
autoplot(flu.h1n1[,c(1,4)]) +
ggtitle("Strain AH1N1 Flu Cases per Week") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
What about all of the other types of A strains?
flu.typea <- ts(flu.data[,c(7:11,16,17,19)], start = 2009, frequency = 52)
autoplot(flu.typea[,c(1:5,7)]) +
ggtitle("Flu A Cases per Week Compared to All") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
flu.typeb <- ts(flu.data[,c(13:15,16,17,19)], start = 2009, frequency = 52)
autoplot(flu.typeb[,c(1:3,5)]) +
ggtitle("Flu B Cases per Week Compared to All") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
Looking now at the flu tests administered.
flu.test <- ts(flu.data[,c(5:6,12,16:17,20:22)], start = 2009, frequency = 52)
This plot shows the number of flu tests processed by week. Notice again the seasonality of the plot and the trend data. One would assume that in years where the flu is more prevalent, that a higher amount of tests would be done as a precautionary measure. We find that higher rates correlate with higher cases but as the plot
autoplot(flu.test[,c(2,5)])+
ggtitle("Processed Flu Tests and Cases") +
xlab("Weeks, 2009-present") +
ylab("Flu Tests") +
theme_classic()
Correlation test:
cor(flu.data[6], flu.data[17])
## ALL_INF
## SPEC_PROCESSED_NB 0.9042479
The following graph shows the percentage of cases diagnosed from the tests submitted. What we find here is that outside of the H1N1 outbreak in 2009-2010, the percentage of cases diagnosed is relatively consistent. This would seemingly debunk the “panic” theory that more tests during high flu activity years (such as 2015-16 and 2017-18) would promote more negative tests and therefore a lower percentage.
autoplot(flu.test[,7:8])+
ggtitle("Processed Flu Tests and Cases Comparing A vs. B") +
xlab("Weeks, 2009-present") +
ylab("Flu Cases") +
theme_classic()
Initial time series modeling was performed using the stlm function from the forecast package. This function takes a time series data set and applies a STL decomposition to the data, then models the seasonally adjusted data using a specified modeling technique. ARIMA modeling is chosen here.
Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components.