Introduction to the Influenza Virus

With the attention that the Influenza virus is receiving in the news lately, it seemed appropriate to take a look at historical flu data and learn more about it.

There are two main types of flu virus:

Type A

Type B

Data Used

World Health Organization – Flu Data

Other data sourced but not used in this presentation includes:

R Packages Used

The following R packages were used in producing the charts and graphs used to analyze the flu data.

WHO Flu Data

The WHO flu data is obtained from the WHO website (listed above), by entering the date and level (national, state, etc.) that you want the data. The results are then downloaded as a .csv file.

Read and manipulate flu data

This code is reading in the .csv file and manipulating the data to obtain more information. Three new fields are added, first is the “pctposcases” which is the percentages of total flu test specimens processed that come back positive. The new variables “pctposacases” and “pctposbcases” are the percentage of positive tests of flu type A and B respectively.

Next the start and end date are transformed to a standard month/day/year format and the Title field, which defines the status of the flu season (e.g. Sporadic, Local Outbreak, etc.) is coerced to a factor. Finally, the first few columns are not relevant to our research and are dropped.

flu.data <- read.csv("FluCasesReported.csv")

flu.data$pctposcases <- flu.data$ALL_INF / flu.data$SPEC_PROCESSED_NB
flu.data$pctposacases <- flu.data$INF_A / flu.data$SPEC_PROCESSED_NB
flu.data$pctposbcases <- flu.data$INF_B / flu.data$SPEC_PROCESSED_NB

flu.data$SDATE <- as.Date(flu.data$SDATE, "%m/%d/%Y")
flu.data$EDATE <- as.Date(flu.data$EDATE, "%m/%d/%Y")

flu.data$TITLE <- as.factor(flu.data$TITLE)
flu.data <- select(flu.data, Year:pctposbcases)

The following code creates the time series to be used in our analysis.

flu.ts <- ts(flu.data[,c(12,16,17,19)], start = 2009, frequency = 52)

Data Exploration and Insights

Plot All cases reported for flu data

This chart is all flu cases reported in the United States from 2009 through January 2018. Note the clear seasonality and recent increasing trend since 2010.

autoplot(flu.ts[,3]) +
        ggtitle("Flu Cases per Week") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Adding Flu A and B

Layering in types A and B against the total flu data. Notice flu A cases are much higher than B. This is expected as type A is known to be more common and more prone to cause a pandemic outbreak.

autoplot(flu.ts[,1:3]) +
        ggtitle("Flu Cases per Week") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Plot Cases of Type A against B

Plotting cases of type A against type B shows an interesting phenomenon. Type B seems to increase after type A peaks for the season. Type B seems to peak after type A season is nearly complete. The other interesting phenomenon is the irregular data between 2009 and 2010. This is the same time that the “Swine” Flu pandemic came about. This will be reviewed in just a moment.

autoplot(flu.ts[,c(1:2)]) +
        ggtitle("Flu Cases A and B per Week") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Plot onto a polar plot

Given what we know about the seasonality and our irregular 2009-2010 data, the polar plot is a great visual to show our seasonality and the Swine flu exception.

ggseasonplot(flu.ts[ ,3], polar = TRUE) +
        ylab("Flu Cases")+
        ggtitle("Flu Cases by Week")

Looking at Flu Strain, starting with H1N1 outbreak of 2009

Let’s look at the anomaly that is 2009-2010. It was stated earlier that the H1N1 (Swine Flu) strain was rampant during this time period. The radar plot clearly shows this. How does the H1N1 strain compare to the others though.

flu.h1n1 <- ts(flu.data[,c(8,12,16,17,19)], start = 2009, frequency = 52)

Plot cases of H1N1 compared to total flu cases

Notice the outbreak of this strain in the 2009-2010 time frame, but also notice the cyclical 2-year recurrance of this strain. It appears it’s beginning its surge in 2018.

autoplot(flu.h1n1[,c(1,4)]) +
        ggtitle("Strain AH1N1 Flu Cases per Week") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Looking at Flu A Strains compared to overall flu cases

What about all of the other types of A strains?

flu.typea <- ts(flu.data[,c(7:11,16,17,19)], start = 2009, frequency = 52)

Plot cases of Type A strains compared to total flu cases

autoplot(flu.typea[,c(1:5,7)]) +
        ggtitle("Flu A Cases per Week Compared to All") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Looking at Flu B Strains compared to overall flu cases

flu.typeb <- ts(flu.data[,c(13:15,16,17,19)], start = 2009, frequency = 52)

Plot cases of H1N1 compared to total flu cases

autoplot(flu.typeb[,c(1:3,5)]) +
        ggtitle("Flu B Cases per Week Compared to All") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Tests for the flu compared to flu cases

Looking now at the flu tests administered.

flu.test <- ts(flu.data[,c(5:6,12,16:17,20:22)], start = 2009, frequency = 52)

Plot flu tests processed

This plot shows the number of flu tests processed by week. Notice again the seasonality of the plot and the trend data. One would assume that in years where the flu is more prevalent, that a higher amount of tests would be done as a precautionary measure. We find that higher rates correlate with higher cases but as the plot

autoplot(flu.test[,c(2,5)])+
        ggtitle("Processed Flu Tests and Cases") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Tests") + 
        theme_classic()

Correlation test:

cor(flu.data[6], flu.data[17])
##                     ALL_INF
## SPEC_PROCESSED_NB 0.9042479

The following graph shows the percentage of cases diagnosed from the tests submitted. What we find here is that outside of the H1N1 outbreak in 2009-2010, the percentage of cases diagnosed is relatively consistent. This would seemingly debunk the “panic” theory that more tests during high flu activity years (such as 2015-16 and 2017-18) would promote more negative tests and therefore a lower percentage.

autoplot(flu.test[,7:8])+
        ggtitle("Processed Flu Tests and Cases Comparing A vs. B") +
        xlab("Weeks, 2009-present") +
        ylab("Flu Cases") + 
        theme_classic()

Initial ARIMA Models

Initial time series modeling was performed using the stlm function from the forecast package. This function takes a time series data set and applies a STL decomposition to the data, then models the seasonally adjusted data using a specified modeling technique. ARIMA modeling is chosen here.

Flu A

Flu B

All Flu

Adjusted ARIMA Models

Flu A

Flu B

All Flu

TBATS Models

Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components.

Flu A

Flu B

All Flu

Final Thoughts