I will do a time series analysis on the number of tornadoes that have been recorded each month in the United States since 1990. This data was first officially recorded by the National Oceanic and Atmospheric Administration (NOAA) starting in 1950. At first, I was going to make forecasts using the complete data from 1950, but there are inconsistencies from the earlier data. Over time, the frequency of tornadoes increases. This increase is due to an improvement in technology and recording devices for confirming sightings. Specifically, the widespread implementation of Doppler radar and satellite imaging improved the forecasting and sightings of tornadoes in recent decades. The complete data can be shown in the plot below. I will analyze this time series data and forecast the number of tornadoes in the coming months of 2021. There is already recorded data for April, so I can see if the predictions match the current trend.
Forecasting the number of tornadoes in the future can give people an idea of potential damages and also help people prepare. However, the number of tornadoes that occur is not always the best predictor of damages and fatalities. The severity of the tornadoes also need to be taken into account. For instance, in 1973 there were 1102 tornadoes recorded with 89 fatalities recorded, whereas in 1974 there was a single tornado outbreak from April 3th to April 4th that killed 307 people. The number of recorded tornadoes in 1974 is 945, 157 less than 1973. A system was implemented in 1973 called the Fujita (F) Scale which took into account the damages and wind speeds of tornadoes. The Fujita Scale ranges from F0 to F5, each indicating the severity of the storm. Ratings are given by looking at the damages left behind in the path of the tornado, with F0 being light damage and F5 being incredible damage. Improvements to the scale were made in 2007 with the Enhanced Fujita (EF) Scale, ranging from EF0 to EF5. With the EF scale and improving technology, the ability to forecast and prepare for tornadoes is getting better each year. Here is a list of infamous outbreaks in the United States:
and probably the most infamous…
When the Fujita Scale was introduced, the Techniques Development Unit of the National Severe Storms Forecast Center hired students to read publications and newspaper articles on tornadoes since 1950 and to assign F-scale damage ratings to these storms. This is why storms before 1973 have ratings.
The majority of tornadoes in the United States occur in an area called “Tornado Alley,” encompassing the central plain states from Texas to North Dakota. However, in recent years, research (such as this research and this research) has indicated that this high frequency area is shifting eastward toward states like Tennessee, Alabama, and Georgia. This area was named “Dixie Alley.”
Source: KSWO News, Research Indicates that the Significant Tornado Threat is Shifting Easward - Away From “Tornado Alley”
The problem with this shift is the lack of preparation across Dixie Alley. Most homes across the central plains have storm shelters, tornado alarms, and other measures in place to keep them out of harm’s way. However, most homes across Dixie Alley have no measures in place regarding tornado safety.
I want to forecast the number of tornadoes that will occur in the coming months of 2021. So far there have been 193 confirmed tornadoes in 2021, with 46 in April. Here is a plot of the complete monthly tornado data since 1990 with some of the outbreaks highlighted.
plot.ts(data,
xlab = "Year",
ylab = "Tornado Count",
main = "Monthly Confirmed Tornadoes in the US Since 1990",
ylim = c(0, 850))
text(2006, 800, labels = "2011 Super Outbreak", cex = 0.5, font = 2)
arrows(2009, 790, x1 = 2010.75, y1 = 750, code = 0)
text(2015, 600, labels = "2019 May Outbreak", cex = 0.5, font = 2)
arrows(2017, 575, x1 = 2018.9, y1 = 500, code = 0)
text(1995, 650, labels = "Outbreak Sequence of May 2003 and 2004", cex = 0.5, font = 2)
arrows(1997, 625, x1 = 2002.5, y1 = 500, code = 0)
Now let’s use the decompose() function to obtain the random noise. This function will remove the trend and seasonality.
trend <- decompose(data)$trend
season <- decompose(data)$seasonal
noise <- decompose(data)$random
noise <- noise[!is.na(noise)]
trend <- trend[!is.na(trend)]
plot(decompose(data))
Highlighted above is each component of the decomposition: observed, trend, seasonal, and random. Tornadoes are seasonal, so we expected a heavy seasonal component. The random component looks stationary with the exception of the one clear outlier. This outlier is the month of April in 2011 where there were five major outbreaks, including the Super Outbreak of 2011 from April 25-28. This month spawned a record 771 confirmed tornadoes! There was a total of four EF5 and 13 EF4 tornadoes, and the estimated total damages were $10.2 billion, the most for an outbreak ever!
I will apply multiple tests from the “randtests” package to test whether the random component is independent.
Box.test(noise, type="Ljung-Box", lag=20)
##
## Box-Ljung test
##
## data: noise
## X-squared = 56.962, df = 20, p-value = 2.08e-05
turning.point.test(noise)
##
## Turning Point Test
##
## data: noise
## statistic = -2.8287, n = 363, p-value = 0.004674
## alternative hypothesis: non randomness
difference.sign.test(noise)
##
## Difference Sign Test
##
## data: noise
## statistic = 0.18157, n = 363, p-value = 0.8559
## alternative hypothesis: nonrandomness
rank.test(noise)
##
## Mann-Kendall Rank Test
##
## data: noise
## statistic = -0.10606, n = 363, p-value = 0.9155
## alternative hypothesis: trend
qqnorm(noise)
qqline(noise)
Based on all of the tests, it appears that there is still a trend in the data. I think this is being caused by the major outbreaks, such as the 2011 and 2019 outbreaks. The Ljung-Box test was conclusive that the data is independent, so I will go with that conclusion. We ran into this same problem with the Johnson & Johnson data and Australian Wine data, but did not run into problems so I will continue.
To choose the number of AR and MA coefficients I will start by analyzing the autocorrelation function (ACF) and the partial autocorrelation function (PACF).
acf(noise, lag.max = 50)
acf(noise, lag.max = 50, type = "partial")
The ACF appears to cut off at lag 6 and the PACF appear to gradually decrease. This is an indication that an MA(6) model is appropriate. However, I will ultimately pick a model by looping through different models and choosing the one with the best AICC.
iterations <- 6 # maximum order we want to test for each p and q
AICCs <- matrix(NA, nrow = iterations + 1, ncol = iterations + 1)
bestAICC <- Inf
bestOrder <- c()
for (i in 0:iterations) { # loop over p
for (j in 0:iterations) { # loop over q
AICC <- arima(noise, order = c(i,0,j), include.mean = F, method = "ML")$aic
if (AICC < bestAICC){
bestOrder <- c(i,j) # (p,q)
bestAICC <- AICC
print(paste("The optimal number of AR and MA coefficients are", bestOrder[1], "and", bestOrder[2], sep = " "))
}
AICCs[i+1,j+1] <- AICC
}
}
## [1] "The optimal number of AR and MA coefficients are 0 and 0"
## [1] "The optimal number of AR and MA coefficients are 0 and 1"
## [1] "The optimal number of AR and MA coefficients are 0 and 2"
## [1] "The optimal number of AR and MA coefficients are 0 and 3"
## [1] "The optimal number of AR and MA coefficients are 0 and 4"
## [1] "The optimal number of AR and MA coefficients are 1 and 3"
## [1] "The optimal number of AR and MA coefficients are 2 and 1"
## [1] "The optimal number of AR and MA coefficients are 2 and 2"
## [1] "The optimal number of AR and MA coefficients are 2 and 3"
## [1] "The optimal number of AR and MA coefficients are 6 and 6"
I was surprised that the MA(6) was never even considered. From the results, there is an indication that the best model would be an ARMA(6,6). Because this model produced the minimum AICC, I will choose this to make my final predictions. Next, I will use the Hannan-Rissanen algorithm to estimate the ARMA coefficients.
model <- HannanRissanen(noise, bestOrder[1], bestOrder[2])
phi_hat <- model$phi
theta_hat <- model$theta
model
## $phi
## [1] -0.18369346 -0.03586984 0.10218728 0.04082744 -0.16024526 -0.23049881
##
## $theta
## [1] -0.01093936 -0.35901544 -0.49905523 -0.27119388 -0.01602192 0.14323928
The model coefficient can be shown above. Lastly, I will make predictions using the model found in the last step. I will use the Innovations-style h-step forecast to predict the the number of tornadoes in the remaining months of 2021.
pred2021 <- rep(0, 9)
for (i in 7:15) {
n <- length(noise)
xhats <- hStepPred(noise, i, phi_hat, theta_hat)
fit <- holt(trend, h = i)
pred2021[i-6] <- round(fit$mean[i]+season[i-3]+xhats[i])
}
finalPredictions <- as.data.frame(pred2021)
colnames(finalPredictions) <- "Predictions"
rownames(finalPredictions) <- c("Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
finalPredictions
## Predictions
## Apr 161
## May 245
## Jun 199
## Jul 105
## Aug 64
## Sep 49
## Oct 45
## Nov 40
## Dec 13
predTS <- ts(finalPredictions, start = c(2021, 4), frequency = 12)
plotTS <- window(data, start=c(2017,1), end = c(2021,3))
plot(NA, NA,
xlab = "Year",
ylab = "Tornado Count",
xlim = c(2017,2022),
ylim = c(0, max(as.matrix(plotTS), as.matrix(finalPredictions))),
main = "Tornadoes in the US from 2017 to 2022")
points(plotTS, type = "l")
points(predTS, type = "p", pch = 25, col = "red")
The forecasts look plausible and appear like a typical tornado season. However, with tornado seasons, it is extremely difficult to predict the catastrophic outbreaks.
Next, we may ask ourselves what damage should we expect? Or, how many fatalities might there be? I will look back into the past to see what year is most like the forecasted year 2021. I will then look at some statistics from those years to see what we might expect. To find these years, I will use the sum-of-squared-differences to compare with 2021.
year2021 <- c(data[373:375], finalPredictions[[1]])
diff <- rep(0, floor(length(data)/12))
for (i in 1:floor(length(data)/12)) {
diff[i] <- sum((data[(i*12-11):(i*12)] - year2021)^2)
}
closestYears <- order(diff)
plot.ts(year2021,
type = "l",
xlab = "Month",
ylab = "Tornado Count",
main = "2021 vs. 2000 and 2007", col = "black")
lines(data[(closestYears[1]*12-11):(closestYears[1]*12)],
type = "l",
col = "red")
lines(data[(closestYears[2]*12-11):(closestYears[2]*12)],
type = "l",
col = "blue")
legend(10, 240,
legend=c("2021", "2000", "2007"),
col=c("black", "red", "blue"),
lty=1, cex=0.8, text.font=4,
title="Year", bg='lightblue')
We can see that years 2000 and 2007 are very close to the current trend and forecast of 2021. Let’s now take a look at those years for the amount of damage recorded and the number of fatalities.
In the year 2000, there were $424 million of damages ($652 million adjusted for inflation) and there were 41 fatalities. In the year 2007, there were $1.4 billion of damages ($1.8 billion adjusted for inflation) and there were 81 fatalities. These figures are far different than I anticipated. This is an indication that we will not be able to make a good prediction for damages and fatalities. There may or may not be any outbreaks this season, and let’s hope not.
Tornado data is particularly difficult to forecast because of these outbreaks. Outbreaks seem to be completely random and follow no pattern. Even though the 2007 tornado season matched the number of tornadoes for 2000, there were some super-outbreaks that caused catastrophic damage. For example, from May 4-6 there was a major tornado outbreak that tore through Texas all the way to South Dakota. The worst tornado that formed in this outbreak was on May 4th. This tornado was rated at an EF5, and destroyed 95% of the entire town of Greensburg, Kansas. This town is still recovering from this natural disaster today. This outbreak from May 4-6 caused $268 million ($342 million adjusted for inflation) worth of damage, caused 14 fatalities, and spawned a total of 129 confirmed tornadoes.
There is a trend that every year a Fujita Scale is created or changed there is a major outbreak (1974 and 2007), so they should not update it. LOL
The biggest lesson I learned from this project was how difficult is is to predict tornadoes. Although my model did make forecasts that make sense, we still have no idea what is going to happen because of outbreaks. So far, there have been 46 tornadoes recorded in April as of April 26th. My forecast shows there should be 161 tornadoes this month. It could also be the case that not all of the confirmed cases have been filed yet.
To make predictions for this project I took the following steps.
The current technology that storm trackers and storm reporters have are keeping people safe across the United States. Statisticians are also keeping people safe by making correct forecast of the weather every day.