An Inquiry into U.S. Forecasting Error

Introduction

Which areas of the U.S. struggle the most with temperature forecasting accuracy, and why? To answer this question empirically, I analyze datasets sourced from the National Weather Service, covering 16 months of forecasts and observations from 167 cities across the United States. The data consists of 20 meteorological and geographic variables, indexed by time and location. Given the complexity of forecasting variability, I refined the scope of my analysis by first identifying key “problem areas”, and then examining variables most likely to contribute to forecast errors in those areas. By systematically narrowing down potential influences, I aim to uncover potential factors in determining forecasting accuracy, or perhaps more accurately, the lack thereof.

Data Wrangling & Initial Analysis

After initially reading in my data and ensuring the variables were all classified correctly, I inspected the possible_error variable to determine whether any errors were present, and if so, figure out how to fix them. In doing so, I identified the following recurring issues:

The outlook “VRYCLD”, meant to identify observations in which forecasters’ expect very cold weather, was used incorrectly in several cases.
A forecast temperature of -10 was incorrectly recorded in several cases.
A combination of the above errors was present for a few observations.
Some observed temperature values were incorrectly recorded as 0 or 108.

I determined that these errors seemed to be the product of incorrectly entered data, and because they were only a few cases out of a very large dataset, I decided to replace each incorrect case with NA.

Once the data was ready to work with, I constructed error variables to use in my analysis. The two most relevant of these are: absolute error (useful to compare magnitudes of errors) and bias (signed error, useful to determine whether a given observed temperature was underestimated, overestimated, or accurately forecasted.) An inital review of these variables suggested that underestimation was more prevalent in the context of the dataset as a whole, but I wanted to create a visualization to be sure, and additionally to determine where absolute error was the highest.

Indeed, we see here that forecast bias is predominantly characterized by underestimation at a country-wide level. Northern and inland states seem to have the highest errors, with Alaska and Montana exemplifying this trend. To ensure the broader validity of this, I checked to see if the trend held for both high and low forecasts.

Here we see that the general trend does hold, though its clear that error magnitude is increased for low temperature forecasts. Given our previous discovery that underestimation is more prevalent than overestimation, if low temperatures are less accurately forecasted than high ones, then one potential explanation for the increased inaccuracy of forecasting seen in northern states may be a tendency for forecasters to expect colder temperatures than are being observed in colder states; this pattern is familiar to most of us who reside in northern states, and may have climate change to blame. That, however, would not be easily parsed from the data I’m working with, so to identify other factors, I will return to the general trend of region’s apparent affect on forecast accuracy. If it is the case, as it appears here, that regions are indicative of forecast error magnitude, it would be prudent to examine geographic factors as they relate to this error.

Geographic Factors and Forecast Accuracy

I identified three variables which I deemed to be geographically important: latitude, longitude, and distance to coast. In order to get a better idea of what, if any, role these variables played in determining the regional forecasting error rates I previously observed, I set up a correlation matrix between these geographic factors and the aforementioned error variables, with absolute error split into categories for high and low temperature forecasts.

We can see many interesting correlations here, but in order to continue to pursue my angle of regional error differences, I will condense the most important information into two points:

It is, in fact, true that northern states (those with larger latitudes), and inland states (those with a greater distance to the coast), are associated with higher error magnitudes, both in high and low temperature forecasting
Western states (those with lower longitudes), are associated with higher error magnitudes, but only in low temperature forecasting; the correlation between longitude and high temperature forecasting is not enough to suggest a relationship. However, western states are associated higher temperature bias, suggesting that overestimation is specifically prevalent in the western U.S.

So, we now know which areas in the United States struggle more with forecasting accuracy: western states, northern states, and inland states. The question still remains, however, of why that is.

To attempt to answer that, I will take a similar approach as I did when trying to find struggling regions. I will identify variables within the dataset that seem relevant to explaining regional differences in meteorological conditions, and then work to identify factors that make forecasting more difficult in certain regions.

Meteorological Analysis

This time, I have identified four variables that seem to be of potential significance: outlook (the type of weather expected), elevation, observed precipitation, and wind speed. They are all numeric, except for outlook, which is categorical. I will spare the details of my investigation into outlooks, as nothing of note emerged through this regional lens: the top 5 outlooks for the three regions of note - inland, northern, and western - were identical (Here, northern states are those with average latitudes greater than the national median latitude, western states are those with average longitudes less than the national median longitude, and inland states are those with average distance to coast greater than the national median distance to coast). Inquiries into the three numeric variables, however, proved to be more fruitful.

Above are the two boxplots which yielded interesting results for regional elevation disparities. Inland states tend to have higher elevations than coastal states. Similarly, western states tend to have higher elevation than eastern states. This suggests that increased elevation may lead to decreased forecasting accuracy.

Similarly, the two regions with notable trends to display are inland and western. Inland states tend to have less rainfall than coastal states, and western states tend to have a significantly lower amount of rainfall than their eastern counterparts. So, less rainfall may also lead to decreased forecasting accuracy.

For wind speed, only one region displayed a significant disparity from its counterpart: Inland states tend to have greater wind speeds than coastal states. This could mean that faster wind speeds lead to less accurate forecasting.

The glaring problem here is that northern states did not differ significantly from southern states in any one of these meteorologic variables. We know that northern states do struggle more with forecast accuracy, but nothing here provided an empirically compelling reason for that disparity. I would posit, as I mentioned earlier, that the lack of accuracy for northern states may be exacerbated significantly by climate change.

Conclusion

Rather than summarize what I have already stated, I will conclude by attempting to use what we’ve learned about the ways in which the United States’ forecasting accuracy is highly regionally-dependent to make inferences as to the underlying causes of the trends we have parsed.

Western states struggle with low-temperature forecasting due to high elevation, dry conditions, and wind speed, making them more likely to overestimate temperatures. The atmosphere is thinner at higher elevations, leading to increased cooling effects after the sun goes down. That, coupled with prevalence of deserts in the western United States (which are famously hot during the day and drastically cooler at night) may make it difficult for forecasting models to make accurate predictions, as they may overestimate night-time temperatures.

Inland states have less rain, higher wind speeds, and more elevation, which contribute to increased forecast errors. One potential reason for this is the diverse terrain of inland states. Mountain ranges in the region could produce forecasting challenges, as could rain shadow effects of mountains. Further, the sheer variability of terrain and biome types makes characterizing the weather patterns in this region especially hard, since an inland region can take on a large range of both latitudes and longitudes.

Northern states’ accuracy struggles are not explained by precipitation, elevation, or wind, suggesting other variables, such as atmospheric instability, may play a role.