In this project, I will be using data sourced from the National Weather Service over a 16-month long period. It originally contained 167 US cities, but I cleaned it to contain only cities from the continental US, reducing that number to 161 cities. One data set had information about forecast and observed high and low temperatures in degrees Fahrenheit for these cities and the other contained geographic information about major US cities.
The goal of this project is to explore possible explanations for
error in weather forecasts. By error, I mean the difference between the
predicted and observed temperatures for both the high and low predicted
temperatures. In order to investigate this for each city, I added a
column to the forecast dataset that was the absolute value of the mean
difference between the forecast and observed temperature for each city.
Then I combined the forecast dataset and the city information dataset by
city and state and created graphics to explore
and represent trends in the error.
On the whole, the error was not all that much, with the mean and median average error in the high forecast being 2.25 and 2.28, and that of the low forecast being 2.42 and 2.35 respectively.
Now, to look for trends within the data. In order to do so, I first created maps with colored dots to represent each city and its forecast error to visually survey any trends. Based off these two maps, it initially seemed that perhaps the coasts were more regularly accurate than the middle of the country since most of the midwest was more lightly colored than the east, west, and south of the country, especially for the high forecast.
When I created a scatterplot to see if any linear trend following average forecast error and a city’s distance from the coast appeared. Looking at the high forecast error, there does appear to be a slight positive correlation between mean error and city distance from the coast. However, that is not the case for the low forecast error, as most error values are similar to one another regardless of distance from the coast.
This trend seen in the high forecast conceptually makes sense because there are not massive bodies of water near the center of the country, so the temperature is more subject to uncertainty. Water regulates the temperature of nearby land such as coastal cities, meaning their temperatures are likely easier to forecast than those of the midwest which don’t have the regulation of large bodies of water and are more affected by cold or warm fronts.
I also looked at the forecast error vs city latitude, which gave a similar graph to the previous one, just with a fuzzier trend, since places on the west coast or Texas are at larger latitudinal values. Latitude lines don’t affect cities in tangible ways, so I suspect the trend seen here is due to the distance from the coast as well.
Other possible explications for error (city longitude and elevation) were explored, but there were no clear trends between the variables.
Concerning the new element we were meant to use in this project, I do
not recall abs() being brought up in class. I used it` to
get the absolute value of the error.