Exploring Reasoning Behind Errors in Weather Prediction for US States

Author

Helana Solomon

Introduction

The data sets comes from the National Weather service. There are three data sets provided - weather-forecasts.csv, forecast-cities.csv and outlook-meanings.csv. The first dataset - weather-forecasts includes weather forecasts and observations from 167 cities across 16 months. The data also includes the type of temperature (high or low), the forecast outlook (ex: cloudy, rainy, sunny..) and any possible errors. In the forecast_cities data set, there is information about the cities in the weather-forecasts csv as well as other American cities. Other information you can find includes the distance to coast, the wind speed and the average annual participation of these cities. The outlook-meanings data set provides the full name for the outlook forecast variable in the weather-forecast data set. For this analysis, I focused on exploring the error in high and low temperature predictions by looking at the geographic locations of cites, the wind speed and the overall number of high and low error points exist within the weather-forecasts dataset.

Data Quality Issue

The data quality issues relies on the fact that there exists cities in different states with the same name. As shown below, cities such as Buffalo, Charleston, Columbus, Portland, and Richmond exists in more than one state. This may cause some issues during the process of joining the data sets if it is not specified by city and state. Furthermore, for some cities such as Buffalo, the cities in both states (New York and Wyoming) have the same observed temperatures and forecast temperatures which is very unlikely since the climates of both cities are different. This means there was also some issue with the data when inputting those values.

# A tibble: 13 × 2
   city        state
   <chr>       <chr>
 1 BUFFALO     NY   
 2 BUFFALO     WY   
 3 CHARLESTON  SC   
 4 CHARLESTON  WV   
 5 COLUMBUS    GA   
 6 COLUMBUS    OH   
 7 PORTLAND    ME   
 8 PORTLAND    OR   
 9 RICHMOND    VA   
10 RICHMOND    CA   
11 RICHMOND    RI   
12 SPRINGFIELD IL   
13 SPRINGFIELD MO

# A tibble: 6 × 4
  city    state observed_temp forecast_temp
  <chr>   <chr>         <dbl>         <dbl>
1 BUFFALO NY               28            24
2 BUFFALO WY               28            24
3 BUFFALO NY               13            15
4 BUFFALO WY               13            15
5 BUFFALO NY               13            14
6 BUFFALO WY               13            14

Top 10 cities that suffer with high accurary issues for both the low and high temperatures

The graphs above show the locations of the top 10 cities with the largest number of rows in the data set with errors between the observed and predicted temperatures. In this context, high accuracy values mean that there was a large difference between the predicted and observed temperature for that day and the time the prediction was made. The first graph shows the top 10 cities with the highest inaccuracies during high temperatures and all the cities fall in the Midwest region. The states are Colorado, Iowa, Montana, Nebraska, North Dakota, and South Dakota. States such as Montana, North Dakota, Nebraska, and Colorado actually have two cities, maybe highlighting some possible similarities between these states to examine.

Furthermore, the second graph shows the top 10 cities with the largest count of inaccuracies during low temperatures. Outside of Alaska and Hawaii, most of the cities on this list fall within the mountain-west and west-cost areas. The cities reside in Alaska, Arizona, Colorado, California,Idaho, Montana, Washington, and Wyoming, In contrast to the first graph, California is the only state with more than one city in the top 10. Other the the two outline cities, the only similarity between the states is that they reside in the west coast.

Through this graph, we can infer that Montana and Colorado have high a count of inaccuracies between the observed and predicted temperatures as they are both in the top 10 for high and low temperatures.

Comparing Count of High and Low Temperature Accuracy Errors

These graphs explores the distributions of the literal number of differences between observed and predicted temperatures across high and low temperatures. The left sub graph for each larger graph is the unfiltered data, whereas the graphs on the right are filtered to include only inaccuracies that are less than or equal to 20. I chose 20 as the cutoff because the graphs are skewed to the right and I wanted to get a closer look at the denser part of the graph. Surprisingly, when filtering the dataset, it becomes clear that the distribution of values is similar between the two types of temperatures We can also see that the both for high and low temperatures, 2 was the most common error value. We can also see that there were larger error values during high temperatures because the x-axis reaches the 90 mark whereas in the low temperatures the x-axis only reach 75.

Then, to examine the overall count of the high and low temperatures, I calculated the total rows with accuracy values greater than 0 and compared them across temperatures. Through this graph, we can see that there are more inaccuracies during low temperatures, where low temperatures reach about ~250000 and high temperatures reach ~237500. With this, we can say that weather prediction errors occur more during low temperatures even though the differences are not drastic.

Comparing Average Wind Speed and Average Annual Preciptation

Wind Speed

The average wind speed for cities with high temperature and accuracy of zero is 3.35 compared to cities with inaccuraces with temperature prediction is 3.41. We can conclude that there is no large/significant difference in the average wind speed for cities with high temperatures depending on their accuracy level (0 or more). But, cities with higher accuracy levels have a higher average wind speed by less than .1.

The average wind speed for cities with low temperatures and an accuracy of zero is 3.39 compares to cities with inaccuracies with temperature predictions is 3.4 . We can conclude that there is no large/significant difference in the average wind speed for cities with low temperatures depending on their accuracy level (0 or more). In this case, the difference between the averages is .1 with accuracies greater than zero having the higher average.

Average Annual Precipitation

The average annual precipitation prediction for cities with high temperature and accuracy of zero is 38.77 compared to cities with inaccuracies with temperature prediction is 39.16 . We can conclude that there is no significant difference in the average annual precipitation for cities with high temperatures depending on their accuracy level (0 or more). But, cities with higher accuracy levels have a higher average annual precipitation by less than .1.

The average annual precipiation for cities with low temperatures and an accuracy of zero is 40.64 compares to cities with inaccuracies with temperature predictions is 38.85.

We can conclude that there is no significant difference in the average annual participation for cities with low temperatures depending on their accuracy level (0 or more). But, cities with an accuracy of zero have higher average annual precipitation by about .2.

For this section, I also created an interactive where users can click on the type of weather and analyze numerical characters such as (wind, distance to coast, elevation, and average annual precipitation ) between cities with no accuracy (0) and inaccuracies (> 0). I was unable to link it add it onto this document as a shiny app so I published it seperately. You can access the table with this link: https://solomon77h.shinyapps.io/examing_averages_shiny_table/.

Conclusion

Through this analysis, we explored which areas in the U.S struggle with weather prediction and also why they could be struggling. We found that Montana and Colorado were the states that struggled with weather prediction during both High and low temperatures. Overall, we found that the United States struggles with weather prediction more during low temperatures compared to high temperatures. We explored two possible reasons why, specifically looking at the average wind speed and average annual participation. Although all the averages with inaccuracies for high and low temperatures were larger, the difference was so small that it is considered insignificant. Since the cities with the most inaccuracies for both temperatures are in the Midwest, location is something to focus on in the future. Overall, these are some factors to consider when making predictions in the future for all cities.