The data sets comes from the National Weather service. There are three data sets provided - weather-forecasts.csv, forecast-cities.csv and outlook-meanings.csv. The first dataset - weather-forecasts includes weather forecasts and observations from 167 cities across 16 months. The data also includes the type of temperature (high or low), the forecast outlook (ex: cloudy, rainy, sunny..) and any possible errors. In the forecast_cities data set, there is information about the cities in the weather-forecasts csv as well as other American cities. Other information you can find includes the distance to coast, the wind speed and the average annual participation of these cities. The outlook-meanings data set provides the full name for the outlook forecast variable in the weather-forecast data set. For this analysis, I focused on exploring the error in high and low temperature predictions by looking at the geographic locations of cites, the wind speed and the overall number of high and low error points exist within the weather-forecasts dataset.
The data quality issues relies on the fact that there exists cities in different states with the same name. As shown below, cities such as Buffalo, Charleston, Columbus, Portland, and Richmond exists in more than one state. This may cause some issues during the process of joining the data sets if it is not specified by city and state. Furthermore, for some cities such as Buffalo, the cities in both states (New York and Wyoming) have the same observed temperatures and forecast temperatures which is very unlikely since the climates of both cities are different. This means there was also some issue with the data when inputting those values.
The graphs above show the locations of the top 10 cities with the largest number of rows in the data set with errors between the observed and predicted temperatures. In this context, high accuracy values mean that there was a large difference between the predicted and observed temperature for that day and the time the prediction was made. The first graph shows the top 10 cities with the highest inaccuracies during high temperatures and all the cities fall in the Midwest region. The states are Colorado, Iowa, Montana, Nebraska, North Dakota, and South Dakota. States such as Montana, North Dakota, Nebraska, and Colorado actually have two cities, maybe highlighting some possible similarities between these states to examine.
Furthermore, the second graph shows the top 10 cities with the largest count of inaccuracies during low temperatures. Outside of Alaska and Hawaii, most of the cities on this list fall within the mountain-west and west-cost areas. The cities reside in Alaska, Arizona, Colorado, California,Idaho, Montana, Washington, and Wyoming, In contrast to the first graph, California is the only state with more than one city in the top 10. Other the the two outline cities, the only similarity between the states is that they reside in the west coast.
Through this graph, we can infer that Montana and Colorado have high a count of inaccuracies between the observed and predicted temperatures as they are both in the top 10 for high and low temperatures.
These graphs explores the distributions of the literal number of differences between observed and predicted temperatures across high and low temperatures. The left sub graph for each larger graph is the unfiltered data, whereas the graphs on the right are filtered to include only inaccuracies that are less than or equal to 20. I chose 20 as the cutoff because the graphs are skewed to the right and I wanted to get a closer look at the denser part of the graph. Surprisingly, when filtering the dataset, it becomes clear that the distribution of values is similar between the two types of temperatures We can also see that the both for high and low temperatures, 2 was the most common error value. We can also see that there were larger error values during high temperatures because the x-axis reaches the 90 mark whereas in the low temperatures the x-axis only reach 75.
Then, to examine the overall count of the high and low temperatures, I calculated the total rows with accuracy values greater than 0 and compared them across temperatures. Through this graph, we can see that there are more inaccuracies during low temperatures, where low temperatures reach about ~250000 and high temperatures reach ~237500. With this, we can say that weather prediction errors occur more during low temperatures even though the differences are not drastic.
The average wind speed for cities with high temperature and accuracy
of zero is
{r} high_temp |> filter( accuracy == 0 ) |> summarize(mean = round(mean(wind, na.rm = TRUE),2))
compared to cities with inaccuraces with temperature prediction is
{r} high_temp |> filter( accuracy != 0 ) |> summarize(mean = round(mean(wind, na.rm = TRUE), 2)).
We can conclude that there is no large/significant difference in the
average wind speed for cities with high temperatures depending on their
accuracy level (0 or more). But, cities with higher accuracy levels have
a higher average wind speed by less than .1.
The average wind speed for cities with low temperatures and an
accuracy of zero is
{r} low_temp |> filter( accuracy == 0 ) |> summarize(mean = round(mean(wind, na.rm = TRUE),2))
compares to cities with inaccuracies with temperature predictions is
{r} low_temp |> filter( accuracy != 0 ) |> summarize(mean = round(mean(wind, na.rm = TRUE),2))
. We can conclude that there is no large/significant difference in the
average wind speed for cities with low temperatures depending on their
accuracy level (0 or more). In this case, the difference between the
averages is .1 with accuracies greater than zero having the higher
average.
The average annual precipitation prediction for cities with high
temperature and accuracy of zero is
{r} high_temp |> filter( accuracy == 0 ) |> summarize(mean = round(mean(avg_annual_precip, na.rm = TRUE),2))
compared to cities with inaccuracies with temperature prediction is
{r} high_temp |> filter( accuracy != 0 ) |> summarize(mean = round(mean(avg_annual_precip, na.rm = TRUE), 2))
. We can conclude that there is no significant difference in the average
annual precipitation for cities with high temperatures depending on
their accuracy level (0 or more). But, cities with higher accuracy
levels have a higher average annual precipitation by less than .1.
The average annual precipiation for cities with low temperatures and
an accuracy of zero is
{r} low_temp |> filter( accuracy == 0 ) |> summarize(mean = round(mean(avg_annual_precip, na.rm = TRUE), 2))
compares to cities with inaccuracies with temperature predictions is
{r} low_temp |> filter( accuracy != 0 ) |> summarize(mean = round(mean(avg_annual_precip, na.rm = TRUE), 2)).
We can conclude that there is no significant difference in the average annual participation for cities with low temperatures depending on their accuracy level (0 or more). But, cities with an accuracy of zero have higher average annual precipitation by about .2.
For this section, I also created an interactive table where users can click on the type of weather (high or low) and analyze numerical characters such as (wind, distance to coast, elevation, and average annual precipitation ) between cities with no accuracy (0) and inaccuracies (> 0). I was unable add the shiny code into this document properly so I published it separately. You can access the table with this link: https://solomon77h.shinyapps.io/examing_averages_shiny_table/.
Through this analysis, we explored which areas in the U.S struggle with weather prediction and also why they could be struggling. We found that Montana and Colorado were the states that struggled with weather prediction during both High and low temperatures. Overall, we found that the United States struggles with weather prediction more during low temperatures compared to high temperatures. We explored two possible reasons why, specifically looking at the average wind speed and average annual participation. Although all the averages with inaccuracies for high and low temperatures were larger, the difference was so small that it is considered insignificant. Since the cities with the most inaccuracies for both temperatures are in the Midwest, location is something to focus on in the future. Overall, these are some factors to consider when making predictions in the future for all cities.