http://rpubs.com/AngelML15/1269118

Introduction

In this analysis, we explore the accuracy of high and low temperature forecasts across 167 US cities over a 16 month period. Using data from the National Weather Service, we investigate which areas struggle the most with accurate weather predictions and explore potential reasons for these discrepancies.

Loading Data

outlook_meanings <- read_csv("data/outlook_meanings.csv")

weather_forecasts <- read_csv("data/weather_forecasts.csv",
                              col_types = cols(
  date = col_date(),
  city = col_factor(),
  state = col_factor(),
  high_or_low = col_factor(),
  forecast_hours_before = col_integer(),
  observed_temp = col_integer(),
  forecast_temp = col_integer(),
  forecast_outlook = col_factor(),
  possible_error = col_factor()
))

forecast_cities <- read_csv("data/forecast_cities.csv")

Wrangling

combined_weather_data <- weather_forecasts |>
  left_join(forecast_cities, by = c("city", "state")) |>
  mutate(temp_error = abs(forecast_temp - observed_temp)) |>
  filter(!is.na(temp_error))

Visualizations with Analysis

Average Forecast Error by US State

state_error <- combined_weather_data |>
  group_by(state) |>
  summarize(mean_error = mean(temp_error, na.rm = TRUE)) |>
  mutate(state = tolower(state)) |>
  left_join(tibble(  # new element changing abbreviations to full names
  state = tolower(state.abb), 
  name = tolower(state.name)
))

us_states <- map_data("state") 

ggplot(data = state_error) +
  geom_map(aes(map_id = name, fill = mean_error), 
           color = "white", 
           map = us_states) +
  expand_limits(x = us_states$long, y = us_states$lat) +
  scale_fill_viridis_c(option = "G", direction = -1) +  
  coord_map() + 
  theme_void() +
  theme(legend.position = "bottom") +
  labs(
    title = "Average Forecast Error by US State",
    fill = "Avg. Temp Error (°F)"
  )

To visualize how forecasting accuracy varies across the country, I created a choropleth map showing the average temperature forecast errors by U.S. state. The map reveals that some states, particularly in the western US around Montana, exhibit relatively high forecasting errors, which could indicate that these regions face more challenges in weather prediction. In contrast, states in the southeastern US tend to have lower errors, suggesting more reliable temperature predictions in this region.

High vs. Low Average Forecast Error

state_error <- combined_weather_data |>
  group_by(state, high_or_low) |>
  summarize(mean_error = mean(temp_error, na.rm = TRUE)) |>
  mutate(state = tolower(state)) |>
  left_join(tibble(
    state = tolower(state.abb),
    name = tolower(state.name)
  ), by = "state")

ggplot(data = state_error) +
  geom_map(aes(map_id = name, fill = mean_error), 
           color = "white", 
           map = us_states) +
  expand_limits(x = us_states$long, y = us_states$lat) +
  scale_fill_viridis_c(option = "G", direction = -1) +  
  coord_map() + 
  theme_void() +
  facet_wrap(~high_or_low) +
  theme(legend.position = "bottom") +
  labs(
    title = "Average Forecast Error by US State",
    fill = " Avg. Temp Error (°F)"
  )

When comparing two choropleth maps of US states showing forecasting errors for high and low temperatures, I observed that low temperature forecasting errors were generally higher than those for high temperatures. These errors were particularly concentrated in the western half of the US. In contrast, high temperature forecast errors appeared to be more evenly distributed across states, with no clear regional pattern.

Forcast hours before vs. Average Forecast Error

forecast_time_error <- combined_weather_data |>
  group_by(forecast_hours_before) |>
  summarize(mean_error = mean(temp_error, na.rm = TRUE))

ggplot(data = forecast_time_error) +
  geom_col(aes(x = factor(forecast_hours_before), y = mean_error, fill = mean_error)) +
  labs(title = "Average Forecast Error by Lead Time",
       x = "Forecast Hours Before Observation",
       y = "Avg. Temp Error (°F)") +
  theme(legend.position = "none")

When analyzing the relationship between forecast hours before the observation and the average temperature error, the data shows that forecast errors tend to increase as the lead time grows. In general, longer lead times correlate with higher forecast errors, which is expected since predicting temperatures further in advance involves greater uncertainty. These findings highlight the importance of providing more accurate short term forecasts, as accuracy declines over longer forecast periods.

Distance to coast vs. Average Forecast Error

distance_coast_error <- combined_weather_data |>
  group_by(distance_to_coast) |>
  summarize(mean_error = mean(temp_error, na.rm = TRUE))

ggplot(data = distance_coast_error) +
  geom_point(aes(x = distance_to_coast, y = mean_error), color = "blue") +  # Scatter plot
  labs(title = "Distance to Coast vs Average Forecast Error",
       x = "Distance to Coast (miles)",
       y = "Avg. temp Error (°F)") +
  theme_minimal()

Lastly, the analysis of the relationship between distance to the coast and average temperature forecast error reveals a suggestive pattern such that as the distance from the coast increases, the mean temperature forecast error tends to rise slightly. However, there is still significant scatter in the data, indicating that other factors may also play a role. This suggests that distance to the coast could be a key factor influencing forecast accuracy