Portfolio Project2: Exploring Weather Forecast Performance in the US

Import Data

cities <- read_csv("data/forecast_cities.csv")
forecast <- read_csv("data/weather_forecasts.csv")
outlook <- read_csv("data/outlook_meanings.csv")
US <- map_data("state")

Data Wrangling

forecast_high <- forecast |>
  filter(!state %in% c("AK", "HI", "PR", "VI")) |>
  filter(forecast_hours_before == "12") |>
  mutate(error = observed_temp - forecast_temp) |>
  pivot_wider(names_from = high_or_low,
              values_from = error) |>
  filter(!is.na(high)) |>
  rename(error_high = high) |>
  group_by(city, state) |>
  summarize(mean_error_high = mean(abs(error_high)), .groups = "keep") |>
  ungroup()

forecast_low <- forecast |>
  filter(!state %in% c("AK", "HI", "PR", "VI")) |>
  filter(forecast_hours_before == "12") |>
  mutate(error = observed_temp - forecast_temp) |>
  pivot_wider(names_from = high_or_low,
              values_from = error) |>
  filter(!is.na(low)) |>
  rename(error_low = low) |>
  group_by(city, state) |>
  summarize(mean_error_low = mean(abs(error_low)), .groups = "keep") |>
  ungroup()

cities1 <- cities |>
  full_join(forecast_high, by = c("city", "state")) |>
  full_join(forecast_low, by = c("city","state")) |>
  drop_na(mean_error_high, mean_error_low)

US <- US |>
  mutate(REGION = case_when(region %in% c("california", "nevada", "utah", "colorado", "wyoming", "montana", "idaho", "oregon", "washington") ~ "West",
                            region %in% c("arizona", "new mexico", "texas", "oklahoma") ~ "Southwest",
                            region %in% c("north dakota", "south dakota", "nebraska", "kansas", "minnesota", "iowa", "missouri", "wisconsin", "illinois", "michigan", "indiana", "ohio") ~ "Midwest",
                            region %in% c("maine", "new hampshire", "vermont", "massachusetts", "new york", "rhode island", "connecticut", "new jersey", "pennsylvania", "maryland", "delaware") ~ "Northeast",
                            region %in% c("west virginia", "district of columbia", "virginia", "kentucky", "tennessee", "north carolina", "south carolina", "georgia", "florida", "alabama", "mississippi", "louisiana", "arkansas") ~ "Southeast"))

Intended to identify areas in the US that struggle the most with weather prediction, this project investigates the relationship between forecast errors and other geographical factors using the forecast_cities and weather_forecasts data sets (which I imported as cities and forecast, respectively). While forecast outlines high and low temperature and precipitation predictions from the forecast between January 30, 2021 and June 1, 2022 for 167 US cities, cities provides useful information about the location, climate, and topography of these cities. I chose to focus on such variables as elevation, average wind speed, and Köppen climate classiﬁcation. Before visualizing the multivariate relationships through plots and graphs, I calculated the mean absolute error for both high and low temperatures from the forecast 12 hours prior to observing the real temperatures, for each individual city. Then, I combined the data sets into cities1. For mapping purposes, I introduced the US data set and defined five regions of the US. Besides, I omitted the 6 cities in Alaska, Hawaii, Puerto Rico, and Virgin Islands, thereby restricting the scope of analysis to only continental US. Based on the outlook_meanings data set (which I imported as outlook), I grouped all outlook types into four main categories: sunny, cloudy, rainy, and snowy.

Figure 1: Temperature Error vs. Elevation

p1 <- ggplot() + 
  geom_polygon(data = US, aes(x = long, y = lat, group = group, fill = REGION), color="brown") +
  geom_point(data = cities1, aes(x = lon, y = lat, color = elevation, size = mean_error_high), alpha = 0.5) +
  coord_map(projection = "sinusoidal") + 
  theme_map() +
  scale_size_continuous(breaks = c(1.0, 2.0, 3.0, 4.0), limits = c(1, 5), range = c(2, 10), name = "Mean absolute temperature error") +
  scale_fill_brewer(palette = "Pastel1", name = "Region") +
  scale_color_distiller(palette = "Blues", direction = -1) +
  labs(color = "Elevation") +
  annotate("text", x = -100, y = 20, label = "High temperature", size = 4, color = "black") +
  guides(fill = guide_legend(order = 1), color = guide_colorbar(order = 2))
 
p2 <- ggplot() + 
  geom_polygon(data = US, aes(x = long, y = lat, group = group, fill = REGION), color="brown") +
  geom_point(data = cities1, aes(x = lon, y = lat, color = elevation, size = mean_error_low), alpha = 0.5) +
  coord_map(projection = "sinusoidal") + 
  theme_map() +
  scale_size_continuous(breaks = c(1.0, 2.0, 3.0, 4.0), limits = c(1, 5), range = c(2, 10), name = "Mean absolute temperature error") +
  scale_fill_brewer(palette = "Pastel1", name = "Region") +
  scale_color_distiller(palette = "Blues", direction = -1) +
  labs(color = "Elevation") +
  annotate("text", x = -100, y = 20, label = "Low temperature", size = 4, color = "black") +
  guides(fill = guide_legend(order = 1), color = guide_colorbar(order = 2))

(p1 + p2) +
  plot_layout(guides = 'collect') +
  plot_annotation(title = "Mean absolute error in 12-hour prior temperature forecast in continental United States, color-coded by elevation") +
  theme(plot.title.position = "plot")

Figure 2: Temperature Error vs. Wind

p3 <- ggplot() + 
  geom_polygon(data = US, aes(x = long, y = lat, group = group, fill = REGION), color="brown") +
  geom_point(data = cities1, aes(x = lon, y = lat, color = wind, size = mean_error_high), alpha = 0.5) +
  coord_map(projection = "sinusoidal") + 
  theme_map() +
  scale_size_continuous(breaks = c(1.0, 2.0, 3.0, 4.0), limits = c(1, 5), range = c(2, 10), name = "Mean absolute temperature error") +
  scale_fill_brewer(palette = "Pastel1", name = "Region") +
  scale_color_distiller(palette = "Greys", direction = 1) +
  labs(color = "Average wind speed") +
  annotate("text", x = -100, y = 20, label = "High temperature", size = 4, color = "black")

p4 <- ggplot() + 
  geom_polygon(data = US, aes(x = long, y = lat, group = group, fill = REGION), color="brown") +
  geom_point(data = cities1, aes(x = lon, y = lat, color = wind, size = mean_error_low), alpha = 0.5) +
  coord_map(projection = "sinusoidal") + 
  theme_map() +
  scale_size_continuous(breaks = c(1.0, 2.0, 3.0, 4.0), limits = c(1, 5), range = c(2, 10), name = "Mean absolute temperature error") +
  scale_fill_brewer(palette = "Pastel1", name = "Region") +
  scale_color_distiller(palette = "Greys", direction = 1) +
  labs(color = "Average wind speed") +
  annotate("text", x = -100, y = 20, label = "Low temperature", size = 4, color = "black")

(p3 + p4) +
  plot_layout(guides = 'collect') +
  plot_annotation(title = "Mean absolute error in 12-hour prior temperature forecast in continental United States, color-coded by average wind speed") +
  theme(plot.title.position = "plot")

If we compare the left and right panels on either Figure 1 or Figure 2, there is not a large discrepancy in the mean absolute error of forecast temporatures between high and low. Each of the five regions contains both smaller and larger dots on the scale for both high and low temperatures. However, there appears to be a trivial yet detectable increase in dot size from the east of the country to the west. In particular, states like Montana and Wyoming along the Rocky Mountains are home to cities that have the biggest dots, with mean absolute errors of low temperature close to 4 degrees. This suggests that forecast temperatures in the West plausibly deviate more from observed temperatures than in the Northeast and the Southeast. In other words, states on the East Coast seem to have done a relatively better job at predicting temperatures overall than those on the West Coast. Nonetheless, a range of one to three degrees is usually considered not too big of an error in terms of citywide temperature prediction. Therefore, it is fair to conclude that all five regions in the US perform equally tolerable in predicting temperatures at least 12 hours ahead.

Figure 3: Temperature Error vs. Köppen Climate Classification

p5 <- cities1 |> 
  filter(koppen %in% c("BSk", "Cfa", "Csb", "Dfa", "Dfb")) |>
  ggplot() +
  geom_histogram(aes(x = mean_error_high), bins = 12) +
  facet_wrap(~ koppen, scales = "fixed", ncol = 5) +
  theme_minimal() +
  labs(x = "Mean absolute error of high temperature forecast",
       y = "Count")

p6 <- cities1 |> 
  filter(koppen %in% c("BSk", "Cfa", "Csb", "Dfa", "Dfb")) |>
  ggplot() +
  geom_histogram(aes(x = mean_error_low), bins = 12) +
  facet_wrap(~ koppen, scales = "fixed", ncol = 5) +
  theme_minimal() +
  labs(x = "Mean absolute error of low temperature forecast",
       y = "Count")

(p5 + p6) +
  plot_layout(guides = 'collect') +
  plot_annotation(title = "Histograms of mean absolute errors for 12-hour prior temperature forecast \nacross cities and dates, sorted by Köppen climate classiﬁcation") +
  theme(plot.title.position = "plot")

Furthermore, I examined possible effects of elevation and average wind speed on the forecast. For the Rocky Mountains states which I have identified as slightly struggling more, it is apparent that they also have relatively greater elevation and average wind speed—especially compared to flatter areas along the East Coast. This is manifested in lighter dot color on Figure 1 and darker dot color on Figure 2. Figure 3, on the other hand, shows that the distributions of mean absolute error in temperature across five major Köppen climate classifications are similar in center, spread, and shape. They are alll uni-modal, slightly right-skewed and centered around 2 and 3, for high and low temperatures respectively. Thus, Köppen climate classifications do not offer a meaningful explanation to the difference in forecast performance between regions.

Figure 4: Observed Precipitation vs. Forecast Outlook

set.seed(86538)
sample_city <- forecast |>
  distinct(city) |>
  slice_sample(n = 12) 
  
forecast |>
  drop_na(observed_precip, forecast_outlook) |>
  filter(forecast_hours_before == "12") |>
  filter(!forecast_outlook %in% c("DUST", "FOG", "WINDY", "VRYCLD", "VRYHOT", "SMOKE")) |>
  mutate(forecast_outlook = fct_collapse(forecast_outlook, Rainy = c("SHWRS", "FZDRZL", "RNSNOW", "SLEET", "FZRAIN", "TSTRMS", "DRZL", "RAIN"),
                                                          Snowy = c("SNOSHW", "BLZZRD", "FLRRYS", "BLGSNO", "SNOW"),
                                                          Cloudy = c("CLOUDY", "MOCLDY", "PTCLDY"),
                                                          Sunny = c("SUNNY")),
         forecast_outlook = fct_relevel(forecast_outlook, c("Sunny", "Cloudy", "Rainy", "Snowy"))) |>
  filter(city %in% sample_city$city) |>
  ggplot() +
  geom_point(aes(x = observed_precip, y = forecast_outlook)) +
  facet_wrap(~ city, labeller = labeller(city = function(x) str_to_title(str_replace_all(x, "_", " ")))) + 
  labs(x = "Observed precipitation (in inches)", 
       y = "Forecast outlook",
       title = "Dot plot of observed precipitation vs. 12-hour prior forecast outlook \nfor 10 randomly selected cities") +
  theme_minimal()

Finally, the observed precipitation versus forecast outlook chart (Figure 4) lends insights into how accurate forecast outlooks are in general. I randomly selected 12 cities which were assumed to be representative of all 161 cities*. Among the four outlook categories, rainy universally has the largest center across all these cities. This means that what the forecast said would be rainy days did produce more precipitation all over the US, speaking to the forecast’s reliability. Because all 12 cities share a similar distribution, there is no evidence to suggest that one is better at predicting a specific type of weather than the others. What can be concluded, though, is that for all these cities there were many days which the forecast had predicted to be sunny and cloudy ended up being rainy or snowy, as precipitation was still observed. In this sense, the forecast still has great room for improvement in predicting sunny and cloudy weathers.