Dry, cold, windy cities at high elevations inside the continent struggle with accurate weather prediction

Amberly Kroha Published on RPubs here: http://rpubs.com/krohaa/portfolio-2-krohaa

library(tidyverse)
library(dplyr)
library(stringr)
library(maps)
library(mapproj)
library(ggthemes)
library(plotly)
library(forcats)

This project uses weather station data from the National Weather Service. The data sets that I load below include general information for 167 US cities (forecast_cities), a key to weather forecast type codes (outlook_meanings) and information about over 500,000 weather forecasts (weather_forecasts) from these 167 cities. For this project, my goals are to locate which cities, on average, have the highest magnitude of difference between forecasted and actual temperatures, hence forth called “forecast error”, and to determine some reasons particular cities may struggle with forecasting.

forecast_cities <- read_csv("forecast_cities.csv")
outlook_meanings <- read_csv("outlook_meanings.csv")
weather_forecasts <- read_csv("weather_forecasts.csv")

Data Wrangling

First, I would like to create a new data set that joins the informative aspects of the forecast_cities data set with the weather_forecasts data set. I can do this with the following code:

weather <- weather_forecasts %>% 
  drop_na(forecast_temp) %>% 
  drop_na(observed_temp) %>% 
  mutate(
    prediction_error = forecast_temp - observed_temp
  ) %>% 
  mutate(
    error_magnitude = abs(prediction_error)
  ) %>% 
  relocate(forecast_temp, .before = observed_temp) %>% 
  relocate(prediction_error, .after = observed_temp) %>% 
  relocate(error_magnitude, .after = prediction_error) %>% 
  left_join(forecast_cities, by = c("city", "state")) %>% 
  relocate(13:14, .after = state)

Data Analysis

Which cities, on average, struggle with accurate forecasting?

Next, I would like to explore which cities struggle with forecasting their weather data. To do this, I consider the average forecast error among many weather predictions for each of the 167 cities in the data set.

map_info <- forecast_cities %>%
  select(1:4)

NorthAmerica_Map <- map_data("state")

p <- weather %>% 
  filter(state != "AK", state != "VI", state != "PR", state != "HI") %>% 
  group_by(city, state) %>% 
  summarize(
    avg_error = mean(error_magnitude, na.rm = FALSE),
    lat = mean(lat, na.rm = FALSE),
    long = mean(lon, na.rm = FALSE)
  ) %>% 
  mutate(city = factor(city, unique(city))) %>%
  mutate(city = (city %>% str_replace_all("_", " ") %>% str_to_title())) %>% 
  mutate(interactive_text = paste(
    "City: ", city, "\n",
    "Average forecast error: ", round(avg_error, 3),
    sep = ""
  )) %>% 
  ggplot() +
  geom_polygon(data = NorthAmerica_Map, 
               aes(x = long, y = lat, group = group), 
               fill = "grey98", color = "grey80") +
  geom_point(aes(x = long, y = lat, color = avg_error, size = avg_error, text = interactive_text), alpha = 0.7) +
  scale_color_viridis_c(option = "A", begin = 0.02, end = 0.95) +
  coord_map() +
  theme_map() +
  labs(
    title = "Average forecast error among cities in the contiguous United States",
    color = "Forecast\nerror [degrees]",
    caption = "Figure 1. Interactive map of US cities and weather forecast errors"
  )

p <- ggplotly(p, tooltip = "text")
p

This map (Fig. 1), which I have made interactive with plotly to add a new element beyond course content, shows the average error in weather forecasts for different cities in the contiguous United States. In general, it seems that cities in the northwest tend to have higher forecast errors than other cities in the country. In fact, as shown in the column chart below (Fig. 2), many of the top 20 worst cities at accurately predicting weather, such as Helena, Missoula, and Casper, are located in the northwestern US. In the rest of the report, I will explore why this may be the case.

weather %>% 
  mutate(city = factor(city, unique(city))) %>%
  mutate(city = (city %>% str_replace_all("_", " ") %>% str_to_title())) %>%
  group_by(city, state) %>% 
  summarize(
    mean_error = mean(error_magnitude, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  arrange(desc(mean_error)) %>% 
  slice_head(n = 20) %>% 
  ggplot() +
  geom_col(aes(x = mean_error, y = fct_reorder(city, mean_error))) +
  theme_classic() +
  labs(
    title = "Top 20 cities for forecast errors",
    x = "Average forecast error [degrees]",
    y = "City",
    caption = "Figure 2. Column chart of highest average forecast errors"
  )

Non-explanations

Perhaps, there are particular times that are the worst to predict weather. It seems reasonable that there is possible a most difficult time of the year in which to predict weather conditions, or a most difficult time before the weather data is recorded to construct a forecast. However, this is not supported by the data. The line graph below (Fig. 3) shows not only are there no discernible trends for when in the year weather predictions are the worst, but also that the number of hours before a weather measurement the forecast is made does not seem to make a difference in the resulting magnitude of error. These results indicate time is not a good predictor of how accurate weather forecasts may be.

weather %>%
  mutate(forecast_hours_before = as.factor(forecast_hours_before)) %>%
  arrange(desc(error_magnitude)) %>%
  slice_head(n = 100000) %>%
  ggplot() +
  geom_line(aes(
    x = date, 
    y = error_magnitude, 
    color = fct_reorder2(forecast_hours_before, .x = date, .y = error_magnitude))) +
  scale_color_viridis_d(option = "A", begin = 0.3, end = 0.8) +
  theme_classic() +
  labs(
    title = "Forecast errors among the 100,000 least accurate weather predictions",
    x = "Date",
    y = "Average forecast error [degrees]",
    color = "Number of hours\nbefore event",
    caption = "Figure 3. Timeseries for forecast errors among four\ndifferent prediction times"
  )

If not time, perhaps type of weather may be a useful indicator of why particular cities struggle with weather forecasting. However, the box plot below (Fig. 4) shows that there are not easily discernible differences in prediction errors for different types of weathers. Therefore, it does not seem that cities more prone to experiencing any type of (possibly unpredictable) weather are more likely to make forecasting errors.

weather %>%
  left_join(outlook_meanings, by = "forecast_outlook") %>% 
  relocate(22, .after = forecast_outlook) %>% 
  ggplot() +
  geom_boxplot(aes(x = error_magnitude, y = meaning)) +
  theme_classic() +
  labs(
    title = "The type of weather predicted has little to do with how accurate any\ngiven forecast will be",
    x = "Forecast error [degrees]",
    y = "Forecast outlook",
    caption = "Figure 4. Box plot of weather forecast types and\nforecast errors."
  )

Geographic explanations

It seems that errors in weather predictions are linked to geographic location. In fact, as the scatter plots below indicate, there are moderate to strong relationships between error prediction and longitude (Fig. 5), latitude (Fig. 6), elevation (Fig. 7), and distance to coast (Fig. 8). This analysis indicates that, on average, cities in northern latitudes, western longitudes, high elevations, and inland continental regions are most likely to inaccurately forecast the weather. This makes sense, as distance from the equator, lower atmospheric pressures and/or weather events controlled by high topography, and distance from the regulatory body of the ocean tend to lead to more variable and unpredictable weathers. When we compare these results to the top 20 cities for most inaccurate weather predictions (Fig. 2), we can confirm these trends, as northwestern, inland cities at high elevations (like Denver and Colorado Springs), struggle to accurately predict weather.

The following code calculates the correlation coefficients for the scatter plots (Figs. 5-8) below, which are created in the code chunks that follow.

long_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    long = mean(lon, na.rm = TRUE) 
  ) %>%
  ungroup() %>% 
  summarize(
    corl = cor(long, mean_error, use = "complete.obs")
    )

lat_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    lat = mean(lat, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(lat, mean_error))

ele_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    ele = mean(elevation, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(ele, mean_error))

coast_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    coast = mean(distance_to_coast, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(coast, mean_error))

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    long = mean(lon, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = long, y = mean_error)) +
  geom_point(color = "#F3B584", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#F3B584") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in Western cities",
    y = "Average forecast error [degrees]",
    x = "Longitude [degrees West]",
    caption = "Figure 5. Scatter plot of longitude against forecast error"
  ) +
    annotate("text",
    x = -70,
    y = 4,
    label = str_c("Correlation:\n", round(long_cor, 2)),
    size = 4,
    color = "#F3B584"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    lat = mean(lat, na.rm = TRUE) 
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = lat, y = mean_error)) +
  geom_point(color = "#F6A97A", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#F6A97A") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in Northern cities",
    y = "Average forecast error [degrees]",
    x = "Latitude [degrees North]",
    caption = "Figure 6. Scatter plot of latitude against forecast error"
  ) +
    annotate("text",
    x = 60,
    y = 4,
    label = str_c("Correlation:\n", round(lat_cor, 2)),
    size = 4,
    color = "#F6A97A"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    elevation = mean(elevation, na.rm = TRUE),
  ) %>% 
  ungroup() %>% 
  arrange(mean_error) %>% 
  ggplot(aes(x = elevation, y = mean_error)) +
  geom_point(color = "#4B2991", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#4B2991") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in elevated cities",
    y = "Average forecast error [degrees]",
    x = "Elevation [meters above sea level]",
    caption = "Figure 7. Scatter plot of elevation against forecast error"
  ) +
  annotate("text",
    x = 2000,
    y = 4,
    label = str_c("Correlation:\n", round(ele_cor, 2)),
    size = 4,
    color = "#4B2991"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    distance_to_coast = mean(distance_to_coast, na.rm = TRUE),
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = distance_to_coast, y = mean_error)) +
  geom_point(color = "#D44292", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#D44292") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in inland cities",
    y = "Average forecast error [degrees]",
    x = "Distance to coast [miles]",
    caption = "Figure 8. Scatter plot of distance to caost against forecast\nerror"
  ) +
    annotate("text",
    x = 1100,
    y = 4,
    label = str_c("Correlation:\n", round(coast_cor, 2)),
    size = 4,
    color = "#D44292"
  )

Climate-related explanations

While geography and climate are certainly inextricably interlinked variables in terms of Earth system processes, I have chosen to separate more climate-related factors from the above geographically-related factors described above.

It seems that errors in weather predictions are linked to aspects of local climate. In fact, as the scatter plots below indicate, there are moderate to strong relationships between error prediction and event precipitation (Fig. 9), average annual precipitation (Fig. 10), event temperature (Fig. 11), and average wind speed (Fig. 12). This analysis indicates that, on average, cities with low precipitation (Figs. 9-10), cool temperatures (Fig. 11), and high winds (Fig. 12) are most likely to inaccurately forecast the weather. This makes sense, as dry, cold, fast moving air can contribute to transient and unpredictable weather conditions (source). Another way to visualize these trends is to explore the average prediction error for cities within each of the Koppen Climate Classifications (Fig. 13). This graph seems to indicate that coldness is highly influential in resultant forecast errors, though some of these effects are also modulated by dryness, and trends about humidity are less clear in this visualization than in the scatter plots, which have more data to compose them. Yet, Figure 13 still shows that, in general, cold locations struggle with forecast error. When we compare this result to the top 20 cities for most inaccurate weather predictions (Fig. 2), we can confirm this trends, as cold cities (like Fairbanks), struggle to accurately predict weather.

The following code calculates the correlation coefficients for the scatter plots (Figs. 9-12) below, which are created in the code chunks that follow.

event_precip_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    precip = mean(observed_precip, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(precip, mean_error))

annual_precip_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    precip = mean(avg_annual_precip, na.rm = TRUE)
  ) %>% 
  ungroup %>% 
  summarize(cor(precip, mean_error, use = "complete.obs"))

temp_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    temp = mean(observed_temp, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(temp, mean_error, use = "complete.obs"))

wind_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    wind = mean(wind, na.rm = TRUE)
  ) %>% 
  ungroup %>% 
  summarize(cor(wind, mean_error, use = "complete.obs"))

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    avg_observed_precip = mean(observed_precip, na.rm = TRUE),
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = avg_observed_precip, y = mean_error)) +
  geom_point(color = "#952EA0", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#952EA0") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher for dry weather events",
    y = "Average forecast error [degrees]",
    x = "Average precipitation during forecasted events [inches]",
    caption = "Figure 9. Scatter plot of forecast event precipitation against\nforecast error"
  ) +
    annotate("text",
    y = 4,
    x = 0.2,
    label = str_c("Correlation:\n", round(event_precip_cor, 2)),
    size = 4,
    color = "#952EA0"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    avg_annual_precip = mean(avg_annual_precip, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = avg_annual_precip, y = mean_error)) +
  geom_point(color = "#A3319F", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#A3319F") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be somewhat higher in drier cities",
    y = "Average forecast error [degrees]",
    x = "Average annual precipitation [inches]",
    caption = "Figure 10. Scatter plot of annual precipiation against forecast\nerror"
  ) +
    annotate("text",
    y = 4,
    x = 100,
    label = str_c("Correlation:\n", round(annual_precip_cor, 2)),
    size = 4,
    color = "#A3319F"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    temp = mean(observed_temp, na.rm = TRUE)
  ) %>%
  ungroup() %>% 
  ggplot(aes(x = temp, y = mean_error)) +
  geom_point(color = "#EFCC98", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#EFCC98") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in colder cities",
    y = "Average forecast error [degrees]",
    x = "Average temperature during forecasted events [degrees]",
    caption = "Figure 11. Scatter plot of event temperature against forecast\nerror"
  ) +
    annotate("text",
    y = 4,
    x = 78,
    label = str_c("Correlation:\n", round(temp_cor, 2)),
    size = 4,
    color = "#EFCC98"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    wind = mean(wind, na.rm = TRUE)
  ) %>%
  ungroup() %>% 
  ggplot(aes(x = wind, y = mean_error)) +
  geom_point(color = "#F66D7A", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#F66D7A") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be slightly higher in windier cities",
    y = "Average forecast error [degrees]",
    x = "Average wind speed [miles per hour]",
    caption = "Figure 12. Scatter plot of wind speed against forecast error"
  ) +
    annotate("text",
    y = 4,
    x = 5.5,
    label = str_c("Correlation:\n", round(wind_cor, 2)),
    size = 4,
    color = "#F66D7A"
  )

weather %>% 
  group_by(koppen) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE))
  ) %>% 
  mutate(koppen =
    fct_recode(koppen,
      "Tropical Rainforest climate" = "Af",
      "Tropical Monsoon climate" = "Am",
      "Tropical Savanna, dry summer climate" = "As",
      "Tropical Savanna, dry winter climate" = "Aw",
      "Hot Semi-arid climate" = "BSh",
      "Cold Semi-arid climate" = "BSk",
      "Hot Desert cliamte" = "BWh",
      "Cold Desert climate" = "BWk",
      "Humid Subtropical climate" = "Cfa",
      "Temperate Oceanic or Subtropical Highland climate" = "Cfb",
      "Hot Dry Temperate climate" = "Csa",
      "War Dry Temperate climate" = "Csb",
      "Hot Humid Continental climate" = "Dfa",
      "Warm Humid Continental climate" = "Dfb",
      "Cold Humid Continental climate" = "Dfc" 
      )
         ) %>% 
  ggplot() +
  geom_point(aes(x = mean_error, y = fct_reorder(koppen, mean_error))) +
  theme_classic() +
  labs(
    title = "Average prediction error in different Koppen\nClimate Classifications",
    subtitle = "Forecast errors tend to be higher in dry and continental\nclimates",
    y = "Koppen Classification",
    x = "Forecast error [degrees]",
    caption = "Figure 13. Dot plot for forecast errors among\nKoppen Climate Classifications"
  )

Summary

In general, the above analysis shows that cities that struggle with accurate weather forecasting, such as Fairbanks, Denver, and Missoula, tend to be geographically located in northern latitudes, western longitudes, high elevations, and inland continental regions and tend to have dry, cold, windy climates.