Dry, cold, windy cities at high elevations inside the continent struggle with accurate weather prediction

Amberly Kroha Published on RPubs here: http://rpubs.com/krohaa/portfolio-2-krohaa

library(tidyverse)
library(dplyr)
library(stringr)
library(maps)
library(mapproj)
library(ggthemes)
library(plotly)
library(forcats)

This project uses weather station data from the National Weather Service. The data sets that I load below include general information for 167 US cities (forecast_cities), a key to weather forecast type codes (outlook_meanings) and information about over 500,000 weather forecasts (weather_forecasts) from these 167 cities. For this project, my goals are to locate which cities, on average, have the highest magnitude of difference between forecasted and actual temperatures, hence forth called “forecast error”, and to determine some reasons particular cities may struggle with forecasting.

forecast_cities <- read_csv("forecast_cities.csv")
outlook_meanings <- read_csv("outlook_meanings.csv")
weather_forecasts <- read_csv("weather_forecasts.csv")

Data Wrangling

First, I would like to create a new data set that joins the informative aspects of the forecast_cities data set with the weather_forecasts data set. I can do this with the following code:

weather <- weather_forecasts %>% 
  drop_na(forecast_temp) %>% 
  drop_na(observed_temp) %>% 
  mutate(
    prediction_error = forecast_temp - observed_temp
  ) %>% 
  mutate(
    error_magnitude = abs(prediction_error)
  ) %>% 
  relocate(forecast_temp, .before = observed_temp) %>% 
  relocate(prediction_error, .after = observed_temp) %>% 
  relocate(error_magnitude, .after = prediction_error) %>% 
  left_join(forecast_cities, by = c("city", "state")) %>% 
  relocate(13:14, .after = state)

Data Analysis

Which cities, on average, struggle with accurate forecasting?

Next, I would like to explore which cities struggle with forecasting their weather data. To do this, I consider the average forecast error among many weather predictions for each of the 167 cities in the data set.

map_info <- forecast_cities %>%
  select(1:4)

NorthAmerica_Map <- map_data("state")

p <- weather %>% 
  filter(state != "AK", state != "VI", state != "PR", state != "HI") %>% 
  group_by(city, state) %>% 
  summarize(
    avg_error = mean(error_magnitude, na.rm = FALSE),
    lat = mean(lat, na.rm = FALSE),
    long = mean(lon, na.rm = FALSE)
  ) %>% 
  mutate(city = factor(city, unique(city))) %>%
  mutate(city = (city %>% str_replace_all("_", " ") %>% str_to_title())) %>% 
  mutate(interactive_text = paste(
    "City: ", city, "\n",
    "Average forecast error: ", round(avg_error, 3),
    sep = ""
  )) %>% 
  ggplot() +
  geom_polygon(data = NorthAmerica_Map, 
               aes(x = long, y = lat, group = group), 
               fill = "grey98", color = "grey80") +
  geom_point(aes(x = long, y = lat, color = avg_error, size = avg_error, text = interactive_text), alpha = 0.7) +
  scale_color_viridis_c(option = "A", begin = 0.02, end = 0.95) +
  coord_map() +
  theme_map() +
  labs(
    title = "Average forecast error among cities in the contiguous United States",
    color = "Forecast\nerror [degrees]",
    caption = "Figure 1. Interactive map of US cities and weather forecast errors"
  )

p <- ggplotly(p, tooltip = "text")
p

This map (Fig. 1), which I have made interactive with plotly to add a new element beyond course content, shows the average error in weather forecasts for different cities in the contiguous United States. In general, it seems that cities in the northwest tend to have higher forecast errors than other cities in the country. In fact, as shown in the column chart below (Fig. 2), many of the top 20 worst cities at accurately predicting weather, such as Helena, Missoula, and Casper, are located in the northwestern US. In the rest of the report, I will explore why this may be the case.

weather %>% 
  mutate(city = factor(city, unique(city))) %>%
  mutate(city = (city %>% str_replace_all("_", " ") %>% str_to_title())) %>%
  group_by(city, state) %>% 
  summarize(
    mean_error = mean(error_magnitude, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  arrange(desc(mean_error)) %>% 
  slice_head(n = 20) %>% 
  ggplot() +
  geom_col(aes(x = mean_error, y = fct_reorder(city, mean_error))) +
  theme_classic() +
  labs(
    title = "Top 20 cities for forecast errors",
    x = "Average forecast error [degrees]",
    y = "City",
    caption = "Figure 2. Column chart of highest average forecast errors"
  )

Non-explanations

Perhaps, there are particular times that are the worst to predict weather. It seems reasonable that there is possible a most difficult time of the year in which to predict weather conditions, or a most difficult time before the weather data is recorded to construct a forecast. However, this is not supported by the data. The line graph below (Fig. 3) shows not only are there no discernible trends for when in the year weather predictions are the worst, but also that the number of hours before a weather measurement the forecast is made does not seem to make a difference in the resulting magnitude of error. These results indicate time is not a good predictor of how accurate weather forecasts may be.

weather %>%
  mutate(forecast_hours_before = as.factor(forecast_hours_before)) %>%
  arrange(desc(error_magnitude)) %>%
  slice_head(n = 100000) %>%
  ggplot() +
  geom_line(aes(
    x = date, 
    y = error_magnitude, 
    color = fct_reorder2(forecast_hours_before, .x = date, .y = error_magnitude))) +
  scale_color_viridis_d(option = "A", begin = 0.3, end = 0.8) +
  theme_classic() +
  labs(
    title = "Forecast errors among the 100,000 least accurate weather predictions",
    x = "Date",
    y = "Average forecast error [degrees]",
    color = "Number of hours\nbefore event",
    caption = "Figure 3. Timeseries for forecast errors among four\ndifferent prediction times"
  )

If not time, perhaps type of weather may be a useful indicator of why particular cities struggle with weather forecasting. However, the box plot below (Fig. 4) shows that there are not easily discernible differences in prediction errors for different types of weathers. Therefore, it does not seem that cities more prone to experiencing any type of (possibly unpredictable) weather are more likely to make forecasting errors.

weather %>%
  left_join(outlook_meanings, by = "forecast_outlook") %>% 
  relocate(22, .after = forecast_outlook) %>% 
  ggplot() +
  geom_boxplot(aes(x = error_magnitude, y = meaning)) +
  theme_classic() +
  labs(
    title = "The type of weather predicted has little to do with how accurate any\ngiven forecast will be",
    x = "Forecast error [degrees]",
    y = "Forecast outlook",
    caption = "Figure 4. Box plot of weather forecast types and\nforecast errors."
  )

Geographic explanations

It seems that errors in weather predictions are linked to geographic location. In fact, as the scatter plots below indicate, there are moderate to strong relationships between error prediction and longitude (Fig. 5), latitude (Fig. 6), elevation (Fig. 7), and distance to coast (Fig. 8). This analysis indicates that, on average, cities in northern latitudes, western longitudes, high elevations, and inland continental regions are most likely to inaccurately forecast the weather. This makes sense, as distance from the equator, lower atmospheric pressures and/or weather events controlled by high topography, and distance from the regulatory body of the ocean tend to lead to more variable and unpredictable weathers. When we compare these results to the top 20 cities for most inaccurate weather predictions (Fig. 2), we can confirm these trends, as northwestern, inland cities at high elevations (like Denver and Colorado Springs), struggle to accurately predict weather.

The following code calculates the correlation coefficients for the scatter plots (Figs. 5-8) below, which are created in the code chunks that follow.

long_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    long = mean(lon, na.rm = TRUE) 
  ) %>%
  ungroup() %>% 
  summarize(
    corl = cor(long, mean_error, use = "complete.obs")
    )

lat_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    lat = mean(lat, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(lat, mean_error))

ele_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    ele = mean(elevation, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(ele, mean_error))

coast_cor <- weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    coast = mean(distance_to_coast, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  summarize(cor(coast, mean_error))
weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    long = mean(lon, na.rm = TRUE)
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = long, y = mean_error)) +
  geom_point(color = "#F3B584", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#F3B584") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in Western cities",
    y = "Average forecast error [degrees]",
    x = "Longitude [degrees West]",
    caption = "Figure 5. Scatter plot of longitude against forecast error"
  ) +
    annotate("text",
    x = -70,
    y = 4,
    label = str_c("Correlation:\n", round(long_cor, 2)),
    size = 4,
    color = "#F3B584"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    lat = mean(lat, na.rm = TRUE) 
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = lat, y = mean_error)) +
  geom_point(color = "#F6A97A", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#F6A97A") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in Northern cities",
    y = "Average forecast error [degrees]",
    x = "Latitude [degrees North]",
    caption = "Figure 6. Scatter plot of latitude against forecast error"
  ) +
    annotate("text",
    x = 60,
    y = 4,
    label = str_c("Correlation:\n", round(lat_cor, 2)),
    size = 4,
    color = "#F6A97A"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    elevation = mean(elevation, na.rm = TRUE),
  ) %>% 
  ungroup() %>% 
  arrange(mean_error) %>% 
  ggplot(aes(x = elevation, y = mean_error)) +
  geom_point(color = "#4B2991", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#4B2991") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in elevated cities",
    y = "Average forecast error [degrees]",
    x = "Elevation [meters above sea level]",
    caption = "Figure 7. Scatter plot of elevation against forecast error"
  ) +
  annotate("text",
    x = 2000,
    y = 4,
    label = str_c("Correlation:\n", round(ele_cor, 2)),
    size = 4,
    color = "#4B2991"
  )

weather %>% 
  group_by(city, state) %>% 
  summarize(
    mean_error = abs(mean(error_magnitude, na.rm = TRUE)),
    distance_to_coast = mean(distance_to_coast, na.rm = TRUE),
  ) %>% 
  ungroup() %>% 
  ggplot(aes(x = distance_to_coast, y = mean_error)) +
  geom_point(color = "#D44292", alpha = 0.4) +
  geom_smooth(method = lm, se = FALSE, color = "#D44292") +
  theme_classic() +
  labs(
    title = "Forecast errors tend to be higher in inland cities",
    y = "Average forecast error [degrees]",
    x = "Distance to coast [miles]",
    caption = "Figure 8. Scatter plot of distance to caost against forecast\nerror"
  ) +
    annotate("text",
    x = 1100,
    y = 4,
    label = str_c("Correlation:\n", round(coast_cor, 2)),
    size = 4,
    color = "#D44292"
  )

Summary

In general, the above analysis shows that cities that struggle with accurate weather forecasting, such as Fairbanks, Denver, and Missoula, tend to be geographically located in northern latitudes, western longitudes, high elevations, and inland continental regions and tend to have dry, cold, windy climates.