The dataset weather_forecasts contains observations of the weather forecasts and the observed weather at different times and places across the United States. Supplementing weather_forecasts is the forecast_cities dataset, which gives geographical data about US cities. The objective of this report is to investigate causes of error in these weather forecasts, which I chose to do by comparing the city means for the observed high and low temperatures versus the 12-hour forecasted high and low temperatures. Figure 1 illustrates the distribution of error and indicates greater variability in error in the low temperature forecasts, with a significantly larger median and maximum degrees of error. It appears that accurately forecasting the daily low is more challenging than forecasting the daily high.

wrangled_12hr <- weather_forecasts |>
  filter(forecast_hours_before == 12) |>
  pivot_wider(
      id_cols = c(date:state, observed_precip, possible_error),
      names_from = high_or_low,
      values_from = c(observed_temp, forecast_temp)
  ) |>
  inner_join(forecast_cities, join_by(city, state))

error_bycity_12hr <- wrangled_12hr |>
  group_by(city, state) |>
  mutate(
     low_error = abs(mean(forecast_temp_low - observed_temp_low, na.rm = TRUE)),
     high_error = abs(mean(forecast_temp_high - observed_temp_high, na.rm = TRUE))
  ) |>
  select(
    date:state, low_error, high_error, 
    observed_precip:possible_error, 
    lon:avg_annual_precip)

error_bycity_12hr$koppen <- as.factor(error_bycity_12hr$koppen)
error_bycity_12hr$koppen2 = fct_collapse(
      error_bycity_12hr$koppen, 
      "A" = c("Af", "Am", "As", "Aw"),
      "B" = c("BSh", "BSk", "BWh", "BWk"),
      "C" = c("Cfa", "Cfb", "Csa", "Csb"),
      "D" = c("Dfa", "Dfb", "Dfc")
    )
weather_forecasts |>
  filter(forecast_hours_before == 12) |>
  group_by(city, state, high_or_low) |>
  mutate(
    forecast_error = abs(mean(forecast_temp - observed_temp, na.rm = TRUE))
  ) |>
  ggplot(aes(x = forecast_error)) +
  geom_histogram(aes(fill = high_or_low), bins=30) +
  facet_wrap(vars(high_or_low), nrow=2)+ 
  labs(
    title = "Distribution of forecast error", 
    subtitle = "Averaged by city, using 12-hour forecasts",
    x = "Degrees between observed and forecasted temperature",
    y = "",
    fill = "Forecast type",
    tag = "Figure 1"
    ) +
  theme_minimal() +
  scale_fill_manual(values = c("red3", "turquoise4")) +
  theme(strip.text.x = element_blank())

In the process of this project, I tried out the inner join to only include weather observations that were in cities with accompanying geographical data. I also learned how to use the smoothing geom to create a least squares regression line to display relationships between two quantitative variables.

Analysis Results

Forecasting lows

Error in forecasting low temperatures may be correlated with elevation and annual precipitation. Figure 2 demonstrates a positive relationship between mean low forecasting error by city and the most severe elevation change out of the four points closest to the city, so cities which are in or near mountain ranges may tend to struggle with accurate low temperature predictions. Furthermore, the points of this dotplot are color-coded by the elevation of each city, and cities with higher elevations visually appear to have a larger mean error. Figure 3 shows a negative relationship between a city’s mean low temperature error and its annual rainfall, indicating that drier climates may be more challenging locations to predict daily low temperatures in.

error_bycity_12hr |>
  ggplot(aes(y = low_error, x = elevation_change_four)) +
  geom_point(aes(color = elevation)) +
  geom_smooth(
    method = "lm",
    formula = y~x,
    color = "black"
  ) +
  theme_minimal() +
  labs(
    title = "Low temperature error vs. nearby elevation change",
    x = "Nearby elevation change (meters)",
    y = "Forecasting error (degrees)",
    color = "Elevation (m)",
    tag = "Figure 2"
  ) +
  scale_color_viridis_c(option="G", begin=0.25)

error_bycity_12hr |>
  ggplot(aes(y = low_error, x = avg_annual_precip)) +
  geom_point(color = "turquoise4") +
  geom_smooth(
    method = "lm",
    formula = y~x,
    color = "black"
  ) +
  theme_minimal() +
  labs(
    title = "Low temperature error vs. average annual precipitation",
    subtitle = "Forecast observations averaged by city",
    x = "Average annual precipitation (inches)",
    y = "Forecasting error (degrees)",
    tag = "Figure 3"
  )

Forecasting highs

Evaluating the sources of error in forecasting daily high temperatures proved to be more challenging due to the lack of strong patterns. This could in part be due to the less drastic range of errors. The strongest cause of error I was able to identify was latitude. Reference figure 4, which indicates a positive relationship between latitude and mean high temperature error by city. It seems that cities farther from the equator face greater difficulty forecasting the high temperature. Figure 5 displays another interesting observation, which is that different Köppen climate classifications have different error distributions. Although there may not be adequate evidence of certain climates creating particular challenge, the humid subtropical climate (Cfa) is much more starkly skewed right than any other climate and has a minimum at 0 degrees of error, indicating that perhaps high temperature forecasting in these regions tends to be more accurate.

error_bycity_12hr |>
  ggplot(aes(y = high_error, x = lat)) +
  geom_point(color = "red3") +
  geom_smooth(
    method = "lm",
    formula = y~x,
    color = "black"
  ) +
  theme_minimal() +
  labs(
    title = "High temperature forecasting error by latitude",
    x = "Latitude (degrees)",
    y = "Forecasting error (degrees)",
    tag = "Figure 4"
  )

error_bycity_12hr |>
  ggplot(aes(x = high_error, fill = koppen2)) +
  geom_histogram(bins=10) +
  facet_wrap(vars(koppen)) +
  scale_fill_viridis_d(
    option="F", 
    begin = 0.2, 
    end = 0.8) +
  scale_x_continuous(breaks = c(0, 1, 2)) +
  theme(panel.grid.minor = element_blank()) +
  theme_minimal() +
  labs(
    title = "Distribution of high temp forecasting error by climate classification",
    x = "Forecasting error (degrees)",
    fill = "Major climate group",
    tag = "Figure 5",
    y = ""
  )