Introduction

Accurate weather forecasts play a vital role in decision-making for various sectors. However, forecast accuracy can vary significantly depending on geographic and climate factors. In this report, we explore forecast accuracy for high and low temperature predictions across 167 U.S. cities. We aim to determine which areas struggle with forecast errors and investigate possible reasons behind these inaccuracies.

Load Libraries and Data

library(tidyverse)
library(ggthemes)
library(lubridate)
library(viridis)
library(ggridges)

# Load datasets
weather_forecasts <- read_csv("data/weather_forecasts.csv")
forecast_cities <- read_csv("data/forecast_cities.csv")
outlook_meanings <- read_csv("data/outlook_meanings.csv")

Data Preparation

# Ensure date columns are properly parsed
weather_forecasts <- weather_forecasts %>%
  mutate(date = ymd(date),
         high_or_low = as.factor(high_or_low),
         forecast_outlook = as.factor(forecast_outlook))

forecast_cities <- forecast_cities %>%
  mutate(city = as.factor(city), state = as.factor(state))

Data Integration

# Join weather data with city information
full_data <- weather_forecasts %>%
  left_join(forecast_cities, by = "city") %>%
  left_join(outlook_meanings, by = "forecast_outlook")

# Create a new column for forecast error (absolute difference between forecast and observed temp)
full_data <- full_data %>%
  mutate(temp_error = abs(forecast_temp - observed_temp))

Summary of Forecast Errors

# Summary statistics for overall error
overall_accuracy <- full_data %>%
  summarize(
    mean_error = mean(temp_error, na.rm = TRUE),
    median_error = median(temp_error, na.rm = TRUE),
    error_sd = sd(temp_error, na.rm = TRUE)
  )

overall_accuracy
## # A tibble: 1 × 3
##   mean_error median_error error_sd
##        <dbl>        <dbl>    <dbl>
## 1       2.32            2     2.12

Analyzing Error by City

# Average error by city
# Calculate the average error by city and select the top cities with the highest errors
error_by_city_top <- full_data %>%
  group_by(city) %>%
  summarize(avg_error = mean(temp_error, na.rm = TRUE)) %>%
  arrange(desc(avg_error)) %>%
  slice_head(n = 10)  
ggplot(error_by_city_top, aes(x = reorder(city, avg_error), y = avg_error, fill = avg_error)) +
  geom_col() +
  scale_fill_viridis(option = "C") +
  labs(title = "Top 10 Cities with the Highest Forecast Errors",
       x = "City", y = "Average Error (°F)", fill = "Error") +
  theme_minimal() +
  coord_flip()

Error Metrics Calculation

# Calculate additional forecast accuracy metrics
accuracy_metrics <- full_data %>%
  summarize(
    mean_error = mean(forecast_temp - observed_temp, na.rm = TRUE),
    mean_absolute_error = mean(abs(forecast_temp - observed_temp), na.rm = TRUE),
    root_mean_squared_error = sqrt(mean((forecast_temp - observed_temp)^2, na.rm = TRUE)),
    mean_absolute_percentage_error = mean(abs((forecast_temp - observed_temp) / observed_temp) * 100, na.rm = TRUE)
  )

accuracy_metrics
## # A tibble: 1 × 4
##   mean_error mean_absolute_error root_mean_squared_error mean_absolute_percent…¹
##        <dbl>               <dbl>                   <dbl>                   <dbl>
## 1     -0.429                2.32                    3.14                     Inf
## # ℹ abbreviated name: ¹​mean_absolute_percentage_error

Relationship Between Elevation and Error

# Analyze relationship between elevation and forecast error
elevation_analysis <- full_data %>%
  group_by(city) %>%
  summarize(avg_error = mean(temp_error, na.rm = TRUE),
            elevation = mean(elevation, na.rm = TRUE))

# plot
plot_elevation <- ggplot(elevation_analysis, aes(x = elevation, y = avg_error)) +
  geom_point(color = "darkblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Elevation and Forecast Error",
       x = "Elevation (meters)", y = "Average Temperature Error (°F)") +
  theme_minimal()

plot_elevation

Insight:

Higher elevation is associated with larger forecast errors, possibly due to complex atmospheric conditions at higher altitudes.

Error Over Time

# Analyze average error over time
error_over_time <- full_data %>%
  group_by(date) %>%
  summarize(avg_error = mean(temp_error, na.rm = TRUE))

# Plot
ggplot(error_over_time, aes(x = date, y = avg_error)) +
  geom_line(color = "steelblue", size = 1) +
  labs(title = "Forecast Accuracy Over Time",
       x = "Date", y = "Average Temperature Error (°F)") +
  theme_minimal()

Insight:

The plot shows fluctuations in error over time, with certain periods having higher spikes. These may correspond to extreme weather events or seasonal variability.

Error by Forecast Horizon

# Analyze error by forecast horizon
error_by_horizon <- full_data %>%
  group_by(forecast_hours_before) %>%
  summarize(avg_error = mean(temp_error, na.rm = TRUE))


#Visualize using boxplot (simpler and faster)
plot_box <- ggplot(full_data, aes(x = factor(forecast_hours_before), y = temp_error)) +
  geom_boxplot(fill = "skyblue", color = "darkblue", outlier.shape = NA) +
  labs(title = "Boxplot of Forecast Errors by Forecast Horizon",
       x = "Forecast Horizon (Hours Before Observation)", 
       y = "Temperature Error (°F)") +
  coord_cartesian(ylim = c(0, 8)) +  
  theme_minimal()

plot_box

Insight:

Forecast errors increase with longer forecast horizons, indicating reduced accuracy for longer-term predictions.

Analyzing Climate Factors

# Group data by climate classification and calculate error
error_by_climate <- full_data %>%
  group_by(koppen) %>%
  summarize(avg_error = mean(temp_error, na.rm = TRUE)) %>%
  arrange(desc(avg_error))

# Plot
ggplot(error_by_climate, aes(x = reorder(koppen, avg_error), y = avg_error, fill = avg_error)) +
  geom_col() +
  scale_fill_viridis(option = "C") +
  labs(title = "Average Forecast Error by Climate Classification",
       x = "Climate Type", y = "Average Error (°F)", fill = "Error") +
  theme_minimal() +
  coord_flip()

Insight:

Climate classification has a noticeable impact on forecast accuracy, with certain climates (e.g., those with rapid weather changes) having larger errors.

Conclusion

This analysis indicates that forecast errors increase with forecast horizon. Cities with high elevations and those farther from the coast tend to have larger errors, suggesting that geographic and climate factors influence forecast accuracy. Climate classification and extreme weather events may further contribute to these errors.

We identified that the top 10 cities with the highest forecast errors likely face challenges due to localized geographic or climatic factors. Additionally, higher elevations showed a clear relationship with increased forecast errors, possibly due to atmospheric complexities in these regions. The climate classification analysis demonstrated that areas with dynamic and variable climates tend to have greater forecast errors.

The results from these analyses highlight areas for improvement in weather forecasting models, particularly for regions with complex terrain or unstable weather patterns.

Further work could investigate other factors, such as wind patterns and precipitation types, to enhance the understanding of forecast accuracy across different scenarios. Our analysis indicates that forecast errors increase with forecast horizon. Cities with high elevations and those farther from the coast tend to have larger errors, suggesting that geographic and climate factors influence forecast accuracy. Climate classification and extreme weather events may further contribute to these errors. These insights could help improve forecasting models in challenging regions.