library(tidyverse)
library(lubridate)
library(here)
library(patchwork)
library(plotly)
library(ggthemes)
library(gt)

Introduction

This report analyzes the accuracy of weather forecasts in the United States, focusing on temperature predictions across 167 cities over a 16-month period. The analysis examines forecast errors for high and low temperatures, explores potential factors affecting forecast accuracy, and identifies data quality issues in the dataset.

Loading Data

forecast_cities <- read_csv(here("data","forecast_cities.csv"))
outlook_meanings <- read_csv(here("data","outlook_meanings.csv"))
weather_forecasts <- read_csv(here("data","weather_forecasts.csv"))

Data Cleaning and Merging

weather_forecasts <- weather_forecasts %>%
  mutate(date = as.Date(date),
         forecast_outlook = as.character(forecast_outlook)) %>%
  left_join(outlook_meanings, by = "forecast_outlook")

weather_data <- weather_forecasts %>%
  left_join(forecast_cities, by = c("city", "state")) %>%
  filter(!is.na(observed_temp) & !is.na(forecast_temp))

Data Overview

Below is a map of all the cities included in the data set. Note that this does not include cities in Hawaii, Alaska, Puerto Rico, or the Virgin Islands.

us_map <- map_data("state")
forecast_cities%>%
  filter(!state %in% c("HI", "AK", "PR", "VI")) %>%
  ggplot() +
  geom_polygon(data = map_data("state"), aes(x = long, y = lat, group = group),
               fill = "lightgray", color = "white") +
  geom_point(aes(x = lon, y = lat), color = "blue", size = 2) +
  theme_map() +
  labs(title = "Map of US Cities Included",
       x = "Longitude", y = "Latitude")

Next, we have a table including the Köppen classifications used.

koppen_labels <- tribble(
  ~"koppen", ~"climate_description", ~"example_city",
  "Af",  "Tropical Rainforest (no dry season)", "West Palm Beach, FL",
  "Am",  "Tropical Monsoon (short dry season)", "San Juan, PR",
  "As",  "Tropical Savanna (dry summer)", "Honolulu, HI",
  "Aw",  "Tropical Savanna (dry winter)", "Key West, FL",
  "BSh", "Hot Semi-Arid (steppe, hot)", "Tuscon, AZ", 
  "BSk", "Cold Semi-Arid (steppe, cold)", "Denver, CO",
  "BWh", "Hot Desert (arid, very hot)", "Phoenix, AZ",
  "BWk", "Cold Desert (arid, cold)", "Las Vegas, NV",
  "Cfa", "Humid Subtropical (hot summers, no dry season)", "Atlanta, GA",
  "Cfb", "Temperate Oceanic (warm summers, no dry season)", "Santa Fe, NM",
  "Cfc", "Subpolar Oceanic (cool summers, no dry season)", "Old Harbor, AK",
  "Csa", "Mediterranean (hot dry summers)", "Sacramento, CA", 
  "Csb", "Mediterranean (warm dry summers)", "Los Angeles, CA",
  "Dfa", "Humid Continental (hot summers, no dry season)", "Chicago, IL",
  "Dfb", "Humid Continental (warm summers, no dry season)", "Milwaukee, WI",
  "Dfc", "Subarctic (cool summers, severe winters)", "Anchorage, AK",
)
koppen_labels <- koppen_labels %>%
  mutate(
    climate_zone = case_when(
      startsWith(koppen, "A") ~ "Tropical",
      startsWith(koppen, "B") ~ "Dry",
      startsWith(koppen, "C") ~ "Temperate",
      startsWith(koppen, "D") ~ "Continental"
    )
  )

forecast_cities %>%
  group_by(koppen) %>%
  left_join(koppen_labels, by = "koppen") %>%
  summarize(
    "Climate Description" = first(climate_description), 
    "Example City" = first(example_city),
    "Number of Cities" = n(),
    climate_zone = first(climate_zone)
  ) %>%
  gt(rowname_col = "koppen", groupname_col = "climate_zone", row_group_as_column = TRUE)
Climate Description Example City Number of Cities
Tropical Af Tropical Rainforest (no dry season) West Palm Beach, FL 4
Am Tropical Monsoon (short dry season) San Juan, PR 9
As Tropical Savanna (dry summer) Honolulu, HI 1
Aw Tropical Savanna (dry winter) Key West, FL 1
Dry BSh Hot Semi-Arid (steppe, hot) Tuscon, AZ 1
BSk Cold Semi-Arid (steppe, cold) Denver, CO 24
BWh Hot Desert (arid, very hot) Phoenix, AZ 2
BWk Cold Desert (arid, cold) Las Vegas, NV 4
Temperate Cfa Humid Subtropical (hot summers, no dry season) Atlanta, GA 89
Cfb Temperate Oceanic (warm summers, no dry season) Santa Fe, NM 6
Cfc Subpolar Oceanic (cool summers, no dry season) Old Harbor, AK 1
Csa Mediterranean (hot dry summers) Sacramento, CA 2
Csb Mediterranean (warm dry summers) Los Angeles, CA 15
Continental Dfa Humid Continental (hot summers, no dry season) Chicago, IL 27
Dfb Humid Continental (warm summers, no dry season) Milwaukee, WI 45
Dfc Subarctic (cool summers, severe winters) Anchorage, AK 5

Identifying Data Quality Issues

The dataset contains multiple cities with the same name, such as ‘Richmond’ and ‘Buffalo’, across different states (e.g., Richmond, CA; Richmond, WA; Richmond, WY). During the join operation, the data was matched solely by city name without considering the state, resulting in duplicated and inaccurate entries for each instance of these cities. Note that the observed temperatures and precipitation values are the exact same for each city

weather_forecasts%>%
  filter(date == ymd(20210130),
         high_or_low == "high",
         forecast_hours_before == 48,
         city %in% c("BUFFALO", "RICHMOND"))%>%
  summarize(city, state, date, "high or low" = high_or_low, "hours before" = forecast_hours_before, "temperature" = observed_temp, "precipitation" = observed_precip)%>%
  gt(groupname_col = "city", rowname_col = "state", row_group_as_column = TRUE)
date high or low hours before temperature precipitation
BUFFALO NY 2021-01-30 high 48 28 0.00
WY 2021-01-30 high 48 28 0.00
RICHMOND VA 2021-01-30 high 48 40 0.08
CA 2021-01-30 high 48 40 0.08
RI 2021-01-30 high 48 40 0.08

Calculating Forecast Error

weather_data <- weather_data %>%
  mutate(forecast_error = abs(forecast_temp - observed_temp))

Visualizing Forecast Errors

ggplot(weather_data, aes(x = forecast_error)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Distribution of Forecast Error", 
    x = "Forecast Error (°F)", 
    y = "Frequency")+
  scale_x_continuous(breaks = seq(0, 100, by = 10))

We can see that, while there is a large range in forecast error distribution, the vast majority have less than 10 degrees of error.

Analyzing Accuracy by Factors

accuracy_factors <- weather_data %>%
  group_by(koppen, elevation, distance_to_coast, wind) %>%
  summarize(mean_error = mean(forecast_error, na.rm = TRUE), .groups = "drop")

Mean Error vs Elevation

ggplot(accuracy_factors, aes(x = elevation, y = mean_error)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Elevation",
    y = "Mean Error in Temperature"
  ) +
  theme_minimal()

We see that there is a positive correlation between error and elevation of the cities.

Mean Error vs Koppen Classification

accuracy_factors %>%
  group_by(koppen) %>%
  filter(n() >= 3) %>% # Filter for koppen values with at least 5 occurrences
  ggplot(aes(x = koppen, y = mean_error)) +
  geom_violin() +
  labs(
    x = "Koppen Classification",
    y = "Mean Error"
  ) +
  theme_minimal()

We see that Am has the lowest mean error with Dfc having the highest mean error across koppen climate classifications. This makes sense because we would expect, warm and hot climates to have lower error, due to large amounts of humidity and surrounding water regulating the temperatures. In contrast, the subarctic climates in Dfc makes sense for having a high error because of the mix of gyres in the Arctic Circle. Note: we removed Af, As, Aw, BSh, BWh, BWk, and Csa due te low sample sizes.

Mean Error vs Distance to Coast

ggplot(accuracy_factors, aes(x = distance_to_coast, y = mean_error)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Distance to Coast",
    y = "Mean Error in Temperature"
  ) +
  theme_minimal()

We see that there is positive correlation between error and distance to coast. This makes sense because oceans are very important in holding temperatures consistent with the high heat capacity of water.

Mean Error vs Wind Speed

ggplot(accuracy_factors, aes(x = wind, y = mean_error)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Wind Speed",
    y = "Mean Error in Temperature"
  ) +
  theme_minimal()

We see that there is a positive correlation between error and wind speed. This makes sense because high wind speeds can lead to more rapid changes in temperature, making it harder to predict accurately.

Conclusion

In this report, we analyzed the accuracy of weather forecasts across the United States and identified data quality issues affecting the data set. Forecast accuracy was found to be influenced by factors such as elevation, climate type, and distance to the coast. Future improvements could focus on ensuring consistent city attributes and further refining forecasts for cities with high error rates.