Analyzing Potential Factors Causing Innacuracy in Weather Forecasts

library(tidyverse)
library(lubridate)
library(here)
library(patchwork)
library(plotly)
library(ggthemes)
library(gt)

Introduction

This report analyzes the accuracy of weather forecasts in the United States, focusing on temperature predictions across 167 cities over a 16-month period. The analysis examines forecast errors for high and low temperatures, explores potential factors affecting forecast accuracy, and identifies data quality issues in the dataset.

Loading Data

forecast_cities <- read_csv(here("data","forecast_cities.csv"))
outlook_meanings <- read_csv(here("data","outlook_meanings.csv"))
weather_forecasts <- read_csv(here("data","weather_forecasts.csv"))

Data Cleaning and Merging

weather_forecasts <- weather_forecasts %>%
  mutate(date = as.Date(date),
         forecast_outlook = as.character(forecast_outlook)) %>%
  left_join(outlook_meanings, by = "forecast_outlook")

weather_data <- weather_forecasts %>%
  left_join(forecast_cities, by = c("city", "state")) %>%
  filter(!is.na(observed_temp) & !is.na(forecast_temp))

Data Overview

Below is a map of all the cities included in the data set. Note that this does not include cities in Hawaii, Alaska, Puerto Rico, or the Virgin Islands.

us_map <- map_data("state")
forecast_cities%>%
  filter(!state %in% c("HI", "AK", "PR", "VI")) %>%
  ggplot() +
  geom_polygon(data = map_data("state"), aes(x = long, y = lat, group = group),
               fill = "lightgray", color = "white") +
  geom_point(aes(x = lon, y = lat), color = "blue", size = 2) +
  theme_map() +
  labs(title = "Map of US Cities Included",
       x = "Longitude", y = "Latitude")

Next, we have a table including the Köppen classifications used.

koppen_labels <- tribble(
  ~"koppen", ~"climate_description", ~"example_city",
  "Af",  "Tropical Rainforest (no dry season)", "West Palm Beach, FL",
  "Am",  "Tropical Monsoon (short dry season)", "San Juan, PR",
  "As",  "Tropical Savanna (dry summer)", "Honolulu, HI",
  "Aw",  "Tropical Savanna (dry winter)", "Key West, FL",
  "BSh", "Hot Semi-Arid (steppe, hot)", "Tuscon, AZ", 
  "BSk", "Cold Semi-Arid (steppe, cold)", "Denver, CO",
  "BWh", "Hot Desert (arid, very hot)", "Phoenix, AZ",
  "BWk", "Cold Desert (arid, cold)", "Las Vegas, NV",
  "Cfa", "Humid Subtropical (hot summers, no dry season)", "Atlanta, GA",
  "Cfb", "Temperate Oceanic (warm summers, no dry season)", "Santa Fe, NM",
  "Cfc", "Subpolar Oceanic (cool summers, no dry season)", "Old Harbor, AK",
  "Csa", "Mediterranean (hot dry summers)", "Sacramento, CA", 
  "Csb", "Mediterranean (warm dry summers)", "Los Angeles, CA",
  "Dfa", "Humid Continental (hot summers, no dry season)", "Chicago, IL",
  "Dfb", "Humid Continental (warm summers, no dry season)", "Milwaukee, WI",
  "Dfc", "Subarctic (cool summers, severe winters)", "Anchorage, AK",
)
koppen_labels <- koppen_labels %>%
  mutate(
    climate_zone = case_when(
      startsWith(koppen, "A") ~ "Tropical",
      startsWith(koppen, "B") ~ "Dry",
      startsWith(koppen, "C") ~ "Temperate",
      startsWith(koppen, "D") ~ "Continental"
    )
  )

forecast_cities %>%
  group_by(koppen) %>%
  left_join(koppen_labels, by = "koppen") %>%
  summarize(
    "Climate Description" = first(climate_description), 
    "Example City" = first(example_city),
    "Number of Cities" = n(),
    climate_zone = first(climate_zone)
  ) %>%
  gt(rowname_col = "koppen", groupname_col = "climate_zone", row_group_as_column = TRUE)

		Climate Description	Example City	Number of Cities
Tropical	Af	Tropical Rainforest (no dry season)	West Palm Beach, FL	4
	Am	Tropical Monsoon (short dry season)	San Juan, PR	9
	As	Tropical Savanna (dry summer)	Honolulu, HI	1
	Aw	Tropical Savanna (dry winter)	Key West, FL	1
Dry	BSh	Hot Semi-Arid (steppe, hot)	Tuscon, AZ	1
	BSk	Cold Semi-Arid (steppe, cold)	Denver, CO	24
	BWh	Hot Desert (arid, very hot)	Phoenix, AZ	2
	BWk	Cold Desert (arid, cold)	Las Vegas, NV	4
Temperate	Cfa	Humid Subtropical (hot summers, no dry season)	Atlanta, GA	89
	Cfb	Temperate Oceanic (warm summers, no dry season)	Santa Fe, NM	6
	Cfc	Subpolar Oceanic (cool summers, no dry season)	Old Harbor, AK	1
	Csa	Mediterranean (hot dry summers)	Sacramento, CA	2
	Csb	Mediterranean (warm dry summers)	Los Angeles, CA	15
Continental	Dfa	Humid Continental (hot summers, no dry season)	Chicago, IL	27
	Dfb	Humid Continental (warm summers, no dry season)	Milwaukee, WI	45
	Dfc	Subarctic (cool summers, severe winters)	Anchorage, AK	5

Identifying Data Quality Issues

The dataset contains multiple cities with the same name, such as ‘Richmond’ and ‘Buffalo’, across different states (e.g., Richmond, CA; Richmond, WA; Richmond, WY). During the join operation, the data was matched solely by city name without considering the state, resulting in duplicated and inaccurate entries for each instance of these cities. Note that the observed temperatures and precipitation values are the exact same for each city

weather_forecasts%>%
  filter(date == ymd(20210130),
         high_or_low == "high",
         forecast_hours_before == 48,
         city %in% c("BUFFALO", "RICHMOND"))%>%
  summarize(city, state, date, "high or low" = high_or_low, "hours before" = forecast_hours_before, "temperature" = observed_temp, "precipitation" = observed_precip)%>%
  gt(groupname_col = "city", rowname_col = "state", row_group_as_column = TRUE)

		date	high or low	hours before	temperature	precipitation
BUFFALO	NY	2021-01-30	high	48	28	0.00
BUFFALO	WY	2021-01-30	high	48	28	0.00
RICHMOND	VA	2021-01-30	high	48	40	0.08
	CA	2021-01-30	high	48	40	0.08
	RI	2021-01-30	high	48	40	0.08

Calculating Forecast Error

weather_data <- weather_data %>%
  mutate(forecast_error = abs(forecast_temp - observed_temp))

Visualizing Forecast Errors

ggplot(weather_data, aes(x = forecast_error)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  theme_minimal() +
  labs(
    title = "Distribution of Forecast Error", 
    x = "Forecast Error (°F)", 
    y = "Frequency")+
  scale_x_continuous(breaks = seq(0, 100, by = 10))

We can see that, while there is a large range in forecast error distribution, the vast majority have less than 10 degrees of error.

Analyzing Accuracy by Factors

accuracy_factors <- weather_data %>%
  group_by(koppen, elevation, distance_to_coast, wind) %>%
  summarize(mean_error = mean(forecast_error, na.rm = TRUE), .groups = "drop")

Mean Error vs Elevation

ggplot(accuracy_factors, aes(x = elevation, y = mean_error)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Elevation",
    y = "Mean Error in Temperature"
  ) +
  theme_minimal()

We see that there is a positive correlation between error and elevation of the cities.

Mean Error vs Koppen Classification

accuracy_factors %>%
  group_by(koppen) %>%
  filter(n() >= 3) %>% # Filter for koppen values with at least 5 occurrences
  ggplot(aes(x = koppen, y = mean_error)) +
  geom_violin() +
  labs(
    x = "Koppen Classification",
    y = "Mean Error"
  ) +
  theme_minimal()

We see that Am has the lowest mean error with Dfc having the highest mean error across koppen climate classifications. This makes sense because we would expect, warm and hot climates to have lower error, due to large amounts of humidity and surrounding water regulating the temperatures. In contrast, the subarctic climates in Dfc makes sense for having a high error because of the mix of gyres in the Arctic Circle. Note: we removed Af, As, Aw, BSh, BWh, BWk, and Csa due te low sample sizes.

Mean Error vs Distance to Coast

ggplot(accuracy_factors, aes(x = distance_to_coast, y = mean_error)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Distance to Coast",
    y = "Mean Error in Temperature"
  ) +
  theme_minimal()

We see that there is positive correlation between error and distance to coast. This makes sense because oceans are very important in holding temperatures consistent with the high heat capacity of water.

Mean Error vs Wind Speed

ggplot(accuracy_factors, aes(x = wind, y = mean_error)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Wind Speed",
    y = "Mean Error in Temperature"
  ) +
  theme_minimal()

We see that there is a positive correlation between error and wind speed. This makes sense because high wind speeds can lead to more rapid changes in temperature, making it harder to predict accurately.

Conclusion

In this report, we analyzed the accuracy of weather forecasts across the United States and identified data quality issues affecting the data set. Forecast accuracy was found to be influenced by factors such as elevation, climate type, and distance to the coast. Future improvements could focus on ensuring consistent city attributes and further refining forecasts for cities with high error rates.