library(tidyverse)
library(lubridate)
library(here)
library(patchwork)
library(plotly)
library(ggthemes)
library(gt)
This report analyzes the accuracy of weather forecasts in the United States, focusing on temperature predictions across 167 cities over a 16-month period. The analysis examines forecast errors for high and low temperatures, explores potential factors affecting forecast accuracy, and identifies data quality issues in the dataset.
forecast_cities <- read_csv(here("data","forecast_cities.csv"))
outlook_meanings <- read_csv(here("data","outlook_meanings.csv"))
weather_forecasts <- read_csv(here("data","weather_forecasts.csv"))
weather_forecasts <- weather_forecasts %>%
mutate(date = as.Date(date),
forecast_outlook = as.character(forecast_outlook)) %>%
left_join(outlook_meanings, by = "forecast_outlook")
weather_data <- weather_forecasts %>%
left_join(forecast_cities, by = c("city", "state")) %>%
filter(!is.na(observed_temp) & !is.na(forecast_temp))
Below is a map of all the cities included in the data set. Note that this does not include cities in Hawaii, Alaska, Puerto Rico, or the Virgin Islands.
us_map <- map_data("state")
forecast_cities%>%
filter(!state %in% c("HI", "AK", "PR", "VI")) %>%
ggplot() +
geom_polygon(data = map_data("state"), aes(x = long, y = lat, group = group),
fill = "lightgray", color = "white") +
geom_point(aes(x = lon, y = lat), color = "blue", size = 2) +
theme_map() +
labs(title = "Map of US Cities Included",
x = "Longitude", y = "Latitude")
Next, we have a table including the Köppen classifications used.
koppen_labels <- tribble(
~"koppen", ~"climate_description", ~"example_city",
"Af", "Tropical Rainforest (no dry season)", "West Palm Beach, FL",
"Am", "Tropical Monsoon (short dry season)", "San Juan, PR",
"As", "Tropical Savanna (dry summer)", "Honolulu, HI",
"Aw", "Tropical Savanna (dry winter)", "Key West, FL",
"BSh", "Hot Semi-Arid (steppe, hot)", "Tuscon, AZ",
"BSk", "Cold Semi-Arid (steppe, cold)", "Denver, CO",
"BWh", "Hot Desert (arid, very hot)", "Phoenix, AZ",
"BWk", "Cold Desert (arid, cold)", "Las Vegas, NV",
"Cfa", "Humid Subtropical (hot summers, no dry season)", "Atlanta, GA",
"Cfb", "Temperate Oceanic (warm summers, no dry season)", "Santa Fe, NM",
"Cfc", "Subpolar Oceanic (cool summers, no dry season)", "Old Harbor, AK",
"Csa", "Mediterranean (hot dry summers)", "Sacramento, CA",
"Csb", "Mediterranean (warm dry summers)", "Los Angeles, CA",
"Dfa", "Humid Continental (hot summers, no dry season)", "Chicago, IL",
"Dfb", "Humid Continental (warm summers, no dry season)", "Milwaukee, WI",
"Dfc", "Subarctic (cool summers, severe winters)", "Anchorage, AK",
)
koppen_labels <- koppen_labels %>%
mutate(
climate_zone = case_when(
startsWith(koppen, "A") ~ "Tropical",
startsWith(koppen, "B") ~ "Dry",
startsWith(koppen, "C") ~ "Temperate",
startsWith(koppen, "D") ~ "Continental"
)
)
forecast_cities %>%
group_by(koppen) %>%
left_join(koppen_labels, by = "koppen") %>%
summarize(
"Climate Description" = first(climate_description),
"Example City" = first(example_city),
"Number of Cities" = n(),
climate_zone = first(climate_zone)
) %>%
gt(rowname_col = "koppen", groupname_col = "climate_zone", row_group_as_column = TRUE)
| Climate Description | Example City | Number of Cities | ||
|---|---|---|---|---|
| Tropical | Af | Tropical Rainforest (no dry season) | West Palm Beach, FL | 4 |
| Am | Tropical Monsoon (short dry season) | San Juan, PR | 9 | |
| As | Tropical Savanna (dry summer) | Honolulu, HI | 1 | |
| Aw | Tropical Savanna (dry winter) | Key West, FL | 1 | |
| Dry | BSh | Hot Semi-Arid (steppe, hot) | Tuscon, AZ | 1 |
| BSk | Cold Semi-Arid (steppe, cold) | Denver, CO | 24 | |
| BWh | Hot Desert (arid, very hot) | Phoenix, AZ | 2 | |
| BWk | Cold Desert (arid, cold) | Las Vegas, NV | 4 | |
| Temperate | Cfa | Humid Subtropical (hot summers, no dry season) | Atlanta, GA | 89 |
| Cfb | Temperate Oceanic (warm summers, no dry season) | Santa Fe, NM | 6 | |
| Cfc | Subpolar Oceanic (cool summers, no dry season) | Old Harbor, AK | 1 | |
| Csa | Mediterranean (hot dry summers) | Sacramento, CA | 2 | |
| Csb | Mediterranean (warm dry summers) | Los Angeles, CA | 15 | |
| Continental | Dfa | Humid Continental (hot summers, no dry season) | Chicago, IL | 27 |
| Dfb | Humid Continental (warm summers, no dry season) | Milwaukee, WI | 45 | |
| Dfc | Subarctic (cool summers, severe winters) | Anchorage, AK | 5 | |
The dataset contains multiple cities with the same name, such as ‘Richmond’ and ‘Buffalo’, across different states (e.g., Richmond, CA; Richmond, WA; Richmond, WY). During the join operation, the data was matched solely by city name without considering the state, resulting in duplicated and inaccurate entries for each instance of these cities. Note that the observed temperatures and precipitation values are the exact same for each city
weather_forecasts%>%
filter(date == ymd(20210130),
high_or_low == "high",
forecast_hours_before == 48,
city %in% c("BUFFALO", "RICHMOND"))%>%
summarize(city, state, date, "high or low" = high_or_low, "hours before" = forecast_hours_before, "temperature" = observed_temp, "precipitation" = observed_precip)%>%
gt(groupname_col = "city", rowname_col = "state", row_group_as_column = TRUE)
| date | high or low | hours before | temperature | precipitation | ||
|---|---|---|---|---|---|---|
| BUFFALO | NY | 2021-01-30 | high | 48 | 28 | 0.00 |
| WY | 2021-01-30 | high | 48 | 28 | 0.00 | |
| RICHMOND | VA | 2021-01-30 | high | 48 | 40 | 0.08 |
| CA | 2021-01-30 | high | 48 | 40 | 0.08 | |
| RI | 2021-01-30 | high | 48 | 40 | 0.08 | |
weather_data <- weather_data %>%
mutate(forecast_error = abs(forecast_temp - observed_temp))
ggplot(weather_data, aes(x = forecast_error)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
theme_minimal() +
labs(
title = "Distribution of Forecast Error",
x = "Forecast Error (°F)",
y = "Frequency")+
scale_x_continuous(breaks = seq(0, 100, by = 10))
We can see that, while there is a large range in forecast error distribution, the vast majority have less than 10 degrees of error.
accuracy_factors <- weather_data %>%
group_by(koppen, elevation, distance_to_coast, wind) %>%
summarize(mean_error = mean(forecast_error, na.rm = TRUE), .groups = "drop")
ggplot(accuracy_factors, aes(x = elevation, y = mean_error)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Elevation",
y = "Mean Error in Temperature"
) +
theme_minimal()
We see that there is a positive correlation between error and elevation of the cities.
accuracy_factors %>%
group_by(koppen) %>%
filter(n() >= 3) %>% # Filter for koppen values with at least 5 occurrences
ggplot(aes(x = koppen, y = mean_error)) +
geom_violin() +
labs(
x = "Koppen Classification",
y = "Mean Error"
) +
theme_minimal()
We see that Am has the lowest mean error with Dfc having the highest mean error across koppen climate classifications. This makes sense because we would expect, warm and hot climates to have lower error, due to large amounts of humidity and surrounding water regulating the temperatures. In contrast, the subarctic climates in Dfc makes sense for having a high error because of the mix of gyres in the Arctic Circle. Note: we removed Af, As, Aw, BSh, BWh, BWk, and Csa due te low sample sizes.
ggplot(accuracy_factors, aes(x = distance_to_coast, y = mean_error)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Distance to Coast",
y = "Mean Error in Temperature"
) +
theme_minimal()
We see that there is positive correlation between error and distance to coast. This makes sense because oceans are very important in holding temperatures consistent with the high heat capacity of water.
ggplot(accuracy_factors, aes(x = wind, y = mean_error)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Wind Speed",
y = "Mean Error in Temperature"
) +
theme_minimal()
We see that there is a positive correlation between error and wind speed. This makes sense because high wind speeds can lead to more rapid changes in temperature, making it harder to predict accurately.
In this report, we analyzed the accuracy of weather forecasts across the United States and identified data quality issues affecting the data set. Forecast accuracy was found to be influenced by factors such as elevation, climate type, and distance to the coast. Future improvements could focus on ensuring consistent city attributes and further refining forecasts for cities with high error rates.