The data used in this analysis came from the National Weather Service and includes sixteen months of weather forecasts and observations from 167 cities. Additionally, a separate data set provides geographical and climatic details about these cities, and other American cities. The analysis utilizes data from the following three files: forecast_cities.csv (city-specific information), outlook_meanings.csv (forecast outlook reference codes), weather_forecasts.csv (forecasted and observed weather data). The outlook_meanings.csv file serves as a reference for interpreting forecast outlook codes used in the weather dataset. These three datasets were joined together using dplyr join functions, and temperature error (temp_error) was computed for each observation to quantify forecast accuracy.
library(dplyr)
library(readr)
library(ggplot2)
library(tidyverse)
library(viridis)
weather_forecasts <- read_csv("data/weather_forecasts.csv")
forecast_cities <- read_csv("data/forecast_cities.csv")
outlook_meanings <- read_csv("data/outlook_meanings.csv")
weather_analysis <- weather_forecasts |>
left_join(forecast_cities, by = c("city", "state")) |>
left_join(outlook_meanings, by = "forecast_outlook") |>
mutate(temp_error = forecast_temp - observed_temp) |> # Calculate error
select(temp_error, forecast_hours_before, high_or_low,
lon, lat, elevation, distance_to_coast,
elevation_change_eight, wind, avg_annual_precip, meaning) |>
relocate(temp_error) #new element for the new element requirement
library(maps)
library(dplyr)
library(ggplot2)
# Get state boundaries
states <- map_data("state")
# Function to create map of forecast errors
create_error_plot <- function(data, temp_type) {
# Filter data for temperature type
plot_data <- data %>%
filter(high_or_low == temp_type)
max_error <- max(abs(plot_data$temp_error), na.rm = TRUE)
ggplot() +
geom_polygon(data = states,
aes(x = long, y = lat, group = group),
fill = NA, color = "black", size = 0.5) +
geom_point(data = plot_data,
aes(x = lon, y = lat,
color = abs(temp_error),
size = abs(temp_error)),
alpha = 0.6,
shape = 21,
fill = NA) +
# Color scale with limits
scale_color_gradientn(
colors = c("lightgreen", "yellow", "darkred"),
name = "Forecast Error (°F)",
breaks = seq(0, max_error, length.out = 5),
labels = scales::label_number(accuracy = 1),
guide = guide_colorbar(
direction = "vertical",
title.position = "top",
barwidth = 1,
barheight = 8,
ticks = TRUE
)
) +
scale_size_continuous(
range = c(1, 4),
name = "Forecast Error (°F)",
breaks = seq(0, max_error, length.out = 5),
labels = scales::label_number(accuracy = 1)
) +
coord_map() +
labs(title = sprintf("Geographic Distribution of %s Temperature Forecast Errors",
ifelse(temp_type == "high", "High", "Low")),
subtitle = "(Larger points indicate larger errors)",
x = "Longitude",
y = "Latitude") +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold"),
plot.subtitle = element_text(size = 10),
legend.position = "right",
legend.box = "vertical",
legend.margin = margin(6, 6, 6, 6),
legend.title = element_text(face = "bold"),
)
}
low_plot <- create_error_plot(weather_analysis, "low")
high_plot <- create_error_plot(weather_analysis, "high")
print(low_plot)
print(high_plot)
Higher forecast errors are more prevalent in the eastern United States, particularly in the Midwest and Texas, where certain locations have noticeably large discrepancies. The geographic distribution of low and high temperature forecast errors is highly similar, suggesting that errors in low-temperature predictions and high-temperature predictions tend to be similar among each forecast.
# Compared all data to high-error locations
error_analysis <- weather_analysis %>%
mutate(error_category = ifelse(
abs(temp_error) > quantile(abs(temp_error), 0.9, na.rm = TRUE),
"high_error", "normal_error"
)) %>%
group_by(high_or_low, error_category) %>%
summarise(
avg_elevation = mean(elevation, na.rm = TRUE),
avg_distance_coast = mean(distance_to_coast, na.rm = TRUE),
avg_elevation_change_eight = mean(elevation_change_eight, na.rm = TRUE),
avg_wind = mean(wind, na.rm = TRUE),
avg_annual_precip = mean(avg_annual_precip, na.rm = TRUE),
n = n(),
.groups = "drop"
) %>%
arrange(high_or_low, error_category)
print(as.data.frame(error_analysis))
## high_or_low error_category avg_elevation avg_distance_coast
## 1 high high_error 385.8698 383.8344
## 2 high normal_error 345.0003 315.6293
## 3 high <NA> 345.1939 317.5408
## 4 low high_error 537.1471 399.7696
## 5 low normal_error 331.1456 313.4344
## 6 low <NA> 345.1362 317.5086
## avg_elevation_change_eight avg_wind avg_annual_precip n
## 1 148.8429 3.468456 37.85841 20429
## 2 142.5122 3.391335 39.19201 267997
## 3 142.7964 3.389961 39.03318 37558
## 4 194.7854 3.413888 32.74012 23540
## 5 138.3727 3.395337 39.66565 266039
## 6 142.8076 3.389660 39.03367 36405
To better understand the factors influencing forecast errors, the top 10% of temperature forecast errors were compared to the rest of the dataset. The first summary table indicates that, in high-error cases, low-temperature forecasts are associated with higher average elevation and greater terrain complexity than high-temperature forecasts. However, when considering the overall dataset, locations with forecast errors tend to have lower elevation and lower terrain complexity compared to those in the top 10% of errors. Average wind speed was nearly identical between the top 10% of forecast errors and the full dataset for both high and low temperatures. Average annual precipitation was slightly lower in high-temperature forecast errors within the top 10% of errors.
#graph 1
ggplot(weather_analysis,
aes(x = factor(forecast_hours_before),
y = abs(temp_error),
fill = high_or_low)) +
geom_boxplot(
coef = 1) + # Reduce whisker length
coord_cartesian(y = c(0, 10)) + # Zoom in on boxes
scale_fill_viridis_d() +
labs(title = "Forecast Error by Hours Before Observation",
x = "Hours Before Observation",
y = "Absolute Temperature Error (°F)",
fill = "Temperature Type") +
theme_minimal()
To determine whether forecast lead time (the number of hours before observation) influenced error magnitude, box plots were created comparing absolute temperature error across different forecast lead times (12, 24, 36, and 48 hours before observation). Although the interquartile range was significantly higher for low-temperature forecasts at 36 and 48 hours before observation, the median error remained the same across all lead times. This indicates that forecast lead time does not significantly impact absolute temperature error within the relatively short timeframes analyzed. This matches expectations, as short-term weather forecasts (within 48 hours) are typically stable in accuracy. However, extending the analysis to forecasts made further in advance may reveal a stronger relationship between lead time and error.
# Calculate medians for each meaning and temperature type
error_summary <- weather_analysis %>%
group_by(meaning, high_or_low) %>%
summarise(
median_error = median(abs(temp_error), na.rm = TRUE),
mean_error = mean(abs(temp_error), na.rm = TRUE),
.groups = 'drop'
)
# Create bar plot with medians
ggplot(error_summary,
aes(x = factor(meaning),
y = median_error,
fill = high_or_low)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = c("high" = "#5B4B8A", "low" = "#92B4EC"),
labels = c("High Temperature", "Low Temperature")) +
labs(title = "Median Forecast Error by Outlook",
x = "Outlook",
y = "Median Absolute Temperature Error (°F)",
fill = "Temperature Type") +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.minor = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x labels for better readability
)
The median absolute temperature error was generally around 2°F. However, high-temperature forecasts exhibited the largest errors (6°F) when the forecast outlook was missing. This suggests that unusual or complex weather conditions, which were not categorized under a specific outlook, posed greater forecasting challenges. High-temperature predictions were also more difficult during drizzle and fog, with median forecast errors around 3°F. Low-temperature predictions had higher errors during conditions such as blowing snow, dust storms, freezing drizzle, heavy cloud cover, and strong winds, with median errors around 3°F. These findings suggest that certain weather conditions may increase error rates among high and low temperatures.
# Function to calculate Spearman correlations by group
get_correlations <- function(data, x_var) {
# For high temperatures
high_data <- data[data$high_or_low == "high", ]
high_spearman <- cor.test(
high_data[[x_var]],
abs(high_data$temp_error),
method = "spearman"
)
# For low temperatures
low_data <- data[data$high_or_low == "low", ]
low_spearman <- cor.test(
low_data[[x_var]],
abs(low_data$temp_error),
method = "spearman"
)
return(list(
high = list(
spearman = high_spearman
),
low = list(
spearman = low_spearman
)
))
}
# Calculate correlations for each variable
variables <- c("elevation", "distance_to_coast", "elevation_change_eight", "wind", "avg_annual_precip")
names(variables) <- c("Elevation", "Distance to Coast", "Elevation Change", "wind", "avg_annual_precip")
# Summary data frame
results <- data.frame(
Variable = character(),
Temp_Type = character(),
Spearman_Rho = numeric(),
Spearman_P = numeric(),
stringsAsFactors = FALSE
)
for(var in variables) {
corr <- get_correlations(weather_analysis, var)
# High temperature results
results <- rbind(results, data.frame(
Variable = names(variables)[variables == var],
Temp_Type = "High",
Spearman_Rho = corr$high$spearman$estimate,
Spearman_P = corr$high$spearman$p.value
))
# Low temperature results
results <- rbind(results, data.frame(
Variable = names(variables)[variables == var],
Temp_Type = "Low",
Spearman_Rho = corr$low$spearman$estimate,
Spearman_P = corr$low$spearman$p.value
))
}
# Format p-values and correlations
results$Spearman_Rho <- round(results$Spearman_Rho, 3)
results$Spearman_P <- format.pval(results$Spearman_P, digits = 3)
print(results)
## Variable Temp_Type Spearman_Rho Spearman_P
## rho Elevation High 0.049 < 2e-16
## rho1 Elevation Low 0.106 < 2e-16
## rho2 Distance to Coast High 0.080 < 2e-16
## rho3 Distance to Coast Low 0.077 < 2e-16
## rho4 Elevation Change High 0.024 < 2e-16
## rho5 Elevation Change Low 0.091 < 2e-16
## rho6 wind High 0.047 < 2e-16
## rho7 wind Low 0.013 2.24e-12
## rho8 avg_annual_precip High -0.013 6.05e-12
## rho9 avg_annual_precip Low -0.113 < 2e-16
A Spearman correlation test was conducted to assess the influence of various geographical and meteorological factors on absolute temperature error. The results indicate that all correlations were statistically significant (p < 0.05), but their magnitudes were weak. Elevation and average annual precipitation exhibited the strongest correlation with low-temperature forecast errors (ρ = 0.106 and ρ = -0.113 respectively). Distance to the coast, elevation change, and wind speed had even weaker effects on temperature forecast errors. While all correlations were statistically significant, their small magnitudes indicate that forecast errors are likely driven by different factors instead of geographic features.
# ANOVA for high temperature errors
high_temp_anova <- aov(abs(temp_error) ~ meaning,
data = subset(weather_analysis, high_or_low == "high"))
# ANOVA for low temperature errors
low_temp_anova <- aov(abs(temp_error) ~ meaning,
data = subset(weather_analysis, high_or_low == "low"))
# Get summary of results
summary(high_temp_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meaning 22 32770 1489.5 340.7 <2e-16 ***
## Residuals 288125 1259523 4.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 37836 observations deleted due to missingness
summary(low_temp_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## meaning 22 27535 1251.6 276.7 <2e-16 ***
## Residuals 289275 1308498 4.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 36686 observations deleted due to missingness
An ANOVA (Analysis of Variance) test was conducted to determine whether weather outlook (forecast conditions) significantly affects forecast errors. The results show that weather outlook has a highly significant impact, with p-values < 2e-16 and large F-values (340.7 and 276.7), confirming that forecast errors vary significantly depending on weather conditions.
However, despite this strong statistical significance, weather outlook alone does not explain most of the forecast error variance. The residual variance remains high (Sum of Squares: 32,770 vs. 1,259,523 and 27,535 vs. 1,308,498), indicating that other factors are contributing to forecast inaccuracies. Additionally, 37,836 and 36,686 observations were removed due to missing data, suggesting that certain weather conditions may have more incomplete records, which could affect the overall analysis.
This analysis examined temperature forecast errors across 167 U.S. cities, and considered elevation, distance to coast, elevation change (eight points), wind, average annual precipitation, and outlook. The results indicate that while elevation, distance to coast, elevation change, wind,and average annual precipitation, slightly impact low-temperature forecast errors, their effects are weak. Weather outlook plays a statistically significant role in error magnitude, particularly during complex weather conditions, but does not fully explain the observed variations. The time before observation (12-48 hours) does not significantly influence error rates, suggesting that short-term forecasts maintain consistent accuracy. Overall, the findings suggest that temperature forecast errors arise from a combination of factors related to outlook and other measures, instead of geographic features or wind speed.