install.packages("ggpmisc")
library(readr)
library(tidyr)
library(ggplot2)
library(tidyverse)
library(viridis)
library(maps)
library(mapproj)
library(ggthemes)
library(plotly)
library(stringr)
library(forcats)
library(ggpmisc) ## See scatterplots for information on this new element
The data used in this analysis comes from the National Weather Service and includes weather forecast data for 167 cities over a period of 16 months. This data was initially in three separate data sets and elements have been joined from different data sets to create interesting comparisons. The goal of this project is to investigate which areas in the U.S. struggle with weather predictions and explore possible reasons for forecast errors.
weather_forecasts <- read_csv("data/weather_forecasts.csv")
forecast_cities <- read_csv("data/forecast_cities.csv")
outlook_meanings <- read_csv("data/outlook_meanings.csv")
weather_and_cities <- weather_forecasts %>%
left_join(forecast_cities, by = c("city", "state")) %>%
filter(!state %in% c("AK", "HI", "VI", "PR", "DC")) %>%
mutate(forecast_error = abs(forecast_temp - observed_temp)) %>%
filter(!is.na(forecast_error))
sum(is.na(weather_and_cities))
## [1] 2856
To tidy the data I created a new variable called ‘weather_and_cities’ in which I have joined the ‘weather_forecasts’ data with the ‘forecast_cities’ data so that I can make plots using variables from both data sets easily. I have also chosen to filter out territories and states that are not part of the continental US so that as I visualize the data down the line it is easier to see information about my area of interest without the map spreading too far. Finally, in this chunk I have also chosen to create a new column called ‘forecast_error’ which was created by taking the absolute value of the difference between the forecasted and observed temperatures to find how far off the predictions were.
I then created a base map off of which I could alter a few variables of interest. This base map is a chloropleth map showing the continental united states with each state filled by a color representing the average error in forecast across all cities in the state present in the data. It is important to note that not all states have equal numbers of cities represented and this may skew the data.
state_avg_error <- weather_and_cities %>%
group_by(state) %>%
summarize(avg_error = mean(forecast_error, na.rm = TRUE))
states_map <- map_data("state") %>%
filter(!region %in% c("alaska", "hawaii"))
state_avg_error$state <- tolower(state_avg_error$state)
state_avg_error <- state_avg_error %>%
left_join(
tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
))
ggplot(state_avg_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E") +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
From the base map I created three different piece-wise iterations of the map in which I sliced the US up by different variables. This allowed me to take initial guesses at which way of dividing the states would group them best by shared error amounts. Finally, I created scatter plots with trend lines shown on the graphs to check correlations between error and the the two variables I found most interesting: elevation and precipitation.
The base choropleth map reveals that certain regions, especially in the central US, exhibit higher forecast errors. Montana stands out as having the highest overall forecast error of all the continental United States. Montana, the Dakotas and other nearby central states being higher in overall forecasting error suggests that forecast accuracy may be affected by geographic and environmental factors such as elevation and proximity to the coast. Three such factors were tested in following chloropleth maps to see if those patterns persist.
All included states had some high temperature error and some low temperature error. The data show that many states farther south in the US have lower high temperature error, and many northern and central states have higher high temperature error and much higher low temperature error. The East Coast has roughly the same error for both high and low.
high_temp_error <- weather_and_cities %>%
filter(high_or_low == "high") %>%
group_by(state) %>%
summarize(avg_high_error = mean(forecast_error, na.rm = TRUE))
low_temp_error <- weather_and_cities %>%
filter(high_or_low == "low") %>%
group_by(state) %>%
summarize(avg_low_error = mean(forecast_error, na.rm = TRUE))
state_name_mapping <- tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
)
high_temp_error$state <- tolower(high_temp_error$state)
low_temp_error$state <- tolower(low_temp_error$state)
high_temp_error <- high_temp_error %>%
left_join(state_name_mapping, by = "state")
low_temp_error <- low_temp_error %>%
left_join(state_name_mapping, by = "state")
library(patchwork)
p1 <- ggplot(high_temp_error) +
geom_map(aes(map_id = name, fill = avg_high_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average High Temperature Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p2 <- ggplot(low_temp_error) +
geom_map(aes(map_id = name, fill = avg_low_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Low Temperature Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p1 / p2
To examine elevation differences I have carved the continental US into higher average elevation regions (the Central United States) and lower elevation regions (the West Coast, Midwest and East Coast). When isolated, the trend in the differences in error in the Central US are emphasized. Elevation seemed to group states best by their similar error types, and this theory that the correlation is stronger here than with other variables is shown in supplemental scatter plot graphs with trendlines showing slope of correlation.
state_elev_error <- weather_and_cities %>%
group_by(state) %>%
summarize(
avg_error = mean(forecast_error, na.rm = TRUE),
avg_elevation = mean(elevation, na.rm = TRUE)
)
high_elev_error <- state_elev_error %>%
filter(avg_elevation >= 1000)
low_elev_error <- state_elev_error %>%
filter(avg_elevation < 1000)
state_name_mapping <- tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
)
high_elev_error$state <- tolower(high_elev_error$state)
low_elev_error$state <- tolower(low_elev_error$state)
high_elev_error <- high_elev_error %>%
left_join(state_name_mapping, by = "state")
low_elev_error <- low_elev_error %>%
left_join(state_name_mapping, by = "state")
p1 <- ggplot(high_elev_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75), ) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average High Elevation (m) Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p2 <- ggplot(low_elev_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Low Elevation (m) Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p1 / p2
ggplot(state_elev_error, aes(x = avg_elevation, y = avg_error)) +
geom_point(color = "black", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
stat_poly_eq(aes(label = after_stat(eq.label)), ## this uses ggpmisc, a new install I used to help display the equation of the linear regression line (with the formula y ~ x) on the scatter plots. This allowed me to visualize the relationship between forecast error and factors like elevation or precipitation. This helps to convey the linear model equation directly on the plot for easy interpretation.
formula = y ~ x,
parse = TRUE,
color = "black",
size = 5,
vjust = 2,
hjust = -2) +
labs(
title = "Forecast Error vs Elevation by State",
x = "Average Elevation (m)",
y = "Average Forecast Error (°F)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12)
)
To further investigate which ways we can slice up the states to try and group error amounts together, precipitation was plotted across the map with high precipitation being above the average rainfall in the entire US per year (30 inches) and low rainfall being below. This trend is somewhat similar, with Montana dn the central states being mostly together at lower precipitation and the entire East Coast having high precipitation and generally lower error.
state_precip_error <- weather_and_cities %>%
group_by(state) %>%
summarize(
avg_error = mean(forecast_error, na.rm = TRUE),
avg_precip = mean(avg_annual_precip, na.rm = TRUE)
)
high_precip_error <- state_precip_error %>%
filter(avg_precip > 30)
low_precip_error <- state_precip_error %>%
filter(avg_precip < 30)
state_name_mapping <- tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
)
high_precip_error$state <- tolower(high_precip_error$state)
low_precip_error$state <- tolower(low_precip_error$state)
high_precip_error <- high_precip_error %>%
left_join(state_name_mapping, by = "state")
low_precip_error <- low_precip_error %>%
left_join(state_name_mapping, by = "state")
p1 <- ggplot(high_precip_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average High Precipitation Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p2 <- ggplot(low_precip_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Low Precipitation Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p1 / p2
ggplot(state_precip_error, aes(x = avg_precip, y = avg_error)) +
geom_point(color = "black", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
stat_poly_eq(aes(label = after_stat(eq.label)),
formula = y ~ x,
parse = TRUE,
color = "black",
size = 5,
vjust = 2,
hjust = -2) +
labs(
title = "Forecast Error vs Precipitation by State",
x = "Average Precipitation (in)",
y = "Average Forecast Error (°F)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12)
)
After examining the data we could predict that the East Coast, having more precipitation, lower average elevation, and mostly middle-of-the-road error (around 2.5 at a glance) is quite consistent as a region in its error error. Forecasters on the East Coast could be aware of their tendencies to over and underestimate on occasion by a factor of around 2.5. Montana, the Dakotas, Nebraska and Colorado tend to be some of the states with the highest error, and Montana Colorado and some other central states have high average elevations. Perhaps something in the forecasting technique causes inaccuracies at higher elevations. These predictions are confirmed by scatter plots made later in the data analysis, where elevation shows a positive correlation between higher elevations and higher errors, where as precipitation shows a slightly negative correlation between high precipitation and error.
link for published page: