The data used in this analysis comes from the National Weather Service and includes weather forecast data for 167 cities over a period of 16 months. This data was initially in three separate data sets and elements have been joined from different data sets to create interesting comparisons. The goal of this project is to investigate which areas in the U.S. struggle with weather predictions and explore possible reasons for forecast errors.
To tidy the data I created a new variable called ‘weather_and_cities’ in which I have joined the ‘weather_forecasts’ data with the ‘forecast_cities’ data so that I can make plots using variables from both data sets easily. I have also chosen to filter out territories and states that are not part of the continental US so that as I visualize the data down the line it is easier to see information about my area of interest without the map spreading too far. Finally, in this chunk I have also chosen to create a new column called ‘forecast_error’ which was created by taking the absolute value of the difference between the forecasted and observed temperatures to find how far off the predictions were.
I then created a base map off of which I could alter a few variables of interest. This base map is a chloropleth map showing the continental united states with each state filled by a color representing the average error in forecast across all cities in the state present in the data. It is important to note that not all states have equal numbers of cities represented and this may skew the data.
From the base map I created three different piece-wise iterations of the map in which I sliced the US up by different variables. This allowed me to take initial guesses at which way of dividing the states would group them best by shared error amounts. Finally, I created scatter plots with trend lines shown on the graphs to check correlations between error and the the two variables I found most interesting: elevation and precipitation.
The base choropleth map reveals that certain regions, especially in the central US, exhibit higher forecast errors. Montana stands out as having the highest overall forecast error of all the continental United States. Montana, the Dakotas and other nearby central states being higher in overall forecasting error suggests that forecast accuracy may be affected by geographic and environmental factors such as elevation and proximity to the coast. Three such factors were tested in following chloropleth maps to see if those patterns persist.
All included states had some high temperature error and some low temperature error. The data show that many states farther south in the US have lower high temperature error, and many northern and central states have higher high temperature error and much higher low temperature error. The East Coast has roughly the same error for both high and low.
To examine elevation differences I have carved the continental US into higher average elevation regions (the Central United States) and lower elevation regions (the West Coast, Midwest and East Coast). When isolated, the trend in the differences in error in the Central US are emphasized. Elevation seemed to group states best by their similar error types, and this theory that the correlation is stronger here than with other variables is shown in supplemental scatter plot graphs with trendlines showing slope of correlation.
To further investigate which ways we can slice up the states to try and group error amounts together, precipitation was plotted across the map with high precipitation being above the average rainfall in the entire US per year (30 inches) and low rainfall being below. This trend is somewhat similar, with Montana dn the central states being mostly together at lower precipitation and the entire East Coast having high precipitation and generally lower error.
After examining the data we could predict that the East Coast, having more precipitation, lower average elevation, and mostly middle-of-the-road error (around 2.5 at a glance) is quite consistent as a region in its error error. Forecasters on the East Coast could be aware of their tendencies to over and underestimate on occasion by a factor of around 2.5. Montana, the Dakotas, Nebraska and Colorado tend to be some of the states with the highest error, and Montana Colorado and some other central states have high average elevations. Perhaps something in the forecasting technique causes inaccuracies at higher elevations. These predictions are confirmed by scatter plots made later in the data analysis, where elevation shows a positive correlation between higher elevations and higher errors, where as precipitation shows a slightly negative correlation between high precipitation and error.
link for published page:
install.packages("ggpmisc")
## Installing package into '/Accounts/greggr2/R/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(readr)
library(tidyr)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(viridis)
## Loading required package: viridisLite
library(maps)
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:viridis':
##
## unemp
##
## The following object is masked from 'package:purrr':
##
## map
library(mapproj)
library(ggthemes)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(stringr)
library(forcats)
library(ggpmisc)
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
##
## Attaching package: 'ggpp'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
weather_forecasts <- read_csv("data/weather_forecasts.csv")
## Rows: 651968 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): city, state, high_or_low, forecast_outlook, possible_error
## dbl (4): forecast_hours_before, observed_temp, forecast_temp, observed_precip
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
forecast_cities <- read_csv("data/forecast_cities.csv")
## Rows: 236 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): city, state, koppen
## dbl (8): lon, lat, elevation, distance_to_coast, wind, elevation_change_four...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
outlook_meanings <- read_csv("data/outlook_meanings.csv")
## Rows: 23 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): forecast_outlook, meaning
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather_and_cities <- weather_forecasts %>%
left_join(forecast_cities, by = c("city", "state")) %>%
filter(!state %in% c("AK", "HI", "VI", "PR", "DC")) %>%
mutate(forecast_error = abs(forecast_temp - observed_temp)) %>%
filter(!is.na(forecast_error))
sum(is.na(weather_and_cities))
## [1] 2856
head(weather_and_cities)
## # A tibble: 6 × 20
## date city state high_or_low forecast_hours_before observed_temp
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2021-01-30 ABILENE TX high 12 70
## 2 2021-01-30 ABILENE TX low 24 42
## 3 2021-01-30 ABILENE TX low 12 42
## 4 2021-01-30 AKRON_CANTON OH high 12 29
## 5 2021-01-30 AKRON_CANTON OH low 24 26
## 6 2021-01-30 AKRON_CANTON OH low 12 26
## # ℹ 14 more variables: forecast_temp <dbl>, observed_precip <dbl>,
## # forecast_outlook <chr>, possible_error <chr>, lon <dbl>, lat <dbl>,
## # koppen <chr>, elevation <dbl>, distance_to_coast <dbl>, wind <dbl>,
## # elevation_change_four <dbl>, elevation_change_eight <dbl>,
## # avg_annual_precip <dbl>, forecast_error <dbl>
state_avg_error <- weather_and_cities %>%
group_by(state) %>%
summarize(avg_error = mean(forecast_error, na.rm = TRUE))
states_map <- map_data("state") %>%
filter(!region %in% c("alaska", "hawaii"))
state_avg_error$state <- tolower(state_avg_error$state)
state_avg_error <- state_avg_error %>%
left_join(
tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
))
## Joining with `by = join_by(state)`
head(state_avg_error)
## # A tibble: 6 × 3
## state avg_error name
## <chr> <dbl> <chr>
## 1 al 2.14 alabama
## 2 ar 2.26 arkansas
## 3 az 1.96 arizona
## 4 ca 2.21 california
## 5 co 2.84 colorado
## 6 ct 2.33 connecticut
ggplot(state_avg_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E") +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
high_temp_error <- weather_and_cities %>%
filter(high_or_low == "high") %>%
group_by(state) %>%
summarize(avg_high_error = mean(forecast_error, na.rm = TRUE))
low_temp_error <- weather_and_cities %>%
filter(high_or_low == "low") %>%
group_by(state) %>%
summarize(avg_low_error = mean(forecast_error, na.rm = TRUE))
state_name_mapping <- tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
)
high_temp_error$state <- tolower(high_temp_error$state)
low_temp_error$state <- tolower(low_temp_error$state)
high_temp_error <- high_temp_error %>%
left_join(state_name_mapping, by = "state")
low_temp_error <- low_temp_error %>%
left_join(state_name_mapping, by = "state")
head(high_temp_error)
## # A tibble: 6 × 3
## state avg_high_error name
## <chr> <dbl> <chr>
## 1 al 2.17 alabama
## 2 ar 2.37 arkansas
## 3 az 1.35 arizona
## 4 ca 2.23 california
## 5 co 2.59 colorado
## 6 ct 2.29 connecticut
head(low_temp_error)
## # A tibble: 6 × 3
## state avg_low_error name
## <chr> <dbl> <chr>
## 1 al 2.11 alabama
## 2 ar 2.15 arkansas
## 3 az 2.56 arizona
## 4 ca 2.19 california
## 5 co 3.09 colorado
## 6 ct 2.36 connecticut
library(patchwork)
p1 <- ggplot(high_temp_error) +
geom_map(aes(map_id = name, fill = avg_high_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average High Temperature Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p2 <- ggplot(low_temp_error) +
geom_map(aes(map_id = name, fill = avg_low_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Low Temperature Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p1 / p2
state_elev_error <- weather_and_cities %>%
group_by(state) %>%
summarize(
avg_error = mean(forecast_error, na.rm = TRUE),
avg_elevation = mean(elevation, na.rm = TRUE)
)
high_elev_error <- state_elev_error %>%
filter(avg_elevation >= 1000)
low_elev_error <- state_elev_error %>%
filter(avg_elevation < 1000)
state_name_mapping <- tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
)
high_elev_error$state <- tolower(high_elev_error$state)
low_elev_error$state <- tolower(low_elev_error$state)
high_elev_error <- high_elev_error %>%
left_join(state_name_mapping, by = "state")
low_elev_error <- low_elev_error %>%
left_join(state_name_mapping, by = "state")
head(high_elev_error)
## # A tibble: 6 × 4
## state avg_error avg_elevation name
## <chr> <dbl> <dbl> <chr>
## 1 co 2.84 1620. colorado
## 2 id 2.67 1140. idaho
## 3 mt 3.23 1091. montana
## 4 nm 2.36 1765. new mexico
## 5 nv 2.11 1035. nevada
## 6 ut 2.23 1288. utah
head(low_elev_error)
## # A tibble: 6 × 4
## state avg_error avg_elevation name
## <chr> <dbl> <dbl> <chr>
## 1 al 2.14 124. alabama
## 2 ar 2.26 108. arkansas
## 3 az 1.96 831. arizona
## 4 ca 2.21 72.7 california
## 5 ct 2.33 3.10 connecticut
## 6 de 2.09 15.2 delaware
head(high_elev_error)
## # A tibble: 6 × 4
## state avg_error avg_elevation name
## <chr> <dbl> <dbl> <chr>
## 1 co 2.84 1620. colorado
## 2 id 2.67 1140. idaho
## 3 mt 3.23 1091. montana
## 4 nm 2.36 1765. new mexico
## 5 nv 2.11 1035. nevada
## 6 ut 2.23 1288. utah
head(low_elev_error)
## # A tibble: 6 × 4
## state avg_error avg_elevation name
## <chr> <dbl> <dbl> <chr>
## 1 al 2.14 124. alabama
## 2 ar 2.26 108. arkansas
## 3 az 1.96 831. arizona
## 4 ca 2.21 72.7 california
## 5 ct 2.33 3.10 connecticut
## 6 de 2.09 15.2 delaware
p1 <- ggplot(high_elev_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75), ) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average High Elevation (m) Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p2 <- ggplot(low_elev_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Low Elevation (m) Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p1 / p2
state_precip_error <- weather_and_cities %>%
group_by(state) %>%
summarize(
avg_error = mean(forecast_error, na.rm = TRUE),
avg_precip = mean(avg_annual_precip, na.rm = TRUE)
)
high_precip_error <- state_precip_error %>%
filter(avg_precip > 30)
low_precip_error <- state_precip_error %>%
filter(avg_precip < 30)
state_name_mapping <- tibble(
state = str_to_lower(state.abb),
name = str_to_lower(state.name)
)
high_precip_error$state <- tolower(high_precip_error$state)
low_precip_error$state <- tolower(low_precip_error$state)
high_precip_error <- high_precip_error %>%
left_join(state_name_mapping, by = "state")
low_precip_error <- low_precip_error %>%
left_join(state_name_mapping, by = "state")
head(high_precip_error)
## # A tibble: 6 × 4
## state avg_error avg_precip name
## <chr> <dbl> <dbl> <chr>
## 1 al 2.14 60.8 alabama
## 2 ar 2.26 55.8 arkansas
## 3 ct 2.33 50.9 connecticut
## 4 de 2.09 50.4 delaware
## 5 fl 1.79 55.8 florida
## 6 ga 2.07 51.4 georgia
head(low_precip_error)
## # A tibble: 6 × 4
## state avg_error avg_precip name
## <chr> <dbl> <dbl> <chr>
## 1 az 1.96 11.2 arizona
## 2 ca 2.21 16.3 california
## 3 co 2.84 15.1 colorado
## 4 id 2.67 13.7 idaho
## 5 mt 3.23 15.2 montana
## 6 nd 2.89 22.2 north dakota
head(high_precip_error)
## # A tibble: 6 × 4
## state avg_error avg_precip name
## <chr> <dbl> <dbl> <chr>
## 1 al 2.14 60.8 alabama
## 2 ar 2.26 55.8 arkansas
## 3 ct 2.33 50.9 connecticut
## 4 de 2.09 50.4 delaware
## 5 fl 1.79 55.8 florida
## 6 ga 2.07 51.4 georgia
head(low_precip_error)
## # A tibble: 6 × 4
## state avg_error avg_precip name
## <chr> <dbl> <dbl> <chr>
## 1 az 1.96 11.2 arizona
## 2 ca 2.21 16.3 california
## 3 co 2.84 15.1 colorado
## 4 id 2.67 13.7 idaho
## 5 mt 3.23 15.2 montana
## 6 nd 2.89 22.2 north dakota
p1 <- ggplot(high_precip_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average High Precipitation Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p2 <- ggplot(low_precip_error) +
geom_map(aes(map_id = name, fill = avg_error),
color = "white",
map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +
coord_map() +
theme_void() +
theme(legend.position = "bottom") +
labs(
title = "Average Low Precipitation Forecast Error by State",
fill = "Error (°F)",
x = NULL,
y = NULL
)
p1 / p2
ggplot(state_elev_error, aes(x = avg_elevation, y = avg_error)) +
geom_point(color = "black", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
stat_poly_eq(aes(label = after_stat(eq.label)),
formula = y ~ x,
parse = TRUE,
color = "black",
size = 5,
vjust = 2,
hjust = -2) +
labs(
title = "Forecast Error vs Elevation by State",
x = "Average Elevation (m)",
y = "Average Forecast Error (°F)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(state_precip_error, aes(x = avg_precip, y = avg_error)) +
geom_point(color = "black", size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
stat_poly_eq(aes(label = after_stat(eq.label)),
formula = y ~ x,
parse = TRUE,
color = "black",
size = 5,
vjust = 2,
hjust = -2) +
labs(
title = "Forecast Error vs Precipitation by State",
x = "Average Precipitation (in)",
y = "Average Forecast Error (°F)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12)
)
## `geom_smooth()` using formula = 'y ~ x'