Introduction

Data Overview

The data used in this analysis comes from the National Weather Service and includes weather forecast data for 167 cities over a period of 16 months. This data was initially in three separate data sets and elements have been joined from different data sets to create interesting comparisons. The goal of this project is to investigate which areas in the U.S. struggle with weather predictions and explore possible reasons for forecast errors.

Report Section 2: Analysis

To tidy the data I created a new variable called ‘weather_and_cities’ in which I have joined the ‘weather_forecasts’ data with the ‘forecast_cities’ data so that I can make plots using variables from both data sets easily. I have also chosen to filter out territories and states that are not part of the continental US so that as I visualize the data down the line it is easier to see information about my area of interest without the map spreading too far. Finally, in this chunk I have also chosen to create a new column called ‘forecast_error’ which was created by taking the absolute value of the difference between the forecasted and observed temperatures to find how far off the predictions were.

I then created a base map off of which I could alter a few variables of interest. This base map is a chloropleth map showing the continental united states with each state filled by a color representing the average error in forecast across all cities in the state present in the data. It is important to note that not all states have equal numbers of cities represented and this may skew the data.

From the base map I created three different piece-wise iterations of the map in which I sliced the US up by different variables. This allowed me to take initial guesses at which way of dividing the states would group them best by shared error amounts. Finally, I created scatter plots with trend lines shown on the graphs to check correlations between error and the the two variables I found most interesting: elevation and precipitation.

Overall Forecast Accuracy

The base choropleth map reveals that certain regions, especially in the central US, exhibit higher forecast errors. Montana stands out as having the highest overall forecast error of all the continental United States. Montana, the Dakotas and other nearby central states being higher in overall forecasting error suggests that forecast accuracy may be affected by geographic and environmental factors such as elevation and proximity to the coast. Three such factors were tested in following chloropleth maps to see if those patterns persist.

Temperature (High/Low)

All included states had some high temperature error and some low temperature error. The data show that many states farther south in the US have lower high temperature error, and many northern and central states have higher high temperature error and much higher low temperature error. The East Coast has roughly the same error for both high and low.

Elevation (> or < 1000)

To examine elevation differences I have carved the continental US into higher average elevation regions (the Central United States) and lower elevation regions (the West Coast, Midwest and East Coast). When isolated, the trend in the differences in error in the Central US are emphasized. Elevation seemed to group states best by their similar error types, and this theory that the correlation is stronger here than with other variables is shown in supplemental scatter plot graphs with trendlines showing slope of correlation.

Precipitation (> or < 30 inches/year)

To further investigate which ways we can slice up the states to try and group error amounts together, precipitation was plotted across the map with high precipitation being above the average rainfall in the entire US per year (30 inches) and low rainfall being below. This trend is somewhat similar, with Montana dn the central states being mostly together at lower precipitation and the entire East Coast having high precipitation and generally lower error.

Report Section 3: Conclusion

Key Insights

After examining the data we could predict that the East Coast, having more precipitation, lower average elevation, and mostly middle-of-the-road error (around 2.5 at a glance) is quite consistent as a region in its error error. Forecasters on the East Coast could be aware of their tendencies to over and underestimate on occasion by a factor of around 2.5. Montana, the Dakotas, Nebraska and Colorado tend to be some of the states with the highest error, and Montana Colorado and some other central states have high average elevations. Perhaps something in the forecasting technique causes inaccuracies at higher elevations. These predictions are confirmed by scatter plots made later in the data analysis, where elevation shows a positive correlation between higher elevations and higher errors, where as precipitation shows a slightly negative correlation between high precipitation and error.

link for published page:

http://rpubs.com/greggr2/weathererrorusa

Load Libraries/install packages

install.packages("ggpmisc")

## Installing package into '/Accounts/greggr2/R/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(readr)
library(tidyr)
library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(viridis)

## Loading required package: viridisLite

library(maps)

## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:viridis':
## 
##     unemp
## 
## The following object is masked from 'package:purrr':
## 
##     map

library(mapproj)
library(ggthemes)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(stringr)
library(forcats)
library(ggpmisc)

## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Loading Data

weather_forecasts <- read_csv("data/weather_forecasts.csv")

## Rows: 651968 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): city, state, high_or_low, forecast_outlook, possible_error
## dbl  (4): forecast_hours_before, observed_temp, forecast_temp, observed_precip
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

forecast_cities <- read_csv("data/forecast_cities.csv")

## Rows: 236 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): city, state, koppen
## dbl (8): lon, lat, elevation, distance_to_coast, wind, elevation_change_four...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

outlook_meanings <- read_csv("data/outlook_meanings.csv")

## Rows: 23 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): forecast_outlook, meaning
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Tidying and Wrangling Data

weather_and_cities <- weather_forecasts %>%
  left_join(forecast_cities, by = c("city", "state")) %>% 
  filter(!state %in% c("AK", "HI", "VI", "PR", "DC")) %>%  
  mutate(forecast_error = abs(forecast_temp - observed_temp)) %>%
  filter(!is.na(forecast_error))

sum(is.na(weather_and_cities))

## [1] 2856

head(weather_and_cities)

## # A tibble: 6 × 20
##   date       city         state high_or_low forecast_hours_before observed_temp
##   <date>     <chr>        <chr> <chr>                       <dbl>         <dbl>
## 1 2021-01-30 ABILENE      TX    high                           12            70
## 2 2021-01-30 ABILENE      TX    low                            24            42
## 3 2021-01-30 ABILENE      TX    low                            12            42
## 4 2021-01-30 AKRON_CANTON OH    high                           12            29
## 5 2021-01-30 AKRON_CANTON OH    low                            24            26
## 6 2021-01-30 AKRON_CANTON OH    low                            12            26
## # ℹ 14 more variables: forecast_temp <dbl>, observed_precip <dbl>,
## #   forecast_outlook <chr>, possible_error <chr>, lon <dbl>, lat <dbl>,
## #   koppen <chr>, elevation <dbl>, distance_to_coast <dbl>, wind <dbl>,
## #   elevation_change_four <dbl>, elevation_change_eight <dbl>,
## #   avg_annual_precip <dbl>, forecast_error <dbl>

Create Chloropleth Map to Visualize Just Error and Location

state_avg_error <- weather_and_cities %>%
  group_by(state) %>%
  summarize(avg_error = mean(forecast_error, na.rm = TRUE))

states_map <- map_data("state") %>%
   filter(!region %in% c("alaska", "hawaii"))

state_avg_error$state <- tolower(state_avg_error$state) 

state_avg_error <- state_avg_error %>%
  left_join(
tibble(
  state = str_to_lower(state.abb),
  name = str_to_lower(state.name)
))

## Joining with `by = join_by(state)`

head(state_avg_error)

## # A tibble: 6 × 3
##   state avg_error name       
##   <chr>     <dbl> <chr>      
## 1 al         2.14 alabama    
## 2 ar         2.26 arkansas   
## 3 az         1.96 arizona    
## 4 ca         2.21 california 
## 5 co         2.84 colorado   
## 6 ct         2.33 connecticut

ggplot(state_avg_error) +
  geom_map(aes(map_id = name, fill = avg_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E") +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  )

Create Side by Side Chloropleth Maps for High and Low Temperatures

high_temp_error <- weather_and_cities %>%
  filter(high_or_low == "high") %>%
  group_by(state) %>%
  summarize(avg_high_error = mean(forecast_error, na.rm = TRUE))

low_temp_error <- weather_and_cities %>%
  filter(high_or_low == "low") %>%
  group_by(state) %>%
  summarize(avg_low_error = mean(forecast_error, na.rm = TRUE))

state_name_mapping <- tibble(
  state = str_to_lower(state.abb),
  name = str_to_lower(state.name)
)

high_temp_error$state <- tolower(high_temp_error$state)
low_temp_error$state <- tolower(low_temp_error$state)

high_temp_error <- high_temp_error %>%
  left_join(state_name_mapping, by = "state")

low_temp_error <- low_temp_error %>%
  left_join(state_name_mapping, by = "state")

head(high_temp_error)

## # A tibble: 6 × 3
##   state avg_high_error name       
##   <chr>          <dbl> <chr>      
## 1 al              2.17 alabama    
## 2 ar              2.37 arkansas   
## 3 az              1.35 arizona    
## 4 ca              2.23 california 
## 5 co              2.59 colorado   
## 6 ct              2.29 connecticut

head(low_temp_error)

## # A tibble: 6 × 3
##   state avg_low_error name       
##   <chr>         <dbl> <chr>      
## 1 al             2.11 alabama    
## 2 ar             2.15 arkansas   
## 3 az             2.56 arizona    
## 4 ca             2.19 california 
## 5 co             3.09 colorado   
## 6 ct             2.36 connecticut

library(patchwork)

p1 <- ggplot(high_temp_error) +
  geom_map(aes(map_id = name, fill = avg_high_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average High Temperature Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  )

p2 <- ggplot(low_temp_error) +
  geom_map(aes(map_id = name, fill = avg_low_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average Low Temperature Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  ) 

p1 / p2

Create Side by Side Chloropleth Maps for <1000 ft or >1000 ft elevation

state_elev_error <- weather_and_cities %>%
  group_by(state) %>%
  summarize(
    avg_error = mean(forecast_error, na.rm = TRUE), 
    avg_elevation = mean(elevation, na.rm = TRUE) 
  )

high_elev_error <- state_elev_error %>%
  filter(avg_elevation >= 1000)

low_elev_error <- state_elev_error %>%
  filter(avg_elevation < 1000)

state_name_mapping <- tibble(
  state = str_to_lower(state.abb),
  name = str_to_lower(state.name)
)

high_elev_error$state <- tolower(high_elev_error$state)
low_elev_error$state <- tolower(low_elev_error$state)

high_elev_error <- high_elev_error %>%
  left_join(state_name_mapping, by = "state")

low_elev_error <- low_elev_error %>%
  left_join(state_name_mapping, by = "state")

head(high_elev_error)

## # A tibble: 6 × 4
##   state avg_error avg_elevation name      
##   <chr>     <dbl>         <dbl> <chr>     
## 1 co         2.84         1620. colorado  
## 2 id         2.67         1140. idaho     
## 3 mt         3.23         1091. montana   
## 4 nm         2.36         1765. new mexico
## 5 nv         2.11         1035. nevada    
## 6 ut         2.23         1288. utah

head(low_elev_error)

## # A tibble: 6 × 4
##   state avg_error avg_elevation name       
##   <chr>     <dbl>         <dbl> <chr>      
## 1 al         2.14        124.   alabama    
## 2 ar         2.26        108.   arkansas   
## 3 az         1.96        831.   arizona    
## 4 ca         2.21         72.7  california 
## 5 ct         2.33          3.10 connecticut
## 6 de         2.09         15.2  delaware

head(high_elev_error)

## # A tibble: 6 × 4
##   state avg_error avg_elevation name      
##   <chr>     <dbl>         <dbl> <chr>     
## 1 co         2.84         1620. colorado  
## 2 id         2.67         1140. idaho     
## 3 mt         3.23         1091. montana   
## 4 nm         2.36         1765. new mexico
## 5 nv         2.11         1035. nevada    
## 6 ut         2.23         1288. utah

head(low_elev_error)

## # A tibble: 6 × 4
##   state avg_error avg_elevation name       
##   <chr>     <dbl>         <dbl> <chr>      
## 1 al         2.14        124.   alabama    
## 2 ar         2.26        108.   arkansas   
## 3 az         1.96        831.   arizona    
## 4 ca         2.21         72.7  california 
## 5 ct         2.33          3.10 connecticut
## 6 de         2.09         15.2  delaware

p1 <- ggplot(high_elev_error) +
  geom_map(aes(map_id = name, fill = avg_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75), ) +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average High Elevation (m) Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  )

p2 <- ggplot(low_elev_error) +
  geom_map(aes(map_id = name, fill = avg_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average Low Elevation (m) Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  ) 

p1 / p2

Create Side by Side Chloropleth Maps for <30 in or >30 in precipitation

state_precip_error <- weather_and_cities %>%
  group_by(state) %>%
  summarize(
    avg_error = mean(forecast_error, na.rm = TRUE),  
    avg_precip = mean(avg_annual_precip, na.rm = TRUE)  
  )

high_precip_error <- state_precip_error %>%
  filter(avg_precip > 30)

low_precip_error <- state_precip_error %>%
  filter(avg_precip < 30)

state_name_mapping <- tibble(
  state = str_to_lower(state.abb),
  name = str_to_lower(state.name)
)

high_precip_error$state <- tolower(high_precip_error$state)
low_precip_error$state <- tolower(low_precip_error$state)

high_precip_error <- high_precip_error %>%
  left_join(state_name_mapping, by = "state")

low_precip_error <- low_precip_error %>%
  left_join(state_name_mapping, by = "state")

head(high_precip_error)

## # A tibble: 6 × 4
##   state avg_error avg_precip name       
##   <chr>     <dbl>      <dbl> <chr>      
## 1 al         2.14       60.8 alabama    
## 2 ar         2.26       55.8 arkansas   
## 3 ct         2.33       50.9 connecticut
## 4 de         2.09       50.4 delaware   
## 5 fl         1.79       55.8 florida    
## 6 ga         2.07       51.4 georgia

head(low_precip_error)

## # A tibble: 6 × 4
##   state avg_error avg_precip name        
##   <chr>     <dbl>      <dbl> <chr>       
## 1 az         1.96       11.2 arizona     
## 2 ca         2.21       16.3 california  
## 3 co         2.84       15.1 colorado    
## 4 id         2.67       13.7 idaho       
## 5 mt         3.23       15.2 montana     
## 6 nd         2.89       22.2 north dakota

head(high_precip_error)

## # A tibble: 6 × 4
##   state avg_error avg_precip name       
##   <chr>     <dbl>      <dbl> <chr>      
## 1 al         2.14       60.8 alabama    
## 2 ar         2.26       55.8 arkansas   
## 3 ct         2.33       50.9 connecticut
## 4 de         2.09       50.4 delaware   
## 5 fl         1.79       55.8 florida    
## 6 ga         2.07       51.4 georgia

head(low_precip_error)

## # A tibble: 6 × 4
##   state avg_error avg_precip name        
##   <chr>     <dbl>      <dbl> <chr>       
## 1 az         1.96       11.2 arizona     
## 2 ca         2.21       16.3 california  
## 3 co         2.84       15.1 colorado    
## 4 id         2.67       13.7 idaho       
## 5 mt         3.23       15.2 montana     
## 6 nd         2.89       22.2 north dakota

p1 <- ggplot(high_precip_error) +
  geom_map(aes(map_id = name, fill = avg_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average High Precipitation Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  )

p2 <- ggplot(low_precip_error) +
  geom_map(aes(map_id = name, fill = avg_error), 
           color = "white", 
           map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  scale_fill_viridis_c(option = "E", limits = c(1.25, 3.75)) +  
  coord_map() + 
  theme_void() +  
  theme(legend.position = "bottom") + 
  labs(
    title = "Average Low Precipitation Forecast Error by State",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  ) 

p1 / p2

Create Scatterplot to Test Elevation Theory

ggplot(state_elev_error, aes(x = avg_elevation, y = avg_error)) +
  geom_point(color = "black", size = 3) + 
  geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  stat_poly_eq(aes(label = after_stat(eq.label)), 
               formula = y ~ x, 
               parse = TRUE, 
               color = "black", 
               size = 5,
               vjust = 2, 
               hjust = -2) + 
  labs(
    title = "Forecast Error vs Elevation by State",
    x = "Average Elevation (m)",
    y = "Average Forecast Error (°F)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

## `geom_smooth()` using formula = 'y ~ x'

Create Scatter Plot of Precipitation Data

ggplot(state_precip_error, aes(x = avg_precip, y = avg_error)) +
  geom_point(color = "black", size = 3) + 
  geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  stat_poly_eq(aes(label = after_stat(eq.label)), 
               formula = y ~ x, 
               parse = TRUE, 
               color = "black", 
               size = 5,
               vjust = 2, 
               hjust = -2) + 
  labs(
    title = "Forecast Error vs Precipitation by State",
    x = "Average Precipitation (in)",
    y = "Average Forecast Error (°F)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 12)
  )

## `geom_smooth()` using formula = 'y ~ x'

project2-writeup