Portfolio 2

I decided to first combine the weather forecasts and forecast outlook data in order to include forecast meaning information. Then, I used innerjoin() to combine the new weather forecast data with the city data to join info about each city that exists in the dataset. Next, I created a new variable in the dataset that calculates the forecasting error by subtracting the observed temperature - the forecasted temperature. I averaged the cities in each state to get the average error in each state. I used this data to plot a map showing the error in degrees of each US state. I decided to group this by high or low temperature to see if some states were better at predicting one over the other. I calculated the states with the greatest average error in either high or low temperatures and plotted their errors. I decided to analyze how these states in particular were affected by their environment including factors such as average precipitation, elevation, and distance to a coast. I also found the 5 most common forecast outlooks for each group of either states with high temperature error or low temperature error and plotted those findings. Light blue in each plot always corresponds to states with high temperature error and the darker blue is states with low temperature errors.

From the two maps of US states showing high and low temperature forecasting errors, I found that states tended to have greater error forecasting the low temperature rather than the high. Low temperature forecasting errors tend to occur on the western half of the US primarily in states like Nevada, Montana, Oregon, New Mexico, and Washington. On the east coast, Massachusetts also has high error in forecasting low temperatures. States with the greatest error forecasting high temperatures appear to be scattered throughout the country with less of a regional pattern.

When I calculated the states with the greatest error in either forecasting low or high temperatures I found similar results to what was visible on the map. Alaska and Hawaii were not plotted on the map but appear in the high and low categories in the top six states. Low temperature error states tended to have greater average forecasting errors than states with high temperature errors.

The boxplots show different trends in factors that may be affecting forecasting accuracy. The Precipitation plot shows that high temperature error states have higher average annual precipitation than low temp error states. This may have an effect on forecasting accuracy. Low temperature error states have a greater range in elevation which may have a large effect on forecasting error. Both groups of states had varying distances to a coast which is unlikely to have an effect on temperature forecasting.

I also found that the most common forecast outlook for both groups is sunny, which seems like forecast outlook is not a good reason for why these states have temperature forecasting error.

library(tidyverse)
library(patchwork)
library(ggthemes)
library(ggplot2)

library(readr)

# Load in the data sets

weather <- read_csv("data/weather_forecasts.csv")
cities <- read_csv("data/forecast_cities.csv")
outlook <- read_csv("data/outlook_meanings.csv")

# Left_join() weather data with outlook data

weather <- weather %>%
  left_join(outlook, by = "forecast_outlook")

# Use left_join() to add city information to the weather data

weather <- weather %>%
  inner_join(cities, by = "city")

# create a variable called error that calculates the difference in observed vs forecasted temp. and then a new variable that averages those values for each state. Also added regions to the weather2 dataset. 
weather2 <- weather %>%
  mutate(error = observed_temp - forecast_temp) %>%
  group_by(state.x, high_or_low) %>%
  summarise(average_error = mean(error, na.rm = TRUE))

# write state initials as full name
weather2 <- weather2 %>%
  mutate(state_full = recode(state.x,
    "AL" = "alabama", "AK" = "alaska", "AZ" = "arizona", "AR" = "arkansas", "CA" = "california",
    "CO" = "colorado", "CT" = "connecticut", "DE" = "delaware", "FL" = "florida", "GA" = "georgia",
    "HI" = "hawaii", "ID" = "idaho", "IL" = "illinois", "IN" = "indiana", "IA" = "iowa", 
    "KS" = "kansas", "KY" = "kentucky", "LA" = "louisiana", "ME" = "maine", "MD" = "maryland",
    "MA" = "massachusetts", "MI" = "michigan", "MN" = "minnesota", "MS" = "mississippi", "MO" = "missouri", 
    "MT" = "montana", "NE" = "nebraska", "NV" = "nevada", "NH" = "new hampshire", "NJ" = "new jersey", 
    "NM" = "new mexico", "NY" = "new york", "NC" = "north carolina", "ND" = "north dakota", "OH" = "ohio", 
    "OK" = "oklahoma", "OR" = "oregon", "PA" = "pennsylvania", "RI" = "rhode island", "SC" = "south carolina", 
    "SD" = "south dakota", "TN" = "tennessee", "TX" = "texas", "UT" = "utah", "VT" = "vermont", 
    "VA" = "virginia", "WA" = "washington", "WV" = "west virginia", "WI" = "wisconsin", "WY" = "wyoming"
  ))

# Map showing average high and low forecasting error in each state
states <- map_data("state")

ggplot(weather2) + 
  geom_map(
    aes(map_id = state_full, fill = average_error), color = "white",
    map = states
  ) +
    scale_fill_distiller(palette = "Spectral", limits = c(-0.5, NA)) +
  expand_limits(x = states$long, y = states$lat) +
  coord_map() +
  theme_map() +
  facet_wrap(~high_or_low) +
  theme(legend.position = "bottom") +
  theme_minimal() +
  labs(
    title = "Forecasting Error in each US State",
    subtitle = "(Observed - Forecasted Temperature)",
    fill = "Error (°F)",
    x = NULL,
    y = NULL
  )

# Find the states with greatest high error
high_error_states <- weather2 %>%
  filter(high_or_low == "high") %>%  
  arrange(desc(average_error))     

# Find the states with greatest low error
low_error_states <- weather2 %>%
  filter(high_or_low == "low") %>%   
  arrange(desc(average_error))   

# Print the results
high_error_states

## # A tibble: 53 × 4
## # Groups:   state.x [53]
##    state.x high_or_low average_error state_full   
##    <chr>   <chr>               <dbl> <chr>        
##  1 AK      high                0.972 alaska       
##  2 WI      high                0.944 wisconsin    
##  3 MA      high                0.770 massachusetts
##  4 MS      high                0.657 mississippi  
##  5 VT      high                0.640 vermont      
##  6 ID      high                0.624 idaho        
##  7 OR      high                0.532 oregon       
##  8 NJ      high                0.492 new jersey   
##  9 NH      high                0.491 new hampshire
## 10 PA      high                0.465 pennsylvania 
## # ℹ 43 more rows

low_error_states

## # A tibble: 53 × 4
## # Groups:   state.x [53]
##    state.x high_or_low average_error state_full   
##    <chr>   <chr>               <dbl> <chr>        
##  1 HI      low                  2.31 hawaii       
##  2 NV      low                  1.95 nevada       
##  3 MT      low                  1.87 montana      
##  4 MA      low                  1.77 massachusetts
##  5 AK      low                  1.75 alaska       
##  6 OR      low                  1.70 oregon       
##  7 NM      low                  1.50 new mexico   
##  8 AZ      low                  1.34 arizona      
##  9 WA      low                  1.27 washington   
## 10 CO      low                  1.22 colorado     
## # ℹ 43 more rows

# high and low states forecasting error plots 
p1 <- weather2 %>%
    filter(state.x %in% c("AK", "WI", "MA", "MS", "VT", "ID"), high_or_low == "high") %>%
ggplot(aes(x = state.x, y = average_error, fill = high_or_low)) +
  geom_col() +
  coord_flip() +  
  theme_minimal() +
  labs(title = "High", x = NULL, y = "Error (°F)") +
  scale_fill_manual(values = c("high" = "lightblue")) +
  theme(legend.position = "none") +
  ylim(0, 3) +
  scale_x_discrete(labels = c("AK" = "Alaska", "WI" = "Wisconsin", "MA" = "Massachusetts", 
                              "MS" = "Mississippi", "VT" = "Vermont", "ID" = "Idaho"))

p2 <- weather2 %>%
    filter(state.x %in% c("HI", "NV", "MT", "MA", "AK", "OR"), high_or_low == "low") %>%
ggplot(aes(x = state.x, y = average_error, fill = high_or_low)) +
  geom_col() +
  coord_flip() +  
  theme_minimal() +
  labs(title = "Low", x = NULL, y = "Error (°F)") +
  scale_fill_manual(values = c("low" = "steelblue")) +
  theme(legend.position = "none") +
  ylim(0, 3) +
  scale_x_discrete(labels = c("HI" = "Hawaii", "NV" = "Nevada", "MT" = "Montana", 
                              "MA" = "Massachusetts", "AK" = "Alaska", "OR" = "Oregon"))

p1 + p2 +
  plot_annotation(title = "States with the Greatest Temperature Forecasting Error", subtitle = "Grouped by high and low temperature")

# Box plots comparing the average annual precipitation, elevation, and distance from a coast of states with the greatest high and low temperature forecasting errors 

# PRECIP
# States with greatest high and low temperature errors by their initials
high_error_states_initials <- c("AK", "WI", "MA", "MS", "VT", "ID")
low_error_states_initials <- c("HI", "NV", "MT", "MA", "AR", "OR")

# Filter the weather data for the selected states based on their initials
weather_filtered <- weather %>%
  filter(state.x %in% c(high_error_states_initials, low_error_states_initials))

# Calculate the average PRECIP for each state
weather_precip <- weather_filtered %>%
  group_by(state.x) %>%
  summarise(state_precip = mean(avg_annual_precip, na.rm = TRUE))

# Add a new variable for the error group (high or low error)
weather_precip <- weather_precip %>%
  mutate(error_group = case_when(
    state.x %in% high_error_states_initials ~ "High",
    state.x %in% low_error_states_initials ~ "Low"
  ))

# Plot the comparison
p1 <- ggplot(weather_precip, aes(x = error_group, y = state_precip, fill = error_group)) +
  geom_boxplot() +  
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Precipitation",
    x = NULL,
    y = "Average Annual Precipitation (inches)",
  ) +
  scale_fill_manual(values = c("High" = "lightblue", "Low" = "steelblue")) +
  theme(legend.position = "none")  

# ELEVATION
# Calculate the average ELEVATION for each state
weather_elev <- weather_filtered %>%
  group_by(state.x) %>%
  summarise(state_elev = mean(elevation, na.rm = TRUE))

# Add a new variable for the error group (high or low error)
weather_elev <- weather_elev %>%
  mutate(error_group = case_when(
    state.x %in% high_error_states_initials ~ "High",
    state.x %in% low_error_states_initials ~ "Low"
  ))

# Plot the comparison
p2 <- ggplot(weather_elev, aes(x = error_group, y = state_elev, fill = error_group)) +
  geom_boxplot() +  
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Elevation",
    x = NULL,
    y = "Average Elevation",
  ) +
  scale_fill_manual(values = c("High" = "lightblue", "Low" = "steelblue")) +
  theme(legend.position = "none")  

# DISTANCE TO A COAST
# Calculate the average DISTANCE TO A COAST for each state
weather_coast <- weather_filtered %>%
  group_by(state.x) %>%
  summarise(state_coast = mean(distance_to_coast, na.rm = TRUE))

# Add a new variable for the error group (high or low error)
weather_coast <- weather_coast %>%
  mutate(error_group = case_when(
    state.x %in% high_error_states_initials ~ "High",
    state.x %in% low_error_states_initials ~ "Low"
  ))

# Plot the comparison
p3 <- ggplot(weather_coast, aes(x = error_group, y = state_coast, fill = error_group)) +
  geom_boxplot() +  
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Distance to a Coast",
    x = NULL,
    y = "Average Distance to a coast",
  ) +
  scale_fill_manual(values = c("High" = "lightblue", "Low" = "steelblue")) +
  theme(legend.position = "none")  

p1 + p2 + p3 +
  plot_annotation(title = "Possible Reasons for Forecasting Errors", subtitle = "For states with the greatest high and low temperature forecasting errors")

# high and low states into their own datasets
weather_high <- weather_filtered %>%
    filter(state.x %in% c("AK", "WI", "MA", "MS", "VT", "ID"))

weather_low <- weather_filtered %>%
    filter(state.x %in% c("HI", "NV", "MT", "MA", "AR", "OR"))

# Get the top 5 most common meanings for high error states
high_error_meanings <- weather_high %>%
  group_by(meaning) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 5)  # Select top 5

# Get the top 5 most common meanings for low error states
low_error_meanings <- weather_low %>%
  group_by(meaning) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 5)  # Select top 5

# Plot for high error states
p1 <- ggplot(high_error_meanings, aes(x = reorder(meaning, count), y = count)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +  # Flip for better readability
  theme_minimal() +
  labs(
    title = "High Error States",
    x = "Meaning",
    y = "Count"
  ) +
  theme(legend.position = "none")

# Plot for low error states
p2 <- ggplot(low_error_meanings, aes(x = reorder(meaning, count), y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip for better readability
  theme_minimal() +
  labs(
    title = "Low Error States",
    x = "Meaning",
    y = "Count"
  ) +
  theme(legend.position = "none")

p1 + p2 +
    plot_annotation(title = "Top 5 most common forecast outlook")

Portfolio 2

Lia Salomon