library(tidyverse)
library(patchwork)
library(ggthemes)
library(viridis)
forecast_cities <- read_csv("data/forecast_cities.csv")
outlook <- read_csv("data/outlook_meanings.csv")
weather_forecasts <- read_csv("data/weather_forecasts.csv")

This analysis was performed on two datasets. Weather_forecasts is a large dataset that comes from the National Weather Service and includes sixteen months of weather data from 167 American cities, including the predicted forecast and the observed temperature. This dataset was joined with forecast_cities, which provides more information about the cities where temperatures in weather_forecasts were collected. Forecast_cities provides a wealth of geographical information—elevation, distance from the coast, wind, and more. This data was extremely helpful in my analysis. I joined the two datasets so weather_forecasts contained the extra data for each city that was given in forecast_cities.

weather_forecasts <- weather_forecasts %>% 
  left_join(forecast_cities, by = c("city", "state"))

My analysis began with some initial exploration of the data. I did this by creating a scatterplot that graphs the forecasted temperature (forecast_temp) against the actual temperature (observed_temp). This plot includes a line with a slope of one, which allows us to visualize trends in the differences, as they relate to observed and forecasted temperature. The plot indicates that these differences are relatively consistent across temperature ranges.

 weather_forecasts %>% 
  drop_na(forecast_temp) %>% 
  ggplot(aes(x = observed_temp, y = forecast_temp)) +
      geom_point(size = 0.2, alpha = 0.3) +
      geom_abline(slope = 1, intercept = 0, color = "darkgreen") +
      labs(
        x = "observed temp",
        y = "forecasted temp",
        title = "Observed vs Forecasted Temp",
        subtitle = "the green line represents an accurate forecast"
      )

To further investigate, I created a difference variable, which is the absolute value of the difference between the observed and forecasted temperatures. There is a significant degree of difference between these differences between states, but a geographic pattern was not apparent from a table.

weather_forecasts <- weather_forecasts %>% 
  drop_na(forecast_temp) %>% 
  mutate(difference = (abs(observed_temp - forecast_temp)))

weather_forecasts %>% 
  group_by(state) %>% 
  summarize(avg_dif = mean(difference, na.rm = TRUE)) %>% 
  arrange(desc(avg_dif))
## # A tibble: 53 × 2
##    state avg_dif
##    <chr>   <dbl>
##  1 MT       3.23
##  2 AK       3.18
##  3 ND       2.89
##  4 SD       2.87
##  5 CO       2.84
##  6 NE       2.82
##  7 WV       2.70
##  8 WY       2.70
##  9 ID       2.67
## 10 NH       2.67
## # ℹ 43 more rows

We did not discuss the relocate dplyr function in class.

weather_forecasts <- weather_forecasts %>% 
  relocate(difference, .after = forecast_hours_before)

To make any patterns more apparent, I created a map to visualize the average difference in each continuous state. On this map, it seems that there is a concentration of higher average differences in inland states in the west of the U.S., with a few exceptions. This geographic concentration led me to consider a number of variables for analysis.

states <- map_data("state")

#install.packages("usmap")
#library(usmap) 

weather_forecasts <- weather_forecasts %>% 
  mutate(
    full_state_name = case_when(
      state == "AL" ~ "alabama",
      state == "AK" ~ "alaska",
      state == "AZ" ~ "arizona",
      state == "AR" ~ "arkansas",
      state == "CA" ~ "california",
      state == "CO" ~ "colorado",
      state == "CT" ~ "connecticut",
      state == "DE" ~ "delaware",
      state == "FL" ~ "florida",
      state == "GA" ~ "georgia",
      state == "HI" ~ "hawaii",
      state == "ID" ~ "idaho",
      state == "IL" ~ "illinois",
      state == "IN" ~ "indiana",
      state == "IA" ~ "iowa",
      state == "KS" ~ "kansas",
      state == "KY" ~ "kentucky",
      state == "LA" ~ "louisiana",
      state == "ME" ~ "maine",
      state == "MD" ~ "maryland",
      state == "MA" ~ "massachusetts",
      state == "MI" ~ "michigan",
      state == "MN" ~ "minnesota",
      state == "MS" ~ "mississippi",
      state == "MO" ~ "missouri",
      state == "MT" ~ "montana",
      state == "NE" ~ "nebraska",
      state == "NV" ~ "nevada",
      state == "NH" ~ "new hampshire",
      state == "NJ" ~ "new jersey",
      state == "NM" ~ "new mexico",
      state == "NY" ~ "new york",
      state == "NC" ~ "north carolina",
      state == "ND" ~ "north dakota",
      state == "OH" ~ "ohio",
      state == "OK" ~ "oklahoma",
      state == "OR" ~ "oregon",
      state == "PA" ~ "pennsylvania",
      state == "RI" ~ "rhode island",
      state == "SC" ~ "south carolina",
      state == "SD" ~ "south dakota",
      state == "TN" ~ "tennessee",
      state == "TX" ~ "texas",
      state == "UT" ~ "utah",
      state == "VT" ~ "vermont",
      state == "VA" ~ "virginia",
      state == "WA" ~ "washington",
      state == "WV" ~ "west virginia",
      state == "WI" ~ "wisconsin",
      state == "WY" ~ "wyoming"
    )
  )


map <- weather_forecasts %>% 
  group_by(full_state_name) %>% 
  summarize(avg_dif = mean(difference, na.rm = TRUE)) %>% 
  ggplot() +
    geom_map(
      aes(map_id = full_state_name, fill = avg_dif), 
      map = states
    ) +
    expand_limits(x = states$long, y = states$lat) +
    scale_fill_viridis(option = "plasma", direction = -1) +
    coord_map() +
    theme_map() +
    labs(
      title = "Average Diff. b/w Forecast and Actual Temp",
      fill = "Average Difference"
    )

map

The geom_smooth() layer is also something we haven’t talked about in class.

I created a series of scatter plots based on my mapping to compare the average difference in forecasted and actual temperatures, this time by city rather than state, to four different variables: elevation, distance_to_coast, elevation_change_four, and elevation_change_eight. I also added a trendline to show the impact of each variable on each city’s average difference. These plots indicate that each of these variables could have some kind of association with the accuracy of forecasts. Cities with higher elevation change around them tend to have less accurate forecasts, but all of the elevation-related variables have higher confidence intervals, represented by the shaded areas on the graphs. The plot with the distance to the coast shows a similar trend, but with a slightly more accurate trendline.

p1 <- weather_forecasts %>% # distance to coast 
  group_by(city, state) %>% 
  summarize(avg_dif = mean(difference, na.rm = TRUE),
            distance_to_coast = first(distance_to_coast)) %>% 
  ggplot(aes(x = distance_to_coast, y = avg_dif)) +
    geom_point() +
    geom_smooth(method = 'lm') + 
    labs(
      title = "Distance to Coast vs. \n Forecast Accuracy",
      x = "dist. to coast",
      y = "average difference"
    )

p2 <-  weather_forecasts %>% # elevation 
  group_by(city, state) %>% 
  summarize(avg_dif = mean(difference, na.rm = TRUE),
            elevation = first(elevation)) %>% 
  ggplot(aes(x = elevation, y = avg_dif)) +
    geom_point() +
    geom_smooth(method = 'lm') +
    labs(
      title = "Elevation of City vs. \n Forecast Accuracy",
      y = "average difference"
    )

p3 <-  weather_forecasts %>% 
  group_by(city, state) %>% 
  summarize(avg_dif = mean(difference, na.rm = TRUE),
            elevation_change_four = first(elevation_change_four)) %>% 
  ggplot(aes(x = elevation_change_four, y = avg_dif)) +
    geom_point() +
    geom_smooth(method = 'lm') +
    labs(
      title = "Greatest Elevation \n Change w/in  \n Four Closest Points",
      y = "average difference",
      x = "elevation change (four closest)"
    )

p4 <-  weather_forecasts %>% # distance to coast 
  group_by(city, state) %>% 
  summarize(avg_dif = mean(difference, na.rm = TRUE),
            elevation_change_eight = first(elevation_change_eight)) %>% 
  ggplot(aes(x = elevation_change_eight, y = avg_dif)) +
    geom_point() + 
    geom_smooth(method = 'lm') +
    labs(
      title = "Greatest Elevation \n Change \n w/in Eight \n Closest Points",
      y = "average difference",
      x = "elevation change (eight closest)"
    )

p1 + p2

p2 + p3 + p4

This analysis indicates that cities further from the coast with more surrounding elevation change can be associated with less accurate weather forecasts. There is a geographic trend in weather forecast accuracy, which could be tied to any number of these variables.