library(tidyverse)
library(readr)
library(patchwork)
library(ggthemes)
library(maps)
library(mapproj)
library(usdata)
cities <- read_csv("data/forecast_cities.csv")
weather <- read_csv("data/weather_forecasts.csv")
outlook <- read_csv("data/outlook_meanings.csv")
states <- map_data("state")

 forecast <- cities %>%
   left_join(weather, by = c("city", "state")) 
forecast <- forecast %>%
 mutate(
   state = abbr2state(state),
    state = tolower(state),
     dif = observed_temp - forecast_temp
   ) 
forecast %>%
  group_by(state) %>%
  summarize(
    mean_dif = mean(dif, na.rm = TRUE)
  ) %>%
  ggplot(aes(map_id = state, fill = mean_dif)) +
  geom_map(map = states) +
  scale_fill_gradient2(low = "darkgreen", high = "blue", mid = "white", midpoint = 0, 
                       name = "Difference in Temperature") +
  expand_limits(x = states$long, y = states$lat) +
  coord_map() +
  theme_map() +
  labs(
    title = "Discrepencies in Temperature Prediction in the US"
  ) +
  theme(legend.position = "right")

forecast %>%
  group_by(state) %>%
  summarize(
    mean_dif = mean(dif, na.rm = TRUE)
  ) %>%
  slice_max(mean_dif, n=6)
## # A tibble: 6 × 2
##   state         mean_dif
##   <chr>            <dbl>
## 1 alaska           1.36 
## 2 massachusetts    1.27 
## 3 oregon           1.14 
## 4 hawaii           1.13 
## 5 montana          1.09 
## 6 nevada           0.992
forecast %>%
  filter(!is.na(high_or_low)) %>%
ggplot(aes(x = high_or_low, y = dif)) + geom_boxplot() +
  labs(title = "Weather prediction by high or low temperatures",
     x = "High or Low",
     y = "Difference")

ggplot(forecast, aes(x = avg_annual_precip, y = dif)) + 
  geom_point(alpha = 0.5, color = "darkmagenta") +
  labs(title = "Precipitation vs. Difference",
     x = "Average annual precipitation",
     y = "Difference") 

worst_states <- c("oregon", "montana", "nevada", "alaska", "massachusetts", 
                  "hawaii")
new_forecast <- forecast %>%
  filter(state %in% worst_states)

#scatterplot
ggplot(new_forecast, aes(x = avg_annual_precip, y = dif)) + 
  geom_point(alpha = 0.5, color = "darkmagenta") + geom_smooth() +
  labs(title = "Precipitation vs. Difference for states with highest difference",
     x = "Average annual precipitation",
     y = "Difference") 

#bar chart
ggplot(new_forecast, aes(x = avg_annual_precip, y = dif, fill = avg_annual_precip)) + 
  geom_bar(stat = "summary", width = 10, alpha = 0.5)  +
  scale_fill_viridis_b() +
  labs(title = "Precipitation bar chart for states with highest difference",
       x = "Average annual precipitation",
       y = "Difference",
       fill = "Average annual precipitation")

ggplot(new_forecast, aes(x = state, y = forecast_outlook, fill = avg_annual_precip)) +
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "darkmagenta") +
  labs(
    title = "Heat map of temperature predictions",
    x = "State",
    y = "Forecast Outlook",
    fill = "Avg. Annual Precip."
  )

This data set describes predicted and observed temperatures in 167 cities across the US. The weather data set describes the forecasted and observed temperatures for each city, along with other variables about the weather forecast. The cities data set provides information about each city in the weather data set. I merged these two data sets together in order to create one comprehensive data set.

In order to get a sense of the overall picture of the differences in observed and predicted temperatures, I created a choropleth map of the United States where the color of each state represents the difference in observed vs. predicted temperature. I found the difference in temperature by subtracting forecast_temp from observed_temp and then found the average difference for each state. Using the slice_max() and summarize() functions, I found that , Alaska, Massachusetts, Oregon, Hawaii, Montana, and Nevada have the largest discrepancies in temperature prediction. This map provides an overview of temperature prediction in the US by state, which is useful to understand before exploring possible reasons for these differences.

Next, in order to visualize this relationship I created a bar chart with forecasted high and low temperatures on the x axis and difference between observed and forecasted temperatures on the y axis. Based on this bar chart, high vs. low temperatures do not have a large effect on forecasting abilities. There are a few data points in the high temperature range that result in greater differences, but since there are only three, they can be treated as outliers and do not significantly impact the data. The same is true for low temperatures with two or three data points falling on the low range of the difference measure. These data points can also be treated as outliers since there are so few. Based on these results, high or low temperatures are not highly correlated with the ability to forecast temperature.

Next, I made a scatterplot to visualize the relationship between average annual precipitation and temperature prediction difference. I chose the precipitation variable because it has a clear relationship with temperature and differs across regions of the US. For the most part, the data points are clustered around y = 0, except for a section of points that have a difference > 50 between 20 and 40 inches. This range of average annual precipitation could cause issues with weather prediction, specifically with under predicting temperatures. In order to target only regions where temperature prediction is an issue, I made a new scatter plot only including Alaska, Massachusetts, Oregon, Hawaii, Montana, and Nevada because these six states had the highest mean difference score. Here, average annual precipitation around 15 inches causes temperature to be forecasted as lower than it is actually observed. For the rest of precipitation values on the x axis, the difference value is fairly consistent. It is important to include both of these scatter plots in order to visualize the difference between the entire data set, and only the regions with the highest prediction issues.
The bar chart further reflects the relationship between difference value and average annual precipitation for only the six states. The y axis for difference reflects the average difference for the states. This plot shows that lower average annual precipitation results in greater difference values, meaning more predictions that underestimate the temperature. This matches the results from the scatter plot.

This heat map is the element I included that we did not talk about in class. This plot is included in ggplot2 and uses the geom_tile() function to display the relationship between x and y and a third variable in a grid-like manner. By plotting states against forecast outlook, grouped by average annual precipitation, we are able to determine how these three factors are related. There are some areas of the plot that are empty due to data entries not existing for those specific outcomes. Massachusetts and Oregon have similar average annual precipitations and their forecast outlooks also share similar values, which could be a cause for their errors in temperature prediction. Thunderstorms, windy, sunny, rain showers, partly cloudy, mostly cloudy, and cloudy conditions appear for all six states, which could influence their temperature prediction abilities. This plot is useful for seeing relationships between x and y, but in this case the addition of the average annual precipitation variable does not lead to much further analysis.

These findings demonstrate that temperature forecasting is worse in regions where average annual precipitation falls around 15 inches. High vs. low temperatures do not predict the ability of forecasting temperature as well as average annual precipitation does. The entire dataset, which includes information about 167 cities, is hard to visualize because there are over 600,000 data entries. By making a new data set, which only includes values from the 6 states with the highest difference in forecast temperature vs. observed temperature, it is possible to visualize these relationships on a smaller scale.