knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(maps)
library(mapproj)
library(ggthemes)
library(basemaps)
library(sf)
states <- map_data("state")
cities <- read_csv("data/forecast_cities.csv")
outlook <- read_csv("data/outlook_meanings.csv")
forecasts <- read_csv("data/weather_forecasts.csv",
col_types = list( state = col_factor()
))
#need to make state a factor so that I can later make a new variable called region
#joining the data sets and creating a new variable for the residual of the forecast and actual temps
forecasts <- forecasts %>%
right_join(cities, join_by(city)) %>%
mutate(
error = (observed_temp - forecasts$forecast_temp)
)
#creating a variable for the different regions of the US
forecasts <- forecasts %>%
mutate(
region = fct_collapse(state.x,
Midwest = c("KS", "NE", "SD", "ND", "MN", "IA", "MO", "IL", "WI", "IN", "MI", "OH"),
Northeast = c("PA", "NY", "MD", "DC", "DE", "NJ", "CT", "MA", "RI", "VT", "NH", "ME", "NJ"),
Southeast = c("WV", "VA", "KY", "TN", "NC", "TN", "AR", "SC", "GA", "AL", "MS", "LA", "FL"),
Southwest = c("TX", "OK", "AZ", "NM"),
West = c("CO", "UT", "WY", "MT", "ID", "WA", "OR", "CA", "NV"),
"Other" = c("PR", "VI", "AK", "HI")
)
)
#removing all missing values from the error column so I don't have to deal w/ NA values/issues
forecasts2 <- forecasts %>%
na.omit(forecasts$error) %>%
filter(region != "Other")
forecasts2 <- forecasts2 %>%
group_by(city) %>%
mutate("mean error" = mean(error))
states <- map_data("state")
#making an error contour map
ggplot(states, aes(x=long, y=lat, group=group)) +
geom_polygon(color="navy", fill="lightblue") +
coord_fixed() +
theme_map() +
scale_color_viridis_d() +
geom_density_2d(data = forecasts2, aes(x = lon, y = lat, group = TRUE, colour = "mean error"), alpha = 0.8, show.legend = TRUE) + #new function that we didn't
#learn in class
labs(title = "Contour map of the U.S., showing levels of
error in temperature forecasts")
As seen in the contour map, the data is focused in California and on the east coast.The maximum error is a forecast of 116 degrees higher than the observed temperature, while the minimum is 94 degrees less than the observed temperature. The mutations I did on the data were joining the weather forecasts and forecast cities data sets so that I would be able to match latitude and longitude with the errors for each forecast. I then created a region variable to be able to separate the data into geographical regions in the U.S. (with the region “Other” representing Puerto Rico, the Virgin Islands, Hawaii, and Alaska, which I removed from the final data set as I wanted to focus on the continental United States). To create the contour map I used the geom_density_2d function, which creates contour plots based on the data that you feed it. I learned how to use this function from Alan Jackson (https://adelieresources.com/2022/10/making-contour-maps-in-r/), and his contour map blog. This was more challenging that I thought it would be to create, and I’m still not entirely sure how it works, but I’m proud that I was able to make a contour map.
ggplot(forecasts2, aes(x = elevation, y = error)) +
geom_point(aes(colour = forecasts2$region)) +
geom_smooth(method = lm) +
theme_minimal() +
labs(x = "Elevation (m)", y = "Error in Forecast", title = "Elevation vs. the difference between the actual
and predicted forecast in the U.S.", colour = "Regions")
#making a linear model and getting summary stats
forecast_model <- lm(data = forecasts2, error ~ elevation, group = (region))
summary(forecast_model)
##
## Call:
## lm(formula = error ~ elevation, data = forecasts2, group = (region))
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.386 -13.462 -0.312 13.541 111.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.484e+00 3.543e-02 126.57 <2e-16 ***
## elevation -5.933e-03 6.156e-05 -96.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.63 on 587517 degrees of freedom
## Multiple R-squared: 0.01557, Adjusted R-squared: 0.01556
## F-statistic: 9290 on 1 and 587517 DF, p-value: < 2.2e-16
The relationship between elevation and error in temperature forecasts throughout the U.S. is negative, with an intercept at 16.32. This means that at sea level, the linear model of the relationship predicts an average prediction of about 16 degrees Fahrenheit higher than what the actual temperature will be. Increasing the elevation by one meter is predicted to decrease the error of the temperature forecast by 0.009 degrees. Since elevation changes can be large, this is a larger change than it may seem initially. The p-value of \(2.2\text{ x }10^{-16}\) is low. We reject the null hypothesis of no relationship between elevation and error in temperature forecasts. There is a \(2.2\text{ x }10^{-18}\)% chance of seeing an F-statistic as or more extreme as what we calculated from the data sets by random chance. I chose to create a linear model for the relationship between elevation and error in temperature forecasts because I was interested in creating a contour map for the relationship between elevation and the error in forecasts, but wanted a numeric interpretation as well.