In this portfolio project, I analyze weather forecast data alongside
observed temperatures.
The main goal is to assess forecast accuracy and explore potential
factors that influence prediction errors.
This analysis utilizes three datasets from the National Weather
Service, covering sixteen months of forecasts and observations across
167 cities.
Additionally, a supplementary dataset provides geographic and
climate-related information for these cities and other locations in the
U.S.
I first grouped the data by states and see if some states has greater
forecast temperature error than other states.
I observed that the highest temperature errors occur in states
located in the northern part of the Southern U.S.
Therefore, I created a graph to examine the relationship between region
and temperature errors.
A noticeable trend is that temperature errors tend to be higher in
mid western states and northern western states.
Then, I tried to find the reason why those states tend to have larger
forecast error.
I discovered that forecast error tends to increase with higher
elevation and greater distance from the coast.
This observation provides a clear explanation for why midwestern and
northwestern states exhibit larger forecast errors.
Cities in the Midwest are located farther from the coasts, while
northern states like Alaska have higher elevations.
As a result, forecasts in these regions tend to be less accurate.
This led me to consider whether there are other factors, unrelated to
location, that might influence forecast accuracy.
I identified a variable called forecast_hours_before, which measures the
time gap between the forecast and the actual observation.
This variable could potentially play a significant role in determining
forecast precision.
The graph demonstrates that as the time gap (forecast_hours_before)
increases, the temperature error also tends to rise.
This suggests that forecasts made further in advance are less accurate
compared to those made closer to the actual observation time.
Additionally, I identified another variable, forecast_outlook, which
represents the general weather outlook, such as the type of
precipitation.
This variable could also play a role in influencing forecast
accuracy.
It appears that the types of outlooks (forecast_outlook) also have a
significant impact on forecast accuracy.
This suggests that certain weather conditions may make forecasts less
precise.
Overall, the temperature forecast is quite accurate, with an average error of 2-4°F. This suggests that, in general, predictions are reliable.
However, forecast accuracy varies by location. Areas with higher elevations and those farther from coastlines tend to have larger errors.
Additionally, forecast accuracy improves as the time between forecast
and observation decreases.
Short-term forecasts (12-24 hours) are more precise, while longer-term
forecasts (36-48 hours) are less accurate.
Finally, the type of weather outlook significantly impacts forecast
accuracy.
Forecasts for storms and extreme weather events tend to have higher
errors
weather_data %>%
drop_na(temp_error) %>%
group_by(state) %>%
summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
ungroup() %>%
mutate(state = fct_lump(state, n = 10, w = avg_temp_error)) %>%
filter(state != "Other") %>%
ggplot() +
geom_col(aes(x = reorder(state, avg_temp_error), y = avg_temp_error), fill = "orange", color = "black") +
coord_flip() +
labs(title = "Temperature Forecast Errors based on States",
x = "State",
y = "Average Temperature Error") +
theme_minimal()
states <- map_data("state")
weather_data %>%
group_by(city, state, lat, lon) %>%
summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
ggplot(aes(x = lon, y = lat, color = avg_temp_error)) +
geom_polygon(data = states, color = "lightgrey", fill = NA,aes(x=long, y=lat, group=group)) +
geom_point(size = 3) +
scale_color_viridis_c(
direction = -1
) +
labs(title = "Geographic Distribution of Forecast Errors",
x = "",
y = "",
color = "Average Temp Error") +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank()
)
forecast_error_by_time <- weather_data %>%
group_by(forecast_hours_before) %>%
summarise(avg_error = mean(temp_error, na.rm = TRUE))
ggplot(forecast_error_by_time, aes(x = forecast_hours_before, y = avg_error)) +
geom_line(size = 1.2, alpha = 0.6) +
geom_point(size = 4) +
labs(title = "Forecast Error vs. Time Lag Between Forecast and Observation",
x = "Hours Before Observation",
y = "Average Temperature Error") +
theme_minimal() +
scale_x_continuous(breaks = c(12,24,36,48))
weather_data %>%
drop_na(temp_error) %>%
group_by(forecast_hours_before) %>%
summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = reorder(forecast_hours_before, avg_temp_error), y = avg_temp_error), fill = "orange", color = "black") +
coord_flip() +
labs(title = "Temperature Forecast Errors based on Hours Before Observation",
x = "Hours before the observation",
y = "Average Temperature Error") +
theme_minimal()
Note:This is another type of graph that show the relationship between hours before the observation and the temperature error.
weather_data %>%
drop_na(temp_error) %>%
filter(temp_error <= 3) %>%
ggplot(aes(y = temp_error)) +
geom_boxplot() +
labs(title = "Temperature Forecast Errors",
x = "Hours before the observation",
y = "Temperature Error") +
theme_minimal() +
scale_y_continuous(limits = c(0, 5)) +
facet_wrap(~ forecast_hours_before)
Note: I initially expected a box plot to reveal the relationship between hours before observation and temperature error. However, this graph is less clear than the two graphs above in showing the target relation. One possible reason is the presence of large temperature errors (60+), which are difficult to represent clearly in a box plot.. Considering these factors, I decided to include this graph in the Appendix.
weather_data %>%
drop_na(forecast_hours_before) %>%
group_by(state) %>%
summarise(avg_forecast_hours_before = mean(forecast_hours_before, na.rm = TRUE)) %>%
ungroup() %>%
filter(state != "Other") %>%
ggplot() +
geom_col(aes(x = reorder(state, avg_forecast_hours_before), y = avg_forecast_hours_before), fill = "orange", color = "black") +
coord_flip() +
labs(title = "Temperature Forecast Errors",
x = "State",
y = "Average hours before the observation (°F)") +
theme_minimal(base_size = 14) + # Increase base font size
theme(
axis.text.y = element_text(size = 6, margin = margin(r = 10)),
axis.title = element_text(size = 7)
)
Note: This graph shows that every state has the same average hours before observation, which is unrelated to the main focus of this report. However, it is an evidence that regional forecast accuracy is not influenced by the hours before observation.
weather_data %>%
drop_na(temp_error) %>%
inner_join (outlook_meanings, by = "forecast_outlook") %>% # the join function we didn't learn in class :)
group_by(meaning) %>%
summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
ungroup() %>%
filter(meaning != "NA") %>%
ggplot() +
geom_col(aes(x = reorder(meaning, avg_temp_error), y = avg_temp_error), fill = "orange", color = "black") +
coord_flip() +
labs(title = "Temperature Forecast Errors based on States",
x = "Forecast Outlook",
y = "Average Temperature Error") +
theme_minimal()
Note: I used inner_join, which we didn’t talk about in class.
elevation_vs_error <- weather_data %>%
group_by(elevation) %>%
summarise(avg_error = mean(temp_error, na.rm = TRUE))
p1 <- ggplot(elevation_vs_error, aes(x = elevation, y = avg_error)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(title = "Elevation vs. Forecast Error",
x = "Elevation", y = "Avg Temperature Error") +
theme_minimal()
coast_vs_error <- weather_data %>%
group_by(distance_to_coast) %>%
summarise(avg_error = mean(temp_error, na.rm = TRUE))
p2 <- ggplot(coast_vs_error, aes(x = distance_to_coast, y = avg_error)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Distance to Coast vs. Forecast Error",
x = "Distance to Coast", y = "Avg Temperature Error") +
theme_minimal()
p1 <- p1 + theme(plot.title = element_text(size = 12))
p2 <- p2 + theme(plot.title = element_text(size = 12))
(p1 + p2) +
plot_annotation(
title = "Effect of Geographic Factors on Forecast Error",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)