1 Introduction

In this portfolio project, I analyze weather forecast data alongside observed temperatures.
The main goal is to assess forecast accuracy and explore potential factors that influence prediction errors.

2 Data Description

This analysis utilizes three datasets from the National Weather Service, covering sixteen months of forecasts and observations across 167 cities.
Additionally, a supplementary dataset provides geographic and climate-related information for these cities and other locations in the U.S.

3 Analysis

I first grouped the data by states and see if some states has greater forecast temperature error than other states.

I observed that the highest temperature errors occur in states located in the northern part of the Southern U.S.
Therefore, I created a graph to examine the relationship between region and temperature errors.

A noticeable trend is that temperature errors tend to be higher in mid western states and northern western states.
Then, I tried to find the reason why those states tend to have larger forecast error.

I discovered that forecast error tends to increase with higher elevation and greater distance from the coast.
This observation provides a clear explanation for why midwestern and northwestern states exhibit larger forecast errors.
Cities in the Midwest are located farther from the coasts, while northern states like Alaska have higher elevations.
As a result, forecasts in these regions tend to be less accurate.

This led me to consider whether there are other factors, unrelated to location, that might influence forecast accuracy.
I identified a variable called forecast_hours_before, which measures the time gap between the forecast and the actual observation.
This variable could potentially play a significant role in determining forecast precision.

The graph demonstrates that as the time gap (forecast_hours_before) increases, the temperature error also tends to rise.
This suggests that forecasts made further in advance are less accurate compared to those made closer to the actual observation time.

Additionally, I identified another variable, forecast_outlook, which represents the general weather outlook, such as the type of precipitation.
This variable could also play a role in influencing forecast accuracy.

It appears that the types of outlooks (forecast_outlook) also have a significant impact on forecast accuracy.
This suggests that certain weather conditions may make forecasts less precise.

4 Summary

Overall, the temperature forecast is quite accurate, with an average error of 2-4°F. This suggests that, in general, predictions are reliable.

However, forecast accuracy varies by location. Areas with higher elevations and those farther from coastlines tend to have larger errors.

Additionally, forecast accuracy improves as the time between forecast and observation decreases.
Short-term forecasts (12-24 hours) are more precise, while longer-term forecasts (36-48 hours) are less accurate.

Finally, the type of weather outlook significantly impacts forecast accuracy.
Forecasts for storms and extreme weather events tend to have higher errors

5 Appendix

weather_data %>%
  drop_na(temp_error) %>%
  group_by(state) %>%
  summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
  ungroup() %>% 
  mutate(state = fct_lump(state, n = 10, w = avg_temp_error)) %>%
  filter(state != "Other") %>%
  ggplot() +
    geom_col(aes(x = reorder(state, avg_temp_error), y = avg_temp_error), fill = "orange", color = "black") +
    coord_flip() +  
    labs(title = "Temperature Forecast Errors based on States",
         x = "State",
         y = "Average Temperature Error") +
    theme_minimal()

states <- map_data("state")

weather_data %>%
  group_by(city, state, lat, lon) %>%
  summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
  ggplot(aes(x = lon, y = lat, color = avg_temp_error)) +
  geom_polygon(data = states, color = "lightgrey", fill = NA,aes(x=long, y=lat, group=group)) + 
  geom_point(size = 3) +
  scale_color_viridis_c(
    direction = -1
  ) +
  labs(title = "Geographic Distribution of Forecast Errors",
       x = "",
       y = "",
       color = "Average Temp Error") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(), 
    axis.text.y = element_blank() 
  )

forecast_error_by_time <- weather_data %>%
  group_by(forecast_hours_before) %>%
  summarise(avg_error = mean(temp_error, na.rm = TRUE))

ggplot(forecast_error_by_time, aes(x = forecast_hours_before, y = avg_error)) +
  geom_line(size = 1.2, alpha = 0.6) +
  geom_point(size = 4) +
  labs(title = "Forecast Error vs. Time Lag Between Forecast and Observation",
       x = "Hours Before Observation",
       y = "Average Temperature Error") +
  theme_minimal() +
  scale_x_continuous(breaks = c(12,24,36,48))

weather_data %>%
  drop_na(temp_error) %>%
  group_by(forecast_hours_before) %>%
  summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
  ungroup() %>% 
  ggplot() +
    geom_col(aes(x = reorder(forecast_hours_before, avg_temp_error), y = avg_temp_error), fill = "orange", color = "black") +
    coord_flip() +  
    labs(title = "Temperature Forecast Errors based on Hours Before Observation",
         x = "Hours before the observation",
         y = "Average Temperature Error") +
    theme_minimal()

Note:This is another type of graph that show the relationship between hours before the observation and the temperature error.

weather_data %>%
  drop_na(temp_error) %>%
  filter(temp_error <= 3) %>%
  ggplot(aes(y = temp_error)) +
    geom_boxplot() + 
    labs(title = "Temperature Forecast Errors",
         x = "Hours before the observation",
         y = "Temperature Error") +
    theme_minimal() +
    scale_y_continuous(limits = c(0, 5)) +
    facet_wrap(~ forecast_hours_before)

Note: I initially expected a box plot to reveal the relationship between hours before observation and temperature error. However, this graph is less clear than the two graphs above in showing the target relation. One possible reason is the presence of large temperature errors (60+), which are difficult to represent clearly in a box plot.. Considering these factors, I decided to include this graph in the Appendix.

weather_data %>%
  drop_na(forecast_hours_before) %>%
  group_by(state) %>%
  summarise(avg_forecast_hours_before = mean(forecast_hours_before, na.rm = TRUE)) %>%
  ungroup() %>% 
  filter(state != "Other") %>%
  ggplot() +
    geom_col(aes(x = reorder(state, avg_forecast_hours_before), y = avg_forecast_hours_before), fill = "orange", color = "black") +
    coord_flip() +  
    labs(title = "Temperature Forecast Errors",
         x = "State",
         y = "Average hours before the observation (°F)") +
    theme_minimal(base_size = 14) +  # Increase base font size
    theme(
      axis.text.y = element_text(size = 6, margin = margin(r = 10)),  
      axis.title = element_text(size = 7)
    )

Note: This graph shows that every state has the same average hours before observation, which is unrelated to the main focus of this report. However, it is an evidence that regional forecast accuracy is not influenced by the hours before observation.

weather_data %>%
  drop_na(temp_error) %>%
  inner_join (outlook_meanings, by = "forecast_outlook") %>% # the join function we didn't learn in class :)
  group_by(meaning) %>%
  summarise(avg_temp_error = mean(temp_error, na.rm = TRUE)) %>%
  ungroup() %>% 
  filter(meaning != "NA") %>%
  ggplot() +
    geom_col(aes(x = reorder(meaning, avg_temp_error), y = avg_temp_error), fill = "orange", color = "black") +
    coord_flip() +  
    labs(title = "Temperature Forecast Errors based on States",
         x = "Forecast Outlook",
         y = "Average Temperature Error") +
    theme_minimal()

Note: I used inner_join, which we didn’t talk about in class.

elevation_vs_error <- weather_data %>%
  group_by(elevation) %>%
  summarise(avg_error = mean(temp_error, na.rm = TRUE))

p1 <- ggplot(elevation_vs_error, aes(x = elevation, y = avg_error)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(title = "Elevation vs. Forecast Error",
       x = "Elevation", y = "Avg Temperature Error") +
  theme_minimal()

coast_vs_error <- weather_data %>%
  group_by(distance_to_coast) %>%
  summarise(avg_error = mean(temp_error, na.rm = TRUE))

p2 <- ggplot(coast_vs_error, aes(x = distance_to_coast, y = avg_error)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Distance to Coast vs. Forecast Error",
       x = "Distance to Coast", y = "Avg Temperature Error") +
  theme_minimal()

p1 <- p1 + theme(plot.title = element_text(size = 12))  
p2 <- p2 + theme(plot.title = element_text(size = 12))  


(p1 + p2) + 
  plot_annotation(
    title = "Effect of Geographic Factors on Forecast Error",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )