Introduction

Weather forecasting services play an essential role in shaping the daily lives of many Americans. Not only do they greatly improve the daily lives of many Americans, they also play a vital role in protecting the public safety of many citizens in times of extreme weather. However, in many cities in America, we consistently see that there are errors within these weather forecasts. This report aims to explore the accuracy of temperature and precipitation forecasts across 167 cities in the US over a 16 month period using data from the National Weather Service. By identifying instances of unreliable weather reporting, this study aims to uncover which regions of the US struggle the most with weather reporting, and the reasons why certain areas struggle with weather prediction.

Methodology and Results

To analyze temperature and precipitation forecast errors, I first merged the city and weather forecast data sets using an inner join by city and state. This way I would be able to remove city’s and state’s that do not have corresponding matches between the data sets. Using the merged data set, I calculated an absolute temperature error by finding the difference between forecasted and observed temperatures. The precipitation error was calculated in the same way where I found the absolute difference between observed precipitation and forecasted precipitation.

Figure 1: Comparison of average temperature prediction errors across different factors. The top plot examines errors in relation to whether the forecast was for a high or low temperature prediction. The bottom-left plot looks at the relationship between average temperature error and distance to the coast. The bottom-right plot looks at the relationship between average temperature error and a city’s koppen climate classification.

In order to create my first figure, I grouped the data by city, state, and high/low temperature prediction. I used head() to isolate the 40 cities with the greatest average temperature error and created my three plots using this data set. The first plot is a box plot showing the difference in average temperature prediction based on a high or low temperature prediction. We see that there is a significant difference between the average temperature error’s for a high/low prediction. With a high prediction having an average temperature prediction mean around 2.9 degrees of error, and a low temperature prediction having an average temperature prediction mean around 3.2 degrees of error. This tells us that high temperature predictions may have a greater average temperature error. The second plot is a scatter plot showing average temperature error against distance to the coast in miles. It is hard to see any strong correlations, however, we can see that there may be a weak positive correlation between average temperature error and distance to the coast. This could mean that the farther you are from the coast, the harder it becomes to accurately predict the temperature. However, it should be noted that this relationship is not expressed very strongly. The third plot was made by comparing average temperature error based on a city’s Koppen climate classification. We see that in the 40 cities with the most average temperature error, that they fall within the three categories of arid, temperate, and continental. This tells us that perhaps regions these three climate classifications are more susceptible to having weather forecasts with greater average temperature error.

Figure 2:: Comparison of Average Temperature Error based on the number of hours before the actual observation that a forecast was made.

I created figure 2 by grouping my data by forecast lead time. I created four data frames, for the 12 different categorizations of forecast lead time, and than used bind_rows() to group all the forecasts together again. Finally, I created a box plot to compare the distribution of average temperature errors based on forecast lead times. Looking at this plot, there is a clear correlation between average temperature error and the amount of time before a forecast prediction was made. We see that average temperature error in an observation decreases as the forecast lead period grows smaller. While, we saw some correlations between average temperature error and the prior factors, the forecast lead period is evidently the strongest correlated factor. This tells us that there is strong positive correlation between average temperature error and forecast lead time.

Figure 3: Comparison of average error in precipitation observation against the average annual precipitation of an area.

I created figure 3 to look at what factors most greatly affect average precipitation error. Through EDA, I found that the average error in precipitation had a relationship with a region’s average annual precipitation. We see that there is negative correlation between average annual participation and average error in a precipitation forecast. This tells us that, if an area has less annual rainfall, this may be associated with a higher chance of average precipitation error.

Code Index

library(ggplot2)
library(dplyr)
library(tidyverse)
library(patchwork)
library(lubridate)
library(ggthemes)

city_fore <- read_csv("data/forecast_cities.csv")
outlook <- read_csv("data/outlook_meanings.csv")
weather_fore <- read_csv("data/weather_forecasts.csv")

combined <- city_fore %>% inner_join(weather_fore, by = c("city", "state"))
combined <- combined %>%  mutate(temp_error = abs(forecast_temp - observed_temp))
combined <- combined %>% mutate(precip_error = abs(forecast_outlook != "none" & observed_precip == 0))
combined <- combined %>%mutate(forecast_hours_before = as.factor(forecast_hours_before), high_or_low = as.factor(high_or_low))

city_errors <- combined %>%
   filter(!is.na(high_or_low)) %>% 
  group_by(city, state, high_or_low) %>%
  summarise(
    total_forecasts = n(),
    avg_temp_error = mean(temp_error, na.rm = TRUE),
    avg_precip_error = mean(precip_error, na.rm = TRUE)
  )

final <- city_errors %>% left_join(city_fore, by = c("city", "state"))

four_plots <- final %>% arrange(desc(avg_temp_error)) %>% head(40)

high_or_low <- ggplot(four_plots, aes(x = avg_temp_error, y = high_or_low)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  theme_clean() +
  geom_smooth() +
  labs(
    x = "Avg. Temp. Error",
    y = "High/Low Prediction",
    title = "Average Temperature Error Against:",
    subtitle = "High/Low Prediction") +
  theme(plot.subtitle = element_text(face = "bold", size = 12))

distance <- ggplot(four_plots, aes(x = avg_temp_error, y = distance_to_coast)) +
  geom_point() +
  theme_clean() +
  labs(
    x = "Avg. Temp. Error",
    y = "Distance (miles)",
    title = "Distance to Coast"
  ) +
  theme(axis.title = element_text(size = 10, face = "bold")) +
  theme(plot.title = element_text(face = "bold", size = 12))

koppen <- ggplot(four_plots, aes(x = avg_temp_error, y = koppen)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  theme_clean() +
  labs(
    x = "Avg. Temp. Error",
    y = "Köppen Categorization",
    title = "Koppen Climate Type") +
  theme(axis.title = element_text(size = 10, face = "bold")) +
  theme(plot.title = element_text(face = "bold", size = 12))

(high_or_low) / (distance + koppen)

city_errors_12 <- combined %>%
  filter(forecast_hours_before == 12) %>% 
  group_by(city, state, forecast_hours_before) %>%
  summarise(
    total_forecasts = n(),
    avg_temp_error = mean(temp_error, na.rm = TRUE),
    avg_precip_error = mean(precip_error, na.rm = TRUE))
city_errors_24 <- combined %>%
  filter(forecast_hours_before == 24) %>% 
  group_by(city, state, forecast_hours_before) %>%
  summarise(
    total_forecasts = n(),
    avg_temp_error = mean(temp_error, na.rm = TRUE),
    avg_precip_error = mean(precip_error, na.rm = TRUE))
city_errors_36 <- combined %>%
  filter(forecast_hours_before == 36) %>% 
  group_by(city, state, forecast_hours_before) %>%
  summarise(
    total_forecasts = n(),
    avg_temp_error = mean(temp_error, na.rm = TRUE),
    avg_precip_error = mean(precip_error, na.rm = TRUE))
city_errors_48 <- combined %>%
  filter(forecast_hours_before == 48) %>% 
  group_by(city, state, forecast_hours_before) %>%
  summarise(
    total_forecasts = n(),
    avg_temp_error = mean(temp_error, na.rm = TRUE),
    avg_precip_error = mean(precip_error, na.rm = TRUE))

city_errors_combined <- bind_rows(city_errors_12, city_errors_24, city_errors_36, city_errors_48)
city_errors_combined <- city_errors_combined %>%
  mutate(forecast_hours_before = as.factor(forecast_hours_before))

#Graph For Temperature
ggplot(city_errors_combined, aes(x = forecast_hours_before, y = avg_temp_error)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Temperature Forecast Error by Forecast Hours Before",
       x = "Forecast Hours Before",
       y = "Average Temperature Error") +
  theme_clean()+
  theme(axis.title = element_text(size = 10, face = "bold"))

Figure 2:: Comparison of Average Temperature Error based on the number of hours before the actual observation that a forecast was made.

# Graph For Precipitation
ggplot(final, aes(x = avg_precip_error, y = avg_annual_precip))+
  geom_point()+
  geom_smooth(method = "lm", na.rm = TRUE)+
  theme_clean()+
  labs(
    x = "Average Error in Precipitation Forecast(inches)",
    y = "Average Annual Precipitation",
    title = "Precipitation Error Based on Annual Precipitation") +
  theme(axis.title = element_text(size = 10, face = "bold"))

Figure 3: Comparison of average error in precipitation observation against the average annual precipitation of an area.

#Tells us that there is not a strong correlation, if any between forecast hours before and precipitation error
ggplot(city_errors_combined, aes(x = forecast_hours_before, y = avg_precip_error)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Temperature Forecast Error by Forecast Hours Before",
       x = "Forecast Hours Before",
       y = "Average Temperature Error") +
  theme_clean()

Portfolio Project 2

RPUBS:

Ethan Chan

Introduction

Methodology and Results

Code Index