library(tidyverse)
library(ggrepel)
library(knitr)
library(patchwork)
cities <- read.csv("data/forecast_cities.csv")
outlook <- read.csv("data/outlook_meanings.csv")
weather <- read.csv("data/weather_forecasts.csv")

1 Introduction

Weather prediction is crucial for everyday life where individuals reference the prediction for what to wear and what to bring. It is even more important for certain industries such as agriculture and aviation. Despite numerous technological advancements, many regions still struggle with predicting weather. This reports seeks to determine the cities in the US that struggle with weather prediction and explore potential factors that contribute to this.

2 States Struggling with Prediction

To determine what areas in the US struggle with weather prediction, it is crucial to first calculate the difference between the forecasted temperature and the observed temperature. The data set “weather” provides us with the forecasted temperature and the actual temperature of the day in different cities. I found the difference between the two to pinpoint how inaccurate the prediction was. I then group the results by city and state and then calculate the average difference in predicted and actual temperature for each city. It is then possible to determine the cities in US that struggle with weather prediction, or in this case, with temperature prediction. After learning the cities that have the greatest inaccuracy in temperature prediction, it is important to determine the possible causes of this.

To determine the causes of inaccurate weather prediction, I filtered out the cities data set into 2 subsets that contain the cities with the greatest average temperature difference and the cities with the least. I will then compare different variables between the two to explore possible factors that might contribute to the difference in average.

city_diff <- weather %>%
  group_by(city,state) %>%
  mutate(
    temp_diff = abs(forecast_temp - observed_temp)
  ) %>%
  summarize(
    avg_diff = mean(temp_diff,na.rm = TRUE)
  ) %>%
  ungroup()

max_city <- slice_max(city_diff, avg_diff, n = 10)
min_city <- slice_min(city_diff, avg_diff, n = 10)

target_cities <- cities %>%
  filter(city %in% (max_city$city)) %>%
  left_join(city_diff) %>%
  select(city, state, avg_diff, everything())

min_cities <- cities %>%
  filter(city %in% (min_city$city))%>%
  left_join(city_diff) %>%
  select(city, state, avg_diff, everything())


#Learned how to make a table 
target_cities %>%
  select(City = city,State = state, `Average Difference` = avg_diff) %>%
  kable(caption = "Cities with Highest Temperature Difference", align = "c")
Cities with Highest Temperature Difference
City State Average Difference
BISMARCK ND 3.031017
CASPER WY 3.310077
FAIRBANKS AK 4.137941
GREAT_FALLS MT 3.055780
HELENA MT 3.740166
MISSOULA MT 3.334495
NORTH_PLATTE NE 3.081011
POCATELLO ID 3.045006
PUEBLO CO 3.017488
YAKIMA WA 3.248114

min_cities %>%
  select(City = city,State = state, `Average Difference` = avg_diff) %>%
  kable(caption = "Cities with Lowest Temperature Difference", align = "c")
Cities with Lowest Temperature Difference
City State Average Difference
DAYTONA_BEACH FL 1.811802
KEY_WEST FL 1.465691
MIAMI_BEACH FL 1.689895
ORLANDO FL 1.600465
SAN_JUAN PR 1.322235
SEATTLE WA 1.756804
ST_PETERSBURG FL 1.440767
ST_THOMAS VI 1.598581
TAMPA FL 1.599016
YUMA AZ 1.673333

3 Determining Causes

The first factor that contributes to error in temperature prediction is the location of the city. By graphing the locations of the cities with the greatest average temperature difference, it is evident that they are mostly distributed in the Northwest regions of the United States. Moreover, when observing the map of cities with the least temperature error, one can see they are mostly distributed in the Southeast regions of the United States. Consequently, these evidence strongly support the claim that the location (latitudes and longitudes) of the cities greatly impacts the accuracy of temperature prediction.

states <- map_data("state") 

ggplot() + geom_polygon(data = states, aes(
  x = long, 
  y = lat, 
  fill = region)) +
  theme_classic() + 
  theme(legend.position = "none") +
  geom_point(data = target_cities, aes(
    x = lon, 
    y = lat
  )) + 
  labs(
    title = "US Map of Cities with High Temperature Prediction Error"
  ) 

ggplot() + geom_polygon(data = states, aes(
  x = long, 
  y = lat, 
  fill = region)) +
  theme_classic() +
  theme(legend.position = "none") +
  geom_point(data = min_cities, aes(
    x = lon, 
    y = lat
  )) + 
  labs(
    title = "US Map of Cities with Low Temperature Prediction Error"
  )

Apart from the location of the cities, another factor that influences weather prediction is the city’s distance from coast. By comparing the results with the graph of average temperature difference against distance to coast, it is evident that the cities with higher temperature prediction error tend to have a greater distance to coast whereas cities with low temperature prediction error are closer to the coast. Thus, it is safe to assume that the distance to the coast is another factor that contributes to the accuracy of temperature prediction.

full_cities <- full_join(target_cities, min_cities) 
full_cities <- mutate(
  full_cities, 
  error = case_when(
    avg_diff >= 3 ~ "high",
    avg_diff < 3 ~ "low"
  )
)
  
ggplot(full_cities) + 
  geom_point(aes(x = distance_to_coast, y = avg_diff, color = error)) +
  labs(
    x = "Distance to coast", 
    y = "Average temperature difference",
    title = "Relationship between distance to coast and average forecast 
    temperature error in different cities"
  )

To conclude, two factors that can influence the accuracy of temperature prediction are the location of the city in the United States and how far the city is from the coast.