library(tidyverse)
library(ggrepel)
library(knitr)
library(patchwork)
cities <- read.csv("data/forecast_cities.csv")
outlook <- read.csv("data/outlook_meanings.csv")
weather <- read.csv("data/weather_forecasts.csv")
Weather prediction is crucial for everyday life where individuals reference the prediction for what to wear and what to bring. It is even more important for certain industries such as agriculture and aviation. Despite numerous technological advancements, many regions still struggle with predicting weather. This reports seeks to determine the cities in the US that struggle with weather prediction and explore potential factors that contribute to this.
To determine what areas in the US struggle with weather prediction, it is crucial to first calculate the difference between the forecasted temperature and the observed temperature. The data set “weather” provides us with the forecasted temperature and the actual temperature of the day in different cities. I found the difference between the two to pinpoint how inaccurate the prediction was. I then group the results by city and state and then calculate the average difference in predicted and actual temperature for each city. It is then possible to determine the cities in US that struggle with weather prediction, or in this case, with temperature prediction. After learning the cities that have the greatest inaccuracy in temperature prediction, it is important to determine the possible causes of this.
To determine the causes of inaccurate weather prediction, I filtered out the cities data set into 2 subsets that contain the cities with the greatest average temperature difference and the cities with the least. I will then compare different variables between the two to explore possible factors that might contribute to the difference in average.
city_diff <- weather %>%
group_by(city,state) %>%
mutate(
temp_diff = abs(forecast_temp - observed_temp)
) %>%
summarize(
avg_diff = mean(temp_diff,na.rm = TRUE)
) %>%
ungroup()
max_city <- slice_max(city_diff, avg_diff, n = 10)
min_city <- slice_min(city_diff, avg_diff, n = 10)
target_cities <- cities %>%
filter(city %in% (max_city$city)) %>%
left_join(city_diff) %>%
select(city, state, avg_diff, everything())
min_cities <- cities %>%
filter(city %in% (min_city$city))%>%
left_join(city_diff) %>%
select(city, state, avg_diff, everything())
#Learned how to make a table
target_cities %>%
select(City = city,State = state, `Average Difference` = avg_diff) %>%
kable(caption = "Cities with Highest Temperature Difference", align = "c")
| City | State | Average Difference |
|---|---|---|
| BISMARCK | ND | 3.031017 |
| CASPER | WY | 3.310077 |
| FAIRBANKS | AK | 4.137941 |
| GREAT_FALLS | MT | 3.055780 |
| HELENA | MT | 3.740166 |
| MISSOULA | MT | 3.334495 |
| NORTH_PLATTE | NE | 3.081011 |
| POCATELLO | ID | 3.045006 |
| PUEBLO | CO | 3.017488 |
| YAKIMA | WA | 3.248114 |
min_cities %>%
select(City = city,State = state, `Average Difference` = avg_diff) %>%
kable(caption = "Cities with Lowest Temperature Difference", align = "c")
| City | State | Average Difference |
|---|---|---|
| DAYTONA_BEACH | FL | 1.811802 |
| KEY_WEST | FL | 1.465691 |
| MIAMI_BEACH | FL | 1.689895 |
| ORLANDO | FL | 1.600465 |
| SAN_JUAN | PR | 1.322235 |
| SEATTLE | WA | 1.756804 |
| ST_PETERSBURG | FL | 1.440767 |
| ST_THOMAS | VI | 1.598581 |
| TAMPA | FL | 1.599016 |
| YUMA | AZ | 1.673333 |
The first factor that contributes to error in temperature prediction is the location of the city. By graphing the locations of the cities with the greatest average temperature difference, it is evident that they are mostly distributed in the Northwest regions of the United States. Moreover, when observing the map of cities with the least temperature error, one can see they are mostly distributed in the Southeast regions of the United States. Consequently, these evidence strongly support the claim that the location (latitudes and longitudes) of the cities greatly impacts the accuracy of temperature prediction.
states <- map_data("state")
ggplot() + geom_polygon(data = states, aes(
x = long,
y = lat,
fill = region)) +
theme_classic() +
theme(legend.position = "none") +
geom_point(data = target_cities, aes(
x = lon,
y = lat
)) +
labs(
title = "US Map of Cities with High Temperature Prediction Error"
)
ggplot() + geom_polygon(data = states, aes(
x = long,
y = lat,
fill = region)) +
theme_classic() +
theme(legend.position = "none") +
geom_point(data = min_cities, aes(
x = lon,
y = lat
)) +
labs(
title = "US Map of Cities with Low Temperature Prediction Error"
)
Apart from the location of the cities, another factor that influences weather prediction is the city’s distance from coast. By comparing the results with the graph of average temperature difference against distance to coast, it is evident that the cities with higher temperature prediction error tend to have a greater distance to coast whereas cities with low temperature prediction error are closer to the coast. Thus, it is safe to assume that the distance to the coast is another factor that contributes to the accuracy of temperature prediction.
full_cities <- full_join(target_cities, min_cities)
full_cities <- mutate(
full_cities,
error = case_when(
avg_diff >= 3 ~ "high",
avg_diff < 3 ~ "low"
)
)
ggplot(full_cities) +
geom_point(aes(x = distance_to_coast, y = avg_diff, color = error)) +
labs(
x = "Distance to coast",
y = "Average temperature difference",
title = "Relationship between distance to coast and average forecast
temperature error in different cities"
)
To conclude, two factors that can influence the accuracy of temperature prediction are the location of the city in the United States and how far the city is from the coast.