library(tidyverse)
library(ggplot2)
library(readr)
forecast_cities <- read_csv("data/forecast_cities.csv", col_types = list(
city = col_factor()
))
outlook_meanings <- read_csv("data/outlook_meanings.csv")
weather_forecasts <- read_csv("data/weather_forecasts.csv", col_types = list(
city = col_factor()
))
The data sets weather_forecasts and
forecast_cities from the National Weather Service contain
data from 167 cities across the US. My goal was to determine which
cities had the most and least accurate weather forecasts, then look for
variables that could potentially explain these differences.
I combined the data sets forecast_cities and
weather_forecasts using a left_join. To measure the
accuracy of the weather forecast in each city, I added a new column,
temp_error: the absolute value of the difference between
forecasted temperature and observed temperature.
The variable city_avg represents the average temperature
error in forecasting for each city.
forecasts_full <- weather_forecasts %>%
left_join(forecast_cities, by = c("city", "state")) %>%
drop_na() %>%
mutate(temp_error = abs(observed_temp - forecast_temp))
forecasts_full <- mutate(forecasts_full, col_types = list(
temp_error = col_factor()))
forecasts_full2 <- forecasts_full %>%
group_by(city, state) %>%
mutate(city_avg = mean(temp_error))
forecasts_full2 %>%
ungroup() %>%
select(city, city_avg) %>%
slice_max(order_by = city_avg, n = 5)
## # A tibble: 3,391 × 2
## city city_avg
## <fct> <dbl>
## 1 FAIRBANKS 4.08
## 2 FAIRBANKS 4.08
## 3 FAIRBANKS 4.08
## 4 FAIRBANKS 4.08
## 5 FAIRBANKS 4.08
## 6 FAIRBANKS 4.08
## 7 FAIRBANKS 4.08
## 8 FAIRBANKS 4.08
## 9 FAIRBANKS 4.08
## 10 FAIRBANKS 4.08
## # ℹ 3,381 more rows
On average, the cities with the least accurate weather forecasts are Fairbanks, AK (4.081687); Helena, MT (3.736797); Missoula, MT (3.338381); Casper, WY (3.305379); and Yakima, WA (3.248114).
The cities with the most accurate forecasts are St. Petersburg, FL (1.439977); Key West, FL (1.466148); Orlando, FL (1.598547); Tampa, FL (1.599070); and Yuma, AZ (1.678218).
Next, I calculated summary statistics for the high- and low-error cities.
forecasts_full %>%
filter(city == c("FAIRBANKS", "HELENA", "MISSOULA", "CASPER", "YAKIMA")) %>%
group_by(city, state) %>%
select(city, state, elevation, forecast_hours_before, distance_to_coast, avg_annual_precip) %>%
summary()
## city state elevation forecast_hours_before
## HELENA :707 Length:3416 Min. : 130.2 Min. :12.00
## YAKIMA :690 Class :character 1st Qu.: 321.0 1st Qu.:12.00
## CASPER :679 Mode :character Median : 974.1 Median :36.00
## FAIRBANKS:676 Mean : 845.9 Mean :30.06
## MISSOULA :664 3rd Qu.:1177.8 3rd Qu.:48.00
## ABILENE : 0 Max. :1620.8 Max. :48.00
## (Other) : 0
## distance_to_coast avg_annual_precip
## Min. :135.3 Min. :10.21
## 1st Qu.:243.1 1st Qu.:13.39
## Median :564.1 Median :15.56
## Mean :516.4 Mean :14.64
## 3rd Qu.:710.8 3rd Qu.:16.60
## Max. :927.0 Max. :17.65
##
forecasts_full %>%
filter(city == c("ST_PETERSBURG", "KEY_WEST", "ORLANDO", "TAMPA", "YUMA")) %>%
group_by(city, state) %>%
select(city, state, elevation, forecast_hours_before, distance_to_coast, avg_annual_precip) %>%
summary()
## city state elevation forecast_hours_before
## ORLANDO :703 Length:3414 Min. : 0.00 Min. :12.00
## ST_PETERSBURG:682 Class :character 1st Qu.: 1.26 1st Qu.:24.00
## TAMPA :681 Mode :character Median : 5.45 Median :36.00
## YUMA :681 Mean :20.63 Mean :30.19
## KEY_WEST :667 3rd Qu.:31.64 3rd Qu.:48.00
## ABILENE : 0 Max. :64.04 Max. :48.00
## (Other) : 0
## distance_to_coast avg_annual_precip
## Min. : 0.26 Min. : 3.895
## 1st Qu.: 1.13 1st Qu.:46.710
## Median : 1.19 Median :53.017
## Mean :19.18 Mean :43.598
## 3rd Qu.:36.14 3rd Qu.:56.484
## Max. :56.25 Max. :57.627
##
The cities with the least accurate weather forecasts tended to be far from the coast– 516.4 miles on average. Their elevation ranged from 130.2 to 1620.8 meters. Their Koppen climate scores were Dfc, Bsk, Dfb, and Csb, and their average annual precipitation was 15.56 inches.
Out of the five cities with the most accurate weather forecasts, four were located in Florida. These cities had very low elevation– 0 to 64 meters– and were located much closer to the coast. Their Koppen climate scores were Cfa, Aw, and Bwh. Their average annual precipitation was 43.60 inches.
To see if these trends could be generalized to the full data set, I
plotted the average forecasting error for all 167 cities
(city_avg) against elevation, distance to the coast, and
average annual precipitation.
library(patchwork)
forecasts_full3 <- forecasts_full2 %>%
select(city, state, elevation, distance_to_coast, avg_annual_precip, city_avg)
p1 <- forecasts_full3 %>%
ggplot(mapping = aes(x = distance_to_coast, y = city_avg)) + geom_point() + labs(x = "distance to coast (miles)", y = "avg forecasting error") + scale_fill_manual(values = "navyblue")
p2 <- forecasts_full3 %>%
ggplot(mapping = aes(x = elevation, y = city_avg)) + geom_point() + labs(x = "elevation (m)", y = "avg forecasting error") + scale_fill_manual(values = "navyblue")
p3 <- forecasts_full3 %>%
ggplot(mapping = aes(x = avg_annual_precip, y = city_avg)) + geom_point() + labs(x = "average annual precipitation (inches)", y = "avg forecasting error") + scale_fill_manual(values = "navyblue")
(p1 + p2)/p3
The plots showed that forecasting error had a moderate positive correlation with city elevation and distance to coast, and a slight negative correlation with annual precipitation. To confirm, I used R to calculate correlation coefficients for each plot.
cor(forecasts_full3$city_avg, forecasts_full3$distance_to_coast)
## [1] 0.4689017
cor(forecasts_full3$city_avg, forecasts_full3$elevation)
## [1] 0.4382862
cor(forecasts_full3$city_avg, forecasts_full3$avg_annual_precip)
## [1] -0.4148372
Cities that were located far from the coast and had higher elevation were slightly more likely to have high levels of error in temperature forecasting.