Portfolio Project 2

Wrangling Weather Forecasts

library(tidyverse)
library(ggplot2)
library(readr)
forecast_cities <- read_csv("data/forecast_cities.csv", col_types = list(
  city = col_factor()
))
outlook_meanings <- read_csv("data/outlook_meanings.csv")
weather_forecasts <- read_csv("data/weather_forecasts.csv", col_types = list(
  city = col_factor()
))

The data sets weather_forecasts and forecast_cities from the National Weather Service contain data from 167 cities across the US. My goal was to determine which cities had the most and least accurate weather forecasts, then look for variables that could potentially explain these differences.

I combined the data sets forecast_cities and weather_forecasts using a left_join. To measure the accuracy of the weather forecast in each city, I added a new column, temp_error: the absolute value of the difference between forecasted temperature and observed temperature.

The variable city_avg represents the average temperature error in forecasting for each city.

forecasts_full <- weather_forecasts %>%
  left_join(forecast_cities, by = c("city", "state")) %>%
  drop_na() %>%
  mutate(temp_error = abs(observed_temp - forecast_temp))

forecasts_full <- mutate(forecasts_full, col_types = list(
  temp_error = col_factor()))

forecasts_full2 <- forecasts_full %>%
  group_by(city, state) %>%
  mutate(city_avg = mean(temp_error))
forecasts_full2 %>%
  ungroup() %>%
  select(city, city_avg) %>%
  slice_max(order_by = city_avg, n = 5)
## # A tibble: 3,391 × 2
##    city      city_avg
##    <fct>        <dbl>
##  1 FAIRBANKS     4.08
##  2 FAIRBANKS     4.08
##  3 FAIRBANKS     4.08
##  4 FAIRBANKS     4.08
##  5 FAIRBANKS     4.08
##  6 FAIRBANKS     4.08
##  7 FAIRBANKS     4.08
##  8 FAIRBANKS     4.08
##  9 FAIRBANKS     4.08
## 10 FAIRBANKS     4.08
## # ℹ 3,381 more rows

On average, the cities with the least accurate weather forecasts are Fairbanks, AK (4.081687); Helena, MT (3.736797); Missoula, MT (3.338381); Casper, WY (3.305379); and Yakima, WA (3.248114).

The cities with the most accurate forecasts are St. Petersburg, FL (1.439977); Key West, FL (1.466148); Orlando, FL (1.598547); Tampa, FL (1.599070); and Yuma, AZ (1.678218).

Next, I calculated summary statistics for the high- and low-error cities.

forecasts_full %>%
  filter(city == c("FAIRBANKS", "HELENA", "MISSOULA", "CASPER", "YAKIMA")) %>%
  group_by(city, state) %>%
  select(city, state, elevation, forecast_hours_before, distance_to_coast, avg_annual_precip) %>%
  summary()
##         city        state             elevation      forecast_hours_before
##  HELENA   :707   Length:3416        Min.   : 130.2   Min.   :12.00        
##  YAKIMA   :690   Class :character   1st Qu.: 321.0   1st Qu.:12.00        
##  CASPER   :679   Mode  :character   Median : 974.1   Median :36.00        
##  FAIRBANKS:676                      Mean   : 845.9   Mean   :30.06        
##  MISSOULA :664                      3rd Qu.:1177.8   3rd Qu.:48.00        
##  ABILENE  :  0                      Max.   :1620.8   Max.   :48.00        
##  (Other)  :  0                                                            
##  distance_to_coast avg_annual_precip
##  Min.   :135.3     Min.   :10.21    
##  1st Qu.:243.1     1st Qu.:13.39    
##  Median :564.1     Median :15.56    
##  Mean   :516.4     Mean   :14.64    
##  3rd Qu.:710.8     3rd Qu.:16.60    
##  Max.   :927.0     Max.   :17.65    
## 
forecasts_full %>%
  filter(city == c("ST_PETERSBURG", "KEY_WEST", "ORLANDO", "TAMPA", "YUMA")) %>%
  group_by(city, state) %>%
  select(city, state, elevation, forecast_hours_before, distance_to_coast, avg_annual_precip) %>%
  summary()
##             city        state             elevation     forecast_hours_before
##  ORLANDO      :703   Length:3414        Min.   : 0.00   Min.   :12.00        
##  ST_PETERSBURG:682   Class :character   1st Qu.: 1.26   1st Qu.:24.00        
##  TAMPA        :681   Mode  :character   Median : 5.45   Median :36.00        
##  YUMA         :681                      Mean   :20.63   Mean   :30.19        
##  KEY_WEST     :667                      3rd Qu.:31.64   3rd Qu.:48.00        
##  ABILENE      :  0                      Max.   :64.04   Max.   :48.00        
##  (Other)      :  0                                                           
##  distance_to_coast avg_annual_precip
##  Min.   : 0.26     Min.   : 3.895   
##  1st Qu.: 1.13     1st Qu.:46.710   
##  Median : 1.19     Median :53.017   
##  Mean   :19.18     Mean   :43.598   
##  3rd Qu.:36.14     3rd Qu.:56.484   
##  Max.   :56.25     Max.   :57.627   
## 

The cities with the least accurate weather forecasts tended to be far from the coast– 516.4 miles on average. Their elevation ranged from 130.2 to 1620.8 meters. Their Koppen climate scores were Dfc, Bsk, Dfb, and Csb, and their average annual precipitation was 15.56 inches.

Out of the five cities with the most accurate weather forecasts, four were located in Florida. These cities had very low elevation– 0 to 64 meters– and were located much closer to the coast. Their Koppen climate scores were Cfa, Aw, and Bwh. Their average annual precipitation was 43.60 inches.

To see if these trends could be generalized to the full data set, I plotted the average forecasting error for all 167 cities (city_avg) against elevation, distance to the coast, and average annual precipitation.

library(patchwork)
forecasts_full3 <- forecasts_full2 %>%
  select(city, state, elevation, distance_to_coast, avg_annual_precip, city_avg) 

p1 <- forecasts_full3 %>%
  ggplot(mapping = aes(x = distance_to_coast, y = city_avg)) + geom_point() + labs(x = "distance to coast (miles)", y = "avg forecasting error") + scale_fill_manual(values = "navyblue")

p2 <- forecasts_full3 %>%
  ggplot(mapping = aes(x = elevation, y = city_avg)) + geom_point() + labs(x = "elevation (m)", y = "avg forecasting error") + scale_fill_manual(values = "navyblue")

p3 <- forecasts_full3 %>%
  ggplot(mapping = aes(x = avg_annual_precip, y = city_avg)) + geom_point() + labs(x = "average annual precipitation (inches)", y = "avg forecasting error") + scale_fill_manual(values = "navyblue")

(p1 + p2)/p3

The plots showed that forecasting error had a moderate positive correlation with city elevation and distance to coast, and a slight negative correlation with annual precipitation. To confirm, I used R to calculate correlation coefficients for each plot.

cor(forecasts_full3$city_avg, forecasts_full3$distance_to_coast)
## [1] 0.4689017
cor(forecasts_full3$city_avg, forecasts_full3$elevation)
## [1] 0.4382862
cor(forecasts_full3$city_avg, forecasts_full3$avg_annual_precip)
## [1] -0.4148372

Cities that were located far from the coast and had higher elevation were slightly more likely to have high levels of error in temperature forecasting.