# Load packages
library(tidyverse)
library(here)
library(gt)
# Import data
weather_data <- read_csv("weather_forecasts.csv")BGEN516 - Distributions Lab
Lab Overview
In this lab, I’ll examine the distribution and normality of forecast errors and compare error values between different cities and lead times. The dataset contains observed and forecast temperatures from various U.S. cities, with forecasts made at different time intervals.
Dataset description
The data comes from the USA National Weather Service. I learned about this dataset from Tidy Tuesday, specifically the #tidytuesday project:
TidyTuesday is a weekly podcast and community activity brought to you by the Data Science Learning Community. Our goal is to help data-science learners learn in real-world contexts.
Tidy Tuesday provides a weekly opportunity to practice cleaning, exploring, visualizing, and extracting insights from data. It’s a fun and safe way to learn and compare approaches with others in the world of data science.
Okay, back to the data! The file I’m using, weather_forecasts.csv, includes 16 months of forecasts and observations from American cities.
| variable | class | description |
|---|---|---|
| date | date | date described by the forecast and observation |
| city | factor | city |
| state | factor | state or territory |
| high_or_low | factor | weather the forecast is for the high temperature of the low temperature |
| forecast_hours_before | integer | the number of hours before the observation (one of 12, 24, 36, or 48) |
| observed_temp | integer | the actual observed temperature on that date |
| forecast_temp | integer | the predicted temperature on that date |
| observed_precip | double | the observed precipitation on that date, in inches; note that some observations lack an indication of precipitation, while others explicitly report 0 |
| forecast_outlook | factor | an abbreviation for the general outlook, such as precipitation type |
| possible_error | factor | either (1) “none” if the row contains no potential errors or (2) the name of the variable that is the cause of the potential error |
Given my focus on forecasting accuracy, I’ll primarily work with three columns: observed_temp, forecast_temp, and possible_error. I’ll use the temperature columns to quantify and evaluate accuracy, and the possible error column to reduce the likelihood that problematic values are present in the data.
Prep workspace
To start, I’ll load required packages. I’ll also import the data and store it as weather_data.
Before filtering, I’ll examine the data to better understand its structure and contents:
glimpse(weather_data)Rows: 651,968
Columns: 10
$ date <date> 2021-01-30, 2021-01-30, 2021-01-30, 2021-01-30,…
$ city <chr> "ABILENE", "ABILENE", "ABILENE", "ABILENE", "ABI…
$ state <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", …
$ high_or_low <chr> "high", "high", "high", "high", "low", "low", "l…
$ forecast_hours_before <dbl> 48, 36, 24, 12, 48, 36, 24, 12, 48, 36, 24, 12, …
$ observed_temp <dbl> 70, 70, 70, 70, 42, 42, 42, 42, 29, 29, 29, 29, …
$ forecast_temp <dbl> NA, NA, NA, 70, NA, NA, 39, 38, NA, NA, NA, 30, …
$ observed_precip <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
$ forecast_outlook <chr> NA, NA, NA, "DUST", NA, NA, "DUST", "SUNNY", NA,…
$ possible_error <chr> "none", "none", "none", "none", "none", "none", …
Based on this glimpse of the data, I know that the forecast_temp column contains NAs. It’s possible the observed_temp columns contains NAs as well.
Next, I’ll use base R functions table() and is.na() to check observed_temp for NAs:
# Count missing and non-missing values observed_temp
table(is.na(weather_data$observed_temp))
FALSE TRUE
604224 47744
observed_temp does in fact contain NAs and quite a few at that (47, 744)! Here’s how you can approach the same task using a tidyverse-based workflow, piping the results into gt() to produce a clean, report-ready table.
weather_data %>%
mutate(missing = is.na(observed_temp)) %>%
count(missing) %>%
gt() %>%
cols_label(missing = "Missing?") %>%
cols_align(align = "center") %>%
tab_style(style = cell_text(weight = "bold"),
locations = cells_column_labels(columns = c(missing, n)))| Missing? | n |
|---|---|
| FALSE | 604224 |
| TRUE | 47744 |
Okay, now onto possible_error.
From the glimpse of the data, I know it contains none as a value. What other values do observations in this column take?
The unique function returns an object with duplicate elements removed. It’s a handy way to get a quick overview of all the distinct values in a column, especially when working with factor or character variables, which are types of categorical data.
unique(weather_data$possible_error)[1] "none" "forecast_outlook" "forecast_temp" "observed_temp"
Here’s the tidy version which utilizes the distinct function from the dplyr package in the tidyverse:
weather_data %>%
distinct(possible_error) %>%
gt() %>%
cols_label(possible_error = "possible_error values") %>%
cols_align("center") %>%
tab_style(style = cell_text(weight = "bold"),
locations = cells_column_labels(columns = possible_error))| possible_error values |
|---|
| none |
| forecast_outlook |
| forecast_temp |
| observed_temp |
I now know that possible_error takes one of four possible values and none of those values are NA.
Filter the data
Next, I’ll filter out NAs and observations with known possible errors.
# Subset to keep error-free and complete observations
weather_filtered <- weather_data %>%
filter(
possible_error == "none",
!is.na(observed_temp),
!is.na(forecast_temp)
)
# Inspect to make sure it worked
glimpse(weather_filtered)Rows: 577,914
Columns: 10
$ date <date> 2021-01-30, 2021-01-30, 2021-01-30, 2021-01-30,…
$ city <chr> "ABILENE", "ABILENE", "ABILENE", "AKRON_CANTON",…
$ state <chr> "TX", "TX", "TX", "OH", "OH", "OH", "NY", "NY", …
$ high_or_low <chr> "high", "low", "low", "high", "low", "low", "hig…
$ forecast_hours_before <dbl> 12, 24, 12, 12, 24, 12, 12, 24, 12, 12, 24, 12, …
$ observed_temp <dbl> 70, 42, 42, 29, 26, 26, 17, -6, -6, 48, 24, 24, …
$ forecast_temp <dbl> 70, 39, 38, 30, 26, 26, 18, -1, -1, 48, 26, 24, …
$ observed_precip <dbl> 0.00, 0.00, 0.00, 0.09, 0.09, 0.09, 0.00, 0.00, …
$ forecast_outlook <chr> "DUST", "DUST", "SUNNY", "MOCLDY", "MOCLDY", "SN…
$ possible_error <chr> "none", "none", "none", "none", "none", "none", …
Looks great!
Create new variable
Now I want to create a new variable that represents the difference between observed temperature and forecast temperature. This difference, often called the forecast error, captures forecasting accuracy by showing how far off the predicted values were from the actual values.
weather_filtered <- weather_filtered %>%
mutate(error = observed_temp - forecast_temp)Summarize new variable
We’ve recently learned about the normal distribution. I want to use data summaries to explore and assess whether the variable’s values are approximately normally distributed or show deviations. The summary statistics provide key measurements, and the visualizations help me better understand how those measurements reflect the overall pattern and spread of the errors.
Descriptive Statistics
First, I’ll calculate summary statistics for my new variable and display them in a table. This will give me a numerical summary of the variable’s distribution.
Before we compute and interpret the summary statistics for the error variable, it’s important to understand what positive and negative error values mean:
A positive error value means the observed temperature is higher than the forecast temperature
A negative error value means the observed temperature is lower than the forecast temperature
# Calculate statistics
error_summary <- weather_filtered %>%
summarise(
mean_error = mean(error),
median_error = median(error),
sd_error = sd(error),
iqr_error = IQR(error),
min_error = min(error),
max_error = max(error)
) %>%
pivot_longer(everything(), names_to = "statistic", values_to = "value")
# Present in table with row groups
error_summary %>%
gt(rowname_col = "statistic") %>%
# Center the values in columns
cols_align("center") %>%
# Hide the header row
tab_options(column_labels.hidden = TRUE) %>%
# Create row groups
tab_row_group(label = "Central Tendency",
rows = matches("mean_error|median_error")) %>%
tab_row_group(label = "Dispersion",
rows = matches("sd_error|iqr_error")) %>%
tab_row_group(label = "Boundary",
rows = matches("min_error|max_error")) %>%
# Order the row groups
row_group_order(c("Central Tendency", "Dispersion", "Boundary")) %>%
# Bold row group labels
tab_style(style = cell_text(weight = "bold"), locations = cells_row_groups()) %>%
# Format values to 2 decimal places
fmt_number(columns = value, decimals = 2)| Central Tendency | |
| mean_error | 0.45 |
| median_error | 0.00 |
| Dispersion | |
| sd_error | 3.12 |
| iqr_error | 3.00 |
| Boundary | |
| min_error | −46.00 |
| max_error | 32.00 |
Alright, let’s interpret these values!
Central Tendency
The mean error is about 0.45, meaning on average the observed temperature is slightly higher than the forecast temperature.
The median error is 0. Half the errors are zero or less and half are above zero, suggesting errors are roughly centered around zero.
Dispersion
The standard deviation is 3.12. This tells me that the errors usually vary by about 3 degrees from the average error.
The interquartile range is 3. This means that the middle 50% of the errors are spread out over a range of 3 degrees, showing where most typical errors lie.
Boundary
The minimum error is -46, a very large negative outlier where the forecast was much higher than observed temperature.
The maximum error is 32, a large positive outlier where the forecast was much lower than observed temperature.
Overall, errors are generally close to zero with moderate spread. The distribution shows some extreme values on both ends, which may affect how closely the errors follow a normal distribution.
Visualization
Next, I’ll visualize the distribution of error values using two types of plots: a histogram and a Q-Q plot. Together, these will give me a visual summary of the variable’s distribution.
Histogram
First, I’ll visualize the distribution of error values with a histogram.
In the code chunk below, I wrote a function that calculates the bin width for a histogram using the Freedman-Diaconis rule. After applying it to the error data, I saw the result was too small, so I rounded it up to the nearest whole number. Then I used this bin width to create a histogram of the forecast errors with ggplot2.
# Create function to apply Freedman-Diaconis Rule
fd_rule_binwidth <- function(a_vector) {
n <- length(a_vector)
iqr <- IQR(a_vector)
bw <- 2 * iqr / (n^(1 / 3))
bw
}
# Apply function to error variable
error_bins <- fd_rule_binwidth(weather_filtered$error)
# error_bins == 0.07203278
# error_bins value is too close to zero,
# will round up to nearest whole number
# using ceiling()
# Plot distribution
weather_filtered %>%
ggplot(aes(x = error)) +
geom_histogram(
binwidth = ceiling(error_bins),
fill = "#70002e",
color = "white"
) +
labs(
title = "Histogram of Forecast Errors",
x = "Error",
y = "Frequency"
) +
theme_minimal()The x-axis shows the size of the forecast errors, and the y-axis shows how often each error value occurs. Most of the errors are clustered tightly around zero, indicating that the forecast is generally close to the observed temperature. The distribution is roughly symmetrical, with errors spread out evenly on both sides of zero. The frequencies decrease quickly as errors get larger, which suggests large errors are uncommon. While the shape appears somewhat bell-shaped, it’s important to use a more accurate assessment of normality.
Q-Q Plot
To further assess whether the errors follow a normal distribution, I’ll create a Q-Q plot. This plot will help me see how closely the error values match a normal distribution by comparing their quantiles.
weather_filtered %>%
ggplot(aes(sample = error)) +
stat_qq(color = "#70002e") +
stat_qq_line(color = "red") +
labs(
title = "Q-Q Plot of Forecast Errors",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()This plot compares the distribution of forecast errors to a theoretical normal distribution. The x-axis shows the theoretical quantiles I’d expect if the errors were perfectly normal, and the y-axis shows the actual quantiles from the sample.
Most of the points fall fairly close to the line, especially in the center, suggesting the errors are approximately normally distributed. However, the slight curve at both ends indicates some deviation in the tails, with a few more extreme values than I’d expect under a normal distribution.
Compare groups
Lastly, I want to explore how forecast error distributions differ by city and by lead time. The data set is large, so I need to focus on key insights related to forecasting accuracy. I’ll highlight groups where the forecast errors were either the largest or the smallest, looking at the size of the errors regardless of whether they were too high or too low (i.e., ignoring the sign). This will give a clearer picture of how location and lead time might be associated with accuracy.
Here’s what I’ll do to accomplish my goal:
Group the data by those two variables
Calculate the mean, standard deviation, and number of observations for the forecast error in each group
weather_grouped <- weather_filtered %>%
group_by(city, forecast_hours_before) %>%
summarise(
mean_error = mean(error),
sd_error = sd(error),
n = n(),
.groups = "drop"
)- Rank the groups by the absolute value of their mean error to find which have the largest and smallest average forecast errors
ranked_groups <- weather_grouped %>%
mutate(
# Replace underscore with space, convert to title case
city = str_to_title(str_replace_all(city, "_", " ")),
# Get absolute mean error
abs_mean_error = abs(mean_error),
# Largest absolute errors
rank_high = rank(-abs_mean_error),
# Smallest absolute errors
rank_low = rank(abs_mean_error),
rank_group = case_when(
rank_high <= 5 ~ "Largest 5",
rank_low <= 5 ~ "Smallest 5",
TRUE ~ NA_character_
)) %>%
filter(rank_group %in% c("Largest 5", "Smallest 5")) %>%
arrange(desc(abs_mean_error))- Create tables showing the groups with the largest and smallest error magnitudes for easy comparison and interpretation
# Largest errors
ranked_groups %>%
filter(rank_group == "Largest 5") %>%
select(-c(rank_high, rank_low, rank_group, abs_mean_error)) %>%
gt(caption = "Groups with Largest Absolute Mean Errors") %>%
cols_align("center") %>%
cols_label(
city = "City",
forecast_hours_before = "Lead Time",
mean_error = "Mean Error",
sd_error = "SD Error"
) %>%
tab_style(style = cell_text(weight = "bold"),
locations = cells_column_labels()) %>%
fmt_number(
columns = c(mean_error, sd_error),
decimals = 3
)| City | Lead Time | Mean Error | SD Error | n |
|---|---|---|---|---|
| Helena | 48 | 2.191 | 4.695 | 865 |
| Helena | 24 | 2.171 | 4.295 | 871 |
| Helena | 36 | 2.144 | 4.595 | 868 |
| Helena | 12 | 2.123 | 4.229 | 876 |
| Milwaukee | 48 | 2.081 | 3.539 | 857 |
This table highlights which city and lead time combinations have the largest average magnitude of forecasting errors. Helena, especially at longer lead times, appears to be the most challenging location for accurate temperature forecasts in this dataset. This insight could guide further investigation or model improvement efforts targeted at specific cities or forecast horizons.
# Smallest errors
ranked_groups %>%
filter(rank_group == "Smallest 5") %>%
select(-c(rank_high, rank_low, rank_group, abs_mean_error)) %>%
gt(caption = "Groups with Smallest Absolute Mean Errors") %>%
cols_align("center") %>%
cols_label(
city = "City",
forecast_hours_before = "Lead Time",
mean_error = "Mean Error",
sd_error = "SD Error"
) %>%
tab_style(style = cell_text(weight = "bold"),
locations = cells_column_labels()) %>%
fmt_number(
columns = c(mean_error, sd_error),
decimals = 3
)| City | Lead Time | Mean Error | SD Error | n |
|---|---|---|---|---|
| Evansville | 12 | 0.005 | 2.577 | 876 |
| Wilmington | 48 | −0.004 | 3.047 | 857 |
| Pittsburgh | 24 | −0.003 | 2.722 | 862 |
| Des Moines | 48 | −0.003 | 3.330 | 868 |
| Youngstown | 48 | −0.002 | 3.211 | 859 |
This table identifies city and lead time combinations where temperature forecasts tend to be most accurate. The mean errors are all very close to zero, indicating that forecasts for these groups are generally unbiased and highly accurate on average. For example, Evansville at 12 hours and Wilmington at 48 hours both have near-zero mean errors, reflecting reliable forecasts at different lead times. The standard deviations indicate some variability in errors, but overall these city and lead time combinations represent the most precise and consistent forecasting performance.
By examining which cities and lead times have the largest and smallest forecast errors, we gain valuable insight into where and when the forecasts tend to be more or less accurate. This understanding can help guide further investigation into the factors influencing forecast performance and inform efforts to improve prediction models in specific locations or timeframes.