BGEN516 - Distributions Lab

Author

Affiliation

University of Montana

Published

October 9, 2025

Lab Overview

In this lab, I’ll examine the distribution and normality of forecast errors and compare error values between different cities and lead times. The dataset contains observed and forecast temperatures from various U.S. cities, with forecasts made at different time intervals.

Dataset description

The data comes from the USA National Weather Service. I learned about this dataset from Tidy Tuesday, specifically the #tidytuesday project:

TidyTuesday is a weekly podcast and community activity brought to you by the Data Science Learning Community. Our goal is to help data-science learners learn in real-world contexts.

Tidy Tuesday provides a weekly opportunity to practice cleaning, exploring, visualizing, and extracting insights from data. It’s a fun and safe way to learn and compare approaches with others in the world of data science.

Okay, back to the data! The file I’m using, weather_forecasts.csv, includes 16 months of forecasts and observations from American cities.

variable	class	description
date	date	date described by the forecast and observation
city	factor	city
state	factor	state or territory
high_or_low	factor	weather the forecast is for the high temperature of the low temperature
forecast_hours_before	integer	the number of hours before the observation (one of 12, 24, 36, or 48)
observed_temp	integer	the actual observed temperature on that date
forecast_temp	integer	the predicted temperature on that date
observed_precip	double	the observed precipitation on that date, in inches; note that some observations lack an indication of precipitation, while others explicitly report 0
forecast_outlook	factor	an abbreviation for the general outlook, such as precipitation type
possible_error	factor	either (1) “none” if the row contains no potential errors or (2) the name of the variable that is the cause of the potential error

Given my focus on forecasting accuracy, I’ll primarily work with three columns: observed_temp, forecast_temp, and possible_error. I’ll use the temperature columns to quantify and evaluate accuracy, and the possible error column to reduce the likelihood that problematic values are present in the data.

Prep workspace

To start, I’ll load required packages. I’ll also import the data and store it as weather_data.

# Load packages
library(tidyverse)
library(here)
library(gt)

# Import data
weather_data <- read_csv("weather_forecasts.csv")

Before filtering, I’ll examine the data to better understand its structure and contents:

glimpse(weather_data)

Rows: 651,968
Columns: 10
$ date                  <date> 2021-01-30, 2021-01-30, 2021-01-30, 2021-01-30,…
$ city                  <chr> "ABILENE", "ABILENE", "ABILENE", "ABILENE", "ABI…
$ state                 <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", …
$ high_or_low           <chr> "high", "high", "high", "high", "low", "low", "l…
$ forecast_hours_before <dbl> 48, 36, 24, 12, 48, 36, 24, 12, 48, 36, 24, 12, …
$ observed_temp         <dbl> 70, 70, 70, 70, 42, 42, 42, 42, 29, 29, 29, 29, …
$ forecast_temp         <dbl> NA, NA, NA, 70, NA, NA, 39, 38, NA, NA, NA, 30, …
$ observed_precip       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
$ forecast_outlook      <chr> NA, NA, NA, "DUST", NA, NA, "DUST", "SUNNY", NA,…
$ possible_error        <chr> "none", "none", "none", "none", "none", "none", …

Based on this glimpse of the data, I know that the forecast_temp column contains NAs. It’s possible the observed_temp columns contains NAs as well.

Next, I’ll use base R functions table() and is.na() to check observed_temp for NAs:

# Count missing and non-missing values observed_temp
table(is.na(weather_data$observed_temp))


 FALSE   TRUE 
604224  47744

observed_temp does in fact contain NAs and quite a few at that (47, 744)! Here’s how you can approach the same task using a tidyverse-based workflow, piping the results into gt() to produce a clean, report-ready table.

weather_data %>% 
  mutate(missing = is.na(observed_temp)) %>% 
  count(missing) %>%
  gt() %>%
  cols_label(missing = "Missing?") %>%
  cols_align(align = "center") %>%
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels(columns = c(missing, n)))

Missing?	n
FALSE	604224
TRUE	47744

Okay, now onto possible_error.

From the glimpse of the data, I know it contains none as a value. What other values do observations in this column take?

The unique function returns an object with duplicate elements removed. It’s a handy way to get a quick overview of all the distinct values in a column, especially when working with factor or character variables, which are types of categorical data.

unique(weather_data$possible_error)

[1] "none"             "forecast_outlook" "forecast_temp"    "observed_temp"

Here’s the tidy version which utilizes the distinct function from the dplyr package in the tidyverse:

weather_data %>% 
  distinct(possible_error) %>%
  gt() %>%
  cols_label(possible_error = "possible_error values") %>%
  cols_align("center") %>%
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels(columns = possible_error))

possible_error values
none
forecast_outlook
forecast_temp
observed_temp

I now know that possible_error takes one of four possible values and none of those values are NA.

Filter the data

Next, I’ll filter out NAs and observations with known possible errors.

# Subset to keep error-free and complete observations
weather_filtered <- weather_data %>%
  filter(
    possible_error == "none",
    !is.na(observed_temp),
    !is.na(forecast_temp)
    )

# Inspect to make sure it worked
glimpse(weather_filtered)

Rows: 577,914
Columns: 10
$ date                  <date> 2021-01-30, 2021-01-30, 2021-01-30, 2021-01-30,…
$ city                  <chr> "ABILENE", "ABILENE", "ABILENE", "AKRON_CANTON",…
$ state                 <chr> "TX", "TX", "TX", "OH", "OH", "OH", "NY", "NY", …
$ high_or_low           <chr> "high", "low", "low", "high", "low", "low", "hig…
$ forecast_hours_before <dbl> 12, 24, 12, 12, 24, 12, 12, 24, 12, 12, 24, 12, …
$ observed_temp         <dbl> 70, 42, 42, 29, 26, 26, 17, -6, -6, 48, 24, 24, …
$ forecast_temp         <dbl> 70, 39, 38, 30, 26, 26, 18, -1, -1, 48, 26, 24, …
$ observed_precip       <dbl> 0.00, 0.00, 0.00, 0.09, 0.09, 0.09, 0.00, 0.00, …
$ forecast_outlook      <chr> "DUST", "DUST", "SUNNY", "MOCLDY", "MOCLDY", "SN…
$ possible_error        <chr> "none", "none", "none", "none", "none", "none", …

Looks great!

Create new variable

Now I want to create a new variable that represents the difference between observed temperature and forecast temperature. This difference, often called the forecast error, captures forecasting accuracy by showing how far off the predicted values were from the actual values.

weather_filtered <- weather_filtered %>%
  mutate(error = observed_temp - forecast_temp)

Summarize new variable

We’ve recently learned about the normal distribution. I want to use data summaries to explore and assess whether the variable’s values are approximately normally distributed or show deviations. The summary statistics provide key measurements, and the visualizations help me better understand how those measurements reflect the overall pattern and spread of the errors.

Descriptive Statistics

First, I’ll calculate summary statistics for my new variable and display them in a table. This will give me a numerical summary of the variable’s distribution.

Before we compute and interpret the summary statistics for the error variable, it’s important to understand what positive and negative error values mean:

A positive error value means the observed temperature is higher than the forecast temperature
A negative error value means the observed temperature is lower than the forecast temperature

# Calculate statistics
error_summary <- weather_filtered %>%
  summarise(
    mean_error = mean(error),
    median_error = median(error),
    sd_error = sd(error),
    iqr_error = IQR(error),
    min_error = min(error),
    max_error = max(error)
    ) %>%
  pivot_longer(everything(), names_to = "statistic", values_to = "value")

# Present in table with row groups
error_summary %>%
  gt(rowname_col = "statistic") %>%
  # Center the values in columns
  cols_align("center") %>%
  # Hide the header row 
  tab_options(column_labels.hidden = TRUE) %>%
  # Create row groups
  tab_row_group(label = "Central Tendency", 
                rows = matches("mean_error|median_error")) %>%
  tab_row_group(label = "Dispersion", 
                rows = matches("sd_error|iqr_error")) %>%
  tab_row_group(label = "Boundary", 
                rows = matches("min_error|max_error")) %>%
  # Order the row groups
  row_group_order(c("Central Tendency", "Dispersion", "Boundary")) %>%
  # Bold row group labels
  tab_style(style = cell_text(weight = "bold"), locations = cells_row_groups()) %>%
  # Format values to 2 decimal places
  fmt_number(columns = value, decimals = 2)

Central Tendency
mean_error	0.45
median_error	0.00
Dispersion
sd_error	3.12
iqr_error	3.00
Boundary
min_error	−46.00
max_error	32.00

Alright, let’s interpret these values!

Central Tendency

The mean error is about 0.45, meaning on average the observed temperature is slightly higher than the forecast temperature.
The median error is 0. Half the errors are zero or less and half are above zero, suggesting errors are roughly centered around zero.

Dispersion

The standard deviation is 3.12. This tells me that the errors usually vary by about 3 degrees from the average error.
The interquartile range is 3. This means that the middle 50% of the errors are spread out over a range of 3 degrees, showing where most typical errors lie.

Boundary

The minimum error is -46, a very large negative outlier where the forecast was much higher than observed temperature.
The maximum error is 32, a large positive outlier where the forecast was much lower than observed temperature.

Overall, errors are generally close to zero with moderate spread. The distribution shows some extreme values on both ends, which may affect how closely the errors follow a normal distribution.

Visualization

Next, I’ll visualize the distribution of error values using two types of plots: a histogram and a Q-Q plot. Together, these will give me a visual summary of the variable’s distribution.

Histogram

First, I’ll visualize the distribution of error values with a histogram.

In the code chunk below, I wrote a function that calculates the bin width for a histogram using the Freedman-Diaconis rule. After applying it to the error data, I saw the result was too small, so I rounded it up to the nearest whole number. Then I used this bin width to create a histogram of the forecast errors with ggplot2.

# Create function to apply Freedman-Diaconis Rule
fd_rule_binwidth <- function(a_vector) {
  n <- length(a_vector)
  iqr <- IQR(a_vector)
  bw <- 2 * iqr / (n^(1 / 3))
  bw
  }

# Apply function to error variable
error_bins <- fd_rule_binwidth(weather_filtered$error)

# error_bins == 0.07203278
# error_bins value is too close to zero, 
# will round up to nearest whole number
# using ceiling()

# Plot distribution
weather_filtered %>%
  ggplot(aes(x = error)) +
  geom_histogram(
    binwidth = ceiling(error_bins),
    fill = "#70002e",
    color = "white"
    ) +
  labs(
    title = "Histogram of Forecast Errors", 
    x = "Error", 
    y = "Frequency"
    ) +
  theme_minimal()

The x-axis shows the size of the forecast errors, and the y-axis shows how often each error value occurs. Most of the errors are clustered tightly around zero, indicating that the forecast is generally close to the observed temperature. The distribution is roughly symmetrical, with errors spread out evenly on both sides of zero. The frequencies decrease quickly as errors get larger, which suggests large errors are uncommon. While the shape appears somewhat bell-shaped, it’s important to use a more accurate assessment of normality.

Q-Q Plot

To further assess whether the errors follow a normal distribution, I’ll create a Q-Q plot. This plot will help me see how closely the error values match a normal distribution by comparing their quantiles.

weather_filtered %>%
  ggplot(aes(sample = error)) +
  stat_qq(color = "#70002e") +
  stat_qq_line(color = "red") +
  labs(
    title = "Q-Q Plot of Forecast Errors",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
    ) +
  theme_minimal()

This plot compares the distribution of forecast errors to a theoretical normal distribution. The x-axis shows the theoretical quantiles I’d expect if the errors were perfectly normal, and the y-axis shows the actual quantiles from the sample.

Most of the points fall fairly close to the line, especially in the center, suggesting the errors are approximately normally distributed. However, the slight curve at both ends indicates some deviation in the tails, with a few more extreme values than I’d expect under a normal distribution.

Compare groups

Lastly, I want to explore how forecast error distributions differ by city and by lead time. The data set is large, so I need to focus on key insights related to forecasting accuracy. I’ll highlight groups where the forecast errors were either the largest or the smallest, looking at the size of the errors regardless of whether they were too high or too low (i.e., ignoring the sign). This will give a clearer picture of how location and lead time might be associated with accuracy.

Here’s what I’ll do to accomplish my goal:

Group the data by those two variables
Calculate the mean, standard deviation, and number of observations for the forecast error in each group

weather_grouped <- weather_filtered %>%
  group_by(city, forecast_hours_before) %>%
  summarise(
    mean_error = mean(error),
    sd_error = sd(error),
    n = n(),
    .groups = "drop"
    )

Rank the groups by the absolute value of their mean error to find which have the largest and smallest average forecast errors

ranked_groups <- weather_grouped %>%
  mutate(
    # Replace underscore with space, convert to title case
    city = str_to_title(str_replace_all(city, "_", " ")),
    # Get absolute mean error
    abs_mean_error = abs(mean_error),
    # Largest absolute errors
    rank_high = rank(-abs_mean_error),
    # Smallest absolute errors
    rank_low = rank(abs_mean_error),
    rank_group = case_when(
      rank_high <= 5 ~ "Largest 5",
      rank_low <= 5 ~ "Smallest 5",
      TRUE ~ NA_character_
      )) %>%
  filter(rank_group %in% c("Largest 5", "Smallest 5")) %>%
  arrange(desc(abs_mean_error))

Create tables showing the groups with the largest and smallest error magnitudes for easy comparison and interpretation

# Largest errors
ranked_groups %>%
  filter(rank_group == "Largest 5") %>%
  select(-c(rank_high, rank_low, rank_group, abs_mean_error)) %>%
  gt(caption = "Groups with Largest Absolute Mean Errors") %>%
  cols_align("center") %>%
  cols_label(
    city = "City",
    forecast_hours_before = "Lead Time",
    mean_error = "Mean Error",
    sd_error = "SD Error"
    ) %>%
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) %>%
  fmt_number(
    columns = c(mean_error, sd_error), 
    decimals = 3
    )

Groups with Largest Absolute Mean Errors
City	Lead Time	Mean Error	SD Error	n
Helena	48	2.191	4.695	865
Helena	24	2.171	4.295	871
Helena	36	2.144	4.595	868
Helena	12	2.123	4.229	876
Milwaukee	48	2.081	3.539	857

This table highlights which city and lead time combinations have the largest average magnitude of forecasting errors. Helena, especially at longer lead times, appears to be the most challenging location for accurate temperature forecasts in this dataset. This insight could guide further investigation or model improvement efforts targeted at specific cities or forecast horizons.

# Smallest errors
ranked_groups %>%
  filter(rank_group == "Smallest 5") %>%
  select(-c(rank_high, rank_low, rank_group, abs_mean_error)) %>%
  gt(caption = "Groups with Smallest Absolute Mean Errors") %>%
  cols_align("center") %>%
  cols_label(
    city = "City",
    forecast_hours_before = "Lead Time",
    mean_error = "Mean Error",
    sd_error = "SD Error"
    ) %>%
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) %>%
  fmt_number(
    columns = c(mean_error, sd_error), 
    decimals = 3
    )

Groups with Smallest Absolute Mean Errors
City	Lead Time	Mean Error	SD Error	n
Evansville	12	0.005	2.577	876
Wilmington	48	−0.004	3.047	857
Pittsburgh	24	−0.003	2.722	862
Des Moines	48	−0.003	3.330	868
Youngstown	48	−0.002	3.211	859

This table identifies city and lead time combinations where temperature forecasts tend to be most accurate. The mean errors are all very close to zero, indicating that forecasts for these groups are generally unbiased and highly accurate on average. For example, Evansville at 12 hours and Wilmington at 48 hours both have near-zero mean errors, reflecting reliable forecasts at different lead times. The standard deviations indicate some variability in errors, but overall these city and lead time combinations represent the most precise and consistent forecasting performance.

By examining which cities and lead times have the largest and smallest forecast errors, we gain valuable insight into where and when the forecasts tend to be more or less accurate. This understanding can help guide further investigation into the factors influencing forecast performance and inform efforts to improve prediction models in specific locations or timeframes.