Background

When I was a kid my father and his father, my grandfather, use to say “if it rains on Easter Sunday it will rain the next seven Sundays.” Even as a kid I questioned, could that really be true? I remember trying to count how many consecutive Sundays it would rain following rain on Easter Sunday but I never managed to successfully keep track of it to know if it actually rained for the next seven Sundays or not.

This work will attempt to either prove the old saying true or debunk it by analyzing the weather data for one location. The one location used for this work will be Dubuque, Iowa and that is only because that is the station closest to where I grew up as a kid and also closest to where both my father and grandfather grew up. This work could easily be expanded and modified a little to study other locations as well.

The question this work will answer is:

The Data

The data required to answer this question is straight forward, only two datasets are needed:

The following sections will discuss the two datasets used for this work.

Weather Data

A few years ago when I was working on a different project that required weather data for building net load forecasting models for a utility company, a local meteorologist pointed me to the Iowa Environmental Mesonet site hosted by Iowa State University. You can find the site here: https://mesonet.agron.iastate.edu/

The disclaimer page explains how while they strive to provide accurate information there may be errors and therefore the data is provided as is without any warranty of accuracy. It states how the data is part of the public domain and can be freely used. It also points out how the Iowa Environmental Mesonet is a volunteer effort and receives no funding and that it may be discontinued at any time with little or no notice. See the disclaimer page for more details https://mesonet.agron.iastate.edu/disclaimer.php

For this simple work the source of the weather data seems to be a reliable and accurate source. However, it is a public data set, it is handled as volunteer effort, thus, if the accuracy of the data was highly critical for the project it would be worth trying to get in touch with the owners of the data to have a conversation to try to determine how trustworthy the data is or not. Additionally, other weather data sources could be found and used to compare with this data as well. However, that is out of the scope of this work.

The data used for this work can be found from the following page: https://mesonet.agron.iastate.edu/request/daily.phtml

This page lets you select the stations you want data for, the range of the data, what variables you want, how missing values should be treated, and the download format for the data. Once you make the desired selections it updates an example automation curl for automating the data extraction, which provides a direct link that can be used to request the data in the desired format.

The selected stations to gather data for are:

  • DBQ – Dubuque (1951-Now)
  • ALO – Waterloo (1949-Now)
  • BRL – Burlington (1931-Now)
  • CID – Cedar Rapids (1972-Now)
  • DSM – Des Moines (1928-Now)

Looking at the station list it is seen that Dubuque has data going back to 1951. At the time of this work it is 2025 so that is 75 years of data, so not a lot but enough for answering the question for this work.

The date range for the data will be from 1900-01-01 to 2025-07-22 (Year-Month-Day) Note that this is the default date range and was used as is, even though the data does not go all the way back to 1900. Start: Year = 1900, Month = Jan, Day = 1 End: Year = 2025, Month = Jul, Day = 22

By default all the available daily variables for the dataset are selected for the data request. The list of variables is below. Looking through the list the variable named precip_in is the variable for the Daily Precipitation in units of inches. This will be the key variable used for this work to know if it rained or not, i.e., if precip_in is greater than zero then it rained.

Select from available daily variables Note: Values of 0.0001 inches are Trace Reports

  • [max_temp_f] Maximum Air Temperature [F].
  • [min_temp_f] Minimum Air Temperature [F].
  • [max_dewpoint_f] Maximum Dew Point [F].
  • [min_dewpoint_f] Minimum Dew Point [F].
  • [precip_in] Daily Precipitation [inch].
  • [avg_wind_speed_kts] Average Wind Speed [knots]
  • [avg_wind_drct] Average Wind Direction [deg]
  • [min_rh] Minimum Relative Humidity [%]
  • [avg_rh] Average Relative Humidity [%]: computed by time averaging the available reports, it is not average of the daily max/min.
  • [max_rh] Maximum Relative Humidity [%]
  • [climo_high_f] NCEI 1991-2020 Daily High Temperature Climatology [F]
  • [climo_low_f] NCEI 1991-2020 Daily Low Temperature Climatology [F]
  • [climo_precip_in] NCEI 1991-2020 Daily Precipitation Climatology [inch]
  • [snow_in] Reported Snowfall [inch]
  • [snowd_in] Reported Snow Depth [inch]
  • [min_feel] Minimum ‘Feels Like’ (either wind chill or heat index) temperature. The value is always at least the air temperature.
  • [avg_feel] Average ‘Feels Like’ (either wind chill or heat index) temperature. The value is always at least the air temperature. Value is a time weighted average.
  • [max_feel] Maximum ‘Feels Like’ (either wind chill or heat index) temperature. The value is always at least the air temperature.
  • [max_wind_speed_kts] Maximum sustained wind speed in knots.
  • [max_wind_gust_kts] Maximum wind gust in knots.
  • [srad_mj] Daily Solar Radiation MJ/m2 (when available).

Easter Sunday Dates

The second dataset required to answer the question is the Easter Sunday dates. Finding this data was a little interesting as Easter is not a fixed day of the month every year. After an initial search I found how the date of Easter is calculated from the USCCB site see the link below.

USCC site that provides information for how Easter Sunday dates are determined. https://www.usccb.org/prayer-worship/liturgical-year/easter#tab--how-are-the-dates-for-easter-palm-sunday-and-ash-wednesday-determined

The site says this: “Easter is celebrated on the first Sunday after the Paschal full moon, which is the first full moon occurring either on or after the spring equinox (March 21). As a result, Easter will always fall between March 22 and April 25.”

I never knew that was how the Easter date was determined. The site provided a date range for which Easter could fall. It also means if we can determine the date of the Paschal full moon for each year, then we can determine the date of Easter.

After doing some research on the Paschal full moon this led me to the Wikipedia page on the Ecclesiastical full moon. The page explains how the paschal full moon corresponds to the ecclesiastical full moon for the northern spring. It also explains how the ecclesiastical equinox is a fixed date by the Gregorian reform of the calendar. Here is the link to the Wikipedia page where you can read all the details. https://en.wikipedia.org/wiki/Ecclesiastical_full_moon

At the end of the Wikipedia page under the References section are these two references:

After reviewing both sites from the references and comparing some of the Easter dates, both sites seemed to agree with each other. Note that I did not compare every single date.

Additionally, I checked the Easter dates on the census.gov site to the ‘Holidays in United States’ Google calendar for the years 2020-2030. Here is the link to ‘Holidays in United States’ calendar https://calendar.google.com/calendar/embed?src=en.usa%23holiday%40group.v.calendar.google.com&ctz=America%2FChicago

All the Easter dates on the census.gov site from 2020-2030 agree with the dates on the calendar. Therefore, I am assuming this is a good and accurate source for Easter dates.

For this work the Easter dates I am using are from the census.gov site. This is because a link to an Excel file and a text file for the dates was provided. I downloaded the Excel file and used it locally for this work.

Here are the direct links to the Excel and corresponding text file that contain the Easter dates used for this work.

Set up environment for analysis

The code below loads the packages used for this work.

# load required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(rlang)
## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice
library(ggh4x)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(knitr)

Get, Prepare, and Process the weather data

Get the weather data

The code below will read and load the weather data for the selections described in the Weather Data section.

# read data from website
data <- read.csv('https://mesonet.agron.iastate.edu/cgi-bin/request/daily.py?network=IA_ASOS&stations=DBQ&stations=ALO&stations=BRL&stations=CID&stations=DSM&year1=1900&month1=1&day1=1&year2=2025&month2=7&day2=22&var=max_temp_f&var=min_temp_f&var=max_dewpoint_f&var=min_dewpoint_f&var=precip_in&var=avg_wind_speed_kts&var=avg_wind_drct&var=min_rh&var=avg_rh&var=max_rh&var=climo_high_f&var=climo_low_f&var=climo_precip_in&var=snow_in&var=snowd_in&var=min_feel&var=avg_feel&var=max_feel&var=max_wind_speed_kts&var=max_wind_gust_kts&var=srad_mj&na=blank&format=csv')

Preview the data

Now that we have the data, let’s preview it to start to understand it. The code below shows a sample of the first six rows of the data.

# preview
head(data)
##   station        day max_temp_f min_temp_f max_dewpoint_f min_dewpoint_f
## 1     DSM 1928-01-01         14          8             NA             NA
## 2     DSM 1928-01-02         24         22             NA             NA
## 3     DSM 1928-01-03         17         15             NA             NA
## 4     DSM 1928-01-04         24         23             NA             NA
## 5     DSM 1928-01-05         23         22             NA             NA
## 6     DSM 1928-01-06         12          7             NA             NA
##   precip_in avg_wind_speed_kts avg_wind_drct min_rh avg_rh max_rh snow_in
## 1        NA                 NA            NA     NA     NA     NA      NA
## 2        NA                 NA            NA     NA     NA     NA      NA
## 3        NA                 NA            NA     NA     NA     NA      NA
## 4        NA                 NA            NA     NA     NA     NA      NA
## 5        NA                 NA            NA     NA     NA     NA      NA
## 6        NA                 NA            NA     NA     NA     NA      NA
##   snowd_in min_feel avg_feel max_feel max_wind_speed_kts max_wind_gust_kts
## 1       NA       NA       NA       NA           26.24190                NA
## 2       NA       NA       NA       NA           22.00000                NA
## 3       NA       NA       NA       NA           13.21814                NA
## 4       NA       NA       NA       NA           13.21814                NA
## 5       NA       NA       NA       NA           13.21814                NA
## 6       NA       NA       NA       NA           13.21814                NA
##   srad_mj climo_high_f climo_low_f climo_precip_in
## 1      NA         31.7        15.4            0.04
## 2      NA         31.6        15.2            0.03
## 3      NA         31.4        15.0            0.04
## 4      NA         31.3        14.8            0.04
## 5      NA         31.2        14.7            0.03
## 6      NA         31.0        14.5            0.04

The code below gives us all the column names for the data set.

# column names
colnames(data)
##  [1] "station"            "day"                "max_temp_f"        
##  [4] "min_temp_f"         "max_dewpoint_f"     "min_dewpoint_f"    
##  [7] "precip_in"          "avg_wind_speed_kts" "avg_wind_drct"     
## [10] "min_rh"             "avg_rh"             "max_rh"            
## [13] "snow_in"            "snowd_in"           "min_feel"          
## [16] "avg_feel"           "max_feel"           "max_wind_speed_kts"
## [19] "max_wind_gust_kts"  "srad_mj"            "climo_high_f"      
## [22] "climo_low_f"        "climo_precip_in"

The column names look to match the list of variables shown above from the mesonet site.

Convert and Filter the Data

The following code converts the date information into a date format which will make it easier to work with later when we want to extract specific dates.

# Convert the date into a date format
data_date <- mutate(data, date = ymd(day))

Now for a quick check of the station codes to make sure we have only the five stations that were selected for data download.

# get a unique list of station codes
unique(data_date$station)
## [1] "DSM" "BRL" "ALO" "DBQ" "CID"

Perfect, the unique list of station codes has only the five stations that were requested for download.

As described in the Background section this work will focus on the Dubuque station, which is labeled as DBQ in the station column. The following code will filter the data for the Dubuque station.

# filter for Dubuque Station
dbq_data  <- data_date %>% filter(station == "DBQ")

Now for a double check on the filtered data to make sure we only have the data for Dubuque (DBQ) station, the following code will list the unique station codes.

# lets check the station column for unique stations
unique(dbq_data$station)
## [1] "DBQ"

Perfect, the only station is DBQ.

Preview the Dubuque Station Data

First, let’s check how many observations there are for the Dubuque station.

# we can find how many observations there are for the Dubuque station
nrow(dbq_data)
## [1] 27197

The following code provides some information and little preview of the data.

# get preview
head(dbq_data)
##   station        day max_temp_f min_temp_f max_dewpoint_f min_dewpoint_f
## 1     DBQ 1951-02-01         -1        -18            -11            -25
## 2     DBQ 1951-02-02          8        -21              2            -29
## 3     DBQ 1951-02-03         27          9             19              0
## 4     DBQ 1951-02-04         28         15             22             10
## 5     DBQ 1951-02-05         28          8             25              5
## 6     DBQ 1951-02-06         33         -4             33             -9
##   precip_in avg_wind_speed_kts avg_wind_drct   min_rh   avg_rh    max_rh
## 1         0          10.913043      317.3376 55.33802 67.53948  73.93644
## 2         0          14.086957      191.4964 52.94791 68.61010  78.80496
## 3         0          16.173914      180.0000 63.03075 75.55913  88.37287
## 4         0           8.652174      321.2700 71.25885 82.92670  91.98074
## 5         0          10.608696      141.5715 62.55473 84.04946  96.31438
## 6         0          17.130434      266.6721 75.68251 88.10319 100.00000
##   snow_in snowd_in   min_feel   avg_feel   max_feel max_wind_speed_kts
## 1      NA       NA -37.527310 -31.336327 -19.160774                 16
## 2      NA       NA -47.051180 -26.273819  -9.152055                 20
## 3      NA       NA  -9.089237   3.815424  13.505674                 21
## 4      NA       NA   2.474925  10.819743  18.589834                 13
## 5      NA       NA  -2.036095   8.175632  14.784925                 23
## 6      NA       NA -27.086733   4.213754  22.952017                 24
##   max_wind_gust_kts srad_mj climo_high_f climo_low_f climo_precip_in       date
## 1                NA      NA         27.1        11.2            0.05 1951-02-01
## 2                NA      NA         27.2        11.3            0.04 1951-02-02
## 3                NA      NA         27.4        11.5            0.05 1951-02-03
## 4                NA      NA         27.6        11.7            0.05 1951-02-04
## 5                NA      NA         27.8        11.9            0.05 1951-02-05
## 6                NA      NA         28.0        12.2            0.05 1951-02-06
# get some info
str(dbq_data)
## 'data.frame':    27197 obs. of  24 variables:
##  $ station           : chr  "DBQ" "DBQ" "DBQ" "DBQ" ...
##  $ day               : chr  "1951-02-01" "1951-02-02" "1951-02-03" "1951-02-04" ...
##  $ max_temp_f        : num  -1 8 27 28 28 33 1 -1 7 33 ...
##  $ min_temp_f        : num  -18 -21 9 15 8 -4 -10 -11 -15 8 ...
##  $ max_dewpoint_f    : num  -11 2 19 22 25 33 -8 -6 0 27 ...
##  $ min_dewpoint_f    : num  -25 -29 0 10 5 -9 -17 -17 -21 3 ...
##  $ precip_in         : num  0 0 0 0 0 0 0 0 0 0.01 ...
##  $ avg_wind_speed_kts: num  10.91 14.09 16.17 8.65 10.61 ...
##  $ avg_wind_drct     : num  317 191 180 321 142 ...
##  $ min_rh            : num  55.3 52.9 63 71.3 62.6 ...
##  $ avg_rh            : num  67.5 68.6 75.6 82.9 84 ...
##  $ max_rh            : num  73.9 78.8 88.4 92 96.3 ...
##  $ snow_in           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ snowd_in          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_feel          : num  -37.53 -47.05 -9.09 2.47 -2.04 ...
##  $ avg_feel          : num  -31.34 -26.27 3.82 10.82 8.18 ...
##  $ max_feel          : num  -19.16 -9.15 13.51 18.59 14.78 ...
##  $ max_wind_speed_kts: num  16 20 21 13 23 24 22 12 16 24 ...
##  $ max_wind_gust_kts : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ srad_mj           : logi  NA NA NA NA NA NA ...
##  $ climo_high_f      : num  27.1 27.2 27.4 27.6 27.8 28 28.3 28.5 28.8 29.1 ...
##  $ climo_low_f       : num  11.2 11.3 11.5 11.7 11.9 12.2 12.4 12.7 12.9 13.2 ...
##  $ climo_precip_in   : num  0.05 0.04 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.06 ...
##  $ date              : Date, format: "1951-02-01" "1951-02-02" ...
glimpse(dbq_data)
## Rows: 27,197
## Columns: 24
## $ station            <chr> "DBQ", "DBQ", "DBQ", "DBQ", "DBQ", "DBQ", "DBQ", "D…
## $ day                <chr> "1951-02-01", "1951-02-02", "1951-02-03", "1951-02-…
## $ max_temp_f         <dbl> -1, 8, 27, 28, 28, 33, 1, -1, 7, 33, 43, 36, 18, 17…
## $ min_temp_f         <dbl> -18, -21, 9, 15, 8, -4, -10, -11, -15, 8, 33, 20, 3…
## $ max_dewpoint_f     <dbl> -11, 2, 19, 22, 25, 33, -8, -6, 0, 27, 39, 30, 15, …
## $ min_dewpoint_f     <dbl> -25, -29, 0, 10, 5, -9, -17, -17, -21, 3, 27, 19, -…
## $ precip_in          <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
## $ avg_wind_speed_kts <dbl> 10.913043, 14.086957, 16.173914, 8.652174, 10.60869…
## $ avg_wind_drct      <dbl> 317.337600, 191.496370, 180.000000, 321.269960, 141…
## $ min_rh             <dbl> 55.33802, 52.94791, 63.03075, 71.25885, 62.55473, 7…
## $ avg_rh             <dbl> 67.53948, 68.61010, 75.55913, 82.92670, 84.04946, 8…
## $ max_rh             <dbl> 73.93644, 78.80496, 88.37287, 91.98074, 96.31438, 1…
## $ snow_in            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ snowd_in           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ min_feel           <dbl> -37.527310, -47.051180, -9.089237, 2.474925, -2.036…
## $ avg_feel           <dbl> -31.336327, -26.273819, 3.815424, 10.819743, 8.1756…
## $ max_feel           <dbl> -19.160774, -9.152055, 13.505674, 18.589834, 14.784…
## $ max_wind_speed_kts <dbl> 16, 20, 21, 13, 23, 24, 22, 12, 16, 24, 26, 16, 16,…
## $ max_wind_gust_kts  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ srad_mj            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ climo_high_f       <dbl> 27.1, 27.2, 27.4, 27.6, 27.8, 28.0, 28.3, 28.5, 28.…
## $ climo_low_f        <dbl> 11.2, 11.3, 11.5, 11.7, 11.9, 12.2, 12.4, 12.7, 12.…
## $ climo_precip_in    <dbl> 0.05, 0.04, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.0…
## $ date               <date> 1951-02-01, 1951-02-02, 1951-02-03, 1951-02-04, 19…

My preferred way of getting a feel for the data is using the skim_without_charts function. This gives a summary of the data, data types, group variables, and some statistics for each variable type.

# get a summary and some statistics
skim_without_charts(dbq_data)
Data summary
Name dbq_data
Number of rows 27197
Number of columns 24
_______________________
Column type frequency:
character 2
Date 1
logical 1
numeric 20
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
station 0 1 3 3 0 1 0
day 0 1 10 10 0 27197 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 1951-02-01 2025-07-22 1988-04-29 27197

Variable type: logical

skim_variable n_missing complete_rate mean count
srad_mj 27197 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
max_temp_f 2 1.00 55.96 22.45 -17.00 37.00 59.00 76.00 101.00
min_temp_f 1 1.00 38.74 20.51 -31.00 25.00 40.00 56.00 80.00
max_dewpoint_f 2 1.00 43.38 19.93 -28.00 29.00 44.06 60.98 82.00
min_dewpoint_f 2 1.00 30.57 20.36 -51.00 18.00 32.00 46.40 75.02
precip_in 13 1.00 0.08 0.30 0.00 0.00 0.00 0.02 22.45
avg_wind_speed_kts 262 0.99 9.20 3.71 0.00 6.51 8.79 11.45 33.38
avg_wind_drct 263 0.99 194.84 99.55 0.00 127.07 197.73 280.51 360.00
min_rh 4 1.00 51.99 17.62 4.74 38.81 51.37 64.88 100.00
avg_rh 267 0.99 71.32 12.71 23.41 62.50 71.75 80.56 100.00
max_rh 4 1.00 88.65 9.78 34.38 83.09 90.31 96.39 100.00
snow_in 21185 0.22 0.13 0.68 0.00 0.00 0.00 0.00 11.50
snowd_in 22544 0.17 0.90 2.51 0.00 0.00 0.00 0.00 19.00
min_feel 4 1.00 33.10 25.87 -59.04 15.34 34.43 57.00 80.56
avg_feel 263 0.99 42.68 26.09 -49.82 22.62 46.25 65.78 97.08
max_feel 4 1.00 53.18 26.94 -40.69 30.54 59.00 75.90 121.74
max_wind_speed_kts 1 1.00 15.38 5.35 4.00 12.00 15.00 18.00 95.00
max_wind_gust_kts 14060 0.48 24.30 9.86 6.08 19.00 23.00 28.00 278.70
climo_high_f 0 1.00 56.58 19.66 26.00 37.20 59.00 76.30 81.70
climo_low_f 0 1.00 38.10 17.77 10.40 21.60 38.80 55.80 62.20
climo_precip_in 0 1.00 0.10 0.04 0.04 0.07 0.11 0.14 0.19

This tells us there are 27,197 observations (number of rows) and 23 variables (number of columns). From summary of the ‘character’ variable type we can see there are no missing (n_missing column) values for station and day. This is good, it means we do not have any missing data we need to deal with.

For the ‘numeric’ variable type let’s focus on the precip_in variable, recall that is amount of precipitation, to check it’s statistics.

Two interesting things for the precip_in variable jump out that should be investigated:

  • First, the precip_in variable has 13 missing values. This may be a problem if any of the 13 missing values correspond to the time periods we are interested in.
  • Second, p100 for precip_in is 22.4 inches, that seems like a lot of precipitation for one day, so it may be worth checking into this to see if this is accurate or not.

Investigate the 13 missing values

First let’s investigate the 13 missing values. Being there are only 13 missing values, an easy thing to check and review would be to see what months have missing values. If the months do not correspond to months that Easter can fall in, then we can ignore them.

The code below will find the row locations within the data set that have the missing values.

# finding the locations of missing values for precip_in
locMissingPrecip <- is.na(dbq_data$precip_in)

Being we know there are only 13, one way to review them is simply print them out to review.

# print out the dates with missing values
dbq_data[locMissingPrecip, c(2,7)]
##              day precip_in
## 5625  1966-06-30        NA
## 7694  1972-02-28        NA
## 16460 1996-02-28        NA
## 22963 2013-12-18        NA
## 22973 2013-12-28        NA
## 23247 2014-09-28        NA
## 23248 2014-09-29        NA
## 23250 2014-10-01        NA
## 23254 2014-10-05        NA
## 23257 2014-10-08        NA
## 23258 2014-10-09        NA
## 23729 2016-01-23        NA
## 23730 2016-01-24        NA

While the list above gives us the information we want, we can put it into a more readable list by extracting the names of the months and listing them out.

# get the months with missing values for precip_in
missingPrecipMonths <- dbq_data[locMissingPrecip, 2] %>% ymd() %>% months() %>% unique()
# print out the list months
cat(paste("*", missingPrecipMonths), sep = "\n")
  • June
  • February
  • December
  • September
  • October
  • January

Reviewing the months with missing values and knowing that Easter must be sometime between March 22 to April 25 (see the Easter Sunday Dates section for details), we can conclude that none of the months with missing precipitation values correspond to months that Easter falls in. So, the missing values are not an issue.

Investigate the 22.4 inches of precipitation

Now let’s investigate the 22.4 inches of precipitation. That number seems really high for one day, so it would be good to try to determine if it is correct or not.

To start, let’s find the rows in the data set that have precipitation greater than 22 inches.

# find rows with precip_in > 22
loc22 <- dbq_data$precip_in > 22

Now let’s check how many observations have precipitation greater than 22 inches.

# check how many rows have values > 22
cat("Number of days with precipitation greater than 22 inches: ", length(which(loc22)) )
## Number of days with precipitation greater than 22 inches:  1

There is exactly one observation with a value greater than 22.

So, let’s take a look at that observation for that day.

# get the index
row_with_preceip_gt_22 <- which(loc22)

# lets take a look at that row
rowLoc22 <- dbq_data[row_with_preceip_gt_22,]

# print table with only station, day, and precip_in
data22 <- rowLoc22 %>% select(station, day, precip_in)
Day with precipitation greater than 22 inches
station day precip_in
9125 DBQ 1976-01-29 22.45

This event occurred on January 29, 1976. While this date is not of interest to us, it would be good to continue to check this data to see if it is correct. If it is found to not be correct, then maybe the accuracy of the entire data set should be questioned.

January is a cold winter month in Iowa . Thus, maybe the 22.4 inches was snowfall and not rain. So, let’s check the snow_in variable for the reported snowfall.

# print table with only station, day, precip_in, snow_in, snowd_in
rowLoc22 %>% select(station, day, precip_in, snow_in, snowd_in)
##      station        day precip_in snow_in snowd_in
## 9125     DBQ 1976-01-29     22.45      NA       NA

Well, the snowfall inches and snow depth are missing for that date. However, if we check the temperature for the day, maybe we can tell if it might have been snowfall or warm enough to have rained.

# print table with temperature
rowLoc22 %>% select(station, day, precip_in, max_temp_f, min_temp_f)
##      station        day precip_in max_temp_f min_temp_f
## 9125     DBQ 1976-01-29     22.45         24         12

The minimum temperature that day was 12 degrees F and the high temperature that day was 24 degrees F. Thus, it is probably safe to say the 22.4 inches of precipitation on January 29, 1976 was snow. Wow, that’s a lot of snow to get in one day, assuming this data is correct. Having lived in Iowa my entire life I know that it is certainly possible to get that much snow in one day, so I am assuming this data is correct. However, if this date was needed to answer our question, it would be worth investigating further to determine if it is correct.

Get, Prepare, and Process the Easter Sunday Dates

As described previously the Easter Sunday dates used for this work were taken from the Excel file downloaded from the census.gov site. See the Easter Sunday Dates section for more details. The direct link to the Excel file: https://www2.census.gov/software/x-13arima-seats/win-genhol/documentation/easter500.xls

The code below will load the Easter Sunday dates.

# read Easter dates
easter <- readxl::read_excel("easter500.xls")

Get a quick summary of the dates data.

# summary
skim_without_charts(easter)
Data summary
Name easter
Number of rows 501
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Month 1 1 3.77 0.42 3 4.00 4.0 4.00 4
Day 1 1 15.70 8.95 1 8.00 16.0 23.00 31
Year 1 1 1849.50 144.48 1600 1724.75 1849.5 1974.25 2099

The dataset has one missing value for each variable. So, let’s check what rows have the missing values.

# which row is missing
misDateRow <- which(is.na(easter$Month))
# print it
cat("Row with missing data:", misDateRow)
## Row with missing data: 501
easter[misDateRow,]
## # A tibble: 1 × 3
##   Month   Day  Year
##   <dbl> <dbl> <dbl>
## 1    NA    NA    NA

The row with the missing data is the last row of the dataset and has no data, so we can simply drop the row and ignore it without it having any impact on the analysis.

The code below drops the last row with the missing values.

# the last row has the missing data, so we can simply drop it
easter <- easter[1:500,]

Here is a summary of the updated Easter dates data.

# check summary
skim_without_charts(easter)
Data summary
Name easter
Number of rows 500
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Month 0 1 3.77 0.42 3 4.00 4.0 4.00 4
Day 0 1 15.70 8.95 1 8.00 16.0 23.00 31
Year 0 1 1849.50 144.48 1600 1724.75 1849.5 1974.25 2099

There are no missing values now. Next the dates will be converted into a date format and the corresponding decade for each year will be calculated.

# create a column for the date format and decade
easter_dates <- easter %>% mutate(Decade = floor(Year/10)*10, Date = make_date(year = easter$Year, month = easter$Month, day = easter$Day))

Preview the updated dataset, the beginning of it and end of it.

head(easter_dates)
## # A tibble: 6 × 5
##   Month   Day  Year Decade Date      
##   <dbl> <dbl> <dbl>  <dbl> <date>    
## 1     4     2  1600   1600 1600-04-02
## 2     4    22  1601   1600 1601-04-22
## 3     4     7  1602   1600 1602-04-07
## 4     3    30  1603   1600 1603-03-30
## 5     4    18  1604   1600 1604-04-18
## 6     4    10  1605   1600 1605-04-10
tail(easter_dates)
## # A tibble: 6 × 5
##   Month   Day  Year Decade Date      
##   <dbl> <dbl> <dbl>  <dbl> <date>    
## 1     4     4  2094   2090 2094-04-04
## 2     4    24  2095   2090 2095-04-24
## 3     4    15  2096   2090 2096-04-15
## 4     3    31  2097   2090 2097-03-31
## 5     4    20  2098   2090 2098-04-20
## 6     4    12  2099   2090 2099-04-12

The updated dataset looks to have the correct date and corresponding decade variables now.

Analyze & Visualize the Data

Now that we have all the data needed to answer the question, “if it rained on Easter Sunday, did it rain the next seven Sundays?”

Recalling that this study is only using the Dubuque station data. Therefore, only the Easter dates that the Dubuque weather station has data for are needed. The Easter dates will be filtered for only the dates in the Dubuque weather station data using the code below.

# filter Easter dates
dbq_easter_dates <- easter_dates %>% filter(Date %in% dbq_data$date)

In addition to the Easter Sunday dates, the dates for the next seven Sundays following each Easter Sunday are also needed. These dates will be calculated by simply adding weeks to the Easter Sunday date. These weeks will be added to the dataset as new variables.

# calculate next 7 Sundays
dbq_easter_Weeks <- dbq_easter_dates %>%
  mutate(week_1 = Date + weeks(1)) %>% 
  mutate(week_2 = Date + weeks(2)) %>%
  mutate(week_3 = Date + weeks(3)) %>%
  mutate(week_4 = Date + weeks(4)) %>%
  mutate(week_5 = Date + weeks(5)) %>%
  mutate(week_6 = Date + weeks(6)) %>%
  mutate(week_7 = Date + weeks(7))

Find How Many Easter Sundays had Rain

Before jumping to answer the question, the first question would be how many Easter Sundays and what percentage of Easter Sundays did it rain? This is important to check to know if the dataset is going to be large enough to answer the question reasonably accurately. This will be done by filtering the weather data for the Easter Sunday dates and calculating how many of those Easter Sundays that it rained and the percentage of them that it rained.

# filter weather data to only Easter Sunday
dbq_easter_data <- dbq_data %>% filter(date %in% dbq_easter_Weeks$Date)

# calculate percentage of Easter Sundays that it rained
easter_rain_percentage <- data.frame(event = c("Rain", "No Rain"), number_of_sundays = c(sum( dbq_easter_data$precip_in > 0 ), sum( dbq_easter_data$precip_in == 0 )), percentage = c(sum( dbq_easter_data$precip_in > 0 ) / nrow(dbq_easter_data) * 100, sum( dbq_easter_data$precip_in == 0 ) / nrow(dbq_easter_data) * 100))

Below is a plot showing the percentages for the Easter Sundays that had no rain and rain.

# plot Easter Sunday rain percentage
ggplot(data = easter_rain_percentage) +
  geom_col(mapping = aes(x = event, y = percentage), color = "steelblue", fill = "steelblue") +
  geom_text(mapping = aes(x = event, y = percentage-1.5, label = paste0( round(percentage, 1), "%")), color = "white") +
  scale_y_continuous(breaks = seq(0,60,10)) + 
  labs(title = "Dubuque (DBQ) Station", subtitle = paste0(nrow(dbq_easter_data), " Easter Sunday Observations"), x = "Easter Sunday", y = "Percentage (%)")

Below is a plot showing the number of Easter Sundays that had no rain and rain.

# plot number of Sundays with rain
ggplot(data = easter_rain_percentage) + 
  geom_col(mapping = aes(x = event, y = number_of_sundays), color = "steelblue", fill = "steelblue") + 
  geom_text(mapping = aes(x = event, y = number_of_sundays-1.5, label = number_of_sundays), color = "white") +
  labs(title = "Dubuque (DBQ) Station", subtitle = paste0(nrow(dbq_easter_data), " Easter Sunday Observations"), x = "Easter Sunday", y = "Number of Sundays")

As seen from the plots 32 of the 75 Easter Sundays (42.7%) had rain, so just a little less than half. That seems like a large enough percentage that the data can provide some meaningful insight for answering the question.

Rain on Easter Sunday → Number of Following Sundays it Rained

Now let’s answer the question, if it rained on Easter Sunday, did it rain the next 7 Sundays?

To answer this question the Dubuque station data will be filtered for only Easter Sundays that had a precipitation greater than zero, i.e., any recorded precipitation regardless of how much.

# expect the dates to be equal, but give an error if they are not equal
if ( ! identical(dbq_easter_data$date, dbq_easter_Weeks$Date) ) {
  # give an error message
  stop("The dates are not equal, check the datasets")
}

# first filter the Easter weeks to only the ones that it rained for
dbq_easter_rain <- dbq_easter_Weeks[ dbq_easter_data$precip_in > 0,]

Taking the filtered rainy Easter Sundays dataset, the dataset will be converted into a long format to put the dates of the seven Sundays following Easter Sunday into a column named “week_date”. Then using the long-formatted data, the data will be joined together with the weather data and a new variable column will be created named “rained” that will be a logical indicating if it rained (TRUE), i.e., precip_in > 0, on that day.

# pivot data into long format
dbq_easter_rain <- pivot_longer(data = dbq_easter_rain, cols = starts_with("week"), names_to = "week_number", values_to = "week_date")

# join Easter dates & weather data -- then add variable to check if it rained on the date
dbq_easter_rain_data <- inner_join(dbq_easter_rain, dbq_data, by = c("week_date" = "date")) %>% mutate(rained = precip_in > 0)

Now the dataset contains the dates and weather data for the seven Sundays following Easter Sunday for all Easter Sundays that it rained on. The next step is to calculate how many of the seven Sundays following Easter it rained.

If the old saying “if it rains on Easter Sunday it will rain the next seven Sundays” is true, then this calculated total should add up to be seven. The code below will sum up the number of Sundays out of the first seven Sundays following Easter that it rained.

# calculate how many of the seven sundays that it rained
dbq_easter_rain_summary <- dbq_easter_rain_data %>% group_by(Date) %>% summarise(Decade = median(Decade), number_rain_days = sum(rained))
# calculate year number for the decaded
dbq_easter_rain_summary <- dbq_easter_rain_summary %>% mutate(year = year(Date))

The visualization below shows the number of Sundays out of the first seven Sundays following Easter Sunday that it rained when it rained on Easter Sunday. The plots are broken up by decade to make them more readable.

# function to calculate breaks
calc_breaks <- function(x) {
  # calculate decade
  dec = as.integer( floor(median(x)/10) * 10 )
  # check if 1951
  if ( dec == 1950L ) {
    # set limits
    return( seq(1951, dec+9, 1) )
  } else if ( dec == 2020L ) {
    # set limits -- 2025 is the last year with data being this was created in 2025
    return( seq(dec, 2025, 1) )
  } else {
    # create breaks
    return( seq(dec, dec+9, 1) )
  } # end if
  # create breaks
  #seq(dec, dec+9, 1)
} # end calc_breaks

# function to calculate limits
calc_limits <- function(x) {
  # calculate decade
  dec = as.integer( floor(median(x)/10) * 10 )
  # check if 1951
  if ( dec == 1950L ) {
    # set limits
    return( c(1951, dec+9) )
  } else if ( dec == 2020L ) {
    # set limits -- 2025 is the last year with data being this was created in 2025
    return( c(dec, 2025) )
  } else {
    # create limits
    return( c(dec, dec+9) )
  } # end if
} # end calc_limits

# create x scale list 
x_scale_list <- lapply(unique(dbq_easter_rain_summary$Decade), function(dec) {
  new_formula(
    lhs = expr(Decade == !!dec),
    rhs = scale_x_continuous(limits = calc_limits, breaks = calc_breaks)
  )
})


# create y scale list
y_scale_list <- lapply(unique(dbq_easter_rain_summary$Decade), function(dec) {
  scale_y_continuous(limits = c(0,7), breaks = 0:7)
})


# plot summary -- by year and facet by decade
ggplot(data = dbq_easter_rain_summary) +
  geom_col(mapping = aes(x = year, y = number_rain_days), color = "steelblue", fill = "steelblue") +
  facet_wrap(~ Decade, scales = "free") +
  facetted_pos_scales(x = x_scale_list, y = y_scale_list) +
  theme(panel.grid.minor.y = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + 
  labs(title = "Number of Sundays following rain on Easter Sunday that also had rain (Dubuque 1951-2025)",
       subtitle = "Only the first 7 Sundays following Easter Sunday are considered.",
       x = "Year (years with no data means it did not rain on Easter Sunday that year)",
       y = "Number of Sundays it rained following rain on Easter Sunday")

Conclusion

Viewing the charts above, it is clearly seen that 2009 was the only year that it rained on Easter Sunday and rained the seven Sundays following Easter. There was only one year out of 32 when it rained on Easter Sunday and rained the following seven Sundays, which was 2009. Thus, it seems the old saying, “If it rains on Easter Sunday it will rain the next seven Sundays” is clearly not true for Dubuque based on the available 75 years of weather data from 1951-2025.