Initial Analysis

This is a very fun data set to work with, as to the human eye it is a very easily interpretable dataset but it needs some serious coercion to be usable for analysis by a machine. Kevin Havis was kind enough to provide a csv in his post, and deserves credit for said portion. As the raw data can be easily copied from the source into a spreadsheet it seems unnecessary to repeat this step.

Looking at the data we can immediately see some issues. There are several rows that repeat the column names, there is missing future data at the end, the snowfall measurements are not numeric due to the “T” values, and while the data appears like a good candidate for a time series the columns are split in a non-annualized way. We will start with the simplest cleanup first, and then proceed to the more difficult task of coercing the data into a tsibble.

raw_data <- read_csv('https://raw.githubusercontent.com/Tillmawitz/data_607/refs/heads/main/project_2/buffalo_weather.csv')
## Rows: 98 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): SEASON, JUL, AUG, SEP, OCT, NOV, DEC, JAN, FEB, MAR, APR, MAY, JUN...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw_data
## # A tibble: 98 × 14
##    SEASON  JUL   AUG   SEP   OCT   NOV   DEC   JAN   FEB   MAR   APR   MAY  
##    <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 1940-41 0     0     0     T     17.5  12.1  17.3  23.1  9.3   T     0    
##  2 1941-42 0     0     0     T     5     7.8   31    28    13.7  4.1   0    
##  3 1942-43 0     0     0     T     8.7   26.7  16.9  17.7  10.4  5.1   T    
##  4 1943-44 0     0     0     1.5   13.6  1.7   3.4   24.6  10.5  2.7   0    
##  5 1944-45 0     0     0     0     3.9   35.1  50.6  23.3  5.8   T     2    
##  6 1945-46 0     0     0     T     25.2  51.1  10.7  23.5  T     T     0    
##  7 1946-47 0     0     0     0     T     11.9  13    22.2  13.5  4     0.8  
##  8 1947-48 0     0     0     0     9.9   4.3   16.7  7     4.2   T     T    
##  9 1948-49 0     0     0     T     1.3   7     11.8  5.2   14.3  0.5   0    
## 10 1949-50 0     0     0     0     28.6  9.5   14.8  19.3  13.7  2.8   0    
## # ℹ 88 more rows
## # ℹ 2 more variables: JUN <chr>, ANNUAL <chr>

Simple Cleanup

Our first step is to simply remove the rows that consist of the duplicate headers.

simply_clean <- raw_data |>
  filter(SEASON != "SEASON")

simply_clean
## # A tibble: 90 × 14
##    SEASON  JUL   AUG   SEP   OCT   NOV   DEC   JAN   FEB   MAR   APR   MAY  
##    <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 1940-41 0     0     0     T     17.5  12.1  17.3  23.1  9.3   T     0    
##  2 1941-42 0     0     0     T     5     7.8   31    28    13.7  4.1   0    
##  3 1942-43 0     0     0     T     8.7   26.7  16.9  17.7  10.4  5.1   T    
##  4 1943-44 0     0     0     1.5   13.6  1.7   3.4   24.6  10.5  2.7   0    
##  5 1944-45 0     0     0     0     3.9   35.1  50.6  23.3  5.8   T     2    
##  6 1945-46 0     0     0     T     25.2  51.1  10.7  23.5  T     T     0    
##  7 1946-47 0     0     0     0     T     11.9  13    22.2  13.5  4     0.8  
##  8 1947-48 0     0     0     0     9.9   4.3   16.7  7     4.2   T     T    
##  9 1948-49 0     0     0     T     1.3   7     11.8  5.2   14.3  0.5   0    
## 10 1949-50 0     0     0     0     28.6  9.5   14.8  19.3  13.7  2.8   0    
## # ℹ 80 more rows
## # ℹ 2 more variables: JUN <chr>, ANNUAL <chr>

Next up we can use pivot longer to get season and month in each row to get an orderly progression in time. Additionally we remove the annual aggregation as we will be able to calculate it from our cleaned data.

pivoted <- simply_clean |>
  pivot_longer(cols = (JUL:JUN), names_to = "month", values_to = "snowfall") |>
  select(!ANNUAL)
pivoted
## # A tibble: 1,080 × 3
##    SEASON  month snowfall
##    <chr>   <chr> <chr>   
##  1 1940-41 JUL   0       
##  2 1940-41 AUG   0       
##  3 1940-41 SEP   0       
##  4 1940-41 OCT   T       
##  5 1940-41 NOV   17.5    
##  6 1940-41 DEC   12.1    
##  7 1940-41 JAN   17.3    
##  8 1940-41 FEB   23.1    
##  9 1940-41 MAR   9.3     
## 10 1940-41 APR   T       
## # ℹ 1,070 more rows

Complex Cleaning

In order to get the data nicely annualized we create a separate column for year from SEASON based on whether the month is in the first or last half of the season. We can keep the SEASON column in case we want to track by “winter” instead of “year”. There is some slight manual cleanup we need to do to change 1900 to 2000 due to how we parsed the season field.

begin_year <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN")
fetch_year <- pivoted |>
  mutate(
    year = case_when(
      month %in% begin_year ~ str_replace_all(SEASON, r"(\d{2}-)", ""),
      .default = str_extract(SEASON, r"(\d{4})")
    )
  ) |>
  mutate(year = if_else(year == "1900", "2000", year))

fetch_year
## # A tibble: 1,080 × 4
##    SEASON  month snowfall year 
##    <chr>   <chr> <chr>    <chr>
##  1 1940-41 JUL   0        1940 
##  2 1940-41 AUG   0        1940 
##  3 1940-41 SEP   0        1940 
##  4 1940-41 OCT   T        1940 
##  5 1940-41 NOV   17.5     1940 
##  6 1940-41 DEC   12.1     1940 
##  7 1940-41 JAN   17.3     1941 
##  8 1940-41 FEB   23.1     1941 
##  9 1940-41 MAR   9.3      1941 
## 10 1940-41 APR   T        1941 
## # ℹ 1,070 more rows

Using lubridate we can convert the year and month columns to a yearmonth type for easy conversion to a tsibble. There is additional manual cleanup of an error in the data which was found in a later verification step but is done here as it was easier.

pre_series <- fetch_year |>
  mutate(date = yearmonth(paste(year, month, sep="-"))) |>
  mutate(snowfall = gsub("[^0-9.T]", "", snowfall)) |>
  select(date, snowfall, SEASON) |>
  rename(season = SEASON)
 
rows_update(pre_series, tibble(date = yearmonth("1976 Apr"),snowfall = "2.5", season = "1975-76"))
## Matching, by = "date"
## # A tibble: 1,080 × 3
##        date snowfall season 
##       <mth> <chr>    <chr>  
##  1 1940 Jul 0        1940-41
##  2 1940 Aug 0        1940-41
##  3 1940 Sep 0        1940-41
##  4 1940 Oct T        1940-41
##  5 1940 Nov 17.5     1940-41
##  6 1940 Dec 12.1     1940-41
##  7 1941 Jan 17.3     1940-41
##  8 1941 Feb 23.1     1940-41
##  9 1941 Mar 9.3      1940-41
## 10 1941 Apr T        1940-41
## # ℹ 1,070 more rows
time_series <- pre_series |>  
  as_tsibble(key = season, index = date)
time_series
## # A tsibble: 1,080 x 3 [1M]
## # Key:       season [90]
##        date snowfall season 
##       <mth> <chr>    <chr>  
##  1 1940 Jul 0        1940-41
##  2 1940 Aug 0        1940-41
##  3 1940 Sep 0        1940-41
##  4 1940 Oct T        1940-41
##  5 1940 Nov 17.5     1940-41
##  6 1940 Dec 12.1     1940-41
##  7 1941 Jan 17.3     1940-41
##  8 1941 Feb 23.1     1940-41
##  9 1941 Mar 9.3      1940-41
## 10 1941 Apr T        1940-41
## # ℹ 1,070 more rows

Now that we have a nice time series, we need to decide what to do with the T values in snowfall. The T stands for a trace amount of snowfall, indicating that snow fell but the accumulation was so small as to be impossible to measure. Looking at the origin of the data, we can see that the limit for measurement is 1 inch of accumulation. Knowing this, it is a standard interpolation strategy to generate random values between 0 and this minimum, so we fill in the missing data this way rounded to the nearest tenth in keeping with the accuracy of the existing data.

set.seed(836465)

prelim_tidied <- time_series |>
  mutate(snowfall = replace(snowfall, snowfall == "T", round(runif(1),1))) |>
  mutate(snowfall = as.double(snowfall))

prelim_tidied
## # A tsibble: 1,080 x 3 [1M]
## # Key:       season [90]
##        date snowfall season 
##       <mth>    <dbl> <chr>  
##  1 1940 Jul      0   1940-41
##  2 1940 Aug      0   1940-41
##  3 1940 Sep      0   1940-41
##  4 1940 Oct      0.4 1940-41
##  5 1940 Nov     17.5 1940-41
##  6 1940 Dec     12.1 1940-41
##  7 1941 Jan     17.3 1940-41
##  8 1941 Feb     23.1 1940-41
##  9 1941 Mar      9.3 1940-41
## 10 1941 Apr      0.4 1940-41
## # ℹ 1,070 more rows

Finally, we can see if there are any more missed values. This is the original location the error April 1976 was found, but fixing the error here is significantly more complicated than in earlier steps. Now we see that the only rows missing data are those that are empty in the original dataset, indicating they should not be considered in any analysis.

prelim_tidied |>
  filter(is.na(snowfall))
## # A tsibble: 73 x 3 [1M]
## # Key:       season [7]
##        date snowfall season 
##       <mth>    <dbl> <chr>  
##  1 2024 Jun       NA 2023-24
##  2 2024 Jul       NA 2024-25
##  3 2024 Aug       NA 2024-25
##  4 2024 Sep       NA 2024-25
##  5 2024 Oct       NA 2024-25
##  6 2024 Nov       NA 2024-25
##  7 2024 Dec       NA 2024-25
##  8 2025 Jan       NA 2024-25
##  9 2025 Feb       NA 2024-25
## 10 2025 Mar       NA 2024-25
## # ℹ 63 more rows

Dropping these rows gives us our final time series, and an initial plot of the snowfall which, as one may expect, shows a high degree of seasonality. This data is ready for modeling or further analysis, but this is outside the scope of this analysis.

tidied_data <- prelim_tidied |>
  filter(!is.na(snowfall))

tidied_data |>
  ggplot(aes(x = date, y = snowfall)) +
  geom_line()

We can, however, take a look at the snowiest months in the dataset to answer the original question of when the worst storms occurred, as well as take a look at what the snowiest winters were.

tidied_data |>
  as_tibble() |>
  group_by(season) |>
  summarise(total_snow = sum(snowfall)) |>
  slice_max(order_by = total_snow, n = 5)
## # A tibble: 5 × 2
##   season  total_snow
##   <chr>        <dbl>
## 1 1976-77       199.
## 2 2000-01       159.
## 3 1977-78       155.
## 4 1995-96       142.
## 5 2022-23       134.
tidied_data |>
  slice_max(order_by = snowfall, n = 5)
## # A tsibble: 5 x 3 [1M]
## # Key:       season [5]
##       date snowfall season 
##      <mth>    <dbl> <chr>  
## 1 2001 Dec     82.7 2001-02
## 2 1985 Dec     68.4 1985-86
## 3 1977 Jan     68.3 1976-77
## 4 1985 Jan     65.9 1984-85
## 5 1999 Jan     65.1 1998-99