This is a very fun data set to work with, as to the human eye it is a very easily interpretable dataset but it needs some serious coercion to be usable for analysis by a machine. Kevin Havis was kind enough to provide a csv in his post, and deserves credit for said portion. As the raw data can be easily copied from the source into a spreadsheet it seems unnecessary to repeat this step.
Looking at the data we can immediately see some issues. There are several rows that repeat the column names, there is missing future data at the end, the snowfall measurements are not numeric due to the “T” values, and while the data appears like a good candidate for a time series the columns are split in a non-annualized way. We will start with the simplest cleanup first, and then proceed to the more difficult task of coercing the data into a tsibble.
raw_data <- read_csv('https://raw.githubusercontent.com/Tillmawitz/data_607/refs/heads/main/project_2/buffalo_weather.csv')
## Rows: 98 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): SEASON, JUL, AUG, SEP, OCT, NOV, DEC, JAN, FEB, MAR, APR, MAY, JUN...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw_data
## # A tibble: 98 × 14
## SEASON JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1940-41 0 0 0 T 17.5 12.1 17.3 23.1 9.3 T 0
## 2 1941-42 0 0 0 T 5 7.8 31 28 13.7 4.1 0
## 3 1942-43 0 0 0 T 8.7 26.7 16.9 17.7 10.4 5.1 T
## 4 1943-44 0 0 0 1.5 13.6 1.7 3.4 24.6 10.5 2.7 0
## 5 1944-45 0 0 0 0 3.9 35.1 50.6 23.3 5.8 T 2
## 6 1945-46 0 0 0 T 25.2 51.1 10.7 23.5 T T 0
## 7 1946-47 0 0 0 0 T 11.9 13 22.2 13.5 4 0.8
## 8 1947-48 0 0 0 0 9.9 4.3 16.7 7 4.2 T T
## 9 1948-49 0 0 0 T 1.3 7 11.8 5.2 14.3 0.5 0
## 10 1949-50 0 0 0 0 28.6 9.5 14.8 19.3 13.7 2.8 0
## # ℹ 88 more rows
## # ℹ 2 more variables: JUN <chr>, ANNUAL <chr>
Our first step is to simply remove the rows that consist of the duplicate headers.
simply_clean <- raw_data |>
filter(SEASON != "SEASON")
simply_clean
## # A tibble: 90 × 14
## SEASON JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1940-41 0 0 0 T 17.5 12.1 17.3 23.1 9.3 T 0
## 2 1941-42 0 0 0 T 5 7.8 31 28 13.7 4.1 0
## 3 1942-43 0 0 0 T 8.7 26.7 16.9 17.7 10.4 5.1 T
## 4 1943-44 0 0 0 1.5 13.6 1.7 3.4 24.6 10.5 2.7 0
## 5 1944-45 0 0 0 0 3.9 35.1 50.6 23.3 5.8 T 2
## 6 1945-46 0 0 0 T 25.2 51.1 10.7 23.5 T T 0
## 7 1946-47 0 0 0 0 T 11.9 13 22.2 13.5 4 0.8
## 8 1947-48 0 0 0 0 9.9 4.3 16.7 7 4.2 T T
## 9 1948-49 0 0 0 T 1.3 7 11.8 5.2 14.3 0.5 0
## 10 1949-50 0 0 0 0 28.6 9.5 14.8 19.3 13.7 2.8 0
## # ℹ 80 more rows
## # ℹ 2 more variables: JUN <chr>, ANNUAL <chr>
Next up we can use pivot longer to get season and month in each row to get an orderly progression in time. Additionally we remove the annual aggregation as we will be able to calculate it from our cleaned data.
pivoted <- simply_clean |>
pivot_longer(cols = (JUL:JUN), names_to = "month", values_to = "snowfall") |>
select(!ANNUAL)
pivoted
## # A tibble: 1,080 × 3
## SEASON month snowfall
## <chr> <chr> <chr>
## 1 1940-41 JUL 0
## 2 1940-41 AUG 0
## 3 1940-41 SEP 0
## 4 1940-41 OCT T
## 5 1940-41 NOV 17.5
## 6 1940-41 DEC 12.1
## 7 1940-41 JAN 17.3
## 8 1940-41 FEB 23.1
## 9 1940-41 MAR 9.3
## 10 1940-41 APR T
## # ℹ 1,070 more rows
In order to get the data nicely annualized we create a separate column for year from SEASON based on whether the month is in the first or last half of the season. We can keep the SEASON column in case we want to track by “winter” instead of “year”. There is some slight manual cleanup we need to do to change 1900 to 2000 due to how we parsed the season field.
begin_year <- c("JAN", "FEB", "MAR", "APR", "MAY", "JUN")
fetch_year <- pivoted |>
mutate(
year = case_when(
month %in% begin_year ~ str_replace_all(SEASON, r"(\d{2}-)", ""),
.default = str_extract(SEASON, r"(\d{4})")
)
) |>
mutate(year = if_else(year == "1900", "2000", year))
fetch_year
## # A tibble: 1,080 × 4
## SEASON month snowfall year
## <chr> <chr> <chr> <chr>
## 1 1940-41 JUL 0 1940
## 2 1940-41 AUG 0 1940
## 3 1940-41 SEP 0 1940
## 4 1940-41 OCT T 1940
## 5 1940-41 NOV 17.5 1940
## 6 1940-41 DEC 12.1 1940
## 7 1940-41 JAN 17.3 1941
## 8 1940-41 FEB 23.1 1941
## 9 1940-41 MAR 9.3 1941
## 10 1940-41 APR T 1941
## # ℹ 1,070 more rows
Using lubridate we can convert the year and month columns to a yearmonth type for easy conversion to a tsibble. There is additional manual cleanup of an error in the data which was found in a later verification step but is done here as it was easier.
pre_series <- fetch_year |>
mutate(date = yearmonth(paste(year, month, sep="-"))) |>
mutate(snowfall = gsub("[^0-9.T]", "", snowfall)) |>
select(date, snowfall, SEASON) |>
rename(season = SEASON)
rows_update(pre_series, tibble(date = yearmonth("1976 Apr"),snowfall = "2.5", season = "1975-76"))
## Matching, by = "date"
## # A tibble: 1,080 × 3
## date snowfall season
## <mth> <chr> <chr>
## 1 1940 Jul 0 1940-41
## 2 1940 Aug 0 1940-41
## 3 1940 Sep 0 1940-41
## 4 1940 Oct T 1940-41
## 5 1940 Nov 17.5 1940-41
## 6 1940 Dec 12.1 1940-41
## 7 1941 Jan 17.3 1940-41
## 8 1941 Feb 23.1 1940-41
## 9 1941 Mar 9.3 1940-41
## 10 1941 Apr T 1940-41
## # ℹ 1,070 more rows
time_series <- pre_series |>
as_tsibble(key = season, index = date)
time_series
## # A tsibble: 1,080 x 3 [1M]
## # Key: season [90]
## date snowfall season
## <mth> <chr> <chr>
## 1 1940 Jul 0 1940-41
## 2 1940 Aug 0 1940-41
## 3 1940 Sep 0 1940-41
## 4 1940 Oct T 1940-41
## 5 1940 Nov 17.5 1940-41
## 6 1940 Dec 12.1 1940-41
## 7 1941 Jan 17.3 1940-41
## 8 1941 Feb 23.1 1940-41
## 9 1941 Mar 9.3 1940-41
## 10 1941 Apr T 1940-41
## # ℹ 1,070 more rows
Now that we have a nice time series, we need to decide what to do with the T values in snowfall. The T stands for a trace amount of snowfall, indicating that snow fell but the accumulation was so small as to be impossible to measure. Looking at the origin of the data, we can see that the limit for measurement is 1 inch of accumulation. Knowing this, it is a standard interpolation strategy to generate random values between 0 and this minimum, so we fill in the missing data this way rounded to the nearest tenth in keeping with the accuracy of the existing data.
set.seed(836465)
prelim_tidied <- time_series |>
mutate(snowfall = replace(snowfall, snowfall == "T", round(runif(1),1))) |>
mutate(snowfall = as.double(snowfall))
prelim_tidied
## # A tsibble: 1,080 x 3 [1M]
## # Key: season [90]
## date snowfall season
## <mth> <dbl> <chr>
## 1 1940 Jul 0 1940-41
## 2 1940 Aug 0 1940-41
## 3 1940 Sep 0 1940-41
## 4 1940 Oct 0.4 1940-41
## 5 1940 Nov 17.5 1940-41
## 6 1940 Dec 12.1 1940-41
## 7 1941 Jan 17.3 1940-41
## 8 1941 Feb 23.1 1940-41
## 9 1941 Mar 9.3 1940-41
## 10 1941 Apr 0.4 1940-41
## # ℹ 1,070 more rows
Finally, we can see if there are any more missed values. This is the original location the error April 1976 was found, but fixing the error here is significantly more complicated than in earlier steps. Now we see that the only rows missing data are those that are empty in the original dataset, indicating they should not be considered in any analysis.
prelim_tidied |>
filter(is.na(snowfall))
## # A tsibble: 73 x 3 [1M]
## # Key: season [7]
## date snowfall season
## <mth> <dbl> <chr>
## 1 2024 Jun NA 2023-24
## 2 2024 Jul NA 2024-25
## 3 2024 Aug NA 2024-25
## 4 2024 Sep NA 2024-25
## 5 2024 Oct NA 2024-25
## 6 2024 Nov NA 2024-25
## 7 2024 Dec NA 2024-25
## 8 2025 Jan NA 2024-25
## 9 2025 Feb NA 2024-25
## 10 2025 Mar NA 2024-25
## # ℹ 63 more rows
Dropping these rows gives us our final time series, and an initial plot of the snowfall which, as one may expect, shows a high degree of seasonality. This data is ready for modeling or further analysis, but this is outside the scope of this analysis.
tidied_data <- prelim_tidied |>
filter(!is.na(snowfall))
tidied_data |>
ggplot(aes(x = date, y = snowfall)) +
geom_line()
We can, however, take a look at the snowiest months in the dataset to answer the original question of when the worst storms occurred, as well as take a look at what the snowiest winters were.
tidied_data |>
as_tibble() |>
group_by(season) |>
summarise(total_snow = sum(snowfall)) |>
slice_max(order_by = total_snow, n = 5)
## # A tibble: 5 × 2
## season total_snow
## <chr> <dbl>
## 1 1976-77 199.
## 2 2000-01 159.
## 3 1977-78 155.
## 4 1995-96 142.
## 5 2022-23 134.
tidied_data |>
slice_max(order_by = snowfall, n = 5)
## # A tsibble: 5 x 3 [1M]
## # Key: season [5]
## date snowfall season
## <mth> <dbl> <chr>
## 1 2001 Dec 82.7 2001-02
## 2 1985 Dec 68.4 1985-86
## 3 1977 Jan 68.3 1976-77
## 4 1985 Jan 65.9 1984-85
## 5 1999 Jan 65.1 1998-99