Overview

This project focuses on analyzing snowfall patterns in Buffalo, New York, from historical weather data. The main objective is to import a messy untidy data set, and clean the data accordingly as to allow for proper analysis.

The dataset used for this analysis includes monthly snowfall measurements from July to June for each season, alongside the total annual snowfall. Key steps include cleaning the data to handle missing values and erroneous entries, pivoting the data into long format, and visualizing snowfall trends over time in order to locate the largest storm events.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Data Import and Cleaning

Several steps were taken to ensure the data set was clean and ready for analysis:

  1. Removing Missing Data:
    Rows where all monthly snowfall values were missing were filtered out to eliminate incomplete records.

  2. Filtering Header Rows:
    Some rows included text headers. These were removed by filtering out any rows that only contained non-numeric characters.

  3. Replacing “T” Values:
    “T” indicates trace amounts of snowfall. These values were replaced with 0 to reflect negligible snowfall.

  4. Correcting Input Errors:
    One entry contained problematic input (e.g., “2,.5” instead of 2.5), which needed to be corrected.

  5. Numeric Conversion:
    After resolving input issues, the data was converted to numeric.

  6. Pivoting Data:
    The data was reshaped from wide to long format for easier analysis, with each month’s snowfall recorded as a separate row.

weather_df <- read_csv("https://raw.githubusercontent.com/awrubes/Data607_Project2/main/buffalo_weather.csv")
## Rows: 98 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): SEASON, JUL, AUG, SEP, OCT, NOV, DEC, JAN, FEB, MAR, APR, MAY, JUN...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weather_df)
## # A tibble: 6 × 14
##   SEASON JUL   AUG   SEP   OCT   NOV   DEC   JAN   FEB   MAR   APR   MAY   JUN  
##   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1940-… 0     0     0     T     17.5  12.1  17.3  23.1  9.3   T     0     0    
## 2 1941-… 0     0     0     T     5     7.8   31    28    13.7  4.1   0     0    
## 3 1942-… 0     0     0     T     8.7   26.7  16.9  17.7  10.4  5.1   T     0    
## 4 1943-… 0     0     0     1.5   13.6  1.7   3.4   24.6  10.5  2.7   0     0    
## 5 1944-… 0     0     0     0     3.9   35.1  50.6  23.3  5.8   T     2     0    
## 6 1945-… 0     0     0     T     25.2  51.1  10.7  23.5  T     T     0     0    
## # ℹ 1 more variable: ANNUAL <chr>
#handle rows with missing values
weather_df_missing <- weather_df %>%
  filter(!if_all(JUL:ANNUAL, is.na))

#get rid of rows that match header rows
weather_df_rows <- weather_df_missing %>%
  filter(if_any(JUL:ANNUAL, ~ !grepl("[A-Z]", .x)))

#replace T values with 0
weather_df_rows_num <- weather_df_rows %>%
  mutate(across(JUL:ANNUAL, ~ ifelse(.x == "T", 0, .x)))

weather_df_rows_num <- weather_df_rows_num %>%
  mutate(across(JUL:ANNUAL, ~ ifelse(.x == "2,.5", 2.5, .x)))

#change to numeric and replace missing value 
weather_df_missing_final <- weather_df_rows_num %>%
  mutate(across(JUL:ANNUAL, ~ as.numeric(.x)))%>%
  mutate(across(JUL:ANNUAL, ~ replace_na(.x, 0)))

#before pivoting need to remove ANNUAL column
weather_df_pivot <- weather_df_missing_final %>%
  select(-ANNUAL)

head(weather_df_pivot)
## # A tibble: 6 × 13
##   SEASON   JUL   AUG   SEP   OCT   NOV   DEC   JAN   FEB   MAR   APR   MAY   JUN
##   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1940-…     0     0     0   0    17.5  12.1  17.3  23.1   9.3   0       0     0
## 2 1941-…     0     0     0   0     5     7.8  31    28    13.7   4.1     0     0
## 3 1942-…     0     0     0   0     8.7  26.7  16.9  17.7  10.4   5.1     0     0
## 4 1943-…     0     0     0   1.5  13.6   1.7   3.4  24.6  10.5   2.7     0     0
## 5 1944-…     0     0     0   0     3.9  35.1  50.6  23.3   5.8   0       2     0
## 6 1945-…     0     0     0   0    25.2  51.1  10.7  23.5   0     0       0     0

Now with the data looking more presentable, we can pivot the data from a wide to long format for easier analysis. We’ll also create a factor so that the months appear in chronological order for future visualizations and analysis.

#pivot data to long format
weather_pivot <- weather_df_pivot %>%
  pivot_longer(
    cols = JUL:JUN,
    names_to="Month",
    values_to="Snowfall"
  )

print(weather_pivot)
## # A tibble: 1,008 × 3
##    SEASON  Month Snowfall
##    <chr>   <chr>    <dbl>
##  1 1940-41 JUL        0  
##  2 1940-41 AUG        0  
##  3 1940-41 SEP        0  
##  4 1940-41 OCT        0  
##  5 1940-41 NOV       17.5
##  6 1940-41 DEC       12.1
##  7 1940-41 JAN       17.3
##  8 1940-41 FEB       23.1
##  9 1940-41 MAR        9.3
## 10 1940-41 APR        0  
## # ℹ 998 more rows
#create factor so that the months are organized by seasaon, JUL-JUN
weather_pivot$Month <- factor(weather_pivot$Month, 
                                 levels = c("JUL", "AUG", "SEP", "OCT", "NOV", "DEC", 
                                            "JAN", "FEB", "MAR", "APR", "MAY", "JUN"))

Visualization and Analysis

To explore the snowfall distribution across different months, a boxplot was generated. This visualization illustrates how snowfall varies between months, highlighting the winter months where snowfall is most concentrated (typically from November to February).

#snowfall distribution by month over all years
ggplot(weather_pivot, aes(x = Month, y = Snowfall)) +
  geom_boxplot() +
  labs(x = "Month", y = "Snowfall (inches)", title = "Snowfall Distribution by Month")

The analysis also aimed to identify the largest storms. I thought it would be helpful to use a scatter plot in order to visualize the data to allow me to easily spot outliers. I created the the chart excluding small trace amounts of snow, since we’re only interested in large snowfall amounts.

#filter snowfall
filtered_df <- weather_pivot %>%
  filter(Snowfall > 5)

# Scatterplot of monthly snowfall over time
ggplot(filtered_df, aes(x = SEASON, y = Snowfall, color = Month)) +
  geom_point(size = 2) +
  labs(x = "Start Year", y = "Snowfall (inches)", title = "Monthly Snowfall Over Time") 

But even still, given the large amount of data spanning a large number of years, the horizontal axis is crowded and difficult to discern. That said, you can see that there are a number of large snowfall events but you can’t make out when those might have actually occurred. So, taking a different approach, I decided to capture the top 5% of snow amounts.

#get top 5% for snowfall
top_5_threshold <- quantile(weather_pivot$Snowfall, 0.95, na.rm = TRUE)

# Filter the data to get the top 5% of snowfall and sort descending order
top_5_snowfall <- weather_pivot %>%
  filter(Snowfall >= top_5_threshold)%>%
  arrange(desc(Snowfall))

# View the filtered data 
print(top_5_snowfall)
## # A tibble: 51 × 3
##    SEASON  Month Snowfall
##    <chr>   <fct>    <dbl>
##  1 2001-02 DEC       82.7
##  2 1985-86 DEC       68.4
##  3 1976-77 JAN       68.3
##  4 1984-85 JAN       65.9
##  5 1998-99 JAN       65.1
##  6 2022-23 DEC       64.7
##  7 2018-19 JAN       63.4
##  8 1995-96 DEC       61.2
##  9 1976-77 DEC       60.7
## 10 1977-78 JAN       56.5
## # ℹ 41 more rows

Through data cleaning and analysis, we were able to identify trends in Buffalo’s snowfall patterns and pinpoint the most significant snowfall events. The months of December through February tend to experience the heaviest snowfall, with certain years standing out as especially stormy. The analysis of top 5% snowfall events offers insight into Buffalo’s most severe winter storms, which is crucial for understanding and preparing for extreme weather conditions.