This project focuses on analyzing snowfall patterns in Buffalo, New York, from historical weather data. The main objective is to import a messy untidy data set, and clean the data accordingly as to allow for proper analysis.
The dataset used for this analysis includes monthly snowfall measurements from July to June for each season, alongside the total annual snowfall. Key steps include cleaning the data to handle missing values and erroneous entries, pivoting the data into long format, and visualizing snowfall trends over time in order to locate the largest storm events.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
Several steps were taken to ensure the data set was clean and ready for analysis:
Removing Missing Data:
Rows where all monthly snowfall values were missing were filtered out to
eliminate incomplete records.
Filtering Header Rows:
Some rows included text headers. These were removed by filtering out any
rows that only contained non-numeric characters.
Replacing “T” Values:
“T” indicates trace amounts of snowfall. These values were replaced with
0 to reflect negligible snowfall.
Correcting Input Errors:
One entry contained problematic input (e.g., “2,.5” instead of 2.5),
which needed to be corrected.
Numeric Conversion:
After resolving input issues, the data was converted to
numeric.
Pivoting Data:
The data was reshaped from wide to long format for easier analysis, with
each month’s snowfall recorded as a separate row.
weather_df <- read_csv("https://raw.githubusercontent.com/awrubes/Data607_Project2/main/buffalo_weather.csv")
## Rows: 98 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): SEASON, JUL, AUG, SEP, OCT, NOV, DEC, JAN, FEB, MAR, APR, MAY, JUN...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(weather_df)
## # A tibble: 6 × 14
## SEASON JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1940-… 0 0 0 T 17.5 12.1 17.3 23.1 9.3 T 0 0
## 2 1941-… 0 0 0 T 5 7.8 31 28 13.7 4.1 0 0
## 3 1942-… 0 0 0 T 8.7 26.7 16.9 17.7 10.4 5.1 T 0
## 4 1943-… 0 0 0 1.5 13.6 1.7 3.4 24.6 10.5 2.7 0 0
## 5 1944-… 0 0 0 0 3.9 35.1 50.6 23.3 5.8 T 2 0
## 6 1945-… 0 0 0 T 25.2 51.1 10.7 23.5 T T 0 0
## # ℹ 1 more variable: ANNUAL <chr>
#handle rows with missing values
weather_df_missing <- weather_df %>%
filter(!if_all(JUL:ANNUAL, is.na))
#get rid of rows that match header rows
weather_df_rows <- weather_df_missing %>%
filter(if_any(JUL:ANNUAL, ~ !grepl("[A-Z]", .x)))
#replace T values with 0
weather_df_rows_num <- weather_df_rows %>%
mutate(across(JUL:ANNUAL, ~ ifelse(.x == "T", 0, .x)))
weather_df_rows_num <- weather_df_rows_num %>%
mutate(across(JUL:ANNUAL, ~ ifelse(.x == "2,.5", 2.5, .x)))
#change to numeric and replace missing value
weather_df_missing_final <- weather_df_rows_num %>%
mutate(across(JUL:ANNUAL, ~ as.numeric(.x)))%>%
mutate(across(JUL:ANNUAL, ~ replace_na(.x, 0)))
#before pivoting need to remove ANNUAL column
weather_df_pivot <- weather_df_missing_final %>%
select(-ANNUAL)
head(weather_df_pivot)
## # A tibble: 6 × 13
## SEASON JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1940-… 0 0 0 0 17.5 12.1 17.3 23.1 9.3 0 0 0
## 2 1941-… 0 0 0 0 5 7.8 31 28 13.7 4.1 0 0
## 3 1942-… 0 0 0 0 8.7 26.7 16.9 17.7 10.4 5.1 0 0
## 4 1943-… 0 0 0 1.5 13.6 1.7 3.4 24.6 10.5 2.7 0 0
## 5 1944-… 0 0 0 0 3.9 35.1 50.6 23.3 5.8 0 2 0
## 6 1945-… 0 0 0 0 25.2 51.1 10.7 23.5 0 0 0 0
Now with the data looking more presentable, we can pivot the data from a wide to long format for easier analysis. We’ll also create a factor so that the months appear in chronological order for future visualizations and analysis.
#pivot data to long format
weather_pivot <- weather_df_pivot %>%
pivot_longer(
cols = JUL:JUN,
names_to="Month",
values_to="Snowfall"
)
print(weather_pivot)
## # A tibble: 1,008 × 3
## SEASON Month Snowfall
## <chr> <chr> <dbl>
## 1 1940-41 JUL 0
## 2 1940-41 AUG 0
## 3 1940-41 SEP 0
## 4 1940-41 OCT 0
## 5 1940-41 NOV 17.5
## 6 1940-41 DEC 12.1
## 7 1940-41 JAN 17.3
## 8 1940-41 FEB 23.1
## 9 1940-41 MAR 9.3
## 10 1940-41 APR 0
## # ℹ 998 more rows
#create factor so that the months are organized by seasaon, JUL-JUN
weather_pivot$Month <- factor(weather_pivot$Month,
levels = c("JUL", "AUG", "SEP", "OCT", "NOV", "DEC",
"JAN", "FEB", "MAR", "APR", "MAY", "JUN"))
To explore the snowfall distribution across different months, a boxplot was generated. This visualization illustrates how snowfall varies between months, highlighting the winter months where snowfall is most concentrated (typically from November to February).
#snowfall distribution by month over all years
ggplot(weather_pivot, aes(x = Month, y = Snowfall)) +
geom_boxplot() +
labs(x = "Month", y = "Snowfall (inches)", title = "Snowfall Distribution by Month")
The analysis also aimed to identify the largest storms. I thought it would be helpful to use a scatter plot in order to visualize the data to allow me to easily spot outliers. I created the the chart excluding small trace amounts of snow, since we’re only interested in large snowfall amounts.
#filter snowfall
filtered_df <- weather_pivot %>%
filter(Snowfall > 5)
# Scatterplot of monthly snowfall over time
ggplot(filtered_df, aes(x = SEASON, y = Snowfall, color = Month)) +
geom_point(size = 2) +
labs(x = "Start Year", y = "Snowfall (inches)", title = "Monthly Snowfall Over Time")
But even still, given the large amount of data spanning a large number of years, the horizontal axis is crowded and difficult to discern. That said, you can see that there are a number of large snowfall events but you can’t make out when those might have actually occurred. So, taking a different approach, I decided to capture the top 5% of snow amounts.
#get top 5% for snowfall
top_5_threshold <- quantile(weather_pivot$Snowfall, 0.95, na.rm = TRUE)
# Filter the data to get the top 5% of snowfall and sort descending order
top_5_snowfall <- weather_pivot %>%
filter(Snowfall >= top_5_threshold)%>%
arrange(desc(Snowfall))
# View the filtered data
print(top_5_snowfall)
## # A tibble: 51 × 3
## SEASON Month Snowfall
## <chr> <fct> <dbl>
## 1 2001-02 DEC 82.7
## 2 1985-86 DEC 68.4
## 3 1976-77 JAN 68.3
## 4 1984-85 JAN 65.9
## 5 1998-99 JAN 65.1
## 6 2022-23 DEC 64.7
## 7 2018-19 JAN 63.4
## 8 1995-96 DEC 61.2
## 9 1976-77 DEC 60.7
## 10 1977-78 JAN 56.5
## # ℹ 41 more rows
Through data cleaning and analysis, we were able to identify trends in Buffalo’s snowfall patterns and pinpoint the most significant snowfall events. The months of December through February tend to experience the heaviest snowfall, with certain years standing out as especially stormy. The analysis of top 5% snowfall events offers insight into Buffalo’s most severe winter storms, which is crucial for understanding and preparing for extreme weather conditions.