1st Data Set

Storm Surges

The first dataset to modify will be coming from the Meterological Developmental Laboratory, provided by author of the post Kimberly Koon.

The author focuses on ideas to transform the data: “The data is untidy in many ways - for instance, the tables are broken up into years, with each year having its own table. The first column follows the format Year - Storm, and then the rows also contain the year with the name of the storm. This column also appears to have certain tags, like (R) for retired name. Some of the formats are not consistent, such as having”’18-Storm Name” versus “2018-Storm Name”. Date ranges are stored in a “Date” column rather than in a “Storm Date Start” and “Storm Date End” column. The Surge data seems to be inconsistent in units between the tables, with Above Ground Level and Mean Higher High Water used. The Obs column has comma delimited lists, and the Cat,Pres,Dead,$bn column is for some reason one single column instead of having four columns for each variable. Area is also a comma delimited list, and for some reason the area is inconsistent in terms of State, City, etc.”

First we’ll load the libraries needed

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library (ggplot2)

Second we’ll load the data from a CSV file hosted on github.

storms <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project2/stormdata.csv")

head (storms)
##     X2024.Storm       Date          Storm.Tide   Obs       Guidance
## 1   2024-Helene Sep 23-29    Obs w4: >9.3 mhhw (FEV)  Warn-A6 (1-8)
## 2                          Fcst w6: 15-20 mhhw                     
## 3                                                                  
## 4 2024-Francine     12-Sep          w2: 5 mhhw       Warn-A5 (1-15)
## 5                                                                  
## 6 2024-Gilma-cp     30-Aug          w1: 3 mhhw         PS-Adv 43,46
##        Cat..Pres..Dead...bn             Area
## 1 (Cat4, 938, 221+, $27.5+) N.W. FL, GA, AL,
## 2                             TN, KY, VA, WV
## 3                                           
## 4      (Cat2, 972, 0, $1.5)           LA, MS
## 5                                           
## 6         (Cat4, 949, -, -)               HI

Please note that for this exercise I’ll be focusing on transforming the data for year 2024.

I will start by getting rid of the columns Obs and Guidance

# Remove the "Obs" and "Guidance" columns
storms <- storms %>%
  select(-Obs, -Guidance)

Now, I want to split a column that 4 different types of information into 4 different columns

# Separate the "Cat, Pres, Dead, $bn" column into four separate columns
storms <- storms %>%
  separate(`Cat..Pres..Dead...bn`, into = c("Category", "Pressure", "Deaths", "Cost"), sep = ", ", fill = "right", extra = "merge")

And I want to transform the First column to only get the names and remove 2024-

# Remove "2024-" prefix from the "2024-Storm" column
storms <- storms %>%
  mutate(`X2024.Storm` = str_replace(`X2024.Storm`, "^2024-", ""))

head(storms)
##   X2024.Storm       Date          Storm.Tide Category Pressure Deaths    Cost
## 1      Helene Sep 23-29    Obs w4: >9.3 mhhw    (Cat4      938   221+ $27.5+)
## 2                        Fcst w6: 15-20 mhhw              <NA>   <NA>    <NA>
## 3                                                         <NA>   <NA>    <NA>
## 4    Francine     12-Sep          w2: 5 mhhw    (Cat2      972      0   $1.5)
## 5                                                         <NA>   <NA>    <NA>
## 6    Gilma-cp     30-Aug          w1: 3 mhhw    (Cat4      949      -      -)
##               Area
## 1 N.W. FL, GA, AL,
## 2   TN, KY, VA, WV
## 3                 
## 4           LA, MS
## 5                 
## 6               HI

For the column Category, I want to remove “(Cat”, but double backslashes are needed to escape the parenthesis \

# Remove "(Cat" prefix from the "Category" column
storms <- storms %>%
  mutate(`Category` = str_replace(`Category`, "\\(Cat", ""))

Rename Area as States Impacted

# Rename the "Area" column to "States Impacted"
storms <- storms %>%
  rename(`States Impacted` = Area)

Remove empty rows

storms <- storms %>%
  filter(!(is.na(`Cost`)))

head(storms)
##          X2024.Storm       Date        Storm.Tide Category Pressure Deaths
## 1             Helene Sep 23-29  Obs w4: >9.3 mhhw        4      938   221+
## 2           Francine     12-Sep        w2: 5 mhhw        2      972      0
## 3           Gilma-cp     30-Aug        w1: 3 mhhw        4      949      -
## 4            Hone-cp     25-Aug        w1: 3 mhhw        1      988      -
## 5 Post-Typhoon Ampil     22-Aug      w3: 6.3 mhhw        4      947      -
## 6              Debby    Aug 5,6      w2: 5.1 mhhw        1      979     10
##      Cost     States Impacted
## 1 $27.5+)    N.W. FL, GA, AL,
## 2   $1.5)              LA, MS
## 3      -)                  HI
## 4      -)                  HI
## 5      -)                  AK
## 6    >$1) N.W. FL, GA, SC, NC

For the column cost there are a lot of symbols that I’d like to remove. In this case we’ll check for just numbers and remove everything else.

storms <- storms %>%
  mutate(Cost = str_replace_all(Cost, "[^0-9.]", ""),  # Remove non-numeric characters except for period (.)
         Cost = as.numeric(Cost),  # Convert to numeric type
         Cost = ifelse(is.na(Cost), "No data", Cost))  # Replace NA with "No data         available"

Now, convert the column Storm.Tide so that it only keeps a numeric value, in this case we can easily do this by finding the numeric values right before mhhw.

storms <- storms %>%
  mutate(Storm.Tide = str_extract(Storm.Tide, "\\d+\\.?\\d*(?= mhhw)"))

# Convert the extracted values to numeric type if needed
storms <- storms %>%
  mutate(Storm.Tide = as.numeric(Storm.Tide))

Next, we’ll transform the column Deaths so that the values are only numeric

storms <- storms %>%
  mutate(Deaths = str_replace_all(Deaths, "\\+", "") %>% as.numeric(),
         Deaths = ifelse(is.na(Deaths), "No data", Deaths))  # Replace NA with "No data")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Deaths = str_replace_all(Deaths, "\\+", "") %>% as.numeric()`.
## Caused by warning in `str_replace_all(Deaths, "\\+", "") %>% as.numeric()`:
## ! NAs introduced by coercion

Finally, and perhaps the most time consuming step is to check for different formats of dates and then standardize them. In this case, to simplify the proces we used ChatGPT to detect the different formats that exist in the data set, and asked to generate code that would help us standardize it into mm/dd/yyyy. The prompt used was: Look for existing formats used in the data set for the column Date, and create code that applies to all the cases so that no value is lost and it can all be standardized to mm/dd/yyyy.

storms <- storms %>%
  mutate(Date = case_when(
    # Format: "Sep 23-29" -> Convert to "09/23/2024" (taking the first day of the range)
    str_detect(Date, "[A-Za-z]{3} \\d{1,2}-\\d{1,2}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "^[A-Za-z]+"), "-01"))), 2, pad = "0"), "/",
                                                              str_extract(Date, "\\d{1,2}"), "/2024"),
    
    # Format: "12-Sep" -> Convert to "09/12/2024"
    str_detect(Date, "\\d{1,2}-[A-Za-z]{3}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "[A-Za-z]{3}"), "-01"))), 2, pad = "0"), "/",
                                                      str_extract(Date, "\\d{1,2}"), "/2024"),
    
    # Format: "Aug 5,6" -> Convert to "08/05/2024" (taking the first day only)
    str_detect(Date, "[A-Za-z]{3} \\d{1,2},\\d{1,2}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "^[A-Za-z]+"), "-01"))), 2, pad = "0"), "/",
                                                              str_extract(Date, "\\d{1,2}"), "/2024"),
    
    # Format: "8-Jul" -> Convert to "07/08/2024"
    str_detect(Date, "\\d{1,2}-[A-Za-z]{3}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "[A-Za-z]{3}"), "-01"))), 2, pad = "0"), "/",
                                                      str_extract(Date, "\\d{1,2}"), "/2024"),
    
    # Format: "Jun 19,20" -> Convert to "06/19/2024" (taking the first day only)
    str_detect(Date, "[A-Za-z]{3} \\d{1,2},\\d{1,2}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "^[A-Za-z]+"), "-01"))), 2, pad = "0"), "/",
                                                              str_extract(Date, "\\d{1,2}"), "/2024"),
    
    TRUE ~ Date  # Retain original date if it doesn't match any of the above patterns
  ))
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Date = case_when(...)`.
## Caused by warning:
## !  5 failed to parse.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
head(storms)
##          X2024.Storm       Date Storm.Tide Category Pressure  Deaths    Cost
## 1             Helene 09/23/2024        9.3        4      938     221    27.5
## 2           Francine 09/12/2024        5.0        2      972       0     1.5
## 3           Gilma-cp 08/30/2024        3.0        4      949 No data No data
## 4            Hone-cp 08/25/2024        3.0        1      988 No data No data
## 5 Post-Typhoon Ampil 08/22/2024        6.3        4      947 No data No data
## 6              Debby  08/5/2024        5.1        1      979      10       1
##       States Impacted
## 1    N.W. FL, GA, AL,
## 2              LA, MS
## 3                  HI
## 4                  HI
## 5                  AK
## 6 N.W. FL, GA, SC, NC

This completes the transformation of the data for the first wide dataset. In the discussion, the author focuses more on what clean up is needed. In this case, we can look if there’s a correlation between category and Storm Tide

storms <- storms %>%
  mutate(Category = as.numeric(Category))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Category = as.numeric(Category)`.
## Caused by warning:
## ! NAs introduced by coercion
# Remove rows with NA values in Storm.Tide or Category
storms_complete <- storms %>%
  filter(!is.na(Storm.Tide) & !is.na(Category))


correlation <- cor(storms_complete$Storm.Tide, storms_complete$Category, use = "complete.obs")

print(paste("Correlation between Storm Tide and Category: ", correlation))
## [1] "Correlation between Storm Tide and Category:  0.574230805043916"

Conclusion

In this case we found a correlation of 0.5742, indicating there is some degree of association between the severity of a hurricane (measured by its category) and the storm tide heights. This means stronger hurricanes (higher category numbers) are associated with higher storm tides. However, the correlation is not very high, which means other factors might also be influencing storm tide levels.

Now, on to our second data set

2nd Data Set

Employee Salaries 2023

The second dataset to modify will be coming from the government website from Montgomery, Maryland, provided by Crystal Quezada, which discusses the yearly employee salaries for government officials in Montgomery County, MD.

Since we have previously loaded needed libraries in this file we can now go directly to loading the data from our trusted GitHub site.

salaries <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project2/EmployeeSalaries-2023.csv")

head (salaries)
##   Department           Department_Name                       Division Gender
## 1        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 2        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 3        ABS Alcohol Beverage Services          ABS 85 Administration      F
## 4        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 5        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 6        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
##   Base_Salary Overtime_Pay Longevity_Pay Grade
## 1   175873.00         0.00          0.00    M2
## 2   145613.36         0.00          0.00    M3
## 3   136970.00         0.00          0.00    M3
## 4    89432.69         0.00       2490.00    21
## 5    78947.00       456.68       6257.70    16
## 6    98228.00       518.80        998.28    21

Thankfully, this dataset is already pretty organized, so any needed transformations will be purely based on what the author of the post describes:

“The grade column does not contain a typical letter grade, but some have numbers and letters. Others are”NULL”. Naturally, I would filter observations with NULL grades, and overtime and longevity pays at zero. Not only that but, it has over 10,000 observations so to make specified analysis I would filter wages by whole number and a salary between 50 and 150k.”

Let’s start with removing the entries that a NULL value in the column Grades.

# Create new table called salaries_clean
salaries_clean <- salaries %>%
  filter(Grade != "NULL")

# View the cleaned data
head(salaries_clean)
##   Department           Department_Name                       Division Gender
## 1        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 2        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 3        ABS Alcohol Beverage Services          ABS 85 Administration      F
## 4        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 5        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 6        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
##   Base_Salary Overtime_Pay Longevity_Pay Grade
## 1   175873.00         0.00          0.00    M2
## 2   145613.36         0.00          0.00    M3
## 3   136970.00         0.00          0.00    M3
## 4    89432.69         0.00       2490.00    21
## 5    78947.00       456.68       6257.70    16
## 6    98228.00       518.80        998.28    21

Now we have a tabled called salaries_clean and the previous code removed 33 entries.

The author describes also removing overtime and longevity pay that contain a value of zero. But I believe that since there is a column called base_salary, for calculations that we want to do excluding people that have any type of additional values in overtime or longevity, it can be completed as is.

Instead I would propose to create a new column called total_compensation, that adds up the values in base_salary, overtime and longevity_pay. That way we can compare, for example, if there are any differences between genders based only on base salary. And, in case there are any differences, check if they increase or decrease when factoring in total compensation.

Let’s create the new column first.

salaries_clean <- salaries_clean %>%
  mutate(Total_Compensation = Base_Salary + Overtime_Pay + Longevity_Pay)

head(salaries_clean)
##   Department           Department_Name                       Division Gender
## 1        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 2        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 3        ABS Alcohol Beverage Services          ABS 85 Administration      F
## 4        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 5        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 6        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
##   Base_Salary Overtime_Pay Longevity_Pay Grade Total_Compensation
## 1   175873.00         0.00          0.00    M2          175873.00
## 2   145613.36         0.00          0.00    M3          145613.36
## 3   136970.00         0.00          0.00    M3          136970.00
## 4    89432.69         0.00       2490.00    21           91922.69
## 5    78947.00       456.68       6257.70    16           85661.38
## 6    98228.00       518.80        998.28    21           99745.08

Now let’s summarize the data

salary_summary <- salaries_clean %>%
  group_by(Gender) %>%
  summarize(Avg_Base_Salary = mean(Base_Salary, na.rm = TRUE),  # Calculate average base salary
            Median_Base_Salary = median(Base_Salary, na.rm = TRUE),  # Calculate median base salary
            Avg_Total_Comp = mean(Total_Compensation, na.rm = TRUE),  # Calculate average total compensation
            Median_Total_Comp = median(Total_Compensation, na.rm = TRUE)  # Calculate median total compensation
            ) %>%
  # Reshape the data to have one column for the "Stat" (Average/Median) and another for "Compensation"
  pivot_longer(cols = c(Avg_Base_Salary, Median_Base_Salary, Avg_Total_Comp, Median_Total_Comp), 
               names_to = "Stat", 
               values_to = "Compensation")

print(salary_summary)
## # A tibble: 8 × 3
##   Gender Stat               Compensation
##   <chr>  <chr>                     <dbl>
## 1 F      Avg_Base_Salary          87183.
## 2 F      Median_Base_Salary       85143.
## 3 F      Avg_Total_Comp           91260.
## 4 F      Median_Total_Comp        88967.
## 5 M      Avg_Base_Salary          92194.
## 6 M      Median_Base_Salary       89948 
## 7 M      Avg_Total_Comp          105932.
## 8 M      Median_Total_Comp       101492.

And now on to plotting the summary found:

ggplot(data = salary_summary, aes(x = Stat, y = Compensation, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average and Median Compensantion by Gender",
       x = "Statistic",
       y = "Base Salary ($)") +
  theme_minimal()

The plot shows that Men have higher base salaries on average than Women. However, this difference is more pronounced when factoring in overtime and longevity pay. But why is that? Is it a matter of amount of employees having access to this additonal pay? To check this we can count how many employees in total get overtime and longevity pay, and group them by gender. We’ll include a total number of employees to provide context.

pay_summary <- salaries_clean %>%
  group_by(Gender) %>%
  summarize(
    Total_Employees = sum(Base_Salary > 0, na.rm = TRUE),  #Count total employees with base salary > 0
    Employees_with_Overtime_Pay = sum(Overtime_Pay > 0, na.rm = TRUE),  #Count employees with overtime pay > 0
    Employees_with_Longevity_Pay = sum(Longevity_Pay > 0, na.rm = TRUE)  #Count employees with longevity pay > 0
  ) %>%
  mutate(
    Proportion_with_Overtime_Pay = Employees_with_Overtime_Pay / Total_Employees * 100,  # Proportion for Overtime Pay
    Proportion_with_Longevity_Pay = Employees_with_Longevity_Pay / Total_Employees * 100  # Proportion for Longevity Pay
  )
print(pay_summary)
## # A tibble: 2 × 6
##   Gender Total_Employees Employees_with_Overtime_Pay Employees_with_Longevity_…¹
##   <chr>            <int>                       <int>                       <int>
## 1 F                 4345                        1485                        1135
## 2 M                 5913                        4155                        1701
## # ℹ abbreviated name: ¹​Employees_with_Longevity_Pay
## # ℹ 2 more variables: Proportion_with_Overtime_Pay <dbl>,
## #   Proportion_with_Longevity_Pay <dbl>

Let’s transform that table from a wide format to long format, to make it easier to plot, removing count values, and leaving only percentages

pay_summary_long <- pay_summary %>%
  # Select only the proportion columns for reshaping
  pivot_longer(cols = c(Proportion_with_Overtime_Pay, Proportion_with_Longevity_Pay), 
               names_to = "Pay_Type", 
               values_to = "Percentage")

Finally, let’s create a new plot to visualize the proportions found.

ggplot(data = pay_summary_long, aes(x = Pay_Type, y = Percentage, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Employees with Overtime and Longevity Pay by Gender",
       x = "Pay Type",
       y = "Percentage (%)") +
  theme_minimal()

In this case the plot shows us that there’s a significant difference between men and women regarding how many of them get Overtime Pay (70.26% Vs 34.18%). This seems to be the main factor impacting the difference between both genders in total compensation. Longevity pay also tends to favor men slightly more (28.77% vs 26.12%), but is not as clearly marked as overtime.

Conclusion

The data shows that when comparing base salaries between men and women, men have on average slightly higher base salaries than women. Additionally, more men than women (in proportion and total count) get overtime pay and longevity pay, bringing the average total compensation higher for men than for women. Future studies could look into what are factors impacting access to overtime pay. Is it a matter of additional responsibilities outside work? Reduced availability for overtime opportunities?

Let’s move on to our final data set

3rd Data Set

Buffalo Blizzards

The third data set to modify will be coming from information posted on the class forum, provided by author Kevin Havis, which contains recorded snowfall by month in Buffalo from the National Weather Association from 1940 until 2024.

Since we have previously loaded needed libraries in this file we can now go directly to loading the data from our trusted GitHub site.

snowfall <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project2/Snow-records-buffalo.csv")

head (snowfall)
##    SEASON JUL AUG SEP OCT  NOV  DEC  JAN  FEB  MAR APR MAY JUN ANNUAL
## 1 1940-41   0   0   0   T 17.5 12.1 17.3 23.1  9.3   T   0   0   79.3
## 2 1941-42   0   0   0   T    5  7.8   31   28 13.7 4.1   0   0   89.6
## 3 1942-43   0   0   0   T  8.7 26.7 16.9 17.7 10.4 5.1   T   0   85.5
## 4 1943-44   0   0   0 1.5 13.6  1.7  3.4 24.6 10.5 2.7   0   0     58
## 5 1944-45   0   0   0   0  3.9 35.1 50.6 23.3  5.8   T   2   0  120.7
## 6 1945-46   0   0   0   T 25.2 51.1 10.7 23.5    T   T   0   0  110.5

When we visualize this data set, we can see that is mostly organized. There are rows that need to be removed, as it repeats the header on at least 5 occasions. Additionally, we see that there are several values T in the data. What does that mean?

Years of studying meterology (which I don’t have), plus a quick google search (which I did) explain that the letter “T” typically stands for “Trace”. This indicates that a very small amount of snowfall was recorded—too small to be measured accurately. For example, “T” might mean that there was a dusting of snow, but not enough to reach a measurable quantity such as 0.1 inches or more.

So, what to do with this value? If we consider the T insignificant and replace with a 0 we might be underestimating the snowfall total. Including a small value instead (0.05) is suggested, as it would provide a more accurate reflection of the trace snowfall.

First, let’s remove the repeat Season rows

snowfall_clean <- snowfall %>%
  filter(SEASON != "SEASON")

Some values have typos in their format, for example 2,.5. If we replaced the Ts now those cases would become NAs. To fix this we can remove comas in the middle of the numbers first

snowfall_clean <- snowfall_clean %>%
  mutate(across(JUL:ANNUAL, ~ str_replace_all(., ",", "")))  # Removes commas

snowfall_clean <- snowfall_clean %>%
  mutate(across(JUL:ANNUAL, ~ str_replace_all(., "\\.{2,}", ".")))  # Replaces two or more periods with one

Now, let’s replace T symbols with 0.05

snowfall_clean <- snowfall_clean %>%
  mutate(across(JUL:ANNUAL, ~ ifelse(. == "T", 0.05, as.numeric(.))))
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(JUL:ANNUAL, ~ifelse(. == "T", 0.05, as.numeric(.)))`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.

Finally let’s check if there are any NAs values left

remaining_issues <- snowfall_clean %>%
  summarise(across(JUL:ANNUAL, ~ sum(is.na(.))))

print(remaining_issues)
##   JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN ANNUAL
## 1   0   0   0   0   0   0   0   0   0   0   0   1      0

The only NA value left is for June of 2024, which did not have data, but based on historical data we can replace it with a 0.

snowfall_clean <- snowfall_clean %>%
  mutate(JUN = ifelse(is.na(JUN), 0, JUN))

For the Column Season, let’s replace the values that currently give a range, with just the year that season begins. For example 1940-41 will now appear as 1940.

snowfall_clean <- snowfall_clean %>%
  mutate(SEASON = str_extract(SEASON, "^\\d{4}"))

Now, let’s remove the columns that only have 0s, since we can just ignore that now if it’s the case for all the years of data.

snowfall_clean <- snowfall_clean %>%
  select_if(~ !all(. == 0))

head(snowfall_clean)
##   SEASON SEP  OCT  NOV  DEC  JAN  FEB   MAR  APR  MAY ANNUAL
## 1   1940   0 0.05 17.5 12.1 17.3 23.1  9.30 0.05 0.00   79.3
## 2   1941   0 0.05  5.0  7.8 31.0 28.0 13.70 4.10 0.00   89.6
## 3   1942   0 0.05  8.7 26.7 16.9 17.7 10.40 5.10 0.05   85.5
## 4   1943   0 1.50 13.6  1.7  3.4 24.6 10.50 2.70 0.00   58.0
## 5   1944   0 0.00  3.9 35.1 50.6 23.3  5.80 0.05 2.00  120.7
## 6   1945   0 0.05 25.2 51.1 10.7 23.5  0.05 0.05 0.00  110.5

That removed 3 columns, helping reduce how wide the table is.

Now let’s get to analyze the data and start creating some summary tables. We want to know: -Total of snowfall (in inches) per month -Min and Max value of the whole table -Avg and Median values per month, and in total.

Let’s get the total snowfall per month

total_snowfall_per_month <- snowfall_clean %>%
  summarize(across(SEP:ANNUAL, \(x) sum(x, na.rm = TRUE)))

print(total_snowfall_per_month)
##   SEP   OCT   NOV    DEC    JAN  FEB    MAR   APR   MAY ANNUAL
## 1 0.1 41.35 858.4 1960.8 2106.7 1533 1010.2 243.7 14.25 7774.7

In total, January is the month that has had the most recorded snowfall in Buffalo (2106.7), followed by December (1960.8), February (1533), March (1010.2) and November (858.4)

avg_snowfall_per_month <- snowfall_clean %>%
  summarize(across(SEP:ANNUAL, \(x) mean(x, na.rm = TRUE)))

median_snowfall_per_month <- snowfall_clean %>%
  summarize(across(SEP:ANNUAL, \(x) median(x, na.rm = TRUE)))

cat("\nAverage Snowfall Per Month (inches):\n")
## 
## Average Snowfall Per Month (inches):
print(avg_snowfall_per_month)
##           SEP       OCT      NOV      DEC      JAN   FEB      MAR     APR
## 1 0.001190476 0.4922619 10.21905 23.34286 25.07976 18.25 12.02619 2.90119
##         MAY   ANNUAL
## 1 0.1696429 92.55595
cat("\nMedian Snowfall Per Month (inches):\n")
## 
## Median Snowfall Per Month (inches):
print(median_snowfall_per_month)
##   SEP  OCT NOV   DEC  JAN   FEB   MAR APR MAY ANNUAL
## 1   0 0.05 8.8 20.05 19.3 17.25 10.65 2.2   0  91.35

On average, January is the month with most snowfall (25.01 inches of snow), closely followed by December (23.34 inches of snow).

Let’s find Min and Max total values per month, combine those tables into one, and transform from wide to long format.

# Minimum Value Per Month (Excluding ANNUAL)
min_snowfall_per_month <- snowfall_clean %>%
  summarize(across(SEP:MAY, \(x) min(x, na.rm = TRUE))) %>%
  mutate(Statistic = "Min")

# Maximum Value Per Month (Excluding ANNUAL)
max_snowfall_per_month <- snowfall_clean %>%
  summarize(across(SEP:MAY, \(x) max(x, na.rm = TRUE))) %>%
  mutate(Statistic = "Max")

combined_snowfall <- bind_rows(min_snowfall_per_month, max_snowfall_per_month)

combined_snowfall_long <- combined_snowfall %>%
  pivot_longer(cols = SEP:MAY, names_to = "Month", values_to = "Snowfall")  # Convert wide to long format

final_snowfall_summary <- combined_snowfall_long %>%
  pivot_wider(names_from = Statistic, values_from = Snowfall)

print(final_snowfall_summary)
## # A tibble: 9 × 3
##   Month   Min   Max
##   <chr> <dbl> <dbl>
## 1 SEP    0     0.05
## 2 OCT    0    22.6 
## 3 NOV    0    45.6 
## 4 DEC    1    82.7 
## 5 JAN    3.4  68.3 
## 6 FEB    1.8  54.2 
## 7 MAR    0.05 32.8 
## 8 APR    0.05 15   
## 9 MAY    0     7.9

Conclusion

The author challenged the class to find the worst snowfall month for Buffalo. With the Min-Max table we found the value 82.7 which corresponds to December of 2001, followed by January of 1976 with 68.3