The first dataset to modify will be coming from the Meterological Developmental Laboratory, provided by author of the post Kimberly Koon.
The author focuses on ideas to transform the data: “The data is untidy in many ways - for instance, the tables are broken up into years, with each year having its own table. The first column follows the format Year - Storm, and then the rows also contain the year with the name of the storm. This column also appears to have certain tags, like (R) for retired name. Some of the formats are not consistent, such as having”’18-Storm Name” versus “2018-Storm Name”. Date ranges are stored in a “Date” column rather than in a “Storm Date Start” and “Storm Date End” column. The Surge data seems to be inconsistent in units between the tables, with Above Ground Level and Mean Higher High Water used. The Obs column has comma delimited lists, and the Cat,Pres,Dead,$bn column is for some reason one single column instead of having four columns for each variable. Area is also a comma delimited list, and for some reason the area is inconsistent in terms of State, City, etc.”
First we’ll load the libraries needed
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library (ggplot2)
Second we’ll load the data from a CSV file hosted on github.
storms <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project2/stormdata.csv")
head (storms)
## X2024.Storm Date Storm.Tide Obs Guidance
## 1 2024-Helene Sep 23-29 Obs w4: >9.3 mhhw (FEV) Warn-A6 (1-8)
## 2 Fcst w6: 15-20 mhhw
## 3
## 4 2024-Francine 12-Sep w2: 5 mhhw Warn-A5 (1-15)
## 5
## 6 2024-Gilma-cp 30-Aug w1: 3 mhhw PS-Adv 43,46
## Cat..Pres..Dead...bn Area
## 1 (Cat4, 938, 221+, $27.5+) N.W. FL, GA, AL,
## 2 TN, KY, VA, WV
## 3
## 4 (Cat2, 972, 0, $1.5) LA, MS
## 5
## 6 (Cat4, 949, -, -) HI
Please note that for this exercise I’ll be focusing on transforming the data for year 2024.
I will start by getting rid of the columns Obs and Guidance
# Remove the "Obs" and "Guidance" columns
storms <- storms %>%
select(-Obs, -Guidance)
Now, I want to split a column that 4 different types of information into 4 different columns
# Separate the "Cat, Pres, Dead, $bn" column into four separate columns
storms <- storms %>%
separate(`Cat..Pres..Dead...bn`, into = c("Category", "Pressure", "Deaths", "Cost"), sep = ", ", fill = "right", extra = "merge")
And I want to transform the First column to only get the names and remove 2024-
# Remove "2024-" prefix from the "2024-Storm" column
storms <- storms %>%
mutate(`X2024.Storm` = str_replace(`X2024.Storm`, "^2024-", ""))
head(storms)
## X2024.Storm Date Storm.Tide Category Pressure Deaths Cost
## 1 Helene Sep 23-29 Obs w4: >9.3 mhhw (Cat4 938 221+ $27.5+)
## 2 Fcst w6: 15-20 mhhw <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 Francine 12-Sep w2: 5 mhhw (Cat2 972 0 $1.5)
## 5 <NA> <NA> <NA>
## 6 Gilma-cp 30-Aug w1: 3 mhhw (Cat4 949 - -)
## Area
## 1 N.W. FL, GA, AL,
## 2 TN, KY, VA, WV
## 3
## 4 LA, MS
## 5
## 6 HI
For the column Category, I want to remove “(Cat”, but double backslashes are needed to escape the parenthesis \
# Remove "(Cat" prefix from the "Category" column
storms <- storms %>%
mutate(`Category` = str_replace(`Category`, "\\(Cat", ""))
Rename Area as States Impacted
# Rename the "Area" column to "States Impacted"
storms <- storms %>%
rename(`States Impacted` = Area)
Remove empty rows
storms <- storms %>%
filter(!(is.na(`Cost`)))
head(storms)
## X2024.Storm Date Storm.Tide Category Pressure Deaths
## 1 Helene Sep 23-29 Obs w4: >9.3 mhhw 4 938 221+
## 2 Francine 12-Sep w2: 5 mhhw 2 972 0
## 3 Gilma-cp 30-Aug w1: 3 mhhw 4 949 -
## 4 Hone-cp 25-Aug w1: 3 mhhw 1 988 -
## 5 Post-Typhoon Ampil 22-Aug w3: 6.3 mhhw 4 947 -
## 6 Debby Aug 5,6 w2: 5.1 mhhw 1 979 10
## Cost States Impacted
## 1 $27.5+) N.W. FL, GA, AL,
## 2 $1.5) LA, MS
## 3 -) HI
## 4 -) HI
## 5 -) AK
## 6 >$1) N.W. FL, GA, SC, NC
For the column cost there are a lot of symbols that I’d like to remove. In this case we’ll check for just numbers and remove everything else.
storms <- storms %>%
mutate(Cost = str_replace_all(Cost, "[^0-9.]", ""), # Remove non-numeric characters except for period (.)
Cost = as.numeric(Cost), # Convert to numeric type
Cost = ifelse(is.na(Cost), "No data", Cost)) # Replace NA with "No data available"
Now, convert the column Storm.Tide so that it only keeps a numeric value, in this case we can easily do this by finding the numeric values right before mhhw.
storms <- storms %>%
mutate(Storm.Tide = str_extract(Storm.Tide, "\\d+\\.?\\d*(?= mhhw)"))
# Convert the extracted values to numeric type if needed
storms <- storms %>%
mutate(Storm.Tide = as.numeric(Storm.Tide))
Next, we’ll transform the column Deaths so that the values are only numeric
storms <- storms %>%
mutate(Deaths = str_replace_all(Deaths, "\\+", "") %>% as.numeric(),
Deaths = ifelse(is.na(Deaths), "No data", Deaths)) # Replace NA with "No data")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Deaths = str_replace_all(Deaths, "\\+", "") %>% as.numeric()`.
## Caused by warning in `str_replace_all(Deaths, "\\+", "") %>% as.numeric()`:
## ! NAs introduced by coercion
Finally, and perhaps the most time consuming step is to check for different formats of dates and then standardize them. In this case, to simplify the proces we used ChatGPT to detect the different formats that exist in the data set, and asked to generate code that would help us standardize it into mm/dd/yyyy. The prompt used was: Look for existing formats used in the data set for the column Date, and create code that applies to all the cases so that no value is lost and it can all be standardized to mm/dd/yyyy.
storms <- storms %>%
mutate(Date = case_when(
# Format: "Sep 23-29" -> Convert to "09/23/2024" (taking the first day of the range)
str_detect(Date, "[A-Za-z]{3} \\d{1,2}-\\d{1,2}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "^[A-Za-z]+"), "-01"))), 2, pad = "0"), "/",
str_extract(Date, "\\d{1,2}"), "/2024"),
# Format: "12-Sep" -> Convert to "09/12/2024"
str_detect(Date, "\\d{1,2}-[A-Za-z]{3}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "[A-Za-z]{3}"), "-01"))), 2, pad = "0"), "/",
str_extract(Date, "\\d{1,2}"), "/2024"),
# Format: "Aug 5,6" -> Convert to "08/05/2024" (taking the first day only)
str_detect(Date, "[A-Za-z]{3} \\d{1,2},\\d{1,2}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "^[A-Za-z]+"), "-01"))), 2, pad = "0"), "/",
str_extract(Date, "\\d{1,2}"), "/2024"),
# Format: "8-Jul" -> Convert to "07/08/2024"
str_detect(Date, "\\d{1,2}-[A-Za-z]{3}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "[A-Za-z]{3}"), "-01"))), 2, pad = "0"), "/",
str_extract(Date, "\\d{1,2}"), "/2024"),
# Format: "Jun 19,20" -> Convert to "06/19/2024" (taking the first day only)
str_detect(Date, "[A-Za-z]{3} \\d{1,2},\\d{1,2}") ~ paste0(str_pad(month(ymd(paste0("2024-", str_extract(Date, "^[A-Za-z]+"), "-01"))), 2, pad = "0"), "/",
str_extract(Date, "\\d{1,2}"), "/2024"),
TRUE ~ Date # Retain original date if it doesn't match any of the above patterns
))
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Date = case_when(...)`.
## Caused by warning:
## ! 5 failed to parse.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
head(storms)
## X2024.Storm Date Storm.Tide Category Pressure Deaths Cost
## 1 Helene 09/23/2024 9.3 4 938 221 27.5
## 2 Francine 09/12/2024 5.0 2 972 0 1.5
## 3 Gilma-cp 08/30/2024 3.0 4 949 No data No data
## 4 Hone-cp 08/25/2024 3.0 1 988 No data No data
## 5 Post-Typhoon Ampil 08/22/2024 6.3 4 947 No data No data
## 6 Debby 08/5/2024 5.1 1 979 10 1
## States Impacted
## 1 N.W. FL, GA, AL,
## 2 LA, MS
## 3 HI
## 4 HI
## 5 AK
## 6 N.W. FL, GA, SC, NC
This completes the transformation of the data for the first wide dataset. In the discussion, the author focuses more on what clean up is needed. In this case, we can look if there’s a correlation between category and Storm Tide
storms <- storms %>%
mutate(Category = as.numeric(Category))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Category = as.numeric(Category)`.
## Caused by warning:
## ! NAs introduced by coercion
# Remove rows with NA values in Storm.Tide or Category
storms_complete <- storms %>%
filter(!is.na(Storm.Tide) & !is.na(Category))
correlation <- cor(storms_complete$Storm.Tide, storms_complete$Category, use = "complete.obs")
print(paste("Correlation between Storm Tide and Category: ", correlation))
## [1] "Correlation between Storm Tide and Category: 0.574230805043916"
In this case we found a correlation of 0.5742, indicating there is some degree of association between the severity of a hurricane (measured by its category) and the storm tide heights. This means stronger hurricanes (higher category numbers) are associated with higher storm tides. However, the correlation is not very high, which means other factors might also be influencing storm tide levels.
Now, on to our second data set
The second dataset to modify will be coming from the government website from Montgomery, Maryland, provided by Crystal Quezada, which discusses the yearly employee salaries for government officials in Montgomery County, MD.
Since we have previously loaded needed libraries in this file we can now go directly to loading the data from our trusted GitHub site.
salaries <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project2/EmployeeSalaries-2023.csv")
head (salaries)
## Department Department_Name Division Gender
## 1 ABS Alcohol Beverage Services ABS 85 Administration M
## 2 ABS Alcohol Beverage Services ABS 85 Administration M
## 3 ABS Alcohol Beverage Services ABS 85 Administration F
## 4 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 5 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 6 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## Base_Salary Overtime_Pay Longevity_Pay Grade
## 1 175873.00 0.00 0.00 M2
## 2 145613.36 0.00 0.00 M3
## 3 136970.00 0.00 0.00 M3
## 4 89432.69 0.00 2490.00 21
## 5 78947.00 456.68 6257.70 16
## 6 98228.00 518.80 998.28 21
Thankfully, this dataset is already pretty organized, so any needed transformations will be purely based on what the author of the post describes:
“The grade column does not contain a typical letter grade, but some have numbers and letters. Others are”NULL”. Naturally, I would filter observations with NULL grades, and overtime and longevity pays at zero. Not only that but, it has over 10,000 observations so to make specified analysis I would filter wages by whole number and a salary between 50 and 150k.”
Let’s start with removing the entries that a NULL value in the column Grades.
# Create new table called salaries_clean
salaries_clean <- salaries %>%
filter(Grade != "NULL")
# View the cleaned data
head(salaries_clean)
## Department Department_Name Division Gender
## 1 ABS Alcohol Beverage Services ABS 85 Administration M
## 2 ABS Alcohol Beverage Services ABS 85 Administration M
## 3 ABS Alcohol Beverage Services ABS 85 Administration F
## 4 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 5 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 6 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## Base_Salary Overtime_Pay Longevity_Pay Grade
## 1 175873.00 0.00 0.00 M2
## 2 145613.36 0.00 0.00 M3
## 3 136970.00 0.00 0.00 M3
## 4 89432.69 0.00 2490.00 21
## 5 78947.00 456.68 6257.70 16
## 6 98228.00 518.80 998.28 21
Now we have a tabled called salaries_clean and the previous code removed 33 entries.
The author describes also removing overtime and longevity pay that contain a value of zero. But I believe that since there is a column called base_salary, for calculations that we want to do excluding people that have any type of additional values in overtime or longevity, it can be completed as is.
Instead I would propose to create a new column called total_compensation, that adds up the values in base_salary, overtime and longevity_pay. That way we can compare, for example, if there are any differences between genders based only on base salary. And, in case there are any differences, check if they increase or decrease when factoring in total compensation.
Let’s create the new column first.
salaries_clean <- salaries_clean %>%
mutate(Total_Compensation = Base_Salary + Overtime_Pay + Longevity_Pay)
head(salaries_clean)
## Department Department_Name Division Gender
## 1 ABS Alcohol Beverage Services ABS 85 Administration M
## 2 ABS Alcohol Beverage Services ABS 85 Administration M
## 3 ABS Alcohol Beverage Services ABS 85 Administration F
## 4 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 5 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 6 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## Base_Salary Overtime_Pay Longevity_Pay Grade Total_Compensation
## 1 175873.00 0.00 0.00 M2 175873.00
## 2 145613.36 0.00 0.00 M3 145613.36
## 3 136970.00 0.00 0.00 M3 136970.00
## 4 89432.69 0.00 2490.00 21 91922.69
## 5 78947.00 456.68 6257.70 16 85661.38
## 6 98228.00 518.80 998.28 21 99745.08
Now let’s summarize the data
salary_summary <- salaries_clean %>%
group_by(Gender) %>%
summarize(Avg_Base_Salary = mean(Base_Salary, na.rm = TRUE), # Calculate average base salary
Median_Base_Salary = median(Base_Salary, na.rm = TRUE), # Calculate median base salary
Avg_Total_Comp = mean(Total_Compensation, na.rm = TRUE), # Calculate average total compensation
Median_Total_Comp = median(Total_Compensation, na.rm = TRUE) # Calculate median total compensation
) %>%
# Reshape the data to have one column for the "Stat" (Average/Median) and another for "Compensation"
pivot_longer(cols = c(Avg_Base_Salary, Median_Base_Salary, Avg_Total_Comp, Median_Total_Comp),
names_to = "Stat",
values_to = "Compensation")
print(salary_summary)
## # A tibble: 8 × 3
## Gender Stat Compensation
## <chr> <chr> <dbl>
## 1 F Avg_Base_Salary 87183.
## 2 F Median_Base_Salary 85143.
## 3 F Avg_Total_Comp 91260.
## 4 F Median_Total_Comp 88967.
## 5 M Avg_Base_Salary 92194.
## 6 M Median_Base_Salary 89948
## 7 M Avg_Total_Comp 105932.
## 8 M Median_Total_Comp 101492.
And now on to plotting the summary found:
ggplot(data = salary_summary, aes(x = Stat, y = Compensation, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average and Median Compensantion by Gender",
x = "Statistic",
y = "Base Salary ($)") +
theme_minimal()
The plot shows that Men have higher base salaries on average than Women. However, this difference is more pronounced when factoring in overtime and longevity pay. But why is that? Is it a matter of amount of employees having access to this additonal pay? To check this we can count how many employees in total get overtime and longevity pay, and group them by gender. We’ll include a total number of employees to provide context.
pay_summary <- salaries_clean %>%
group_by(Gender) %>%
summarize(
Total_Employees = sum(Base_Salary > 0, na.rm = TRUE), #Count total employees with base salary > 0
Employees_with_Overtime_Pay = sum(Overtime_Pay > 0, na.rm = TRUE), #Count employees with overtime pay > 0
Employees_with_Longevity_Pay = sum(Longevity_Pay > 0, na.rm = TRUE) #Count employees with longevity pay > 0
) %>%
mutate(
Proportion_with_Overtime_Pay = Employees_with_Overtime_Pay / Total_Employees * 100, # Proportion for Overtime Pay
Proportion_with_Longevity_Pay = Employees_with_Longevity_Pay / Total_Employees * 100 # Proportion for Longevity Pay
)
print(pay_summary)
## # A tibble: 2 × 6
## Gender Total_Employees Employees_with_Overtime_Pay Employees_with_Longevity_…¹
## <chr> <int> <int> <int>
## 1 F 4345 1485 1135
## 2 M 5913 4155 1701
## # ℹ abbreviated name: ¹Employees_with_Longevity_Pay
## # ℹ 2 more variables: Proportion_with_Overtime_Pay <dbl>,
## # Proportion_with_Longevity_Pay <dbl>
Let’s transform that table from a wide format to long format, to make it easier to plot, removing count values, and leaving only percentages
pay_summary_long <- pay_summary %>%
# Select only the proportion columns for reshaping
pivot_longer(cols = c(Proportion_with_Overtime_Pay, Proportion_with_Longevity_Pay),
names_to = "Pay_Type",
values_to = "Percentage")
Finally, let’s create a new plot to visualize the proportions found.
ggplot(data = pay_summary_long, aes(x = Pay_Type, y = Percentage, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Proportion of Employees with Overtime and Longevity Pay by Gender",
x = "Pay Type",
y = "Percentage (%)") +
theme_minimal()
In this case the plot shows us that there’s a significant difference
between men and women regarding how many of them get Overtime Pay
(70.26% Vs 34.18%). This seems to be the main factor impacting the
difference between both genders in total compensation. Longevity pay
also tends to favor men slightly more (28.77% vs 26.12%), but is not as
clearly marked as overtime.
The data shows that when comparing base salaries between men and women, men have on average slightly higher base salaries than women. Additionally, more men than women (in proportion and total count) get overtime pay and longevity pay, bringing the average total compensation higher for men than for women. Future studies could look into what are factors impacting access to overtime pay. Is it a matter of additional responsibilities outside work? Reduced availability for overtime opportunities?
Let’s move on to our final data set
The third data set to modify will be coming from information posted on the class forum, provided by author Kevin Havis, which contains recorded snowfall by month in Buffalo from the National Weather Association from 1940 until 2024.
Since we have previously loaded needed libraries in this file we can now go directly to loading the data from our trusted GitHub site.
snowfall <- read.csv("https://raw.githubusercontent.com/Lfirenzeg/msds607labs/refs/heads/main/Project2/Snow-records-buffalo.csv")
head (snowfall)
## SEASON JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN ANNUAL
## 1 1940-41 0 0 0 T 17.5 12.1 17.3 23.1 9.3 T 0 0 79.3
## 2 1941-42 0 0 0 T 5 7.8 31 28 13.7 4.1 0 0 89.6
## 3 1942-43 0 0 0 T 8.7 26.7 16.9 17.7 10.4 5.1 T 0 85.5
## 4 1943-44 0 0 0 1.5 13.6 1.7 3.4 24.6 10.5 2.7 0 0 58
## 5 1944-45 0 0 0 0 3.9 35.1 50.6 23.3 5.8 T 2 0 120.7
## 6 1945-46 0 0 0 T 25.2 51.1 10.7 23.5 T T 0 0 110.5
When we visualize this data set, we can see that is mostly organized. There are rows that need to be removed, as it repeats the header on at least 5 occasions. Additionally, we see that there are several values T in the data. What does that mean?
Years of studying meterology (which I don’t have), plus a quick google search (which I did) explain that the letter “T” typically stands for “Trace”. This indicates that a very small amount of snowfall was recorded—too small to be measured accurately. For example, “T” might mean that there was a dusting of snow, but not enough to reach a measurable quantity such as 0.1 inches or more.
So, what to do with this value? If we consider the T insignificant and replace with a 0 we might be underestimating the snowfall total. Including a small value instead (0.05) is suggested, as it would provide a more accurate reflection of the trace snowfall.
First, let’s remove the repeat Season rows
snowfall_clean <- snowfall %>%
filter(SEASON != "SEASON")
Some values have typos in their format, for example 2,.5. If we replaced the Ts now those cases would become NAs. To fix this we can remove comas in the middle of the numbers first
snowfall_clean <- snowfall_clean %>%
mutate(across(JUL:ANNUAL, ~ str_replace_all(., ",", ""))) # Removes commas
snowfall_clean <- snowfall_clean %>%
mutate(across(JUL:ANNUAL, ~ str_replace_all(., "\\.{2,}", "."))) # Replaces two or more periods with one
Now, let’s replace T symbols with 0.05
snowfall_clean <- snowfall_clean %>%
mutate(across(JUL:ANNUAL, ~ ifelse(. == "T", 0.05, as.numeric(.))))
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(JUL:ANNUAL, ~ifelse(. == "T", 0.05, as.numeric(.)))`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
Finally let’s check if there are any NAs values left
remaining_issues <- snowfall_clean %>%
summarise(across(JUL:ANNUAL, ~ sum(is.na(.))))
print(remaining_issues)
## JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN ANNUAL
## 1 0 0 0 0 0 0 0 0 0 0 0 1 0
The only NA value left is for June of 2024, which did not have data, but based on historical data we can replace it with a 0.
snowfall_clean <- snowfall_clean %>%
mutate(JUN = ifelse(is.na(JUN), 0, JUN))
For the Column Season, let’s replace the values that currently give a range, with just the year that season begins. For example 1940-41 will now appear as 1940.
snowfall_clean <- snowfall_clean %>%
mutate(SEASON = str_extract(SEASON, "^\\d{4}"))
Now, let’s remove the columns that only have 0s, since we can just ignore that now if it’s the case for all the years of data.
snowfall_clean <- snowfall_clean %>%
select_if(~ !all(. == 0))
head(snowfall_clean)
## SEASON SEP OCT NOV DEC JAN FEB MAR APR MAY ANNUAL
## 1 1940 0 0.05 17.5 12.1 17.3 23.1 9.30 0.05 0.00 79.3
## 2 1941 0 0.05 5.0 7.8 31.0 28.0 13.70 4.10 0.00 89.6
## 3 1942 0 0.05 8.7 26.7 16.9 17.7 10.40 5.10 0.05 85.5
## 4 1943 0 1.50 13.6 1.7 3.4 24.6 10.50 2.70 0.00 58.0
## 5 1944 0 0.00 3.9 35.1 50.6 23.3 5.80 0.05 2.00 120.7
## 6 1945 0 0.05 25.2 51.1 10.7 23.5 0.05 0.05 0.00 110.5
That removed 3 columns, helping reduce how wide the table is.
Now let’s get to analyze the data and start creating some summary tables. We want to know: -Total of snowfall (in inches) per month -Min and Max value of the whole table -Avg and Median values per month, and in total.
Let’s get the total snowfall per month
total_snowfall_per_month <- snowfall_clean %>%
summarize(across(SEP:ANNUAL, \(x) sum(x, na.rm = TRUE)))
print(total_snowfall_per_month)
## SEP OCT NOV DEC JAN FEB MAR APR MAY ANNUAL
## 1 0.1 41.35 858.4 1960.8 2106.7 1533 1010.2 243.7 14.25 7774.7
In total, January is the month that has had the most recorded snowfall in Buffalo (2106.7), followed by December (1960.8), February (1533), March (1010.2) and November (858.4)
avg_snowfall_per_month <- snowfall_clean %>%
summarize(across(SEP:ANNUAL, \(x) mean(x, na.rm = TRUE)))
median_snowfall_per_month <- snowfall_clean %>%
summarize(across(SEP:ANNUAL, \(x) median(x, na.rm = TRUE)))
cat("\nAverage Snowfall Per Month (inches):\n")
##
## Average Snowfall Per Month (inches):
print(avg_snowfall_per_month)
## SEP OCT NOV DEC JAN FEB MAR APR
## 1 0.001190476 0.4922619 10.21905 23.34286 25.07976 18.25 12.02619 2.90119
## MAY ANNUAL
## 1 0.1696429 92.55595
cat("\nMedian Snowfall Per Month (inches):\n")
##
## Median Snowfall Per Month (inches):
print(median_snowfall_per_month)
## SEP OCT NOV DEC JAN FEB MAR APR MAY ANNUAL
## 1 0 0.05 8.8 20.05 19.3 17.25 10.65 2.2 0 91.35
On average, January is the month with most snowfall (25.01 inches of snow), closely followed by December (23.34 inches of snow).
Let’s find Min and Max total values per month, combine those tables into one, and transform from wide to long format.
# Minimum Value Per Month (Excluding ANNUAL)
min_snowfall_per_month <- snowfall_clean %>%
summarize(across(SEP:MAY, \(x) min(x, na.rm = TRUE))) %>%
mutate(Statistic = "Min")
# Maximum Value Per Month (Excluding ANNUAL)
max_snowfall_per_month <- snowfall_clean %>%
summarize(across(SEP:MAY, \(x) max(x, na.rm = TRUE))) %>%
mutate(Statistic = "Max")
combined_snowfall <- bind_rows(min_snowfall_per_month, max_snowfall_per_month)
combined_snowfall_long <- combined_snowfall %>%
pivot_longer(cols = SEP:MAY, names_to = "Month", values_to = "Snowfall") # Convert wide to long format
final_snowfall_summary <- combined_snowfall_long %>%
pivot_wider(names_from = Statistic, values_from = Snowfall)
print(final_snowfall_summary)
## # A tibble: 9 × 3
## Month Min Max
## <chr> <dbl> <dbl>
## 1 SEP 0 0.05
## 2 OCT 0 22.6
## 3 NOV 0 45.6
## 4 DEC 1 82.7
## 5 JAN 3.4 68.3
## 6 FEB 1.8 54.2
## 7 MAR 0.05 32.8
## 8 APR 0.05 15
## 9 MAY 0 7.9
The author challenged the class to find the worst snowfall month for Buffalo. With the Min-Max table we found the value 82.7 which corresponds to December of 2001, followed by January of 1976 with 68.3