library(tidyverse)
library(openintro)
library(readr)
library(dplyr)
In this section, I analyzed the Historical Storms 2024
data to explore the relationship between storm surge and impact
variables, such as economic damage. Additionally, I investigated which
states are more frequently affected by named storms.
url <- "https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Historical-Storm-Data.csv"
storm_data <- read.csv(url, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(storm_data)
## Storm Date Observed.Storm.Tide Forecasted.Storm.Tide
## 1 2024-Helene Sep 23-29 >9.3 mhhw 15-20 mhhw
## 2 2024-Francine 12-Sep 5 mhhw
## 3 2024-Gilma-cp 30-Aug 3 mhhw
## 4 2024-Hone-cp 25-Aug 3 mhhw
## 5 2024-Post-Typhoon Ampil 22-Aug 6.3 mhhw
## 6 2024-Debby Aug 5-6 5.1 mhhw
## Observation.System Warning.Level Category Pressure..mb. Casualties
## 1 FEV Warn-A6 (1-8) Cat4 938 221+
## 2 FEV Warn-A5 (1-15) Cat2 972 0
## 3 PS-Adv 43,46 Cat4 949 <NA>
## 4 PS-Adv 2-7,9-11,13,14 Cat1 988 <NA>
## 5 P-ETSS Cat4 947 <NA>
## 6 FEV Warn-A5 (6-20) Cat1 979 10
## Damage..bn.USD. Affected.Areas
## 1 27.5+ N.W. FL, GA, AL, TN, KY, VA, WV
## 2 1.5 LA, MS
## 3 <NA> HI
## 4 <NA> HI
## 5 <NA> AK
## 6 >1 N.W. FL, GA, SC, NC
To analyze the number of named storms by state, we need to split the
Affected Areas column into separate columns for each state.
To achieve this using the separate() function from
tidyr.
storm_data <- storm_data %>%
separate(`Affected.Areas`, into = paste0("Area", 1:7), sep = ", ", fill = "right")
Missing values can impact analysis and visualizations. To address
this, replace all NA values in the dataset with a
placeholder text "No Data".
storm_data <- storm_data %>%
mutate(across(everything(), ~ replace(., is.na(.), "No Data")))
We are particularly interested in understanding whether higher storm
surges are associated with greater economic damage. For this analysis,
convert the Damage (bn USD) and
Observed Storm-Tide columns to numeric values and compute
their correlation.
storm_data$Damage..bn.USD. <- as.numeric(gsub("[^0-9.]", "", storm_data$Damage..bn.USD.))
storm_data$Surge <- as.numeric(gsub("[^0-9.]", "", storm_data$Observed.Storm.Tide))
# Checking correlation between surge and damage
correlation <- cor(storm_data$Surge, storm_data$Damage..bn.USD., use = "complete.obs")
correlation
## [1] 0.8433445
Summary: A correlation value close to 1 or -1 would indicate a strong linear relationship. In our case, a positive correlation suggests that higher storm surges may be associated with increased economic damage.
Next, to identify which states have experienced the highest number of named storms, reshape the data and count the number of named storms for each state.
state_counts <- storm_data %>%
pivot_longer(cols = starts_with("Area"), values_drop_na = TRUE) %>%
group_by(value) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
print(state_counts)
## # A tibble: 15 × 2
## value Count
## <chr> <int>
## 1 No Data 36
## 2 LA 3
## 3 GA 2
## 4 HI 2
## 5 N.W. FL 2
## 6 TX 2
## 7 AK 1
## 8 AL 1
## 9 KY 1
## 10 MS 1
## 11 NC 1
## 12 SC 1
## 13 TN 1
## 14 VA 1
## 15 WV 1
Summary: The table above shows the frequency of named storms for each state, helping us identify the regions that are most frequently impacted.
Create a scatter plot to visualize the relationship between observed storm-tide (surge) and economic damage.
ggplot(storm_data, aes(x = Surge, y = Damage..bn.USD.)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Relationship between Surge and Damage (in billion USD)",
x = "Observed Storm-Tide (Surge)",
y = "Damage (bn USD)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
The second dataset focuses on employee salaries across different departments and divisions. Below I tried to understand the average compensation by gender, compensation distribution across departments, and how compensation varies with grade and division.
url <- "https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Employee_Salaries_-_2023.csv"
employee_data <- read.csv(url)
head(employee_data)
## Department Department_Name Division Gender
## 1 ABS Alcohol Beverage Services ABS 85 Administration M
## 2 ABS Alcohol Beverage Services ABS 85 Administration M
## 3 ABS Alcohol Beverage Services ABS 85 Administration F
## 4 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 5 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 6 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## Base_Salary Overtime_Pay Longevity_Pay Grade
## 1 175873.00 0.00 0.00 M2
## 2 145613.36 0.00 0.00 M3
## 3 136970.00 0.00 0.00 M3
## 4 89432.69 0.00 2490.00 21
## 5 78947.00 456.68 6257.70 16
## 6 98228.00 518.80 998.28 21
To perform meaningful analysis, first convert the
Base_Salary, Overtime_Pay, and
Longevity_Pay columns to numeric values.
employee_data <- employee_data %>%
mutate(Base_Salary = as.numeric(Base_Salary),
Overtime_Pay = as.numeric(Overtime_Pay),
Longevity_Pay = as.numeric(Longevity_Pay))
head(employee_data)
## Department Department_Name Division Gender
## 1 ABS Alcohol Beverage Services ABS 85 Administration M
## 2 ABS Alcohol Beverage Services ABS 85 Administration M
## 3 ABS Alcohol Beverage Services ABS 85 Administration F
## 4 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 5 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## 6 ABS Alcohol Beverage Services ABS 85 Administrative Services F
## Base_Salary Overtime_Pay Longevity_Pay Grade
## 1 175873.00 0.00 0.00 M2
## 2 145613.36 0.00 0.00 M3
## 3 136970.00 0.00 0.00 M3
## 4 89432.69 0.00 2490.00 21
## 5 78947.00 456.68 6257.70 16
## 6 98228.00 518.80 998.28 21
Total compensation is calculated by summing up base salary, overtime pay, and longevity pay.
employee_data <- employee_data %>%
mutate(Total_Compensation = Base_Salary + Overtime_Pay + Longevity_Pay)
Explore whether there are differences in average compensation between genders.
gender_summary <- employee_data %>%
group_by(Gender) %>%
summarise(Average_Compensation = mean(Total_Compensation, na.rm = TRUE))
ggplot(gender_summary, aes(x = Gender, y = Average_Compensation, fill = Gender)) +
geom_bar(stat = "identity", width = 0.6) +
theme_minimal() +
labs(title = "Average Compensation by Gender",
x = "Gender",
y = "Average Total Compensation") +
scale_fill_manual(values = c("skyblue", "lightcoral"))
Calculate the average compensation by department and division to identify where salaries are concentrated.
department_summary <- employee_data %>%
group_by(Department, Division) %>%
summarise(Average_Department_Compensation = mean(Total_Compensation, na.rm = TRUE))
## `summarise()` has grouped output by 'Department'. You can override using the
## `.groups` argument.
Finally, create a table to view the average compensation by grade across different divisions.
compensation_by_grade <- employee_data %>%
group_by(Division, Grade) %>%
summarise(Avg_Compensation = mean(Total_Compensation, na.rm = TRUE)) %>%
pivot_wider(names_from = Grade, values_from = Avg_Compensation)
## `summarise()` has grouped output by 'Division'. You can override using the
## `.groups` argument.
Summary: The table above shows the average compensation for each grade across divisions, providing insights into pay distribution across different levels.
View(gender_summary)
View(department_summary)
View(compensation_by_grade)