library(tidyverse)
library(openintro)
library(readr)
library(dplyr)

Dataset 1: Historical Storms 2024 Data

In this section, I analyzed the Historical Storms 2024 data to explore the relationship between storm surge and impact variables, such as economic damage. Additionally, I investigated which states are more frequently affected by named storms.

Loading the Storm Dataset

url <- "https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Historical-Storm-Data.csv"
storm_data <- read.csv(url, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(storm_data)

##                     Storm      Date Observed.Storm.Tide Forecasted.Storm.Tide
## 1             2024-Helene Sep 23-29           >9.3 mhhw            15-20 mhhw
## 2           2024-Francine    12-Sep              5 mhhw                      
## 3           2024-Gilma-cp    30-Aug              3 mhhw                      
## 4            2024-Hone-cp    25-Aug              3 mhhw                      
## 5 2024-Post-Typhoon Ampil    22-Aug            6.3 mhhw                      
## 6              2024-Debby   Aug 5-6            5.1 mhhw                      
##      Observation.System  Warning.Level Category Pressure..mb. Casualties
## 1                   FEV  Warn-A6 (1-8)     Cat4           938       221+
## 2                   FEV Warn-A5 (1-15)     Cat2           972          0
## 3          PS-Adv 43,46                    Cat4           949       <NA>
## 4 PS-Adv 2-7,9-11,13,14                    Cat1           988       <NA>
## 5                P-ETSS                    Cat4           947       <NA>
## 6                   FEV Warn-A5 (6-20)     Cat1           979         10
##   Damage..bn.USD.                  Affected.Areas
## 1           27.5+ N.W. FL, GA, AL, TN, KY, VA, WV
## 2             1.5                          LA, MS
## 3            <NA>                              HI
## 4            <NA>                              HI
## 5            <NA>                              AK
## 6              >1             N.W. FL, GA, SC, NC

Separating the States into Separate Columns

To analyze the number of named storms by state, we need to split the Affected Areas column into separate columns for each state. To achieve this using the separate() function from tidyr.

storm_data <- storm_data %>%
  separate(`Affected.Areas`, into = paste0("Area", 1:7), sep = ", ", fill = "right")

Handling Missing Values

Missing values can impact analysis and visualizations. To address this, replace all NA values in the dataset with a placeholder text "No Data".

storm_data <- storm_data %>%
  mutate(across(everything(), ~ replace(., is.na(.), "No Data")))

Analysis: Relationship Between Surge Data and Impact Variables

We are particularly interested in understanding whether higher storm surges are associated with greater economic damage. For this analysis, convert the Damage (bn USD) and Observed Storm-Tide columns to numeric values and compute their correlation.

storm_data$Damage..bn.USD. <- as.numeric(gsub("[^0-9.]", "", storm_data$Damage..bn.USD.))
storm_data$Surge <- as.numeric(gsub("[^0-9.]", "", storm_data$Observed.Storm.Tide))

# Checking correlation between surge and damage
correlation <- cor(storm_data$Surge, storm_data$Damage..bn.USD., use = "complete.obs")
correlation

## [1] 0.8433445

Summary: A correlation value close to 1 or -1 would indicate a strong linear relationship. In our case, a positive correlation suggests that higher storm surges may be associated with increased economic damage.

Analyzing the Number of Named Storms by State

Next, to identify which states have experienced the highest number of named storms, reshape the data and count the number of named storms for each state.

state_counts <- storm_data %>%
  pivot_longer(cols = starts_with("Area"), values_drop_na = TRUE) %>%
  group_by(value) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

print(state_counts)

## # A tibble: 15 × 2
##    value   Count
##    <chr>   <int>
##  1 No Data    36
##  2 LA          3
##  3 GA          2
##  4 HI          2
##  5 N.W. FL     2
##  6 TX          2
##  7 AK          1
##  8 AL          1
##  9 KY          1
## 10 MS          1
## 11 NC          1
## 12 SC          1
## 13 TN          1
## 14 VA          1
## 15 WV          1

Summary: The table above shows the frequency of named storms for each state, helping us identify the regions that are most frequently impacted.

Visualization: Relationship Between Surge and Damage

Create a scatter plot to visualize the relationship between observed storm-tide (surge) and economic damage.

ggplot(storm_data, aes(x = Surge, y = Damage..bn.USD.)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Surge and Damage (in billion USD)",
       x = "Observed Storm-Tide (Surge)",
       y = "Damage (bn USD)")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

Dataset 2: Employee Salary Data Analysis

The second dataset focuses on employee salaries across different departments and divisions. Below I tried to understand the average compensation by gender, compensation distribution across departments, and how compensation varies with grade and division.

Loading the Employee Salary Dataset

url <- "https://raw.githubusercontent.com/Amish22/DS607/refs/heads/main/Employee_Salaries_-_2023.csv"
employee_data <- read.csv(url)
head(employee_data)

##   Department           Department_Name                       Division Gender
## 1        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 2        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 3        ABS Alcohol Beverage Services          ABS 85 Administration      F
## 4        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 5        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 6        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
##   Base_Salary Overtime_Pay Longevity_Pay Grade
## 1   175873.00         0.00          0.00    M2
## 2   145613.36         0.00          0.00    M3
## 3   136970.00         0.00          0.00    M3
## 4    89432.69         0.00       2490.00    21
## 5    78947.00       456.68       6257.70    16
## 6    98228.00       518.80        998.28    21

Data Transformation: Converting Salary Columns to Numeric

To perform meaningful analysis, first convert the Base_Salary, Overtime_Pay, and Longevity_Pay columns to numeric values.

employee_data <- employee_data %>%
  mutate(Base_Salary = as.numeric(Base_Salary),
         Overtime_Pay = as.numeric(Overtime_Pay),
         Longevity_Pay = as.numeric(Longevity_Pay))
head(employee_data)

##   Department           Department_Name                       Division Gender
## 1        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 2        ABS Alcohol Beverage Services          ABS 85 Administration      M
## 3        ABS Alcohol Beverage Services          ABS 85 Administration      F
## 4        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 5        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
## 6        ABS Alcohol Beverage Services ABS 85 Administrative Services      F
##   Base_Salary Overtime_Pay Longevity_Pay Grade
## 1   175873.00         0.00          0.00    M2
## 2   145613.36         0.00          0.00    M3
## 3   136970.00         0.00          0.00    M3
## 4    89432.69         0.00       2490.00    21
## 5    78947.00       456.68       6257.70    16
## 6    98228.00       518.80        998.28    21

Calculating Total Compensation

Total compensation is calculated by summing up base salary, overtime pay, and longevity pay.

employee_data <- employee_data %>%
  mutate(Total_Compensation = Base_Salary + Overtime_Pay + Longevity_Pay)

Analysis: Average Compensation by Gender

Explore whether there are differences in average compensation between genders.

gender_summary <- employee_data %>%
  group_by(Gender) %>%
  summarise(Average_Compensation = mean(Total_Compensation, na.rm = TRUE))

ggplot(gender_summary, aes(x = Gender, y = Average_Compensation, fill = Gender)) +
  geom_bar(stat = "identity", width = 0.6) +
  theme_minimal() +
  labs(title = "Average Compensation by Gender",
       x = "Gender",
       y = "Average Total Compensation") +
  scale_fill_manual(values = c("skyblue", "lightcoral"))

Analyzing Compensation by Department

Calculate the average compensation by department and division to identify where salaries are concentrated.

department_summary <- employee_data %>%
  group_by(Department, Division) %>%
  summarise(Average_Department_Compensation = mean(Total_Compensation, na.rm = TRUE))

## `summarise()` has grouped output by 'Department'. You can override using the
## `.groups` argument.

Compensation by Grade and Division

Finally, create a table to view the average compensation by grade across different divisions.

compensation_by_grade <- employee_data %>%
  group_by(Division, Grade) %>%
  summarise(Avg_Compensation = mean(Total_Compensation, na.rm = TRUE)) %>%
  pivot_wider(names_from = Grade, values_from = Avg_Compensation)

## `summarise()` has grouped output by 'Division'. You can override using the
## `.groups` argument.

Summary: The table above shows the average compensation for each grade across divisions, providing insights into pay distribution across different levels.

View(gender_summary)
View(department_summary)
View(compensation_by_grade)

Project 2: Data Analysis on Storm and Employee Datasets

Amish Rasheed

2024-10-07