Synopsis

Analysis was completed based on the parameters of the project. Data was downloaded from the web and then processed using the dplyr package. This allowed for the creation of new variables and the subsequent filtering of other variables. Columns were created that combined fatalities and injuries to create a new column for “public health damage”. Similarly, property damage and its notation (ie. “1.5”, “B”) were combined and then crop damage with similar notation, allowing for the formulation of total cost. New columns were then used for ordering of event types to get the largest contributors to public health damage and economic consequences respectively. Graphs were then created when data was fully processed, with conclusions of analysis included at the end.

Data Processing

———————————————————————————

Define the URL and the file path

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
path <- "/Users/seanmurphy/Desktop/DS_coursera/storm_data_project"
file_name <- "storm_data.csv.bz2"

Check if the directory exists; if not, create it

if (!dir.exists(path)) {
  dir.create(path, recursive = TRUE)
}

Construct the full file path

file_path <- file.path(path, file_name)

Download the file

download.file(url, file_path, mode = "wb")

Confirm the download and date downloaded

message("File downloaded to: ", file_path)
## File downloaded to: /Users/seanmurphy/Desktop/DS_coursera/storm_data_project/storm_data.csv.bz2
dateDownloaded <- date()
dateDownloaded
## [1] "Sun Sep 29 16:25:31 2024"

Load the data in and save it as a variable “storms”

storms <- read.csv("storm_data.csv.bz2")

Load necessary libraries for analysis

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     storms
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Change the case of all names as well as make them human readable

colnames(storms) <- tolower(colnames(storms))
colnames(storms) <- gsub("bgn", "beginning", colnames(storms))
colnames(storms) <- gsub("ev", "event", colnames(storms))
colnames(storms) <- gsub("locati", "location", colnames(storms))
colnames(storms) <- gsub("azi", "azimuth", colnames(storms))
colnames(storms) <- gsub("prop", "property", colnames(storms))
colnames(storms) <- gsub("dmg", "damage", colnames(storms))
colnames(storms) <- gsub("wfo", "weather_forecast_office", colnames(storms))
colnames(storms) <- gsub("offic", "office", colnames(storms))
colnames(storms) <- gsub("exp", "exposure", colnames(storms))
colnames(storms) <- gsub("refnum", "reference_number", colnames(storms))
colnames(storms) <- gsub("tstm", "thunderstorm", colnames(storms))

————————————Q1————————————————–

Pre-processing for question 1 regarding the events with the largest impact on public health

Create a new dataframe that has combined fatalities and injuries into a single variable

storms_q1 <- storms %>%
  mutate(public_health_damage=fatalities + injuries)

Create a dataframe that has ordered the entire dataframe by its impact on public health in descending order

public_health_damage <- storms_q1 %>%
  arrange(desc(public_health_damage))

Create a dataset which combines event types by name to get a cumulative sum of each event types impact on public health, ie. combining all instances of “fire” or “flood”.

public_health_df <- aggregate(public_health_damage$public_health_damage, 
                       by=list(public_health_damage$eventtype), FUN=sum)

Arrange in descending order by public_health_damage

public_health_df %>%
  arrange(desc(x))
# top 10 cut off is ~1000 with avalanche

Retain only the top 14 contributors to public health damage and save it as a variable

largest_health_damage <- public_health_df %>%
  filter(x >= 1000)

————————————Q2————————————————–

Pre-processing data for question two regarding the events with the greatest economic damage.

Combing numeric values of damage with their respective digit magnitude to understand full economic/crop impact in billion (“B”), millions(“M”), and thousands(“K”).

storms$property_cost <- paste(storms$propertydamage, storms$propertydamageexp, sep="")
storms$crop_cost <- paste(storms$cropdamage, storms$cropdamageexp, sep="")

Change notation of financial figures for property and crop damage and then make them numeric.

## Financial figures for property damage
storms$property_cost <- gsub("B", "000000", storms$property_cost)
storms$property_cost <- gsub("M", "000", storms$property_cost)
storms$property_cost <- gsub("K", "", storms$property_cost)
storms$property_cost <- as.numeric(storms$property_cost)
## Warning: NAs introduced by coercion
# Financial figures for crop damage
storms$crop_cost <- gsub("B", "000000", storms$crop_cost)
storms$crop_cost <- gsub("M", "000", storms$crop_cost)
storms$crop_cost <- gsub("K", "", storms$crop_cost)
storms$crop_cost <- as.numeric(storms$crop_cost)
## Warning: NAs introduced by coercion

Add a new variable which adds both crop and property cost together and save it as a new dataframe

storms_q2 <- mutate(storms, total_cost= property_cost+crop_cost) 

Create a dataframe that has ordered the entire dataframe by its economic impact

total_economic_damage <- storms_q2 %>%
  arrange(desc(total_cost))

Create a dataset which combines event types together to get a cumulative sum of each event types impact on the economy (in terms of property and crop damage)

economic_damage_df <- aggregate(total_economic_damage$total_cost, 
                              by=list(total_economic_damage$eventtype), FUN=sum)

Arrange in descending order by total economic cost

economic_damage_df %>%
  arrange(desc(x))
# top 15 events with a cut off at ~ 1000000 with frost/freeze

Retain only the 15 largest contributors to economic damage by total cost

largest_economic_damage <- economic_damage_df %>%
  filter(x >= 1000000)

=================== RESULTS ==========================================

——————— Q1 ———————————————

Creating a plot that visually represents the which types of events are most harmful with respect to population health.

The graph has been labelled correctly to match the variables and event names have been rotated on the x axis for greater readability

library(ggplot2)
ggplot(largest_health_damage, aes(x=Group.1, y=x)) + 
  geom_bar(stat = "identity", width=0.2) +
  labs(y="Fatalities & Injuries") +
  labs(x="Weather Group") +
  labs(title="Event Types with Largest Impact on Public Health") +
  expand_limits(x=c(0,12), y=c(0,6000)) +
  theme(axis.text.x = element_text(angle = 90))

——————— Q2 ———————————————–

Creating a plot that visually represents which types of events have the greatest economic consequences.

The graph has been labelled correctly to match the variables and event names have been rotated on the x axis for greater readability

ggplot(largest_economic_damage, aes(x=Group.1, y=x)) + 
  geom_bar(stat = "identity", width=0.2) +
  labs(y="Cost in $") +
  labs(x="Weather Group") +
  labs(title="Event Types with Largest Economic Consequences") +
  expand_limits(x=c(0,12)) +
  theme(axis.text.x = element_text(angle = 90))

Conclusion

Both graphs indicate that two different weather events have a disproportionate impact on public health and economic damage respectively. As can be concluded from the graph for Q1, “Tornados” had the largest number of fatalities and injuries in the United States from 1950-2011, with an approxiamte ~ 900,000 fatalities/injuries. Graph for Q2 shows that “Floods” have contributed just under $150 billion in total economic damage (property and crop) from the years 1950-2011 in the United States.