Analysis was completed based on the parameters of the project. Data was downloaded from the web and then processed using the dplyr package. This allowed for the creation of new variables and the subsequent filtering of other variables. Columns were created that combined fatalities and injuries to create a new column for “public health damage”. Similarly, property damage and its notation (ie. “1.5”, “B”) were combined and then crop damage with similar notation, allowing for the formulation of total cost. New columns were then used for ordering of event types to get the largest contributors to public health damage and economic consequences respectively. Graphs were then created when data was fully processed, with conclusions of analysis included at the end.
Define the URL and the file path
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
path <- "/Users/seanmurphy/Desktop/DS_coursera/storm_data_project"
file_name <- "storm_data.csv.bz2"
Check if the directory exists; if not, create it
if (!dir.exists(path)) {
dir.create(path, recursive = TRUE)
}
Construct the full file path
file_path <- file.path(path, file_name)
Download the file
download.file(url, file_path, mode = "wb")
Confirm the download and date downloaded
message("File downloaded to: ", file_path)
## File downloaded to: /Users/seanmurphy/Desktop/DS_coursera/storm_data_project/storm_data.csv.bz2
dateDownloaded <- date()
dateDownloaded
## [1] "Sun Sep 29 16:25:31 2024"
Load the data in and save it as a variable “storms”
storms <- read.csv("storm_data.csv.bz2")
Load necessary libraries for analysis
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked _by_ '.GlobalEnv':
##
## storms
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Change the case of all names as well as make them human readable
colnames(storms) <- tolower(colnames(storms))
colnames(storms) <- gsub("bgn", "beginning", colnames(storms))
colnames(storms) <- gsub("ev", "event", colnames(storms))
colnames(storms) <- gsub("locati", "location", colnames(storms))
colnames(storms) <- gsub("azi", "azimuth", colnames(storms))
colnames(storms) <- gsub("prop", "property", colnames(storms))
colnames(storms) <- gsub("dmg", "damage", colnames(storms))
colnames(storms) <- gsub("wfo", "weather_forecast_office", colnames(storms))
colnames(storms) <- gsub("offic", "office", colnames(storms))
colnames(storms) <- gsub("exp", "exposure", colnames(storms))
colnames(storms) <- gsub("refnum", "reference_number", colnames(storms))
colnames(storms) <- gsub("tstm", "thunderstorm", colnames(storms))
Create a new dataframe that has combined fatalities and injuries into a single variable
storms_q1 <- storms %>%
mutate(public_health_damage=fatalities + injuries)
Create a dataframe that has ordered the entire dataframe by its impact on public health in descending order
public_health_damage <- storms_q1 %>%
arrange(desc(public_health_damage))
Create a dataset which combines event types by name to get a cumulative sum of each event types impact on public health, ie. combining all instances of “fire” or “flood”.
public_health_df <- aggregate(public_health_damage$public_health_damage,
by=list(public_health_damage$eventtype), FUN=sum)
Arrange in descending order by public_health_damage
public_health_df %>%
arrange(desc(x))
# top 10 cut off is ~1000 with avalanche
Retain only the top 14 contributors to public health damage and save it as a variable
largest_health_damage <- public_health_df %>%
filter(x >= 1000)
Combing numeric values of damage with their respective digit magnitude to understand full economic/crop impact in billion (“B”), millions(“M”), and thousands(“K”).
storms$property_cost <- paste(storms$propertydamage, storms$propertydamageexp, sep="")
storms$crop_cost <- paste(storms$cropdamage, storms$cropdamageexp, sep="")
Change notation of financial figures for property and crop damage and then make them numeric.
## Financial figures for property damage
storms$property_cost <- gsub("B", "000000", storms$property_cost)
storms$property_cost <- gsub("M", "000", storms$property_cost)
storms$property_cost <- gsub("K", "", storms$property_cost)
storms$property_cost <- as.numeric(storms$property_cost)
## Warning: NAs introduced by coercion
# Financial figures for crop damage
storms$crop_cost <- gsub("B", "000000", storms$crop_cost)
storms$crop_cost <- gsub("M", "000", storms$crop_cost)
storms$crop_cost <- gsub("K", "", storms$crop_cost)
storms$crop_cost <- as.numeric(storms$crop_cost)
## Warning: NAs introduced by coercion
Add a new variable which adds both crop and property cost together and save it as a new dataframe
storms_q2 <- mutate(storms, total_cost= property_cost+crop_cost)
Create a dataframe that has ordered the entire dataframe by its economic impact
total_economic_damage <- storms_q2 %>%
arrange(desc(total_cost))
Create a dataset which combines event types together to get a cumulative sum of each event types impact on the economy (in terms of property and crop damage)
economic_damage_df <- aggregate(total_economic_damage$total_cost,
by=list(total_economic_damage$eventtype), FUN=sum)
Arrange in descending order by total economic cost
economic_damage_df %>%
arrange(desc(x))
# top 15 events with a cut off at ~ 1000000 with frost/freeze
Retain only the 15 largest contributors to economic damage by total cost
largest_economic_damage <- economic_damage_df %>%
filter(x >= 1000000)
The graph has been labelled correctly to match the variables and event names have been rotated on the x axis for greater readability
library(ggplot2)
ggplot(largest_health_damage, aes(x=Group.1, y=x)) +
geom_bar(stat = "identity", width=0.2) +
labs(y="Fatalities & Injuries") +
labs(x="Weather Group") +
labs(title="Event Types with Largest Impact on Public Health") +
expand_limits(x=c(0,12), y=c(0,6000)) +
theme(axis.text.x = element_text(angle = 90))
The graph has been labelled correctly to match the variables and event names have been rotated on the x axis for greater readability
ggplot(largest_economic_damage, aes(x=Group.1, y=x)) +
geom_bar(stat = "identity", width=0.2) +
labs(y="Cost in $") +
labs(x="Weather Group") +
labs(title="Event Types with Largest Economic Consequences") +
expand_limits(x=c(0,12)) +
theme(axis.text.x = element_text(angle = 90))
Both graphs indicate that two different weather events have a disproportionate impact on public health and economic damage respectively. As can be concluded from the graph for Q1, “Tornados” had the largest number of fatalities and injuries in the United States from 1950-2011, with an approxiamte ~ 900,000 fatalities/injuries. Graph for Q2 shows that “Floods” have contributed just under $150 billion in total economic damage (property and crop) from the years 1950-2011 in the United States.