This analysis uses the US. NOAA Storm Database collected from 1950-2011. In the following, we describe how we load in the data, process the data, and analyze the data for which events in the entire US were the most harmful (by morbidty, inury-alone, fatality-alone) and/or the most expensive in terms of property and or crop damage.
To begin, we…
load in our dataset by first creating a “filepath” object of the directory where the data is located.
Then, we load in the data.table package to read in the raw data file.
After, we create our data.table object (dt) using the fread function and defining the “” values included in the raw data as NAs.
Finally, we examine the structure of the data by using the str function.
#### Load in the data
options(scipen = 9999)
filepath <- "~/R Programming/PA2_Module4/repdata_data_StormData.csv/repdata_data_StormData.csv"
library(data.table)
dt <- fread(filepath,
na.strings = "")
str(dt)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
To start our harmful event analysis, we create the harm_events data.table object which contains the morbidity (or the total number of injuries and fatalities by events), the total injuries by event, and the total fatalities by event. We then include the percentages of each event (total, injuries, and fatalities). Next, we set the order of the harm_events object by descending order of the total so that we can visualize the most harmful event in aggregate. Finally, we view the harm_event data.table object and see that tornadoes have been responsible for the most injuries and/or deaths in aggregate, making them the most harmful events to have occurred during the duration of this study. However, caution should be taken as it is impossible to determine if one event could have caused a high proportion of death or injury, or whether some injuries could have led to some deaths within the data (double counting).
harm_events <- dt[, .(total = sum(INJURIES, FATALITIES), injuries = sum(INJURIES), fatalities = sum(FATALITIES)), EVTYPE]
harm_events[, total_percentage := total/sum(total) *100]
harm_events[, injury_percentage := injuries/sum(injuries) *100]
harm_events[, fatality_percentage := fatalities/sum(fatalities) *100]
setorder(x = harm_events, -total)
View(harm_events)
For our expensive event analysis, we create the expense_events data.table object which contains the total expense from both property and crop damages, as well as property and crop damages individually. We then include the percentages of each event (total expense, property expense, and crop expense). Next, we set the order of the expense_events object by descending order of the total so that we can visualize the most expensive event in aggregate. Finally, we view the harm_event data.table object and see that tornadoes have been responsible for the most property and crop related expenses in aggregate, making them the most expensive events to have occurred during the duration of this study.
expense_events <- dt[, .(total = sum(PROPDMG,CROPDMG),
property_damage_expense = sum(PROPDMG),
crop_damage_expense = sum(CROPDMG)), EVTYPE ]
expense_events[, total_percentage := total/sum(total) *100]
expense_events[, property_damage_expense_percentage := property_damage_expense/sum(property_damage_expense) *100]
expense_events[, crop_damage_expense_percentage := crop_damage_expense/sum(crop_damage_expense) *100]
setorder(x = expense_events, -total)
View(expense_events)
Since we have 985 individual events, we will plot only the top 10 events for both morbidty and expense. To do this, we first subset each object, both harm_event and expense event, to only include the first 10 rows of data as they are already in descending order and this will capture the top ten events for each category (harm and expense). Using the ggplot package we order the event type in descending order by the total (either total morbidity for the top 10 most harmful natural events or total expense for the top 10 most expensive event). The y-axis for both plots is the total morbidty or expense, respectively. We include the total count and percentages above the bars for our bar plots (created with geom_col), set the dimensions for our axis text using text(axis.text.x()), and define our labels. The results clearly show that tornadoes are both the most harmful and most expensive events to have occurred during the course of the study.
library(ggplot2)
#Top 10 most harmful events
ggplot(data = harm_events[1:10], mapping = aes(x = reorder(EVTYPE, -total), y = total)) +
geom_col(colour = "black", fill = "steelblue") +
geom_text(aes(label = paste0(total, " (", round(total_percentage, 1), "%)")), #add count and rounded percentages
vjust = -0.3, #vertical adjustment above bar of count/percent labels
size = 1.5) +
theme(axis.text.x = element_text(angle = 45, #angle of x-axis labels
hjust = 1,
vjust = 1,
size = 7,
color = "black"
)) +
labs(title = "Top 10 Most Harmful Natural Events in US",
x = "Event Type",
y = "Total Morbidity")
#Top 10 most expensive events
ggplot(data = expense_events[1:10], mapping = aes(x = reorder(EVTYPE, -total), y = total)) +
geom_col(colour = "black", fill = "steelblue") +
geom_text(aes(label = paste0(total, " (", round(total_percentage, 1), "%)")), #add count and rounded percentages
vjust = -0.3, #vertical adjustment above bar of count/percent labels
size = 1.5) +
theme(axis.text.x = element_text(angle = 45, #angle of x-axis labels
hjust = 1,
vjust = 1,
size = 7,
color = "black"
)) +
labs(title = "Top 10 Most Expensive Natural Events in US",
x = "Event Type",
y = "Total Expense")