This analysis uses the US. NOAA Storm Database collected from 1950-2011. In the following, we describe how we load in the data, process the data, and analyze the data for which events in the entire US were the most harmful (by morbidty, inury-alone, fatality-alone) and/or the most expensive in terms of property and or crop damage.
To begin, we…
load in our dataset by first creating a “filepath” object of the directory where the data is located.
Then, we load in the data.table package to read in the raw data file.
After, we create our data.table object (dt) using the fread function and defining the “” values included in the raw data as NAs.
Finally, we examine the structure of the data by using the str function.
#### Load in the data
options(scipen = 9999)
filepath <- "~/R Programming/PA2_Module4/repdata_data_StormData.csv/repdata_data_StormData.csv"
library(data.table)
dt <- fread(filepath,
na.strings = "")
str(dt)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
To start our harmful event analysis, we create the harm_events data.table object which contains the morbidity (or the total number of injuries and fatalities by events), the total injuries by event, and the total fatalities by event. We then include the percentages of each event (total, injuries, and fatalities). Next, we set the order of the harm_events object by descending order of the total so that we can visualize the most harmful event in aggregate. Finally, we view the harm_event data.table object and see that tornadoes have been responsible for the most injuries and/or deaths in aggregate, making them the most harmful events to have occurred during the duration of this study. However, caution should be taken as it is impossible to determine if one event could have caused a high proportion of death or injury, or whether some injuries could have led to some deaths within the data (double counting).
harm_events <- dt[, .(total = sum(INJURIES, FATALITIES), injuries = sum(INJURIES), fatalities = sum(FATALITIES)), EVTYPE]
harm_events[, total_percentage := total/sum(total) *100]
harm_events[, injury_percentage := injuries/sum(injuries) *100]
harm_events[, fatality_percentage := fatalities/sum(fatalities) *100]
setorder(x = harm_events, -total)
View(harm_events)
To determine the most expensive events, we…
first must convert all the alpha numerics into actual numeric values. To accomplishe this, for each type of damage (property or crop) we examine the unique values and then convert those unique values into actual numerics that the original symbols represent. For instance, K = 1000 and M = 1000000, etc..
We then add two additional variables to our data.table that contains these converted values and confirm that our conversion was successful by comparing both the old and new variables to eachother.
We then create a separate data.table that includes the total of the property and crop damage combined, along with the individual damages from each type of damage.
#Convert unique values in PROPMGEXP and make new variable
dt[,.N,.(PROPDMGEXP)] #Examine the unique values in variable
## PROPDMGEXP N
## <char> <int>
## 1: K 424665
## 2: M 11330
## 3: <NA> 465934
## 4: B 40
## 5: m 7
## 6: + 5
## 7: 0 216
## 8: 5 28
## 9: 6 4
## 10: ? 8
## 11: 4 4
## 12: 2 13
## 13: 3 4
## 14: h 1
## 15: 7 5
## 16: H 6
## 17: - 1
## 18: 1 25
## 19: 8 1
dt[PROPDMGEXP == "K", Prop_dam_exp := 10^3] #Make k 1000
dt[PROPDMGEXP == "M" | PROPDMGEXP == "m", Prop_dam_exp := 10^6] #Make M 1000000
dt[PROPDMGEXP == "B", Prop_dam_exp := 10^9] #Make B 1000000000
dt[PROPDMGEXP == "H" | PROPDMGEXP == "h", Prop_dam_exp := 10^2] #Make h/H 100
dt[PROPDMGEXP == "?" | PROPDMGEXP == "-" | is.na(PROPDMGEXP), Prop_dam_exp := 0] #Make - 0
dt[PROPDMGEXP %in% 0:8 , Prop_dam_exp := 10] #Make 0-8 10
dt[PROPDMGEXP == "+" , Prop_dam_exp := 1] #Make - 1
dt[, .N,keyby=.(Prop_dam_exp,PROPDMGEXP)] #Ensure that conversion worked appropriately
## Key: <Prop_dam_exp, PROPDMGEXP>
## Prop_dam_exp PROPDMGEXP N
## <num> <char> <int>
## 1: 0 <NA> 465934
## 2: 0 - 1
## 3: 0 ? 8
## 4: 1 + 5
## 5: 10 0 216
## 6: 10 1 25
## 7: 10 2 13
## 8: 10 3 4
## 9: 10 4 4
## 10: 10 5 28
## 11: 10 6 4
## 12: 10 7 5
## 13: 10 8 1
## 14: 100 H 6
## 15: 100 h 1
## 16: 1000 K 424665
## 17: 1000000 M 11330
## 18: 1000000 m 7
## 19: 1000000000 B 40
#Convert unique values in CROPMGEXP and make new variable
dt[,unique(CROPDMGEXP)] #Examine the unique values in variable
## [1] NA "M" "K" "m" "B" "?" "0" "k" "2"
dt[CROPDMGEXP == "K" | CROPDMGEXP == "k", Crop_dam_exp := 10^3] #Make k 1000
dt[CROPDMGEXP == "m" | CROPDMGEXP == "M", Crop_dam_exp := 10^6] #Make M 1000000
dt[CROPDMGEXP == "B", Crop_dam_exp := 10^9] #Make B 1000000000
dt[CROPDMGEXP == "?" | is.na(CROPDMGEXP), Crop_dam_exp := 0] #Make - 0
dt[CROPDMGEXP %in% 0:8, Crop_dam_exp := 0]
dt[, .N,keyby=.(Crop_dam_exp,CROPDMGEXP)] #Ensure that conversion worked appropriately
## Key: <Crop_dam_exp, CROPDMGEXP>
## Crop_dam_exp CROPDMGEXP N
## <num> <char> <int>
## 1: 0 <NA> 618413
## 2: 0 0 19
## 3: 0 2 1
## 4: 0 ? 7
## 5: 1000 K 281832
## 6: 1000 k 21
## 7: 1000000 M 1994
## 8: 1000000 m 1
## 9: 1000000000 B 9
expense_events <- dt[, .(total = sum(Prop_dam_exp,Crop_dam_exp),
prop_damage = sum(Prop_dam_exp),
crop_damage = sum(Crop_dam_exp)), EVTYPE ]
expense_events[, total_percent := total/sum(total) *100]
expense_events[, prop_damage_percent := prop_damage/sum(prop_damage) *100]
expense_events[, crop_damage_percent := crop_damage/sum(crop_damage) *100]
setorder(x = expense_events, -total)
View(expense_events)
Since we have 985 individual events, we will plot only the top 10 events for both morbidty and expense. To do this, we first subset each object, both harm_event and expense event, to only include the first 10 rows of data as they are already in descending order and this will capture the top ten events for each category (harm and expense). Using the ggplot package we order the event type in descending order by the total (either total morbidity for the top 10 most harmful natural events or total expense for the top 10 most expensive event). The y-axis for both plots is the total morbidty or expense, respectively. We include the total count and percentages above the bars for our bar plots (created with geom_col), set the dimensions for our axis text using text(axis.text.x()), and define our labels. The results clearly show that tornadoes are the most harmful event, while hurricane/typhoons are the most expensive events to have occurred during the course of the study.
library(ggplot2)
#Top 10 most harmful events
ggplot(data = harm_events[1:10], mapping = aes(x = reorder(EVTYPE, -total), y = total)) +
geom_col(colour = "black", fill = "steelblue") +
geom_text(aes(label = paste0(total, " (", round(total_percentage, 1), "%)")), #add count and rounded percentages
vjust = -0.3, #vertical adjustment above bar of count/percent labels
size = 1.5) +
theme(axis.text.x = element_text(angle = 45, #angle of x-axis labels
hjust = 1,
vjust = 1,
size = 7,
color = "black"
)) +
labs(title = "Top 10 Most Harmful Natural Events in US",
x = "Event Type",
y = "Total Morbidity")
#Top 10 most expensive events
ggplot(data = expense_events[1:10], mapping = aes(x = reorder(EVTYPE, -total), y = total)) +
geom_col(colour = "black", fill = "steelblue") +
geom_text(aes(label = paste0(total, " (", round(total_percent, 1), "%)")), #add count and rounded percentages
vjust = -0.3, #vertical adjustment above bar of count/percent labels
size = 1.5) +
theme(axis.text.x = element_text(angle = 45, #angle of x-axis labels
hjust = 1,
vjust = 1,
size = 7,
color = "black"
)) +
labs(title = "Top 10 Most Expensive Natural Events in US",
x = "Event Type",
y = "Total Expense")