Using the NOAA storm database, which types of weather events are the most harmful to populations, and which events have the greatest economic consequences?
In this analysis, we address these questions by exploring average deaths/injuries and average property/crop damage based on event type. Liberal reference is made to [NOAA’s documentation] (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) to understand this dataset, acronyms that are used, etc.
Set global options for the markdown file:
View the software environment in which this analysis will be conducted:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2.1 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8.1 knitr_1.12.3
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.8
Call on the two libraries we will use for the remainder of the analysis:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Download and read data:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv.bz2")
stormdata <- tbl_df(read.csv(bzfile("stormdata.csv.bz2")))
To see what needs to be cleaned up and get a grasp on the volume of data, take a look at a summary of the raw data:
stormdata
## Source: local data frame [902,297 x 37]
##
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## (dbl) (fctr) (fctr) (fctr) (dbl) (fctr) (fctr)
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: EVTYPE (fctr), BGN_RANGE (dbl), BGN_AZI (fctr),
## BGN_LOCATI (fctr), END_DATE (fctr), END_TIME (fctr), COUNTY_END (dbl),
## COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (fctr), END_LOCATI (fctr),
## LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
## INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (fctr), CROPDMG (dbl),
## CROPDMGEXP (fctr), WFO (fctr), STATEOFFIC (fctr), ZONENAMES (fctr),
## LATITUDE (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl),
## REMARKS (fctr), REFNUM (dbl)
As we can see above, the column names are not descriptive. Cleaning them up will give us a better understanding of what data we’re working with.
Here I normalize capitalization and make the names descriptive:
stormdata2<-stormdata
cols<-names(stormdata2)
cols<-tolower(cols)
cols<-sub("bgn","begin",cols)
cols<-sub("prop","property_",cols)
cols<-sub("ev","event_",cols)
cols<-sub("dmg","damage_",cols)
cols<-sub("ref","reference_",cols)
cols<-sub("num","number",cols)
cols<-sub("locati","location",cols)
cols<-sub("crop","crop_",cols)
cols<-sub("azi","azimuth",cols)
cols<-sub("zonenames","zone_names",cols)
cols<-sub("latitude_e","latitude_end",cols)
cols<-sub("longitude_","longitude_end",cols)
cols<-sub("offic","office",cols)
cols<-sub("wfo","weather_forecast_office",cols)
cols<-sub("stateoffice","state_office",cols)
cols<-sub("county","county_",cols)
cols<-sub("county_endn","county_end_nas",cols)
cols<-sub("exp","exponent",cols)
cols<-sub("__","_",cols)
cols<-sub("_$","",cols)
cols[21:22]<-c("fujita_scale","magnitude")
cols[1]<-"state_1"
names(stormdata2)<-cols
If we take a look at event types, we can see that there is also some cleanup to do in that column. Per table 2.1.1 in NOAA’s documentation, there should be only 48 unique event types, but this dataset contains 985:
str(stormdata2$event_type)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
There must be some duplicates or typos. Viewing a snippet of the unique event types confirms this:
unique(stormdata2$event_type)[1:50]
## [1] TORNADO TSTM WIND
## [3] HAIL FREEZING RAIN
## [5] SNOW ICE STORM/FLASH FLOOD
## [7] SNOW/ICE WINTER STORM
## [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [11] RECORD COLD HURRICANE ERIN
## [13] HURRICANE OPAL HEAVY RAIN
## [15] LIGHTNING THUNDERSTORM WIND
## [17] DENSE FOG RIP CURRENT
## [19] THUNDERSTORM WINS FLASH FLOOD
## [21] FLASH FLOODING HIGH WINDS
## [23] FUNNEL CLOUD TORNADO F0
## [25] THUNDERSTORM WINDS LIGHTNING THUNDERSTORM WINDS/HAIL
## [27] HEAT WIND
## [29] LIGHTING HEAVY RAINS
## [31] LIGHTNING AND HEAVY RAIN FUNNEL
## [33] WALL CLOUD FLOODING
## [35] THUNDERSTORM WINDS HAIL FLOOD
## [37] COLD HEAVY RAIN/LIGHTNING
## [39] FLASH FLOODING/THUNDERSTORM WI WALL CLOUD/FUNNEL CLOUD
## [41] THUNDERSTORM WATERSPOUT
## [43] EXTREME COLD HAIL 1.75)
## [45] LIGHTNING/HEAVY RAIN HIGH WIND
## [47] BLIZZARD BLIZZARD WEATHER
## [49] WIND CHILL BREAKUP FLOODING
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
There appear to be a few overarching reasons for an event to be miscategorized. To address each of these issues, I follow the following methodology: 1. Misspellings and abbreviations: I knock out a couple obvious ones (because correcting each entry’s spelling individually is unnecessarily time consuming) 2. Extra or incorrect punctuation, spacing, and numbers: Remove all numbers, change all punctuation to a space, then remove any leading, ending, or duplicate spaces. 3. More detail than necessary (e.g. “HURRICANE ERIN” vs. “HURRICANE”): Search for several keyword strings within the vector, then if the value contains a keyword, replace the whole value with the keyword. 4. Making categories plural (e.g. “HEAVY RAINS” vs. “HEAVY RAIN”): Remove trailing letter “s” from all values.
events<-stormdata2$event_type
events<-tolower(events)
events<-sub("s$","",events)
events<-gsub("[[:digit:]]","",events)
events<-gsub("[[:punct:]]"," ",events)
events<-gsub("tstm","thunderstorm",events)
events<-gsub("torndao","tornado",events)
events<-gsub("vog","fog",events)
events<-gsub("avalance","avalanche",events)
events<-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", events, perl=TRUE)
for(i in 1:length(events)){
if(grepl("blizzard",events[i])==TRUE){events[i]<-"blizzard"}
else if(grepl("tornado",events[i])==TRUE){events[i]<-"tornado"}
else if(grepl("hurricane",events[i])==TRUE){events[i]<-"hurricane"}
else if(grepl("hail",events[i])==TRUE){events[i]<-"hail"}
else if(grepl("vulcan",events[i])==TRUE){events[i]<-"volcanic ash"}
else if(grepl("fog",events[i])==TRUE){events[i]<-"fog"}
else if(grepl("sleet",events[i])==TRUE){events[i]<-"sleet"}
else if(grepl("surf",events[i])==TRUE){events[i]<-"high surf"}
else if(grepl("thunderstorm",events[i])==TRUE){events[i]<-"thunderstorm wind"}
else if(grepl("tropical storm",events[i])==TRUE){events[i]<-"tropical storm"}}
Have the number of unique values been cut down enough? There are still over 500 unique events, but we want only 48:
str(unique(events))
## chr [1:536] "tornado" "thunderstorm wind" "hail" ...
If we look at a table of the unique events, we can see that fewer than 1% of all events are categorized as one of the 48 NOAA event types. This percentage is calculated by finding the top most occurring event type names (assuming any event in a NOAA category occurs more often in the dataset than an event with a miscellanous name):
eventfreqs<-arrange(tbl_df(as.data.frame(table(events))),desc(Freq))
names(eventfreqs)<-c("event_type_clean","frequency")
sum(eventfreqs$frequency[49:length(eventfreqs$frequency)])/sum(eventfreqs$frequency[1:48])
## [1] 0.00513654
This means that we can glean any trends from the remaining data.
stormdata2$event_type_clean<-events
We must remove any non-official event categories before we calculate averages by event type. This is because one misspelled, extremely tragic event would give us inaccurate results.
I use a subset of the event frequency data frame to isolate the “top 48” most frequent storm events. We will then use that “top 48” data frame to create subset of the main data frame using inner_join, which will remove the row if the event type isn’t in the “top 48”:
top_48<-eventfreqs[1:48,1:2]
stormdata3<-inner_join(stormdata2,top_48,by="event_type_clean")
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
We can see that this worked - cutting out events not in the “top 48” decreased the number of rows in the stormdata data frame:
dim(stormdata2) #the old data frame
## [1] 902297 38
dim(stormdata3) #the new data frame, subset by the most frequent categories only
## [1] 897686 39
The economic cost of each storm event is in 4 total columns that must be combined to come up with a total: * property_damage and crop_damage * property_damage_exponent and crop_damage_exponent
To do this, we first convert the exponents for property damage and crop damage to a multiplier (normalizing capitalization). If we take a look at the frequency of each exponent, the letters seem obvious but the numbers are less so. The NOAA documentation does not include an explanation, and I am unconvinced they are actual exponents (why would so many numbers be to the power of 0?). In any case, those other digits occur infrequently enough that I am ignoring them for the purposes of this analysis:
propmult<-tolower(stormdata3$property_damage_exponent)
cropmult<-tolower(stormdata3$crop_damage_exponent)
table(propmult)
## propmult
## - ? + 0 1 2 3 4 5
## 462924 1 8 3 215 25 13 4 3 25
## 6 7 8 b h k m
## 4 5 1 37 7 423201 11210
table(cropmult)
## cropmult
## ? 0 2 b k m
## 614730 6 18 1 7 280983 1941
Assuming h=hundred, k=thousand, m=million, b=billion, we now convert those letters to their corresponding integer:
for(i in 1:length(propmult)){
if(grepl("h",propmult[i])==TRUE){propmult[i]<-100}
else if(grepl("k",propmult[i])==TRUE){propmult[i]<-1000}
else if(grepl("m",propmult[i])==TRUE){propmult[i]<-1000000}
else if(grepl("b",propmult[i])==TRUE){propmult[i]<-1000000000}
else{propmult[i]<-1}
}
for(i in 1:length(cropmult)){
if(grepl("h",cropmult[i])==TRUE){cropmult[i]<-100}
else if(grepl("k",cropmult[i])==TRUE){cropmult[i]<-1000}
else if(grepl("m",cropmult[i])==TRUE){cropmult[i]<-1000000}
else if(grepl("b",cropmult[i])==TRUE){cropmult[i]<-1000000000}
else{cropmult[i]<-1}
}
Finally, multiply the property damage and crop damage columns by their multiplier to get total damage by category, then add those two vectors together to get a vector of total damage:
prop_dmg_total<-stormdata3$property_damage*as.numeric(propmult)
crop_dmg_total<-stormdata3$crop_damage*as.numeric(cropmult)
stormdata3$cost<-prop_dmg_total+crop_dmg_total
To answer both questions, I believe it makes the most sense to average both fatalities and injuries by event category (some events take place substantially more often than others, as we can see looking at the frequency of a few top events, so total deaths/injuries would be a highly weighted number).
head(top_48)
## Source: local data frame [6 x 2]
##
## event_type_clean frequency
## (fctr) (int)
## 1 thunderstorm wind 335690
## 2 hail 290400
## 3 tornado 60701
## 4 flash flood 54311
## 5 flood 25330
## 6 high wind 21751
tail(top_48)
## Source: local data frame [6 x 2]
##
## event_type_clean frequency
## (fctr) (int)
## 1 freezing rain 260
## 2 urban flood 254
## 3 extreme windchill 204
## 4 dry microburst 192
## 5 coastal flooding 183
## 6 light snow 176
To prepare the data for the first question (which weather events are most harmful with respect to population health), group data frame by event type, then average the number of fatalities and injuries by type.
avg_deaths<-stormdata3%>%
group_by(event_type_clean)%>%
summarize(mean(fatalities,na.rm=TRUE))
names(avg_deaths)<-c("Event","Fatalities")
avg_deaths<-arrange(tbl_df(avg_deaths),desc(Fatalities))
avg_inj<-stormdata3%>%
group_by(event_type_clean)%>%
summarize(mean(injuries,na.rm=TRUE))
names(avg_inj)<-c("Event","Injuries")
avg_inj<-arrange(tbl_df(avg_inj),desc(Injuries))
To veiw the most harmful event types, we will then subset the average fatalities and injuries data frames to include only top 5 most harmful:
avg_deaths<-avg_deaths[1:7,1:2]
avg_inj<-avg_inj[1:7,1:2]
To prepare the data for the second question (which weather events are most costly), we are going to calculate the average cost (combined crop and property damage) of any given event by event category, then subset the top 5 most costly events. We will also divide cost by 1 million to make the graph more readable:
avg_cost<-stormdata3%>%
group_by(event_type_clean)%>%
summarize(mean(cost,na.rm=TRUE))
names(avg_cost)<-c("Event","Cost")
avg_cost<-arrange(tbl_df(avg_cost),desc(Cost))[1:5,1:2]
avg_cost$Cost_mil<-avg_cost$Cost/1000000
On average, more people die whenever there is a heat event (excessive heat as well as just heat) than any other weather event:
ggplot(avg_deaths,aes(x=Event,y=Fatalities))+
geom_bar(stat="identity",aes(fill=Fatalities))+
scale_fill_gradient(low="cyan",high="red")+
labs(y="Average fatalities",title="Average fatalities per weather event")
On average, more people are injured during a hurricane than any other weather event, with excessive heat as a close second:
ggplot(avg_inj,aes(x=Event,y=Injuries))+
geom_bar(stat="identity",aes(fill=Injuries))+
scale_fill_gradient(low="cyan",high="red")+
labs(y="Average injuries",title="Average injuries per weather event")
On average, hurricanes cost far more than any other weather event:
ggplot(avg_cost,aes(x=Event,y=Cost_mil))+
geom_bar(stat="identity",aes(fill=Cost_mil))+
scale_fill_gradient(low="cyan",high="green")+
labs(y="Average cost (in millions of dollars)",title="Average total cost per weather event")