Storm data

Using the NOAA storm database, which types of weather events are the most harmful to populations, and which events have the greatest economic consequences?

In this analysis, we address these questions by exploring average deaths/injuries and average property/crop damage based on event type. Liberal reference is made to [NOAA’s documentation] (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) to understand this dataset, acronyms that are used, etc.

Data processing

Set global options for the markdown file:

View the software environment in which this analysis will be conducted:

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2.1   tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.8.1 knitr_1.12.3   
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.8

Call on the two libraries we will use for the remainder of the analysis:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Download and read data:

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv.bz2")
stormdata <- tbl_df(read.csv(bzfile("stormdata.csv.bz2")))

Cleanup

To see what needs to be cleaned up and get a grasp on the volume of data, take a look at a summary of the raw data:

stormdata

## Source: local data frame [902,297 x 37]
## 
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME  STATE
##      (dbl)             (fctr)   (fctr)    (fctr)  (dbl)     (fctr) (fctr)
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE     AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN     AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE     AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON     AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN     AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE     AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT     AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA     AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA     AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE     AL
## ..     ...                ...      ...       ...    ...        ...    ...
## Variables not shown: EVTYPE (fctr), BGN_RANGE (dbl), BGN_AZI (fctr),
##   BGN_LOCATI (fctr), END_DATE (fctr), END_TIME (fctr), COUNTY_END (dbl),
##   COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (fctr), END_LOCATI (fctr),
##   LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
##   INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (fctr), CROPDMG (dbl),
##   CROPDMGEXP (fctr), WFO (fctr), STATEOFFIC (fctr), ZONENAMES (fctr),
##   LATITUDE (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl),
##   REMARKS (fctr), REFNUM (dbl)

Clean up column names

As we can see above, the column names are not descriptive. Cleaning them up will give us a better understanding of what data we’re working with.

Here I normalize capitalization and make the names descriptive:

stormdata2<-stormdata
cols<-names(stormdata2)
cols<-tolower(cols)
cols<-sub("bgn","begin",cols)
cols<-sub("prop","property_",cols)
cols<-sub("ev","event_",cols)
cols<-sub("dmg","damage_",cols)
cols<-sub("ref","reference_",cols)
cols<-sub("num","number",cols)
cols<-sub("locati","location",cols)
cols<-sub("crop","crop_",cols)
cols<-sub("azi","azimuth",cols)
cols<-sub("zonenames","zone_names",cols)
cols<-sub("latitude_e","latitude_end",cols)
cols<-sub("longitude_","longitude_end",cols)
cols<-sub("offic","office",cols)
cols<-sub("wfo","weather_forecast_office",cols)
cols<-sub("stateoffice","state_office",cols)
cols<-sub("county","county_",cols)
cols<-sub("county_endn","county_end_nas",cols)
cols<-sub("exp","exponent",cols)
cols<-sub("__","_",cols)
cols<-sub("_$","",cols)
cols[21:22]<-c("fujita_scale","magnitude")
cols[1]<-"state_1"
names(stormdata2)<-cols

Clean up event types

If we take a look at event types, we can see that there is also some cleanup to do in that column. Per table 2.1.1 in NOAA’s documentation, there should be only 48 unique event types, but this dataset contains 985:

str(stormdata2$event_type)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

There must be some duplicates or typos. Viewing a snippet of the unique event types confirms this:

unique(stormdata2$event_type)[1:50]

##  [1] TORNADO                        TSTM WIND                     
##  [3] HAIL                           FREEZING RAIN                 
##  [5] SNOW                           ICE STORM/FLASH FLOOD         
##  [7] SNOW/ICE                       WINTER STORM                  
##  [9] HURRICANE OPAL/HIGH WINDS      THUNDERSTORM WINDS            
## [11] RECORD COLD                    HURRICANE ERIN                
## [13] HURRICANE OPAL                 HEAVY RAIN                    
## [15] LIGHTNING                      THUNDERSTORM WIND             
## [17] DENSE FOG                      RIP CURRENT                   
## [19] THUNDERSTORM WINS              FLASH FLOOD                   
## [21] FLASH FLOODING                 HIGH WINDS                    
## [23] FUNNEL CLOUD                   TORNADO F0                    
## [25] THUNDERSTORM WINDS LIGHTNING   THUNDERSTORM WINDS/HAIL       
## [27] HEAT                           WIND                          
## [29] LIGHTING                       HEAVY RAINS                   
## [31] LIGHTNING AND HEAVY RAIN       FUNNEL                        
## [33] WALL CLOUD                     FLOODING                      
## [35] THUNDERSTORM WINDS HAIL        FLOOD                         
## [37] COLD                           HEAVY RAIN/LIGHTNING          
## [39] FLASH FLOODING/THUNDERSTORM WI WALL CLOUD/FUNNEL CLOUD       
## [41] THUNDERSTORM                   WATERSPOUT                    
## [43] EXTREME COLD                   HAIL 1.75)                    
## [45] LIGHTNING/HEAVY RAIN           HIGH WIND                     
## [47] BLIZZARD                       BLIZZARD WEATHER              
## [49] WIND CHILL                     BREAKUP FLOODING              
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

There appear to be a few overarching reasons for an event to be miscategorized. To address each of these issues, I follow the following methodology: 1. Misspellings and abbreviations: I knock out a couple obvious ones (because correcting each entry’s spelling individually is unnecessarily time consuming) 2. Extra or incorrect punctuation, spacing, and numbers: Remove all numbers, change all punctuation to a space, then remove any leading, ending, or duplicate spaces. 3. More detail than necessary (e.g. “HURRICANE ERIN” vs. “HURRICANE”): Search for several keyword strings within the vector, then if the value contains a keyword, replace the whole value with the keyword. 4. Making categories plural (e.g. “HEAVY RAINS” vs. “HEAVY RAIN”): Remove trailing letter “s” from all values.

events<-stormdata2$event_type
events<-tolower(events)
events<-sub("s$","",events)
events<-gsub("[[:digit:]]","",events)
events<-gsub("[[:punct:]]"," ",events)
events<-gsub("tstm","thunderstorm",events)
events<-gsub("torndao","tornado",events)
events<-gsub("vog","fog",events)
events<-gsub("avalance","avalanche",events)
events<-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", events, perl=TRUE)
for(i in 1:length(events)){
        if(grepl("blizzard",events[i])==TRUE){events[i]<-"blizzard"}
        else if(grepl("tornado",events[i])==TRUE){events[i]<-"tornado"}
        else if(grepl("hurricane",events[i])==TRUE){events[i]<-"hurricane"}
        else if(grepl("hail",events[i])==TRUE){events[i]<-"hail"}
        else if(grepl("vulcan",events[i])==TRUE){events[i]<-"volcanic ash"}
        else if(grepl("fog",events[i])==TRUE){events[i]<-"fog"}
        else if(grepl("sleet",events[i])==TRUE){events[i]<-"sleet"}
        else if(grepl("surf",events[i])==TRUE){events[i]<-"high surf"}
        else if(grepl("thunderstorm",events[i])==TRUE){events[i]<-"thunderstorm wind"}
        else if(grepl("tropical storm",events[i])==TRUE){events[i]<-"tropical storm"}}

Have the number of unique values been cut down enough? There are still over 500 unique events, but we want only 48:

str(unique(events))

##  chr [1:536] "tornado" "thunderstorm wind" "hail" ...

If we look at a table of the unique events, we can see that fewer than 1% of all events are categorized as one of the 48 NOAA event types. This percentage is calculated by finding the top most occurring event type names (assuming any event in a NOAA category occurs more often in the dataset than an event with a miscellanous name):

eventfreqs<-arrange(tbl_df(as.data.frame(table(events))),desc(Freq))
names(eventfreqs)<-c("event_type_clean","frequency")
sum(eventfreqs$frequency[49:length(eventfreqs$frequency)])/sum(eventfreqs$frequency[1:48])

## [1] 0.00513654

This means that we can glean any trends from the remaining data.

stormdata2$event_type_clean<-events

Subset data based on whether the event type is in the “top 48”

We must remove any non-official event categories before we calculate averages by event type. This is because one misspelled, extremely tragic event would give us inaccurate results.

I use a subset of the event frequency data frame to isolate the “top 48” most frequent storm events. We will then use that “top 48” data frame to create subset of the main data frame using inner_join, which will remove the row if the event type isn’t in the “top 48”:

top_48<-eventfreqs[1:48,1:2]
stormdata3<-inner_join(stormdata2,top_48,by="event_type_clean")

## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector

We can see that this worked - cutting out events not in the “top 48” decreased the number of rows in the stormdata data frame:

dim(stormdata2) #the old data frame

## [1] 902297     38

dim(stormdata3) #the new data frame, subset by the most frequent categories only

## [1] 897686     39

Calculate total cost of damage

The economic cost of each storm event is in 4 total columns that must be combined to come up with a total: * property_damage and crop_damage * property_damage_exponent and crop_damage_exponent

To do this, we first convert the exponents for property damage and crop damage to a multiplier (normalizing capitalization). If we take a look at the frequency of each exponent, the letters seem obvious but the numbers are less so. The NOAA documentation does not include an explanation, and I am unconvinced they are actual exponents (why would so many numbers be to the power of 0?). In any case, those other digits occur infrequently enough that I am ignoring them for the purposes of this analysis:

propmult<-tolower(stormdata3$property_damage_exponent)
cropmult<-tolower(stormdata3$crop_damage_exponent)
table(propmult)

## propmult
##             -      ?      +      0      1      2      3      4      5 
## 462924      1      8      3    215     25     13      4      3     25 
##      6      7      8      b      h      k      m 
##      4      5      1     37      7 423201  11210

table(cropmult)

## cropmult
##             ?      0      2      b      k      m 
## 614730      6     18      1      7 280983   1941

Assuming h=hundred, k=thousand, m=million, b=billion, we now convert those letters to their corresponding integer:

for(i in 1:length(propmult)){
        if(grepl("h",propmult[i])==TRUE){propmult[i]<-100}
        else if(grepl("k",propmult[i])==TRUE){propmult[i]<-1000}
        else if(grepl("m",propmult[i])==TRUE){propmult[i]<-1000000}
        else if(grepl("b",propmult[i])==TRUE){propmult[i]<-1000000000}
        else{propmult[i]<-1}
}
for(i in 1:length(cropmult)){
        if(grepl("h",cropmult[i])==TRUE){cropmult[i]<-100}
        else if(grepl("k",cropmult[i])==TRUE){cropmult[i]<-1000}
        else if(grepl("m",cropmult[i])==TRUE){cropmult[i]<-1000000}
        else if(grepl("b",cropmult[i])==TRUE){cropmult[i]<-1000000000}
        else{cropmult[i]<-1}
}

Finally, multiply the property damage and crop damage columns by their multiplier to get total damage by category, then add those two vectors together to get a vector of total damage:

prop_dmg_total<-stormdata3$property_damage*as.numeric(propmult)
crop_dmg_total<-stormdata3$crop_damage*as.numeric(cropmult)
stormdata3$cost<-prop_dmg_total+crop_dmg_total

Prepare data for graphs

To answer both questions, I believe it makes the most sense to average both fatalities and injuries by event category (some events take place substantially more often than others, as we can see looking at the frequency of a few top events, so total deaths/injuries would be a highly weighted number).

head(top_48)

## Source: local data frame [6 x 2]
## 
##    event_type_clean frequency
##              (fctr)     (int)
## 1 thunderstorm wind    335690
## 2              hail    290400
## 3           tornado     60701
## 4       flash flood     54311
## 5             flood     25330
## 6         high wind     21751

tail(top_48)

## Source: local data frame [6 x 2]
## 
##    event_type_clean frequency
##              (fctr)     (int)
## 1     freezing rain       260
## 2       urban flood       254
## 3 extreme windchill       204
## 4    dry microburst       192
## 5  coastal flooding       183
## 6        light snow       176

To prepare the data for the first question (which weather events are most harmful with respect to population health), group data frame by event type, then average the number of fatalities and injuries by type.

avg_deaths<-stormdata3%>%
        group_by(event_type_clean)%>%
        summarize(mean(fatalities,na.rm=TRUE))
names(avg_deaths)<-c("Event","Fatalities")
avg_deaths<-arrange(tbl_df(avg_deaths),desc(Fatalities))

avg_inj<-stormdata3%>%
        group_by(event_type_clean)%>%
        summarize(mean(injuries,na.rm=TRUE))
names(avg_inj)<-c("Event","Injuries")
avg_inj<-arrange(tbl_df(avg_inj),desc(Injuries))

To veiw the most harmful event types, we will then subset the average fatalities and injuries data frames to include only top 5 most harmful:

avg_deaths<-avg_deaths[1:7,1:2]
avg_inj<-avg_inj[1:7,1:2]

To prepare the data for the second question (which weather events are most costly), we are going to calculate the average cost (combined crop and property damage) of any given event by event category, then subset the top 5 most costly events. We will also divide cost by 1 million to make the graph more readable:

avg_cost<-stormdata3%>%
        group_by(event_type_clean)%>%
        summarize(mean(cost,na.rm=TRUE))
names(avg_cost)<-c("Event","Cost")
avg_cost<-arrange(tbl_df(avg_cost),desc(Cost))[1:5,1:2]
avg_cost$Cost_mil<-avg_cost$Cost/1000000

Results

Which types of events are most harmful with respect to population health?

On average, more people die whenever there is a heat event (excessive heat as well as just heat) than any other weather event:

ggplot(avg_deaths,aes(x=Event,y=Fatalities))+
        geom_bar(stat="identity",aes(fill=Fatalities))+
        scale_fill_gradient(low="cyan",high="red")+
        labs(y="Average fatalities",title="Average fatalities per weather event")

On average, more people are injured during a hurricane than any other weather event, with excessive heat as a close second:

ggplot(avg_inj,aes(x=Event,y=Injuries))+
        geom_bar(stat="identity",aes(fill=Injuries))+
        scale_fill_gradient(low="cyan",high="red")+
        labs(y="Average injuries",title="Average injuries per weather event")

Which types of events have the greatest economic consequences?

On average, hurricanes cost far more than any other weather event:

ggplot(avg_cost,aes(x=Event,y=Cost_mil))+
        geom_bar(stat="identity",aes(fill=Cost_mil))+
        scale_fill_gradient(low="cyan",high="green")+
        labs(y="Average cost (in millions of dollars)",title="Average total cost per weather event")