This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property or crop damage. NOAA has maintained this database of storm events since 1950.
This analysis will investigate the costs and human impact of these distaster events. We will endeavor to answer the questions:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
Note: Although the data goes as far back as 1950, it wasn’t until 1996 that NOAA created a list of 48 standardized event types. Events prior to 1996 were classed into fewer types, and even after 1996 many reports of storm events do not stick to the official list of events. For the purposes of this analysis, we will observe the data both in total and with the data post 1995 separated out, and determine which, if either, is more suited to our needs.
In order to process the data and results, we first need to initialize a few libraries. Plyr, dplyr and reshape2 are for data cleaning and processing purposes, while ggplot2 and gridExtra are for presentation of the results.
library(plyr) #load necessary libraries
library(dplyr)
library(reshape2)
library(ggplot2)
library(gridExtra)
The next step is to download and extract the data from NOAA. We’ll go with a straight extraction for simplicity, though there are potentially faster means of doing so with such heavily compressed data. We’ll also move the extracted data to a new variable, in case we do any manipulations but still want to refer back to the original extraction. This prevents having to go back and re-extract the data repeatedly.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
StormData_RAW<-read.csv("StormData.csv.bz2") #download and load data
StormData<-StormData_RAW #store RAW data for reference
Then the question is, what does the structure of the raw data look like.
str(StormData) #look at data
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are a few things we can see from the output above:
EVTYPE, which identifies the different types of events at the heart of our guiding questions, appears to have 985 factors. However, in the documentation linked in the appendix, NOAA only identifies 48 distinct events. So there is definitely some cleaning of this variable that will need to be done.PROPDMG) and crop damage (CROPDMG) are divided into two separate variables a piece. So these will need to be combined at some point for proper analysis. And based on the first several values for PROPDMGEXP and CROPDMEXP, neither is a straight forward factor we can just multiply by.For the moment, we can put the many variables issue aside and focus on the other three cleaning issues. So let’s start by cleaning the Event Types.
Since the event types variable appears to be in such dramatic disarray, the first step we’ll take is to create a clean vector of the official event types, as a control. We can get this from the data documentation as mentioned earlier.
#load official EVTYPE list for comparison
EVTYPE_official<-c("astronomicallowtide","avalanche","blizzard","coastalflood","coldwindchill",
"debrisflow","densefog","densesmoke","drought","dustdevil","duststorm",
"excessiveheat","extremecoldwindchill","flashflood","flood","freezingfog",
"frostfreeze","funnelcloud","hail","heat","heavyrain","heavysnow","highsurf",
"highwind","hurricanetyphoon","icestorm","lakeeffectsnow","lakeshoreflood",
"lightning","marinehail","marinehighwind","marinestrongwind","marinethunderstormwind",
"ripcurrent","seiche","sleet","stormsurgetide","strongwind","thunderstormwind",
"tornado","tropicaldepression","tropicalstorm","tsunami","volcanicash",
"waterspout","wildfire","winterstorm","winterweather")
Then we can begin to clean with some simple steps:
[A-Za-z]), to remove any variance in punctuation and spaces (using periods or underscores instead of spaces, for instance).Note: When we do the last part below, we actually only remove characters after a given official event type insert, not before. This is because several of the event types in the official list are sub-strings of others, and the sub-string aspect is always the last part of the string. So in order to avoid losing any valid event types, for the time being, we’re only overwriting the end of given strings with the official vector values. We’ll test if more needs to be done shortly.
StormData$EVTYPE<-tolower(StormData$EVTYPE) #make EVTYPE lowercase to narrow down options
StormData$EVTYPE<-gsub("[^A-Za-z]","",StormData$EVTYPE) #remove all non alphabetic chars to narrow further
for(i in 1:length(EVTYPE_official)){ #sub in official EVTYPE for all EVTYPES that are similar
StormData$EVTYPE<-gsub(paste0(EVTYPE_official[i],".*"),paste0(EVTYPE_official[i]),StormData$EVTYPE)
}
Now it’s time to check the results of our cleaning thus far, and see if we’ve done enough to make EVTYPE a sufficiently useable variable.
In order to do this check, we can create a check table that stores a count of all of the remaining EVTYPE values that are still not cleaned. We can then also sort the resulting table, so that the unclean variables that appear most often are right at the top. This will allow us to find where we can get the biggest bang for our buck if we decide to do any further cleaning.
check1<-table(StormData$EVTYPE) #create table of all remaining EVTYPEs
check1<-check1[!(names(check1) %in% EVTYPE_official)] #narrow table to only unofficial EVTYPEs
check1<-check1[order(check1,decreasing=TRUE)] #sort table in order of decreasing EVTYPE frequency
Before we see what our check table contains, let’s first see if we’ve already done enough cleaning. We can do this by calculating what percentage of the total rows from the original data frame the sum of the check table represents.
print(sum(check1)/nrow(StormData)*100) #see what percent of the EVTYPE data is still uncleaned
## [1] 26.62183
Based on this calculation, it looks like EVTYPE still contains 26.6% unofficial EVTYPEs. That’s very high. Losing over a quarter of our data this early would be significant, so we should probably try to clean the variable further.
In order to better direct our continued cleaning, let’s see where we can make the most headway by looking at the first few values of our check table. We can also then look at the entire list of value names to see if any of those at the top are also similar to any other stragglers (given that this second part is optional and takes up a lot of space, the output is hidden below).
print(head(check1,25)) #see what the top 10 uncleaned EVTYPES are, to check clustering
##
## tstmwind marinetstmwind urbansmlstreamfld
## 219961 6175 3392
## wildforestfire tstmwindhail extremecold
## 1457 1028 657
## snow landslide fog
## 617 600 538
## urbanflood wind stormsurge
## 354 347 261
## freezingrain heavysurfhighsurf extremewindchill
## 260 228 210
## riverflood drymicroburst lightsnow
## 202 192 176
## hurricane recordwarmth unseasonablywarm
## 174 154 126
## astronomicalhightide moderatesnowfall wintrymix
## 103 101 94
## heavysurf
## 87
print(sort(names(check1))) #get list of all unofficial EVTYPES remaining <--optional
Wow, tstmwind is in EVTYPE over 200000 times! There are just over 900000 EVTYPE values total, so that’s extremely significant for just one value (over 24% of all values!). We can also see that the second most significant remaining unofficial value is marinetstmwind, which, while similar, fits into a completely different official event type. This will be important shortly.
Another notable item is that after about 20 values the count for the given values in our check table drops below 100. Any count below 100 is only 0.01% of all of the values in EVTYPE at most, so we can probably ignore anything below this mark. You could alternatively ignore any value with a count below 1000 (0.1% or less of all values), which would still give us quite precise results, but for the sake of the experience, let’s dig deeper than that.
In order to fix the top 20-ish remaining values, we’ll go through piecemeal, and make sure that we really get them fixed. As mentioned earlier, we can also throw in anything similar to these values (from the full name list), since we’ll have a line of code for them anyway.
Here’s the process:
tstmwind, but then we could run into a problem with marinetstmwind, since tstmwind is a sub-string. So let’s start with marine tstmwind first. This appears to map to the marinethunderstormwind value in our official vector, and doesn’t appear to have any particularly similar stragglers.tstmwind. But actually, since there’s no official value for just thunderstorm, we can look for any value that still has tstm in it period. And from the full list of names, we can see that we can also include anything that involves thunder.urbansmlstreamfld. At first glance this doesn’t appear to map to anything in the official vector. But if we go to the documentation and search for the word “urban”, we can see that this is only mentioned in the context of the flood event type. So we can map anything with urban to the flood event type, and while we’re at it we might as well throw in riverflood.wildforestfire, which based on the list of names can be boiled down to just wild, and fits into the official category of wildfire.tstmwindhail, but we’ve already handled that with tstm.extremecold is next. This one is pretty easy to fit into extremecoldwindchill, and it looks like we can also fit in extremewindchill and extendedcold. One thing that’s different here is that extremecold is a sub-string of extremecoldwindchill. So to avoid any unnecessary or unwanted replacement, we should anchor the former to the end of the string.We can continue to go on like this down the list, with a few quirks here and there (like distinguishing what goes into strongwind vs highwind, using the definitions in the documentation), and once we do, we end up with the series of functions below.
#use data documentation to fit top unofficialEVTYPES (and other similar ones) into official categories
StormData$EVTYPE<-gsub("marinetstm.*","marinethunderstormwind",StormData$EVTYPE)
StormData$EVTYPE<-gsub("tstm.*|thunder.*","thunderstormwind",StormData$EVTYPE)
StormData$EVTYPE<-gsub("urban.*|riverflood.*","flood",StormData$EVTYPE)
StormData$EVTYPE<-gsub("wild.*","wildfire",StormData$EVTYPE) #where we'd cut off with 1000 threshold
StormData$EVTYPE<-gsub("extremecold$|extremewindchill.*|extendedcold.*","extremecoldwindchill",StormData$EVTYPE)
StormData$EVTYPE<-gsub("^snow.*","heavysnow",StormData$EVTYPE)
StormData$EVTYPE<-gsub("landslide.*|landslump.*","debrisflow",StormData$EVTYPE)
StormData$EVTYPE<-gsub("^fog.*","densefog",StormData$EVTYPE)
StormData$EVTYPE<-gsub("^wind.*","strongwind",StormData$EVTYPE)
StormData$EVTYPE<-gsub("stormsurge.*","stormsurgetide",StormData$EVTYPE)
StormData$EVTYPE<-gsub("freezingrain.*","sleet",StormData$EVTYPE)
StormData$EVTYPE<-gsub("heavysurf.*|.*hightide.*","highsurf",StormData$EVTYPE)
StormData$EVTYPE<-gsub("^drymicroburst.*","highwind",StormData$EVTYPE)
StormData$EVTYPE<-gsub("lightsnow.*|moderatesnow.*|wintrymix.*","winterstorm",StormData$EVTYPE)
StormData$EVTYPE<-gsub("^hurricane.*|^typhoon.*","hurricanetyphoon",StormData$EVTYPE)
StormData$EVTYPE<-gsub("recordw.*|recordh.*","excessiveheat",StormData$EVTYPE)
StormData$EVTYPE<-gsub("unseasonablyw.*|unseasonablyh.*|unusuallyw.*","heat",StormData$EVTYPE)
Now that we’ve done this more in depth cleaning, it’s time to check how our efforts to fix things faired. We can use the same check method as before to do this.
check2<-table(StormData$EVTYPE) #create table of all remaining EVTYPEs
check2<-check2[!(names(check2) %in% EVTYPE_official)] #narrow table to only unofficial EVTYPEs
check2<-check2[order(check2,decreasing=TRUE)] #sort table in order of decreasing EVTYPE frequency
print(head(check2,10)) #see if the top 10 uncleaned EVTYPES have under 100 records per EVTYPE
##
## cold freeze recordcold gustywinds
## 82 76 68 65
## ice frost unseasonablydry smallhail
## 61 57 56 53
## other funnel
## 52 46
print(sum(check2)/nrow(StormData)*100) #see what percent of the EVTYPE data is still uncleaned
## [1] 0.2146743
From the results above we can see that the top remaining unofficial event type now has a mere 82 instances in EVTYPE, and all of the remaining unofficial event types amount to only 0.2% of the data. That’s a job well done.
Next on our list is cleaning the property and crop damage cost variables. In this case, however, we don’t really need to worry about the actual PROPDMG and CROPDMG variables themselves, as they’re in decent shape. The issue we saw earlier pertained to PROPDMGEXP and CROPDMGEXP, which, according to the documentation, are the exponential components of the cost values given.
Let’s start off by taking a closer look at the distribution of values in these two factors.
print(table(StormData$PROPDMGEXP)) #check the values and distribution of PROPDMGEXP and CROPDMGEXP
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
print(table(StormData$CROPDMGEXP))
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
As we can see in the tables above, we have a variety of values, from random seeming punctuation, to numbers, to letters, to a large number of factors with completely empty values. Referring again to the documentation, it appears that the various factors mean the following:
h or H means 10^2k or K means 10^3m or M means 10^6b or B means 10^9It looks like what we need to do is standardize this notation as factors of 10 that we can eventually multiply our base cost values by. This will eventually allow us to make direct cost comparisons.
By standardize, we mean replace, an operation that would usually use a base or package function coupled with mutate. But there’s no base or package function that exists that can navigate the specific quirks we just listed above. So first, we need to create a function that understands the nuances of the replacement operation we’d like to perform.
EXP<-function(x){ #create a function to translate PROPDMGEXP and CROPDMGEXP into usable integers
if(x %in% 0:9) return(as.numeric(paste0("1e",x)))
else if(x=="h"|x=="H") return(1e2)
else if(x=="k"|x=="K") return(1e3)
else if(x=="m"|x=="M") return(1e6)
else if(x=="b"|x=="B") return(1e9)
else return(1)
}
Then, we can plug PROPDMGEXP and CROPDMGEXP into that function and get them cleaned up.
StormData<-mutate(StormData,PROPDMGEXP=as.numeric(sapply(PROPDMGEXP,FUN=EXP)))%>%
mutate(CROPDMGEXP=as.numeric(sapply(CROPDMGEXP,FUN=EXP))) #run all EXP data through the function
Now, let’s check those distributions again, and see how things look.
print(table(StormData$PROPDMGEXP)) #re-check the values and distribution of PROPDMGEXP and CROPDMGEXP
##
## 1 10 100 1000 10000 1e+05 1e+06 1e+07 1e+08 1e+09
## 466164 25 20 424669 4 28 11341 5 1 40
print(table(StormData$CROPDMGEXP))
##
## 1 100 1000 1e+06 1e+09
## 618439 1 281853 1995 9
Looks much better!
Last on our original list of items to fix were the dates. This one is easy, as we’re simply standardizing each of the dates to be Date class variables in a basic "%m/%d/%Y" format.
StormData$BGN_DATE<-as.Date(StormData$BGN_DATE,format="%m/%d/%Y") #convert dates to date format
StormData$END_DATE<-as.Date(StormData$END_DATE,format="%m/%d/%Y")
While we’re looking at the dates, though, let’s turn for a moment to an side item mentioned in our synopsis. NOAA only finalized their official list of event types after 1995. This could lead us to conclude that we should filter the data down to only include values recorded from 1996 on, to give us a more even playing field for comparison. But that begs the question: if we do filter the data, how much data would we be losing?
#see what percent of dates are before 1996
print(nrow(StormData[StormData$BGN_DATE<"1996-01-01",])/nrow(StormData)*100)
## [1] 27.57041
print(summary(StormData$BGN_DATE)) #get a summary of the date range and spread
## Min. 1st Qu. Median Mean 3rd Qu.
## "1950-01-03" "1995-04-20" "2002-03-18" "1998-12-27" "2007-07-28"
## Max.
## "2011-11-30"
What we can see above is that the data pre-1996 is over 27% of our values. That’s a lot to throw away. The summary of the dates shows this clearly as well, with the first quartile marker being in mid-1995.
We weren’t planning to do any filtering right now anyway, but this is going to be important to keep in mind once we start consolodating the data into summaries (which happens to be our next task).
Now that we’re done cleaning the data, it’s time to start shaping the storm data into a form that can answer our original questions. Just to refresh, those questions were:
Thinking about these questions, it’s clear that there are two roads we can take:
Rather than choose just one approach, for now, let’s copy the data into two separate datasets, one that we’ll use for average summaries, and one that we’ll use for total summaries. However, since not all event types were recorded before 1996, we should only use data from post 1996 in the totals summary, to maintain comparitive equity (this doesn’t matter for the averages summary, as the extra data for the few pre-1996 event types will just help to create more robust means).
Also, while we’re splitting our dataset for summarization, we can filter our data down to just the (now cleaned up) official event types, combine the property and crop damage costs with their (now cleaned up) exponential values, and get rid of any variables we won’t be using going forward.
#narrow data to only official EVTYPES & condense damage costs
StormData_Complete<-filter(StormData,EVTYPE %in% EVTYPE_official)%>%
mutate(EVTYPE=as.factor(EVTYPE),PROPDMG=PROPDMG*PROPDMGEXP,CROPDMG=CROPDMG*CROPDMGEXP)%>%
select(STATE,EVTYPE,BGN_DATE,FATALITIES,INJURIES,PROPDMG,CROPDMG)%>% #narrow variables of interest
group_by(EVTYPE) #set factors for later summarization
StormData_Modern<-filter(StormData_Complete,BGN_DATE>="1996-01-01")%>% #create separate data set
group_by(EVTYPE) #for just post 1995 data
Now we can create our average and total summaries. For flexibility going forward, we’ll create separate summaries for each of the FATALITIES,INJURIES,PROPDMG and CROPDMG variables. We can always recombine them later if we feel we want/need to.
StormData_Fatalities_Avg<-summarize(StormData_Complete,FATALITIES=mean(FATALITIES))%>%
arrange(desc(FATALITIES)) #get average fatalities
StormData_Injuries_Avg<-summarize(StormData_Complete,INJURIES=mean(INJURIES))%>%
arrange(desc(INJURIES)) #get average injuries
StormData_PropDMG_Avg<-summarize(StormData_Complete,PROPDMG=mean(PROPDMG))%>%
arrange(desc(PROPDMG)) #get average property damage
StormData_CropDMG_Avg<-summarize(StormData_Complete,CROPDMG=mean(CROPDMG))%>%
arrange(desc(CROPDMG)) #get average crop damage
# use modern data set for totals, so it's a fair comparison
StormData_Fatalities_Tot<-summarize(StormData_Modern,FATALITIES=sum(FATALITIES))%>%
arrange(desc(FATALITIES)) #get total fatalities
StormData_Injuries_Tot<-summarize(StormData_Modern,INJURIES=sum(INJURIES))%>%
arrange(desc(INJURIES)) #get total injuries
StormData_PropDMG_Tot<-summarize(StormData_Modern,PROPDMG=sum(PROPDMG))%>%
arrange(desc(PROPDMG)) #get total property damage
StormData_CropDMG_Tot<-summarize(StormData_Modern,CROPDMG=sum(CROPDMG))%>%
arrange(desc(CROPDMG)) #get total crop damage
Now we can finally see what our summaries look like (average of each category comes first, then total).
print(head(StormData_Fatalities_Avg)) #take a look at the summary results
## # A tibble: 6 x 2
## EVTYPE FATALITIES
## <fctr> <dbl>
## 1 tsunami 1.6500000
## 2 heat 1.1297561
## 3 excessiveheat 0.9860248
## 4 ripcurrent 0.7425997
## 5 avalanche 0.5803109
## 6 hurricanetyphoon 0.4515050
print(head(StormData_Fatalities_Tot))
## # A tibble: 6 x 2
## EVTYPE FATALITIES
## <fctr> <dbl>
## 1 excessiveheat 1799
## 2 tornado 1511
## 3 flashflood 887
## 4 lightning 651
## 5 ripcurrent 542
## 6 flood 444
print(head(StormData_Injuries_Avg)) #similar distributions of effects for fatalities and injuries
## # A tibble: 6 x 2
## EVTYPE INJURIES
## <fctr> <dbl>
## 1 tsunami 6.450000
## 2 hurricanetyphoon 4.458194
## 3 excessiveheat 3.403209
## 4 heat 2.449756
## 5 tornado 1.505520
## 6 duststorm 1.025641
print(head(StormData_Injuries_Tot))
## # A tibble: 6 x 2
## EVTYPE INJURIES
## <fctr> <dbl>
## 1 tornado 20667
## 2 flood 6838
## 3 excessiveheat 6391
## 4 thunderstormwind 5129
## 5 lightning 4141
## 6 flashflood 1674
print(head(StormData_PropDMG_Avg))
## # A tibble: 6 x 2
## EVTYPE PROPDMG
## <fctr> <dbl>
## 1 hurricanetyphoon 285472943
## 2 stormsurgetide 117273164
## 3 tropicalstorm 11067992
## 4 tsunami 7203100
## 5 flood 4990060
## 6 wildfire 2006987
print(head(StormData_PropDMG_Tot))
## # A tibble: 6 x 2
## EVTYPE PROPDMG
## <fctr> <dbl>
## 1 flood 144129590200
## 2 hurricanetyphoon 81718889010
## 3 stormsurgetide 47834724000
## 4 tornado 24616945710
## 5 flashflood 15222258910
## 6 hail 14595143420
print(head(StormData_CropDMG_Avg)) #more prop dmg from droughts, more prop dmg from storm surges
## # A tibble: 6 x 2
## EVTYPE CROPDMG
## <fctr> <dbl>
## 1 hurricanetyphoon 18448554.5
## 2 drought 5586794.0
## 3 icestorm 2499807.6
## 4 tropicalstorm 996981.3
## 5 frostfreeze 814126.5
## 6 extremecoldwindchill 703343.7
print(head(StormData_CropDMG_Tot))
## # A tibble: 6 x 2
## EVTYPE CROPDMG
## <fctr> <dbl>
## 1 drought 13367566000
## 2 hurricanetyphoon 5350107800
## 3 flood 5013161500
## 4 hail 2476029450
## 5 flashflood 1334901700
## 6 extremecoldwindchill 1326023000
There’s a lot of useful information here. In fact, there’s so much useful information that we could just end our analysis here and answer our original questions. However, there may be ways to make more direct and easily understandable comparisons if we push a bit further, so we’ll keep goin.
We’ll can start by breaking down some of the information gleaned from the above tibbles:
FATALITIES and INJURIES), the averages are fairly similar (and small) across the board, while the totals only have slight outliers. However, for the economic impact categories (PROPMG and CROPDMG), there are large outliers in both cases. The outliers for the average datasets, in particular, are huge. This is extremely important, as it means we should probably do some scaling on these values, or any graphs we put together are going to be extremely hard to read.FATALITIES and INJURIES, and the values for PROPMG and CROPDMG are within very similar ranges respectively, for both their averages and their totals. Since the questions we’re trying to answer deal with overall human and economic impacts, it wouldn’t hurt to further condense our data, and combine each of these pairs of summaries.Let’s tackle the last point first, as it’ll give us slightly less to work with going forward.
To do the condensing of the datasets, we need to be careful. While we can add absolute totals together, we can’t just do that for averages or they won’t actually be averages anymore. So for each of the condensed average datasets, we should rerun the same summary code we just did, only this time the first step we’ll take is to combine each of the respective impacts. For the condensed total datasets we can do a simple merge of the summaries we already have calculated and then combine the impacts after the fact.
As a side note, we can also divide all the costs by a billion, to bring their absolute scale down a bit. This doesn’t solve our scaling problem, but it will be useful for potential graphing purposes later, in terms of saving space and making things easier to read. We’ll deal with the actual scaling problem in a moment.
#Note: the difference in averages is much more extreme than the difference in totals
HumanImpact_Avg<-mutate(StormData_Complete,HealthImpacts=FATALITIES+INJURIES)%>% #condense for emphasis
summarize(HealthImpacts=mean(HealthImpacts))%>%rename(Disaster_Type=EVTYPE)%>% #make nicer for graph
arrange(desc(HealthImpacts))
HumanImpact_Tot<-merge(StormData_Fatalities_Tot,StormData_Injuries_Tot,all=TRUE)%>%
mutate(HealthImpacts=FATALITIES+INJURIES)%>%rename(Fatalitities=FATALITIES,Injuries=INJURIES)%>%
arrange(desc(HealthImpacts))
EconomicCost_Avg<-mutate(StormData_Complete,TotalCost=PROPDMG+CROPDMG)%>%
summarize(TotalCost=mean(TotalCost)/1000000000)%>%rename(Disaster_Type=EVTYPE)%>% #scale down cost
arrange(desc(TotalCost))
EconomicCost_Tot<-merge(StormData_PropDMG_Tot,StormData_CropDMG_Tot,all=TRUE)%>%
mutate(TotalCost=(PROPDMG+CROPDMG)/1000000000,PROPDMG=PROPDMG/1000000000,CROPDMG=CROPDMG/1000000000)%>%
rename(Property=PROPDMG,Crop=CROPDMG)%>%arrange(desc(TotalCost))
So now we have our data summaries all set. Our next step in making our analysis more generally understandable will be to turn our summaries into visual representations.
Thinking about the breadth of our data, one thing is immediately clear: in order to have visually compelling graphs we’re goin to need to narrow down our summaries a bit. Fitting 48 different disaster event types onto a single graph would be extremely busy (recall from earlier that that’s the number of official event types that we narrowed our data down to). But we don’t want to cut out data arbitrarily, and potentially lose an important part of the narrative.
First we’ll need to determine what a representative sample of our summaries looks like. Let’s see what percentage the top 10 event types for each summary represent.
# What is a representative slice of each category?
print(sum(HumanImpact_Avg$HealthImpacts[1:10])/sum(HumanImpact_Avg$HealthImpacts)*100) #top 10 is not
## [1] 85.16509
print(sum(HumanImpact_Tot$HealthImpacts[1:10])/sum(HumanImpact_Tot$HealthImpacts)*100) #for human impacts
## [1] 85.49317
print(sum(EconomicCost_Avg$TotalCost[1:10])/sum(EconomicCost_Avg$TotalCost)*100) #top 10 is
## [1] 98.84623
print(sum(EconomicCost_Tot$TotalCost[1:10])/sum(EconomicCost_Tot$TotalCost)*100) #for economic losses
## [1] 95.32703
For economic cost, for both the average and total summaries, the top 10 look to be very representative (98.8% and 93.3% respectively). However, for the human impact summaries, the numbers are not quite as sterling (around 85% a piece).
Let’s try sampling the top 20 from the human impacts summaries instead.
print(sum(HumanImpact_Avg$HealthImpacts[1:20])/sum(HumanImpact_Avg$HealthImpacts)*100) #top 20 is
## [1] 96.71101
print(sum(HumanImpact_Tot$HealthImpacts[1:20])/sum(HumanImpact_Tot$HealthImpacts)*100) #for human impacts
## [1] 96.36688
As we can see, they are both over 96% now, which is much better.
Now that we have our representative sample size set, we can finally perform the scaling of the economic costs mentioned earlier. To do this we can use the log10 function on the original data values, and then re-summarize. However, we only really need to do this for the average cost.
EconomicCost_Avg<-mutate(StormData_Complete,TotalCost=PROPDMG+CROPDMG)%>% #use log10 for avg costs
summarize(TotalCost=log10(mean(TotalCost)))%>%rename(Disaster_Type=EVTYPE)%>% #for extreme range
arrange(desc(TotalCost))
One last thing: While our average summaries have the different impact types (FATALITIES,PROPDMG,etc) fully embedded in their values, the data in the total summaries still has the values for each impact type separate and distinct. This gives us an opportunity. If we can turn the impact types into factors themselves, we could plot stacked graphs. These would not only show the overall impacts and costs, but also the degree to which each impact type individually contributed, adding a whole other dimension to our analysis.
To get these new factors, we’re goin to have to “melt” our representative sample from our total summaries, and then get rid of the added totals (since we’ll be stacking the factors to get the added totals instead). The resulting “molten” summaries will be what we use in our graphs.
#make type of impact/cost a factor
HumanImpactTot_Molten<-melt(HumanImpact_Tot[1:20,])%>%arrange(variable,desc(value))
## Using EVTYPE as id variables
HumanImpactTot_Molten<-HumanImpactTot_Molten[1:40,]%>%rename(Types_of_Impact=variable) #remove total damage
EconomicCostTot_Molten<-melt(EconomicCost_Tot[1:10,])%>%arrange(variable,desc(value))
## Using EVTYPE as id variables
EconomicCostTot_Molten<-EconomicCostTot_Molten[1:20,]%>%rename(Types_of_Damage=variable) #remove total cost
And that’s it. We could continue to manipulate the data in a million other ways (such as figuring out the top disaster type in each state), but at this point we have plenty of material to answer the questions at hand. It’s time to see our results.
When we started this analysis of the storm database from NOAA, we posed two questions:
1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
After thoroughly analyzing and summarizing the data, we’ve decided to approach both questions from two angles. The results of these approaches are below.
g1<-ggplot(data=HumanImpact_Avg[1:20,],aes(x=Disaster_Type,y=HealthImpacts))+ #create avg aesthetics
geom_bar(stat="identity",fill="forestgreen")+
labs(x="Event Type",y="Health Impacts",title="Top 20 Natural Disasters by Avg Health Impact")+
theme(axis.text.x=element_text(angle=90,size=12,hjust=1,vjust=-.1),
axis.title=element_text(size=15),title=element_text(size=15))
g2<-ggplot(data=EconomicCost_Avg[1:10,],aes(x=Disaster_Type,y=TotalCost))+
geom_bar(stat="identity",fill="firebrick1")+
labs(x="Event Type",y="log10(Cost ($))",title="Top 10 Natural Disasters by Avg Economic Cost")+
theme(axis.text.x=element_text(angle=90,size=12,hjust=1,vjust=-.1),
axis.title=element_text(size=15),title=element_text(size=15))
grid.arrange(g1,g2,ncol=2) #print first set of graphs together
Our first approach was to look at the average human impacts (from fatalities and injuries combined) and average economic costs (from both property and crop damage). This approach is shown above.
The first graph clearly shows that, on average, tsunamis are the most harmful disaster events, with respect to population health. However, the next three most harmful disasters (which all have fairly similar effects on average) are hurricanes, excessive heat and just heat. So one could say that heat in general is collectively more harmful to population health than even tsunamis.
The second graph is slightly less clear, as it is a scaled graph of log10(cost). With this in mind, however, it quickly becomes evident that, on average, hurricanes have costs that are literal orders of magnitude higher than other disaster events. That being said, storm surges are not too too far behind. One thing to note, though, is that storm surges generally accompany hurricanes. So in reality the top two disasters on this graph are essentially one and the same, and their economic consequences are simply amplified further.
#create total impact aesthetic
g3<-ggplot(data=HumanImpactTot_Molten,aes(x=EVTYPE,y=value,fill=Types_of_Impact))+
geom_bar(stat="identity")+
labs(x="Event Type",y="Count of Impacts",title="Top 20 Natural Disasters in US by Total Human Impact since 1996")+
theme(axis.text.x=element_text(angle=90,size=12,hjust=1,vjust=-.1),
axis.title=element_text(size=15),title=element_text(size=15),legend.text=element_text(size=12))
#create total cost aesthetic
g4<-ggplot(data=EconomicCostTot_Molten,aes(x=EVTYPE,y=value,fill=Types_of_Damage))+
geom_bar(stat="identity")+
labs(x="Event Type",y="Cost (billion $)",title="Top 10 Natural Disasters in US by Total Economic Cost since 1996")+
theme(axis.text.x=element_text(angle=90,size=12,hjust=1,vjust=-.1),
axis.title=element_text(size=15),title=element_text(size=15),legend.text=element_text(size=12))
print(g3) #print second set of graphs separately
print(g4)
Our second approach wass to look at the total human impacts (from fatalities and injuries) and total economic costs (from property and crop damage). This approach is shown in our final two figures.
In terms of population health, tornados have clearly caused by far the most harm. However, from the stacked figure, it appears that excessive heat has actually caused slightly more fatalities. Also of note is the fact that, here, tsunamis aren’t even on the list, in contrast to the average impact. This is likely due to tsunamis being a rare occurence in the US, as opposed to tornados which happen quite frequently.
As for economic consequences, this time floods cost the US the most by a long shot (the y axis is in terms of billions of dollars). In this case, though, hurricanes (which had the highest average costs) do at least come in second, at about half the total cost of floods. This is again likely do to the frequency of floods versus hurricanes. One other very notable item is that draughts are far an away the most costly disasters in terms of crop damage, as can be seen from the stacked bar on the left. In fact, while property damage is clearly quite a bit more costly than crop damage generally, the amount of crop damage caused by draughts makes it chart at high property damage levels.
Overall, we can probably state that excessive heat is the most dangerous disaster for population health, as it’s present in the top 3 for both the average and the absolute approaches. Similarly we can safely say that hurricanes have the greatest economic consequences from a general standpoint.
It would not be wise to put too much weight on this analysis. Though we captured more than 99% of the values recorded, it is still limited in many ways. For one things, the recording of event types is extremely imprecise, and therefore some of our re-workin of the EVTYPE data may have been faulty. And considering the level of disarray of this variable, there are no doubt many other errors that we didn’t see in this brief exercise. The improper coding of even just one crop or property damage exponent from a K or M to a B would throw out the entire analysis. Additionally, it is possible that one of the very few event types we ignored was a single record with huge human or economic impacts. Such an occurence could severely skew our results.
Another thing that would be worth considering is inflation adjusted figures in the economic analysis. Or we could observe the data year by year, to see the effects of climate change on natural disasters over time. Speaking of climate, it could be more useful to look at the results on a region by region or state by state basis, since the types of events in any particular region are heavily influenced by geography. The original storm dataset from NOAA has many additional variables to help with this kind of mapping.
Suffice it to say that while we can draw broad conclusions from our analysis, this is only a first step in really answering our core questions about natural disasters in the United State.
The data itself comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file here:
There is also documentation of the database available, where you can find how some of the variables are constructed/defined:
print(sessionInfo())
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 gridExtra_2.2.1 ggplot2_2.2.1 reshape2_1.4.2
## [5] dplyr_0.7.2 plyr_1.8.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 knitr_1.17 bindr_0.1 magrittr_1.5
## [5] munsell_0.4.3 colorspace_1.3-2 R6_2.2.2 rlang_0.1.2
## [9] stringr_1.2.0 tools_3.4.1 grid_3.4.1 gtable_0.2.0
## [13] htmltools_0.3.6 lazyeval_0.2.0 yaml_2.1.14 rprojroot_1.2
## [17] digest_0.6.12 assertthat_0.2.0 tibble_1.3.3 codetools_0.2-15
## [21] glue_1.1.1 evaluate_0.10.1 rmarkdown_1.6 labeling_0.3
## [25] stringi_1.1.5 compiler_3.4.1 scales_0.4.1 backports_1.1.0
## [29] pkgconfig_2.0.1