Synopsis

In this document we present the analysis performed on the NOAA storm database. We are interestend in finding out which are the main causes for (a) economic damages and (b) health perils, across all United States. After our analysis we can affirm that the main sources for economic damages are: hurricanes/typhoons, storm surges, floods, tornados and hail, and the main source of danger for people health are tornados, followed (in different order, considering injuries or fatalities) by excessive heat, floods and wind. In the case of fatalities, we can also see a more uniform distribution on the main causes, while in the case of injuries, x is by large the main reason.

Data processing

Data is obtained from Coursera Assignment link, despite the original source of the database is the U.S. National Oceanic and Atmospheric Administration’ (NOAA) storm database. This file is compressed using bzip2 (which can be read directly from R), so we have downloaded it to our data directory. This data will be used as our input; itwas obtained on april, 16th, at 23:20.

Additional information about this dataset can be found in the following links:

Reading and transforming data

A careful review of the .csv origin file shows us that it is a comma separated values file with header. Doing a quick search on the file also reveals ? as the NA value. We can use this knowledge to read the csv file (note that we can use read.csv directly, wince R is able to uncompress the file on-the-fly):

data <- read.csv("data/repdata_data_StormData.csv.bz2",header=T, sep=",", check.names = T, na.strings = "?")

Looking at the type of analysis we need to perform, we can check for NAs in our columns of interest:

sapply(c("EVTYPE","INJURIES","FATALITIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","BGN_DATE"),function(x) {summary(data[x])})
## $EVTYPE
##                EVTYPE      
##  HAIL             :288661  
##  TSTM WIND        :219940  
##  THUNDERSTORM WIND: 82563  
##  TORNADO          : 60652  
##  FLASH FLOOD      : 54277  
##  (Other)          :196203  
##  NA's             :     1  
## 
## $INJURIES
##     INJURIES        
##  Min.   :   0.0000  
##  1st Qu.:   0.0000  
##  Median :   0.0000  
##  Mean   :   0.1557  
##  3rd Qu.:   0.0000  
##  Max.   :1700.0000  
## 
## $FATALITIES
##    FATALITIES      
##  Min.   :  0.0000  
##  1st Qu.:  0.0000  
##  Median :  0.0000  
##  Mean   :  0.0168  
##  3rd Qu.:  0.0000  
##  Max.   :583.0000  
## 
## $PROPDMG
##     PROPDMG       
##  Min.   :   0.00  
##  1st Qu.:   0.00  
##  Median :   0.00  
##  Mean   :  12.06  
##  3rd Qu.:   0.50  
##  Max.   :5000.00  
## 
## $PROPDMGEXP
##    PROPDMGEXP    
##         :465934  
##  K      :424665  
##  M      : 11330  
##  0      :   216  
##  B      :    40  
##  (Other):   104  
##  NA's   :     8  
## 
## $CROPDMG
##     CROPDMG       
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  1.527  
##  3rd Qu.:  0.000  
##  Max.   :990.000  
## 
## $CROPDMGEXP
##    CROPDMGEXP    
##         :618413  
##  K      :281832  
##  M      :  1994  
##  k      :    21  
##  0      :    19  
##  (Other):    11  
##  NA's   :     7  
## 
## $BGN_DATE
##               BGN_DATE     
##  5/25/2011 0:00:00:  1202  
##  4/27/2011 0:00:00:  1193  
##  6/9/2011 0:00:00 :  1030  
##  5/30/2004 0:00:00:  1016  
##  4/4/2011 0:00:00 :  1009  
##  4/2/2006 0:00:00 :   981  
##  (Other)          :895866

Looks like there are few NAs in our columns of interest. We can see that there is only one in EVTYPE, 8 in PROPDMGEXP (exponent/modifier for the properties damages value) and 7 for CROPDMGEXP (analogous, for crops damages) that’s all.

Let’s take a look at PROPDMG and PROPMGEXP:

data[is.na(data$PROPDMGEXP),"PROPDMG"]
## [1] 0 0 0 0 0 0 0 0

Let’s do the same for CROPDMG and CROPDMGEXP:

data[is.na(data$CROPDMGEXP),"CROPDMG"]
## [1] 0 0 0 0 0 0 0

Since damages (PROPDMG/CROPDMG) is 0 for all those cases where PROPDMGEXP/CROPDMGEXP is NA, we can set these to 0 in order to deal with them the same way we deal with other rows, without bothering with spurious NAs appearing. We can also assign the case corresponding to a NA EVTYPE to an “unknown” type:

data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0
levels(data$EVTYPE) <- c(levels(data$EVTYPE),"UNKNOWN")
data$EVTYPE[is.na(data$EVTYPE)] <- "UNKNOWN"

We need to perform some analysis based on the type of event (EVTYPE), so one thing we can do is check what data is in that column.

length(levels(data$EVTYPE))
## [1] 985

As we can see, there are almost 1000 different event types, despite the original documentation from NOAA defines many less. We can do a quick rough cut to see what’s going on, since from a first glipse, it looks like there is some duplicity on event types (non standardized values):

unique(substr(levels(data$EVTYPE),1,4))
##   [1] "ABNO" "ACCU" "AGRI" "APAC" "ASTR" "AVAL" "BEAC" "Beac" "BELO" "BITT"
##  [11] "Blac" "BLAC" "BLIZ" "Bliz" "BLOW" "blow" "Blow" "BREA" "BRUS" "COAS"
##  [21] "Coas" " COA" "coas" "Cold" "COLD" "COOL" "CSTL" "Dama" "DAMA" "DAM "
##  [31] "DEEP" "DENS" "DOWN" "DRIE" "Drif" "DROU" "DROW" "DRY"  "DRY " "DRYN"
##  [41] "DUST" "Dust" "EARL" "Earl" "Eros" "EXCE" "Exce" "Exte" "Extr" "EXTR"
##  [51] "FALL" "FIRS" " FLA" "FLAS" "Floo" "FLOO" "FOG"  "FOG " "FORE" "Free"
##  [61] "FREE" "Fros" "FROS" "FUNN" "Funn" "Glaz" "GLAZ" "grad" "Grad" "GRAD"
##  [71] "GRAS" "GROU" "GUST" "Gust" "HAIL" "Hail" "HARD" "HAZA" "HEAT" "Heat"
##  [81] "HEAV" "Heav" "HIGH" "High" "   H" "Hot " "HOT/" "HOT " "HURR" "Hurr"
##  [91] "HVY " "HYPE" "HYPO" "Hypo" "ICE"  "ICE " "Ice " "Ice/" "ICE/" "Ices"
## [101] "Icy " "ICY " "LACK" "Lake" "LAKE" "LAND" "Land" "LARG" "LATE" "Late"
## [111] "LIGH" " LIG" "Ligh" "LIGN" "LOCA" "LOW " "MAJO" "Mari" "MARI" "Metr"
## [121] "Micr" "MICR" "Mild" "MILD" "MINO" "Mino" "MIXE" "Mixe" "MODE" "MONT"
## [131] "Mont" "Moun" "MUD/" "Muds" "MUDS" "MUD " "NEAR" "NONE" "NON " "NON-"
## [141] "NORM" "NORT" "No S" "Othe" "OTHE" "PATC" "Prol" "PROL" "RAIN" "Rain"
## [151] "RAPI" "Reco" "RECO" "RED " "REMN" "RIP " "RIVE" "Rive" "ROCK" "ROGU"
## [161] "ROTA" "ROUG" "RURA" "Saha" "SAHA" "Seas" "SEIC" "SEVE" "SLEE" "smal"
## [171] "Smal" "SMAL" "Sml " "SMOK" "Snow" "SNOW" "SOUT" "STOR" "STRE" "Stro"
## [181] "STRO" "Summ" "SUMM" "Temp" "THUD" "THUN" "Thun" "TIDA" "Tida" "TORN"
## [191] "TORR" "Torr" "TROP" "TSTM" "Tstm" " TST" "TSUN" "TUND" "TYPH" "Unse"
## [201] "UNSE" "UNUS" "URBA" "Urba" "VERY" "VOG"  "Volc" "VOLC" "WAKE" "WALL"
## [211] "WARM" "WATE" " WAT" "WAYT" "wet " "WET " "Wet " "Whir" "WHIR" "WILD"
## [221] "Wind" "WIND" " WIN" "WINT" "Wint" "WND"  "UNKN"

From here, we can extract some additional information:

  • Some event types have leading spaces in their names, we need to fix it.

  • There are some events of type “summ”. Those correspond (after reviewing data) to “summary” entries that point to other (following entries), so we can remove them.

  • There are many different names for the same event type. For example:

grep("^urban.+small",levels(data$EVTYPE), ignore.case=T, value=T)
##  [1] "URBAN AND SMALL"                "URBAN AND SMALL STREAM"        
##  [3] "URBAN AND SMALL STREAM FLOOD"   "URBAN AND SMALL STREAM FLOODIN"
##  [5] "URBAN SMALL"                    "URBAN/SMALL"                   
##  [7] "URBAN/SMALL FLOODING"           "URBAN/SMALL STREAM"            
##  [9] "URBAN SMALL STREAM FLOOD"       "URBAN/SMALL STREAM FLOOD"      
## [11] "URBAN/SMALL STREAM  FLOOD"      "URBAN/SMALL STREAM FLOODING"   
## [13] "URBAN/SMALL STRM FLDG"

In order to fix this, we can run a series of transformations on EVTYPE to unify types. The following transformations have been decided after reviewing the raw list of EVTYPE values:

# drop summaries
data <- data[!grepl("^summary",data$EVTYPE,ignore.case=T),]
# remove leading/trailing spaces
data$EVTYPE <- as.factor(trimws(data$EVTYPE,"b"))
# create a new column for simplified event type, this also allows us to keep the original event type in case we want to perform some additional exploration
data$type <- data$EVTYPE

These are the basic filters for EVTYPE, now we can apply some more transformations:

levels(data$type) <- c(levels(data$type),"HAIL","SNOW","FLOOD","RAIN","TORNADO","UNSEASON","THUNDERSTORM","RECORD","FIRE")
data$type[grepl("wind|wins$|w inds|win$",data$EVTYPE,ignore.case = T)] <- as.factor("WIND")
data$type[grepl("hurricane",data$EVTYPE,ignore.case = T)] <- as.factor("HURRICANE")
data$type[grepl("hail",data$EVTYPE,ignore.case = T)] <- as.factor("HAIL")
data$type[grepl("snow",data$EVTYPE,ignore.case = T)] <- as.factor("SNOW")
data$type[grepl("flood",data$EVTYPE,ignore.case = T)] <- as.factor("FLOOD")
data$type[grepl("rain",data$EVTYPE,ignore.case = T)] <- as.factor("RAIN")
data$type[grepl("tornado",data$EVTYPE,ignore.case = T)] <- as.factor("TORNADO")
data$type[grepl("^unseason",data$EVTYPE,ignore.case = T)] <- as.factor("UNSEASON")
data$type[grepl("thunderstorm",data$EVTYPE,ignore.case = T)] <- as.factor("THUNDERSTORM")
data$type[grepl("record",data$EVTYPE,ignore.case = T)] <- as.factor("RECORD")
data$type[grepl("fire",data$EVTYPE,ignore.case = T)] <- as.factor("FIRE")
data$type <- droplevels(data$type)

Additionally, we can add the year as a separate column, to perform some additional data checking if needed:

data$year <- as.numeric(strftime(as.Date(data$BGN_DATE,"%m/%d/%Y"),"%Y"))

Once we have the type column fixed and we added tye year column, we can work on the PROPDMG and PROPDMGEXP columns. The first one is a number telling the damages in properties. The second one is an exponent (modifier). We can find out which modifiers are there. For properties damages:

levels(data$PROPDMGEXP)
##  [1] ""  "-" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"

And for crops damages:

levels(data$CROPDMGEXP)
## [1] ""  "0" "2" "B" "k" "K" "m" "M"

There are some reasonable “exponents” (empty, 0 or "", 1..8, H/h=100, K=1000, M/m=10^6). These assumptions have been checked by selecting and reading some records from the dataset corresponding to those exponents. For example, the ones with M correspond to big events, such as tornados, hurricanes, etc..

There’s also an important one to check: B. We could think it refers to “billions”. Since that’s a very big exponent, we can check the rows involved:

data[data$PROPDMGEXP=="B",c("year","PROPDMG","type","REFNUM")]
##        year PROPDMG             type REFNUM
## 187564 1993    5.00     WINTER STORM 187564
## 187566 1995    0.10        HURRICANE 187566
## 195000 1995    2.10        HURRICANE 194932
## 195001 1993    1.60          TORNADO 194933
## 195007 1995    1.00        HURRICANE 194939
## 198389 1993    5.00            FLOOD 198375
## 207175 1995    2.50             RAIN 207124
## 243445 1995    1.20     THUNDERSTORM 243394
## 298088 1997    3.00            FLOOD 298057
## 347872 1998    1.70        HURRICANE 347811
## 366694 1999    3.00        HURRICANE 366653
## 397334 2000    1.50             FIRE 398999
## 443782 2001    5.15   TROPICAL STORM 444407
## 485577 2003    1.00            FLOOD 485535
## 488045 2003    1.04             FIRE 488004
## 525145 2004    2.50        HURRICANE 525145
## 529351 2004    5.42        HURRICANE 529299
## 529363 2004    1.30             WIND 529311
## 529436 2004    4.83        HURRICANE 529384
## 529498 2004    4.00        HURRICANE 529446
## 565003 2005    1.00        HURRICANE 564962
## 569065 2005    1.50        HURRICANE 569065
## 569308 2005   10.00        HURRICANE 569288
## 577675 2005   16.93        HURRICANE 577615
## 577676 2005   31.30      STORM SURGE 577616
## 577683 2005    4.00        HURRICANE 577623
## 581533 2005    7.35        HURRICANE 581533
## 581535 2005   11.26      STORM SURGE 581535
## 581537 2005    5.88        HURRICANE 581537
## 598502 2005    2.09        HURRICANE 598472
## 605953 2006  115.00            FLOOD 605943
## 739575 2008    1.00        HURRICANE 739514
## 739576 2008    4.00 STORM SURGE/TIDE 739515
## 808257 2010    1.50            FLOOD 808257
## 834674 2010    1.80             HAIL 834634
## 856214 2011    1.00          TORNADO 859151
## 860386 2011    1.50          TORNADO 860355
## 862634 2011    2.80          TORNADO 862563
## 867749 2011    1.00            FLOOD 867679
## 868046 2011    2.00            FLOOD 867996

There’s one row that catch our attention, since it tells that in 2006 there was a flood that caused 115 bilions of dollars in damages. That number is 4x the higher number in the rest of this set (which actually corresponds to the Katrina Hurricane). That can make us think that this data might not be right. Just to be sure, since it looks like a really big event, I googled to find out the original report. It can be found in Google books. This original report tells us this flood caused 115 milions in damages. We can fix this:

data$PROPDMG[data$REFNUM==605943] <- 115
data$PROPDMGEXP[data$REFNUM==605943] <- "M"
data[data$PROPDMGEXP=="B",c("year","PROPDMG","PROPDMGEXP","type","REFNUM")]
##        year PROPDMG PROPDMGEXP             type REFNUM
## 187564 1993    5.00          B     WINTER STORM 187564
## 187566 1995    0.10          B        HURRICANE 187566
## 195000 1995    2.10          B        HURRICANE 194932
## 195001 1993    1.60          B          TORNADO 194933
## 195007 1995    1.00          B        HURRICANE 194939
## 198389 1993    5.00          B            FLOOD 198375
## 207175 1995    2.50          B             RAIN 207124
## 243445 1995    1.20          B     THUNDERSTORM 243394
## 298088 1997    3.00          B            FLOOD 298057
## 347872 1998    1.70          B        HURRICANE 347811
## 366694 1999    3.00          B        HURRICANE 366653
## 397334 2000    1.50          B             FIRE 398999
## 443782 2001    5.15          B   TROPICAL STORM 444407
## 485577 2003    1.00          B            FLOOD 485535
## 488045 2003    1.04          B             FIRE 488004
## 525145 2004    2.50          B        HURRICANE 525145
## 529351 2004    5.42          B        HURRICANE 529299
## 529363 2004    1.30          B             WIND 529311
## 529436 2004    4.83          B        HURRICANE 529384
## 529498 2004    4.00          B        HURRICANE 529446
## 565003 2005    1.00          B        HURRICANE 564962
## 569065 2005    1.50          B        HURRICANE 569065
## 569308 2005   10.00          B        HURRICANE 569288
## 577675 2005   16.93          B        HURRICANE 577615
## 577676 2005   31.30          B      STORM SURGE 577616
## 577683 2005    4.00          B        HURRICANE 577623
## 581533 2005    7.35          B        HURRICANE 581533
## 581535 2005   11.26          B      STORM SURGE 581535
## 581537 2005    5.88          B        HURRICANE 581537
## 598502 2005    2.09          B        HURRICANE 598472
## 739575 2008    1.00          B        HURRICANE 739514
## 739576 2008    4.00          B STORM SURGE/TIDE 739515
## 808257 2010    1.50          B            FLOOD 808257
## 834674 2010    1.80          B             HAIL 834634
## 856214 2011    1.00          B          TORNADO 859151
## 860386 2011    1.50          B          TORNADO 860355
## 862634 2011    2.80          B          TORNADO 862563
## 867749 2011    1.00          B            FLOOD 867679
## 868046 2011    2.00          B            FLOOD 867996

We can also run a similar check for crops damages:

data[data$CROPDMGEXP=="B",c("CROPDMG","type","REFNUM")]
##        CROPDMG      type REFNUM
## 188633    0.40      HEAT 188633
## 198389    5.00     FLOOD 198375
## 199733    0.50   DROUGHT 201230
## 201256    0.20    FREEZE 201222
## 211900    5.00 ICE STORM 211887
## 581537    1.51 HURRICANE 581537
## 639347    1.00   DROUGHT 639314
## 899222    0.00   DROUGHT 899212
## 899608    0.00   DROUGHT 899608

In this case seems there is no obvious outlier/error, except a “0” in crops damages in one of them (which doesn’t seem to be reasonable, since “B” is indicated as exponent). After checking the corresponding record, there is no additional information to help us, so we decide to keep it as it is.

Once we have decided how to deal with the exponents, we can add a new column with the expanded economic damage for each record based on the exponents/modifiers, and adding both properties damages and crop damages as a single value:

# Create a mapping table of the previously observed exponents/modifiers
# ""  "-" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
# nothing to do for "", "-", "+" and "0"
# for the rest, we set up a dataframe with multipliers
exps <- data.frame(l=levels(data$PROPDMGEXP),factor=c(1,1,1,1,1,2,3,4,5,6,7,8,1000000000,100,100,1000,1000000,1000000))
# we can perform a join using PROPMDGEXP==exp to add a column with the multiplier for property damages
data <- merge(data,exps,by.x = "PROPDMGEXP", by.y="l")
data$dollars <- data$PROPDMG * data$factor
data$factor <- NULL
# We can replicate this operation for the multiplier for crops damages
data <- merge(data,exps,by.x = "CROPDMGEXP", by.y="l")
# and now compute the sum of damages
data$dollars <- data$dollars + data$CROPDMG * data$factor
data$factor <- NULL
# now "dollars" holds the raw value in dollars of the damages

Before proceeeding to the analysis, we can check for NAs in or columns of interest:

sapply(c("EVTYPE","INJURIES","FATALITIES","dollars","year"),function(x) {summary(data[x])})
## $EVTYPE
##                EVTYPE      
##  HAIL             :288646  
##  TSTM WIND        :219944  
##  THUNDERSTORM WIND: 82563  
##  TORNADO          : 60652  
##  FLASH FLOOD      : 54278  
##  FLOOD            : 25326  
##  (Other)          :170792  
## 
## $INJURIES
##     INJURIES        
##  Min.   :   0.0000  
##  1st Qu.:   0.0000  
##  Median :   0.0000  
##  Mean   :   0.1558  
##  3rd Qu.:   0.0000  
##  Max.   :1700.0000  
## 
## $FATALITIES
##    FATALITIES      
##  Min.   :  0.0000  
##  1st Qu.:  0.0000  
##  Median :  0.0000  
##  Mean   :  0.0168  
##  3rd Qu.:  0.0000  
##  Max.   :583.0000  
## 
## $dollars
##     dollars         
##  Min.   :0.000e+00  
##  1st Qu.:0.000e+00  
##  Median :0.000e+00  
##  Mean   :4.007e+05  
##  3rd Qu.:1.000e+03  
##  Max.   :3.130e+10  
## 
## $year
##       year     
##  Min.   :1950  
##  1st Qu.:1995  
##  Median :2002  
##  Mean   :1999  
##  3rd Qu.:2007  
##  Max.   :2011

We can see that there are no NAs, so we consider no further action is required.

Results

First we have to decide which data we want to analyse. The dataset contains data from 1950 to 2011. Since, I asume, we want to take into consideration just “recent” data, perhaps it makes no sense to use data from the fifties to analyse it, because (a) perhaps the damages (in population and economy) do not matter anymore, since new safety laws (for buldings, for example), etc. may have been adopted and (b) inflation: asigning a 1:1 to the value of a dollar from the fifties to the value of a dollar today perhaps doesn’t make sense. By all those considerations, we decide to focus only in the last 25 years of data (that is: 1986-2011).

Since this decision may have impact in the analysis, we can make some plots just to see what data we are planning to remove:

par(mfcol=c(3,1))
par(mar=c(4,4,1,0))
plot(aggregate(FATALITIES ~ year,data,sum),type="l", main="Fatalities, injuries and damages 1950-2011")
plot(aggregate(INJURIES ~ year,data,sum),type="l")
plot(aggregate(dollars ~ year,data,sum),type="l")

From the plots, we can see that economic impact value doesn’t seem much accurate or relevant until the nineties. Effects on people (fatalities, injuries) does not show such a clear difference, despite we can see some difference in the “fatalities” plot. However, the differences seem enough to keep our decission of taking only the last 25 years as dataset.

Once we have decided this, we can filter our dataset (note that this is a completely subjective decission based on the previously exposed rationale)

data <- data[data$year >= 1986,]

Once we have performed all the necessary transformations in our data, we can proceed to answer the questions we’re interested in.

Events with greatest economic consequences

In order to find out which events are the ones with greatest consequences on economy, we can aggregate data about damages in both properties and crops (that value is already in the new dollars column) grouping by type.

eco.impact <- aggregate(dollars ~ type,data,sum)
head(eco.impact)
##                     type  dollars
## 1         ABNORMALLY DRY        0
## 2         ABNORMALLY WET        0
## 3        ABNORMAL WARMTH        0
## 4    AGRICULTURAL FREEZE 28820000
## 5          APACHE COUNTY     5000
## 6 ASTRONOMICAL HIGH TIDE  9425000

Then, we compute the percent of damage caused by every type of event:

eco.impact$pct <- (eco.impact$dollars / sum(eco.impact$dollars))*100
head(eco.impact)
##                     type  dollars          pct
## 1         ABNORMALLY DRY        0 0.000000e+00
## 2         ABNORMALLY WET        0 0.000000e+00
## 3        ABNORMAL WARMTH        0 0.000000e+00
## 4    AGRICULTURAL FREEZE 28820000 8.491820e-03
## 5          APACHE COUNTY     5000 1.473251e-06
## 6 ASTRONOMICAL HIGH TIDE  9425000 2.777078e-03

Now we can compute the accumulated impact. We’ll start with the event type with higher impact (higher percent) and accumulate percents as we go:

eco.impact <- eco.impact[order(-eco.impact$pct),]
eco.impact$cumsum.pct <- cumsum(eco.impact$pct)

Let’s see how many significant (economic impact > 0 ) rows do we have:

eco.impact <- eco.impact[eco.impact$dollars > 0,]
dim(eco.impact)
## [1] 142   4

Now we can focus on the 142 rows with economic impact. As we have computed the cummulative impact of all them (ordered by percent of impact), we can stablish a cut point of the top “n” events. Lets see the top 20:

head(eco.impact,n=20)
##                 type     dollars        pct cumsum.pct
## 156        HURRICANE 90271472810 26.5985105   26.59851
## 88             FLOOD 64895658548 19.1215209   45.72003
## 246      STORM SURGE 43323541000 12.7652914   58.48532
## 249          TORNADO 36858930778 10.8604925   69.34582
## 118             HAIL 19129871036  5.6366210   74.98244
## 49           DROUGHT 15018672000  4.4252552   79.40769
## 298             WIND 12092454868  3.5630446   82.97074
## 168        ICE STORM  8967041360  2.6421408   85.61288
## 313             FIRE  8904910130  2.6238338   88.23671
## 252   TROPICAL STORM  8382236550  2.4698280   90.70654
## 248     THUNDERSTORM  7085466181  2.0877343   92.79427
## 300     WINTER STORM  6715441251  1.9787063   94.77298
## 247 STORM SURGE/TIDE  4642038000  1.3677776   96.14076
## 219             RAIN  4179545492  1.2315041   97.37226
## 83      EXTREME COLD  1360710400  0.4009336   97.77319
## 244             SNOW  1151782858  0.3393731   98.11257
## 103     FROST/FREEZE  1103566000  0.3251660   98.43773
## 180        LIGHTNING   940751606  0.2771927   98.71493
## 16          BLIZZARD   771273950  0.2272560   98.94218
## 261          TYPHOON   601055000  0.1771010   99.11928

From this data, we can decide to focus on the top-14 events, since they account for 97% of damages and contain all events with more than 1% of impact.

top.impact = head(eco.impact,n=14)

Let’s plot this data (reversing ordered by percent, since that is our current data frame order and we want top contributors on top) to see how each top event contributes to total economic damages.

par(mfrow=c(1,1))
par(mar=c(3,7,3,1))
top.impact <- top.impact[order(top.impact$pct),]
bb <- barplot(top.impact$pct,names.arg=top.impact$type,horiz=T,cex.names=0.60,cex.axis=.6,las=1,main="Top % of total economic damage by event type, 1986-2011",xlab="Percent")
title(ylab="Type",line=0,cex.lab=.6)
title(xlab="Percent",line=0.4,cex.lab=.6)
text(y=bb[1:10],top.impact$pct[1:10]+1,labels=sprintf("%.2f%%",top.impact$pct[1:10]),cex=.6)
text(y=bb[11:14],top.impact$pct[11:14]-1,labels=sprintf("%.2f%%",top.impact$pct[11:14]),cex=.6)

Looking at that plot we can conclude that there are 5 top main causes of economic damages in the period 1986-2011:

rev(tail(top.impact$type,n=5))
## [1] HURRICANE   FLOOD       STORM SURGE TORNADO     HAIL       
## 313 Levels: ABNORMALLY DRY ABNORMALLY WET ... FIRE

Together, these types of events caused 74.98% of all total damages in the period 1986-2011.

Most harmful events respect to population health

When analyzing data to find the most harmful events respect to population health, we can take into consideration two variables: injuries and fatalities. Since doesn’t seem to be a way to unify them into a single variable, we can replicate the same steps as in the case of economic loss, but for the two different variables.

First we aggregate them by type:

h.impact <- aggregate(cbind(FATALITIES,INJURIES) ~ type, data, sum)

Now add a percent column computing how each row contributes to the total of each variable:

h.impact$fat.pct <- (h.impact$FATALITIES / sum(h.impact$FATALITIES))*100
h.impact$inj.pct <- (h.impact$INJURIES / sum(h.impact$INJURIES))*100

After we have this new columns, we can compute the cumulative sum of those percentages to stablish a cut point:

h.impact <- h.impact[order(-h.impact$fat.pct),]
h.impact$cumsum.fat.pct <- cumsum(h.impact$fat.pct)

h.impact <- h.impact[order(-h.impact$inj.pct),]
h.impact$cumsum.inj.pct <- cumsum(h.impact$inj.pct)

Now we can look at the first 20 values to decide where to stablish our cut point. First for fatalities:

h.impact <- h.impact[order(-h.impact$fat.pct),]
head(h.impact, n=20)
##               type FATALITIES INJURIES    fat.pct    inj.pct cumsum.fat.pct
## 249        TORNADO       1936    30247 17.0407535 38.4162063       17.04075
## 76  EXCESSIVE HEAT       1903     6525 16.7502861  8.2872928       33.79104
## 88           FLOOD       1524     8604 13.4143121 10.9277958       47.20535
## 298           WIND       1145     8382 10.0783382 10.6458373       57.28369
## 121           HEAT        937     2100  8.2475134  2.6671747       65.53120
## 180      LIGHTNING        816     5230  7.1824663  6.6425351       72.71367
## 223    RIP CURRENT        368      232  3.2391515  0.2946593       75.95282
## 9        AVALANCHE        224      170  1.9716574  0.2159141       77.92448
## 248   THUNDERSTORM        210     2479  1.8484288  3.1485362       79.77291
## 300   WINTER STORM        206     1321  1.8132207  1.6777799       81.58613
## 224   RIP CURRENTS        204      297  1.7956166  0.3772147       83.38174
## 126      HEAT WAVE        172      309  1.5139512  0.3924557       84.89570
## 244           SNOW        164     1164  1.4435349  1.4783768       86.33923
## 83    EXTREME COLD        160      231  1.4083267  0.2933892       87.74756
## 156      HURRICANE        135     1328  1.1882757  1.6866705       88.93583
## 219           RAIN        114      305  1.0034328  0.3873754       89.93927
## 16        BLIZZARD        101      805  0.8890062  1.0224170       90.82827
## 143      HIGH SURF        101      152  0.8890062  0.1930526       91.71728
## 84    EXTREME HEAT         96      155  0.8449960  0.1968629       92.56227
## 313           FIRE         90     1608  0.7921838  2.0422938       93.35446
##     cumsum.inj.pct
## 249       38.41621
## 76        68.27713
## 88        49.34400
## 298       59.98984
## 121       80.73538
## 180       74.91967
## 223       97.20836
## 9         97.99200
## 248       78.06820
## 300       90.39309
## 224       96.91370
## 126       96.14911
## 244       91.87147
## 83        97.50175
## 156       88.71531
## 219       96.53648
## 16        92.89388
## 143       98.38191
## 84        98.18886
## 313       85.28609

And for injuries:

h.impact <- h.impact[order(-h.impact$inj.pct),]
head(h.impact, n=20)
##               type FATALITIES INJURIES    fat.pct    inj.pct cumsum.fat.pct
## 249        TORNADO       1936    30247 17.0407535 38.4162063       17.04075
## 88           FLOOD       1524     8604 13.4143121 10.9277958       47.20535
## 298           WIND       1145     8382 10.0783382 10.6458373       57.28369
## 76  EXCESSIVE HEAT       1903     6525 16.7502861  8.2872928       33.79104
## 180      LIGHTNING        816     5230  7.1824663  6.6425351       72.71367
## 248   THUNDERSTORM        210     2479  1.8484288  3.1485362       79.77291
## 121           HEAT        937     2100  8.2475134  2.6671747       65.53120
## 168      ICE STORM         89     1975  0.7833817  2.5084143       94.13784
## 313           FIRE         90     1608  0.7921838  2.0422938       93.35446
## 118           HAIL         20     1372  0.1760408  1.7425541       98.01954
## 156      HURRICANE        135     1328  1.1882757  1.6866705       88.93583
## 300   WINTER STORM        206     1321  1.8132207  1.6777799       81.58613
## 244           SNOW        164     1164  1.4435349  1.4783768       86.33923
## 16        BLIZZARD        101      805  0.8890062  1.0224170       90.82827
## 89             FOG         62      734  0.5457266  0.9322411       94.68357
## 70      DUST STORM         22      440  0.1936449  0.5588366       97.84350
## 303 WINTER WEATHER         33      398  0.2904674  0.5054931       97.15694
## 45       DENSE FOG         18      342  0.1584368  0.4343685       98.35402
## 252 TROPICAL STORM         58      340  0.5105184  0.4318283       95.19409
## 126      HEAT WAVE        172      309  1.5139512  0.3924557       84.89570
##     cumsum.inj.pct
## 249       38.41621
## 88        49.34400
## 298       59.98984
## 76        68.27713
## 180       74.91967
## 248       78.06820
## 121       80.73538
## 168       83.24379
## 313       85.28609
## 118       87.02864
## 156       88.71531
## 300       90.39309
## 244       91.87147
## 16        92.89388
## 89        93.82613
## 70        94.38496
## 303       94.89046
## 45        95.32482
## 252       95.75665
## 126       96.14911

Once analyzed both listings, we decide to stick at the top 20 for presenting the data, since in both cases they account for more than the 90% of injuries/fatalities.

We now split our dataset into two different datasets (filterint those top-20) and we can plot them to find out how each type of storm affects peoples health:

fatalities.data <- head(h.impact[order(-h.impact$fat.pct),c("type","fat.pct")],n=20)
injuries.data <- head(h.impact[order(-h.impact$inj.pct),c("type","inj.pct")],n=20)

par(mfrow=c(1,2))
par(mar=c(2,7,3,0))

injuries.data <- injuries.data[order(injuries.data$inj.pct),]
bb <- barplot(injuries.data$inj.pct,names=injuries.data$type,horiz=T,cex.names=.6,cex.axis=.6,las=1)
title("Injuries", line=0.5, cex.main=.8)
text(y=bb[1:19],injuries.data$inj.pct[1:19]+5,labels=sprintf("%.2f%%",injuries.data$inj.pct[1:19]),cex=.6)
text(y=bb[20],injuries.data$inj.pct[20]-5,labels=sprintf("%.2f%%",injuries.data$inj.pct[20]),cex=.6)
title(ylab="Type",line=0,cex.lab=.6)
title(xlab="Percent",line=0.4,cex.lab=.6)

fatalities.data <- fatalities.data[order(fatalities.data$fat.pct),]
barplot(fatalities.data$fat.pct,names=fatalities.data$type,horiz=T,cex.names=.6,cex.axis=.6,las=1,)
title("Fatalities", line=0.5, cex.main=.8)
text(y=bb[1:14],fatalities.data$fat.pct[1:14]+2,labels=sprintf("%.2f%%",fatalities.data$fat.pct[1:14]),cex=.6)
text(y=bb[15:20],fatalities.data$fat.pct[15:20]-2,labels=sprintf("%.2f%%",fatalities.data$fat.pct[15:20]),cex=.6)
title(ylab="Type",line=0,cex.lab=.6)
title(xlab="Percent",line=0.4,cex.lab=.6)
mtext("% of total injuries and fatalities by event type, 1986-2011 (Top 20)", side = 1, line = -24.5, outer = TRUE)

From this data we can point to tornados as the main cause of fatalities and injuries. In the case of injuries, it is followed by flood, wind and excessive heat. For fatalities, the three same causes follow tornados, but in this case the order is different; excessive heat goes after tornados, followed by floods and wind. In this case, also, excessive heat is quite close to tornados, and floods and wind are not so distant in percentage as in the injuries case.