In this document we present the analysis performed on the NOAA storm database. We are interestend in finding out which are the main causes for (a) economic damages and (b) health perils, across all United States. After our analysis we can affirm that the main sources for economic damages are: hurricanes/typhoons, storm surges, floods, tornados and hail, and the main source of danger for people health are tornados, followed (in different order, considering injuries or fatalities) by excessive heat, floods and wind. In the case of fatalities, we can also see a more uniform distribution on the main causes, while in the case of injuries, x is by large the main reason.
Data is obtained from Coursera Assignment link, despite the original source of the database is the U.S. National Oceanic and Atmospheric Administration’ (NOAA) storm database. This file is compressed using bzip2 (which can be read directly from R), so we have downloaded it to our data directory. This data will be used as our input; itwas obtained on april, 16th, at 23:20.
Additional information about this dataset can be found in the following links:
A careful review of the .csv origin file shows us that it is a comma separated values file with header. Doing a quick search on the file also reveals ? as the NA value. We can use this knowledge to read the csv file (note that we can use read.csv directly, wince R is able to uncompress the file on-the-fly):
data <- read.csv("data/repdata_data_StormData.csv.bz2",header=T, sep=",", check.names = T, na.strings = "?")
Looking at the type of analysis we need to perform, we can check for NAs in our columns of interest:
sapply(c("EVTYPE","INJURIES","FATALITIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","BGN_DATE"),function(x) {summary(data[x])})
## $EVTYPE
## EVTYPE
## HAIL :288661
## TSTM WIND :219940
## THUNDERSTORM WIND: 82563
## TORNADO : 60652
## FLASH FLOOD : 54277
## (Other) :196203
## NA's : 1
##
## $INJURIES
## INJURIES
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.1557
## 3rd Qu.: 0.0000
## Max. :1700.0000
##
## $FATALITIES
## FATALITIES
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.0168
## 3rd Qu.: 0.0000
## Max. :583.0000
##
## $PROPDMG
## PROPDMG
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 12.06
## 3rd Qu.: 0.50
## Max. :5000.00
##
## $PROPDMGEXP
## PROPDMGEXP
## :465934
## K :424665
## M : 11330
## 0 : 216
## B : 40
## (Other): 104
## NA's : 8
##
## $CROPDMG
## CROPDMG
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
##
## $CROPDMGEXP
## CROPDMGEXP
## :618413
## K :281832
## M : 1994
## k : 21
## 0 : 19
## (Other): 11
## NA's : 7
##
## $BGN_DATE
## BGN_DATE
## 5/25/2011 0:00:00: 1202
## 4/27/2011 0:00:00: 1193
## 6/9/2011 0:00:00 : 1030
## 5/30/2004 0:00:00: 1016
## 4/4/2011 0:00:00 : 1009
## 4/2/2006 0:00:00 : 981
## (Other) :895866
Looks like there are few NAs in our columns of interest. We can see that there is only one in EVTYPE, 8 in PROPDMGEXP (exponent/modifier for the properties damages value) and 7 for CROPDMGEXP (analogous, for crops damages) that’s all.
Let’s take a look at PROPDMG and PROPMGEXP:
data[is.na(data$PROPDMGEXP),"PROPDMG"]
## [1] 0 0 0 0 0 0 0 0
Let’s do the same for CROPDMG and CROPDMGEXP:
data[is.na(data$CROPDMGEXP),"CROPDMG"]
## [1] 0 0 0 0 0 0 0
Since damages (PROPDMG/CROPDMG) is 0 for all those cases where PROPDMGEXP/CROPDMGEXP is NA, we can set these to 0 in order to deal with them the same way we deal with other rows, without bothering with spurious NAs appearing. We can also assign the case corresponding to a NA EVTYPE to an “unknown” type:
data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0
levels(data$EVTYPE) <- c(levels(data$EVTYPE),"UNKNOWN")
data$EVTYPE[is.na(data$EVTYPE)] <- "UNKNOWN"
We need to perform some analysis based on the type of event (EVTYPE), so one thing we can do is check what data is in that column.
length(levels(data$EVTYPE))
## [1] 985
As we can see, there are almost 1000 different event types, despite the original documentation from NOAA defines many less. We can do a quick rough cut to see what’s going on, since from a first glipse, it looks like there is some duplicity on event types (non standardized values):
unique(substr(levels(data$EVTYPE),1,4))
## [1] "ABNO" "ACCU" "AGRI" "APAC" "ASTR" "AVAL" "BEAC" "Beac" "BELO" "BITT"
## [11] "Blac" "BLAC" "BLIZ" "Bliz" "BLOW" "blow" "Blow" "BREA" "BRUS" "COAS"
## [21] "Coas" " COA" "coas" "Cold" "COLD" "COOL" "CSTL" "Dama" "DAMA" "DAM "
## [31] "DEEP" "DENS" "DOWN" "DRIE" "Drif" "DROU" "DROW" "DRY" "DRY " "DRYN"
## [41] "DUST" "Dust" "EARL" "Earl" "Eros" "EXCE" "Exce" "Exte" "Extr" "EXTR"
## [51] "FALL" "FIRS" " FLA" "FLAS" "Floo" "FLOO" "FOG" "FOG " "FORE" "Free"
## [61] "FREE" "Fros" "FROS" "FUNN" "Funn" "Glaz" "GLAZ" "grad" "Grad" "GRAD"
## [71] "GRAS" "GROU" "GUST" "Gust" "HAIL" "Hail" "HARD" "HAZA" "HEAT" "Heat"
## [81] "HEAV" "Heav" "HIGH" "High" " H" "Hot " "HOT/" "HOT " "HURR" "Hurr"
## [91] "HVY " "HYPE" "HYPO" "Hypo" "ICE" "ICE " "Ice " "Ice/" "ICE/" "Ices"
## [101] "Icy " "ICY " "LACK" "Lake" "LAKE" "LAND" "Land" "LARG" "LATE" "Late"
## [111] "LIGH" " LIG" "Ligh" "LIGN" "LOCA" "LOW " "MAJO" "Mari" "MARI" "Metr"
## [121] "Micr" "MICR" "Mild" "MILD" "MINO" "Mino" "MIXE" "Mixe" "MODE" "MONT"
## [131] "Mont" "Moun" "MUD/" "Muds" "MUDS" "MUD " "NEAR" "NONE" "NON " "NON-"
## [141] "NORM" "NORT" "No S" "Othe" "OTHE" "PATC" "Prol" "PROL" "RAIN" "Rain"
## [151] "RAPI" "Reco" "RECO" "RED " "REMN" "RIP " "RIVE" "Rive" "ROCK" "ROGU"
## [161] "ROTA" "ROUG" "RURA" "Saha" "SAHA" "Seas" "SEIC" "SEVE" "SLEE" "smal"
## [171] "Smal" "SMAL" "Sml " "SMOK" "Snow" "SNOW" "SOUT" "STOR" "STRE" "Stro"
## [181] "STRO" "Summ" "SUMM" "Temp" "THUD" "THUN" "Thun" "TIDA" "Tida" "TORN"
## [191] "TORR" "Torr" "TROP" "TSTM" "Tstm" " TST" "TSUN" "TUND" "TYPH" "Unse"
## [201] "UNSE" "UNUS" "URBA" "Urba" "VERY" "VOG" "Volc" "VOLC" "WAKE" "WALL"
## [211] "WARM" "WATE" " WAT" "WAYT" "wet " "WET " "Wet " "Whir" "WHIR" "WILD"
## [221] "Wind" "WIND" " WIN" "WINT" "Wint" "WND" "UNKN"
From here, we can extract some additional information:
Some event types have leading spaces in their names, we need to fix it.
There are some events of type “summ”. Those correspond (after reviewing data) to “summary” entries that point to other (following entries), so we can remove them.
There are many different names for the same event type. For example:
grep("^urban.+small",levels(data$EVTYPE), ignore.case=T, value=T)
## [1] "URBAN AND SMALL" "URBAN AND SMALL STREAM"
## [3] "URBAN AND SMALL STREAM FLOOD" "URBAN AND SMALL STREAM FLOODIN"
## [5] "URBAN SMALL" "URBAN/SMALL"
## [7] "URBAN/SMALL FLOODING" "URBAN/SMALL STREAM"
## [9] "URBAN SMALL STREAM FLOOD" "URBAN/SMALL STREAM FLOOD"
## [11] "URBAN/SMALL STREAM FLOOD" "URBAN/SMALL STREAM FLOODING"
## [13] "URBAN/SMALL STRM FLDG"
In order to fix this, we can run a series of transformations on EVTYPE to unify types. The following transformations have been decided after reviewing the raw list of EVTYPE values:
# drop summaries
data <- data[!grepl("^summary",data$EVTYPE,ignore.case=T),]
# remove leading/trailing spaces
data$EVTYPE <- as.factor(trimws(data$EVTYPE,"b"))
# create a new column for simplified event type, this also allows us to keep the original event type in case we want to perform some additional exploration
data$type <- data$EVTYPE
These are the basic filters for EVTYPE, now we can apply some more transformations:
levels(data$type) <- c(levels(data$type),"HAIL","SNOW","FLOOD","RAIN","TORNADO","UNSEASON","THUNDERSTORM","RECORD","FIRE")
data$type[grepl("wind|wins$|w inds|win$",data$EVTYPE,ignore.case = T)] <- as.factor("WIND")
data$type[grepl("hurricane",data$EVTYPE,ignore.case = T)] <- as.factor("HURRICANE")
data$type[grepl("hail",data$EVTYPE,ignore.case = T)] <- as.factor("HAIL")
data$type[grepl("snow",data$EVTYPE,ignore.case = T)] <- as.factor("SNOW")
data$type[grepl("flood",data$EVTYPE,ignore.case = T)] <- as.factor("FLOOD")
data$type[grepl("rain",data$EVTYPE,ignore.case = T)] <- as.factor("RAIN")
data$type[grepl("tornado",data$EVTYPE,ignore.case = T)] <- as.factor("TORNADO")
data$type[grepl("^unseason",data$EVTYPE,ignore.case = T)] <- as.factor("UNSEASON")
data$type[grepl("thunderstorm",data$EVTYPE,ignore.case = T)] <- as.factor("THUNDERSTORM")
data$type[grepl("record",data$EVTYPE,ignore.case = T)] <- as.factor("RECORD")
data$type[grepl("fire",data$EVTYPE,ignore.case = T)] <- as.factor("FIRE")
data$type <- droplevels(data$type)
Additionally, we can add the year as a separate column, to perform some additional data checking if needed:
data$year <- as.numeric(strftime(as.Date(data$BGN_DATE,"%m/%d/%Y"),"%Y"))
Once we have the type column fixed and we added tye year column, we can work on the PROPDMG and PROPDMGEXP columns. The first one is a number telling the damages in properties. The second one is an exponent (modifier). We can find out which modifiers are there. For properties damages:
levels(data$PROPDMGEXP)
## [1] "" "-" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
And for crops damages:
levels(data$CROPDMGEXP)
## [1] "" "0" "2" "B" "k" "K" "m" "M"
There are some reasonable “exponents” (empty, 0 or "", 1..8, H/h=100, K=1000, M/m=10^6). These assumptions have been checked by selecting and reading some records from the dataset corresponding to those exponents. For example, the ones with M correspond to big events, such as tornados, hurricanes, etc..
There’s also an important one to check: B. We could think it refers to “billions”. Since that’s a very big exponent, we can check the rows involved:
data[data$PROPDMGEXP=="B",c("year","PROPDMG","type","REFNUM")]
## year PROPDMG type REFNUM
## 187564 1993 5.00 WINTER STORM 187564
## 187566 1995 0.10 HURRICANE 187566
## 195000 1995 2.10 HURRICANE 194932
## 195001 1993 1.60 TORNADO 194933
## 195007 1995 1.00 HURRICANE 194939
## 198389 1993 5.00 FLOOD 198375
## 207175 1995 2.50 RAIN 207124
## 243445 1995 1.20 THUNDERSTORM 243394
## 298088 1997 3.00 FLOOD 298057
## 347872 1998 1.70 HURRICANE 347811
## 366694 1999 3.00 HURRICANE 366653
## 397334 2000 1.50 FIRE 398999
## 443782 2001 5.15 TROPICAL STORM 444407
## 485577 2003 1.00 FLOOD 485535
## 488045 2003 1.04 FIRE 488004
## 525145 2004 2.50 HURRICANE 525145
## 529351 2004 5.42 HURRICANE 529299
## 529363 2004 1.30 WIND 529311
## 529436 2004 4.83 HURRICANE 529384
## 529498 2004 4.00 HURRICANE 529446
## 565003 2005 1.00 HURRICANE 564962
## 569065 2005 1.50 HURRICANE 569065
## 569308 2005 10.00 HURRICANE 569288
## 577675 2005 16.93 HURRICANE 577615
## 577676 2005 31.30 STORM SURGE 577616
## 577683 2005 4.00 HURRICANE 577623
## 581533 2005 7.35 HURRICANE 581533
## 581535 2005 11.26 STORM SURGE 581535
## 581537 2005 5.88 HURRICANE 581537
## 598502 2005 2.09 HURRICANE 598472
## 605953 2006 115.00 FLOOD 605943
## 739575 2008 1.00 HURRICANE 739514
## 739576 2008 4.00 STORM SURGE/TIDE 739515
## 808257 2010 1.50 FLOOD 808257
## 834674 2010 1.80 HAIL 834634
## 856214 2011 1.00 TORNADO 859151
## 860386 2011 1.50 TORNADO 860355
## 862634 2011 2.80 TORNADO 862563
## 867749 2011 1.00 FLOOD 867679
## 868046 2011 2.00 FLOOD 867996
There’s one row that catch our attention, since it tells that in 2006 there was a flood that caused 115 bilions of dollars in damages. That number is 4x the higher number in the rest of this set (which actually corresponds to the Katrina Hurricane). That can make us think that this data might not be right. Just to be sure, since it looks like a really big event, I googled to find out the original report. It can be found in Google books. This original report tells us this flood caused 115 milions in damages. We can fix this:
data$PROPDMG[data$REFNUM==605943] <- 115
data$PROPDMGEXP[data$REFNUM==605943] <- "M"
data[data$PROPDMGEXP=="B",c("year","PROPDMG","PROPDMGEXP","type","REFNUM")]
## year PROPDMG PROPDMGEXP type REFNUM
## 187564 1993 5.00 B WINTER STORM 187564
## 187566 1995 0.10 B HURRICANE 187566
## 195000 1995 2.10 B HURRICANE 194932
## 195001 1993 1.60 B TORNADO 194933
## 195007 1995 1.00 B HURRICANE 194939
## 198389 1993 5.00 B FLOOD 198375
## 207175 1995 2.50 B RAIN 207124
## 243445 1995 1.20 B THUNDERSTORM 243394
## 298088 1997 3.00 B FLOOD 298057
## 347872 1998 1.70 B HURRICANE 347811
## 366694 1999 3.00 B HURRICANE 366653
## 397334 2000 1.50 B FIRE 398999
## 443782 2001 5.15 B TROPICAL STORM 444407
## 485577 2003 1.00 B FLOOD 485535
## 488045 2003 1.04 B FIRE 488004
## 525145 2004 2.50 B HURRICANE 525145
## 529351 2004 5.42 B HURRICANE 529299
## 529363 2004 1.30 B WIND 529311
## 529436 2004 4.83 B HURRICANE 529384
## 529498 2004 4.00 B HURRICANE 529446
## 565003 2005 1.00 B HURRICANE 564962
## 569065 2005 1.50 B HURRICANE 569065
## 569308 2005 10.00 B HURRICANE 569288
## 577675 2005 16.93 B HURRICANE 577615
## 577676 2005 31.30 B STORM SURGE 577616
## 577683 2005 4.00 B HURRICANE 577623
## 581533 2005 7.35 B HURRICANE 581533
## 581535 2005 11.26 B STORM SURGE 581535
## 581537 2005 5.88 B HURRICANE 581537
## 598502 2005 2.09 B HURRICANE 598472
## 739575 2008 1.00 B HURRICANE 739514
## 739576 2008 4.00 B STORM SURGE/TIDE 739515
## 808257 2010 1.50 B FLOOD 808257
## 834674 2010 1.80 B HAIL 834634
## 856214 2011 1.00 B TORNADO 859151
## 860386 2011 1.50 B TORNADO 860355
## 862634 2011 2.80 B TORNADO 862563
## 867749 2011 1.00 B FLOOD 867679
## 868046 2011 2.00 B FLOOD 867996
We can also run a similar check for crops damages:
data[data$CROPDMGEXP=="B",c("CROPDMG","type","REFNUM")]
## CROPDMG type REFNUM
## 188633 0.40 HEAT 188633
## 198389 5.00 FLOOD 198375
## 199733 0.50 DROUGHT 201230
## 201256 0.20 FREEZE 201222
## 211900 5.00 ICE STORM 211887
## 581537 1.51 HURRICANE 581537
## 639347 1.00 DROUGHT 639314
## 899222 0.00 DROUGHT 899212
## 899608 0.00 DROUGHT 899608
In this case seems there is no obvious outlier/error, except a “0” in crops damages in one of them (which doesn’t seem to be reasonable, since “B” is indicated as exponent). After checking the corresponding record, there is no additional information to help us, so we decide to keep it as it is.
Once we have decided how to deal with the exponents, we can add a new column with the expanded economic damage for each record based on the exponents/modifiers, and adding both properties damages and crop damages as a single value:
# Create a mapping table of the previously observed exponents/modifiers
# "" "-" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
# nothing to do for "", "-", "+" and "0"
# for the rest, we set up a dataframe with multipliers
exps <- data.frame(l=levels(data$PROPDMGEXP),factor=c(1,1,1,1,1,2,3,4,5,6,7,8,1000000000,100,100,1000,1000000,1000000))
# we can perform a join using PROPMDGEXP==exp to add a column with the multiplier for property damages
data <- merge(data,exps,by.x = "PROPDMGEXP", by.y="l")
data$dollars <- data$PROPDMG * data$factor
data$factor <- NULL
# We can replicate this operation for the multiplier for crops damages
data <- merge(data,exps,by.x = "CROPDMGEXP", by.y="l")
# and now compute the sum of damages
data$dollars <- data$dollars + data$CROPDMG * data$factor
data$factor <- NULL
# now "dollars" holds the raw value in dollars of the damages
Before proceeeding to the analysis, we can check for NAs in or columns of interest:
sapply(c("EVTYPE","INJURIES","FATALITIES","dollars","year"),function(x) {summary(data[x])})
## $EVTYPE
## EVTYPE
## HAIL :288646
## TSTM WIND :219944
## THUNDERSTORM WIND: 82563
## TORNADO : 60652
## FLASH FLOOD : 54278
## FLOOD : 25326
## (Other) :170792
##
## $INJURIES
## INJURIES
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.1558
## 3rd Qu.: 0.0000
## Max. :1700.0000
##
## $FATALITIES
## FATALITIES
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.0168
## 3rd Qu.: 0.0000
## Max. :583.0000
##
## $dollars
## dollars
## Min. :0.000e+00
## 1st Qu.:0.000e+00
## Median :0.000e+00
## Mean :4.007e+05
## 3rd Qu.:1.000e+03
## Max. :3.130e+10
##
## $year
## year
## Min. :1950
## 1st Qu.:1995
## Median :2002
## Mean :1999
## 3rd Qu.:2007
## Max. :2011
We can see that there are no NAs, so we consider no further action is required.
First we have to decide which data we want to analyse. The dataset contains data from 1950 to 2011. Since, I asume, we want to take into consideration just “recent” data, perhaps it makes no sense to use data from the fifties to analyse it, because (a) perhaps the damages (in population and economy) do not matter anymore, since new safety laws (for buldings, for example), etc. may have been adopted and (b) inflation: asigning a 1:1 to the value of a dollar from the fifties to the value of a dollar today perhaps doesn’t make sense. By all those considerations, we decide to focus only in the last 25 years of data (that is: 1986-2011).
Since this decision may have impact in the analysis, we can make some plots just to see what data we are planning to remove:
par(mfcol=c(3,1))
par(mar=c(4,4,1,0))
plot(aggregate(FATALITIES ~ year,data,sum),type="l", main="Fatalities, injuries and damages 1950-2011")
plot(aggregate(INJURIES ~ year,data,sum),type="l")
plot(aggregate(dollars ~ year,data,sum),type="l")
From the plots, we can see that economic impact value doesn’t seem much accurate or relevant until the nineties. Effects on people (fatalities, injuries) does not show such a clear difference, despite we can see some difference in the “fatalities” plot. However, the differences seem enough to keep our decission of taking only the last 25 years as dataset.
Once we have decided this, we can filter our dataset (note that this is a completely subjective decission based on the previously exposed rationale)
data <- data[data$year >= 1986,]
Once we have performed all the necessary transformations in our data, we can proceed to answer the questions we’re interested in.
In order to find out which events are the ones with greatest consequences on economy, we can aggregate data about damages in both properties and crops (that value is already in the new dollars column) grouping by type.
eco.impact <- aggregate(dollars ~ type,data,sum)
head(eco.impact)
## type dollars
## 1 ABNORMALLY DRY 0
## 2 ABNORMALLY WET 0
## 3 ABNORMAL WARMTH 0
## 4 AGRICULTURAL FREEZE 28820000
## 5 APACHE COUNTY 5000
## 6 ASTRONOMICAL HIGH TIDE 9425000
Then, we compute the percent of damage caused by every type of event:
eco.impact$pct <- (eco.impact$dollars / sum(eco.impact$dollars))*100
head(eco.impact)
## type dollars pct
## 1 ABNORMALLY DRY 0 0.000000e+00
## 2 ABNORMALLY WET 0 0.000000e+00
## 3 ABNORMAL WARMTH 0 0.000000e+00
## 4 AGRICULTURAL FREEZE 28820000 8.491820e-03
## 5 APACHE COUNTY 5000 1.473251e-06
## 6 ASTRONOMICAL HIGH TIDE 9425000 2.777078e-03
Now we can compute the accumulated impact. We’ll start with the event type with higher impact (higher percent) and accumulate percents as we go:
eco.impact <- eco.impact[order(-eco.impact$pct),]
eco.impact$cumsum.pct <- cumsum(eco.impact$pct)
Let’s see how many significant (economic impact > 0 ) rows do we have:
eco.impact <- eco.impact[eco.impact$dollars > 0,]
dim(eco.impact)
## [1] 142 4
Now we can focus on the 142 rows with economic impact. As we have computed the cummulative impact of all them (ordered by percent of impact), we can stablish a cut point of the top “n” events. Lets see the top 20:
head(eco.impact,n=20)
## type dollars pct cumsum.pct
## 156 HURRICANE 90271472810 26.5985105 26.59851
## 88 FLOOD 64895658548 19.1215209 45.72003
## 246 STORM SURGE 43323541000 12.7652914 58.48532
## 249 TORNADO 36858930778 10.8604925 69.34582
## 118 HAIL 19129871036 5.6366210 74.98244
## 49 DROUGHT 15018672000 4.4252552 79.40769
## 298 WIND 12092454868 3.5630446 82.97074
## 168 ICE STORM 8967041360 2.6421408 85.61288
## 313 FIRE 8904910130 2.6238338 88.23671
## 252 TROPICAL STORM 8382236550 2.4698280 90.70654
## 248 THUNDERSTORM 7085466181 2.0877343 92.79427
## 300 WINTER STORM 6715441251 1.9787063 94.77298
## 247 STORM SURGE/TIDE 4642038000 1.3677776 96.14076
## 219 RAIN 4179545492 1.2315041 97.37226
## 83 EXTREME COLD 1360710400 0.4009336 97.77319
## 244 SNOW 1151782858 0.3393731 98.11257
## 103 FROST/FREEZE 1103566000 0.3251660 98.43773
## 180 LIGHTNING 940751606 0.2771927 98.71493
## 16 BLIZZARD 771273950 0.2272560 98.94218
## 261 TYPHOON 601055000 0.1771010 99.11928
From this data, we can decide to focus on the top-14 events, since they account for 97% of damages and contain all events with more than 1% of impact.
top.impact = head(eco.impact,n=14)
Let’s plot this data (reversing ordered by percent, since that is our current data frame order and we want top contributors on top) to see how each top event contributes to total economic damages.
par(mfrow=c(1,1))
par(mar=c(3,7,3,1))
top.impact <- top.impact[order(top.impact$pct),]
bb <- barplot(top.impact$pct,names.arg=top.impact$type,horiz=T,cex.names=0.60,cex.axis=.6,las=1,main="Top % of total economic damage by event type, 1986-2011",xlab="Percent")
title(ylab="Type",line=0,cex.lab=.6)
title(xlab="Percent",line=0.4,cex.lab=.6)
text(y=bb[1:10],top.impact$pct[1:10]+1,labels=sprintf("%.2f%%",top.impact$pct[1:10]),cex=.6)
text(y=bb[11:14],top.impact$pct[11:14]-1,labels=sprintf("%.2f%%",top.impact$pct[11:14]),cex=.6)
Looking at that plot we can conclude that there are 5 top main causes of economic damages in the period 1986-2011:
rev(tail(top.impact$type,n=5))
## [1] HURRICANE FLOOD STORM SURGE TORNADO HAIL
## 313 Levels: ABNORMALLY DRY ABNORMALLY WET ... FIRE
Together, these types of events caused 74.98% of all total damages in the period 1986-2011.
When analyzing data to find the most harmful events respect to population health, we can take into consideration two variables: injuries and fatalities. Since doesn’t seem to be a way to unify them into a single variable, we can replicate the same steps as in the case of economic loss, but for the two different variables.
First we aggregate them by type:
h.impact <- aggregate(cbind(FATALITIES,INJURIES) ~ type, data, sum)
Now add a percent column computing how each row contributes to the total of each variable:
h.impact$fat.pct <- (h.impact$FATALITIES / sum(h.impact$FATALITIES))*100
h.impact$inj.pct <- (h.impact$INJURIES / sum(h.impact$INJURIES))*100
After we have this new columns, we can compute the cumulative sum of those percentages to stablish a cut point:
h.impact <- h.impact[order(-h.impact$fat.pct),]
h.impact$cumsum.fat.pct <- cumsum(h.impact$fat.pct)
h.impact <- h.impact[order(-h.impact$inj.pct),]
h.impact$cumsum.inj.pct <- cumsum(h.impact$inj.pct)
Now we can look at the first 20 values to decide where to stablish our cut point. First for fatalities:
h.impact <- h.impact[order(-h.impact$fat.pct),]
head(h.impact, n=20)
## type FATALITIES INJURIES fat.pct inj.pct cumsum.fat.pct
## 249 TORNADO 1936 30247 17.0407535 38.4162063 17.04075
## 76 EXCESSIVE HEAT 1903 6525 16.7502861 8.2872928 33.79104
## 88 FLOOD 1524 8604 13.4143121 10.9277958 47.20535
## 298 WIND 1145 8382 10.0783382 10.6458373 57.28369
## 121 HEAT 937 2100 8.2475134 2.6671747 65.53120
## 180 LIGHTNING 816 5230 7.1824663 6.6425351 72.71367
## 223 RIP CURRENT 368 232 3.2391515 0.2946593 75.95282
## 9 AVALANCHE 224 170 1.9716574 0.2159141 77.92448
## 248 THUNDERSTORM 210 2479 1.8484288 3.1485362 79.77291
## 300 WINTER STORM 206 1321 1.8132207 1.6777799 81.58613
## 224 RIP CURRENTS 204 297 1.7956166 0.3772147 83.38174
## 126 HEAT WAVE 172 309 1.5139512 0.3924557 84.89570
## 244 SNOW 164 1164 1.4435349 1.4783768 86.33923
## 83 EXTREME COLD 160 231 1.4083267 0.2933892 87.74756
## 156 HURRICANE 135 1328 1.1882757 1.6866705 88.93583
## 219 RAIN 114 305 1.0034328 0.3873754 89.93927
## 16 BLIZZARD 101 805 0.8890062 1.0224170 90.82827
## 143 HIGH SURF 101 152 0.8890062 0.1930526 91.71728
## 84 EXTREME HEAT 96 155 0.8449960 0.1968629 92.56227
## 313 FIRE 90 1608 0.7921838 2.0422938 93.35446
## cumsum.inj.pct
## 249 38.41621
## 76 68.27713
## 88 49.34400
## 298 59.98984
## 121 80.73538
## 180 74.91967
## 223 97.20836
## 9 97.99200
## 248 78.06820
## 300 90.39309
## 224 96.91370
## 126 96.14911
## 244 91.87147
## 83 97.50175
## 156 88.71531
## 219 96.53648
## 16 92.89388
## 143 98.38191
## 84 98.18886
## 313 85.28609
And for injuries:
h.impact <- h.impact[order(-h.impact$inj.pct),]
head(h.impact, n=20)
## type FATALITIES INJURIES fat.pct inj.pct cumsum.fat.pct
## 249 TORNADO 1936 30247 17.0407535 38.4162063 17.04075
## 88 FLOOD 1524 8604 13.4143121 10.9277958 47.20535
## 298 WIND 1145 8382 10.0783382 10.6458373 57.28369
## 76 EXCESSIVE HEAT 1903 6525 16.7502861 8.2872928 33.79104
## 180 LIGHTNING 816 5230 7.1824663 6.6425351 72.71367
## 248 THUNDERSTORM 210 2479 1.8484288 3.1485362 79.77291
## 121 HEAT 937 2100 8.2475134 2.6671747 65.53120
## 168 ICE STORM 89 1975 0.7833817 2.5084143 94.13784
## 313 FIRE 90 1608 0.7921838 2.0422938 93.35446
## 118 HAIL 20 1372 0.1760408 1.7425541 98.01954
## 156 HURRICANE 135 1328 1.1882757 1.6866705 88.93583
## 300 WINTER STORM 206 1321 1.8132207 1.6777799 81.58613
## 244 SNOW 164 1164 1.4435349 1.4783768 86.33923
## 16 BLIZZARD 101 805 0.8890062 1.0224170 90.82827
## 89 FOG 62 734 0.5457266 0.9322411 94.68357
## 70 DUST STORM 22 440 0.1936449 0.5588366 97.84350
## 303 WINTER WEATHER 33 398 0.2904674 0.5054931 97.15694
## 45 DENSE FOG 18 342 0.1584368 0.4343685 98.35402
## 252 TROPICAL STORM 58 340 0.5105184 0.4318283 95.19409
## 126 HEAT WAVE 172 309 1.5139512 0.3924557 84.89570
## cumsum.inj.pct
## 249 38.41621
## 88 49.34400
## 298 59.98984
## 76 68.27713
## 180 74.91967
## 248 78.06820
## 121 80.73538
## 168 83.24379
## 313 85.28609
## 118 87.02864
## 156 88.71531
## 300 90.39309
## 244 91.87147
## 16 92.89388
## 89 93.82613
## 70 94.38496
## 303 94.89046
## 45 95.32482
## 252 95.75665
## 126 96.14911
Once analyzed both listings, we decide to stick at the top 20 for presenting the data, since in both cases they account for more than the 90% of injuries/fatalities.
We now split our dataset into two different datasets (filterint those top-20) and we can plot them to find out how each type of storm affects peoples health:
fatalities.data <- head(h.impact[order(-h.impact$fat.pct),c("type","fat.pct")],n=20)
injuries.data <- head(h.impact[order(-h.impact$inj.pct),c("type","inj.pct")],n=20)
par(mfrow=c(1,2))
par(mar=c(2,7,3,0))
injuries.data <- injuries.data[order(injuries.data$inj.pct),]
bb <- barplot(injuries.data$inj.pct,names=injuries.data$type,horiz=T,cex.names=.6,cex.axis=.6,las=1)
title("Injuries", line=0.5, cex.main=.8)
text(y=bb[1:19],injuries.data$inj.pct[1:19]+5,labels=sprintf("%.2f%%",injuries.data$inj.pct[1:19]),cex=.6)
text(y=bb[20],injuries.data$inj.pct[20]-5,labels=sprintf("%.2f%%",injuries.data$inj.pct[20]),cex=.6)
title(ylab="Type",line=0,cex.lab=.6)
title(xlab="Percent",line=0.4,cex.lab=.6)
fatalities.data <- fatalities.data[order(fatalities.data$fat.pct),]
barplot(fatalities.data$fat.pct,names=fatalities.data$type,horiz=T,cex.names=.6,cex.axis=.6,las=1,)
title("Fatalities", line=0.5, cex.main=.8)
text(y=bb[1:14],fatalities.data$fat.pct[1:14]+2,labels=sprintf("%.2f%%",fatalities.data$fat.pct[1:14]),cex=.6)
text(y=bb[15:20],fatalities.data$fat.pct[15:20]-2,labels=sprintf("%.2f%%",fatalities.data$fat.pct[15:20]),cex=.6)
title(ylab="Type",line=0,cex.lab=.6)
title(xlab="Percent",line=0.4,cex.lab=.6)
mtext("% of total injuries and fatalities by event type, 1986-2011 (Top 20)", side = 1, line = -24.5, outer = TRUE)
From this data we can point to tornados as the main cause of fatalities and injuries. In the case of injuries, it is followed by flood, wind and excessive heat. For fatalities, the three same causes follow tornados, but in this case the order is different; excessive heat goes after tornados, followed by floods and wind. In this case, also, excessive heat is quite close to tornados, and floods and wind are not so distant in percentage as in the injuries case.