title: “Storm data analysis” |
author: “I. Llatas” |
date: “February 28, 2016” |
output: html_document |
In this work we analize the NOAA Storm Database to answer which meteorological events are the more dangerous, both in human and economical losses. We conclude that by far, tornados are the most harmful of all events measured in human lives (fatalites and injuries) followed by heat and flood events. Also, the latter events causes the greatest accumulated monetary loss when the property and crop damages are combined. These findings must be corroborated by doing a deep cleaning of the data.
Severe weather events, such as tornados or extreme rainfall can cause losses in lifes and livelihoods when they occur. There is a growing body of evidence that this events will be even more extreme in the near future, due in part to the global climate change. This has prompted new research on how to deal with the risks [1],on the base that: To know is to be prepared. In this work it is examined the NOAA Storm Database to answer which types of events were the most harmful in the United States, during the time period from 1950 to Nov. 2011. The analysis will focus in four specific measurements of loss:
Human Health
+ Fatalities
+ Injuries
Economic Loss
+ Property Damage
+ Crop Damage
The data file was downloaded on Feb 21, 2015 from the URL shown below. Beaware, it is a 46.9 MB file and it takes time to download and to read in R. Code bellow check if these steps have already been taken. Here is the code to read the data and save a working environment for easy access.
# Prepare R, set libraries to be used
library(xtable, quietly = TRUE, warn.conflicts = FALSE)
library(lattice, quietly = TRUE, warn.conflicts = FALSE)
library(pander, quietly = TRUE, warn.conflicts = FALSE)
library(scales, quietly = TRUE, warn.conflicts = FALSE)
# Check for data files and read/load in R
if (!file.exists("storm.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"storm.csv.bz2", mode = "wb")
storm <- read.csv("storm.csv.bz2")
save(storm, file = "storm.RData")
} else if (file.exists("storm.RData")) {
load("storm.RData")
} else {
stop("No data file in this folder, please check!")
}
The storm data frame has 37 variables on 902297 events. The columns relevant for the analysis are named: STATE,EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
After looking at the data, using the str function, one can see that the classification variable EVTYPE has 985 levels and the identification variable STATE has 72 abbreviation names. To address this, one must make some pre-processing of the data. One of the problems is the use of lowercase and upper case letters, that can be easily solved in R, but there are some compound event names that correspond to different meteorological that happened at the same time; for example there are 73 events that start with the word “[Ss]ummary”.To change these names to match the **NOAA* directives for event reporting [2], requires a lot of data cleansing. Since this is a homework for reproducible research, the data will be used as is. This would mean that the results are underestimating the reported total for a given event, and it should be investigated if necessary.
Another issue happens with the economic damage data. One could assume is in nominal US dollars, i.e they are not corrected for inflation, so it is not wise to simply sum the amounts to obtain a number for total loss. However, it escapes the scope of this work to change the monetary damage into **real* US dollars, so it won’t be done. But one thing that is needed is to combine the columns DMG and DMGEXP to get the proper order of magnitude for the monetary loss.
In the docummentation for this problem it is stated that DMGEXP should be read as:
[] Multiply by 1
[hH] Multiply by 100
[kK] Multiply by 1,000
[mM] Multiply by 1,000,000
[bB] Multiply by 1,000,000,000
It is worth noticing that the DMGEXP has characters that are not in the list shown above, and those records will be treated as if the multiplier for the losses were equal to one.
# Data in dataframe <storm>
use.col <- c("STATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
storm.red<- storm[,use.col]
storm.red$EVTYPE<-toupper(storm.red$EVTYPE) #Use Uppercase characters
storm.red$PROPDMGEXP<-toupper(storm.red$PROPDMGEXP)
storm.red$CROPDMGEXP<-toupper(storm.red$CROPDMGEXP)
# Make numeric columms of dmg fix for magnitude
attach(storm.red)
propdmg<-PROPDMG
propdmg[PROPDMGEXP=="H"]<-propdmg[PROPDMGEXP=="H"]*100
propdmg[PROPDMGEXP=="K"]<-propdmg[PROPDMGEXP=="K"]*1000
propdmg[PROPDMGEXP=="M"]<-propdmg[PROPDMGEXP=="M"]*1000000
propdmg[PROPDMGEXP=="B"]<-propdmg[PROPDMGEXP=="B"]*1000000000
cropdmg<-CROPDMG
cropdmg[CROPDMGEXP=="H"]<-cropdmg[CROPDMGEXP=="H"]*100
cropdmg[CROPDMGEXP=="K"]<-cropdmg[CROPDMGEXP=="K"]*1000
cropdmg[CROPDMGEXP=="M"]<-cropdmg[CROPDMGEXP=="M"]*1000000
cropdmg[CROPDMGEXP=="B"]<-cropdmg[CROPDMGEXP=="B"]*1000000000
storm.red$propdmg <-propdmg
storm.red$cropdmg <-cropdmg
detach(2)
Note that all of the results shown along with the text are the result of inline code, (surrounding code with justs one back ticks and the letter r), but the code is not shown in the knitted output.
In total there are 6,974 events (0.77 %) that report one or more fatalities, for a total of 15145 deaths and 17604 events (1.95 %) that report one or more injuries, for a total of 140528. To find out which events are the most dangereous for humans one can look at how many fatalities or injuries are for each type of events. In the table below it is shown the top 15 events.
#fatalities
x1<-with(storm.red,aggregate(FATALITIES, list(EVTYPE), sum))
y1<-x1[x1[,2]>0,] # USe only those with fatalities
y1<-y1[order(y1[,2],decreasing=TRUE),]
#injuries
x2<-with(storm.red,aggregate(INJURIES, list(EVTYPE), sum))
y2<-x2[x2[,2]>0,]
y2<-y2[order(y2[,2],decreasing=TRUE),]
top.events<-cbind(y1[1:15,],y2[1:15,])
colnames(top.events)<-c("Hazard; Fatalities","Victims", "Hazard; Injuries","Victims")
rownames(top.events)<-NULL
pander(top.events, caption="Top events affecting humans in the USA, period 1950-2011. Left correspond to fatality victims, right to injuries")
Hazard; Fatalities | Victims | Hazard; Injuries | Victims |
---|---|---|---|
TORNADO | 5633 | TORNADO | 91346 |
EXCESSIVE HEAT | 1903 | TSTM WIND | 6957 |
FLASH FLOOD | 978 | FLOOD | 6789 |
HEAT | 937 | EXCESSIVE HEAT | 6525 |
LIGHTNING | 816 | LIGHTNING | 5230 |
TSTM WIND | 504 | HEAT | 2100 |
FLOOD | 470 | ICE STORM | 1975 |
RIP CURRENT | 368 | FLASH FLOOD | 1777 |
HIGH WIND | 248 | THUNDERSTORM WIND | 1488 |
AVALANCHE | 224 | HAIL | 1361 |
WINTER STORM | 206 | WINTER STORM | 1321 |
RIP CURRENTS | 204 | HURRICANE/TYPHOON | 1275 |
HEAT WAVE | 172 | HIGH WIND | 1137 |
EXTREME COLD | 162 | HEAVY SNOW | 1021 |
THUNDERSTORM WIND | 133 | WILDFIRE | 911 |
Notice that some of the rows of the table correspond to the same type of weather condition, for example “EXCESSIVE HEAT” and “HEAT”. This must be changed for a better understanding of the hazards.
As can be seen in the table above, tornados are the main cause of injuries and fatalities in USA. Notice that the second cause of mortality is excesive heat and it can be seen that more than half of deaths occured in one event, located in Chicago IL.
To see which states are most affected by the top conditions we looked at the following plot, where the number of fatalities caused by the five top events are plotted by states.
Notice that for the plot, only the records with a propper abbreviate state name are used.
top.events.death<-top.events[1:5,1]
bystate<-with(storm.red,aggregate(FATALITIES,list(STATE,EVTYPE),sum))
colnames(bystate)<-c("STATE","EVTYPE","FATALITIES")
bas<-bystate[(bystate$STATE %in% state.abb)&(bystate$EVTYPE %in% top.events.death),]
barchart(FATALITIES~EVTYPE|STATE,data=bas,layout=c(5,4),scales=list(x=list(rot=45)))
Notice that some states are “safer” than others.
We proceed with the economic losses by looking at the top events.
# property
x1 <- with(storm.red, aggregate(propdmg, list(EVTYPE), sum))
y1 <- x1[x1[, 2] > 0, ] # USe only those with damage
y1 <- y1[order(y1[, 2], decreasing = TRUE), ]
# crops
x2 <- with(storm.red, aggregate(cropdmg, list(EVTYPE), sum))
y2 <- x2[x2[, 2] > 0, ]
y2 <- y2[order(y2[, 2], decreasing = TRUE), ]
top.events <- cbind(y1[1:15, 1], comma(round(y1[1:15, 2]/1000, 0)), y2[1:15,
1], comma(round(y2[1:15, 2]/1000, 0)))
colnames(top.events) <- c("Hazard", "Property", "Hazard", "Crops")
rownames(top.events) <- NULL
pander(top.events, caption = "Top events affecting Properties and Crops in the USA, period 1950-2011. Left correspond to Property losses, right to Crops. Nominal Currency in Thousands of US Dollars")
Hazard | Property | Hazard | Crops |
---|---|---|---|
FLOOD | 144,657,710 | DROUGHT | 13,972,566 |
HURRICANE/TYPHOON | 69,305,840 | FLOOD | 5,661,968 |
TORNADO | 56,937,161 | RIVER FLOOD | 5,029,459 |
STORM SURGE | 43,323,536 | ICE STORM | 5,022,114 |
FLASH FLOOD | 16,140,812 | HAIL | 3,025,954 |
HAIL | 15,732,268 | HURRICANE | 2,741,910 |
HURRICANE | 11,868,319 | HURRICANE/TYPHOON | 2,607,873 |
TROPICAL STORM | 7,703,891 | FLASH FLOOD | 1,421,317 |
WINTER STORM | 6,688,497 | EXTREME COLD | 1,312,973 |
HIGH WIND | 5,270,046 | FROST/FREEZE | 1,094,186 |
RIVER FLOOD | 5,118,946 | HEAVY RAIN | 733,400 |
WILDFIRE | 4,765,114 | TROPICAL STORM | 678,346 |
STORM SURGE/TIDE | 4,641,188 | HIGH WIND | 638,571 |
TSTM WIND | 4,484,958 | TSTM WIND | 554,007 |
ICE STORM | 3,944,928 | EXCESSIVE HEAT | 492,402 |
Important Note: As can be seen in the table above, there is needed for data cleansing, to be able to rank the events properly. For example, for crop losses, the combined loss for “FLOOD”, “RIVER FLOOD” and “FLASH FLOOD” is almost as big as the “DROUGHT” loss. Neverless, the table gives information about the harmful events.
Finally, one can add up the property and economic loss to get an overall hazard rating. The following plot shows the top 15 causes.
totalEL<-with(storm.red,propdmg+cropdmg) #Total economic loss
x <-aggregate(totalEL,list(storm.red$EVTYPE),sum)
y<-x[x[,2]>0,] # USe only those with damage
y<-y[order(y[,2],decreasing=TRUE),]
colnames(y)<-c("EVTYPE","LOSS")
bas1<-factor(y$EVTYPE,levels=as.character(y$EVTYPE),ordered=TRUE)
y$EVTYPE1<-bas1
barchart(EVTYPE1~LOSS/10^9,data=y[1:15,],main="Monetary Losses (in Billions of US Dollars)",xlab="LOSS")
Clearly the type “FLOOD/FLASH FLOOD/RIVER FLOOD” is a major contributor for the monetary losses over the years.
There are some interesting facts that can be drawn from the analysis. One of them is that there is difference between the type of events that causes the most damage either if the focus is in the human loss or in the economical loss.
As it has been written in the preceding paragraphs, the results shown must be revised, given the fact that factor variables such as EVTYPE and STATE have many levels that are not in the usual “dictionary”. It has to be notice that we have not use spatial or temporal variables, that could give more information.
Dear evaluator: I apologize for the spelling errors you surely spotted while reading this work. English is not my first language and I didn’t have the time to run the text in a grammar/spelling corrector. I hope those have not make you wonder if this is written in english or not.
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_0.3.0 pander_0.6.0 lattice_0.20-33 xtable_1.8-2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.2 digest_0.6.9 plyr_1.8.3 grid_3.2.3
## [5] formatR_1.2.1 magrittr_1.5 evaluate_0.8 stringi_1.0-1
## [9] rmarkdown_0.9.2 tools_3.2.3 stringr_1.0.0 munsell_0.4.2
## [13] colorspace_1.2-6 htmltools_0.3 knitr_1.12.3
[1] IPCC, 2012: Summary for Policymakers. In: Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation [Field, C.B., V. Barros, T.F. Stocker, D. Qin, D.J. Dokken, K.L. Ebi, M.D. Mastrandrea, K.J. Mach, G.-K. Plattner, S.K. Allen, M. Tignor, and P.M. Midgley (eds.)]. A Special Report of Working Groups I and II of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, UK, and New York, NY, USA, pp. 1-19. Downloaded from http://www.ipcc-wg2.gov/SREX/images/uploads/SREX-SPMbrochure_FINAL.pdf, Sep 21, 2016.
[2] NOAA, 2007, STORM DATA PREPARATION Downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf, Sep 25, 2016