The U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database tracks characteristics of major storms and weather events in the United States.
With that data, this analysis tries to find out which bad wheather conditions cost the most either in terms of fatalities and injuries, or in terms of damaged goods.
Data is read from the file “StormData.csv.bz2; this file must be situated in the working directory of the project. It is rather big, so time-consuming to read. SD (StormData) is used as a data.frame to store. Let's string be read as factors.
SD <- read.csv(".//repdata_data_StormData.csv.bz2", header=TRUE)
Out of the 37 variables, we only use 6 variables:
EVTYPE: Factor of 985 levels with the type of events of each reacord
FATALITIES: Number of death casualties caused by the event
INJURIES: Number of injured people by the event
PROPDMG and PROPDMGEXP: Both indicate the property damage in dollars. They must be combined as PROPDMG·10PROPDMGEXP
CROPDMG and CROPDMGEXP: Both indicate the property damage in dollars. They must be combined as CROPDMG·10CROPDMGEXP
The first problem we find is that the exponent of the two previous quantities not always is a numeric value. For example, sometimes a "k” or “K” is used for a value of three. So we create a “conversion dictionary”, with the named list of vectors conv, and use it as the levels of the exponent variables, PROPDMGEXP and CROPDMGEXP.
Once de EXP variables are corrected, we can calculate the total amounts of PROPDMG and CROPDMG
conv = list("0"=c("","-","?","+","0"),"1"="1","2"=c("2","h","H"),
"3"=c("3","K","k"),"4"="4","5"="5","6"=c("6","M","m"),"7"=7,
"8"="8","9"=c("9","B","b"))
levels(SD$PROPDMGEXP) <- conv
levels(SD$CROPDMGEXP) <- conv
SD$PROPDMG=SD$PROPDMG*10**(as.integer(as.character(SD$PROPDMGEXP)))
SD$CROPDMG=SD$CROPDMG*10**(as.integer(as.character(SD$CROPDMGEXP)))
There are some missing values and errors in the EVTYPE, but we will ignore them in the analysis.
For each event type EVTYPE, we calculate the total sum of FATALITIES, INJURIES, PROPDMG and CROPDMG We will use the pair funcions melt() and dcast in the library reshape (Yes, I know that I should have load the library at the beginning, just for the sake of clarity ;)
library(reshape2)
SDC <- melt(data=SD[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","CROPDMG")],
id=c("EVTYPE"),
measure.vars=c("FATALITIES", "INJURIES", "PROPDMG","CROPDMG"))
SDC <- dcast(SDC, EVTYPE ~ variable, sum)
The result is stored in the new data.frame SDC (Storm Data-Clean).
So, we are ready to proccess the data and answer the asked questions.
We calculate the sum of the two totals for each EVTYPE, FATALITIES+INJURIES, sort the result in decreasing order, and slect the 8 top values.
Finally we use a barplot() to display the results.
healthD <- head(SDC[order(SDC$FATALITIES + SDC$INJURIES, decreasing=TRUE),
c("EVTYPE", "FATALITIES", "INJURIES")], 8)
barplot(height=(healthD$FATALITIES + healthD$INJURIES)/1000, names.arg=healthD$EVTYPE,
beside=TRUE, width=2, las=2, ylab="Thousands of Persons", col="red", ylim=c(0,100),
main="Health Damage in USA (Fatalities + Injuries)", cex.names=0.6)
rownames(healthD)<-NULL; healthD
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 TSTM WIND 504 6957
## 4 FLOOD 470 6789
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 FLASH FLOOD 978 1777
## 8 ICE STORM 89 1975
So, its easy to see that the TORNADO event is the most dangerous event in terms of economic damage, followed by EXCESSIVE HEAT, TSTM WIND and FLOOD
In a simmilar way than in the previous question, we calculate the sum of the two totals for each EVTYPE, PROPDMG+CROPDMG, sort the result in decreasing order, and select the 8 top values.
Finally we use a barplot() to display the results.
properD <- head(SDC[order(SDC$PROPDMG + SDC$CROPDMG, decreasing=TRUE),
c("EVTYPE", "PROPDMG", "CROPDMG")], 8)
barplot(height=(properD$PROPDMG + properD$CROPDMG)/1.E+9, names.arg=properD$EVTYPE,
beside=TRUE, width=2, las=2, ylab="Billions of Dollars", col="blue",
main="Total Damage in USA (Properties + Crops)", ylim = c(0,160), cex.names=0.7)
rownames(properD)<-NULL; properD
## EVTYPE PROPDMG CROPDMG
## 1 FLOOD 1.447e+11 5.662e+09
## 2 HURRICANE/TYPHOON 6.931e+10 2.608e+09
## 3 TORNADO 5.695e+10 4.150e+08
## 4 STORM SURGE 4.332e+10 5.000e+03
## 5 HAIL 1.574e+10 3.026e+09
## 6 FLASH FLOOD 1.682e+10 1.421e+09
## 7 DROUGHT 1.046e+09 1.397e+10
## 8 HURRICANE 1.187e+10 2.742e+09
So, its easy to see that the FLOOD event is the most dangerous event in terms of economic damage, followed by HURRICANE/TYPHOON, TORNADO and STORM SURGE
Two question has veen answered in this report:
1 Which events are are most harmful to population health ?
TORNADOs events are, by far, the most dangerous storm events, in terms if casualties and injuries.