author: liuyubobobo
date: Saturday, February 21, 2015
In this report, we use the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) to anaylsis the most harmful events in the U.S. cause the population health and economic health. Through the analysis, we conclude the top 10 events which cause the most fatalities, injuries and economic damages. Besides, we discover that most damages are causes by a few events. Among those, the serious one is TORNADO, besides, *Flood and tstm wind** are also very harmful. If we can prevent or forecast these events, we can avoid lots of loss!
We first read in the Storm Data, which comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
More information of this data can be found on:
- National Weather Service Storm Data Documentation
- National Climatic Data Center Storm Events FAQ
data <- read.csv( bzfile("StormData.csv.bz2") )
First of all, we aggregate our data by event types and calculate the total numbers of fatalities and injuries.
eventDataForPopulationHealth <- aggregate( cbind( FATALITIES , INJURIES ) ~ EVTYPE , data = data , FUN = sum)
Then, we can sort our new data frame - eventDataForPopulationHealth by the total number of fatalities.
attach(eventDataForPopulationHealth)
eventDataOrderByFatalities <- eventDataForPopulationHealth[ order(FATALITIES , INJURIES , decreasing = TRUE) , ]
detach(eventDataForPopulationHealth)
We can summarize the top 10 harmful events cause most fatalities.
head( eventDataOrderByFatalities , 10)
## EVTYPE FATALITIES INJURIES
## 834 TORNADO 5633 91346
## 130 EXCESSIVE HEAT 1903 6525
## 153 FLASH FLOOD 978 1777
## 275 HEAT 937 2100
## 464 LIGHTNING 816 5230
## 856 TSTM WIND 504 6957
## 170 FLOOD 470 6789
## 585 RIP CURRENT 368 232
## 359 HIGH WIND 248 1137
## 19 AVALANCHE 224 170
The percentage of these top 10 events cause fatalities can be calculated as follows:
attach(eventDataOrderByFatalities)
sum(FATALITIES[1:10]) / sum(FATALITIES)
## [1] 0.797689
detach(eventDataOrderByFatalities)
Which is really high! It means these events need our attention especially!
For understanding the top 10 events better, we can plot the data as follows:
eventDataOrderByFatalities$FATALITIES <- eventDataOrderByFatalities$FATALITIES / 1000
par( mar = c(10,6,2,1) , las = 2 )
barplot( height = eventDataOrderByFatalities$FATALITIES[1:10] , names.arg = eventDataOrderByFatalities$EVTYPE[1:10] , col = heat.colors(10) , main = "Top 10 Harmful Events cause most fatalities" , ylab = "Total numbers of fatalities (Thousand People)" )
In the same way, we can sort our new data frame - eventDataForPopulationHealth by the total number of injuries.
attach(eventDataForPopulationHealth)
eventDataOrderByInjuries <- eventDataForPopulationHealth[ order(INJURIES , FATALITIES , decreasing = TRUE) , ]
detach(eventDataForPopulationHealth)
Then, we can summarize the top 10 farmful event cause most injuries.
head( eventDataOrderByInjuries , 10)
## EVTYPE FATALITIES INJURIES
## 834 TORNADO 5633 91346
## 856 TSTM WIND 504 6957
## 170 FLOOD 470 6789
## 130 EXCESSIVE HEAT 1903 6525
## 464 LIGHTNING 816 5230
## 275 HEAT 937 2100
## 427 ICE STORM 89 1975
## 153 FLASH FLOOD 978 1777
## 760 THUNDERSTORM WIND 133 1488
## 244 HAIL 15 1361
The percentage of these top 10 events cause injuries can be calculated as follows:
attach(eventDataOrderByInjuries)
sum( INJURIES[1:10] ) / sum( INJURIES )
## [1] 0.893402
detach(eventDataOrderByInjuries)
Which is even higher! These events also need our attention!
For understanding this top 10 events better, we can plot the data as follows:
eventDataOrderByInjuries$INJURIES = eventDataOrderByInjuries$INJURIES / 1000
par( mar = c(10,6,2,1) , las = 2 )
barplot( height = eventDataOrderByInjuries$INJURIES[1:10] , names.arg = eventDataOrderByInjuries$EVTYPE[1:10] , col = heat.colors(10) , main = "Top 10 Harmful Events cause most injures" , ylab = "Total numbers of injuries (Thousand People)" )
To emphasis the harm of these events, we can try to calculate how many counties these events occured.
In our data, the total county number is:
length( unique(data$COUNTY) )
## [1] 557
The county number which the top 10 harmful events cause most fatalities occur is:
length(unique(data[ data$EVTYPE %in% eventDataOrderByFatalities[1:10,"EVTYPE"] , "COUNTY"]))
## [1] 446
The percentage is:
length(unique(data[ data$EVTYPE %in% eventDataOrderByFatalities[1:10,"EVTYPE"] , "COUNTY"])) / length( unique(data$COUNTY) )
## [1] 0.8007181
The county number which the top 10 harmful events cause most injuries occur is:
length(unique(data[ data$EVTYPE %in% eventDataOrderByInjuries[1:10,"EVTYPE"] , "COUNTY"]))
## [1] 424
The percentage is:
length(unique(data[ data$EVTYPE %in% eventDataOrderByInjuries[1:10,"EVTYPE"] , "COUNTY"])) / length( unique(data$COUNTY) )
## [1] 0.7612208
These number are pretty high. It means these events not only cause serious harm to population health, but also occurs widely.
First of all, we can aggregate our data by event types and calculate the total numbers of propert damages.
eventDataForPropdmg <- aggregate( PROPDMG ~ EVTYPE , data = data , FUN = sum)
Then, we can sort our new data frame - eventDataForPropdmg by the total property damages.
attach(eventDataForPropdmg)
eventDataOrderByPropdmg <- eventDataForPropdmg[ order(PROPDMG , decreasing = TRUE) , ]
detach(eventDataForPropdmg)
We can summarize the top 10 events cause the greatest economic consequence.
head( eventDataOrderByPropdmg , 10)
## EVTYPE PROPDMG
## 834 TORNADO 3212258.2
## 153 FLASH FLOOD 1420124.6
## 856 TSTM WIND 1335965.6
## 170 FLOOD 899938.5
## 760 THUNDERSTORM WIND 876844.2
## 244 HAIL 688693.4
## 464 LIGHTNING 603351.8
## 786 THUNDERSTORM WINDS 446293.2
## 359 HIGH WIND 324731.6
## 972 WINTER STORM 132720.6
For understanding the top 10 events better, we can plot the data as follows:
eventDataOrderByPropdmg$PROPDMG <- eventDataOrderByPropdmg$PROPDMG / 1000000
par( mar = c(10,6,2,1) , las = 2 )
barplot( height = eventDataOrderByPropdmg$PROPDMG[1:10] , names.arg = eventDataOrderByPropdmg$EVTYPE[1:10] , col = heat.colors(10) , main = "Top 10 events cause the greatest economic consequence" , ylab = "Total Property Damages (Million Dollars)" )
From the above, we can see that the 3 different top 10 events are serious enough for US, both in population health and economics damages. If we look these events closer, we my find that there are not 30 different events in total. some critical events cause not only fatalities and injuries, but also economics damages. Among them, TORNADO is the most serious one. Besides that, Flood and tstm wind are also very harmful. If we can prevent or forecast these events, we can avoid lots of loss!