Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
With the data collected by U.S. National Oceanic and Atmospheric Administration, analysis is carried out to to address the following questions:
Key Fields the analysis are looking at:
The data file can be downloaded from the course web site: Storm Data (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) [47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library("dplyr")
library("knitr")
library("ggplot2")
library("DataCombine")
library("R.cache")
library("reshape2")
key=list("storm")
stormdata <- loadCache(key)
if (is.null(stormdata)) {
stormdata<-read.csv(bzfile("repdata_data_StormData.csv.bz2"))
cachestorm<-saveCache(stormdata, key)
}
stormdata <- loadCache(key)
Data at a glance:
dim(stormdata)
## [1] 902297 37
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
length(unique(stormdata$EVTYPE))
## [1] 985
head(stormdata, 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
Since the focus is to see what events are harmful to population, properties and crops, data that have value more than zero are to be analysed.
fipcdata<-subset(stormdata, FATALITIES>0|INJURIES>0|PROPDMG>0|CROPDMG>0)
orievtype<-length(unique(fipcdata$EVTYPE))
paste("No. of event type", sep=": ", orievtype)
## [1] "No. of event type: 488"
Cleansing EVTYPE type as the number of event type is too many.
fipcdata$EVTYPE<-toupper(fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("[A-Z]*[0-9]+(.*)[0-9]*", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("[0-9]+[0-9]+[0-9]", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("\\(\\)", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("^\\s+|\\s+$", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("[\\|/|-]", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TSTM", fipcdata$EVTYPE), "THUNDERSTORM WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("THUNDER", fipcdata$EVTYPE), "THUNDERSTORM WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TUNDER", fipcdata$EVTYPE), "THUNDERSTORM WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HIGH WIN", fipcdata$EVTYPE), "HIGH WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("COASTAL", fipcdata$EVTYPE), "COASTAL FLOOD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("ICE", fipcdata$EVTYPE), "ICE STORM")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("LANDSLIDE|MUD|SLIDE", fipcdata$EVTYPE), "LANDSLIDE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("SURF", fipcdata$EVTYPE), "HIGH SURF")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("(.*)FLASH FLO", fipcdata$EVTYPE), "FLASHFL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FLOOD", fipcdata$EVTYPE), "FLOOD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FLASHFL", fipcdata$EVTYPE), "FLASH FLOOD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HURRICANE", fipcdata$EVTYPE), "HURRICANE (TYPHOON)")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HAIL", fipcdata$EVTYPE), "HAIL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("STRONG WIN", fipcdata$EVTYPE), "STRONG WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HEAVY RAI", fipcdata$EVTYPE), "HEAVY RAIN")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("(H(EA)*VY)* +(SNOW)", fipcdata$EVTYPE), "HEAVY SNOW")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("EXCESSIVE HEAT", fipcdata$EVTYPE), "EXCESSIVE HEAT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FUNNEL", fipcdata$EVTYPE), "FUNNEL CLOUD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TORNADO|TORNDAO", fipcdata$EVTYPE), "TORNADO")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FROST|FREEZE", fipcdata$EVTYPE), "FROST/FREEZE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("CHILL", fipcdata$EVTYPE), "CHILL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("DROUGHT", fipcdata$EVTYPE), "DROUGHT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("BLIZZARD", fipcdata$EVTYPE), "BLIZZARD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WATERSPOUT", fipcdata$EVTYPE), "WATERSPOUT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WINTER STORM", fipcdata$EVTYPE), "WINTER STORM")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("RIP CURRENT", fipcdata$EVTYPE), "RIP CURRENT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("AVALAN", fipcdata$EVTYPE), "AVALANCHE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("COLD", fipcdata$EVTYPE), "COLD/WIND CHILL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WINTER W|WINTRY MIX", fipcdata$EVTYPE), "WINTER WEATHER")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HYPOTHERMIA(.*)", fipcdata$EVTYPE), "HYPOTHERMIA")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("LIGHTING|LIGHTNING", fipcdata$EVTYPE), "LIGHTNING")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("ROUGH|TIDE", fipcdata$EVTYPE), "STORM SURGE/TIDE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("GUSTY", fipcdata$EVTYPE), "GUSTY WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WILD(.*)FIRE(S)*", fipcdata$EVTYPE), "WILDFIRE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TROPICAL STORM", fipcdata$EVTYPE), "TROPICAL STORM")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HEAT WAVE(.*)", fipcdata$EVTYPE), "HEAT WAVE")
afterevtype<-length(unique(fipcdata$EVTYPE))
paste("No. of Event Type (before)", sep=": ", orievtype)
## [1] "No. of Event Type (before): 488"
paste("No. of Event Type (after)", sep=": ", afterevtype)
## [1] "No. of Event Type (after): 133"
After cleansing, the type of “EVTYPE” has reduced from 488 to 133. Further cleansing could be carried out. However, it is decided to stop further cleansing as it will not impact much on the top 25 events.
“FATALITIES” and “INJURIES” value are sum up to see the total harm to population health for each event. Top 25 most harmful events to population health in US are as follow:
sumfatalinj<-summarise(group_by(fipcdata, EVTYPE), TOTAL_INJURIES=sum(INJURIES), TOTAL_FATALITIES=sum(FATALITIES), TOTALFATALINJ=sum(FATALITIES, INJURIES))
sumfatalinj<-arrange(sumfatalinj, desc(TOTALFATALINJ))
toptwofivefatalinj<-sumfatalinj[1:25,]
toptwofivefatalinj
## Source: local data frame [25 x 4]
##
## EVTYPE TOTAL_INJURIES TOTAL_FATALITIES TOTALFATALINJ
## 1 TORNADO 91407 5636 97043
## 2 THUNDERSTORM WIND 9545 756 10301
## 3 EXCESSIVE HEAT 6525 1922 8447
## 4 FLOOD 6804 494 7298
## 5 LIGHTNING 5231 817 6048
## 6 HEAT 2100 937 3037
## 7 FLASH FLOOD 1800 1035 2835
## 8 ICE STORM 2166 102 2268
## 9 HIGH WIND 1523 299 1822
## 10 WILDFIRE 1606 90 1696
## 11 WINTER STORM 1338 216 1554
## 12 HURRICANE (TYPHOON) 1328 133 1461
## 13 HAIL 1371 15 1386
## 14 HEAVY SNOW 1042 144 1186
## 15 RIP CURRENT 529 572 1101
## 16 BLIZZARD 805 101 906
## 17 FOG 734 62 796
## 18 WINTER WEATHER 615 62 677
## 19 HEAT WAVE 379 177 556
## 20 COLD/WIND CHILL 280 217 497
## 21 DUST STORM 440 22 462
## 22 TROPICAL STORM 383 66 449
## 23 STRONG WIND 323 125 448
## 24 HIGH SURF 246 166 412
## 25 AVALANCHE 170 225 395
Property damages exposure is cleaned up so as to have a standardized base for the property damage amount for comparison. There are magnitudes like 2, 4, or 5 in the PROPDMGEXP field, and assuming 2 referring to 100, 4 referring to 10000 and so for. “PROPDMGV” (new field after multiplying PROPDMG with PROPDMGEXP) value are then sum up to see the total damages in properties.
pcdata<-fipcdata
pcdata$PROPDMGEXP<-as.character(pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-toupper(pcdata$PROPDMGEXP)
unique(pcdata$PROPDMGEXP)
## [1] "K" "M" "" "B" "+" "0" "5" "6" "4" "H" "2" "7" "3" "-"
pcdata$PROPDMGEXP<-gsub("[-|+|0]+", "1", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[H|2]+", "100", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[K|3]+", "1000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[M|6]+", "1000000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[B]+", "1000000000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[4]+", "10000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[5]+", "100000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[7]+", "10000000", pcdata$PROPDMGEXP)
pcdata[pcdata$PROPDMGEXP %in% c(""), "PROPDMGEXP"] <- "1"
pcdata$PROPDMGEXP<-as.numeric(pcdata$PROPDMGEXP)
pcdata<-mutate(pcdata, PROPDMGV=PROPDMG*PROPDMGEXP)
Crops damages exposure is cleaned up so as to have a standardized base of crop damages amount for comparison. “CROPDMGV” (new field after multiplying CROPDMG with CROPDMGEXP) value are then sum up to see the total damages in crops.
pcdata$CROPDMGEXP<-as.character(pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-toupper(pcdata$CROPDMGEXP)
unique(pcdata$CROPDMGEXP)
## [1] "" "M" "K" "B" "?" "0"
pcdata$CROPDMGEXP<-gsub("[?|0]+", "1", pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-gsub("[K]+", "1000", pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-gsub("[M]+", "1000000", pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-gsub("[B]+", "1000000000", pcdata$CROPDMGEXP)
pcdata[pcdata$CROPDMGEXP %in% c(""), "CROPDMGEXP"] <- "1"
pcdata$CROPDMGEXP<-as.numeric(pcdata$CROPDMGEXP)
pcdata<-mutate(pcdata, CROPDMGV=CROPDMG*CROPDMGEXP)
Top 25 harmful events to properties and crops in US are as follow (in million $):
sumpropcrop<-summarise(group_by(pcdata, EVTYPE), TOTAL_PROP_DAMAGE=(sum(PROPDMGV))/1000000, TOTAL_CROP_DAMAGE=(sum(CROPDMGV))/1000000, TOTALPROPCROP=(sum(PROPDMGV, CROPDMGV))/1000000)
sumpropcrop<-arrange(sumpropcrop, desc(TOTALPROPCROP))
toptwofivepropcrop<-sumpropcrop[1:25,]
toptwofivepropcrop
## Source: local data frame [25 x 4]
##
## EVTYPE TOTAL_PROP_DAMAGE TOTAL_CROP_DAMAGE TOTALPROPCROP
## 1 FLOOD 150612.6927 10842.86195 161455.5547
## 2 HURRICANE (TYPHOON) 84656.1800 5505.29280 90161.4728
## 3 TORNADO 57003.3184 414.96152 57418.2799
## 4 STORM SURGE 43323.5360 0.00500 43323.5410
## 5 STORM SURGE/TIDE 5697.2392 13973.46600 19670.7052
## 6 FLASH FLOOD 17587.9911 1532.19715 19120.1882
## 7 HAIL 15977.5645 3046.88762 19024.4521
## 8 THUNDERSTORM WIND 12785.4907 1274.20899 14059.6997
## 9 ICE STORM 3974.0614 5027.11430 9001.1757
## 10 WILDFIRE 8491.5635 402.78163 8894.3451
## 11 TROPICAL STORM 7714.3906 694.89600 8409.2866
## 12 HIGH WIND 6166.3001 701.82190 6868.1220
## 13 WINTER STORM 6688.9973 27.44400 6716.4413
## 14 HEAVY RAIN 3212.0711 793.89980 4005.9709
## 15 FROST/FREEZE 19.2000 1997.06100 2016.2610
## 16 COLD/WIND CHILL 125.4414 1409.11550 1534.5569
## 17 HEAVY SNOW 996.3021 134.65310 1130.9552
## 18 LIGHTNING 935.4144 12.09209 947.5065
## 19 BLIZZARD 659.8139 112.06000 771.8740
## 20 TYPHOON 600.2300 0.82500 601.0550
## 21 EXCESSIVE HEAT 7.7537 492.40778 500.1615
## 22 HEAT 1.7970 401.46150 403.2585
## 23 LANDSLIDE 326.8821 20.01700 346.8991
## 24 STRONG WIND 179.7926 69.95350 249.7461
## 25 TSUNAMI 144.0620 0.02000 144.0820
toptwofivefatalinj<-toptwofivefatalinj[,1:3]
toptwofivefatalinj<-melt(toptwofivefatalinj, id.vars = c("EVTYPE"))
p<-ggplot(toptwofivefatalinj, aes(x=reorder(EVTYPE, -value), y=value))
p+geom_bar(aes(fill=variable), stat="identity", postion = "stack") + labs(x = "Event", y = "No. of Fatalities / Injuries") + theme(axis.text.x = element_text(size = 10, angle = 90, hjust = 1, vjust = 0.25)) + scale_fill_discrete("Harmful Effect")
From the result, there are more injuries than fatalities for most of the events. The most harmful events to population health are Tornado, Thunderstorm Wind and Excessive Heat. Tornado is almost 6 to 7 times more harmful than Thunderstom Wind that came in second.
toptwofivepropcrop<-toptwofivepropcrop[,1:3]
toptwofivepropcrop<-melt(toptwofivepropcrop, id.vars = c("EVTYPE"))
p<-ggplot(toptwofivepropcrop, aes(x=reorder(EVTYPE, -value), y=value))
p+geom_bar(aes(fill=variable), stat="identity", postion = "stack") + labs(x = "Event", y = "Loss in Property/Crop (million $)") + theme(axis.text.x = element_text(size = 10, angle = 90, hjust = 1, vjust = 0.25)) + scale_fill_discrete("Harmful Effect")
From the result, the events cause more economic loss in properties than crops. The most harmful events to properties and crops are Flood, Hurricane and Tornado.
Tornado is in the top 3 of both the results, ie. it is harmful to both population health and also incur a lot of loss in properties and crops.