1. Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

With the data collected by U.S. National Oceanic and Atmospheric Administration, analysis is carried out to to address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Key Fields the analysis are looking at:

Data

The data file can be downloaded from the course web site: Storm Data (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

2. Data Processing

2.1 Loading necessary libraries

library("dplyr")
library("knitr")
library("ggplot2")
library("DataCombine")
library("R.cache")
library("reshape2")

2.2 Loading and preprocessing the data

  1. Load the data (i.e. read.csv()) and store as cache for easy reloading
  2. Take a look at the data
  3. Process/transform the data into a format suitable for analysis
  • remove event that cause no harm to health or property/crop damage, ie. value less than one in “FATALITIES”, “INJURIES”, “PROPDMG” and “CROPDMG”
  • convert fields to upper case for easy comparison
  • clear unwanted characters in “EVTYPE” such as number, special character, space before or after word, etc
  • Try to standardise the “EVTYPE” to follow the list in National Weather Service Storm Data Documentation, para 2.1.1 as close as possible, except for some unfound eventy type. eg. “TSTM WIND”, “THUNDERSTORM WINDS” “TUNDERSTORM WIND” are standardised as “THUNDERSTORM WIND”
key=list("storm")
stormdata <- loadCache(key)
if (is.null(stormdata)) {
        stormdata<-read.csv(bzfile("repdata_data_StormData.csv.bz2"))
        cachestorm<-saveCache(stormdata, key)
}
stormdata <- loadCache(key)

Data at a glance:

dim(stormdata)
## [1] 902297     37
names(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
length(unique(stormdata$EVTYPE))
## [1] 985
head(stormdata, 3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3

Since the focus is to see what events are harmful to population, properties and crops, data that have value more than zero are to be analysed.

fipcdata<-subset(stormdata, FATALITIES>0|INJURIES>0|PROPDMG>0|CROPDMG>0)
orievtype<-length(unique(fipcdata$EVTYPE))
paste("No. of event type", sep=": ", orievtype)
## [1] "No. of event type: 488"

Cleansing EVTYPE type as the number of event type is too many.

fipcdata$EVTYPE<-toupper(fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("[A-Z]*[0-9]+(.*)[0-9]*", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("[0-9]+[0-9]+[0-9]", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("\\(\\)", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("^\\s+|\\s+$", "", fipcdata$EVTYPE)
fipcdata$EVTYPE<-gsub("[\\|/|-]", "", fipcdata$EVTYPE) 

fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TSTM", fipcdata$EVTYPE), "THUNDERSTORM WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("THUNDER", fipcdata$EVTYPE), "THUNDERSTORM WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TUNDER", fipcdata$EVTYPE), "THUNDERSTORM WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HIGH WIN", fipcdata$EVTYPE), "HIGH WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("COASTAL", fipcdata$EVTYPE), "COASTAL FLOOD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("ICE", fipcdata$EVTYPE), "ICE STORM")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("LANDSLIDE|MUD|SLIDE", fipcdata$EVTYPE), "LANDSLIDE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("SURF", fipcdata$EVTYPE), "HIGH SURF")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("(.*)FLASH FLO", fipcdata$EVTYPE), "FLASHFL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FLOOD", fipcdata$EVTYPE), "FLOOD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FLASHFL", fipcdata$EVTYPE), "FLASH FLOOD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HURRICANE", fipcdata$EVTYPE), "HURRICANE (TYPHOON)")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HAIL", fipcdata$EVTYPE), "HAIL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("STRONG WIN", fipcdata$EVTYPE), "STRONG WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HEAVY RAI", fipcdata$EVTYPE), "HEAVY RAIN")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("(H(EA)*VY)* +(SNOW)", fipcdata$EVTYPE), "HEAVY SNOW")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("EXCESSIVE HEAT", fipcdata$EVTYPE), "EXCESSIVE HEAT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FUNNEL", fipcdata$EVTYPE), "FUNNEL CLOUD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TORNADO|TORNDAO", fipcdata$EVTYPE), "TORNADO")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("FROST|FREEZE", fipcdata$EVTYPE), "FROST/FREEZE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("CHILL", fipcdata$EVTYPE), "CHILL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("DROUGHT", fipcdata$EVTYPE), "DROUGHT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("BLIZZARD", fipcdata$EVTYPE), "BLIZZARD")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WATERSPOUT", fipcdata$EVTYPE), "WATERSPOUT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WINTER STORM", fipcdata$EVTYPE), "WINTER STORM")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("RIP CURRENT", fipcdata$EVTYPE), "RIP CURRENT")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("AVALAN", fipcdata$EVTYPE), "AVALANCHE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("COLD", fipcdata$EVTYPE), "COLD/WIND CHILL")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WINTER W|WINTRY MIX", fipcdata$EVTYPE), "WINTER WEATHER")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HYPOTHERMIA(.*)", fipcdata$EVTYPE), "HYPOTHERMIA")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("LIGHTING|LIGHTNING", fipcdata$EVTYPE), "LIGHTNING")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("ROUGH|TIDE", fipcdata$EVTYPE), "STORM SURGE/TIDE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("GUSTY", fipcdata$EVTYPE), "GUSTY WIND")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("WILD(.*)FIRE(S)*", fipcdata$EVTYPE), "WILDFIRE")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("TROPICAL STORM", fipcdata$EVTYPE), "TROPICAL STORM")
fipcdata$EVTYPE<-replace(fipcdata$EVTYPE, grep("HEAT WAVE(.*)", fipcdata$EVTYPE), "HEAT WAVE")

afterevtype<-length(unique(fipcdata$EVTYPE))

paste("No. of Event Type (before)", sep=": ", orievtype)
## [1] "No. of Event Type (before): 488"
paste("No. of Event Type (after)", sep=": ", afterevtype)
## [1] "No. of Event Type (after): 133"

After cleansing, the type of “EVTYPE” has reduced from 488 to 133. Further cleansing could be carried out. However, it is decided to stop further cleansing as it will not impact much on the top 25 events.

“FATALITIES” and “INJURIES” value are sum up to see the total harm to population health for each event. Top 25 most harmful events to population health in US are as follow:

sumfatalinj<-summarise(group_by(fipcdata, EVTYPE), TOTAL_INJURIES=sum(INJURIES), TOTAL_FATALITIES=sum(FATALITIES), TOTALFATALINJ=sum(FATALITIES, INJURIES))
sumfatalinj<-arrange(sumfatalinj, desc(TOTALFATALINJ))
toptwofivefatalinj<-sumfatalinj[1:25,]

toptwofivefatalinj
## Source: local data frame [25 x 4]
## 
##                 EVTYPE TOTAL_INJURIES TOTAL_FATALITIES TOTALFATALINJ
## 1              TORNADO          91407             5636         97043
## 2    THUNDERSTORM WIND           9545              756         10301
## 3       EXCESSIVE HEAT           6525             1922          8447
## 4                FLOOD           6804              494          7298
## 5            LIGHTNING           5231              817          6048
## 6                 HEAT           2100              937          3037
## 7          FLASH FLOOD           1800             1035          2835
## 8            ICE STORM           2166              102          2268
## 9            HIGH WIND           1523              299          1822
## 10            WILDFIRE           1606               90          1696
## 11        WINTER STORM           1338              216          1554
## 12 HURRICANE (TYPHOON)           1328              133          1461
## 13                HAIL           1371               15          1386
## 14          HEAVY SNOW           1042              144          1186
## 15         RIP CURRENT            529              572          1101
## 16            BLIZZARD            805              101           906
## 17                 FOG            734               62           796
## 18      WINTER WEATHER            615               62           677
## 19           HEAT WAVE            379              177           556
## 20     COLD/WIND CHILL            280              217           497
## 21          DUST STORM            440               22           462
## 22      TROPICAL STORM            383               66           449
## 23         STRONG WIND            323              125           448
## 24           HIGH SURF            246              166           412
## 25           AVALANCHE            170              225           395

Property damages exposure is cleaned up so as to have a standardized base for the property damage amount for comparison. There are magnitudes like 2, 4, or 5 in the PROPDMGEXP field, and assuming 2 referring to 100, 4 referring to 10000 and so for. “PROPDMGV” (new field after multiplying PROPDMG with PROPDMGEXP) value are then sum up to see the total damages in properties.

pcdata<-fipcdata
pcdata$PROPDMGEXP<-as.character(pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-toupper(pcdata$PROPDMGEXP)

unique(pcdata$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "+" "0" "5" "6" "4" "H" "2" "7" "3" "-"
pcdata$PROPDMGEXP<-gsub("[-|+|0]+", "1", pcdata$PROPDMGEXP) 
pcdata$PROPDMGEXP<-gsub("[H|2]+", "100", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[K|3]+", "1000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[M|6]+", "1000000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[B]+", "1000000000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[4]+", "10000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[5]+", "100000", pcdata$PROPDMGEXP)
pcdata$PROPDMGEXP<-gsub("[7]+", "10000000", pcdata$PROPDMGEXP)
pcdata[pcdata$PROPDMGEXP %in% c(""), "PROPDMGEXP"] <- "1"

pcdata$PROPDMGEXP<-as.numeric(pcdata$PROPDMGEXP)
pcdata<-mutate(pcdata, PROPDMGV=PROPDMG*PROPDMGEXP)

Crops damages exposure is cleaned up so as to have a standardized base of crop damages amount for comparison. “CROPDMGV” (new field after multiplying CROPDMG with CROPDMGEXP) value are then sum up to see the total damages in crops.

pcdata$CROPDMGEXP<-as.character(pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-toupper(pcdata$CROPDMGEXP)

unique(pcdata$CROPDMGEXP)
## [1] ""  "M" "K" "B" "?" "0"
pcdata$CROPDMGEXP<-gsub("[?|0]+", "1", pcdata$CROPDMGEXP) 
pcdata$CROPDMGEXP<-gsub("[K]+", "1000", pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-gsub("[M]+", "1000000", pcdata$CROPDMGEXP)
pcdata$CROPDMGEXP<-gsub("[B]+", "1000000000", pcdata$CROPDMGEXP)
pcdata[pcdata$CROPDMGEXP %in% c(""), "CROPDMGEXP"] <- "1"

pcdata$CROPDMGEXP<-as.numeric(pcdata$CROPDMGEXP)
pcdata<-mutate(pcdata, CROPDMGV=CROPDMG*CROPDMGEXP)

Top 25 harmful events to properties and crops in US are as follow (in million $):

sumpropcrop<-summarise(group_by(pcdata, EVTYPE), TOTAL_PROP_DAMAGE=(sum(PROPDMGV))/1000000, TOTAL_CROP_DAMAGE=(sum(CROPDMGV))/1000000, TOTALPROPCROP=(sum(PROPDMGV, CROPDMGV))/1000000)
sumpropcrop<-arrange(sumpropcrop, desc(TOTALPROPCROP))
toptwofivepropcrop<-sumpropcrop[1:25,]

toptwofivepropcrop
## Source: local data frame [25 x 4]
## 
##                 EVTYPE TOTAL_PROP_DAMAGE TOTAL_CROP_DAMAGE TOTALPROPCROP
## 1                FLOOD       150612.6927       10842.86195   161455.5547
## 2  HURRICANE (TYPHOON)        84656.1800        5505.29280    90161.4728
## 3              TORNADO        57003.3184         414.96152    57418.2799
## 4          STORM SURGE        43323.5360           0.00500    43323.5410
## 5     STORM SURGE/TIDE         5697.2392       13973.46600    19670.7052
## 6          FLASH FLOOD        17587.9911        1532.19715    19120.1882
## 7                 HAIL        15977.5645        3046.88762    19024.4521
## 8    THUNDERSTORM WIND        12785.4907        1274.20899    14059.6997
## 9            ICE STORM         3974.0614        5027.11430     9001.1757
## 10            WILDFIRE         8491.5635         402.78163     8894.3451
## 11      TROPICAL STORM         7714.3906         694.89600     8409.2866
## 12           HIGH WIND         6166.3001         701.82190     6868.1220
## 13        WINTER STORM         6688.9973          27.44400     6716.4413
## 14          HEAVY RAIN         3212.0711         793.89980     4005.9709
## 15        FROST/FREEZE           19.2000        1997.06100     2016.2610
## 16     COLD/WIND CHILL          125.4414        1409.11550     1534.5569
## 17          HEAVY SNOW          996.3021         134.65310     1130.9552
## 18           LIGHTNING          935.4144          12.09209      947.5065
## 19            BLIZZARD          659.8139         112.06000      771.8740
## 20             TYPHOON          600.2300           0.82500      601.0550
## 21      EXCESSIVE HEAT            7.7537         492.40778      500.1615
## 22                HEAT            1.7970         401.46150      403.2585
## 23           LANDSLIDE          326.8821          20.01700      346.8991
## 24         STRONG WIND          179.7926          69.95350      249.7461
## 25             TSUNAMI          144.0620           0.02000      144.0820

3. Results

3.1 Top Harmful Events to Public Health

toptwofivefatalinj<-toptwofivefatalinj[,1:3]
toptwofivefatalinj<-melt(toptwofivefatalinj, id.vars = c("EVTYPE"))

p<-ggplot(toptwofivefatalinj, aes(x=reorder(EVTYPE, -value), y=value))
p+geom_bar(aes(fill=variable), stat="identity", postion = "stack")  + labs(x = "Event", y = "No. of Fatalities / Injuries") + theme(axis.text.x = element_text(size = 10, angle = 90, hjust = 1, vjust = 0.25)) + scale_fill_discrete("Harmful Effect")

From the result, there are more injuries than fatalities for most of the events. The most harmful events to population health are Tornado, Thunderstorm Wind and Excessive Heat. Tornado is almost 6 to 7 times more harmful than Thunderstom Wind that came in second.

3.2 Top Harmful Events to Economy (Property and Crop)

toptwofivepropcrop<-toptwofivepropcrop[,1:3]
toptwofivepropcrop<-melt(toptwofivepropcrop, id.vars = c("EVTYPE"))

p<-ggplot(toptwofivepropcrop, aes(x=reorder(EVTYPE, -value), y=value))
p+geom_bar(aes(fill=variable), stat="identity", postion = "stack")  + labs(x = "Event", y = "Loss in Property/Crop (million $)") + theme(axis.text.x = element_text(size = 10, angle = 90, hjust = 1, vjust = 0.25)) + scale_fill_discrete("Harmful Effect")

From the result, the events cause more economic loss in properties than crops. The most harmful events to properties and crops are Flood, Hurricane and Tornado.

Tornado is in the top 3 of both the results, ie. it is harmful to both population health and also incur a lot of loss in properties and crops.