Synopsis

This Stormdata Analysis is looking to find the most harmful nature events for both human health and the economy. It does so by aggregating the effects of nature events from 1950 and end in November 2011. In the following Analysis one will be able to see that the economic damage on property is far more substantial than the one on crop and that not necessairly the same events lead to fatalities as they do to injuries.

Data Processing

Loading the Data into R from the downloaded CSV file, using the read.csv command. Also the command cache=TRUE is used, so that the data does not need to be reloaded. Loading necessary packages:

library(dplyr)
library(knitr)
setwd("~/Downloads")
df <- read.csv("repdata-data-StormData.csv.bz2")

Familiarising yourself with the dataset

head(df)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

dim(df)

## [1] 902297     37

names(df)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Subsetting the dataset by choosing the columns relevatn to determine most harmful events

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df.sub <- select(df, BGN_DATE, STATE, EVTYPE,END_DATE, MAG, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)

The data and PROPDMGEXP/CROPDMGEXP variables

CROPDMGEXP and PROPDMGEXP are the unit variables for property and crop damage. These need to be taken into account to gauge the actual value of damage caused. Lets first look at the differnet types of units.

unique(df.sub$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  0 2 ? B K M k m

From the documntaion it can be taken that K=1000, M=1000000, B=1000000000. However there is no documentation of the other variables. Lets see how frequent they are and if they can be excluded.

dim(filter(df.sub,CROPDMGEXP=="k"))

## [1] 21 11

dim(filter(df.sub,CROPDMGEXP=="?"))

## [1]  7 11

dim(filter(df.sub,CROPDMGEXP=="m"))

## [1]  1 11

dim(filter(df.sub,CROPDMGEXP=="2"))

## [1]  1 11

dim(filter(df.sub,CROPDMGEXP=="0"))

## [1] 19 11

So all together only 49 observations have units of this scale compared to K, which has 281832 observations and M which has 1994. Therefore it seems justified to remove these observations with unknown untis from the dataset instead of making unfounded assumptions.

df.sub.1 <- df.sub[ ! df.sub$CROPDMGEXP %in% c("?","m", "k","2", "0"), ]

unique(df.sub.1$PROPDMGEXP)

##  [1] K M   B + 0 5 m 6 ? 4 2 3 h 7 H - 1 8
## Levels:  + - 0 1 2 3 4 5 6 7 8 ? B H K M h m

dim(filter(df.sub.1,PROPDMGEXP=="-"))

## [1]  1 11

dim(filter(df.sub.1,PROPDMGEXP=="+"))

## [1]  5 11

dim(filter(df.sub.1,PROPDMGEXP=="m"))

## [1]  6 11

dim(filter(df.sub.1,PROPDMGEXP=="h"))

## [1]  1 11

dim(filter(df.sub.1,PROPDMGEXP=="k"))

## [1]  0 11

dim(filter(df.sub.1,PROPDMGEXP=="?"))

## [1]  8 11

dim(filter(df.sub.1,PROPDMGEXP=="1"))

## [1] 25 11

dim(filter(df.sub.1,PROPDMGEXP=="2"))

## [1] 13 11

dim(filter(df.sub.1,PROPDMGEXP=="3"))

## [1]  4 11

dim(filter(df.sub.1,PROPDMGEXP=="4"))

## [1]  4 11

dim(filter(df.sub.1,PROPDMGEXP=="5"))

## [1] 28 11

dim(filter(df.sub.1,PROPDMGEXP=="6"))

## [1]  4 11

dim(filter(df.sub.1,PROPDMGEXP=="7"))

## [1]  5 11

dim(filter(df.sub.1,PROPDMGEXP=="8"))

## [1]  1 11

In the case of proeprty damage there are a few more units that are unknown. However again, K,M and B consititue the highest obseravtion count with 435993 and as we have no certain documentation for what the other units stand for it is safer to remove those.

df.sub.2 <- df.sub[ ! df.sub.1$PROPDMGEXP %in% c("?","m", "+","-","h", "1", "2", "3","4","5","6","7","8", "0"), ]

Now we replace the units by their actual values by creating two new variables cropdmgexpn and propdmgexpn:

df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "K"] <- 1000
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "M"] <- 1e+06
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "B"] <- 1e+09
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "K"] <- 1000
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "M"] <- 1e+06
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "B"] <- 1e+09

Then we multiply units and damage values for Poperty and Crop damage and replacing any NA values that are created due to no record of damage values with 0:

df.sub.3 <- mutate(df.sub.2, fullpropdmg = PROPDMG*propdmgexpn, fullcropdmg = CROPDMG*cropdmgexpn)
df.sub.3[is.na(df.sub.3)]<-0

Familiarising yourself with the Event Types

How many different events are there?

a <- unique(df.sub.3$EVTYPE)
length(a)

## [1] 982

There are 985 different events. ###Which events are most frequent?

c <- df.sub.3 %>% 
    group_by(EVTYPE) %>%
    summarise(total.count=n())
c.sort <- c[order(-c$total.count),]
head(c.sort, 10)

## Source: local data frame [10 x 2]
## 
##                EVTYPE total.count
## 1                HAIL      288585
## 2           TSTM WIND      219940
## 3   THUNDERSTORM WIND       82550
## 4             TORNADO       60635
## 5         FLASH FLOOD       54261
## 6               FLOOD       25315
## 7  THUNDERSTORM WINDS       20711
## 8           HIGH WIND       20211
## 9           LIGHTNING       15738
## 10         HEAVY SNOW       15702

So HAIL and TSTM WIND seem to be by far the most frequent events.

Results

Next we group the data set by event type and summarise the damage values per event into a new dataset called df.total.

 df.total <- df.sub.3 %>% 
    group_by(EVTYPE) %>%
    summarise(TOT.FATALITIES=sum(FATALITIES), TOT.INJURIES=sum(INJURIES),TOT.PROPDMG=sum(fullpropdmg),TOT.CROPDMG=sum(fullcropdmg))

Now we sort thesummarised damage values for each type of damage, health and economic, to then I dentify the top 10 most damaging event in terms of fatalities, injuries, crop damage and property damage.

library(knitr)
fatal.sort <- df.total[order(-df.total$TOT.FATALITIES),]
l<-head(fatal.sort, 10)
l1<-data.frame(l$EVTYPE,l$TOT.FATALITIES)
injuries.sort <- df.total[order(-df.total$TOT.INJURIES),]
m<-head(injuries.sort, 10)
m1<-data.frame(m$EVTYPE,m$TOT.INJURIES)
prop.sort <- df.total[order(-df.total$TOT.PROPDMG),]
n<-head(prop.sort, 10)
n1<-data.frame(n$EVTYPE,n$TOT.PROPDMG)
crop.sort <- df.total[order(-df.total$TOT.CROPDMG),]
o<-head(crop.sort, 10)
o1<-data.frame(o$EVTYPE,o$TOT.CROPDMG)

TOT<-data.frame(l1,m1,n1,o1)
names(TOT)<-c("Event", "Fatalities","Event","Injuries", "Event", "Propdmg", "Event", "Cropdmg")
kable(TOT,format = "html", align ="l")

Event	Fatalities	Event	Injuries	Event	Propdmg	Event	Cropdmg
TORNADO	5633	TORNADO	91311	FLOOD	144656987750	DROUGHT	13972566000
EXCESSIVE HEAT	1903	TSTM WIND	6957	HURRICANE/TYPHOON	69305840000	FLOOD	5656957900
FLASH FLOOD	978	FLOOD	6783	TORNADO	56914430480	RIVER FLOOD	5029459000
HEAT	937	EXCESSIVE HEAT	6525	STORM SURGE	43323536000	ICE STORM	5022113500
LIGHTNING	816	LIGHTNING	5230	FLASH FLOOD	16140124510	HAIL	3025036950
TSTM WIND	504	HEAT	2100	HAIL	15726839170	HURRICANE	2741910000
FLOOD	463	ICE STORM	1975	HURRICANE	11868319010	HURRICANE/TYPHOON	2607872800
RIP CURRENT	368	FLASH FLOOD	1775	TROPICAL STORM	7703890550	FLASH FLOOD	1421316600
HIGH WIND	248	THUNDERSTORM WIND	1488	WINTER STORM	6638497250	EXTREME COLD	1292973000
AVALANCHE	224	HAIL	1360	HIGH WIND	5270046260	FROST/FREEZE	1094086000

Looking closeat health damage

Which of the top 10 most damaging events for fatalieites were also in the top 10 for injuries

a <- which( l$EVTYPE %in% m$EVTYPE  )
l$EVTYPE[a]

## [1] TORNADO        EXCESSIVE HEAT FLASH FLOOD    HEAT          
## [5] LIGHTNING      TSTM WIND      FLOOD         
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... wet micoburst

What is the total health damage of the events that are in the top 10 for both fatalieites and injuries:

b <- which( m$EVTYPE %in% l$EVTYPE  )
fatal.injur<-merge(l[a,],m[b,],"EVTYPE")
total <- fatal.injur$TOT.FATALITIES.x+fatal.injur$TOT.INJURIES.y
top.fatal.injur<-data.frame(fatal.injur$EVTYPE,total)
top.fatal.injur[order(-top.fatal.injur[,2]),]

##   fatal.injur.EVTYPE total
## 6            TORNADO 96944
## 1     EXCESSIVE HEAT  8428
## 7          TSTM WIND  7461
## 3              FLOOD  7246
## 5          LIGHTNING  6046
## 4               HEAT  3037
## 2        FLASH FLOOD  2753

The above table shows that TORNADO and EXCESSIVE HEAT have the most damaging health effect both fatalities and injuries combined. However the third most damaging event is not TSTM WIND, but it is FLASH FLOOD, which is not in the top 10 of both fatalities and injuries, however combined it takes 3rd place as can be seen in the following aggregation:

df.total.health <- data.frame(df.total$EVTYPE, df.total$TOT.FATALITIES, df.total$TOT.INJURIES,df.total$TOT.FATALITIES+df.total$TOT.FATALITIES)
names(df.total.health)<-c("Event","Fatalities", "Injuries","Total")
df.total.health <- df.total.health[order(-df.total.health[,4]),]
head(df.total.health,10)

##              Event Fatalities Injuries Total
## 823        TORNADO       5633    91311 11266
## 124 EXCESSIVE HEAT       1903     6525  3806
## 151    FLASH FLOOD        978     1775  1956
## 271           HEAT        937     2100  1874
## 452      LIGHTNING        816     5230  1632
## 843      TSTM WIND        504     6957  1008
## 167          FLOOD        463     6783   926
## 571    RIP CURRENT        368      232   736
## 343      HIGH WIND        248     1137   496
## 19       AVALANCHE        224      170   448

plot <- as.matrix(df.total.health[,2:3])
barplot(head(plot,10),main="Total health damage",
  xlab="Event", col=c("darkblue","red"),
     legend = head(df.total.health$Event,10))

Looking closer at economic damage

Which of the top 10 most damaging events for crop where also in the top 10 for property damage

p <- which( o$EVTYPE %in% n$EVTYPE  )
o$EVTYPE[p]

## [1] FLOOD             HAIL              HURRICANE         HURRICANE/TYPHOON
## [5] FLASH FLOOD      
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... wet micoburst

What is the total crop damage of the events that are in the top 10 for both property and crop damage:

q <- which( n$EVTYPE %in% o$EVTYPE  )
crop.prop<-merge(o[p,],n[q,],"EVTYPE")
total <- crop.prop$TOT.CROPDMG.x+crop.prop$TOT.PROPDMG.y
top.crop.prop<-data.frame(crop.prop$EVTYPE,total)
top.crop.prop[order(-top.crop.prop[,2]),]

##    crop.prop.EVTYPE        total
## 2             FLOOD 150313945650
## 5 HURRICANE/TYPHOON  71913712800
## 3              HAIL  18751876120
## 1       FLASH FLOOD  17561441110
## 4         HURRICANE  14610229010

The above table shows that FLOOD and HURRICANE/TYPHOON have the most damaging economic effect when both crop and property damage is combined. However the third most damaging event is not HAIL as one can see that TORNADO causes a damage of 56925485480 in property damage alone which is higher than the HAIL value combined.This shows that while FLOOD and HURRICANE/TYPHOON are damaging for both property and crop, Tornado who is not in the top damaging events for crop, has enough economic damage to be the 3rd most damaging event in property alone. The same is true for STORM SURGE,which has a very strong effect on Property and not on Crop. One can therefore say that the top 5 damagin evetns for economic are FLOOD, HURRICANE/TYPHOON and typhoon.

df.total.cp <- data.frame(df.total$EVTYPE, df.total$TOT.PROPDMG, df.total$TOT.CROPDMG,df.total$TOT.PROPDMG+df.total$TOT.CROPDMG)
names(df.total.cp)<-c("Event","Property", "Crop","Total")
df.total.cp <- df.total.cp[order(-df.total.cp[,4]),]
head(df.total.cp,5)

##                 Event     Property       Crop        Total
## 167             FLOOD 144656987750 5656957900 150313945650
## 392 HURRICANE/TYPHOON  69305840000 2607872800  71913712800
## 823           TORNADO  56914430480  414953110  57329383590
## 655       STORM SURGE  43323536000       5000  43323541000
## 241              HAIL  15726839170 3025036950  18751876120

plot.1 <- as.matrix(df.total.cp[,2:3])
barplot(head(plot.1,5),main="Total economic damage",
  xlab="Event", col=c("darkblue","red"),
     legend = head(df.total.health$Event,5))

Conclusion

Overall one could say that TORNADO is probably the most harmful event.It takes number 1 place for health damage and 3rd place for economic damage. Otherwise the most damaging evetns for health and economic damage differ, with FLOOD and HURRICANE/TYPHOON being most harmful economically and EXCESSIVE HEAT and FLASH FLOOD health wise.

Stormdata Analysis

Veronika Rauch

26 Jul 2015