Synopsis

This Stormdata Analysis is looking to find the most harmful nature events for both human health and the economy. It does so by aggregating the effects of nature events from 1950 and end in November 2011. In the following Analysis one will be able to see that the economic damage on property is far more substantial than the one on crop and that not necessairly the same events lead to fatalities as they do to injuries.

Data Processing

Loading the Data into R from the downloaded CSV file, using the read.csv command. Also the command cache=TRUE is used, so that the data does not need to be reloaded. Loading necessary packages:

library(dplyr)
library(knitr)
setwd("~/Downloads")
df <- read.csv("repdata-data-StormData.csv.bz2")

Familiarising yourself with the dataset

head(df)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
dim(df)
## [1] 902297     37
names(df)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Subsetting the dataset by choosing the columns relevatn to determine most harmful events

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df.sub <- select(df, BGN_DATE, STATE, EVTYPE,END_DATE, MAG, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)

The data and PROPDMGEXP/CROPDMGEXP variables

CROPDMGEXP and PROPDMGEXP are the unit variables for property and crop damage. These need to be taken into account to gauge the actual value of damage caused. Lets first look at the differnet types of units.

unique(df.sub$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  0 2 ? B K M k m

From the documntaion it can be taken that K=1000, M=1000000, B=1000000000. However there is no documentation of the other variables. Lets see how frequent they are and if they can be excluded.

dim(filter(df.sub,CROPDMGEXP=="k"))
## [1] 21 11
dim(filter(df.sub,CROPDMGEXP=="?"))
## [1]  7 11
dim(filter(df.sub,CROPDMGEXP=="m"))
## [1]  1 11
dim(filter(df.sub,CROPDMGEXP=="2"))
## [1]  1 11
dim(filter(df.sub,CROPDMGEXP=="0"))
## [1] 19 11

So all together only 49 observations have units of this scale compared to K, which has 281832 observations and M which has 1994. Therefore it seems justified to remove these observations with unknown untis from the dataset instead of making unfounded assumptions.

df.sub.1 <- df.sub[ ! df.sub$CROPDMGEXP %in% c("?","m", "k","2", "0"), ]
unique(df.sub.1$PROPDMGEXP)
##  [1] K M   B + 0 5 m 6 ? 4 2 3 h 7 H - 1 8
## Levels:  + - 0 1 2 3 4 5 6 7 8 ? B H K M h m
dim(filter(df.sub.1,PROPDMGEXP=="-"))
## [1]  1 11
dim(filter(df.sub.1,PROPDMGEXP=="+"))
## [1]  5 11
dim(filter(df.sub.1,PROPDMGEXP=="m"))
## [1]  6 11
dim(filter(df.sub.1,PROPDMGEXP=="h"))
## [1]  1 11
dim(filter(df.sub.1,PROPDMGEXP=="k"))
## [1]  0 11
dim(filter(df.sub.1,PROPDMGEXP=="?"))
## [1]  8 11
dim(filter(df.sub.1,PROPDMGEXP=="1"))
## [1] 25 11
dim(filter(df.sub.1,PROPDMGEXP=="2"))
## [1] 13 11
dim(filter(df.sub.1,PROPDMGEXP=="3"))
## [1]  4 11
dim(filter(df.sub.1,PROPDMGEXP=="4"))
## [1]  4 11
dim(filter(df.sub.1,PROPDMGEXP=="5"))
## [1] 28 11
dim(filter(df.sub.1,PROPDMGEXP=="6"))
## [1]  4 11
dim(filter(df.sub.1,PROPDMGEXP=="7"))
## [1]  5 11
dim(filter(df.sub.1,PROPDMGEXP=="8"))
## [1]  1 11

In the case of proeprty damage there are a few more units that are unknown. However again, K,M and B consititue the highest obseravtion count with 435993 and as we have no certain documentation for what the other units stand for it is safer to remove those.

df.sub.2 <- df.sub[ ! df.sub.1$PROPDMGEXP %in% c("?","m", "+","-","h", "1", "2", "3","4","5","6","7","8", "0"), ]

Now we replace the units by their actual values by creating two new variables cropdmgexpn and propdmgexpn:

df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "K"] <- 1000
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "M"] <- 1e+06
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "B"] <- 1e+09
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "K"] <- 1000
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "M"] <- 1e+06
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "B"] <- 1e+09

Then we multiply units and damage values for Poperty and Crop damage and replacing any NA values that are created due to no record of damage values with 0:

df.sub.3 <- mutate(df.sub.2, fullpropdmg = PROPDMG*propdmgexpn, fullcropdmg = CROPDMG*cropdmgexpn)
df.sub.3[is.na(df.sub.3)]<-0

Familiarising yourself with the Event Types

How many different events are there?

a <- unique(df.sub.3$EVTYPE)
length(a)
## [1] 982

There are 985 different events. ###Which events are most frequent?

c <- df.sub.3 %>% 
    group_by(EVTYPE) %>%
    summarise(total.count=n())
c.sort <- c[order(-c$total.count),]
head(c.sort, 10)
## Source: local data frame [10 x 2]
## 
##                EVTYPE total.count
## 1                HAIL      288585
## 2           TSTM WIND      219940
## 3   THUNDERSTORM WIND       82550
## 4             TORNADO       60635
## 5         FLASH FLOOD       54261
## 6               FLOOD       25315
## 7  THUNDERSTORM WINDS       20711
## 8           HIGH WIND       20211
## 9           LIGHTNING       15738
## 10         HEAVY SNOW       15702

So HAIL and TSTM WIND seem to be by far the most frequent events.

Results

Next we group the data set by event type and summarise the damage values per event into a new dataset called df.total.

 df.total <- df.sub.3 %>% 
    group_by(EVTYPE) %>%
    summarise(TOT.FATALITIES=sum(FATALITIES), TOT.INJURIES=sum(INJURIES),TOT.PROPDMG=sum(fullpropdmg),TOT.CROPDMG=sum(fullcropdmg))

Now we sort thesummarised damage values for each type of damage, health and economic, to then I dentify the top 10 most damaging event in terms of fatalities, injuries, crop damage and property damage.

library(knitr)
fatal.sort <- df.total[order(-df.total$TOT.FATALITIES),]
l<-head(fatal.sort, 10)
l1<-data.frame(l$EVTYPE,l$TOT.FATALITIES)
injuries.sort <- df.total[order(-df.total$TOT.INJURIES),]
m<-head(injuries.sort, 10)
m1<-data.frame(m$EVTYPE,m$TOT.INJURIES)
prop.sort <- df.total[order(-df.total$TOT.PROPDMG),]
n<-head(prop.sort, 10)
n1<-data.frame(n$EVTYPE,n$TOT.PROPDMG)
crop.sort <- df.total[order(-df.total$TOT.CROPDMG),]
o<-head(crop.sort, 10)
o1<-data.frame(o$EVTYPE,o$TOT.CROPDMG)

TOT<-data.frame(l1,m1,n1,o1)
names(TOT)<-c("Event", "Fatalities","Event","Injuries", "Event", "Propdmg", "Event", "Cropdmg")
kable(TOT,format = "html", align ="l")
Event Fatalities Event Injuries Event Propdmg Event Cropdmg
TORNADO 5633 TORNADO 91311 FLOOD 144656987750 DROUGHT 13972566000
EXCESSIVE HEAT 1903 TSTM WIND 6957 HURRICANE/TYPHOON 69305840000 FLOOD 5656957900
FLASH FLOOD 978 FLOOD 6783 TORNADO 56914430480 RIVER FLOOD 5029459000
HEAT 937 EXCESSIVE HEAT 6525 STORM SURGE 43323536000 ICE STORM 5022113500
LIGHTNING 816 LIGHTNING 5230 FLASH FLOOD 16140124510 HAIL 3025036950
TSTM WIND 504 HEAT 2100 HAIL 15726839170 HURRICANE 2741910000
FLOOD 463 ICE STORM 1975 HURRICANE 11868319010 HURRICANE/TYPHOON 2607872800
RIP CURRENT 368 FLASH FLOOD 1775 TROPICAL STORM 7703890550 FLASH FLOOD 1421316600
HIGH WIND 248 THUNDERSTORM WIND 1488 WINTER STORM 6638497250 EXTREME COLD 1292973000
AVALANCHE 224 HAIL 1360 HIGH WIND 5270046260 FROST/FREEZE 1094086000

Looking closeat health damage

Which of the top 10 most damaging events for fatalieites were also in the top 10 for injuries

a <- which( l$EVTYPE %in% m$EVTYPE  )
l$EVTYPE[a]
## [1] TORNADO        EXCESSIVE HEAT FLASH FLOOD    HEAT          
## [5] LIGHTNING      TSTM WIND      FLOOD         
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... wet micoburst

What is the total health damage of the events that are in the top 10 for both fatalieites and injuries:

b <- which( m$EVTYPE %in% l$EVTYPE  )
fatal.injur<-merge(l[a,],m[b,],"EVTYPE")
total <- fatal.injur$TOT.FATALITIES.x+fatal.injur$TOT.INJURIES.y
top.fatal.injur<-data.frame(fatal.injur$EVTYPE,total)
top.fatal.injur[order(-top.fatal.injur[,2]),]
##   fatal.injur.EVTYPE total
## 6            TORNADO 96944
## 1     EXCESSIVE HEAT  8428
## 7          TSTM WIND  7461
## 3              FLOOD  7246
## 5          LIGHTNING  6046
## 4               HEAT  3037
## 2        FLASH FLOOD  2753

The above table shows that TORNADO and EXCESSIVE HEAT have the most damaging health effect both fatalities and injuries combined. However the third most damaging event is not TSTM WIND, but it is FLASH FLOOD, which is not in the top 10 of both fatalities and injuries, however combined it takes 3rd place as can be seen in the following aggregation:

df.total.health <- data.frame(df.total$EVTYPE, df.total$TOT.FATALITIES, df.total$TOT.INJURIES,df.total$TOT.FATALITIES+df.total$TOT.FATALITIES)
names(df.total.health)<-c("Event","Fatalities", "Injuries","Total")
df.total.health <- df.total.health[order(-df.total.health[,4]),]
head(df.total.health,10)
##              Event Fatalities Injuries Total
## 823        TORNADO       5633    91311 11266
## 124 EXCESSIVE HEAT       1903     6525  3806
## 151    FLASH FLOOD        978     1775  1956
## 271           HEAT        937     2100  1874
## 452      LIGHTNING        816     5230  1632
## 843      TSTM WIND        504     6957  1008
## 167          FLOOD        463     6783   926
## 571    RIP CURRENT        368      232   736
## 343      HIGH WIND        248     1137   496
## 19       AVALANCHE        224      170   448
plot <- as.matrix(df.total.health[,2:3])
barplot(head(plot,10),main="Total health damage",
  xlab="Event", col=c("darkblue","red"),
     legend = head(df.total.health$Event,10))

Looking closer at economic damage

Which of the top 10 most damaging events for crop where also in the top 10 for property damage

p <- which( o$EVTYPE %in% n$EVTYPE  )
o$EVTYPE[p]
## [1] FLOOD             HAIL              HURRICANE         HURRICANE/TYPHOON
## [5] FLASH FLOOD      
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... wet micoburst

What is the total crop damage of the events that are in the top 10 for both property and crop damage:

q <- which( n$EVTYPE %in% o$EVTYPE  )
crop.prop<-merge(o[p,],n[q,],"EVTYPE")
total <- crop.prop$TOT.CROPDMG.x+crop.prop$TOT.PROPDMG.y
top.crop.prop<-data.frame(crop.prop$EVTYPE,total)
top.crop.prop[order(-top.crop.prop[,2]),]
##    crop.prop.EVTYPE        total
## 2             FLOOD 150313945650
## 5 HURRICANE/TYPHOON  71913712800
## 3              HAIL  18751876120
## 1       FLASH FLOOD  17561441110
## 4         HURRICANE  14610229010

The above table shows that FLOOD and HURRICANE/TYPHOON have the most damaging economic effect when both crop and property damage is combined. However the third most damaging event is not HAIL as one can see that TORNADO causes a damage of 56925485480 in property damage alone which is higher than the HAIL value combined.This shows that while FLOOD and HURRICANE/TYPHOON are damaging for both property and crop, Tornado who is not in the top damaging events for crop, has enough economic damage to be the 3rd most damaging event in property alone. The same is true for STORM SURGE,which has a very strong effect on Property and not on Crop. One can therefore say that the top 5 damagin evetns for economic are FLOOD, HURRICANE/TYPHOON and typhoon.

df.total.cp <- data.frame(df.total$EVTYPE, df.total$TOT.PROPDMG, df.total$TOT.CROPDMG,df.total$TOT.PROPDMG+df.total$TOT.CROPDMG)
names(df.total.cp)<-c("Event","Property", "Crop","Total")
df.total.cp <- df.total.cp[order(-df.total.cp[,4]),]
head(df.total.cp,5)
##                 Event     Property       Crop        Total
## 167             FLOOD 144656987750 5656957900 150313945650
## 392 HURRICANE/TYPHOON  69305840000 2607872800  71913712800
## 823           TORNADO  56914430480  414953110  57329383590
## 655       STORM SURGE  43323536000       5000  43323541000
## 241              HAIL  15726839170 3025036950  18751876120
plot.1 <- as.matrix(df.total.cp[,2:3])
barplot(head(plot.1,5),main="Total economic damage",
  xlab="Event", col=c("darkblue","red"),
     legend = head(df.total.health$Event,5))

Conclusion

Overall one could say that TORNADO is probably the most harmful event.It takes number 1 place for health damage and 3rd place for economic damage. Otherwise the most damaging evetns for health and economic damage differ, with FLOOD and HURRICANE/TYPHOON being most harmful economically and EXCESSIVE HEAT and FLASH FLOOD health wise.