This Stormdata Analysis is looking to find the most harmful nature events for both human health and the economy. It does so by aggregating the effects of nature events from 1950 and end in November 2011. In the following Analysis one will be able to see that the economic damage on property is far more substantial than the one on crop and that not necessairly the same events lead to fatalities as they do to injuries.
Loading the Data into R from the downloaded CSV file, using the read.csv command. Also the command cache=TRUE is used, so that the data does not need to be reloaded. Loading necessary packages:
library(dplyr)
library(knitr)
setwd("~/Downloads")
df <- read.csv("repdata-data-StormData.csv.bz2")
head(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
dim(df)
## [1] 902297 37
names(df)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df.sub <- select(df, BGN_DATE, STATE, EVTYPE,END_DATE, MAG, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)
CROPDMGEXP and PROPDMGEXP are the unit variables for property and crop damage. These need to be taken into account to gauge the actual value of damage caused. Lets first look at the differnet types of units.
unique(df.sub$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: 0 2 ? B K M k m
From the documntaion it can be taken that K=1000, M=1000000, B=1000000000. However there is no documentation of the other variables. Lets see how frequent they are and if they can be excluded.
dim(filter(df.sub,CROPDMGEXP=="k"))
## [1] 21 11
dim(filter(df.sub,CROPDMGEXP=="?"))
## [1] 7 11
dim(filter(df.sub,CROPDMGEXP=="m"))
## [1] 1 11
dim(filter(df.sub,CROPDMGEXP=="2"))
## [1] 1 11
dim(filter(df.sub,CROPDMGEXP=="0"))
## [1] 19 11
So all together only 49 observations have units of this scale compared to K, which has 281832 observations and M which has 1994. Therefore it seems justified to remove these observations with unknown untis from the dataset instead of making unfounded assumptions.
df.sub.1 <- df.sub[ ! df.sub$CROPDMGEXP %in% c("?","m", "k","2", "0"), ]
unique(df.sub.1$PROPDMGEXP)
## [1] K M B + 0 5 m 6 ? 4 2 3 h 7 H - 1 8
## Levels: + - 0 1 2 3 4 5 6 7 8 ? B H K M h m
dim(filter(df.sub.1,PROPDMGEXP=="-"))
## [1] 1 11
dim(filter(df.sub.1,PROPDMGEXP=="+"))
## [1] 5 11
dim(filter(df.sub.1,PROPDMGEXP=="m"))
## [1] 6 11
dim(filter(df.sub.1,PROPDMGEXP=="h"))
## [1] 1 11
dim(filter(df.sub.1,PROPDMGEXP=="k"))
## [1] 0 11
dim(filter(df.sub.1,PROPDMGEXP=="?"))
## [1] 8 11
dim(filter(df.sub.1,PROPDMGEXP=="1"))
## [1] 25 11
dim(filter(df.sub.1,PROPDMGEXP=="2"))
## [1] 13 11
dim(filter(df.sub.1,PROPDMGEXP=="3"))
## [1] 4 11
dim(filter(df.sub.1,PROPDMGEXP=="4"))
## [1] 4 11
dim(filter(df.sub.1,PROPDMGEXP=="5"))
## [1] 28 11
dim(filter(df.sub.1,PROPDMGEXP=="6"))
## [1] 4 11
dim(filter(df.sub.1,PROPDMGEXP=="7"))
## [1] 5 11
dim(filter(df.sub.1,PROPDMGEXP=="8"))
## [1] 1 11
In the case of proeprty damage there are a few more units that are unknown. However again, K,M and B consititue the highest obseravtion count with 435993 and as we have no certain documentation for what the other units stand for it is safer to remove those.
df.sub.2 <- df.sub[ ! df.sub.1$PROPDMGEXP %in% c("?","m", "+","-","h", "1", "2", "3","4","5","6","7","8", "0"), ]
Now we replace the units by their actual values by creating two new variables cropdmgexpn and propdmgexpn:
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "K"] <- 1000
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "M"] <- 1e+06
df.sub.2$cropdmgexpn[df.sub.2$CROPDMGEXP == "B"] <- 1e+09
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "K"] <- 1000
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "M"] <- 1e+06
df.sub.2$propdmgexpn[df.sub.2$PROPDMGEXP == "B"] <- 1e+09
Then we multiply units and damage values for Poperty and Crop damage and replacing any NA values that are created due to no record of damage values with 0:
df.sub.3 <- mutate(df.sub.2, fullpropdmg = PROPDMG*propdmgexpn, fullcropdmg = CROPDMG*cropdmgexpn)
df.sub.3[is.na(df.sub.3)]<-0
a <- unique(df.sub.3$EVTYPE)
length(a)
## [1] 982
There are 985 different events. ###Which events are most frequent?
c <- df.sub.3 %>%
group_by(EVTYPE) %>%
summarise(total.count=n())
c.sort <- c[order(-c$total.count),]
head(c.sort, 10)
## Source: local data frame [10 x 2]
##
## EVTYPE total.count
## 1 HAIL 288585
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82550
## 4 TORNADO 60635
## 5 FLASH FLOOD 54261
## 6 FLOOD 25315
## 7 THUNDERSTORM WINDS 20711
## 8 HIGH WIND 20211
## 9 LIGHTNING 15738
## 10 HEAVY SNOW 15702
So HAIL and TSTM WIND seem to be by far the most frequent events.
Next we group the data set by event type and summarise the damage values per event into a new dataset called df.total.
df.total <- df.sub.3 %>%
group_by(EVTYPE) %>%
summarise(TOT.FATALITIES=sum(FATALITIES), TOT.INJURIES=sum(INJURIES),TOT.PROPDMG=sum(fullpropdmg),TOT.CROPDMG=sum(fullcropdmg))
Now we sort thesummarised damage values for each type of damage, health and economic, to then I dentify the top 10 most damaging event in terms of fatalities, injuries, crop damage and property damage.
library(knitr)
fatal.sort <- df.total[order(-df.total$TOT.FATALITIES),]
l<-head(fatal.sort, 10)
l1<-data.frame(l$EVTYPE,l$TOT.FATALITIES)
injuries.sort <- df.total[order(-df.total$TOT.INJURIES),]
m<-head(injuries.sort, 10)
m1<-data.frame(m$EVTYPE,m$TOT.INJURIES)
prop.sort <- df.total[order(-df.total$TOT.PROPDMG),]
n<-head(prop.sort, 10)
n1<-data.frame(n$EVTYPE,n$TOT.PROPDMG)
crop.sort <- df.total[order(-df.total$TOT.CROPDMG),]
o<-head(crop.sort, 10)
o1<-data.frame(o$EVTYPE,o$TOT.CROPDMG)
TOT<-data.frame(l1,m1,n1,o1)
names(TOT)<-c("Event", "Fatalities","Event","Injuries", "Event", "Propdmg", "Event", "Cropdmg")
kable(TOT,format = "html", align ="l")
| Event | Fatalities | Event | Injuries | Event | Propdmg | Event | Cropdmg |
|---|---|---|---|---|---|---|---|
| TORNADO | 5633 | TORNADO | 91311 | FLOOD | 144656987750 | DROUGHT | 13972566000 |
| EXCESSIVE HEAT | 1903 | TSTM WIND | 6957 | HURRICANE/TYPHOON | 69305840000 | FLOOD | 5656957900 |
| FLASH FLOOD | 978 | FLOOD | 6783 | TORNADO | 56914430480 | RIVER FLOOD | 5029459000 |
| HEAT | 937 | EXCESSIVE HEAT | 6525 | STORM SURGE | 43323536000 | ICE STORM | 5022113500 |
| LIGHTNING | 816 | LIGHTNING | 5230 | FLASH FLOOD | 16140124510 | HAIL | 3025036950 |
| TSTM WIND | 504 | HEAT | 2100 | HAIL | 15726839170 | HURRICANE | 2741910000 |
| FLOOD | 463 | ICE STORM | 1975 | HURRICANE | 11868319010 | HURRICANE/TYPHOON | 2607872800 |
| RIP CURRENT | 368 | FLASH FLOOD | 1775 | TROPICAL STORM | 7703890550 | FLASH FLOOD | 1421316600 |
| HIGH WIND | 248 | THUNDERSTORM WIND | 1488 | WINTER STORM | 6638497250 | EXTREME COLD | 1292973000 |
| AVALANCHE | 224 | HAIL | 1360 | HIGH WIND | 5270046260 | FROST/FREEZE | 1094086000 |
Which of the top 10 most damaging events for fatalieites were also in the top 10 for injuries
a <- which( l$EVTYPE %in% m$EVTYPE )
l$EVTYPE[a]
## [1] TORNADO EXCESSIVE HEAT FLASH FLOOD HEAT
## [5] LIGHTNING TSTM WIND FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... wet micoburst
What is the total health damage of the events that are in the top 10 for both fatalieites and injuries:
b <- which( m$EVTYPE %in% l$EVTYPE )
fatal.injur<-merge(l[a,],m[b,],"EVTYPE")
total <- fatal.injur$TOT.FATALITIES.x+fatal.injur$TOT.INJURIES.y
top.fatal.injur<-data.frame(fatal.injur$EVTYPE,total)
top.fatal.injur[order(-top.fatal.injur[,2]),]
## fatal.injur.EVTYPE total
## 6 TORNADO 96944
## 1 EXCESSIVE HEAT 8428
## 7 TSTM WIND 7461
## 3 FLOOD 7246
## 5 LIGHTNING 6046
## 4 HEAT 3037
## 2 FLASH FLOOD 2753
The above table shows that TORNADO and EXCESSIVE HEAT have the most damaging health effect both fatalities and injuries combined. However the third most damaging event is not TSTM WIND, but it is FLASH FLOOD, which is not in the top 10 of both fatalities and injuries, however combined it takes 3rd place as can be seen in the following aggregation:
df.total.health <- data.frame(df.total$EVTYPE, df.total$TOT.FATALITIES, df.total$TOT.INJURIES,df.total$TOT.FATALITIES+df.total$TOT.FATALITIES)
names(df.total.health)<-c("Event","Fatalities", "Injuries","Total")
df.total.health <- df.total.health[order(-df.total.health[,4]),]
head(df.total.health,10)
## Event Fatalities Injuries Total
## 823 TORNADO 5633 91311 11266
## 124 EXCESSIVE HEAT 1903 6525 3806
## 151 FLASH FLOOD 978 1775 1956
## 271 HEAT 937 2100 1874
## 452 LIGHTNING 816 5230 1632
## 843 TSTM WIND 504 6957 1008
## 167 FLOOD 463 6783 926
## 571 RIP CURRENT 368 232 736
## 343 HIGH WIND 248 1137 496
## 19 AVALANCHE 224 170 448
plot <- as.matrix(df.total.health[,2:3])
barplot(head(plot,10),main="Total health damage",
xlab="Event", col=c("darkblue","red"),
legend = head(df.total.health$Event,10))
Which of the top 10 most damaging events for crop where also in the top 10 for property damage
p <- which( o$EVTYPE %in% n$EVTYPE )
o$EVTYPE[p]
## [1] FLOOD HAIL HURRICANE HURRICANE/TYPHOON
## [5] FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... wet micoburst
What is the total crop damage of the events that are in the top 10 for both property and crop damage:
q <- which( n$EVTYPE %in% o$EVTYPE )
crop.prop<-merge(o[p,],n[q,],"EVTYPE")
total <- crop.prop$TOT.CROPDMG.x+crop.prop$TOT.PROPDMG.y
top.crop.prop<-data.frame(crop.prop$EVTYPE,total)
top.crop.prop[order(-top.crop.prop[,2]),]
## crop.prop.EVTYPE total
## 2 FLOOD 150313945650
## 5 HURRICANE/TYPHOON 71913712800
## 3 HAIL 18751876120
## 1 FLASH FLOOD 17561441110
## 4 HURRICANE 14610229010
The above table shows that FLOOD and HURRICANE/TYPHOON have the most damaging economic effect when both crop and property damage is combined. However the third most damaging event is not HAIL as one can see that TORNADO causes a damage of 56925485480 in property damage alone which is higher than the HAIL value combined.This shows that while FLOOD and HURRICANE/TYPHOON are damaging for both property and crop, Tornado who is not in the top damaging events for crop, has enough economic damage to be the 3rd most damaging event in property alone. The same is true for STORM SURGE,which has a very strong effect on Property and not on Crop. One can therefore say that the top 5 damagin evetns for economic are FLOOD, HURRICANE/TYPHOON and typhoon.
df.total.cp <- data.frame(df.total$EVTYPE, df.total$TOT.PROPDMG, df.total$TOT.CROPDMG,df.total$TOT.PROPDMG+df.total$TOT.CROPDMG)
names(df.total.cp)<-c("Event","Property", "Crop","Total")
df.total.cp <- df.total.cp[order(-df.total.cp[,4]),]
head(df.total.cp,5)
## Event Property Crop Total
## 167 FLOOD 144656987750 5656957900 150313945650
## 392 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 823 TORNADO 56914430480 414953110 57329383590
## 655 STORM SURGE 43323536000 5000 43323541000
## 241 HAIL 15726839170 3025036950 18751876120
plot.1 <- as.matrix(df.total.cp[,2:3])
barplot(head(plot.1,5),main="Total economic damage",
xlab="Event", col=c("darkblue","red"),
legend = head(df.total.health$Event,5))
Overall one could say that TORNADO is probably the most harmful event.It takes number 1 place for health damage and 3rd place for economic damage. Otherwise the most damaging evetns for health and economic damage differ, with FLOOD and HURRICANE/TYPHOON being most harmful economically and EXCESSIVE HEAT and FLASH FLOOD health wise.