In this report we will look at the infliction of various calamities on population and economic(property+crop) damage over the years from 1950 to 2011. Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
The question we are interested in answering are as follows:
Across the United States, which types of events(EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The dataset was taken from National Weather Service Storm Data. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
Loading libraries required for our analysis.
library(plyr)
library(dplyr)
library(ggplot2)
data<-read.csv("repdata_data_StormData.csv.bz2")
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
So we see there many unwanted variables that are redundant and we don’t need them at least not for our analysis. ### Data transformation
#variables we need to solve this analysis
dframe<-data[,c("STATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
dframe$CROPDMGEXP<-as.character(dframe$CROPDMGEXP)
dframe$PROPDMGEXP<-as.character(dframe$PROPDMGEXP)
head(dframe)
## STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 AL TORNADO 0 15 25.0 K 0
## 2 AL TORNADO 0 0 2.5 K 0
## 3 AL TORNADO 0 2 25.0 K 0
## 4 AL TORNADO 0 2 2.5 K 0
## 5 AL TORNADO 0 2 2.5 K 0
## 6 AL TORNADO 0 6 2.5 K 0
Now we add two more variables ,one would be total population damage variable showing the sum of injuries and fatalities caused by each calamity which will show total population damage and second variable would be total economic damage which was inflicted upon by the calamity.
dframe$POPU.DAMAGE<-dframe[,"FATALITIES"]+dframe[,"INJURIES"] #first variable
dframe$TOTAL.PROP.DAMAGE<-rep(0,902297) #second variable initialized with zero in all rows
Now we will subset from the dframe data frame.The rows in PROPDMGEXP and CROPDMGEXP have various factors of which we will only include denomination marked as K(for Thousand),M(for Million),B(for Billion) and rest levels would be droped.
deno<-c("K","M","B")
df<-filter(dframe,PROPDMGEXP %in% deno & CROPDMGEXP %in% deno)
We have our subset in df dataframe and now we can fill our TOTAL.PROP.DAMAGE column by adding columns CROPDMG and PROPDMG according to the denominations as indicated by CROPDMGEXP and PROPDMGEXP respectively.
for(i in 1:nrow(df)){
if(df$PROPDMGEXP[i] == "K"&&df$CROPDMGEXP[i] == "K")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000 + df$CROPDMG[i]*1000
else if(df$PROPDMGEXP[i]=="K"&&df$CROPDMGEXP[i] == "M")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000 + df$CROPDMG[i]*1000000
else if(df$PROPDMGEXP[i]=="K"&&df$CROPDMGEXP[i] == "B")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000 + df$CROPDMG[i]*1000000000
else if(df$PROPDMGEXP[i] == "M"&&df$CROPDMGEXP[i] == "K")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000 + df$CROPDMG[i]*1000
else if(df$PROPDMGEXP[i]=="M"&&df$CROPDMGEXP[i] == "M")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000 + df$CROPDMG[i]*1000000
else if(df$PROPDMGEXP[i]=="M"&& df$CROPDMGEXP[i] == "B")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000 + df$CROPDMG[i]*1000000000
else if(df$PROPDMGEXP[i] == "B"&&df$CROPDMGEXP[i] == "K")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000000 + df$CROPDMG[i]*1000
else if(df$PROPDMGEXP[i]=="B"&&df$CROPDMGEXP[i] == "M")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000000 + df$CROPDMG[i]*1000000
else if(df$PROPDMGEXP[i]=="B"&& df$CROPDMGEXP[i] == "B")
df$TOTAL.PROP.DAMAGE[i]<- df$PROPDMG[i]*1000000000 + df$CROPDMG[i]*1000000000
}
Now we have prepared our dataframe df for further analysis.
head(df)
## STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 AL HURRICANE OPAL/HIGH WINDS 2 0 0.1 B
## 2 AL THUNDERSTORM WINDS 0 0 5.0 M
## 3 AL HURRICANE ERIN 0 0 25.0 M
## 4 AL HURRICANE OPAL 0 0 48.0 M
## 5 AL THUNDERSTORM WINDS 0 0 50.0 K
## 6 AL THUNDERSTORM WINDS 0 2 500.0 K
## CROPDMG CROPDMGEXP POPU.DAMAGE TOTAL.PROP.DAMAGE
## 1 10 M 2 1.1e+08
## 2 500 K 0 5.5e+06
## 3 1 M 0 2.6e+07
## 4 4 M 0 5.2e+07
## 5 50 K 0 1.0e+05
## 6 50 K 2 5.5e+05
Filter df on the basis of EVTYPE and reassign it to df.
df1<-df%>%group_by(EVTYPE)%>%summarise(TOT.POPU.DAMAGE.PER.EVTYPE=sum(POPU.DAMAGE,na.rm = T),TOT.ECONOMIC.DAMAGE.PER.EVTYPE=sum(TOTAL.PROP.DAMAGE,na.rm = T))
summary(df1)
## EVTYPE TOT.POPU.DAMAGE.PER.EVTYPE
## ASTRONOMICAL HIGH TIDE: 1 Min. : 0.0
## ASTRONOMICAL LOW TIDE : 1 1st Qu.: 0.0
## AVALANCHE : 1 Median : 1.0
## BLIZZARD : 1 Mean : 272.9
## COASTAL FLOOD : 1 3rd Qu.: 39.5
## COASTAL FLOODING : 1 Max. :13024.0
## (Other) :117
## TOT.ECONOMIC.DAMAGE.PER.EVTYPE
## Min. :0.000e+00
## 1st Qu.:2.776e+05
## Median :4.264e+06
## Mean :2.125e+09
## 3rd Qu.:1.555e+08
## Max. :1.380e+11
##
After seeing summary we see that max fatalities in US is 13000+ and still mean is just 200+ so we will divide this dataset into 2 parts one part with total fatalities >1000 and second set with fatalaties<1000 to get a clear picture. Since we are interested in knowing which event was most harmful to population so we will only look fatalities >1000
df2<-filter(df1,TOT.POPU.DAMAGE.PER.EVTYPE>=1000)
df2
## # A tibble: 8 x 3
## EVTYPE TOT.POPU.DAMAGE.PER.EVTYPE TOT.ECONOMIC.DAMAGE.PER.EVTYPE
## <fct> <dbl> <dbl>
## 1 EXCESSIVE HEAT 1070 493803200
## 2 FLASH FLOOD 1054 8715295130
## 3 FLOOD 6756 138007444500
## 4 HEAT 1376 2390000
## 5 ICE STORM 1629 5925147300
## 6 LIGHTNING 1183 320786130
## 7 THUNDERSTORM WIND 1542 3813647990
## 8 TORNADO 13024 16520148150
g1<-ggplot(df2,aes(EVTYPE,TOT.POPU.DAMAGE.PER.EVTYPE,fill=EVTYPE))
g1+geom_bar(stat = "identity",show.legend = F)+theme_bw() +geom_text(aes(label=TOT.POPU.DAMAGE.PER.EVTYPE),vjust=-0.3)+ggtitle("POPULATION DAMAGE(>1000)")+ylab("POPULATION DAMAGE PER EVTYPE")+ theme(axis.text.x = element_text(angle = 60, hjust = 1))
Here we can see that max damage is caused by tornado and is followed by flood and then ice storm. Therefore population wise ranking of top 5 event causing max damage is:
Data analysis on this dataset shows that Tornado is by far population damage wise the most devastating calamity in US when compared to other calamities. When we sum up the economic loss of all calamities other than Tornado, this figure is no where close to loss of lives inflicted by Tornado.
So government should be prepared for Tornado and should invest in proper technolohgy and infrastructure that could inform people beforehand about such calamities (esp. Tornado) and reduce the Tornado’s impact on human lives.
Now in the TOT.ECONOMIC.DAMAGE.PER.EVTYPE column we see alot variation and since we are interested in answering which types of events have the greatest economic consequences so well filter out the rows showing damage greater than 2B$.
df3<-filter(df1,TOT.ECONOMIC.DAMAGE.PER.EVTYPE>2000000000)
df3
## # A tibble: 13 x 3
## EVTYPE TOT.POPU.DAMAGE.PER.EVTYPE TOT.ECONOMIC.DAMAGE.PER.EVTYPE
## <fct> <dbl> <dbl>
## 1 FLASH FLOOD 1054 8715295130
## 2 FLOOD 6756 138007444500
## 3 HAIL 305 10019978590
## 4 HIGH WIND 418 3057666640
## 5 HURRICANE 64 12405268000
## 6 HURRICANE OPAL 1 2157000000
## 7 HURRICANE/TYPHOON 949 29348167800
## 8 ICE STORM 1629 5925147300
## 9 RIVER FLOOD 0 10108369000
## 10 STORM SURGE/TIDE 16 4641493000
## 11 THUNDERSTORM WIND 1542 3813647990
## 12 TORNADO 13024 16520148150
## 13 WILDFIRE 614 3684468370
g2<-ggplot(df3,aes(EVTYPE,TOT.ECONOMIC.DAMAGE.PER.EVTYPE/1000000000,fill=EVTYPE))
g2+geom_bar(stat = "identity",show.legend = F)+theme_bw() +geom_text(aes(label=round(TOT.ECONOMIC.DAMAGE.PER.EVTYPE/1000000000,2)),vjust=-0.3)+ggtitle("ECONOMIC DAMAGE(>2B$)")+ylab("ECONOMIC DAMAGE PER EVTYPE(in Billion Dollars)")+ theme(axis.text.x = element_text(angle = 60, hjust = 1))
Max economic destruction caused by flood with 138.01B dollars followed by hurricane/typhoon with 29.34B dollars loss 3rd place is occupied by tornado with 16.52B$ loss of property. Therefore property wise ranking of top 5 event causing max damage in dollar are:
Data analysis on this dataset shows that flood is by far economically most devastating calamity in US when compared to other calamities. When we sum up the economic loss of all calamities other than flood , this figure is no where close to economic loss inflicted by flood.
So government should be prepared for flood more than anything and should invest in proper infrastructure that could reduce the flood’s economic impact.