Synopsis

In this report, the impact of storm events on the population health along with their economic cost was analyzed. This analysis uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to 2011, that tracks information about major storms and weather events as well as estimates of any fatalities, injuries, and property damage. The impact on population health has been determined by looking at basic event types and also a broader event category. Tornadoes caused a staggering 37% of the total fatalities reported and a even higher 65% of the injuries. The property damage caused by storm events were much larger in magnitude than crop damage and contributed the most economic impact. Floods and Hurricanes were the leading causes of economic damage from storm events as observed in the database.

Data Processing

The Storm database (47 MB) was downloaded from the NOAA site as a compressed .bz2 file. (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2*). This was done once and stored locally to save the download time. The database is in a CSV file format. The file is read into a variable stormdata* using the read.csv() command which is able to read compressed files automatically. The working directory is set to the directory where the storm data file was saved.

setwd("~/Coursera/DataScience/Projects/ReproducibleResearch/RepData_PeerAssessment2")
stormdata<-read.csv("repdata-data-StormData.csv.bz2")

Here is a basic summary of the stormdata dataset which contains 37 variables and more than 900,000 observations in the dataset.

str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Performing our analysis needs data from only a few of the 37 variables. Variables such as the type of event, population health related information and the economic damage indicators, are only going to be used in the analysis. The variable “EVTYPE” contains the type of event, variables “FATALITIES” and “INJURIES” contain the population health impact for that event, and “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” contain information about the monetary damage due to each event. So a smaller dataset is next created that contains a subset of the original variables for doing the analysis. A summary of the new dataset aData is shown below.

aData<-subset(stormdata, select=c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"))
summary(aData)
##                EVTYPE         FATALITIES     INJURIES       PROPDMG    
##  HAIL             :288661   Min.   :  0   Min.   :   0   Min.   :   0  
##  TSTM WIND        :219940   1st Qu.:  0   1st Qu.:   0   1st Qu.:   0  
##  THUNDERSTORM WIND: 82563   Median :  0   Median :   0   Median :   0  
##  TORNADO          : 60652   Mean   :  0   Mean   :   0   Mean   :  12  
##  FLASH FLOOD      : 54277   3rd Qu.:  0   3rd Qu.:   0   3rd Qu.:   0  
##  FLOOD            : 25326   Max.   :583   Max.   :1700   Max.   :5000  
##  (Other)          :170878                                              
##    PROPDMGEXP        CROPDMG      CROPDMGEXP    
##         :465934   Min.   :  0          :618413  
##  K      :424665   1st Qu.:  0   K      :281832  
##  M      : 11330   Median :  0   M      :  1994  
##  0      :   216   Mean   :  2   k      :    21  
##  B      :    40   3rd Qu.:  0   0      :    19  
##  5      :    28   Max.   :990   B      :     9  
##  (Other):    84                 (Other):     9

Processing population health data

Assessing the population health impactFirst requires a look at the total number of fatalities and injury for each event type. The question of which category is most harmful for population is looked at by studying the number of FATALITIES and INJURIES for each event category. The event categories are ranked based on these two variables. A new data frame phealth is created by summing the number of fatalities and injuries based on the event type. This dataset contains the total number of fatality and injury information for all events.

fatality<-by(aData$FATALITIES, aData$EVTYPE, sum)
injury<-by(aData$INJURIES, aData$EVTYPE, sum)
phealth<-data.frame(cbind(fatality, injury))
phealth<-phealth[order(fatality, decreasing=T),]

Simplifying Event categories

Analyzing the event types we find that there are 985 types of events recorded in that variable. On close inspection of the levels it appears there are a lot of similar events that can be broadly considered to be part of a event category such as HURRICANE, WIND, SNOW etc. So to get a better picture of the events, a new variable called CATEGORY was added that would represent the broad general categories FLOOD, HEAT, RIP, LIGHTNING, RAIN, SNOW, HAIL, HURRICANE, ICE, STORMS, TORNADO, BLIZZARD and WIND. This variable was filled in conditionally on the value of the EVTYPE. As an example all event types that have SNOW in them is categorized as SNOW. The smaller and uncommon event types are all broadly classified under OTHER.

aData$CATEGORY<-"OTHER"
aData<-within(aData, CATEGORY[grepl("SNOW",EVTYPE,ignore.case = T)]<-"SNOW")
aData<-within(aData, CATEGORY[grepl("FLOOD",EVTYPE,ignore.case = T)]<-"FLOOD")
aData<-within(aData, CATEGORY[grepl("RAIN",EVTYPE,ignore.case = T)]<-"RAIN")
aData<-within(aData, CATEGORY[grepl("LIGHTNING",EVTYPE,ignore.case = T)]<-"LIGHTNING")
aData<-within(aData, CATEGORY[grepl("HAIL",EVTYPE,ignore.case = T)]<-"HAIL")
aData<-within(aData, CATEGORY[grepl("HURRICANE",EVTYPE,ignore.case = T)]<-"HURRICANE")
aData<-within(aData, CATEGORY[grepl("ICE",EVTYPE,ignore.case = T)]<-"ICE")
aData<-within(aData, CATEGORY[grepl("THUNDERSTORM",EVTYPE,ignore.case = T)]<-"STORMS")
aData<-within(aData, CATEGORY[grepl("TORNADO",EVTYPE,ignore.case = T)]<-"TORNADO")
aData<-within(aData, CATEGORY[grepl("TROPICAL",EVTYPE,ignore.case = T)]<-"STORMS")
aData<-within(aData, CATEGORY[grepl("BLIZZARD",EVTYPE,ignore.case = T)]<-"BLIZZARD")
aData<-within(aData, CATEGORY[grepl("WINTER STORM",EVTYPE,ignore.case = T)]<-"BLIZZARD")
aData<-within(aData, CATEGORY[grepl("WIND",EVTYPE,ignore.case = T)]<-"WIND")
aData<-within(aData, CATEGORY[grepl("HEAT",EVTYPE,ignore.case = T)]<-"HEAT")
aData<-within(aData, CATEGORY[grepl("RIP",EVTYPE,ignore.case = T)]<-"RIP")
aData$CATEGORY<-as.factor(aData$CATEGORY)

Processing population health data by event category

Next we obtained the population health data for the newly created event categories that were determined in the earlier step. This will provide an understanding of how each broad category of events impact the population health.

fatality<-by(aData$FATALITIES, aData$CATEGORY, sum)
injury<-by(aData$INJURIES, aData$CATEGORY, sum)
phbig<-data.frame(cbind(fatality, injury))
phbig<-phbig[order(fatality, decreasing=T),]

Processing data for economic impact calculation

The property and crop damage value is reported in the variable PROPDMG and CROPDMG respectively. These are reported by number rounded to three significant digits and the exponent in the variable PROPDMGEXP and CROPDMGEXP shows whether it is in thousans, millions or billions indicated by K, M, or B. We will combine the two variables for a new variable where we will represent the value in K. So the value in PROPDMG variable for a observation will be multiplied by 1 if PROPDMGEXP is K, by 1000 if it is M and 1000000 if it is B. Similar processing will also be done for CROPDMG value. Thus the value in these two new variables are represented in units of thousands of dollars (K dollars).

aData$PROPVALUE<-0L
aData$CROPVALUE<-0L
index<-grepl("M",aData$PROPDMGEXP,ignore.case = T)
aData<-within(aData, PROPVALUE[index]<-PROPDMG[index]*1000)
index<-grepl("B",aData$PROPDMGEXP,ignore.case = T)
aData<-within(aData, PROPVALUE[index]<-PROPDMG[index]*1000000)
index<-grepl("K",aData$PROPDMGEXP,ignore.case = T)
aData<-within(aData, PROPVALUE[index]<-PROPDMG[index])
index<-grepl("M",aData$CROPDMGEXP,ignore.case = T)
aData<-within(aData, CROPVALUE[index]<-CROPDMG[index]*1000)
index<-grepl("B",aData$CROPDMGEXP,ignore.case = T)
aData<-within(aData, CROPVALUE[index]<-CROPDMG[index]*1000000)
index<-grepl("K",aData$CROPDMGEXP,ignore.case = T)
aData<-within(aData, CROPVALUE[index]<-CROPDMG[index])

Next the PROPVALUE and CROPVALUE variables were summed for the different observations in the database for each event type. A new variable econ_dmg is used to calculate the total economic damage by summing the property and crop damage for all event types. A new dataframe called econImpact is created that contains these three variables for all event types. At the end of this, the dataset econImpact is sorted in decreasing order based on the total damage value. Next a subset of top 10 entries from that dataset is further stored in a new dataset econData10, which will be used for further analysis and presenting results.

library(reshape2)
prop_dmg<-by(aData$PROPVALUE, aData$EVTYPE, sum)
crop_dmg<-by(aData$CROPVALUE, aData$EVTYPE, sum)
econ_dmg<-prop_dmg+crop_dmg
econImpact<-data.frame(event=rownames(econ_dmg), property=as.numeric(prop_dmg),     crop=as.numeric(crop_dmg), total=as.numeric(econ_dmg))
econImpact<-econImpact[order(econImpact$total, decreasing=T),]
econData10<-econImpact[1:10,]
econData10$event<-as.character(econData10$event)
econ10l<-melt(econData10, id.vars="event", variable.name="Damage", value.name="Value")

Results

Which event is most harmful for population health?

Here is a barchart showing the distribituion of fatalities and injuries by the different event categories in US as per the storm database. Here we plot the leading 10 causes of fatalities and injuries. The leading causes of fatalities are plotted on the left panel in decreasing order while the second panel plots the corresponding number of injuries. It is very clear that TORNADO has the most impact on population health.

library(ggplot2)
library(gridExtra)
b<-phealth[1:10,]
b$names<-rownames(b)
b$names<-factor(b$names, levels=b$names[order(b$fatality, decreasing=T)])
bl<-melt(b, id.vars="names", variable.name="Impact", Value.name="value")
ggplot(data=bl, aes(x=names,y=value, fill=names))+geom_bar(stat="identity") + xlab("Storm Event Type") + ylab("Total Fatalities") + theme(axis.text.x=element_blank()) + facet_wrap(~Impact)
Fig 1: Leading Causes of Storm related Fatalities and Injuries. Many more injuries happen and Tornadoes are the leading cause for both

Fig 1: Leading Causes of Storm related Fatalities and Injuries. Many more injuries happen and Tornadoes are the leading cause for both

The figure shows the number of injuries are much larger than the total number of fatalities for each storm event. Tornadoes have the most fatality and injury by a large margin among all the event types.

perc_fatal <- phealth$fatality[1]/sum(phealth$fatality)*100
perc_inj <- phealth$injury[1]/sum(phealth$injury)*100

Tornadoes contributed to 37.19% of the total fatalities and 65% of the total injuries reported in the database

Next the broad event category types are used to further understand the impact of storms on population health. The results from the event categories are shown below.

c<-phbig[1:10,]
c$names<-rownames(c)
c$names<-factor(c$names, levels=c$names[order(c$fatality, decreasing=T)])
g1<-ggplot(data=c, aes(x=names,y=fatality, fill=names))+geom_bar(stat="identity") + xlab("Event Category") + ylab("Total Fatalities") + theme(axis.text.x=element_blank())
g2<-ggplot(data=c, aes(x=names,y=injury, fill=names))+geom_bar(stat="identity") + xlab("Event Category") + ylab("Total Injury") + theme(axis.text.x=element_blank())
grid.arrange(g1, g2, ncol=2, top="Top 10 Categories of Fatalities and Injuries")
Fig 2: Leading Causes of Death and Injury by broad category of storm events. TORNADO has the most impact

Fig 2: Leading Causes of Death and Injury by broad category of storm events. TORNADO has the most impact

Which events have the greatest economic impact?

The economic impact of storm events are performed using the top 10 events that have the greatest damage to property and crops that was obtained earlier. The table below shows the values of economic damage in thousands of US dollars for the leading 10 events in the database. The event type and the damage to property, crop and total combined damage is reported in the columns.

list(econData10)
## [[1]]
##                 event  property     crop     total
## 170             FLOOD 144657710  5661968 150319678
## 411 HURRICANE/TYPHOON  69305840  2607873  71913713
## 834           TORNADO  56937160   414953  57352114
## 670       STORM SURGE  43323536        5  43323541
## 244              HAIL  15732267  3025954  18758221
## 153       FLASH FLOOD  16140812  1421317  17562129
## 95            DROUGHT   1046106 13972566  15018672
## 402         HURRICANE  11868319  2741910  14610229
## 590       RIVER FLOOD   5118946  5029459  10148405
## 427         ICE STORM   3944928  5022114   8967041

Now we look at the economic impact of the storm events graphically below.

g1<-ggplot(data=econ10l, aes(x=event,y=Value, fill=event)) +geom_bar(position="stack",stat="identity") + xlab("Storm Event Type") + ylab("Total Economic Damage(in '000s USD)") + theme(axis.text.x=element_blank()) + facet_wrap(~Damage)
g1
Fig 3: Events that cause the most economic damage

Fig 3: Events that cause the most economic damage

From the above figure it is very clear that the most economic damage is due to property damage as crop damage only contributes a small amount to the total figure for all event types. Floods have the most impact followed by Hurricanes among the different events.