Damage to Public Health and Economy in the USA was Highly Effected by Only Few Storm Types between 1950 and 2011

=========================================================================================

Synopsis

Storm events can cause both public health and economic problems for communities. Many severe events can result in fatalities, injuries, and property damage.

Our hypothesis is that only few storm types highly contributed to the overall health and economic damage in the United States recorded between 1950 and 2011. So, we will explore the US National Oceanic and Atmospheric Administration’s (NOAA) storm database and use the Pareto principle to prove our hypothesis.

From this database we found that there is one undisputable type that was most harmful to public health. Whereas, economic consequences where highly effected by three specific types.

Data

The data for this analysis comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data file-existing under the link Storm Data- will be automatically downloaded in the next section of the report.

The National Weather Service Storm Data Documentation is also a useful documentation of the database where you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

This section describes (in words and code) how the data was downloaded from the source, loaded into R, and processed for analysis.

Downloading the raw data

The analysis starts by downloading the zipped raw data file. Then, the file is unzipped.

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dataFileZIP <- "./repdata_data_StormData.csv.bz2"

if (file.exists(dataFileZIP) == FALSE) {
  download.file(fileURL, destfile = dataFileZIP)
}

Loading the raw data file into R

storms <- read.csv("repdata_data_StormData.csv.bz2", stringsAsFactors=FALSE)

Loading necessary R packages

The analysis uses some packages that need to be installed before running the code. The code below loads these packages. Yet, you need to make sure they are installed in your R version before loading.

library(dplyr)
library(ggplot2)
library(reshape2)

Processing data for analysis

To get an overview of the data before any processing let’s see the data characteristics in terms of dimensions (rows and columns) and some of the variables existing in the data.

dim<-dim(storms)
obs<-dim[1]
var<-dim[2]

We have 902297 observations with 37 variables.

To get sense of the data let’s take a look at some of the variables.

options(width = 100)
head(storms[,1:8],5)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO

In this report we are concerned with the events that are most harmful to public health and those having the greatest economic consequences. By reading through the Storm Data Documentation and exploring the variables in the raw data we can conclude the following:

Impact on public health is represented in the following variables:
- FATALITIES (count of persons)
- INJURIES (count of persons)
Impact on economy is represented in the following variables:
- PROPDMG (numeric variable representing the Property Damage estimate)
- PROPDMGEXP (character variable representing the significance of the property damage estimate using “K” for thousands, “M” for millions, and “B” for billions).
- CROPDMG (numeric variable representing the Crop Damage estimate)
- CROPDMGEXP (character variable representing the significance of the crop damage estimate using “K” for thousands, “M” for millions, and “B” for billions).

Besides, we will use some of the following variables in the analysis:

STATE
BGN_DATE (the approximate date that the event started to have an impact or became a nuisance)
BGN_TIME (the approximate time that the event started to have an impact or became a nuisance)
EVTYPE (type of storm event)

Hence, we will create a subset of the raw data containing the required variables only.

stormsSub<-storms[,c(2,3,7,8,23:28)]

By looking at the event type (EVTYPE) in the dataset we can see that some descriptions contain spaces that need trimming. So, as an initial processing step, we better remove the leading and trailing spaces.

stormsSub[,"EVTYPE"]<-gsub("(^ +)|( +$)", "", stormsSub$EVTYPE)

As the events in the raw data were not entered consistently throughout the years we need to devise a strategy to replace the EVTYPE descriptions with standard ones as listed in the Storm Data Documentation(2.1.1 Storm Data Event Table, pp.6).

However, replacing all EVTYPE descriptions is a tedious process. So, we will follow the vital few principle to correct the event descriptions by correcting the names of the descriptions that have the highest count of observations in the raw data.

To achieve this we need to calculate the frequency of events descriptions in the raw data and their cumulative share to the overall count of observations.

First, we create a dataset that summarizes the available descriptions with each one’s count of observations sorted in descending order.

EVType<-as.factor(stormsSub$EVTYPE)
EVFreq<-arrange(as.data.frame(table(EVType)),desc(Freq))

Second, we calculate the cumulative sum of observations across all events then its cumulative share of the total number of observations.

TotObs<-nrow(storms) # total number of observations
EVFreq$CumSum<- cumsum(EVFreq$Freq)
EVFreq$CumShare<- round(EVFreq$CumSum/TotObs,digits = 3)

By looking at this dataset (EVFreq) we find that the first 47 descriptions contribute to 99% of total observations as shown below.

head(EVFreq,47)

##                      EVType   Freq CumSum CumShare
## 1                      HAIL 288661 288661    0.320
## 2                 TSTM WIND 219944 508605    0.564
## 3         THUNDERSTORM WIND  82563 591168    0.655
## 4                   TORNADO  60652 651820    0.722
## 5               FLASH FLOOD  54278 706098    0.783
## 6                     FLOOD  25326 731424    0.811
## 7        THUNDERSTORM WINDS  20843 752267    0.834
## 8                 HIGH WIND  20212 772479    0.856
## 9                 LIGHTNING  15755 788234    0.874
## 10               HEAVY SNOW  15708 803942    0.891
## 11               HEAVY RAIN  11723 815665    0.904
## 12             WINTER STORM  11433 827098    0.917
## 13           WINTER WEATHER   7026 834124    0.924
## 14             FUNNEL CLOUD   6839 840963    0.932
## 15         MARINE TSTM WIND   6175 847138    0.939
## 16 MARINE THUNDERSTORM WIND   5812 852950    0.945
## 17               WATERSPOUT   3797 856747    0.950
## 18              STRONG WIND   3566 860313    0.953
## 19     URBAN/SML STREAM FLD   3392 863705    0.957
## 20                 WILDFIRE   2761 866466    0.960
## 21                 BLIZZARD   2719 869185    0.963
## 22                  DROUGHT   2488 871673    0.966
## 23                ICE STORM   2006 873679    0.968
## 24           EXCESSIVE HEAT   1678 875357    0.970
## 25               HIGH WINDS   1533 876890    0.972
## 26         WILD/FOREST FIRE   1457 878347    0.973
## 27             FROST/FREEZE   1342 879689    0.975
## 28                DENSE FOG   1293 880982    0.976
## 29       WINTER WEATHER/MIX   1104 882086    0.978
## 30           TSTM WIND/HAIL   1028 883114    0.979
## 31  EXTREME COLD/WIND CHILL   1002 884116    0.980
## 32                     HEAT    767 884883    0.981
## 33                HIGH SURF    725 885608    0.982
## 34           TROPICAL STORM    690 886298    0.982
## 35           FLASH FLOODING    682 886980    0.983
## 36             EXTREME COLD    655 887635    0.984
## 37            COASTAL FLOOD    651 888286    0.984
## 38         LAKE-EFFECT SNOW    636 888922    0.985
## 39        FLOOD/FLASH FLOOD    624 889546    0.986
## 40                LANDSLIDE    600 890146    0.987
## 41                     SNOW    587 890733    0.987
## 42          COLD/WIND CHILL    539 891272    0.988
## 43                      FOG    538 891810    0.988
## 44              RIP CURRENT    470 892280    0.989
## 45              MARINE HAIL    442 892722    0.989
## 46               DUST STORM    427 893149    0.990
## 47                AVALANCHE    386 893535    0.990

Hence, applying the vital few principle we need to make sure these 47 descriptions match the standard ones in the Documentation table. This gives us high confidence in the analysis accuracy as well as efficiency in data processing.

After examining these 47 events descriptions and comparing them to those in the Storm Data Event Table we find that only 13 of them need to be renamed (shown below). Re-naming mis-spelt descriptions will take place in the subsetted data (stormsSub).

stormsSub[,"EVTYPE"]<-gsub("TSTM WIND","THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("THUNDERSTORM WINDS","THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("MARINE TSTM WIND","MARINE THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("URBAN/SML STREAM FLD","FLOOD",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("HIGH WINDS","HIGH WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("WILD/FOREST FIRE","WILDFIRE",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("WINTER WEATHER/MIX","WINTER WEATHER",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("TSTM WIND/HAIL","THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("FLASH FLOODING","FLASH FLOOD",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("EXTREME COLD","EXTREME COLD/WIND CHILL",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("FLOOD/FLASH FLOOD","FLASH FLOOD",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("SNOW","HEAVY SNOW",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("FOG","DENSE FOG",stormsSub[,"EVTYPE"])

Another aspect to be processed before analysis is the unification of damage estimate numbers. Since the significance of damage estimate is represented in a character (K,M, or B) for both property and crop damages we need to convert this significance into numeric values then reflect it to the damage estimates (PROPDMG and CROPDMG).

We start by re-calculating the property damage estimates.

stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="K"]<- stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="K"]*1000
stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="M"]<- stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="M"]*1000000
stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="B"]<- stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="B"]*1000000000

Then, we re-calculate the crop damage estimates.

stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="K"]<- stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="K"]*1000
stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="M"]<- stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="M"]*1000000
stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="B"]<- stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="B"]*1000000000

Now we have same-unit property and crop damage estimates, we assume that overall economic damage can be represented by the sum of both property and crop damages. So, we create a new variable (ECODMG) calculated by the sum of both variables (PROPDMG and CROPDMG) for each event type.

stormsSub$ECODMG<-stormsSub$PROPDMG+stormsSub$CROPDMG

Similarly, we will assume that total event impact on public health can be represented by both fatalities and injuries caused by that event. So, we will create a new variable (HLTHDMG) calculated by the sum of both variables (FATALITIES and INJURIES) for each event type.

stormsSub$HLTHDMG<-stormsSub$FATALITIES+stormsSub$INJURIES

Now, we can drop the property and crop significance variables as they become irrelevant.

stormsSub<-stormsSub[,-c(8,10)]

Results

Summarizing Impact of Events

So far, we have a processed dataset (stormsSub) which we can use to do the calculations needed for the analysis. In order to see the overall impact of each event type we will create a summary dataset (stormsSummary) grouped by event type along with its overall damage estimates to each of public health and to economy.

stormsSummary<-stormsSub %>% 
  group_by(EVTYPE)%>%
  summarize(Total_FATAL=sum(FATALITIES),
            Total_INJR=sum(INJURIES),
            Total_HLTHDMG=sum(HLTHDMG),
            Total_PROPDMG=sum(PROPDMG),
            Total_CROPDMG=sum(CROPDMG),
            Total_ECODMG=sum(ECODMG))

Finding Vital Few Events

We need to find out the events that are most harmful to public health. To do that we will arrange events in descending order by their total health damage (Total_HLTHDMG), and, for easier analysis, store the ordered list in a new dataset (OrdHLTH).

OrdHLTH<-arrange(stormsSummary,desc(Total_HLTHDMG))

On the other hand, we need to see the events that have the greatest economic consequences. So, we will also arrange events in descending order by their total economic damage(Total_ECODMG) and store the ordered list in a new dataset (OrdECO).

OrdECO<-arrange(stormsSummary,desc(Total_ECODMG))

Then, we will apply Pareto principle and find out the vital few events most harmful to each of health and economy. We calculate the cumulative sum of health damage across all events then calculate its share across all events top-down. We’ll do this for both ordered datasets (OrdHLTH,OrdECO). A subset of each dataset is shown below.

TotHLTHDMG<-sum(OrdHLTH$Total_HLTHDMG) # overall health damage
OrdHLTH$HLTHCumSum<- cumsum(OrdHLTH$Total_HLTHDMG)
OrdHLTH$HLTHCumShare<- round(OrdHLTH$HLTHCumSum/TotHLTHDMG,digits = 3)

head(OrdHLTH[,c(1:4,9)],20)

## Source: local data frame [20 x 5]
## 
##               EVTYPE Total_FATAL Total_INJR Total_HLTHDMG HLTHCumShare
## 1            TORNADO        5633      91346         96979        0.623
## 2  THUNDERSTORM WIND         701       9353         10054        0.688
## 3     EXCESSIVE HEAT        1903       6525          8428        0.742
## 4              FLOOD         498       6868          7366        0.789
## 5          LIGHTNING         816       5230          6046        0.828
## 6               HEAT         937       2100          3037        0.847
## 7        FLASH FLOOD        1014       1800          2814        0.865
## 8          ICE STORM          89       1975          2064        0.879
## 9          HIGH WIND         283       1439          1722        0.890
## 10          WILDFIRE          87       1456          1543        0.900
## 11      WINTER STORM         206       1321          1527        0.909
## 12              HAIL          15       1361          1376        0.918
## 13 HURRICANE/TYPHOON          64       1275          1339        0.927
## 14  HEAVY HEAVY SNOW         127       1021          1148        0.934
## 15          BLIZZARD         101        805           906        0.940
## 16         DENSE FOG          62        734           796        0.945
## 17       RIP CURRENT         368        232           600        0.949
## 18    WINTER WEATHER          61        470           531        0.952
## 19      RIP CURRENTS         204        297           501        0.956
## 20         HEAT WAVE         172        309           481        0.959

TotECODMG<-sum(OrdECO$Total_ECODMG) # overall economic damage
OrdECO$ECOCumSum<- cumsum(OrdECO$Total_ECODMG)
OrdECO$ECOCumShare<- round(OrdECO$ECOCumSum/TotECODMG,digits = 3)

head(OrdECO[,c(1,5:7,9)],17)

## Source: local data frame [17 x 5]
## 
##               EVTYPE Total_PROPDMG Total_CROPDMG Total_ECODMG ECOCumShare
## 1              FLOOD  144716019457    5670456550 150386476007       0.316
## 2  HURRICANE/TYPHOON   69305840000    2607872800  71913712800       0.467
## 3            TORNADO   56925660790     414953270  57340614060       0.587
## 4        STORM SURGE   43323536000          5000  43323541000       0.678
## 5               HAIL   15727367053    3025537890  18752904943       0.717
## 6        FLASH FLOOD   16622664680    1531457160  18154121840       0.755
## 7            DROUGHT    1046106000   13972566000  15018672000       0.787
## 8          HURRICANE   11868319010    2741910000  14610229010       0.818
## 9  THUNDERSTORM WIND    9709667805    1159501192  10869168997       0.840
## 10       RIVER FLOOD    5118945500    5029459000  10148404500       0.862
## 11         ICE STORM    3944927860    5022113500   8967041360       0.881
## 12    TROPICAL STORM    7703890550     678346000   8382236550       0.898
## 13          WILDFIRE    7766943500     402269630   8169213130       0.915
## 14      WINTER STORM    6688497251      26944000   6715441251       0.929
## 15         HIGH WIND    5878370043     679291900   6557661943       0.943
## 16  STORM SURGE/TIDE    4641188000        850000   4642038000       0.953
## 17    HURRICANE OPAL    3152846020       9000010   3161846030       0.960

Event types that are most harmful with respect to population health, and those that have the greatest economic consequences can be easily picked in a bar chart constructed in descending order. Therefore, we will plot the significant event types in 2 bar charts; one representing the impact on public health and another on economy. Then, we can pin-point those vital few event types that will be reported as most harmful to health and those with greatest economic consequences.

Notice that the significant event types plotted in the bar charts are those having 96% of the total impact on either public health or economy.

PLOT-1:

Using the ggplot2 package we will plot the events having about 96% of total impact on public health.

mltdHL<-melt(OrdHLTH[1:20,],id=c(1),measure.vars=c(2,3)) # convert data to long form

g<-ggplot(mltdHL,aes(EVTYPE,value,fill=factor(variable,labels=c("Fatalities","Injuries"))))
g<-g+geom_bar(stat="identity")+   
        labs(x="Event Type")+
        labs(y="Fatalities and Injuries")+
        theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1),
              plot.title = element_text(size = 18, face = "bold", colour = "black", vjust = +1),legend.title=element_blank(),legend.position=c(0.9,0.8))+        
        ggtitle(expression(atop("Events Impact on Population Health",
                                atop(italic("Events that Contributed to 96% of Total Health Damage"),
                                     atop("Across USA, 1950-2011")))))+
        annotate("text",x=8,y=60000,label= "Tornado is the most harmful to public health",size=3)

# order the bar chart in descending order to see impact of the vital few events
mltdHL0<- mltdHL
mltdHL0$EVTYPE <- with(mltdHL0, reorder(EVTYPE, -value))
g <- g %+% mltdHL0
g

PLOT-2:

Using the ggplot2 package we will plot the events having about 96% of total impact on economy.

mltdEC<-melt(OrdECO[1:17,],id=c(1),measure.vars=c(5,6)) # convert data to long form

p<-ggplot(mltdEC,aes(EVTYPE,value,fill=factor(variable,labels=c("Property","Crop"))))
p<-p+geom_bar(stat="identity")+  
        labs(x="Event Type")+
        labs(y="Property and Crop Damages (USD)")+
        theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1),
              plot.title = element_text(size = 18, face = "bold", colour = "black", vjust = +1),legend.title=element_blank(),legend.position=c(0.9,0.8))+        
        ggtitle(expression(atop("Events Impact on Economy",
                                atop(italic("Events that Contributed to 96% of Total Economic Damage"),
                                     atop("Across USA, 1950-2011")))))+
        annotate("text",x=8,y=1.0e+11,label= "Flood has the greatest economic \nconsequences followed by Hurricane and Tornado",size=3)

# order the bar chart in descending order to see impact of the vital few events
mltdEC0<- mltdEC
mltdEC0$EVTYPE <- with(mltdEC0, reorder(EVTYPE, -value))
p <- p %+% mltdEC0
p

Although the vital few events are clear in Plot-2 (Events Impact on Economy), we can add another plot showing the cumulative share to the total economic damage across all events. This plot will guide us to select the vital few events based on their impact share (percentage) compared to others.

PLOT-3:

s<-ggplot(OrdECO[1:17,],aes(EVTYPE,ECOCumShare,color=ECOCumShare<=0.60))
s<-s+geom_point()+
        labs(x="Event Type")+
        labs(y="Cumulative Share")+
        theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, vjust=1),
              plot.title = element_text(size = 18, face = "bold", colour = "black", vjust = +1))+        
        ggtitle(expression(atop("Cumulative Share of Events Impact on Economy",
                                atop(italic("Events that Contributed to 96% of Total Economic Damage"),
                                     atop("Across USA, 1950-2011")))))+
        annotate("text",x=10,y=.6,label= "Flood, Hurricane and Tornado are the vital few\nthat have the greatest economic consequences",size=3)

# order the bar chart in descending order to see impact of the vital few events
OrdECO0<- OrdECO[1:17,]
OrdECO0$EVTYPE <- with(OrdECO0, reorder(EVTYPE, ECOCumShare))
s <- s %+% OrdECO0
s

Conclusion

Events impact on population health

Based on the analysis presented in this report we can conclude that Tornado is the undisputable event type that has been most harmful to population health amongst all other events. Tornado got the highest contribution of 62.3% to the overall health damage. The following figures summarizes the impact of this harmful event:

Fatalities: 5,633 persons
Injuries: 91,346 persons
Total Health Damage: 96,979 persons

Events impact on economy

Based on the analysis presented in this report we can conclude that the greatest economic consequences were caused by three major events; namely, Flood, Hurricane/Typhoon, and Tornado with a total contribution of 58.7% to the overall economic damage. The following figures summarizes the impact of these harmful events:

Flood: total damage of USD 150.4 billion
Hurricane/Typhoon: total damage of USD 71.9 billion
Tornado: total damage of USD 57.3 billion

Flood event recorded the highest economic impact of USD 150.4 billion. And, the overall economic damage caused by these three major events was about USD 279.6 billion for the entire span of data from 1950 till November 2011.

======================================END OF REPORT======================================