Introduction

The NOAA has collected and made available data on the impact of weather on human life and property/crop damage. This study aims to summarize these data in order to aid decisionmaking.

Data Processing

The available data were first downloaded to the home directory.

if(!file.exists("stormdata.csv")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv",method="curl")
}

For the analysis, ggplot2 graphics package and (d)plyr packages were used in order to sort the data and create plots.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
if(!exists("stormdata")){
  print("Reading stormdata.csv...")
  stormdata<-read.csv("stormdata.csv")
}
## [1] "Reading stormdata.csv..."
cache=TRUE

There were several steps taken in order to improve the quality of the data set itself. First, dates were converted to the proper format. In order to see the year in which an event occured.

# Convert begin and end dates to POSIXct
stormdata$Year<-as.numeric(format(as.POSIXct(strptime(as.character(stormdata$BGN_DATE),"%m/%d/%Y %H:%M:%S")),"%Y"))

There were quite a few typos and unneccessary abbreviations that were corrected and consolidated.

# Consolidate Event types and abbreviations
stormdata$EVTYPE <- sub("^TSTM.*", "THUNDERSTORM", stormdata$EVTYPE)
stormdata$EVTYPE <- sub(".*THUNDERSTROM.*|.*THUNDERSTORMW.*|.*THUNDERTSORM.*", "THUNDERSTORM", stormdata$EVTYPE)
stormdata$EVTYPE <- sub(".*THUNDERSTORM WIND.*", "THUNDERSTORM WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^FLASH FLOOD.*", "FLASH FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^HIGH WIND.*", "HIGH WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^HURRICANE.*", "HURRICANE", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^TORNADO.*", "TORNADO", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^TROPICAL STORM.*", "TROPICAL STORM", stormdata$EVTYPE)

Next, the property and crop damages were changed to reflect actual numbers.

#Convert crop/property damage into dollar amounts
levels(stormdata$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

As an example, the PROPDMGEXP variable contains letters. For example, “B” means “billions” or a base ten exponent of 9 and “h” means “Hundreds” or a base ten exponent of 2. These were remapped to their numeric counterparts:

# convert the lettering to exponents
stormdata$PROPDMGEXP <- mapvalues(stormdata$PROPDMGEXP,
                                  from = c("B", "b", "M", "m", "K", "k", "H", "h"),
                                  to = c("9", "9", "6", "6", "3", "3", "2", "2"),
                                  warn_missing = F)
stormdata$CROPDMGEXP <- mapvalues(stormdata$CROPDMGEXP, from = c("B", "b", "M", "m", "K", "k", "H", "h"),
                                  to = c("9", "9", "6", "6", "3", "3", "2", "2"),
                                  warn_missing = F)

stormdata <- subset(stormdata, stormdata$PROPDMGEXP != "+" & stormdata$PROPDMGEXP != 
                      "-" & stormdata$PROPDMGEXP != "?")
stormdata <- subset(stormdata, stormdata$CROPDMGEXP != "+" & stormdata$CROPDMGEXP != 
                      "-" & stormdata$CROPDMGEXP != "?")

# turn the exponents into numbers
stormdata$PROPDMGEXP<-as.numeric(as.character(stormdata$PROPDMGEXP))
stormdata$CROPDMGEXP<-as.numeric(as.character(stormdata$CROPDMGEXP))

At this point the Storm Data were much neater than when we found them and the analysis proceeded.

Results

Damage and loss of life are a natural part of any natural disaster. The following subsections look at the NOAA data set on the basis of the cost of human life and cost of damage to crops and property.

Injuries and Fatalities

I chose to look at these values separately to enhance the richness of the analysis. One could also choose to sum fatalities and injuries into a “casualties” indicator, however, there are certainly hazardous weather systems that produce a large number of injuries with fewer fatalities.

To start, I looked at the dataset as a whole, summing up the fatalities and injuries for all years and choosing the top 10 EVTYPES of injuries and fatalities. Later, it will be shown that, while arbitrary, this top-10 ranking covers a large proportion of the entire dataset. I also tracked the number of times a given EVTYPE was reported.

# All Fatalities
fatalEventsAlltime<-ddply(stormdata,"EVTYPE",summarize,
                   sum=sum(FATALITIES),
                   count=length(FATALITIES))
top10fatalAlltime=head(arrange(fatalEventsAlltime,desc(sum)),n=10)

# All Injuries
injuryEventsAlltime<-ddply(stormdata,"EVTYPE",summarize,
                    sum=sum(INJURIES),
                    count=length(INJURIES))

top10injureAlltime=head(arrange(injuryEventsAlltime,desc(sum)),n=10)

There is some overlap. Notably, the highest-ranking event type for injuries and fatalities is a Tornado. However, flash floods,for example, rank 2nd in most fatalities and 8th in most injuries.

Even more interesting was the absence of any deaths or injuries from an event other than Tornados before 1983.

min(filter(stormdata,EVTYPE!="TORNADO" & (FATALITIES>0 | INJURIES>0))$Year)
## [1] 1983

Because it is unlikely that Tornados were the only cause of weather-related injuries and deaths before 1983, I chose to look at only the data after 1983, where other EVTYPE labels came into play.

stormdataPost1983<-filter(stormdata,Year>="1983")
fatalEvents<-ddply(stormdataPost1983,"EVTYPE",summarize,
                   sum=sum(FATALITIES),
                   count=length(FATALITIES))
top10fatal=head(arrange(fatalEvents,desc(sum)),n=10)

# Injuries
injuryEvents<-ddply(stormdataPost1983,"EVTYPE",summarize,
                   sum=sum(INJURIES),
                   count=length(INJURIES))

top10injure=head(arrange(injuryEvents,desc(sum)),n=10)

Looking at the combination of the all-time and post-1983 rankings, two things are clear: 1) Tornados still rank highest in both injuries and fatalities. 2) The number of fatalities, casualties, and counts dropped.

merge(top10fatal,top10fatalAlltime,by="EVTYPE",suffixes=c(".AllTime",".post1983"))
##            EVTYPE sum.AllTime count.AllTime sum.post1983 count.post1983
## 1       AVALANCHE         224           386          224            386
## 2  EXCESSIVE HEAT        1903          1678         1903           1678
## 3     FLASH FLOOD        1018         55034         1018          55034
## 4           FLOOD         470         25326          470          25326
## 5            HEAT         937           767          937            767
## 6       HIGH WIND         291         21796          291          21796
## 7       LIGHTNING         816         15754          816          15754
## 8     RIP CURRENT         368           470          368            470
## 9    THUNDERSTORM         511        179817          511         221105
## 10        TORNADO        2183         35823         5658          60685
merge(top10injure,top10injureAlltime,by="EVTYPE",suffixes=c(".AllTime",".post1983"))
##               EVTYPE sum.AllTime count.AllTime sum.post1983 count.post1983
## 1     EXCESSIVE HEAT        6525          1678         6525           1678
## 2        FLASH FLOOD        1785         55034         1785          55034
## 3              FLOOD        6789         25326         6789          25326
## 4               HEAT        2100           767         2100            767
## 5          HIGH WIND        1471         21796         1471          21796
## 6          ICE STORM        1975          2006         1975           2006
## 7          LIGHTNING        5230         15754         5230          15754
## 8       THUNDERSTORM        7095        179817         7095         221105
## 9  THUNDERSTORM WIND        2428        109438         2428         109438
## 10           TORNADO       34758         35823        91364          60685

As mentioned earlier, top 10 lists are “cute” but sometimes arbitrary. In this case, the top 10 EVTYPES for fatalities accounted for approximately 80% of all fatalities reported. This is true for all-time:

sum(top10fatalAlltime$sum)/sum(fatalEventsAlltime$sum)*100
## [1] 80.53886

and post-1983 data:

sum(top10fatal$sum)/sum(fatalEvents$sum)*100
## [1] 74.74289

In the case of injuries, the top 10 injuring EVTYPES accounted for approximately 90% of all injuries reported:

sum(top10injureAlltime$sum)/sum(injuryEventsAlltime$sum)*100
## [1] 90.20409
sum(top10injure$sum)/sum(injuryEvents$sum)*100
## [1] 83.59667

I chose to plot these data using a scatter plot where the y-axis was the total victim count (injuries or fatalities) and the x-axis was the total number of times that EVTYPE occured. These were plotted in log-log fashion and the point sizes were scaled to represent a “normalized” incident rating. That is, the total number of incidents was divided by the total number of event instances.

In such a plot, events in the upper right region are frequent and afflict more victims whereas those in the lower left are less frequent and afflict fewer victims. Larger points indicate that there are more victims occur per event occurence, and smaller points indicate that fewer victims occur per event occurence.

# Apply labels for plotting purposes
top10fatal$outcome<-"Fatalities"
top10injure$outcome<-"Injuries"
humanFactors<-rbind(top10fatal,top10injure)

ggplot(data=humanFactors,aes(x=log10(count),y=log10(sum), col=outcome))+
  geom_point(aes(size=sum/count))+
  geom_text(col="black",aes(label=EVTYPE),hjust=0.2,vjust=-1,size=4,family=c("serif"))+
  xlab("log(Event Count)")+
  xlim(c(2,6))+
  ylab("log(Incident Count)")+
  ggtitle("Highest fatality and injury events, total incidents vs. event occurence")

From this plot, we can see, for example, that tornados are responsible for the most injuries of all other events and that more people are injured by tornados than killed by tornados. In fact, in most cases, the injury count (blue dot) is higher than the fatality count (red dot), except for avalanches, rip currents, and high wind which may indicate that these events more often result in death than injury.

Property and Crop Damage

Weather events can lead not only to damage of property but also affect the agriculture industry by damaging crops.

As in the above analysis, I started with the all-time data.

# All-time Property Damage
propertyAlltime<-ddply(stormdata,"EVTYPE",summarize,
                       TotalCost=sum(PROPDMG*10^PROPDMGEXP,na.rm=T),
                       TotalCount=length(PROPDMG)
                       )

top10propdmgAlltime=head(arrange(propertyAlltime,desc(TotalCost)),n=10)

# All-time Crop Damage
cropAlltime<-ddply(stormdata,"EVTYPE",summarize,
                       TotalCost=sum(CROPDMG*10^CROPDMGEXP,na.rm=T),
                       TotalCount=length(CROPDMG)
)

top10cropdmgAlltime=head(arrange(propertyAlltime,desc(TotalCost)),n=10)

This time, the raw data would have one believe that no property or crop damage occured due to any other event than tornados before 1993. A similar approach was taken and only property and crop damage data from 1993 onward were taken into account.

min(filter(stormdata,EVTYPE!="TORNADO" & (PROPDMG>0 | CROPDMG>0))$Year)
## [1] 1993
stormdataPost1993<-filter(stormdata,Year>=1993)
propertyDmg<-ddply(stormdataPost1993,"EVTYPE",summarize,
               TotalCost=sum(PROPDMG*10^PROPDMGEXP,na.rm=T),
               TotalCount=length(PROPDMG)
               )

cropDmg<-ddply(stormdataPost1993,"EVTYPE",summarize,
                   TotalCost=sum(CROPDMG*10^CROPDMGEXP,na.rm=T),
                   TotalCount=length(CROPDMG)
               )

top10prop=head(arrange(propertyDmg,desc(TotalCost)),n=10)

top10crop=head(arrange(cropDmg,desc(TotalCost)),n=10)

merge(top10propdmgAlltime,top10prop,by="EVTYPE",suffix=c(".AllTime",".post1993"))
##               EVTYPE TotalCost.AllTime TotalCount.AllTime
## 1        FLASH FLOOD       17414680872              55034
## 2              FLOOD      144657709800              25326
## 3               HAIL       15735267456             288658
## 4          HIGH WIND        6003352990              21796
## 5          HURRICANE       84756180010                286
## 6        STORM SURGE       43323536000                261
## 7  THUNDERSTORM WIND        5431671378             109438
## 8            TORNADO       58552151814              60685
## 9     TROPICAL STORM        7714390550                697
## 10      WINTER STORM        6688497251              11433
##    TotalCost.post1993 TotalCount.post1993
## 1         17414680872               55034
## 2        144657709800               25326
## 3         15735267456              226826
## 4          6003352990               21796
## 5         84756180010                 286
## 6         43323536000                 261
## 7          5431671378              109438
## 8         27953953244               25921
## 9          7714390550                 697
## 10         6688497251               11433
merge(top10cropdmgAlltime,top10crop,by="EVTYPE",suffix=c(".AllTime",".post1993"))
##        EVTYPE TotalCost.AllTime TotalCount.AllTime TotalCost.post1993
## 1 FLASH FLOOD       17414680872              55034         1437163150
## 2       FLOOD      144657709800              25326         5661968450
## 3        HAIL       15735267456             288658         3025954470
## 4   HURRICANE       84756180010                286         5515292800
##   TotalCount.post1993
## 1               55034
## 2               25326
## 3              226826
## 4                 286

In this case, the top 10 events leading to property damage account for approximately 88% of all damages. This is true for all-time:

sum(top10propdmgAlltime$TotalCost)/sum(propertyAlltime$TotalCost)*100
## [1] 91.13864

and post-1993 data:

sum(top10prop$TotalCost)/sum(propertyDmg$TotalCost)*100
## [1] 90.45674

In the case of crop damage, the top 10 injuring events that accounted for approximately 85% of all crop damage costs reported:

sum(top10crop$TotalCost)/sum(cropDmg$TotalCost)*100
## [1] 87.13101

These data were plotted in a similar fashion to injuries/fatalities. This time, the x-axis shows the log of the Event Count and the y-axis shows the total cost of damages to crops or property. The upper right portion shows events that occurred frequently and led to higher sums in damages whereas those in the lower right indicate infrequent events that led to less damage. The plot size indicates the sum of damages (in USD) normalized by the number of event occurences.

# apply a label for plotting purposes
top10prop$Label<-"Property"
top10crop$Label<-"Crop"
damagefactors<-rbind(top10prop,top10crop)

ggplot(data=damagefactors,aes(x=log10(TotalCount),y=log(TotalCost), col=Label))+
  geom_point(aes(size=TotalCost/TotalCount))+
  geom_text(col="black",aes(label=EVTYPE),hjust=0.2,vjust=-1,size=4,family=c("serif"))+
  xlab("Count of Damage Events")+
  ylab("Total Cost of Damages")+
  ggtitle("Highest Property & Crop Damage events, total cost vs. event occurence")

As one might expect, there is some overlap in the placement of crop/property damage and injuries/fatalities: often in dangerous weather situations when property is at risk, human life is in danger. It was interesting to note that hurricanes and typhoons, although occuring somewhat less frequently, led to high damage costs to property not only overall, but per incident (as indicated by the relative size of the plot point).