The NOAA has collected and made available data on the impact of weather on human life and property/crop damage. This study aims to summarize these data in order to aid decisionmaking.
The available data were first downloaded to the home directory.
if(!file.exists("stormdata.csv")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv",method="curl")
}
For the analysis, ggplot2 graphics package and (d)plyr packages were used in order to sort the data and create plots.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
if(!exists("stormdata")){
print("Reading stormdata.csv...")
stormdata<-read.csv("stormdata.csv")
}
## [1] "Reading stormdata.csv..."
cache=TRUE
There were several steps taken in order to improve the quality of the data set itself. First, dates were converted to the proper format. In order to see the year in which an event occured.
# Convert begin and end dates to POSIXct
stormdata$Year<-as.numeric(format(as.POSIXct(strptime(as.character(stormdata$BGN_DATE),"%m/%d/%Y %H:%M:%S")),"%Y"))
There were quite a few typos and unneccessary abbreviations that were corrected and consolidated.
# Consolidate Event types and abbreviations
stormdata$EVTYPE <- sub("^TSTM.*", "THUNDERSTORM", stormdata$EVTYPE)
stormdata$EVTYPE <- sub(".*THUNDERSTROM.*|.*THUNDERSTORMW.*|.*THUNDERTSORM.*", "THUNDERSTORM", stormdata$EVTYPE)
stormdata$EVTYPE <- sub(".*THUNDERSTORM WIND.*", "THUNDERSTORM WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^FLASH FLOOD.*", "FLASH FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^HIGH WIND.*", "HIGH WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^HURRICANE.*", "HURRICANE", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^TORNADO.*", "TORNADO", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("^TROPICAL STORM.*", "TROPICAL STORM", stormdata$EVTYPE)
Next, the property and crop damages were changed to reflect actual numbers.
#Convert crop/property damage into dollar amounts
levels(stormdata$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
As an example, the PROPDMGEXP variable contains letters. For example, “B” means “billions” or a base ten exponent of 9 and “h” means “Hundreds” or a base ten exponent of 2. These were remapped to their numeric counterparts:
# convert the lettering to exponents
stormdata$PROPDMGEXP <- mapvalues(stormdata$PROPDMGEXP,
from = c("B", "b", "M", "m", "K", "k", "H", "h"),
to = c("9", "9", "6", "6", "3", "3", "2", "2"),
warn_missing = F)
stormdata$CROPDMGEXP <- mapvalues(stormdata$CROPDMGEXP, from = c("B", "b", "M", "m", "K", "k", "H", "h"),
to = c("9", "9", "6", "6", "3", "3", "2", "2"),
warn_missing = F)
stormdata <- subset(stormdata, stormdata$PROPDMGEXP != "+" & stormdata$PROPDMGEXP !=
"-" & stormdata$PROPDMGEXP != "?")
stormdata <- subset(stormdata, stormdata$CROPDMGEXP != "+" & stormdata$CROPDMGEXP !=
"-" & stormdata$CROPDMGEXP != "?")
# turn the exponents into numbers
stormdata$PROPDMGEXP<-as.numeric(as.character(stormdata$PROPDMGEXP))
stormdata$CROPDMGEXP<-as.numeric(as.character(stormdata$CROPDMGEXP))
At this point the Storm Data were much neater than when we found them and the analysis proceeded.
Damage and loss of life are a natural part of any natural disaster. The following subsections look at the NOAA data set on the basis of the cost of human life and cost of damage to crops and property.
I chose to look at these values separately to enhance the richness of the analysis. One could also choose to sum fatalities and injuries into a “casualties” indicator, however, there are certainly hazardous weather systems that produce a large number of injuries with fewer fatalities.
To start, I looked at the dataset as a whole, summing up the fatalities and injuries for all years and choosing the top 10 EVTYPES of injuries and fatalities. Later, it will be shown that, while arbitrary, this top-10 ranking covers a large proportion of the entire dataset. I also tracked the number of times a given EVTYPE was reported.
# All Fatalities
fatalEventsAlltime<-ddply(stormdata,"EVTYPE",summarize,
sum=sum(FATALITIES),
count=length(FATALITIES))
top10fatalAlltime=head(arrange(fatalEventsAlltime,desc(sum)),n=10)
# All Injuries
injuryEventsAlltime<-ddply(stormdata,"EVTYPE",summarize,
sum=sum(INJURIES),
count=length(INJURIES))
top10injureAlltime=head(arrange(injuryEventsAlltime,desc(sum)),n=10)
There is some overlap. Notably, the highest-ranking event type for injuries and fatalities is a Tornado. However, flash floods,for example, rank 2nd in most fatalities and 8th in most injuries.
Even more interesting was the absence of any deaths or injuries from an event other than Tornados before 1983.
min(filter(stormdata,EVTYPE!="TORNADO" & (FATALITIES>0 | INJURIES>0))$Year)
## [1] 1983
Because it is unlikely that Tornados were the only cause of weather-related injuries and deaths before 1983, I chose to look at only the data after 1983, where other EVTYPE labels came into play.
stormdataPost1983<-filter(stormdata,Year>="1983")
fatalEvents<-ddply(stormdataPost1983,"EVTYPE",summarize,
sum=sum(FATALITIES),
count=length(FATALITIES))
top10fatal=head(arrange(fatalEvents,desc(sum)),n=10)
# Injuries
injuryEvents<-ddply(stormdataPost1983,"EVTYPE",summarize,
sum=sum(INJURIES),
count=length(INJURIES))
top10injure=head(arrange(injuryEvents,desc(sum)),n=10)
Looking at the combination of the all-time and post-1983 rankings, two things are clear: 1) Tornados still rank highest in both injuries and fatalities. 2) The number of fatalities, casualties, and counts dropped.
merge(top10fatal,top10fatalAlltime,by="EVTYPE",suffixes=c(".AllTime",".post1983"))
## EVTYPE sum.AllTime count.AllTime sum.post1983 count.post1983
## 1 AVALANCHE 224 386 224 386
## 2 EXCESSIVE HEAT 1903 1678 1903 1678
## 3 FLASH FLOOD 1018 55034 1018 55034
## 4 FLOOD 470 25326 470 25326
## 5 HEAT 937 767 937 767
## 6 HIGH WIND 291 21796 291 21796
## 7 LIGHTNING 816 15754 816 15754
## 8 RIP CURRENT 368 470 368 470
## 9 THUNDERSTORM 511 179817 511 221105
## 10 TORNADO 2183 35823 5658 60685
merge(top10injure,top10injureAlltime,by="EVTYPE",suffixes=c(".AllTime",".post1983"))
## EVTYPE sum.AllTime count.AllTime sum.post1983 count.post1983
## 1 EXCESSIVE HEAT 6525 1678 6525 1678
## 2 FLASH FLOOD 1785 55034 1785 55034
## 3 FLOOD 6789 25326 6789 25326
## 4 HEAT 2100 767 2100 767
## 5 HIGH WIND 1471 21796 1471 21796
## 6 ICE STORM 1975 2006 1975 2006
## 7 LIGHTNING 5230 15754 5230 15754
## 8 THUNDERSTORM 7095 179817 7095 221105
## 9 THUNDERSTORM WIND 2428 109438 2428 109438
## 10 TORNADO 34758 35823 91364 60685
As mentioned earlier, top 10 lists are “cute” but sometimes arbitrary. In this case, the top 10 EVTYPES for fatalities accounted for approximately 80% of all fatalities reported. This is true for all-time:
sum(top10fatalAlltime$sum)/sum(fatalEventsAlltime$sum)*100
## [1] 80.53886
and post-1983 data:
sum(top10fatal$sum)/sum(fatalEvents$sum)*100
## [1] 74.74289
In the case of injuries, the top 10 injuring EVTYPES accounted for approximately 90% of all injuries reported:
sum(top10injureAlltime$sum)/sum(injuryEventsAlltime$sum)*100
## [1] 90.20409
sum(top10injure$sum)/sum(injuryEvents$sum)*100
## [1] 83.59667
I chose to plot these data using a scatter plot where the y-axis was the total victim count (injuries or fatalities) and the x-axis was the total number of times that EVTYPE occured. These were plotted in log-log fashion and the point sizes were scaled to represent a “normalized” incident rating. That is, the total number of incidents was divided by the total number of event instances.
In such a plot, events in the upper right region are frequent and afflict more victims whereas those in the lower left are less frequent and afflict fewer victims. Larger points indicate that there are more victims occur per event occurence, and smaller points indicate that fewer victims occur per event occurence.
# Apply labels for plotting purposes
top10fatal$outcome<-"Fatalities"
top10injure$outcome<-"Injuries"
humanFactors<-rbind(top10fatal,top10injure)
ggplot(data=humanFactors,aes(x=log10(count),y=log10(sum), col=outcome))+
geom_point(aes(size=sum/count))+
geom_text(col="black",aes(label=EVTYPE),hjust=0.2,vjust=-1,size=4,family=c("serif"))+
xlab("log(Event Count)")+
xlim(c(2,6))+
ylab("log(Incident Count)")+
ggtitle("Highest fatality and injury events, total incidents vs. event occurence")
From this plot, we can see, for example, that tornados are responsible for the most injuries of all other events and that more people are injured by tornados than killed by tornados. In fact, in most cases, the injury count (blue dot) is higher than the fatality count (red dot), except for avalanches, rip currents, and high wind which may indicate that these events more often result in death than injury.
Weather events can lead not only to damage of property but also affect the agriculture industry by damaging crops.
As in the above analysis, I started with the all-time data.
# All-time Property Damage
propertyAlltime<-ddply(stormdata,"EVTYPE",summarize,
TotalCost=sum(PROPDMG*10^PROPDMGEXP,na.rm=T),
TotalCount=length(PROPDMG)
)
top10propdmgAlltime=head(arrange(propertyAlltime,desc(TotalCost)),n=10)
# All-time Crop Damage
cropAlltime<-ddply(stormdata,"EVTYPE",summarize,
TotalCost=sum(CROPDMG*10^CROPDMGEXP,na.rm=T),
TotalCount=length(CROPDMG)
)
top10cropdmgAlltime=head(arrange(propertyAlltime,desc(TotalCost)),n=10)
This time, the raw data would have one believe that no property or crop damage occured due to any other event than tornados before 1993. A similar approach was taken and only property and crop damage data from 1993 onward were taken into account.
min(filter(stormdata,EVTYPE!="TORNADO" & (PROPDMG>0 | CROPDMG>0))$Year)
## [1] 1993
stormdataPost1993<-filter(stormdata,Year>=1993)
propertyDmg<-ddply(stormdataPost1993,"EVTYPE",summarize,
TotalCost=sum(PROPDMG*10^PROPDMGEXP,na.rm=T),
TotalCount=length(PROPDMG)
)
cropDmg<-ddply(stormdataPost1993,"EVTYPE",summarize,
TotalCost=sum(CROPDMG*10^CROPDMGEXP,na.rm=T),
TotalCount=length(CROPDMG)
)
top10prop=head(arrange(propertyDmg,desc(TotalCost)),n=10)
top10crop=head(arrange(cropDmg,desc(TotalCost)),n=10)
merge(top10propdmgAlltime,top10prop,by="EVTYPE",suffix=c(".AllTime",".post1993"))
## EVTYPE TotalCost.AllTime TotalCount.AllTime
## 1 FLASH FLOOD 17414680872 55034
## 2 FLOOD 144657709800 25326
## 3 HAIL 15735267456 288658
## 4 HIGH WIND 6003352990 21796
## 5 HURRICANE 84756180010 286
## 6 STORM SURGE 43323536000 261
## 7 THUNDERSTORM WIND 5431671378 109438
## 8 TORNADO 58552151814 60685
## 9 TROPICAL STORM 7714390550 697
## 10 WINTER STORM 6688497251 11433
## TotalCost.post1993 TotalCount.post1993
## 1 17414680872 55034
## 2 144657709800 25326
## 3 15735267456 226826
## 4 6003352990 21796
## 5 84756180010 286
## 6 43323536000 261
## 7 5431671378 109438
## 8 27953953244 25921
## 9 7714390550 697
## 10 6688497251 11433
merge(top10cropdmgAlltime,top10crop,by="EVTYPE",suffix=c(".AllTime",".post1993"))
## EVTYPE TotalCost.AllTime TotalCount.AllTime TotalCost.post1993
## 1 FLASH FLOOD 17414680872 55034 1437163150
## 2 FLOOD 144657709800 25326 5661968450
## 3 HAIL 15735267456 288658 3025954470
## 4 HURRICANE 84756180010 286 5515292800
## TotalCount.post1993
## 1 55034
## 2 25326
## 3 226826
## 4 286
In this case, the top 10 events leading to property damage account for approximately 88% of all damages. This is true for all-time:
sum(top10propdmgAlltime$TotalCost)/sum(propertyAlltime$TotalCost)*100
## [1] 91.13864
and post-1993 data:
sum(top10prop$TotalCost)/sum(propertyDmg$TotalCost)*100
## [1] 90.45674
In the case of crop damage, the top 10 injuring events that accounted for approximately 85% of all crop damage costs reported:
sum(top10crop$TotalCost)/sum(cropDmg$TotalCost)*100
## [1] 87.13101
These data were plotted in a similar fashion to injuries/fatalities. This time, the x-axis shows the log of the Event Count and the y-axis shows the total cost of damages to crops or property. The upper right portion shows events that occurred frequently and led to higher sums in damages whereas those in the lower right indicate infrequent events that led to less damage. The plot size indicates the sum of damages (in USD) normalized by the number of event occurences.
# apply a label for plotting purposes
top10prop$Label<-"Property"
top10crop$Label<-"Crop"
damagefactors<-rbind(top10prop,top10crop)
ggplot(data=damagefactors,aes(x=log10(TotalCount),y=log(TotalCost), col=Label))+
geom_point(aes(size=TotalCost/TotalCount))+
geom_text(col="black",aes(label=EVTYPE),hjust=0.2,vjust=-1,size=4,family=c("serif"))+
xlab("Count of Damage Events")+
ylab("Total Cost of Damages")+
ggtitle("Highest Property & Crop Damage events, total cost vs. event occurence")
As one might expect, there is some overlap in the placement of crop/property damage and injuries/fatalities: often in dangerous weather situations when property is at risk, human life is in danger. It was interesting to note that hurricanes and typhoons, although occuring somewhat less frequently, led to high damage costs to property not only overall, but per incident (as indicated by the relative size of the plot point).