The purpose of the paper is to analyze the NOAA data on extreme weather events in order to answer two questions.
First: What extreme weather events are the most damaging to public health?
Second: What extreme weather events cause the most damage to property?
There are a number of ways to interpret this question but for the purposes of this paper the main focus will be on the average total cost and the average causualties per event. The reason for this decision was that so many events had no recorded damage/causuallies the median is 0 for most of the events. Total would have been influenced by how many events of this type were recorded. Finally, although I chose median, it is not perfect as the standard deviatons tend to be quite large. Since record keeping has changed all All Event Types tracking started in 1996 I reserved my analysis to 1996 on. Our results show that Hurricanes are the most costly per incident (followed by storm surges and tropical storms), while hurricanes are the most dangerous to people per incident followed by tsunami and excessive heat.
The first step was loading libraries and reading in the data:
require(tidyr)
require(dtplyr)
##require(data.table)
require(stringdist)
require(plyr)
require(ggplot2)
require(gridExtra)
require(captioner)
if(!dir.exists(file.path("./data1"))){
dir.create(file.path("./data1"), showWarnings = FALSE)
}
if(!file.exists("./data1/stormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",dest = "./data1/stormData.csv.bz2")
}
df<-read.csv("./data1/stormData.csv.bz2")
Once we have the data read in it’s time to process it. First I select the columns I am going to need. Next I limit the data to events that occurred in 1996 or later. I chose to limit the data to 1996 on as, mentioned earlier, this is when all extreme weather events started be tracked and is therefore more consistant and reliable. I then trim and upper case the event type, remove events named “Summary” (data entry errors) and change events with TSTM or THUNDERSTORM to “THUNDERSTORM WIND” to better match the “official” event types
dfSmall<-select(df,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,MAG,BGN_DATE)
dfSmall<-dfSmall[year(as.Date(df$BGN_DATE,"%m/%d/%Y %H:%M:%S"))>=1996,]
dfSmall$EVTYPE<-toupper(trimws(dfSmall$EVTYPE,"b"))
dfSmall<-dfSmall[!grepl("SUMMARY",dfSmall$EVTYPE,ignore.case = TRUE),] ##remove rows with "summary"
dfSmall[grepl("TSTM | THUNDERSTORM",dfSmall$EVTYPE),]$EVTYPE<-"THUNDERSTORM WIND"
The next step is actually mapping the event types to “official” event types.Note I took the list of “offical” event types and copied them into the code as a vector. Thus the list (storme) is created programatically so as to be exactly reproducible. I then use amatch to create a vector that allows one to select the appropriate store event based on it’s match to EVTYPE (event type.) This new column is added to the data frame and will be the key to the analysis. (Note that the original EVTYPE is stilll there for comparavie purposes.) Finally we remove NA’s. In the end we still have more than 600k observations.
storme<-c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow",
"Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
"Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow",
"High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood",
"Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current",
"Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression",
"Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
storme<-toupper(storme)
##I went with 3 because it was the most restrictive that gave me 46 out of the 48 items
dist<-amatch(dfSmall$EVTYPE,storme,maxDist=3)
newev<-storme[dist]
dfSmall$NewEvent<-newev
dfSmall<-dfSmall[complete.cases(dfSmall),]
The final preparatory step is to cacluate total costs and casualty figures. We use PROPDMGEXP and CROPDMGEXP to apply the appropriate multiplier to the data in the damage columns. We do this by spliting the data based on the exponent value (dropping factors with no items..it speeds up calculations.) This will leave us with dfSmallNew having 642077 observations to complete our analysis.
##now calculating the costs
dfSmall$cor <- dfSmall$PROPDMG
dfSmall$cor2<-dfSmall$CROPDMG
dfSmallA<-split(dfSmall,dfSmall$PROPDMGEXP, drop = TRUE)
##if df$PROPDMGEXP in 0:9 then mult by 10
for(i in 1:length(dfSmallA) ) {
if(nrow(dfSmallA[[i]]) > 0) {
corval <- as.character(dfSmallA[[i]][1,]$PROPDMGEXP)
if(corval %in% c("0","1","2","3","4","5","6","7","8","9")) {
corint<-as.integer(corval)
dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*(10^corint);
} else if(corval %in% c("h","H")) {
dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*100
}else if(corval == "K") {
dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000
}
else if(corval %in% c("m","M")) {
dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000000
}
else if(corval == "B") {
dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000000000
}
}
}
dfSmallNew<-rbindlist(dfSmallA)
dfSmallB<-split(dfSmallNew,dfSmall$CROPDMGEX, drop = TRUE)
##if df$PROPDMGEXP in 0:9 then mult by 10
for(i in 1:length(dfSmallB) ) {
if(nrow(dfSmallB[[i]]) > 0) {
corval <- as.character(dfSmallB[[i]][1,]$CROPDMGEXP)
if(corval %in% c("0","1","2","3","4","5","6","7","8","9")) {
corint<-as.integer(corval)
dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*(10^corint)
} else if(corval %in% c("h","H")) {
dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*100
}else if(corval == "K") {
dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000
}
else if(corval %in% c("m","M")) {
dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000000
}
else if(corval == "B") {
dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000000000
}
}
}
dfSmallNew<-rbindlist(dfSmallB)
dfSmallNew$TotalCost<-dfSmallNew$cor + dfSmallNew$cor2
dfSmallNew$TotalCasualties<- dfSmallNew$FATALITIES + dfSmallNew$INJURIES
In this section I will present the code I used to aggregate the data which will lead to my results. We created two separate aggregate data frames, one for cost, the other for casulaties. Since it would be arbitrary to weigh deaths vs injury (and there is no indication of seriousness of injury) I summed fatalities and injuries to form one total casualty number (as shown in the previous code section,) Note that I created the log of the mean cost as the variance of costs were so large as to make it difficult to plot. By taking the log of the means it brings the results into a closer range of values while providing a consistant answer to the question of what events are most costly.
I also had to replace -Inf (log of 0) with 0 to prevent a plotting error. NOTE: we only ended up with 46 out of the 48 possible types. The two that didn’t register were debris flow and freezing fog…not major events. I also wanted to see the total number of observations making up each aggrigate value so I included that as well.
I also set options so that we didn’t have numbers coming back in scientific notation.
options(scipen=100, digits=2)
##use ddply to calcluate the aggregate for cost
aa<-ddply(dfSmallNew,~NewEvent,summarise,median=median(TotalCost),mean = mean(TotalCost),sd=sd(TotalCost),sum=sum(TotalCost),max=max(TotalCost))
##calculate count ... this is the same for both data frams
cc<-count(dfSmallNew$NewEvent)
results<-cbind(aa,cc$freq)
results$LogMean<-log(results$mean)
## remove infinate value
results[results$LogMean == -Inf,]$LogMean = 0
##use ddply to calcuate the aggregate for casualties
aa1<-ddply(dfSmallNew,~NewEvent,summarise,median=median(TotalCasualties),mean = mean(TotalCasualties),sd=sd(TotalCasualties),sum=sum(TotalCasualties),
max=max(TotalCasualties))
results2<-cbind(aa1,cc$freq)
##decided to calculate fatalites separately
aa2<-ddply(dfSmallNew,~NewEvent,summarise,median=median(FATALITIES),mean = mean(TotalCasualties),sd=sd(FATALITIES),sum=sum(FATALITIES),
max=max(FATALITIES))
results3<-cbind(aa2,cc$freq)
The first table is the aggregate data based on cost
#initalizing capioners
figs <- captioner(prefix="Figure")
tbls <- captioner(prefix="Table")
#first change the names
names(results)<-c("Event","Median Cost","Mean Cost","Std","Total Cost", "Max Cost", "Sample Size","Log of Mean")
writeLines("td, th { padding : 6px } th { background-color : black ; color : white; border : 1px solid white; } td { color : black ; border : 1px solid black }", con = "mystyle.css")
##resultsOrder<-results[order(results$`Mean Cost`),]
knitr::kable(results,format="html")
| Event | Median Cost | Mean Cost | Std | Total Cost | Max Cost | Sample Size | Log of Mean |
|---|---|---|---|---|---|---|---|
| ASTRONOMICAL LOW TIDE | 0 | 1839.1 | 17637 | 320000 | 200000 | 174 | 7.5 |
| AVALANCHE | 0 | 9819.6 | 82252 | 3711800 | 1100000 | 378 | 9.2 |
| BLIZZARD | 0 | 199642.6 | 3083425 | 525659017 | 102000000 | 2633 | 12.2 |
| COASTAL FLOOD | 0 | 477432.2 | 3849456 | 355209560 | 75000000 | 744 | 13.1 |
| COLD/WIND CHILL | 0 | 3693.1 | 37063 | 1990600 | 600000 | 539 | 8.2 |
| DENSE FOG | 0 | 6134.9 | 57373 | 7319000 | 1450000 | 1193 | 8.7 |
| DENSE SMOKE | 0 | 10000.0 | 31623 | 100000 | 100000 | 10 | 9.2 |
| DROUGHT | 0 | 429977.1 | 13325098 | 1046134294 | 645150000 | 2433 | 13.0 |
| DUST DEVIL | 300 | 4844.0 | 11766 | 663630 | 75000 | 137 | 8.5 |
| DUST STORM | 0 | 13130.9 | 51106 | 5475602 | 640002 | 417 | 9.5 |
| EXCESSIVE HEAT | 0 | 4664.4 | 110152 | 7724194 | 3300000 | 1656 | 8.4 |
| EXTREME COLD/WIND CHILL | 0 | 8630.8 | 163710 | 8648050 | 5000000 | 1002 | 9.1 |
| FLASH FLOOD | 0 | 298472.9 | 6180060 | 15222414977 | 1000000000 | 51001 | 12.6 |
| FLOOD | 0 | 5788447.9 | 729776449 | 143958699876 | 115000000033 | 24870 | 15.6 |
| FREEZING FOG | 0 | 47434.8 | 294586 | 2182000 | 2000000 | 46 | 10.8 |
| FROST/FREEZE | 0 | 7808.7 | 117272 | 10487134 | 3020005 | 1343 | 9.0 |
| FUNNEL CLOUD | 0 | 22.1 | 828 | 134100 | 50000 | 6068 | 3.1 |
| HAIL | 0 | 70266.4 | 5291859 | 14595942009 | 1800000000 | 207723 | 11.2 |
| HEAT | 0 | 2123.2 | 56059 | 1520177 | 1500000 | 716 | 7.7 |
| HEAVY RAIN | 0 | 50730.8 | 1104674 | 584875418 | 70000002 | 11529 | 10.8 |
| HEAVY SNOW | 0 | 45153.7 | 887175 | 635809132 | 56500002 | 14081 | 10.7 |
| HIGH SURF | 0 | 114034.6 | 1403751 | 83929500 | 30000000 | 736 | 11.6 |
| HIGH WIND | 0 | 263605.1 | 12523804 | 5248377628 | 1300000000 | 19910 | 12.5 |
| HURRICANE (TYPHOON) | 6765001 | 787566418.2 | 2445917029 | 69305844798 | 16930000000 | 88 | 20.5 |
| ICE STORM | 0 | 1938397.6 | 16316807 | 3642249035 | 360000000 | 1879 | 14.5 |
| LAKE-EFFECT SNOW | 0 | 61253.1 | 796534 | 40182000 | 20000000 | 656 | 11.0 |
| LAKESHORE FLOOD | 0 | 327826.1 | 1103276 | 7540000 | 4500000 | 23 | 12.7 |
| LIGHTNING | 5000 | 56276.8 | 302132 | 743078983 | 15000000 | 13204 | 10.9 |
| MARINE HAIL | 0 | 9.1 | 134 | 4000 | 2000 | 442 | 2.2 |
| MARINE HIGH WIND | 0 | 9607.5 | 86508 | 1297010 | 1000000 | 135 | 9.2 |
| MARINE STRONG WIND | 10 | 8715.2 | 43458 | 418330 | 300000 | 48 | 9.1 |
| RIP CURRENT | 0 | 222.1 | 3459 | 163000 | 80000 | 734 | 5.4 |
| SEICHE | 0 | 42608.7 | 155058 | 980000 | 750000 | 23 | 10.7 |
| SLEET | 0 | 0.0 | 0 | 0 | 0 | 58 | 0.0 |
| STORM SURGE/TIDE | 0 | 31359384.1 | 331065125 | 4641188850 | 4000000000 | 148 | 17.3 |
| STRONG WIND | 3000 | 47123.5 | 1179123 | 176995857 | 70000048 | 3756 | 10.8 |
| THUNDERSTORM WIND | 0 | 35485.9 | 2374042 | 7919633460 | 750000000 | 223177 | 10.5 |
| TORNADO | 1000 | 1063187.2 | 24055725 | 24617035839 | 2800000000 | 23154 | 13.9 |
| TROPICAL DEPRESSION | 6500 | 28950.0 | 129715 | 1737000 | 1000000 | 60 | 10.3 |
| TROPICAL STORM | 5000 | 11205983.6 | 198820310 | 7642480814 | 5150000000 | 682 | 16.2 |
| TSUNAMI | 115000 | 7203101.0 | 18617710 | 144062020 | 81000020 | 20 | 15.8 |
| VOLCANIC ASH | 0 | 21739.1 | 73587 | 500000 | 300000 | 23 | 10.0 |
| WATERSPOUT | 0 | 1689.3 | 85960 | 5730200 | 5000000 | 3392 | 7.4 |
| WILDFIRE | 0 | 1741827.0 | 27714269 | 4758671364 | 1040000007 | 2732 | 14.4 |
| WINTER STORM | 0 | 135437.4 | 7112176 | 1532745214 | 750000000 | 11317 | 11.8 |
| WINTER WEATHER | 0 | 2986.4 | 53603 | 20866015 | 3000000 | 6987 | 8.0 |
tbls(name="costtab","Cost by event type")
[1] “Table 1: Cost by event type”
The second table is the aggregate data based on casualties
#first change the names
names(results2)<-c("Event","Median Casulaties","Mean Casualties","Std","Total Casualties", "Max Casualties", "Sample Size")
names(results3)<-c("Event","Median Fatalities","Mean Fatalities","Std","Total Fatalities", "Max Fatalities", "Sample Size")
writeLines("td {padding : 7px}, th { padding : 7px } th { background-color : black ; color : white; border : 1px solid white; } td { color : black ; border : 1px solid black }", con = "mystyle2.css")
knitr::kable(results2,format="html")
| Event | Median Casulaties | Mean Casualties | Std | Total Casualties | Max Casualties | Sample Size |
|---|---|---|---|---|---|---|
| ASTRONOMICAL LOW TIDE | 0 | 0.00 | 0.00 | 0 | 0 | 174 |
| AVALANCHE | 1 | 1.00 | 1.16 | 379 | 8 | 378 |
| BLIZZARD | 0 | 0.17 | 3.34 | 455 | 152 | 2633 |
| COASTAL FLOOD | 0 | 0.01 | 0.13 | 8 | 2 | 744 |
| COLD/WIND CHILL | 0 | 0.20 | 0.67 | 107 | 10 | 539 |
| DENSE FOG | 0 | 0.13 | 1.28 | 152 | 24 | 1193 |
| DENSE SMOKE | 0 | 0.00 | 0.00 | 0 | 0 | 10 |
| DROUGHT | 0 | 0.00 | 0.08 | 4 | 4 | 2433 |
| DUST DEVIL | 0 | 0.30 | 1.42 | 41 | 14 | 137 |
| DUST STORM | 0 | 0.93 | 4.09 | 387 | 40 | 417 |
| EXCESSIVE HEAT | 0 | 4.94 | 27.47 | 8188 | 521 | 1656 |
| EXTREME COLD/WIND CHILL | 0 | 0.15 | 0.71 | 149 | 12 | 1002 |
| FLASH FLOOD | 0 | 0.05 | 1.34 | 2561 | 159 | 51001 |
| FLOOD | 0 | 0.32 | 11.18 | 7978 | 802 | 24870 |
| FREEZING FOG | 0 | 0.00 | 0.00 | 0 | 0 | 46 |
| FROST/FREEZE | 0 | 0.00 | 0.00 | 0 | 0 | 1343 |
| FUNNEL CLOUD | 0 | 0.00 | 0.01 | 1 | 1 | 6068 |
| HAIL | 0 | 0.00 | 0.32 | 720 | 88 | 207723 |
| HEAT | 0 | 2.04 | 16.22 | 1459 | 232 | 716 |
| HEAVY RAIN | 0 | 0.03 | 0.54 | 324 | 32 | 11529 |
| HEAVY SNOW | 0 | 0.06 | 1.18 | 852 | 100 | 14081 |
| HIGH SURF | 0 | 0.35 | 2.28 | 255 | 55 | 736 |
| HIGH WIND | 0 | 0.07 | 0.82 | 1318 | 70 | 19910 |
| HURRICANE (TYPHOON) | 0 | 15.22 | 90.66 | 1339 | 787 | 88 |
| ICE STORM | 0 | 0.21 | 2.04 | 400 | 53 | 1879 |
| LAKE-EFFECT SNOW | 0 | 0.00 | 0.00 | 0 | 0 | 656 |
| LAKESHORE FLOOD | 0 | 0.00 | 0.00 | 0 | 0 | 23 |
| LIGHTNING | 0 | 0.36 | 1.27 | 4792 | 51 | 13204 |
| MARINE HAIL | 0 | 0.00 | 0.00 | 0 | 0 | 442 |
| MARINE HIGH WIND | 0 | 0.01 | 0.17 | 2 | 2 | 135 |
| MARINE STRONG WIND | 0 | 0.75 | 1.52 | 36 | 8 | 48 |
| RIP CURRENT | 1 | 1.42 | 2.29 | 1045 | 35 | 734 |
| SEICHE | 0 | 0.00 | 0.00 | 0 | 0 | 23 |
| SLEET | 0 | 0.00 | 0.00 | 0 | 0 | 58 |
| STORM SURGE/TIDE | 0 | 0.11 | 0.99 | 16 | 11 | 148 |
| STRONG WIND | 0 | 0.11 | 0.57 | 409 | 13 | 3756 |
| THUNDERSTORM WIND | 0 | 0.02 | 0.49 | 5559 | 70 | 223177 |
| TORNADO | 0 | 0.96 | 13.57 | 22178 | 1308 | 23154 |
| TROPICAL DEPRESSION | 0 | 0.00 | 0.00 | 0 | 0 | 60 |
| TROPICAL STORM | 0 | 0.58 | 8.24 | 395 | 201 | 682 |
| TSUNAMI | 0 | 8.10 | 35.99 | 162 | 161 | 20 |
| VOLCANIC ASH | 0 | 0.00 | 0.00 | 0 | 0 | 23 |
| WATERSPOUT | 0 | 0.00 | 0.04 | 4 | 2 | 3392 |
| WILDFIRE | 0 | 0.36 | 3.02 | 986 | 104 | 2732 |
| WINTER STORM | 0 | 0.13 | 2.26 | 1483 | 170 | 11317 |
| WINTER WEATHER | 0 | 0.05 | 1.89 | 376 | 138 | 6987 |
tbls(name="castab","Casualty by event type")
[1] “Table 2: Casualty by event type”
costPlot=ggplot(data=results, aes(x=Event,y=`Log of Mean`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8, angle = 70, hjust = 1))
casualtyPlot=ggplot(data=results2, aes(x=Event,y=`Mean Casualties`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8,angle = 70, hjust = 1))
##grid.arrange(costPlot,casualtyPlot,ncol=1)
print(costPlot)
figs(name="cost","(Log)Mean Cost by event type")
## [1] "Figure 1: (Log)Mean Cost by event type"
As you can see hurricanes have the largest cost per incident followed by storm surges and tropical storms.
print(casualtyPlot)
figs(name="cas","Mean Casualties by event type")
## [1] "Figure 2: Mean Casualties by event type"
As you can see here that hurricanes cause the most casualties per incident, followed by tsuamis and excessive heat. The toll from hurricanes were a little surprising as there is often sufficient warning but oftem people don’t head them. Excessive heat is not surprising as it poses a real danger especially to elderly.
Looking at these findings we decided to look at fatalites separately
fatalityPlot=ggplot(data=results3, aes(x=Event,y=`Mean Fatalities`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8,angle = 70, hjust = 1))
print(fatalityPlot)
figs(name="fatal","Mean Fatalities by event type")
## [1] "Figure 3: Mean Fatalities by event type"
Looking at the chart of fatalities it appears to lead to the same conclusions as total casualties. Hurricanes, tsuamis and excessive heat are the most danagerous to humans.
Based on our analysis, hurricanes are both the most dangerous and the most costly per incident. Storm surges and tropical storms are also very costly. Tsunamis and excessive heat follow in terms of danger to people but it should also be pointed out that tsnamis also rank high in terms of cost per incident. In terms of public policy, evacuations prior to hurricanes are critical as well as providing opportunities for people who don’t have air conditioning to cool off (ex. “cooling centers”) during heat waves.