Synopsis

The purpose of the paper is to analyze the NOAA data on extreme weather events in order to answer two questions.
First: What extreme weather events are the most damaging to public health?
Second: What extreme weather events cause the most damage to property?

There are a number of ways to interpret this question but for the purposes of this paper the main focus will be on the average total cost and the average causualties per event. The reason for this decision was that so many events had no recorded damage/causuallies the median is 0 for most of the events. Total would have been influenced by how many events of this type were recorded. Finally, although I chose median, it is not perfect as the standard deviatons tend to be quite large. Since record keeping has changed all All Event Types tracking started in 1996 I reserved my analysis to 1996 on. Our results show that Hurricanes are the most costly per incident (followed by storm surges and tropical storms), while hurricanes are the most dangerous to people per incident followed by tsunami and excessive heat.

Data Processing

The first step was loading libraries and reading in the data:

require(tidyr)
require(dtplyr)
##require(data.table)
require(stringdist)
require(plyr)
require(ggplot2)
require(gridExtra)
require(captioner)
if(!dir.exists(file.path("./data1"))){
  dir.create(file.path("./data1"), showWarnings = FALSE)
}
if(!file.exists("./data1/stormData.csv.bz2")) {
 download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",dest = "./data1/stormData.csv.bz2")
}
df<-read.csv("./data1/stormData.csv.bz2")

Once we have the data read in it’s time to process it. First I select the columns I am going to need. Next I limit the data to events that occurred in 1996 or later. I chose to limit the data to 1996 on as, mentioned earlier, this is when all extreme weather events started be tracked and is therefore more consistant and reliable. I then trim and upper case the event type, remove events named “Summary” (data entry errors) and change events with TSTM or THUNDERSTORM to “THUNDERSTORM WIND” to better match the “official” event types

dfSmall<-select(df,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,MAG,BGN_DATE)
dfSmall<-dfSmall[year(as.Date(df$BGN_DATE,"%m/%d/%Y %H:%M:%S"))>=1996,]
dfSmall$EVTYPE<-toupper(trimws(dfSmall$EVTYPE,"b"))
dfSmall<-dfSmall[!grepl("SUMMARY",dfSmall$EVTYPE,ignore.case = TRUE),]  ##remove rows with "summary"
dfSmall[grepl("TSTM | THUNDERSTORM",dfSmall$EVTYPE),]$EVTYPE<-"THUNDERSTORM WIND"

The next step is actually mapping the event types to “official” event types.Note I took the list of “offical” event types and copied them into the code as a vector. Thus the list (storme) is created programatically so as to be exactly reproducible. I then use amatch to create a vector that allows one to select the appropriate store event based on it’s match to EVTYPE (event type.) This new column is added to the data frame and will be the key to the analysis. (Note that the original EVTYPE is stilll there for comparavie purposes.) Finally we remove NA’s. In the end we still have more than 600k observations.

storme<-c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow",
          "Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
          "Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow",
          "High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood",
          "Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current",
          "Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression",
          "Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
storme<-toupper(storme)

##I went with 3 because it was the most restrictive that gave me 46 out of the 48 items
dist<-amatch(dfSmall$EVTYPE,storme,maxDist=3)
newev<-storme[dist]
dfSmall$NewEvent<-newev
dfSmall<-dfSmall[complete.cases(dfSmall),]

The final preparatory step is to cacluate total costs and casualty figures. We use PROPDMGEXP and CROPDMGEXP to apply the appropriate multiplier to the data in the damage columns. We do this by spliting the data based on the exponent value (dropping factors with no items..it speeds up calculations.) This will leave us with dfSmallNew having 642077 observations to complete our analysis.

##now calculating the costs

dfSmall$cor <- dfSmall$PROPDMG
dfSmall$cor2<-dfSmall$CROPDMG
dfSmallA<-split(dfSmall,dfSmall$PROPDMGEXP, drop = TRUE)
##if df$PROPDMGEXP in 0:9 then mult by 10
for(i in 1:length(dfSmallA) ) {
  if(nrow(dfSmallA[[i]]) > 0) {
    corval <- as.character(dfSmallA[[i]][1,]$PROPDMGEXP)
    if(corval %in% c("0","1","2","3","4","5","6","7","8","9")) {
      corint<-as.integer(corval)
     dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*(10^corint);
    } else if(corval %in% c("h","H")) {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*100
    }else if(corval == "K") {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000
    }
    else if(corval %in% c("m","M")) {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000000
    }
    else if(corval == "B") {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000000000
    }
  }
}
dfSmallNew<-rbindlist(dfSmallA)



dfSmallB<-split(dfSmallNew,dfSmall$CROPDMGEX, drop = TRUE)
##if df$PROPDMGEXP in 0:9 then mult by 10
for(i in 1:length(dfSmallB) ) {
  if(nrow(dfSmallB[[i]]) > 0) {
    corval <- as.character(dfSmallB[[i]][1,]$CROPDMGEXP)
    if(corval %in% c("0","1","2","3","4","5","6","7","8","9")) {
      corint<-as.integer(corval)
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*(10^corint)
    } else if(corval %in% c("h","H")) {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*100
    }else if(corval == "K") {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000
    }
    else if(corval %in% c("m","M")) {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000000
    }
    else if(corval == "B") {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000000000
    }
  }
}
dfSmallNew<-rbindlist(dfSmallB)

dfSmallNew$TotalCost<-dfSmallNew$cor + dfSmallNew$cor2
dfSmallNew$TotalCasualties<- dfSmallNew$FATALITIES + dfSmallNew$INJURIES

Aggriate Data

In this section I will present the code I used to aggregate the data which will lead to my results. We created two separate aggregate data frames, one for cost, the other for casulaties. Since it would be arbitrary to weigh deaths vs injury (and there is no indication of seriousness of injury) I summed fatalities and injuries to form one total casualty number (as shown in the previous code section,) Note that I created the log of the mean cost as the variance of costs were so large as to make it difficult to plot. By taking the log of the means it brings the results into a closer range of values while providing a consistant answer to the question of what events are most costly.
I also had to replace -Inf (log of 0) with 0 to prevent a plotting error. NOTE: we only ended up with 46 out of the 48 possible types. The two that didn’t register were debris flow and freezing fog…not major events. I also wanted to see the total number of observations making up each aggrigate value so I included that as well.

I also set options so that we didn’t have numbers coming back in scientific notation.

options(scipen=100, digits=2)

##use ddply to calcluate the aggregate for cost
aa<-ddply(dfSmallNew,~NewEvent,summarise,median=median(TotalCost),mean = mean(TotalCost),sd=sd(TotalCost),sum=sum(TotalCost),max=max(TotalCost))

##calculate count ... this is the same for both data frams
cc<-count(dfSmallNew$NewEvent)
results<-cbind(aa,cc$freq)
results$LogMean<-log(results$mean)

## remove infinate value
results[results$LogMean == -Inf,]$LogMean = 0

##use ddply to calcuate the aggregate for casualties
aa1<-ddply(dfSmallNew,~NewEvent,summarise,median=median(TotalCasualties),mean = mean(TotalCasualties),sd=sd(TotalCasualties),sum=sum(TotalCasualties),
           max=max(TotalCasualties))
results2<-cbind(aa1,cc$freq)

##decided to calculate fatalites separately
aa2<-ddply(dfSmallNew,~NewEvent,summarise,median=median(FATALITIES),mean = mean(TotalCasualties),sd=sd(FATALITIES),sum=sum(FATALITIES),
           max=max(FATALITIES))
results3<-cbind(aa2,cc$freq)

Results

The first table is the aggregate data based on cost

#initalizing capioners
figs <- captioner(prefix="Figure")
tbls <- captioner(prefix="Table")
#first change the names
names(results)<-c("Event","Median Cost","Mean Cost","Std","Total Cost", "Max Cost", "Sample Size","Log of Mean")
writeLines("td, th { padding : 6px } th { background-color : black ; color : white; border : 1px solid white; } td { color : black ; border : 1px solid black }", con = "mystyle.css")
##resultsOrder<-results[order(results$`Mean Cost`),]
knitr::kable(results,format="html")
Event Median Cost Mean Cost Std Total Cost Max Cost Sample Size Log of Mean
ASTRONOMICAL LOW TIDE 0 1839.1 17637 320000 200000 174 7.5
AVALANCHE 0 9819.6 82252 3711800 1100000 378 9.2
BLIZZARD 0 199642.6 3083425 525659017 102000000 2633 12.2
COASTAL FLOOD 0 477432.2 3849456 355209560 75000000 744 13.1
COLD/WIND CHILL 0 3693.1 37063 1990600 600000 539 8.2
DENSE FOG 0 6134.9 57373 7319000 1450000 1193 8.7
DENSE SMOKE 0 10000.0 31623 100000 100000 10 9.2
DROUGHT 0 429977.1 13325098 1046134294 645150000 2433 13.0
DUST DEVIL 300 4844.0 11766 663630 75000 137 8.5
DUST STORM 0 13130.9 51106 5475602 640002 417 9.5
EXCESSIVE HEAT 0 4664.4 110152 7724194 3300000 1656 8.4
EXTREME COLD/WIND CHILL 0 8630.8 163710 8648050 5000000 1002 9.1
FLASH FLOOD 0 298472.9 6180060 15222414977 1000000000 51001 12.6
FLOOD 0 5788447.9 729776449 143958699876 115000000033 24870 15.6
FREEZING FOG 0 47434.8 294586 2182000 2000000 46 10.8
FROST/FREEZE 0 7808.7 117272 10487134 3020005 1343 9.0
FUNNEL CLOUD 0 22.1 828 134100 50000 6068 3.1
HAIL 0 70266.4 5291859 14595942009 1800000000 207723 11.2
HEAT 0 2123.2 56059 1520177 1500000 716 7.7
HEAVY RAIN 0 50730.8 1104674 584875418 70000002 11529 10.8
HEAVY SNOW 0 45153.7 887175 635809132 56500002 14081 10.7
HIGH SURF 0 114034.6 1403751 83929500 30000000 736 11.6
HIGH WIND 0 263605.1 12523804 5248377628 1300000000 19910 12.5
HURRICANE (TYPHOON) 6765001 787566418.2 2445917029 69305844798 16930000000 88 20.5
ICE STORM 0 1938397.6 16316807 3642249035 360000000 1879 14.5
LAKE-EFFECT SNOW 0 61253.1 796534 40182000 20000000 656 11.0
LAKESHORE FLOOD 0 327826.1 1103276 7540000 4500000 23 12.7
LIGHTNING 5000 56276.8 302132 743078983 15000000 13204 10.9
MARINE HAIL 0 9.1 134 4000 2000 442 2.2
MARINE HIGH WIND 0 9607.5 86508 1297010 1000000 135 9.2
MARINE STRONG WIND 10 8715.2 43458 418330 300000 48 9.1
RIP CURRENT 0 222.1 3459 163000 80000 734 5.4
SEICHE 0 42608.7 155058 980000 750000 23 10.7
SLEET 0 0.0 0 0 0 58 0.0
STORM SURGE/TIDE 0 31359384.1 331065125 4641188850 4000000000 148 17.3
STRONG WIND 3000 47123.5 1179123 176995857 70000048 3756 10.8
THUNDERSTORM WIND 0 35485.9 2374042 7919633460 750000000 223177 10.5
TORNADO 1000 1063187.2 24055725 24617035839 2800000000 23154 13.9
TROPICAL DEPRESSION 6500 28950.0 129715 1737000 1000000 60 10.3
TROPICAL STORM 5000 11205983.6 198820310 7642480814 5150000000 682 16.2
TSUNAMI 115000 7203101.0 18617710 144062020 81000020 20 15.8
VOLCANIC ASH 0 21739.1 73587 500000 300000 23 10.0
WATERSPOUT 0 1689.3 85960 5730200 5000000 3392 7.4
WILDFIRE 0 1741827.0 27714269 4758671364 1040000007 2732 14.4
WINTER STORM 0 135437.4 7112176 1532745214 750000000 11317 11.8
WINTER WEATHER 0 2986.4 53603 20866015 3000000 6987 8.0
tbls(name="costtab","Cost by event type")

[1] “Table 1: Cost by event type”

The second table is the aggregate data based on casualties

#first change the names
names(results2)<-c("Event","Median Casulaties","Mean Casualties","Std","Total Casualties", "Max Casualties", "Sample Size")
names(results3)<-c("Event","Median Fatalities","Mean Fatalities","Std","Total Fatalities", "Max Fatalities", "Sample Size")
writeLines("td {padding : 7px}, th { padding : 7px } th { background-color : black ; color : white; border : 1px solid white; } td { color : black ; border : 1px solid black }", con = "mystyle2.css")
knitr::kable(results2,format="html")
Event Median Casulaties Mean Casualties Std Total Casualties Max Casualties Sample Size
ASTRONOMICAL LOW TIDE 0 0.00 0.00 0 0 174
AVALANCHE 1 1.00 1.16 379 8 378
BLIZZARD 0 0.17 3.34 455 152 2633
COASTAL FLOOD 0 0.01 0.13 8 2 744
COLD/WIND CHILL 0 0.20 0.67 107 10 539
DENSE FOG 0 0.13 1.28 152 24 1193
DENSE SMOKE 0 0.00 0.00 0 0 10
DROUGHT 0 0.00 0.08 4 4 2433
DUST DEVIL 0 0.30 1.42 41 14 137
DUST STORM 0 0.93 4.09 387 40 417
EXCESSIVE HEAT 0 4.94 27.47 8188 521 1656
EXTREME COLD/WIND CHILL 0 0.15 0.71 149 12 1002
FLASH FLOOD 0 0.05 1.34 2561 159 51001
FLOOD 0 0.32 11.18 7978 802 24870
FREEZING FOG 0 0.00 0.00 0 0 46
FROST/FREEZE 0 0.00 0.00 0 0 1343
FUNNEL CLOUD 0 0.00 0.01 1 1 6068
HAIL 0 0.00 0.32 720 88 207723
HEAT 0 2.04 16.22 1459 232 716
HEAVY RAIN 0 0.03 0.54 324 32 11529
HEAVY SNOW 0 0.06 1.18 852 100 14081
HIGH SURF 0 0.35 2.28 255 55 736
HIGH WIND 0 0.07 0.82 1318 70 19910
HURRICANE (TYPHOON) 0 15.22 90.66 1339 787 88
ICE STORM 0 0.21 2.04 400 53 1879
LAKE-EFFECT SNOW 0 0.00 0.00 0 0 656
LAKESHORE FLOOD 0 0.00 0.00 0 0 23
LIGHTNING 0 0.36 1.27 4792 51 13204
MARINE HAIL 0 0.00 0.00 0 0 442
MARINE HIGH WIND 0 0.01 0.17 2 2 135
MARINE STRONG WIND 0 0.75 1.52 36 8 48
RIP CURRENT 1 1.42 2.29 1045 35 734
SEICHE 0 0.00 0.00 0 0 23
SLEET 0 0.00 0.00 0 0 58
STORM SURGE/TIDE 0 0.11 0.99 16 11 148
STRONG WIND 0 0.11 0.57 409 13 3756
THUNDERSTORM WIND 0 0.02 0.49 5559 70 223177
TORNADO 0 0.96 13.57 22178 1308 23154
TROPICAL DEPRESSION 0 0.00 0.00 0 0 60
TROPICAL STORM 0 0.58 8.24 395 201 682
TSUNAMI 0 8.10 35.99 162 161 20
VOLCANIC ASH 0 0.00 0.00 0 0 23
WATERSPOUT 0 0.00 0.04 4 2 3392
WILDFIRE 0 0.36 3.02 986 104 2732
WINTER STORM 0 0.13 2.26 1483 170 11317
WINTER WEATHER 0 0.05 1.89 376 138 6987
tbls(name="castab","Casualty by event type")

[1] “Table 2: Casualty by event type”

costPlot=ggplot(data=results, aes(x=Event,y=`Log of Mean`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8, angle = 70, hjust = 1))
casualtyPlot=ggplot(data=results2, aes(x=Event,y=`Mean Casualties`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8,angle = 70, hjust = 1))
##grid.arrange(costPlot,casualtyPlot,ncol=1)
print(costPlot)

figs(name="cost","(Log)Mean Cost by event type")
## [1] "Figure  1: (Log)Mean Cost by event type"

As you can see hurricanes have the largest cost per incident followed by storm surges and tropical storms.

print(casualtyPlot)

figs(name="cas","Mean Casualties by event type")
## [1] "Figure  2: Mean Casualties by event type"

As you can see here that hurricanes cause the most casualties per incident, followed by tsuamis and excessive heat. The toll from hurricanes were a little surprising as there is often sufficient warning but oftem people don’t head them. Excessive heat is not surprising as it poses a real danger especially to elderly.

Looking at these findings we decided to look at fatalites separately

fatalityPlot=ggplot(data=results3, aes(x=Event,y=`Mean Fatalities`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8,angle = 70, hjust = 1))
print(fatalityPlot)

figs(name="fatal","Mean Fatalities by event type")
## [1] "Figure  3: Mean Fatalities by event type"

Looking at the chart of fatalities it appears to lead to the same conclusions as total casualties. Hurricanes, tsuamis and excessive heat are the most danagerous to humans.

Conclusions

Based on our analysis, hurricanes are both the most dangerous and the most costly per incident. Storm surges and tropical storms are also very costly. Tsunamis and excessive heat follow in terms of danger to people but it should also be pointed out that tsnamis also rank high in terms of cost per incident. In terms of public policy, evacuations prior to hurricanes are critical as well as providing opportunities for people who don’t have air conditioning to cool off (ex. “cooling centers”) during heat waves.