Reproducible Research - Extreme Weather Events

Synopsis

The purpose of the paper is to analyze the NOAA data on extreme weather events in order to answer two questions.
First: What extreme weather events are the most damaging to public health?
Second: What extreme weather events cause the most damage to property?

There are a number of ways to interpret this question but for the purposes of this paper the main focus will be on the average total cost and the average causualties per event. The reason for this decision was that so many events had no recorded damage/causuallies the median is 0 for most of the events. Total would have been influenced by how many events of this type were recorded. Finally, although I chose median, it is not perfect as the standard deviatons tend to be quite large. Since record keeping has changed all All Event Types tracking started in 1996 I reserved my analysis to 1996 on. Our results show that Hurricanes are the most costly per incident (followed by storm surges and tropical storms), while hurricanes are the most dangerous to people per incident followed by tsunami and excessive heat.

Data Processing

The first step was loading libraries and reading in the data:

require(tidyr)
require(dtplyr)
##require(data.table)
require(stringdist)
require(plyr)
require(ggplot2)
require(gridExtra)
require(captioner)
if(!dir.exists(file.path("./data1"))){
  dir.create(file.path("./data1"), showWarnings = FALSE)
}
if(!file.exists("./data1/stormData.csv.bz2")) {
 download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",dest = "./data1/stormData.csv.bz2")
}
df<-read.csv("./data1/stormData.csv.bz2")

Once we have the data read in it’s time to process it. First I select the columns I am going to need. Next I limit the data to events that occurred in 1996 or later. I chose to limit the data to 1996 on as, mentioned earlier, this is when all extreme weather events started be tracked and is therefore more consistant and reliable. I then trim and upper case the event type, remove events named “Summary” (data entry errors) and change events with TSTM or THUNDERSTORM to “THUNDERSTORM WIND” to better match the “official” event types

dfSmall<-select(df,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,MAG,BGN_DATE)
dfSmall<-dfSmall[year(as.Date(df$BGN_DATE,"%m/%d/%Y %H:%M:%S"))>=1996,]
dfSmall$EVTYPE<-toupper(trimws(dfSmall$EVTYPE,"b"))
dfSmall<-dfSmall[!grepl("SUMMARY",dfSmall$EVTYPE,ignore.case = TRUE),]  ##remove rows with "summary"
dfSmall[grepl("TSTM | THUNDERSTORM",dfSmall$EVTYPE),]$EVTYPE<-"THUNDERSTORM WIND"

The next step is actually mapping the event types to “official” event types.Note I took the list of “offical” event types and copied them into the code as a vector. Thus the list (storme) is created programatically so as to be exactly reproducible. I then use amatch to create a vector that allows one to select the appropriate store event based on it’s match to EVTYPE (event type.) This new column is added to the data frame and will be the key to the analysis. (Note that the original EVTYPE is stilll there for comparavie purposes.) Finally we remove NA’s. In the end we still have more than 600k observations.

storme<-c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow",
          "Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill",
          "Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow",
          "High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood",
          "Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current",
          "Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression",
          "Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
storme<-toupper(storme)

##I went with 3 because it was the most restrictive that gave me 46 out of the 48 items
dist<-amatch(dfSmall$EVTYPE,storme,maxDist=3)
newev<-storme[dist]
dfSmall$NewEvent<-newev
dfSmall<-dfSmall[complete.cases(dfSmall),]

The final preparatory step is to cacluate total costs and casualty figures. We use PROPDMGEXP and CROPDMGEXP to apply the appropriate multiplier to the data in the damage columns. We do this by spliting the data based on the exponent value (dropping factors with no items..it speeds up calculations.) This will leave us with dfSmallNew having 642077 observations to complete our analysis.

##now calculating the costs

dfSmall$cor <- dfSmall$PROPDMG
dfSmall$cor2<-dfSmall$CROPDMG
dfSmallA<-split(dfSmall,dfSmall$PROPDMGEXP, drop = TRUE)
##if df$PROPDMGEXP in 0:9 then mult by 10
for(i in 1:length(dfSmallA) ) {
  if(nrow(dfSmallA[[i]]) > 0) {
    corval <- as.character(dfSmallA[[i]][1,]$PROPDMGEXP)
    if(corval %in% c("0","1","2","3","4","5","6","7","8","9")) {
      corint<-as.integer(corval)
     dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*(10^corint);
    } else if(corval %in% c("h","H")) {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*100
    }else if(corval == "K") {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000
    }
    else if(corval %in% c("m","M")) {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000000
    }
    else if(corval == "B") {
      dfSmallA[[i]]$cor <- dfSmallA[[i]]$cor*1000000000
    }
  }
}
dfSmallNew<-rbindlist(dfSmallA)



dfSmallB<-split(dfSmallNew,dfSmall$CROPDMGEX, drop = TRUE)
##if df$PROPDMGEXP in 0:9 then mult by 10
for(i in 1:length(dfSmallB) ) {
  if(nrow(dfSmallB[[i]]) > 0) {
    corval <- as.character(dfSmallB[[i]][1,]$CROPDMGEXP)
    if(corval %in% c("0","1","2","3","4","5","6","7","8","9")) {
      corint<-as.integer(corval)
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*(10^corint)
    } else if(corval %in% c("h","H")) {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*100
    }else if(corval == "K") {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000
    }
    else if(corval %in% c("m","M")) {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000000
    }
    else if(corval == "B") {
      dfSmallB[[i]]$cor2 <- dfSmallB[[i]]$cor2*1000000000
    }
  }
}
dfSmallNew<-rbindlist(dfSmallB)

dfSmallNew$TotalCost<-dfSmallNew$cor + dfSmallNew$cor2
dfSmallNew$TotalCasualties<- dfSmallNew$FATALITIES + dfSmallNew$INJURIES

Aggriate Data

In this section I will present the code I used to aggregate the data which will lead to my results. We created two separate aggregate data frames, one for cost, the other for casulaties. Since it would be arbitrary to weigh deaths vs injury (and there is no indication of seriousness of injury) I summed fatalities and injuries to form one total casualty number (as shown in the previous code section,) Note that I created the log of the mean cost as the variance of costs were so large as to make it difficult to plot. By taking the log of the means it brings the results into a closer range of values while providing a consistant answer to the question of what events are most costly.
I also had to replace -Inf (log of 0) with 0 to prevent a plotting error. NOTE: we only ended up with 46 out of the 48 possible types. The two that didn’t register were debris flow and freezing fog…not major events. I also wanted to see the total number of observations making up each aggrigate value so I included that as well.

I also set options so that we didn’t have numbers coming back in scientific notation.

options(scipen=100, digits=2)

##use ddply to calcluate the aggregate for cost
aa<-ddply(dfSmallNew,~NewEvent,summarise,median=median(TotalCost),mean = mean(TotalCost),sd=sd(TotalCost),sum=sum(TotalCost),max=max(TotalCost))

##calculate count ... this is the same for both data frams
cc<-count(dfSmallNew$NewEvent)
results<-cbind(aa,cc$freq)
results$LogMean<-log(results$mean)

## remove infinate value
results[results$LogMean == -Inf,]$LogMean = 0

##use ddply to calcuate the aggregate for casualties
aa1<-ddply(dfSmallNew,~NewEvent,summarise,median=median(TotalCasualties),mean = mean(TotalCasualties),sd=sd(TotalCasualties),sum=sum(TotalCasualties),
           max=max(TotalCasualties))
results2<-cbind(aa1,cc$freq)

##decided to calculate fatalites separately
aa2<-ddply(dfSmallNew,~NewEvent,summarise,median=median(FATALITIES),mean = mean(TotalCasualties),sd=sd(FATALITIES),sum=sum(FATALITIES),
           max=max(FATALITIES))
results3<-cbind(aa2,cc$freq)

Results

The first table is the aggregate data based on cost

#initalizing capioners
figs <- captioner(prefix="Figure")
tbls <- captioner(prefix="Table")
#first change the names
names(results)<-c("Event","Median Cost","Mean Cost","Std","Total Cost", "Max Cost", "Sample Size","Log of Mean")
writeLines("td, th { padding : 6px } th { background-color : black ; color : white; border : 1px solid white; } td { color : black ; border : 1px solid black }", con = "mystyle.css")
##resultsOrder<-results[order(results$`Mean Cost`),]
knitr::kable(results,format="html")

Event	Median Cost	Mean Cost	Std	Total Cost	Max Cost	Sample Size	Log of Mean
ASTRONOMICAL LOW TIDE	0	1839.1	17637	320000	200000	174	7.5
AVALANCHE	0	9819.6	82252	3711800	1100000	378	9.2
BLIZZARD	0	199642.6	3083425	525659017	102000000	2633	12.2
COASTAL FLOOD	0	477432.2	3849456	355209560	75000000	744	13.1
COLD/WIND CHILL	0	3693.1	37063	1990600	600000	539	8.2
DENSE FOG	0	6134.9	57373	7319000	1450000	1193	8.7
DENSE SMOKE	0	10000.0	31623	100000	100000	10	9.2
DROUGHT	0	429977.1	13325098	1046134294	645150000	2433	13.0
DUST DEVIL	300	4844.0	11766	663630	75000	137	8.5
DUST STORM	0	13130.9	51106	5475602	640002	417	9.5
EXCESSIVE HEAT	0	4664.4	110152	7724194	3300000	1656	8.4
EXTREME COLD/WIND CHILL	0	8630.8	163710	8648050	5000000	1002	9.1
FLASH FLOOD	0	298472.9	6180060	15222414977	1000000000	51001	12.6
FLOOD	0	5788447.9	729776449	143958699876	115000000033	24870	15.6
FREEZING FOG	0	47434.8	294586	2182000	2000000	46	10.8
FROST/FREEZE	0	7808.7	117272	10487134	3020005	1343	9.0
FUNNEL CLOUD	0	22.1	828	134100	50000	6068	3.1
HAIL	0	70266.4	5291859	14595942009	1800000000	207723	11.2
HEAT	0	2123.2	56059	1520177	1500000	716	7.7
HEAVY RAIN	0	50730.8	1104674	584875418	70000002	11529	10.8
HEAVY SNOW	0	45153.7	887175	635809132	56500002	14081	10.7
HIGH SURF	0	114034.6	1403751	83929500	30000000	736	11.6
HIGH WIND	0	263605.1	12523804	5248377628	1300000000	19910	12.5
HURRICANE (TYPHOON)	6765001	787566418.2	2445917029	69305844798	16930000000	88	20.5
ICE STORM	0	1938397.6	16316807	3642249035	360000000	1879	14.5
LAKE-EFFECT SNOW	0	61253.1	796534	40182000	20000000	656	11.0
LAKESHORE FLOOD	0	327826.1	1103276	7540000	4500000	23	12.7
LIGHTNING	5000	56276.8	302132	743078983	15000000	13204	10.9
MARINE HAIL	0	9.1	134	4000	2000	442	2.2
MARINE HIGH WIND	0	9607.5	86508	1297010	1000000	135	9.2
MARINE STRONG WIND	10	8715.2	43458	418330	300000	48	9.1
RIP CURRENT	0	222.1	3459	163000	80000	734	5.4
SEICHE	0	42608.7	155058	980000	750000	23	10.7
SLEET	0	0.0	0	0	0	58	0.0
STORM SURGE/TIDE	0	31359384.1	331065125	4641188850	4000000000	148	17.3
STRONG WIND	3000	47123.5	1179123	176995857	70000048	3756	10.8
THUNDERSTORM WIND	0	35485.9	2374042	7919633460	750000000	223177	10.5
TORNADO	1000	1063187.2	24055725	24617035839	2800000000	23154	13.9
TROPICAL DEPRESSION	6500	28950.0	129715	1737000	1000000	60	10.3
TROPICAL STORM	5000	11205983.6	198820310	7642480814	5150000000	682	16.2
TSUNAMI	115000	7203101.0	18617710	144062020	81000020	20	15.8
VOLCANIC ASH	0	21739.1	73587	500000	300000	23	10.0
WATERSPOUT	0	1689.3	85960	5730200	5000000	3392	7.4
WILDFIRE	0	1741827.0	27714269	4758671364	1040000007	2732	14.4
WINTER STORM	0	135437.4	7112176	1532745214	750000000	11317	11.8
WINTER WEATHER	0	2986.4	53603	20866015	3000000	6987	8.0

tbls(name="costtab","Cost by event type")

[1] “Table 1: Cost by event type”

The second table is the aggregate data based on casualties

#first change the names
names(results2)<-c("Event","Median Casulaties","Mean Casualties","Std","Total Casualties", "Max Casualties", "Sample Size")
names(results3)<-c("Event","Median Fatalities","Mean Fatalities","Std","Total Fatalities", "Max Fatalities", "Sample Size")
writeLines("td {padding : 7px}, th { padding : 7px } th { background-color : black ; color : white; border : 1px solid white; } td { color : black ; border : 1px solid black }", con = "mystyle2.css")
knitr::kable(results2,format="html")

Event	Median Casulaties	Mean Casualties	Std	Total Casualties	Max Casualties	Sample Size
ASTRONOMICAL LOW TIDE	0	0.00	0.00	0	0	174
AVALANCHE	1	1.00	1.16	379	8	378
BLIZZARD	0	0.17	3.34	455	152	2633
COASTAL FLOOD	0	0.01	0.13	8	2	744
COLD/WIND CHILL	0	0.20	0.67	107	10	539
DENSE FOG	0	0.13	1.28	152	24	1193
DENSE SMOKE	0	0.00	0.00	0	0	10
DROUGHT	0	0.00	0.08	4	4	2433
DUST DEVIL	0	0.30	1.42	41	14	137
DUST STORM	0	0.93	4.09	387	40	417
EXCESSIVE HEAT	0	4.94	27.47	8188	521	1656
EXTREME COLD/WIND CHILL	0	0.15	0.71	149	12	1002
FLASH FLOOD	0	0.05	1.34	2561	159	51001
FLOOD	0	0.32	11.18	7978	802	24870
FREEZING FOG	0	0.00	0.00	0	0	46
FROST/FREEZE	0	0.00	0.00	0	0	1343
FUNNEL CLOUD	0	0.00	0.01	1	1	6068
HAIL	0	0.00	0.32	720	88	207723
HEAT	0	2.04	16.22	1459	232	716
HEAVY RAIN	0	0.03	0.54	324	32	11529
HEAVY SNOW	0	0.06	1.18	852	100	14081
HIGH SURF	0	0.35	2.28	255	55	736
HIGH WIND	0	0.07	0.82	1318	70	19910
HURRICANE (TYPHOON)	0	15.22	90.66	1339	787	88
ICE STORM	0	0.21	2.04	400	53	1879
LAKE-EFFECT SNOW	0	0.00	0.00	0	0	656
LAKESHORE FLOOD	0	0.00	0.00	0	0	23
LIGHTNING	0	0.36	1.27	4792	51	13204
MARINE HAIL	0	0.00	0.00	0	0	442
MARINE HIGH WIND	0	0.01	0.17	2	2	135
MARINE STRONG WIND	0	0.75	1.52	36	8	48
RIP CURRENT	1	1.42	2.29	1045	35	734
SEICHE	0	0.00	0.00	0	0	23
SLEET	0	0.00	0.00	0	0	58
STORM SURGE/TIDE	0	0.11	0.99	16	11	148
STRONG WIND	0	0.11	0.57	409	13	3756
THUNDERSTORM WIND	0	0.02	0.49	5559	70	223177
TORNADO	0	0.96	13.57	22178	1308	23154
TROPICAL DEPRESSION	0	0.00	0.00	0	0	60
TROPICAL STORM	0	0.58	8.24	395	201	682
TSUNAMI	0	8.10	35.99	162	161	20
VOLCANIC ASH	0	0.00	0.00	0	0	23
WATERSPOUT	0	0.00	0.04	4	2	3392
WILDFIRE	0	0.36	3.02	986	104	2732
WINTER STORM	0	0.13	2.26	1483	170	11317
WINTER WEATHER	0	0.05	1.89	376	138	6987

tbls(name="castab","Casualty by event type")

[1] “Table 2: Casualty by event type”

costPlot=ggplot(data=results, aes(x=Event,y=`Log of Mean`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8, angle = 70, hjust = 1))
casualtyPlot=ggplot(data=results2, aes(x=Event,y=`Mean Casualties`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8,angle = 70, hjust = 1))
##grid.arrange(costPlot,casualtyPlot,ncol=1)
print(costPlot)

figs(name="cost","(Log)Mean Cost by event type")

## [1] "Figure  1: (Log)Mean Cost by event type"

As you can see hurricanes have the largest cost per incident followed by storm surges and tropical storms.

print(casualtyPlot)

figs(name="cas","Mean Casualties by event type")

## [1] "Figure  2: Mean Casualties by event type"

As you can see here that hurricanes cause the most casualties per incident, followed by tsuamis and excessive heat. The toll from hurricanes were a little surprising as there is often sufficient warning but oftem people don’t head them. Excessive heat is not surprising as it poses a real danger especially to elderly.

Looking at these findings we decided to look at fatalites separately

fatalityPlot=ggplot(data=results3, aes(x=Event,y=`Mean Fatalities`)) + geom_bar(stat="identity") + xlab("Weather Event") + theme(axis.text.x = element_text(size=8,angle = 70, hjust = 1))
print(fatalityPlot)

figs(name="fatal","Mean Fatalities by event type")

## [1] "Figure  3: Mean Fatalities by event type"

Looking at the chart of fatalities it appears to lead to the same conclusions as total casualties. Hurricanes, tsuamis and excessive heat are the most danagerous to humans.

Conclusions

Based on our analysis, hurricanes are both the most dangerous and the most costly per incident. Storm surges and tropical storms are also very costly. Tsunamis and excessive heat follow in terms of danger to people but it should also be pointed out that tsnamis also rank high in terms of cost per incident. In terms of public policy, evacuations prior to hurricanes are critical as well as providing opportunities for people who don’t have air conditioning to cool off (ex. “cooling centers”) during heat waves.