library(ggplot2)
opts_chunk$set(fig.width=20, autodep=T)

US Weather impact on health and damage

Synopsis

Using the NOAA Storm database, exploration will be done in order to understand the economic and health impacts within the United States. Specifically the types of events that are most harmful with respect to the populations health and the events which have the greatest economic consequences. This data is aimed at the organization responsible for preparing for severe weather events and prioritizing resources in such an event.

For each impact topic, economic and health, we discover the top events - defaulted to 20 as our recommendation, but changeable by top_event_amount variable - by their accumulated impact. For better understanding of impact spread, we create a comparison of those events with box plots only using the occurrences where impact was found i.e omitting zero values. display the most relevant. Finally we create a summary table with stats(occurrences, max, average, frequency, impact probability) to better understand the impact spread observed in the boxplots.

Data Processing

Data is taken directly from the bzip CSV format

storm_data_file <- 'repdata-data-StormData.csv.bz2'
if(!file.exists(storm_data_file)) stop(sprintf('%s is required to be in the working director(%s)', storm_data_file, getwd()))

storm_data <- read.csv(storm_data_file)[,c("EVTYPE", "PROPDMG", "CROPDMG", "INJURIES", "FATALITIES")]

We make little attempt to clean the data by removing the summary data and unknown events. We combine crop and property damage as damage to represent damage impact, we combine fatalities and injuries as casualties to represent health impact. A summary grouped by event is made from the storm data containing sum, max and average impact. Finally we do a probabilty calculus of belief for impact given an event occurs. A limited set of results are calculated and displayed controlled by top_events_amount which is ordered by the overall accumulated impact.

#Attemp for some data cleaning

#Remove summary data as this does not contribute to a particular disaster, introducing no new information
storm_data <- storm_data[-grep(('\\?|summary'),storm_data$EVTYPE, ignore.case=T),]
storm_data$EVTYPE <- factor(storm_data$EVTYPE) #synchronize levels

top_event_amount <- 20 #For ease of adjusting the cut off of 'top' events

#Add boolean impact occurences
storm_data$dmg_occur <- storm_data$PROPDMG+storm_data$CROPDMG > 0
storm_data$cas_occur <- storm_data$INJURIES+storm_data$FATALITIES > 0

#Get the cost sum, mean and max of damage and health impact for events
evt_summary <- as.data.frame(as.list(aggregate(cbind(dmg=PROPDMG+CROPDMG,cas=INJURIES + FATALITIES) ~ EVTYPE, storm_data, 
                                               FUN=function(x) c(sum=sum(x), max=max(x)))))

#calculate occurrences
evt_summary$occur <- aggregate(PROPDMG~EVTYPE, storm_data, FUN=length)[,2]

#calculate occurrences
evt_summary$freq <- evt_summary$occur/nrow(storm_data)

#calculate mean of impact only when impact occurs - else average is too misleading
evt_summary$cas.mean <- evt_summary$cas.sum/aggregate(cas_occur ~ EVTYPE, storm_data, FUN=sum)[,2]
evt_summary$dmg.mean <- evt_summary$dmg.sum/aggregate(dmg_occur ~ EVTYPE, storm_data, FUN=sum)[,2]

#Filter out top events (by accumulated damage/casualties)
top_evt_idx <- data.frame(dmg=head(order(evt_summary$dmg.sum, decreasing=T), top_event_amount),
                     cas=head(order(evt_summary$cas.sum, decreasing=T), top_event_amount))

#create impact occurrence contingency table
imp_occur_dmg <- table(storm_data[,c('EVTYPE','dmg_occur')]) #columns: No Impact then Impact
imp_occur_cas <- table(storm_data[,c('EVTYPE','cas_occur')]) #columns: No Impact then Impact

#Get the damage of event per occurence for top damaging events
top_evt_dmg <- storm_data[storm_data$EVTYPE %in% evt_summary[top_evt_idx$dmg,'EVTYPE'], #rows
                              c('EVTYPE', 'PROPDMG', 'CROPDMG')] #columns

top_evt_cas <- storm_data[storm_data$EVTYPE %in% evt_summary[top_evt_idx$cas,'EVTYPE'], #rows
                              c('EVTYPE', 'INJURIES', 'FATALITIES')] #columns


#Calculus of belief for impact occuring
cas_nume <- (imp_occur_cas[top_evt_idx$cas,2]/margin.table(imp_occur_cas,1)[top_evt_idx$cas])*margin.table(imp_occur_cas,2)[2]/margin.table(imp_occur_cas)
cas_denom <- cas_nume+(imp_occur_cas[top_evt_idx$cas,1]/margin.table(imp_occur_cas,1)[top_evt_idx$cas])*margin.table(imp_occur_cas,2)[1]/margin.table(imp_occur_cas)

dmg_nume <- (imp_occur_dmg[top_evt_idx$dmg,2]/margin.table(imp_occur_dmg,1)[top_evt_idx$dmg])*margin.table(imp_occur_dmg,2)[2]/margin.table(imp_occur_dmg)
dmg_denom <- dmg_nume+(imp_occur_dmg[top_evt_idx$dmg,1]/margin.table(imp_occur_dmg,1)[top_evt_idx$dmg])*margin.table(imp_occur_dmg,2)[1]/margin.table(imp_occur_dmg)

bayes <- cbind(cas=cas_nume/cas_denom, dmg=dmg_nume/dmg_denom)

Results

Figure 1 - Top Events by health impact (with zeros omitted to mitigate skewness)

qplot(EVTYPE,  #substitute space with new line for prettiness
      log10(INJURIES+FATALITIES+1), #Use log as data is still heavily skewed
      data=top_evt_cas[top_evt_cas$INJURIES+top_evt_cas$FATALITIES > 0,], 
      geom='boxplot') + xlab('Event') + ylab('Log(Casualties)') + ggtitle('Casualty spread') + theme(axis.text.x=element_text(angle = 90, hjust = 1))

plot of chunk Top economic damage events box plots

In figure 1 above, we are able to understand how the spread of casualties per event lies compared with each of the top events. Note that we have removed the event occurrences that did not have any casualty impact. We use the log transformation as data is still heavily skewed after omitting zeros. There is a rare tendency for high health impact given by the large amounts of outliers and the marginal difference between the lower quartile and median.

Table 1 - Top Events by health impact

cas_col_names <- c('EVTYPE', 'occur', 'freq', grep('cas', names(evt_summary), value=T))
cas_cols <- cbind(evt_summary[top_evt_idx$cas, c(cas_col_names)], bayes[,1])
kable(
  setNames(cas_cols, c('Event', 'Total Occurences', 'Frequency', 'Total Casualties', 'Most casualties', 'Average casualties during impact', 'Impact Probability')), 
  row.names=F)
Event Total Occurences Frequency Total Casualties Most casualties Average casualties during impact Impact Probability
TORNADO 60652 0.0672 96979 1742 12.232 0.0037
EXCESSIVE HEAT 1678 0.0019 8428 521 12.431 0.0166
TSTM WIND 219940 0.2438 7461 60 2.546 0.0003
FLOOD 25326 0.0281 7259 802 17.705 0.0004
LIGHTNING 15754 0.0175 6046 51 1.829 0.0066
HEAT 767 0.0009 3037 583 14.531 0.0092
FLASH FLOOD 54277 0.0602 2755 159 2.959 0.0004
ICE STORM 2006 0.0022 2064 1569 21.726 0.0012
THUNDERSTORM WIND 82563 0.0915 1621 70 2.377 0.0002
WINTER STORM 11433 0.0127 1527 170 6.727 0.0005
HIGH WIND 20212 0.0224 1385 70 2.638 0.0007
HAIL 288661 0.3199 1376 109 4.778 0.0000
HURRICANE/TYPHOON 88 0.0001 1339 787 51.500 0.0103
HEAVY SNOW 15708 0.0174 1148 185 6.074 0.0003
WILDFIRE 2761 0.0031 986 104 5.031 0.0019
THUNDERSTORM WINDS 20843 0.0231 972 50 2.708 0.0004
BLIZZARD 2719 0.0030 906 390 10.535 0.0008
FOG 538 0.0006 796 80 8.652 0.0051
RIP CURRENT 470 0.0005 600 11 1.504 0.1228
WILD/FOREST FIRE 1457 0.0016 557 52 4.126 0.0025

In table 1 above, we show a summary of the top events ordered by accumulated casualties(fatalities and injuries). The average impact is a calculation only for events that had an impact. The Impact probability shows how likely an event will have a casualty impact given the occurrence of such an event. This is to better our understanding as to how likely such an event will cause casualties. The frequency is the occurrences divided by the total events, zeros inclusive, within the raw data.

Figure 2 - Top Events relative damage spread (with zeros omitted to mitigate skewness)

qplot(EVTYPE,
      log10(PROPDMG+CROPDMG+1), #Use log as data is still heavily skewed
      data=top_evt_dmg[top_evt_dmg$CROPDMG+top_evt_dmg$PROPDMG > 0,], 
      geom='boxplot',
      labs(title='Damage')) + xlab('Event') + ylab('Log(Damage)') + ggtitle('Relative damage spread') + theme(axis.text.x=element_text(angle = 90, hjust = 1))

plot of chunk Top health impact events box plots

In figure 2 above, we are able to understand how the spread of damage per event lies compared with each of the top events. Note that we have removed the event occurrences that did not have any damage impact. We use the log transformation as data is still heavily skewed after omitting zeros. It seems that most weather has a regular damage pattern showing a good degree of consistency, however hail, TSTM winds, thunderstorms and strong winds on odd occasions have extremely high damage from the norm. Finally, Flash flooding shows a consistly high damage given an event of impact indicated by the marginal gap between the upper quartile and the median.

Table 2 - Top Events by damage impact

dmg_col_names <- c('EVTYPE', 'occur', 'freq', grep('dmg', names(evt_summary), value=T))
dmg_cols <- cbind(evt_summary[top_evt_idx$dmg, c(dmg_col_names)], bayes[,2])

kable(
  setNames(dmg_cols, c('Event', 'Total Occurences', 'Frequency', 'Total cost(USD)', 'Highest cost(USD)', 'Average cost(USD) during impact', 'Impact Probability')), 
  row.names=F)
Event Total Occurences Frequency Total cost(USD) Highest cost(USD) Average cost(USD) during impact Impact Probability
TORNADO 60652 0.0672 3312277 4410 84.15 0.4080
FLASH FLOOD 54277 0.0602 1599325 5000 77.42 0.1864
TSTM WIND 219940 0.2438 1445168 1213 23.51 0.1264
HAIL 288661 0.3199 1268290 1300 48.84 0.0355
FLOOD 25326 0.0281 1067976 3000 106.18 0.1972
THUNDERSTORM WIND 82563 0.0915 943636 5000 21.71 0.2930
LIGHTNING 15754 0.0175 606932 1000 58.58 0.4173
THUNDERSTORM WINDS 20843 0.0231 464978 1000 38.73 0.3362
HIGH WIND 20212 0.0224 342015 3000 63.90 0.1184
WINTER STORM 11433 0.0127 134700 1100 96.98 0.0490
HEAVY SNOW 15708 0.0174 124418 950 100.01 0.0311
WILDFIRE 2761 0.0031 88824 1000 114.17 0.1276
ICE STORM 2006 0.0022 67690 1000 101.48 0.1566
STRONG WIND 3566 0.0040 64611 1000 19.70 0.8099
HEAVY RAIN 11723 0.0130 61965 1050 60.04 0.0347
HIGH WINDS 1533 0.0017 57385 1000 93.92 0.1981
TROPICAL STORM 690 0.0008 54323 1100 133.47 0.3490
WILD/FOREST FIRE 1457 0.0016 43534 1000 134.78 0.0960
DROUGHT 2488 0.0028 37998 1000 142.85 0.0427
FLASH FLOODING 682 0.0008 33623 1000 114.36 0.2203

In table 2 above, we show a summary of the top events ordered by accumulated damage(property and crops). The average impact is a calculation only for events that had an impact.The Impact probability shows how likely an event will have a damage impact given the occurrence of such an event. This is to better our understanding as to how likely such an event will cause damage. The frequency is the occurrences divided by the total events within the raw data.