Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

The report were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, no specific recommendations is needed in this report.

Data Processing

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

The data loading process is described in R code below:

Load the Data
storm_data <- read.csv(bzfile('repdata_data_StormData.csv.bz2'))
str(storm_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels ""," ","  ","   ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(storm_data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6
Variables of Interest

For this analysis, the grouping of impact is based on EVTYPE (Event Type) i.e. the name of severe weather event.

Variables related to Health impacts are:
- FATALITIES : number of fatalities
- INJURIES : number of injuries

Variables related to Economic impacts are:
- PROPDMG : value of properties damage
- PROPDMGEXP : unit of the value for the above properties damage
- CROPDMG : value of crops damage
- CROPDMGEXP : unit of the value for the above crops damage

Refining Event Names

There are duplicates of events that are counted separately due to likely typo or incosistencies in naming, such as:
- COASTAL STORM and COASTALSTORM
- FLASH FLOODS and FLASH FLOOD
- HEAT WAVES and HEAT WAVE
- etc

For handling the above, simple Levenshtein algorithm is implemented to correct some of these incosistencies. A value of 3 is used for threshold to capture inconsistencies due to typo and plural as above.

lv <- function(events) {
  distance_events <- adist(events$EVTYPE, events$EVTYPE)
  similar_events <- which((distance_events!=0) & (distance_events<3), arr.ind=TRUE)


  for(i in 1:nrow(similar_events)) {
    earlier = min(similar_events[i,'row'],similar_events[i,'col'])
    later   = max(similar_events[i,'row'],similar_events[i,'col'])
  
    #print(paste(paste(events[later,'EVTYPE'],' >> '),events[earlier,'EVTYPE']))

    events[later,'EVTYPE'] <- events[earlier,'EVTYPE']
  }
  
  events[,'EVTYPE']
}

There could be unwanted changes from the above refinement method - especially 4 letter words (as the threshold used is 3) and sub-grouping (e.g. TSTM WIND (G40) and TSTM WIND (G35) are grouped together in the later). We need to keep this in mind and drill futher if the final result of top events warrant it (i.e. possibility of incorrect grouping).

Analysis

Health
#subsetting Health variables
storm_health <- storm_data[,c('EVTYPE', 'FATALITIES', 'INJURIES')]

#subsetting events with non-zero impacts
storm_health <- subset(storm_health, FATALITIES + INJURIES != 0)

#aggregate based on EVTYPE
storm_health <- aggregate(cbind(FATALITIES, INJURIES)~toupper(EVTYPE), data=storm_health, sum)
names(storm_health) <- c('EVTYPE','FATALITIES','INJURIES')

#refinement -- the above aggregate is done to make refinement faster
storm_health$EVTYPE <- lv(storm_health)

#aggregate based on EVTYPE -- the first aggregate was only to make refinement faster
storm_health <- aggregate(cbind(FATALITIES, INJURIES)~EVTYPE, data=storm_health, sum)

Top 10 Events resulting in Fatalities:

knitr::kable(head(storm_health[order(storm_health$FATALITIES, decreasing=T),c(1,2)], 10))
EVTYPE FATALITIES
147 TORNADO 5633
25 EXCESSIVE HEAT 1903
34 FLASH FLOOD 980
57 HEAT 937
99 LIGHTNING 817
122 RIP CURRENT 572
153 TSTM WIND 504
38 FLOOD 470
75 HIGH WIND 283
1 AVALANCE 225

Top 10 Events resulting in Injuries:

knitr::kable(head(storm_health[order(storm_health$INJURIES, decreasing=T),c(1,3)], 10))
EVTYPE INJURIES
147 TORNADO 91346
153 TSTM WIND 6957
38 FLOOD 6789
25 EXCESSIVE HEAT 6525
99 LIGHTNING 5230
141 THUNDERSTORMS WINDS 2411
57 HEAT 2100
96 ICE STORM 1975
34 FLASH FLOOD 1777
75 HIGH WIND 1439
Economic

The following helper function is used to calculate Total Damage to Properties and Crops:

caldmg <- function(value, unit) {
  if((value<0) | (value>9))
    return(0)
  
  if('H' == toupper(unit)) {
    return(as.numeric(value)*(10^2)) # hundred
  }
  else if('K' == toupper(unit)) {
    return(as.numeric(value)*(10^3)) # thousand
  }
  else if('M' == toupper(unit)) {
    return(as.numeric(value)*(10^6)) # million
  }
  else if('B' == toupper(unit)) {
    return(as.numeric(value)*(10^9)) # billion
  }
  
  return(as.numeric(value))
}

Two helper variables are created to calculate Total Damage:
- PROPTOTDMG : total properties damage, calculated from PROPDMG (value) and PROPDMGEXP (unit)
- CROPTOTDMG : total properties damage, calculated from CROPDMG (value) and CROPDMGEXP (unit)

#subsetting Economic variables
storm_econ <- storm_data[,c('EVTYPE', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]

#calculate Total Damage
storm_econ$PROPTOTDMG <- apply(storm_econ, 1, function(x) caldmg(x['PROPDMG'], x['PROPDMGEXP']))
storm_econ$CROPTOTDMG <- apply(storm_econ, 1, function(x) caldmg(x['CROPDMG'], x['CROPDMGEXP']))

#subsetting events with non-zero impacts
storm_econ <- storm_econ[,c('EVTYPE', 'PROPTOTDMG', 'CROPTOTDMG')]
storm_econ <- subset(storm_econ, PROPTOTDMG + CROPTOTDMG != 0)

#aggregate based on EVTYPE
storm_econ <- aggregate(cbind(PROPTOTDMG, CROPTOTDMG)~toupper(EVTYPE), data=storm_econ, sum)
names(storm_econ) <- c('EVTYPE','PROPTOTDMG','CROPTOTDMG')

#refinement -- the above aggregate is done to make refinement faster
storm_econ$EVTYPE <- lv(storm_econ)

#aggregate based on EVTYPE -- the first aggregate was only to make refinement faster
storm_econ <- aggregate(cbind(PROPTOTDMG, CROPTOTDMG)~EVTYPE, data=storm_econ, sum)

Top 10 Events damaging Properties:

knitr::kable(head(storm_econ[order(storm_econ$PROPTOTDMG, decreasing=T),c(1,2)], 10))
EVTYPE PROPTOTDMG
57 FLOOD 144316876207
149 HURRICANE/TYPHOON 69116410000
247 TORNADO 56469280029
226 STORM SURGE 43304851000
47 FLASH FLOOD 15910819467
85 HAIL 14598761393
141 HURRICANE 11677929010
251 TROPICAL STORM 7691812550
288 WINTER STORM 6674132351
281 WILDFIRE 5484989600

Top 10 Events damaging Crops:

knitr::kable(head(storm_econ[order(storm_econ$CROPTOTDMG, decreasing=T),c(1,3)], 10))
EVTYPE CROPTOTDMG
30 DROUGHT 13962176000
57 FLOOD 5628178450
196 RIVER FLOOD 5029450000
157 ICE STORM 5022113500
85 HAIL 3012529473
141 HURRICANE 2732910000
149 HURRICANE/TYPHOON 2424672800
47 FLASH FLOOD 1319819100
43 EXTREME COLD 1312973000
74 FROST/FREEZE 1085186000

Results

Combining the impacts to Health (i.e. both Fatalities and Injuries), Tornado is by far the most damaging weather events. For Economic impacts, combining damages to Properties and Crop, Flood is the most damaging (although, Drought causes the most damage to crops with Flood comes in second).

The above combined impacts are summarized in the two plots below:

library(reshape2)
library(ggplot2)


# HEALTH
##########################################
#extract Top 10,
top_storm_health <- head(storm_health[with(storm_health,order(FATALITIES+INJURIES, decreasing=T)),], 10)

#re-order for ggplot2
top_storm_health$EVTYPE <- factor(top_storm_health$EVTYPE, levels=top_storm_health[,'EVTYPE'])

#plot!
ggplot(data=melt(top_storm_health, id.var='EVTYPE'), aes(x=EVTYPE, y=value, fill=variable)) + 
  geom_bar(stat='identity') +
  xlab('Weather Events') +  ylab('Total Causalities') +
  theme(axis.text.x=element_text(angle=50)) +
  scale_y_continuous(breaks=seq(0,100000,5000)) +
  scale_fill_discrete(name='Causality Type', labels=c('Fatalities', 'Injuries')) +
  ggtitle('Health Impacts from Severe Weather')

# ECONOMIC
##########################################
#extract Top 10,
top_storm_econ <- head(storm_econ[with(storm_econ,order(PROPTOTDMG+CROPTOTDMG, decreasing=T)),], 10)

#re-order for ggplot2
top_storm_econ$EVTYPE <- factor(top_storm_econ$EVTYPE, levels=top_storm_econ[,'EVTYPE'])

#plot!
ggplot(data=melt(top_storm_econ, id.var='EVTYPE'), aes(x=EVTYPE, y=(value/(10^9)), fill=variable)) + 
  geom_bar(stat='identity') +
  xlab('Weather Events') +  ylab('Total Damages (in Billions)') +
  theme(axis.text.x=element_text(angle=50)) +
  scale_y_continuous(breaks=seq(0,200,10)) +
  scale_fill_discrete(name='Damage Type', labels=c('Properties', 'Crops')) +
  ggtitle('Economic Impacts from Severe Weather')

References

  1. U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database
  2. National Weather Service Storm Data Documentation
  3. National Climatic Data Center Storm Events FAQ
  4. National Climatic Data Center Strom Events Database