Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
The report were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, no specific recommendations is needed in this report.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The data loading process is described in R code below:
storm_data <- read.csv(bzfile('repdata_data_StormData.csv.bz2'))
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels ""," "," "," ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(storm_data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
For this analysis, the grouping of impact is based on EVTYPE (Event Type) i.e. the name of severe weather event.
Variables related to Health impacts are:
- FATALITIES : number of fatalities
- INJURIES : number of injuries
Variables related to Economic impacts are:
- PROPDMG : value of properties damage
- PROPDMGEXP : unit of the value for the above properties damage
- CROPDMG : value of crops damage
- CROPDMGEXP : unit of the value for the above crops damage
PROPDMGEXP, CROPDMGEXP) are expressed as follow: There are duplicates of events that are counted separately due to likely typo or incosistencies in naming, such as:
- COASTAL STORM and COASTALSTORM
- FLASH FLOODS and FLASH FLOOD
- HEAT WAVES and HEAT WAVE
- etc
For handling the above, simple Levenshtein algorithm is implemented to correct some of these incosistencies. A value of 3 is used for threshold to capture inconsistencies due to typo and plural as above.
lv <- function(events) {
distance_events <- adist(events$EVTYPE, events$EVTYPE)
similar_events <- which((distance_events!=0) & (distance_events<3), arr.ind=TRUE)
for(i in 1:nrow(similar_events)) {
earlier = min(similar_events[i,'row'],similar_events[i,'col'])
later = max(similar_events[i,'row'],similar_events[i,'col'])
#print(paste(paste(events[later,'EVTYPE'],' >> '),events[earlier,'EVTYPE']))
events[later,'EVTYPE'] <- events[earlier,'EVTYPE']
}
events[,'EVTYPE']
}
There could be unwanted changes from the above refinement method - especially 4 letter words (as the threshold used is 3) and sub-grouping (e.g. TSTM WIND (G40) and TSTM WIND (G35) are grouped together in the later). We need to keep this in mind and drill futher if the final result of top events warrant it (i.e. possibility of incorrect grouping).
#subsetting Health variables
storm_health <- storm_data[,c('EVTYPE', 'FATALITIES', 'INJURIES')]
#subsetting events with non-zero impacts
storm_health <- subset(storm_health, FATALITIES + INJURIES != 0)
#aggregate based on EVTYPE
storm_health <- aggregate(cbind(FATALITIES, INJURIES)~toupper(EVTYPE), data=storm_health, sum)
names(storm_health) <- c('EVTYPE','FATALITIES','INJURIES')
#refinement -- the above aggregate is done to make refinement faster
storm_health$EVTYPE <- lv(storm_health)
#aggregate based on EVTYPE -- the first aggregate was only to make refinement faster
storm_health <- aggregate(cbind(FATALITIES, INJURIES)~EVTYPE, data=storm_health, sum)
Top 10 Events resulting in Fatalities:
knitr::kable(head(storm_health[order(storm_health$FATALITIES, decreasing=T),c(1,2)], 10))
| EVTYPE | FATALITIES | |
|---|---|---|
| 147 | TORNADO | 5633 |
| 25 | EXCESSIVE HEAT | 1903 |
| 34 | FLASH FLOOD | 980 |
| 57 | HEAT | 937 |
| 99 | LIGHTNING | 817 |
| 122 | RIP CURRENT | 572 |
| 153 | TSTM WIND | 504 |
| 38 | FLOOD | 470 |
| 75 | HIGH WIND | 283 |
| 1 | AVALANCE | 225 |
Top 10 Events resulting in Injuries:
knitr::kable(head(storm_health[order(storm_health$INJURIES, decreasing=T),c(1,3)], 10))
| EVTYPE | INJURIES | |
|---|---|---|
| 147 | TORNADO | 91346 |
| 153 | TSTM WIND | 6957 |
| 38 | FLOOD | 6789 |
| 25 | EXCESSIVE HEAT | 6525 |
| 99 | LIGHTNING | 5230 |
| 141 | THUNDERSTORMS WINDS | 2411 |
| 57 | HEAT | 2100 |
| 96 | ICE STORM | 1975 |
| 34 | FLASH FLOOD | 1777 |
| 75 | HIGH WIND | 1439 |
The following helper function is used to calculate Total Damage to Properties and Crops:
caldmg <- function(value, unit) {
if((value<0) | (value>9))
return(0)
if('H' == toupper(unit)) {
return(as.numeric(value)*(10^2)) # hundred
}
else if('K' == toupper(unit)) {
return(as.numeric(value)*(10^3)) # thousand
}
else if('M' == toupper(unit)) {
return(as.numeric(value)*(10^6)) # million
}
else if('B' == toupper(unit)) {
return(as.numeric(value)*(10^9)) # billion
}
return(as.numeric(value))
}
Two helper variables are created to calculate Total Damage:
- PROPTOTDMG : total properties damage, calculated from PROPDMG (value) and PROPDMGEXP (unit)
- CROPTOTDMG : total properties damage, calculated from CROPDMG (value) and CROPDMGEXP (unit)
#subsetting Economic variables
storm_econ <- storm_data[,c('EVTYPE', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]
#calculate Total Damage
storm_econ$PROPTOTDMG <- apply(storm_econ, 1, function(x) caldmg(x['PROPDMG'], x['PROPDMGEXP']))
storm_econ$CROPTOTDMG <- apply(storm_econ, 1, function(x) caldmg(x['CROPDMG'], x['CROPDMGEXP']))
#subsetting events with non-zero impacts
storm_econ <- storm_econ[,c('EVTYPE', 'PROPTOTDMG', 'CROPTOTDMG')]
storm_econ <- subset(storm_econ, PROPTOTDMG + CROPTOTDMG != 0)
#aggregate based on EVTYPE
storm_econ <- aggregate(cbind(PROPTOTDMG, CROPTOTDMG)~toupper(EVTYPE), data=storm_econ, sum)
names(storm_econ) <- c('EVTYPE','PROPTOTDMG','CROPTOTDMG')
#refinement -- the above aggregate is done to make refinement faster
storm_econ$EVTYPE <- lv(storm_econ)
#aggregate based on EVTYPE -- the first aggregate was only to make refinement faster
storm_econ <- aggregate(cbind(PROPTOTDMG, CROPTOTDMG)~EVTYPE, data=storm_econ, sum)
Top 10 Events damaging Properties:
knitr::kable(head(storm_econ[order(storm_econ$PROPTOTDMG, decreasing=T),c(1,2)], 10))
| EVTYPE | PROPTOTDMG | |
|---|---|---|
| 57 | FLOOD | 144316876207 |
| 149 | HURRICANE/TYPHOON | 69116410000 |
| 247 | TORNADO | 56469280029 |
| 226 | STORM SURGE | 43304851000 |
| 47 | FLASH FLOOD | 15910819467 |
| 85 | HAIL | 14598761393 |
| 141 | HURRICANE | 11677929010 |
| 251 | TROPICAL STORM | 7691812550 |
| 288 | WINTER STORM | 6674132351 |
| 281 | WILDFIRE | 5484989600 |
Top 10 Events damaging Crops:
knitr::kable(head(storm_econ[order(storm_econ$CROPTOTDMG, decreasing=T),c(1,3)], 10))
| EVTYPE | CROPTOTDMG | |
|---|---|---|
| 30 | DROUGHT | 13962176000 |
| 57 | FLOOD | 5628178450 |
| 196 | RIVER FLOOD | 5029450000 |
| 157 | ICE STORM | 5022113500 |
| 85 | HAIL | 3012529473 |
| 141 | HURRICANE | 2732910000 |
| 149 | HURRICANE/TYPHOON | 2424672800 |
| 47 | FLASH FLOOD | 1319819100 |
| 43 | EXTREME COLD | 1312973000 |
| 74 | FROST/FREEZE | 1085186000 |
Combining the impacts to Health (i.e. both Fatalities and Injuries), Tornado is by far the most damaging weather events. For Economic impacts, combining damages to Properties and Crop, Flood is the most damaging (although, Drought causes the most damage to crops with Flood comes in second).
The above combined impacts are summarized in the two plots below:
library(reshape2)
library(ggplot2)
# HEALTH
##########################################
#extract Top 10,
top_storm_health <- head(storm_health[with(storm_health,order(FATALITIES+INJURIES, decreasing=T)),], 10)
#re-order for ggplot2
top_storm_health$EVTYPE <- factor(top_storm_health$EVTYPE, levels=top_storm_health[,'EVTYPE'])
#plot!
ggplot(data=melt(top_storm_health, id.var='EVTYPE'), aes(x=EVTYPE, y=value, fill=variable)) +
geom_bar(stat='identity') +
xlab('Weather Events') + ylab('Total Causalities') +
theme(axis.text.x=element_text(angle=50)) +
scale_y_continuous(breaks=seq(0,100000,5000)) +
scale_fill_discrete(name='Causality Type', labels=c('Fatalities', 'Injuries')) +
ggtitle('Health Impacts from Severe Weather')
# ECONOMIC
##########################################
#extract Top 10,
top_storm_econ <- head(storm_econ[with(storm_econ,order(PROPTOTDMG+CROPTOTDMG, decreasing=T)),], 10)
#re-order for ggplot2
top_storm_econ$EVTYPE <- factor(top_storm_econ$EVTYPE, levels=top_storm_econ[,'EVTYPE'])
#plot!
ggplot(data=melt(top_storm_econ, id.var='EVTYPE'), aes(x=EVTYPE, y=(value/(10^9)), fill=variable)) +
geom_bar(stat='identity') +
xlab('Weather Events') + ylab('Total Damages (in Billions)') +
theme(axis.text.x=element_text(angle=50)) +
scale_y_continuous(breaks=seq(0,200,10)) +
scale_fill_discrete(name='Damage Type', labels=c('Properties', 'Crops')) +
ggtitle('Economic Impacts from Severe Weather')