Synopsis

In this report, we will try to answer the following questions about weather events in the United States:
1. Which types of events are most harmful with respect to population health?
2. Which types of events have the greatest economic consequences?
To answer these questions, we will explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which includes data about health and economic consequences for most weather events between 1950 and 2011.

Data Processing

This document assumes you have already downloaded the appropriate file and unzipped it into a directory named ‘data’, which is a sub-directory of your current working directory. The following code loads the raw csv file into a data.frame and shows the dimensions:

data <- read.csv("data/repdata-data-StormData.csv")
dim(data)
## [1] 902297     37

There are 902297 observations of 37 variables. We first look at the variable names to see which ones are relevant to our study.

names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

For this study, we are only interested in variables related to the event type, population health, and economic consequences. After looking at the documentation, we determine that the event type is in the “EVTYPE”" variable (column 8), health data is in the “FATALITIES” and “INJURIES” variables (columns 23 and 24), and economic consequences are in any variable containing “DMG” (columns 25-28). We create a data frame with only these variables for our analysis and view the summary statistics.

storm <- data[,c(8,23:28)]
summary(storm)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878                                         
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9

The documentation suggests that the “PROPDMGEXP” and “CROPDMGEXP” variables should only have the values ‘’, ’K’, ‘M’, or ‘B’ to indicate what factor the damage should be multiplied by. Our summary shows that there is a small percentage of rows that do not have one of these values, so we will filter out those rows.

library(dplyr)
exp <- c('','K','M','B')
df <- storm %>% filter(PROPDMGEXP %in% exp, CROPDMGEXP %in% exp)

We now do some processing to multiply the damage variables by their respective ‘EXP’ variables to create 2 new columns indicating the total amount of property damage and crop damage.

First create 2 tables mapping the value of each “EXP” variable to the appropriate multiplication factor.

propmult <- data.frame(PROPDMGEXP = exp, pmult = c(1, 1e3, 1e6, 1e9))
cropmult <- data.frame(CROPDMGEXP = exp, cmult = c(1, 1e3, 1e6, 1e9))
propmult
##   PROPDMGEXP pmult
## 1            1e+00
## 2          K 1e+03
## 3          M 1e+06
## 4          B 1e+09

Next merge these tables with our main data frame to add the “pmult” and “cmult” columns.

df2 <- merge(df, propmult)
df3 <- merge(df2, cropmult)

Finally, create the PROPTOTAL and CROPTOTAL columns (total amount of property/crop damage) by multiplying PROPDMG and CROPDMG by pmult and cmult, respectively.

df3$PROPTOTAL <- df3$PROPDMG * df3$pmult
df3$CROPTOTAL <- df3$CROPDMG * df3$cmult

We can now get rid of the PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP, pmult, and cmult columns. This leaves 5 columns indicating the event type, fatalities, injuries, total property damage, and total crop damage.

final <- df3 %>% select(EVTYPE, FATALITIES, INJURIES, PROPTOTAL, CROPTOTAL)

In our final table, there are 985 different event types, which is too many to do a visual comparison that makes sense, so we will add one more column (EVENT) to place each event into one of 12 categories using regular expressions.

final$EVENT <- 'other'
final[grep('HEAT|WARM|DRY|DROUGHT|dry|DUST', final$EVTYPE), 'EVENT'] <- 'heat/drought'
final[grep('WIND|Wind|wind', final$EVTYPE), 'EVENT'] <- 'wind'
final[grep('RAIN', final$EVTYPE), 'EVENT'] <- 'rain'
final[grep('TSTM|LIGHTNING|THUNDERSTORM|STORM SURGE', final$EVTYPE),
      'EVENT'] <- 'thunderstorm'
final[grep('CHILL|COLD|LOW TEMPERATURE|Cold|EXPOS|Expos|HYPO|FROST|FREEZE',
           final$EVTYPE), 'EVENT'] <- 'cold'
final[grep('WINTER|SNOW|ICE|FREEZING|BLIZZ|SLEET|ICY|MIX|WINTRY|snow|Snow',
           final$EVTYPE), 'EVENT'] <- 'snow/ice'
final[grep('HAIL', final$EVTYPE), 'EVENT'] <- 'hail'
final[grep('TORNADO|WATERSPOUT|FUNNEL', final$EVTYPE), 'EVENT'] <- 'tornado'
final[grep('CURRENT|SURF|MARINE|HIGH WAVES|SEAS|SWELLS|Surf|TROPICAL|COASTAL|HURRICANE|
           surf|Marine|TIDE|TSUNAMI', final$EVTYPE), 'EVENT'] <- 'tropical'
final[grep('FLOOD|FLD|RISING|HIGH WATER|Flood', final$EVTYPE), 'EVENT'] <- 'flood'
final[grep('FIRE', final$EVTYPE), 'EVENT'] <- 'wildfire'
table(final$EVENT)
## 
##         cold        flood         hail heat/drought        other 
##         4191        86096       289896         6111         3507 
##         rain     snow/ice thunderstorm      tornado     tropical 
##        11829        42705       339465        71505        16253 
##     wildfire         wind 
##         4240        26123

The table above shows that nearly all the events fit into these categories with only 3507 events in the ‘other’ category.

Results

Which types of events are most harmful with respect to population health?

To answer this question we will first find the total number of fatalities for each event type in decreasing order and create a bar plot to compare them visually.

fat_by_event <- aggregate(FATALITIES ~ EVENT, data = final, sum)
fat_by_event <- fat_by_event[order(fat_by_event$FATALITIES, decreasing=TRUE),]
barplot(fat_by_event[1:5, 'FATALITIES'], names.arg=fat_by_event[1:5, 'EVENT'],
        ylim = c(0,6000), main = 'Total Number of Fatalities', xlab='Event Type')

The plot shows the top five causes of death among the different event categories. Tornadoes have caused nearly twice as many deaths as the second-ranked cause, which is heat/drought. Now we can do the same comparison with injuries.

inj_by_event <- aggregate(INJURIES ~ EVENT, data = final, sum)
inj_by_event <- inj_by_event[order(inj_by_event$INJURIES, decreasing=TRUE),]
barplot(inj_by_event[1:5, 'INJURIES'], names.arg=inj_by_event[1:5, 'EVENT'],
        ylim = c(0,80000), main = 'Total Number of Injuries', xlab='Event Type')

This plot reinforces our belief that tornadoes are the most dangerous weather event, causing more than six times as many injuries as the next closest event, which is thunderstorms. I would rank heat/drought as the second-most dangerous weather event. Although it is less likely than thunderstorms to cause injuries, it is far more likely to cause fatalities.

Which types of events have the greatest economic consequences?

To answer this question, we will look at the combined total amount of property damage and crop damage caused by each of the event categories. Then we will use a bar graph to compare them visually.

prop_by_event <- aggregate(PROPTOTAL ~ EVENT, data = final, sum)
crop_by_event <- aggregate(CROPTOTAL ~ EVENT, data = final, sum)
total_damage <- merge(prop_by_event, crop_by_event)
total_damage$TOTAL <- (total_damage$PROPTOTAL + total_damage$CROPTOTAL) / 1e9
total_damage <- arrange(total_damage, desc(TOTAL))
barplot(total_damage[1:5, 'TOTAL'], names.arg=total_damage[1:5, 'EVENT'],
        xlab='Event Type', ylab = 'Damage in billions of dollars',
        main='Economic Damage Caused By Weather Events')

Floods are the leading cause of damage at around $180 billion, nearly twice as much as tropical storms which came in second.