This document presents my analysis of the “NOAA Storm Database” dataset. It shows how the raw data is processed and the results are generated. Due conflicting deadlines, I have performed only a minimal analysis and have not completely cleaned up the dataset. Although, this may render the results less valid, my analysis can be reproduced and I point out some of the problems of the analysis. I think that this document still proves that I have understood the main concepts behind “reproducible research”.

Data Processing

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The dataset needs to be cleaned up extensively. However, due to time constraints I will only show an example how this would be performed.

After loading the CSV-file, the values of the variables EVTYPE, CROPDMGEXP and PROPDMGEXP are converted to lowercase or uppercase, because lower-/uppercase is not used consequently in the CSV-file.

storm_data <- 
  read.csv("~/Coursera Data Science/Course Project Week 4/repdata_data_StormData.csv") %>%
  mutate(EVTYPE = factor(tolower(EVTYPE)),
         CROPDMGEXP = toupper(CROPDMGEXP),
         PROPDMGEXP = toupper(PROPDMGEXP))

There are only 48 possible event types described in the manual. However, the variable EVTYPE has many more factors.

summary(storm_data$EVTYPE)
##                     hail                tstm wind        thunderstorm wind 
##                   288661                   219942                    82564 
##                  tornado              flash flood                    flood 
##                    60652                    54277                    25327 
##       thunderstorm winds                high wind                lightning 
##                    20843                    20214                    15754 
##               heavy snow               heavy rain             winter storm 
##                    15708                    11742                    11433 
##           winter weather             funnel cloud         marine tstm wind 
##                     7045                     6844                     6175 
## marine thunderstorm wind               waterspout              strong wind 
##                     5812                     3796                     3569 
##     urban/sml stream fld                 wildfire                 blizzard 
##                     3392                     2761                     2719 
##                  drought                ice storm           excessive heat 
##                     2488                     2006                     1678 
##               high winds         wild/forest fire             frost/freeze 
##                     1533                     1457                     1343 
##                dense fog       winter weather/mix           tstm wind/hail 
##                     1293                     1104                     1028 
##  extreme cold/wind chill                     heat                high surf 
##                     1002                      767                      734 
##           tropical storm           flash flooding             extreme cold 
##                      690                      682                      657 
##            coastal flood         lake-effect snow        flood/flash flood 
##                      656                      636                      625 
##                     snow                landslide          cold/wind chill 
##                      617                      600                      539 
##                      fog              rip current              marine hail 
##                      538                      470                      442 
##               dust storm                avalanche                     wind 
##                      427                      386                      346 
##             rip currents              storm surge            freezing rain 
##                      304                      261                      260 
##              urban flood     heavy surf/high surf        extreme windchill 
##                      251                      228                      204 
##             strong winds           dry microburst         coastal flooding 
##                      204                      186                      183 
##               light snow    astronomical low tide                hurricane 
##                      176                      174                      174 
##              river flood            record warmth               dust devil 
##                      173                      154                      149 
##         storm surge/tide         marine high wind        unseasonably warm 
##                      148                      135                      126 
##                 flooding   astronomical high tide        moderate snowfall 
##                      120                      103                      101 
##           urban flooding               wintry mix        hurricane/typhoon 
##                       99                       94                       88 
##            funnel clouds               heavy surf                     cold 
##                       87                       87                       82 
##              record heat                   freeze                heat wave 
##                       82                       76                       75 
##              record cold              gusty winds                      ice 
##                       67                       65                       61 
##  thunderstorm winds hail      tropical depression                    sleet 
##                       61                       60                       59 
##                    frost         unseasonably dry               small hail 
##                       57                       56                       53 
##                    other      thunderstorm windss       marine strong wind 
##                       52                       51                       48 
##             freezing fog                   funnel             thunderstorm 
##                       46                       46                       45 
##                    glaze       temperature record          tstm wind (g45) 
##                       43                       43                       39 
##      mixed precipitation              waterspouts    monthly precipitation 
##                       37                       37                       36 
##                  (Other) 
##                     2677

There are several problems:

  1. Typos / Abbreviatons: These need to be cleaned up manually. I have only done this for the event “avalanche”.
  2. Combination events: Several events are a combination of events. I have used the strategy to classify all events that start with “blizzard” as blizzard.
  3. Events like “Summary of April 13” or “Tropical Storm Gordon”: These would have to be cleaned up manually by looking in the description. I have omitted that.
storm_data$EVTYPE[storm_data$EVTYPE %in% c("avalance", "avalanche")] <- "avalanche"
storm_data$EVTYPE[substr(storm_data$EVTYPE, 1, 8) == "blizzard"] <- "blizzard"

Furthermore, the variables ending with “EXP” should contain “K”, “M” and “B” according to the manual. I have chosen to set other values to NA, although it might have been possible to guess what is meant by them.

storm_data$CROPDMGEXP[!(storm_data$CROPDMGEXP %in% c("", "K", "M", "B"))] <- NA
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == ""] <- 1
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == "K"] <- 1000
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == "M"] <- 1000000
storm_data$CROPDMGEXP[storm_data$CROPDMGEXP == "B"] <- 1000000000
storm_data$PROPDMGEXP[!(storm_data$PROPDMGEXP %in% c("", "K", "M", "B"))] <- NA
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == ""] <- 1
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == "K"] <- 1000
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == "M"] <- 1000000
storm_data$PROPDMGEXP[storm_data$PROPDMGEXP == "B"] <- 1000000000


storm_data <- 
  storm_data %>%
  mutate(crop_damage = CROPDMG * as.numeric(CROPDMGEXP),
         property_damage = PROPDMG * as.numeric(PROPDMGEXP),
         total_damage = crop_damage + property_damage)

Results

Which types of events are most harmful with respect to population health?

It is difficult to decide how the harmfulness can be compared. Is an event with only 8 fatalities and 43 injuries less harmful than an event with 14 fatalities and no injuries?

Here, you can se a table listing the top 10 most harmful events based on average fatalities. Because I have not extensively cleaned up the dataset, I have chosen to exclude events with a number of observations below 10, because they often refer to singular events (“tropical storm gordon”).

storm_data %>%
  group_by(EVTYPE) %>%
  summarize(injuries = mean(INJURIES),
            fatalities = mean(FATALITIES),
            n_observations = n()) %>%
  arrange(desc(fatalities), desc(injuries)) %>%
  filter(n_observations > 10) %>%
  top_n(10, fatalities)
## Source: local data frame [10 x 4]
## 
##                       EVTYPE   injuries fatalities n_observations
## 1               extreme heat  7.0454545  4.3636364             22
## 2                  heat wave  5.0533333  2.2933333             75
## 3  unseasonably warm and dry  0.0000000  2.2307692             13
## 4                    tsunami  6.4500000  1.6500000             20
## 5                       heat  2.7379400  1.2216428            767
## 6             excessive heat  3.8885578  1.1340882           1678
## 7                rip current  0.4936170  0.7829787            470
## 8          hurricane/typhoon 14.4886364  0.7272727             88
## 9               rip currents  0.9769737  0.6710526            304
## 10         flash flood/flood  0.0000000  0.6363636             22

Which types of events have the greatest economic consequences?

Here, I show a list with the top 10 of events with the greates economic consequences and a barplot of the top 5. Note that there are problems with EVTYPE variable, because the events “hurricane/typhoon” and “hurricane” should be one category.

  storm_data %>%
    group_by(EVTYPE) %>%
    summarize(damage = mean(total_damage),
              n_observations = n()) %>%
    arrange(desc(damage)) %>%
    filter(n_observations > 10) %>%
    top_n(10, damage)
## Source: local data frame [10 x 3]
## 
##                 EVTYPE    damage n_observations
## 1    hurricane/typhoon 817201282             88
## 2          storm surge 165990579            261
## 3  severe thunderstorm  92735385             13
## 4            hurricane  83966833            174
## 5          river flood  58661298            173
## 6              typhoon  54641364             11
## 7     storm surge/tide  31365122            148
## 8    flash flood/flood  12409318             22
## 9       tropical storm  12148169            690
## 10             tsunami   7204100             20
  top5_damage <- 
    storm_data %>%
    group_by(EVTYPE) %>%
    summarize(damage = mean(total_damage),
              n_observations = n()) %>%
    arrange(desc(damage)) %>%
    filter(n_observations > 10) %>%
    top_n(5, damage)
  
  barplot(height = top5_damage$damage, names.arg = top5_damage$EVTYPE, cex.names = .8)
  title("Top 5 - Total damage")

Conclusions

Although storms seem to have the biggest economic impact, extreme heat seems to be more harmful to the population.