Overview

Goal of this document is analysis of Storm Data collected by National Oceanic and Atmospheric Administration (NOAA), USA, focusing on identification of type of storms that cause maximum harm to human health and those that cause maximum economic losses. Storm Data is downloaded from Storm data web location. Data dictionary is available at Storm data documentation and FAQ available at Storm data FAQ site.

Results obtained from this analysis shows that, over the years, tornadoes caused maximum life loss or injuries to humans. However, as far as maximum economic loss is concerned, floods were most damaging.

Data loading

Storm data is downloaded from Storm data web location and unzipped as a .csv file. Though handling this heavy data set could be easier by using cacher package, due to time contstraints, pursued simple loading from .csv file.

library(knitr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(stringr, warn.conflicts = FALSE, quietly = TRUE)

# Download data only if it does not already exist locally.
if (!file.exists("StormData.csv"))
{
     download.file(
          "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
          destfile = "StormData.csv")
}

# Load data.
stormdf <- read.csv("StormData.csv", 
                    comment.char = "", 
                    header = TRUE, 
                    stringsAsFactors = FALSE)
dim(stormdf)
## [1] 902297     37

Data Processing

Out of 37 columns of the dataset, our focus is on EVTYPE (type of storm), INJURIES and FATALITIES (harm to human health), PROPDMG and PROPDMGEXP (loss of property), CROPDMG and CROPDMGEXP (loss of agricultural produce).

Raw data here shows 985 unique types of events, however Storm data documentation shows that there are 48 unique types of events. As part of the cleaning and preprocessing of storm data, we trim and capitalize event type and if it does not match with any of these 48 types, we eliminate the record from the data set to be analyzed.

stdEvTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
                "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
                "Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
                "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog",
                "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", 
                "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon",
                "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning",
                "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
                "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
                "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
                "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
                "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")

We also need to reformulate values of PROPDMG and CROPDMG columns based on the values of CROPDMGEXP and PROPDMGEXP columns, respectively. First we check the unique values of these two fields:

unique(stormdf$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
unique(stormdf$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

To correct the values of CROPDMG and PROPDMG, we’ll multiply the values by a multiplier based on corresponding values of CROPDMGEXP and PROPDMGEXP, respectively. As per standard notations, the multipliers will be 1000 for K, 1000000 for M, 1000000000 for B, 10^n for single digit numbers, 1 for ‘+’ and 0 for other values " " (empty or null), ‘-’ and ‘?’. Before making these modifications on PROPDMG and CROPTDMG, we’ll capitalize (to capital letters) values of CROPDMGEXP and PROPDMGEXP.

zeroMultGroup <- c("", "-", "?")
oneMultGroup <- c("+")
powerMultGroup <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")
thousandMultGroup <- c("K")
mMultGroup <- c("M")
bMultGroup <- c("B")

stormdf <- stormdf %>%
  mutate(EVTYPE = str_to_upper(str_trim(EVTYPE)), 
         CROPDMGEXP = str_to_upper(str_trim(CROPDMGEXP)), 
         PROPDMGEXP = str_to_upper(str_trim(PROPDMGEXP))) %>%
  filter(EVTYPE %in% str_to_upper(str_trim(stdEvTypes))) %>%
  mutate(PROPDMG = PROPDMG * 
           ifelse(PROPDMGEXP %in% zeroMultGroup, 0, 
                  ifelse(PROPDMGEXP %in% oneMultGroup, 1, 
                         ifelse(PROPDMGEXP %in% powerMultGroup, 10^as.integer(PROPDMGEXP), 
                                ifelse(PROPDMGEXP %in% thousandMultGroup, 1000, 
                                       ifelse(PROPDMGEXP %in% mMultGroup, 1000000, 
                                              ifelse(PROPDMGEXP %in% bMultGroup, 1000000000, 0))))))) %>%
    mutate(PROPDMG = CROPDMG * 
           ifelse(CROPDMGEXP %in% zeroMultGroup, 0, 
                  ifelse(CROPDMGEXP %in% oneMultGroup, 1, 
                         ifelse(CROPDMGEXP %in% powerMultGroup, 10^as.integer(PROPDMGEXP), 
                                ifelse(CROPDMGEXP %in% thousandMultGroup, 1000, 
                                       ifelse(CROPDMGEXP %in% mMultGroup, 1000000, 
                                              ifelse(CROPDMGEXP %in% bMultGroup, 1000000000, 0)))))))
## Warning in ifelse(PROPDMGEXP %in% powerMultGroup,
## 10^as.integer(PROPDMGEXP), : NAs introduced by coercion
## Warning in ifelse(CROPDMGEXP %in% powerMultGroup,
## 10^as.integer(PROPDMGEXP), : NAs introduced by coercion
dim(stormdf)
## [1] 635295     37

After this cleanup, dataset has 46 unique values of event types.

Results

Finding types of storms causing maximum damage on human health

We group the data set based on event types, to find aggregate values in total damage on human health, considering the variables FATALITIES and INJURIES. Then we take a look at top ten types of storms causing maximum damages.

hddf <- stormdf %>%
  mutate(hhdmg = FATALITIES + INJURIES) %>%
  group_by(EVTYPE) %>%
  summarise(sumdmg = sum(hhdmg), meandmg = round(mean(hhdmg), 5), sddmg = sd(hhdmg)) %>%
  filter(sumdmg > 0)
topsumdmg <- top_n(hddf, 10, sumdmg)
topmeandmg <- top_n(hddf, 10, meandmg)
topdmgtbl <- unique(rbind(topsumdmg, topmeandmg))
kable(topdmgtbl, col.names = c("Storm Type", "Total Life Harm", "Mean Life Harm", "Std. Dev. of Life Harm"))
Storm Type Total Life Harm Mean Life Harm Std. Dev. of Life Harm
EXCESSIVE HEAT 8428 5.02265 27.4492771
FLASH FLOOD 2755 0.05076 1.3079760
FLOOD 7259 0.28661 11.0441607
HEAT 3037 3.95958 29.6934237
HIGH WIND 1385 0.06852 0.8555176
ICE STORM 2064 1.02891 35.0936781
LIGHTNING 6046 0.38375 1.3187508
THUNDERSTORM WIND 1621 0.01963 0.4743736
TORNADO 96979 1.59894 18.2551559
WINTER STORM 1527 0.13356 2.2496219
AVALANCHE 394 1.02073 1.2911630
DUST STORM 462 1.08197 4.4784647
HURRICANE/TYPHOON 1339 15.21591 90.6595816
MARINE STRONG WIND 36 0.75000 1.5228752
RIP CURRENT 600 1.27660 1.2129741
TSUNAMI 162 8.10000 35.9896184

Clearly, over the years, tornados caused maximum damage to human health in terms of injuries or fatalities. Now we plot this data: to keep the damage values comparable between storm types, we use logarithm on Mean Life Harm.

hdplot <- ggplot(data = topdmgtbl, 
                 aes(x = EVTYPE, 
                     y = log(meandmg), 
                     fill = sumdmg))
hdplot + labs(title = "Life Harm from Storms") + 
    geom_bar(stat = "identity") +
    xlab("Storm Type") + 
    theme(axis.text.x = element_text(angle=45, size=8)) +
    ylab("log of Mean Life Harm") + 
    scale_fill_continuous(name = "Total Life Harm") +
    geom_errorbar(aes(ymin = log(meandmg)-log(sddmg), ymax = log(meandmg)+log(sddmg)), 
                  width=.2, position=position_dodge(.9))

Finding types of storms causing maximum economic loss

We group the data set based on event types, to find aggregate values in total economic loss in properties and agricultural area, considering the variables PROPDMG and CROPDMG Then we take a look at top ten types of storms causing maximum economic loss

eldf <- stormdf %>%
  mutate(ecoloss = CROPDMG + PROPDMG) %>%
  group_by(EVTYPE) %>%
  summarise(sumloss = sum(ecoloss), meanloss = round(mean(ecoloss), 5), sdloss = sd(ecoloss)) %>%
  filter(sumloss > 0)
topsumloss <- top_n(eldf, 10, sumloss)
topmeanloss <- top_n(eldf, 10, meanloss)
toplosstbl <- unique(rbind(topsumloss, topmeanloss))
kable(toplosstbl, col.names = c("Storm Type", "Total Economic Loss", "Mean Economic Loss", "Std. Dev. of Economic Loss"))
Storm Type Total Economic Loss Mean Economic Loss Std. Dev. of Economic Loss
EXCESSIVE HEAT 492402494 293446.063 12020499.1
FLASH FLOOD 1421496300 26189.180 1173704.4
FLOOD 5662136488 223561.278 5697337.5
FROST/FREEZE 1094193134 814738.000 10069811.0
HEAVY RAIN 733410923 62460.477 2679109.4
HIGH WIND 638588583 31591.401 1488544.7
HURRICANE/TYPHOON 2607877598 29634972.710 169005303.3
ICE STORM 5022115189 2503546.954 111636119.6
THUNDERSTORM WIND 414909841 5025.312 252139.4
TROPICAL STORM 678351899 983118.694 9511706.0
BLIZZARD 112060172 41213.745 1365693.7
HEAT 401462163 523418.726 14443112.8
WILDFIRE 295477164 107018.169 2483440.9

Clearly, over the years, floods caused maximum total economic loss. Though ice storms and hurricanes/typhoons also caused very high total econmic loss, but variation in such losses were also high as intensities of storms can vary. Now we plot this data: to keep the economic damanges comparable between storm types, we use logarithm on Mean Economic Loss.

elplot <- ggplot(data = toplosstbl, 
                 aes(x = EVTYPE, 
                     y = log(meanloss), 
                     fill = sumloss))
elplot + labs(title = "Economic Loss from Storms") + 
    geom_bar(stat = "identity") +
    xlab("Storm Type") + 
    theme(axis.text.x = element_text(angle=45, size=8)) +
    ylab("log of Mean Economic Loss") + 
    scale_fill_continuous(name = "Total Economic Loss") +
    geom_errorbar(aes(ymin = log(meanloss)-log(sdloss), ymax = log(meanloss)+log(sdloss)), 
                  width=.2, position=position_dodge(.9))