Storm Data Analysis

Overview

Goal of this document is analysis of Storm Data collected by National Oceanic and Atmospheric Administration (NOAA), USA, focusing on identification of type of storms that cause maximum harm to human health and those that cause maximum economic losses. Storm Data is downloaded from Storm data web location. Data dictionary is available at Storm data documentation and FAQ available at Storm data FAQ site.

Results obtained from this analysis shows that, over the years, tornadoes caused maximum life loss or injuries to humans. However, as far as maximum economic loss is concerned, floods were most damaging.

Data loading

Storm data is downloaded from Storm data web location and unzipped as a .csv file. Though handling this heavy data set could be easier by using cacher package, due to time contstraints, pursued simple loading from .csv file.

library(knitr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(stringr, warn.conflicts = FALSE, quietly = TRUE)

# Download data only if it does not already exist locally.
if (!file.exists("StormData.csv"))
{
     download.file(
          "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
          destfile = "StormData.csv")
}

# Load data.
stormdf <- read.csv("StormData.csv", 
                    comment.char = "", 
                    header = TRUE, 
                    stringsAsFactors = FALSE)
dim(stormdf)

## [1] 902297     37

Data Processing

Out of 37 columns of the dataset, our focus is on EVTYPE (type of storm), INJURIES and FATALITIES (harm to human health), PROPDMG and PROPDMGEXP (loss of property), CROPDMG and CROPDMGEXP (loss of agricultural produce).

Raw data here shows 985 unique types of events, however Storm data documentation shows that there are 48 unique types of events. As part of the cleaning and preprocessing of storm data, we trim and capitalize event type and if it does not match with any of these 48 types, we eliminate the record from the data set to be analyzed.

stdEvTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
                "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
                "Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
                "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog",
                "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain", 
                "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon",
                "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning",
                "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
                "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
                "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
                "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
                "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")

We also need to reformulate values of PROPDMG and CROPDMG columns based on the values of CROPDMGEXP and PROPDMGEXP columns, respectively. First we check the unique values of these two fields:

unique(stormdf$CROPDMGEXP)

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

unique(stormdf$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

To correct the values of CROPDMG and PROPDMG, we’ll multiply the values by a multiplier based on corresponding values of CROPDMGEXP and PROPDMGEXP, respectively. As per standard notations, the multipliers will be 1000 for K, 1000000 for M, 1000000000 for B, 10^n for single digit numbers, 1 for ‘+’ and 0 for other values " " (empty or null), ‘-’ and ‘?’. Before making these modifications on PROPDMG and CROPTDMG, we’ll capitalize (to capital letters) values of CROPDMGEXP and PROPDMGEXP.

zeroMultGroup <- c("", "-", "?")
oneMultGroup <- c("+")
powerMultGroup <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")
thousandMultGroup <- c("K")
mMultGroup <- c("M")
bMultGroup <- c("B")

stormdf <- stormdf %>%
  mutate(EVTYPE = str_to_upper(str_trim(EVTYPE)), 
         CROPDMGEXP = str_to_upper(str_trim(CROPDMGEXP)), 
         PROPDMGEXP = str_to_upper(str_trim(PROPDMGEXP))) %>%
  filter(EVTYPE %in% str_to_upper(str_trim(stdEvTypes))) %>%
  mutate(PROPDMG = PROPDMG * 
           ifelse(PROPDMGEXP %in% zeroMultGroup, 0, 
                  ifelse(PROPDMGEXP %in% oneMultGroup, 1, 
                         ifelse(PROPDMGEXP %in% powerMultGroup, 10^as.integer(PROPDMGEXP), 
                                ifelse(PROPDMGEXP %in% thousandMultGroup, 1000, 
                                       ifelse(PROPDMGEXP %in% mMultGroup, 1000000, 
                                              ifelse(PROPDMGEXP %in% bMultGroup, 1000000000, 0))))))) %>%
    mutate(PROPDMG = CROPDMG * 
           ifelse(CROPDMGEXP %in% zeroMultGroup, 0, 
                  ifelse(CROPDMGEXP %in% oneMultGroup, 1, 
                         ifelse(CROPDMGEXP %in% powerMultGroup, 10^as.integer(PROPDMGEXP), 
                                ifelse(CROPDMGEXP %in% thousandMultGroup, 1000, 
                                       ifelse(CROPDMGEXP %in% mMultGroup, 1000000, 
                                              ifelse(CROPDMGEXP %in% bMultGroup, 1000000000, 0)))))))

## Warning in ifelse(PROPDMGEXP %in% powerMultGroup,
## 10^as.integer(PROPDMGEXP), : NAs introduced by coercion

## Warning in ifelse(CROPDMGEXP %in% powerMultGroup,
## 10^as.integer(PROPDMGEXP), : NAs introduced by coercion

dim(stormdf)

## [1] 635295     37

After this cleanup, dataset has 46 unique values of event types.

Results

Finding types of storms causing maximum damage on human health

We group the data set based on event types, to find aggregate values in total damage on human health, considering the variables FATALITIES and INJURIES. Then we take a look at top ten types of storms causing maximum damages.

hddf <- stormdf %>%
  mutate(hhdmg = FATALITIES + INJURIES) %>%
  group_by(EVTYPE) %>%
  summarise(sumdmg = sum(hhdmg), meandmg = round(mean(hhdmg), 5), sddmg = sd(hhdmg)) %>%
  filter(sumdmg > 0)
topsumdmg <- top_n(hddf, 10, sumdmg)
topmeandmg <- top_n(hddf, 10, meandmg)
topdmgtbl <- unique(rbind(topsumdmg, topmeandmg))
kable(topdmgtbl, col.names = c("Storm Type", "Total Life Harm", "Mean Life Harm", "Std. Dev. of Life Harm"))

Storm Type	Total Life Harm	Mean Life Harm	Std. Dev. of Life Harm
EXCESSIVE HEAT	8428	5.02265	27.4492771
FLASH FLOOD	2755	0.05076	1.3079760
FLOOD	7259	0.28661	11.0441607
HEAT	3037	3.95958	29.6934237
HIGH WIND	1385	0.06852	0.8555176
ICE STORM	2064	1.02891	35.0936781
LIGHTNING	6046	0.38375	1.3187508
THUNDERSTORM WIND	1621	0.01963	0.4743736
TORNADO	96979	1.59894	18.2551559
WINTER STORM	1527	0.13356	2.2496219
AVALANCHE	394	1.02073	1.2911630
DUST STORM	462	1.08197	4.4784647
HURRICANE/TYPHOON	1339	15.21591	90.6595816
MARINE STRONG WIND	36	0.75000	1.5228752
RIP CURRENT	600	1.27660	1.2129741
TSUNAMI	162	8.10000	35.9896184

Clearly, over the years, tornados caused maximum damage to human health in terms of injuries or fatalities. Now we plot this data: to keep the damage values comparable between storm types, we use logarithm on Mean Life Harm.

hdplot <- ggplot(data = topdmgtbl, 
                 aes(x = EVTYPE, 
                     y = log(meandmg), 
                     fill = sumdmg))
hdplot + labs(title = "Life Harm from Storms") + 
    geom_bar(stat = "identity") +
    xlab("Storm Type") + 
    theme(axis.text.x = element_text(angle=45, size=8)) +
    ylab("log of Mean Life Harm") + 
    scale_fill_continuous(name = "Total Life Harm") +
    geom_errorbar(aes(ymin = log(meandmg)-log(sddmg), ymax = log(meandmg)+log(sddmg)), 
                  width=.2, position=position_dodge(.9))

Finding types of storms causing maximum economic loss

We group the data set based on event types, to find aggregate values in total economic loss in properties and agricultural area, considering the variables PROPDMG and CROPDMG Then we take a look at top ten types of storms causing maximum economic loss

eldf <- stormdf %>%
  mutate(ecoloss = CROPDMG + PROPDMG) %>%
  group_by(EVTYPE) %>%
  summarise(sumloss = sum(ecoloss), meanloss = round(mean(ecoloss), 5), sdloss = sd(ecoloss)) %>%
  filter(sumloss > 0)
topsumloss <- top_n(eldf, 10, sumloss)
topmeanloss <- top_n(eldf, 10, meanloss)
toplosstbl <- unique(rbind(topsumloss, topmeanloss))
kable(toplosstbl, col.names = c("Storm Type", "Total Economic Loss", "Mean Economic Loss", "Std. Dev. of Economic Loss"))

Storm Type	Total Economic Loss	Mean Economic Loss	Std. Dev. of Economic Loss
EXCESSIVE HEAT	492402494	293446.063	12020499.1
FLASH FLOOD	1421496300	26189.180	1173704.4
FLOOD	5662136488	223561.278	5697337.5
FROST/FREEZE	1094193134	814738.000	10069811.0
HEAVY RAIN	733410923	62460.477	2679109.4
HIGH WIND	638588583	31591.401	1488544.7
HURRICANE/TYPHOON	2607877598	29634972.710	169005303.3
ICE STORM	5022115189	2503546.954	111636119.6
THUNDERSTORM WIND	414909841	5025.312	252139.4
TROPICAL STORM	678351899	983118.694	9511706.0
BLIZZARD	112060172	41213.745	1365693.7
HEAT	401462163	523418.726	14443112.8
WILDFIRE	295477164	107018.169	2483440.9

Clearly, over the years, floods caused maximum total economic loss. Though ice storms and hurricanes/typhoons also caused very high total econmic loss, but variation in such losses were also high as intensities of storms can vary. Now we plot this data: to keep the economic damanges comparable between storm types, we use logarithm on Mean Economic Loss.

elplot <- ggplot(data = toplosstbl, 
                 aes(x = EVTYPE, 
                     y = log(meanloss), 
                     fill = sumloss))
elplot + labs(title = "Economic Loss from Storms") + 
    geom_bar(stat = "identity") +
    xlab("Storm Type") + 
    theme(axis.text.x = element_text(angle=45, size=8)) +
    ylab("log of Mean Economic Loss") + 
    scale_fill_continuous(name = "Total Economic Loss") +
    geom_errorbar(aes(ymin = log(meanloss)-log(sdloss), ymax = log(meanloss)+log(sdloss)), 
                  width=.2, position=position_dodge(.9))