Goal of this document is analysis of Storm Data collected by National Oceanic and Atmospheric Administration (NOAA), USA, focusing on identification of type of storms that cause maximum harm to human health and those that cause maximum economic losses. Storm Data is downloaded from Storm data web location. Data dictionary is available at Storm data documentation and FAQ available at Storm data FAQ site.
Results obtained from this analysis shows that, over the years, tornadoes caused maximum life loss or injuries to humans. However, as far as maximum economic loss is concerned, floods were most damaging.
Storm data is downloaded from Storm data web location and unzipped as a .csv file. Though handling this heavy data set could be easier by using cacher package, due to time contstraints, pursued simple loading from .csv file.
library(knitr, warn.conflicts = FALSE, quietly = TRUE)
library(ggplot2, warn.conflicts = FALSE, quietly = TRUE)
library(dplyr, warn.conflicts = FALSE, quietly = TRUE)
library(stringr, warn.conflicts = FALSE, quietly = TRUE)
# Download data only if it does not already exist locally.
if (!file.exists("StormData.csv"))
{
download.file(
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "StormData.csv")
}
# Load data.
stormdf <- read.csv("StormData.csv",
comment.char = "",
header = TRUE,
stringsAsFactors = FALSE)
dim(stormdf)
## [1] 902297 37
Out of 37 columns of the dataset, our focus is on EVTYPE (type of storm), INJURIES and FATALITIES (harm to human health), PROPDMG and PROPDMGEXP (loss of property), CROPDMG and CROPDMGEXP (loss of agricultural produce).
Raw data here shows 985 unique types of events, however Storm data documentation shows that there are 48 unique types of events. As part of the cleaning and preprocessing of storm data, we trim and capitalize event type and if it does not match with any of these 48 types, we eliminate the record from the data set to be analyzed.
stdEvTypes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
"Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
"Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
"Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Freezing Fog",
"Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain",
"Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon",
"Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
"Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
"Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
"Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
We also need to reformulate values of PROPDMG and CROPDMG columns based on the values of CROPDMGEXP and PROPDMGEXP columns, respectively. First we check the unique values of these two fields:
unique(stormdf$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
unique(stormdf$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
To correct the values of CROPDMG and PROPDMG, we’ll multiply the values by a multiplier based on corresponding values of CROPDMGEXP and PROPDMGEXP, respectively. As per standard notations, the multipliers will be 1000 for K, 1000000 for M, 1000000000 for B, 10^n for single digit numbers, 1 for ‘+’ and 0 for other values " " (empty or null), ‘-’ and ‘?’. Before making these modifications on PROPDMG and CROPTDMG, we’ll capitalize (to capital letters) values of CROPDMGEXP and PROPDMGEXP.
zeroMultGroup <- c("", "-", "?")
oneMultGroup <- c("+")
powerMultGroup <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")
thousandMultGroup <- c("K")
mMultGroup <- c("M")
bMultGroup <- c("B")
stormdf <- stormdf %>%
mutate(EVTYPE = str_to_upper(str_trim(EVTYPE)),
CROPDMGEXP = str_to_upper(str_trim(CROPDMGEXP)),
PROPDMGEXP = str_to_upper(str_trim(PROPDMGEXP))) %>%
filter(EVTYPE %in% str_to_upper(str_trim(stdEvTypes))) %>%
mutate(PROPDMG = PROPDMG *
ifelse(PROPDMGEXP %in% zeroMultGroup, 0,
ifelse(PROPDMGEXP %in% oneMultGroup, 1,
ifelse(PROPDMGEXP %in% powerMultGroup, 10^as.integer(PROPDMGEXP),
ifelse(PROPDMGEXP %in% thousandMultGroup, 1000,
ifelse(PROPDMGEXP %in% mMultGroup, 1000000,
ifelse(PROPDMGEXP %in% bMultGroup, 1000000000, 0))))))) %>%
mutate(PROPDMG = CROPDMG *
ifelse(CROPDMGEXP %in% zeroMultGroup, 0,
ifelse(CROPDMGEXP %in% oneMultGroup, 1,
ifelse(CROPDMGEXP %in% powerMultGroup, 10^as.integer(PROPDMGEXP),
ifelse(CROPDMGEXP %in% thousandMultGroup, 1000,
ifelse(CROPDMGEXP %in% mMultGroup, 1000000,
ifelse(CROPDMGEXP %in% bMultGroup, 1000000000, 0)))))))
## Warning in ifelse(PROPDMGEXP %in% powerMultGroup,
## 10^as.integer(PROPDMGEXP), : NAs introduced by coercion
## Warning in ifelse(CROPDMGEXP %in% powerMultGroup,
## 10^as.integer(PROPDMGEXP), : NAs introduced by coercion
dim(stormdf)
## [1] 635295 37
After this cleanup, dataset has 46 unique values of event types.
We group the data set based on event types, to find aggregate values in total damage on human health, considering the variables FATALITIES and INJURIES. Then we take a look at top ten types of storms causing maximum damages.
hddf <- stormdf %>%
mutate(hhdmg = FATALITIES + INJURIES) %>%
group_by(EVTYPE) %>%
summarise(sumdmg = sum(hhdmg), meandmg = round(mean(hhdmg), 5), sddmg = sd(hhdmg)) %>%
filter(sumdmg > 0)
topsumdmg <- top_n(hddf, 10, sumdmg)
topmeandmg <- top_n(hddf, 10, meandmg)
topdmgtbl <- unique(rbind(topsumdmg, topmeandmg))
kable(topdmgtbl, col.names = c("Storm Type", "Total Life Harm", "Mean Life Harm", "Std. Dev. of Life Harm"))
| Storm Type | Total Life Harm | Mean Life Harm | Std. Dev. of Life Harm |
|---|---|---|---|
| EXCESSIVE HEAT | 8428 | 5.02265 | 27.4492771 |
| FLASH FLOOD | 2755 | 0.05076 | 1.3079760 |
| FLOOD | 7259 | 0.28661 | 11.0441607 |
| HEAT | 3037 | 3.95958 | 29.6934237 |
| HIGH WIND | 1385 | 0.06852 | 0.8555176 |
| ICE STORM | 2064 | 1.02891 | 35.0936781 |
| LIGHTNING | 6046 | 0.38375 | 1.3187508 |
| THUNDERSTORM WIND | 1621 | 0.01963 | 0.4743736 |
| TORNADO | 96979 | 1.59894 | 18.2551559 |
| WINTER STORM | 1527 | 0.13356 | 2.2496219 |
| AVALANCHE | 394 | 1.02073 | 1.2911630 |
| DUST STORM | 462 | 1.08197 | 4.4784647 |
| HURRICANE/TYPHOON | 1339 | 15.21591 | 90.6595816 |
| MARINE STRONG WIND | 36 | 0.75000 | 1.5228752 |
| RIP CURRENT | 600 | 1.27660 | 1.2129741 |
| TSUNAMI | 162 | 8.10000 | 35.9896184 |
Clearly, over the years, tornados caused maximum damage to human health in terms of injuries or fatalities. Now we plot this data: to keep the damage values comparable between storm types, we use logarithm on Mean Life Harm.
hdplot <- ggplot(data = topdmgtbl,
aes(x = EVTYPE,
y = log(meandmg),
fill = sumdmg))
hdplot + labs(title = "Life Harm from Storms") +
geom_bar(stat = "identity") +
xlab("Storm Type") +
theme(axis.text.x = element_text(angle=45, size=8)) +
ylab("log of Mean Life Harm") +
scale_fill_continuous(name = "Total Life Harm") +
geom_errorbar(aes(ymin = log(meandmg)-log(sddmg), ymax = log(meandmg)+log(sddmg)),
width=.2, position=position_dodge(.9))
We group the data set based on event types, to find aggregate values in total economic loss in properties and agricultural area, considering the variables PROPDMG and CROPDMG Then we take a look at top ten types of storms causing maximum economic loss
eldf <- stormdf %>%
mutate(ecoloss = CROPDMG + PROPDMG) %>%
group_by(EVTYPE) %>%
summarise(sumloss = sum(ecoloss), meanloss = round(mean(ecoloss), 5), sdloss = sd(ecoloss)) %>%
filter(sumloss > 0)
topsumloss <- top_n(eldf, 10, sumloss)
topmeanloss <- top_n(eldf, 10, meanloss)
toplosstbl <- unique(rbind(topsumloss, topmeanloss))
kable(toplosstbl, col.names = c("Storm Type", "Total Economic Loss", "Mean Economic Loss", "Std. Dev. of Economic Loss"))
| Storm Type | Total Economic Loss | Mean Economic Loss | Std. Dev. of Economic Loss |
|---|---|---|---|
| EXCESSIVE HEAT | 492402494 | 293446.063 | 12020499.1 |
| FLASH FLOOD | 1421496300 | 26189.180 | 1173704.4 |
| FLOOD | 5662136488 | 223561.278 | 5697337.5 |
| FROST/FREEZE | 1094193134 | 814738.000 | 10069811.0 |
| HEAVY RAIN | 733410923 | 62460.477 | 2679109.4 |
| HIGH WIND | 638588583 | 31591.401 | 1488544.7 |
| HURRICANE/TYPHOON | 2607877598 | 29634972.710 | 169005303.3 |
| ICE STORM | 5022115189 | 2503546.954 | 111636119.6 |
| THUNDERSTORM WIND | 414909841 | 5025.312 | 252139.4 |
| TROPICAL STORM | 678351899 | 983118.694 | 9511706.0 |
| BLIZZARD | 112060172 | 41213.745 | 1365693.7 |
| HEAT | 401462163 | 523418.726 | 14443112.8 |
| WILDFIRE | 295477164 | 107018.169 | 2483440.9 |
Clearly, over the years, floods caused maximum total economic loss. Though ice storms and hurricanes/typhoons also caused very high total econmic loss, but variation in such losses were also high as intensities of storms can vary. Now we plot this data: to keep the economic damanges comparable between storm types, we use logarithm on Mean Economic Loss.
elplot <- ggplot(data = toplosstbl,
aes(x = EVTYPE,
y = log(meanloss),
fill = sumloss))
elplot + labs(title = "Economic Loss from Storms") +
geom_bar(stat = "identity") +
xlab("Storm Type") +
theme(axis.text.x = element_text(angle=45, size=8)) +
ylab("log of Mean Economic Loss") +
scale_fill_continuous(name = "Total Economic Loss") +
geom_errorbar(aes(ymin = log(meanloss)-log(sdloss), ymax = log(meanloss)+log(sdloss)),
width=.2, position=position_dodge(.9))