Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
The data analysis is conducted to trying to respond the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The analysis was conducted with R and write the markdown file with Knitr
To analysis the data we did the following preprocessing steps:
-Set the working directory to conduct the analysis
setwd("C:/Users/Eirr/Desktop/stormdata")
-Load the libraries
library(ggplot2)
library(reshape2)
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(knitr)
-Load the data files in bz2 format and Factorize EVTYPE variable and convert the values like k/K, m/M from PROPDMGEXP and CROPDMGEXP to considerate as the same
stormdata <- read.csv("StormData.csv.bz2", as.is = T)
with(stormdata, {
EVTYPE <- factor(EVTYPE)
PROPDMGEXP <- toupper(PROPDMGEXP)
PROPDMGEXP[PROPDMGEXP == ""] <- "0"
CROPDMGEXP <- toupper(CROPDMGEXP)
CROPDMGEXP[CROPDMGEXP == ""] <- "0"
})
-Subset the data to get the columns of the interest to run the analysis
subsetdata <- subset(stormdata, select = c("EVTYPE", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "FATALITIES"))
To answer this question we considered the variables FATALITIES AND INJURIES and plot the top 20 events which cause the most impact in health across USA
healthdamage.count <- 20
healthdamage <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, subsetdata,
sum, na.rm = TRUE)
healthdamage.top <- healthdamage[order(-healthdamage$FATALITIES)[1:healthdamage.count],
]
healthdamage.top$INJURIES <- cut2(healthdamage.top$INJURIES, g = 7)
ggplot(healthdamage.top, aes(x = reorder(EVTYPE, -FATALITIES), y = FATALITIES,
fill = INJURIES)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = 11) +
guides(fill = guide_legend(reverse = T)) + theme(axis.text.x = element_text(angle = 90,
hjust = 1)) + xlab(NULL) + ggtitle(paste("Top 20 most harmful weather events in the United States")) +
labs(colour = "pink")
We can see the most harmful weather event is TORNADO
Another plot to show the number of injuries produced by weather events
top.inj.count <- 10
top.inj.order <- order(-healthdamage$INJURIES)
data.health.inj <- healthdamage[top.inj.order[1:top.inj.count], ]
other.inj <- sum(healthdamage$INJURIES[top.inj.order[top.inj.count + 1:nrow(healthdamage)]],
na.rm = T)
data.health.inj <- rbind(data.health.inj, list("Other", 0, other.inj))
ggplot(data.health.inj, aes(x = factor(1), y = INJURIES, fill = reorder(EVTYPE,
-INJURIES))) + geom_bar(stat = "identity") + coord_polar(theta = "y") +
theme(legend.title = element_blank()) + xlab(NULL) + ylab(NULL) + ggtitle("Injures from the weather events in the United States") +
scale_fill_brewer(palette = "Spectral")
The 2 plots we made above show us that the TORNADO is most harmful to population in the USA
To answer this question we considered PROPDMG and CROPDMG.
We have to convert the input damage units to the same units to compute total damage, we did this with this function:
decode.units <- function(d) {
switch(d, H = 100, K = 1000, M = 1e+06, B = 1e+09, `0` = 1, `1` = 10, `2` = 100,
`3` = 1000, `4` = 10000, `5` = 1e+05, `6` = 1e+06, `7` = 1e+07, `8` = 1e+08,
`9` = 1e+09, 0)
}
And Finally we compute and plot the total economic damage from weather events
top.damage.count <- 25
stormdata$DAMAGE <- stormdata$PROPDMG * sapply(stormdata$PROPDMGEXP, decode.units) +
stormdata$CROPDMG * sapply(stormdata$CROPDMGEXP, decode.units)
data.damage <- aggregate(DAMAGE ~ EVTYPE, stormdata, sum, na.rm = T)
data.damage.top <- data.damage[order(-data.damage$DAMAGE)[1:top.damage.count],
]
ggplot(data.damage.top, aes(x = reorder(EVTYPE, -DAMAGE), y = DAMAGE)) + geom_bar(stat = "identity",
fill = "grey") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab(NULL) + ylab("Damage, $") + ggtitle(paste("Top", top.damage.count,
"events which have the greatest economic consequences in the United States"))
The plot show us , the weather event which have most economic impact across USA is FLOOD