Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation / National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Your data analysis must address the following questions:
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
The data is read from the internet and unziped before read into a data frame within R. The data is first reduced to the critical columns necessary to find get the answers the two questions about Harmfulness and oeconomic damage. To do this there are some activities necessary to “clean” the data. In a second phase the data is further reduced to focalice un more recent events to assure an certain actuality. The questions will be answered mainly by 6 meaningful plots in 2 figures and statements delivered by this markdown document creating a HTML-file.
Download the data from internet when necessary. Then verify, if CSV-file exists. If not, then unzip by bunzip2 (needs package R.utils). Finally read data frame DF from CSV-file. This funcionality is cached to gain time.
library(R.utils)
if(!file.exists('repdata_data_StormData.csv.bz2')){
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="repdata_data_StormData.csv.bz2")
}
if(!file.exists('StormData.csv')){
bunzip2('repdata_data_StormData.csv.bz2', destname="StormData.csv", remove=F)
}
DF <- read.csv('StormData.csv', header=T)
Ad a factor YEAR to facilitate timeline analysis.
DF <- data.frame(YEAR=format(as.Date(DF$BGN_DATE,format="%m/%d/%Y"),"%Y"),DF)
Considering the following variables for analysis. Events most harmful to human population health:
We will need the columns “YEAR”,“EVTYPE”,“FATALITIES”,“INJURIES” to go further.
Considering the following variables for analysis: Events with greatest economic consequences:
We will need the columns “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” to go further.
DF_reduced<- DF[c("YEAR","EVTYPE","FATALITIES","INJURIES","PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
clean <- function(x){
y <- as.numeric()
y[!(x %in% c("B","b","M","m","K","k"))] <- 1
y[x %in% c("B","b")] <- 1000000000
y[x %in% c("M","m")] <- 1000000
y[x %in% c("K","k")] <- 1000
return (y)
}
clear <- function(x){
y <- as.numeric(x)
y[is.na(y)] <- 0
return (y)
}
DF_reduced$ECONOMIC_PROP <- clear(DF_reduced$PROPDMG)*clean(DF_reduced$PROPDMGEXP)
DF_reduced$ECONOMIC_CROP <- clear(DF_reduced$CROPDMG)*clean(DF_reduced$CROPDMGEXP)
DF_reduced <- DF_reduced[c("YEAR","EVTYPE","FATALITIES", "INJURIES","ECONOMIC_PROP","ECONOMIC_CROP")]
DF_reduced <- DF_reduced[with(DF_reduced, order(YEAR)), ]
DF_reduced$ECONOMIC_LOSS <- DF_reduced$ECONOMIC_PROP+DF_reduced$ECONOMIC_CROP
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Here is the proof:
plot(table(DF_reduced$YEAR), ylab="frequency",main="Number of reported Events per Year")
Figure 1
It seems, that the data might be reduced by the observations before 1995 if we want to focus on more up-to-date observations. We’re creating a subset of the Data Frame DF_reduced:
Selection <- 1995:2011
Selection <- as.character(Selection)
DF_reduced <- DF_reduced[DF_reduced$YEAR %in% Selection,]
DF_reduced$FandI <- DF_reduced$FATALITIES+DF_reduced$INJURIES
# No 1 get total number of fatalites (by event type)
total_fatalities <- by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, sum)
FATALITIES <- sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, sum), decreasing=T)
# no 2 get average number of fatalities per event (by event type) For top 20
mean_fatalities <- by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, mean)
FATALITIESmean <- sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, mean), decreasing=T)
# no 3 get the numbers of total injuries by event type
total_injuries <- by(DF_reduced$INJURIES, DF_reduced$EVTYPE, sum)
INJURIES <- sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, sum), decreasing=T)
# no 4 get thenumber of top mean injuries per event
mean_injuries <- by(DF_reduced$INJURIES, DF_reduced$EVTYPE, mean)
INJURIESmean <- sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, mean), decreasing=T)
# Set up plots
layout(matrix(c(1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4), 8, 2, byrow=T))
par(mar=c(4, 8, 4, 2))
# no 1 barplot of total fatalities by event type
barplot(sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, sum), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Total Number of Fatalities by Event Type")
mtext("Total Number of Fatalities (top 20)", side=3, line=1, cex=1, font=2)
# no 2 barplot of top mean fatalities per event
barplot(sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, mean), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Mean Number of Injuries by Event Type")
mtext("Mean Number of Fatalities (top 20)", side=3, line=1, cex=1, font=2)
# no 3 barplot of total injuries by event type
barplot(sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, sum), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Total Number of Injuries by Event Type")
mtext("Total Number of Injuries (top 20)", side=3, line=1, cex=1, font=2)
# no 4barplot of top mean injuries per event
barplot(sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, mean), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Mean Number of Injuries by Event Type")
mtext("Mean Number of Injuries (top 20)", side=3, line=1, cex=1, font=2)
Figure 2
There are differences in total numbers am mean values. This means that there are some event types that ocure less often but are worse in case of mean values. Others are more often but less bed in terms of mean values. Dependign of the goal of prevention measures one need to focalise on the one ore the others (or even on both).
It also seems, that the event types could be grouped; there are different event types that seem to be quite close in definition…
# Get total oeconomic loss (by event type)
total_Loss <- by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, sum)
LOSS <- sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, sum), decreasing=T)
# Get average oeconomic losses per event (by event type)
mean_Loss <- by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, mean)
LOSSmean <- sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, mean), decreasing=T)
# Set up plots
layout(matrix(c(1,2,1,2,1,2,1,2), 4, 2, byrow=T))
par(mar=c(4, 10, 4, 2))
# No 1 barplot of top total total loss by event type
barplot(sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, sum), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Total Loss of Events by Event Type")
mtext("Total Loss by Event (top 20)", side=3, line=1, cex=1, font=2)
# No 2 barplot of top mean losses by event type
barplot(sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, mean), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Mean Loss of Events by Event Type")
mtext("Mean Loss by Event (top 20)", side=3, line=1, cex=1, font=2)
Figure 3
There are differences in total costs am mean values. This means that there are some event types that occure less often but are more expensive in case of mean values. Others occure more often but are less bad in terms of mean values. Dependign of the goal of prevention measures one need to focalise on the one ore the others (or even on both).
It also seems, that the event types could be grouped; there are different event types that seem to be quite close in definition…
This is a first try to get to know better the data furnished. There are a lot of quistions that arised while working on this assignment that could not be answered because of time on hand.