Introduction (from the course site)

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data (47 Mb)

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation / National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Questions

Your data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.

Synopsis

The data is read from the internet and unziped before read into a data frame within R. The data is first reduced to the critical columns necessary to find get the answers the two questions about Harmfulness and oeconomic damage. To do this there are some activities necessary to “clean” the data. In a second phase the data is further reduced to focalice un more recent events to assure an certain actuality. The questions will be answered mainly by 6 meaningful plots in 2 figures and statements delivered by this markdown document creating a HTML-file.

Data Processing

Downloading the data from the Internet

Download the data from internet when necessary. Then verify, if CSV-file exists. If not, then unzip by bunzip2 (needs package R.utils). Finally read data frame DF from CSV-file. This funcionality is cached to gain time.

library(R.utils)
if(!file.exists('repdata_data_StormData.csv.bz2')){
        url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(url, destfile="repdata_data_StormData.csv.bz2")
        }

if(!file.exists('StormData.csv')){
        bunzip2('repdata_data_StormData.csv.bz2', destname="StormData.csv", remove=F)
        }

DF <- read.csv('StormData.csv', header=T)

Ad a factor YEAR to facilitate timeline analysis.

DF <- data.frame(YEAR=format(as.Date(DF$BGN_DATE,format="%m/%d/%Y"),"%Y"),DF)

Results

Events Most Harmful to Humans

Considering the following variables for analysis. Events most harmful to human population health:

  1. FATALITIES for fatalities due to the recorded event
  2. INJURIES for injuries due to the recorded event

We will need the columns “YEAR”,“EVTYPE”,“FATALITIES”,“INJURIES” to go further.

Greatest Economic Consequences.

Considering the following variables for analysis: Events with greatest economic consequences:

  1. PROPDMG for numeric extent of damage to property
  2. PROPDMGEXP for unit of numeric damage to property
  3. CROPDMG for numeric extent of damage to crop
  4. CROPDMGEXP for unit of numeric damage to crop in dollar terms
  5. Unknown symbols in PROPDMGEXP and CROPDMGEXP can be equated to 0.

We will need the columns “PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP” to go further.

Reduce the data frame DF to important columns

DF_reduced<- DF[c("YEAR","EVTYPE","FATALITIES","INJURIES","PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

Clean data frame, replace codes and calculate economic loss

clean <- function(x){
        y <- as.numeric()
        y[!(x %in% c("B","b","M","m","K","k"))] <- 1
        y[x %in% c("B","b")] <- 1000000000
        y[x %in% c("M","m")] <- 1000000
        y[x %in% c("K","k")] <- 1000
        return (y)
        }

clear <- function(x){
        y <- as.numeric(x)
        y[is.na(y)] <- 0
        return (y)
        }

DF_reduced$ECONOMIC_PROP <- clear(DF_reduced$PROPDMG)*clean(DF_reduced$PROPDMGEXP)
DF_reduced$ECONOMIC_CROP <- clear(DF_reduced$CROPDMG)*clean(DF_reduced$CROPDMGEXP)

DF_reduced <- DF_reduced[c("YEAR","EVTYPE","FATALITIES", "INJURIES","ECONOMIC_PROP","ECONOMIC_CROP")]

DF_reduced <- DF_reduced[with(DF_reduced, order(YEAR)), ]
DF_reduced$ECONOMIC_LOSS <- DF_reduced$ECONOMIC_PROP+DF_reduced$ECONOMIC_CROP

Timeline of observations

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Here is the proof:

plot(table(DF_reduced$YEAR), ylab="frequency",main="Number of reported Events per Year")

Figure 1

It seems, that the data might be reduced by the observations before 1995 if we want to focus on more up-to-date observations. We’re creating a subset of the Data Frame DF_reduced:

Selection <- 1995:2011
Selection <- as.character(Selection)

DF_reduced <- DF_reduced[DF_reduced$YEAR %in% Selection,]

Calculate Fatalities and Injuries

DF_reduced$FandI <- DF_reduced$FATALITIES+DF_reduced$INJURIES

# No 1 get total number of fatalites (by event type)
total_fatalities <- by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, sum)
FATALITIES <- sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, sum), decreasing=T)

# no 2 get average number of fatalities per event (by event type) For top 20
mean_fatalities <- by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, mean)
FATALITIESmean <- sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, mean), decreasing=T)

# no 3 get the numbers of total injuries by event type
total_injuries <- by(DF_reduced$INJURIES, DF_reduced$EVTYPE, sum)
INJURIES <- sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, sum), decreasing=T)

# no 4 get thenumber of top mean injuries per event
mean_injuries <- by(DF_reduced$INJURIES, DF_reduced$EVTYPE, mean)
INJURIESmean <- sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, mean), decreasing=T)
# Set up plots
layout(matrix(c(1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4), 8, 2, byrow=T))
par(mar=c(4, 8, 4, 2))

# no 1 barplot of total fatalities by event type
barplot(sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, sum), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Total Number of Fatalities by Event Type")
mtext("Total Number of Fatalities (top 20)", side=3, line=1, cex=1, font=2)

# no 2 barplot of top mean fatalities per event
barplot(sort(by(DF_reduced$FATALITIES, DF_reduced$EVTYPE, mean), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Mean Number of Injuries by Event Type")
mtext("Mean Number of Fatalities (top 20)", side=3, line=1, cex=1, font=2)

# no 3 barplot of total injuries by event type
barplot(sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, sum), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Total Number of Injuries by Event Type")
mtext("Total Number of Injuries (top 20)", side=3, line=1, cex=1, font=2)

# no 4barplot of top mean injuries per event
barplot(sort(by(DF_reduced$INJURIES, DF_reduced$EVTYPE, mean), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Mean Number of Injuries by Event Type")
mtext("Mean Number of Injuries (top 20)", side=3, line=1, cex=1, font=2)

Figure 2

There are differences in total numbers am mean values. This means that there are some event types that ocure less often but are worse in case of mean values. Others are more often but less bed in terms of mean values. Dependign of the goal of prevention measures one need to focalise on the one ore the others (or even on both).

It also seems, that the event types could be grouped; there are different event types that seem to be quite close in definition…

Statements
  1. The most often event is EXCESSIVE HEAT with 1903 incidences.
  2. The highest number of fatalitiest has COLD AND SNOW with 14 fatalities per event.
  3. The most severe event is TORNADO with 21765 injuries in total.
  4. The most severe event is Heat Wave with 70 injuries per event.

Calculate the oeconomic damange

# Get total oeconomic loss (by event type)
total_Loss <- by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, sum)
LOSS <- sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, sum), decreasing=T)

# Get average oeconomic losses per event (by event type)
mean_Loss <- by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, mean)
LOSSmean <- sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, mean), decreasing=T)
# Set up plots
layout(matrix(c(1,2,1,2,1,2,1,2), 4, 2, byrow=T))
par(mar=c(4, 10, 4, 2))

# No 1 barplot of top total total loss by event type
barplot(sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, sum), decreasing=T)[20:1], horiz=T, las=1, cex.names=0.7, xlab="Total Loss of Events by Event Type")
mtext("Total Loss by Event (top 20)", side=3, line=1, cex=1, font=2)

# No 2 barplot of top mean losses by event type
barplot(sort(by(DF_reduced$ECONOMIC_LOSS, DF_reduced$EVTYPE, mean), decreasing=T)[20:1], horiz=T,  las=1, cex.names=0.7, xlab="Mean Loss of Events by Event Type")
mtext("Mean Loss by Event (top 20)", side=3, line=1, cex=1, font=2)

Figure 3

There are differences in total costs am mean values. This means that there are some event types that occure less often but are more expensive in case of mean values. Others occure more often but are less bad in terms of mean values. Dependign of the goal of prevention measures one need to focalise on the one ore the others (or even on both).

It also seems, that the event types could be grouped; there are different event types that seem to be quite close in definition…

Statements
  1. The most expensiv event in total is FLOOD with 149 billons $ of loss.
  2. The most expensiv event per event HEAVY RAIN/SEVERE WEATHER with 1250 millions $ mean loss per event.

Conlusion

This is a first try to get to know better the data furnished. There are a lot of quistions that arised while working on this assignment that could not be answered because of time on hand.