Reproducible Research Assignment 2

======================================================

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

The analysis on the storm event database revealed that tornadoes are the most dangerous weather event to the population health. The second most dangerous event type is the excessive heat. The economic impact of weather events was also analyzed. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest crop damage caused by drought, followed by flood and hails.

Data Processing

The analysis was performed on Storm Events Database, provided by National Climatic Data Center. The data is from a comma-separated-value file available here. There is also some documentation of the data available here.

The first step is to read the data into a data frame.

storm <- read.csv("/Users/Malter/Desktop/Coursera/Reproducible Research/Assessment 2/repdata-data-StormData.csv", sep = "\t")

Before the analysis, the data need some preprocessing. Event types don’t have a specific format. For instance, there are events with types Frost/Freeze, FROST/FREEZE and FROST\\FREEZE which obviously refer to the same type of event.

Load required packages

library(ggplot2)
library(car)

Read data from file

file <- "/Users/Malter/Desktop/Coursera/Reproducible Research/Assessment 2/repdata-data-StormData.csv"
# get info about file init <- read.csv(file, sep=',', header=TRUE,
# nrows=5000, stringsAsFactors=FALSE, quote = '') classes <-
# sapply(init,class) cols <- colnames(init)
data <- read.csv("/Users/Malter/Desktop/Coursera/Reproducible Research/Assessment 2/repdata-data-StormData.csv", sep = "\t")

data <- read.csv(file, header = TRUE, stringsAsFactors = FALSE)
head(data)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

We’ll be using following metrics: EVTYPE - event type (e.g. flood, tornado) FATALITIES - fatalities INJURIES - injuries PROPDMG - property damage CROPDMG - crop damage

Clean up data. Convert PROPDMG & CROPDMG to same scale values (ones USD)

data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$PROPDMGEXP[data$PROPDMGEXP == "" | data$PROPDMGEXP == "+" | data$PROPDMGEXP == "?" | data$PROPDMGEXP == "-"] <- "1"
data$PROPDMGEXP[data$PROPDMGEXP == "H" | data$PROPDMGEXP == "h"] <- "100"
data$PROPDMGEXP[data$PROPDMGEXP == "K" | data$PROPDMGEXP == "k"] <- "1000"
data$PROPDMGEXP[data$PROPDMGEXP == "M" | data$PROPDMGEXP == "m"] <- "1000000"
data$PROPDMGEXP[data$PROPDMGEXP == "B" | data$PROPDMGEXP == "b"] <- "1000000000"
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMGUSD <- data$PROPDMG * data$PROPDMGEXP


data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
data$CROPDMGEXP[data$CROPDMGEXP == "" | data$CROPDMGEXP == "?"] <- "1"
data$CROPDMGEXP[data$CROPDMGEXP == "B" | data$CROPDMGEXP == "b"] <- "1000000000"
data$CROPDMGEXP[data$CROPDMGEXP == "M" | data$CROPDMGEXP == "m"] <- "1000000"
data$CROPDMGEXP[data$CROPDMGEXP == "K" | data$CROPDMGEXP == "k"] <- "1000"
data$CROPDMGEXP[data$CROPDMGEXP == "" | data$CROPDMGEXP == "?"] <- "1"
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMGUSD <- data$CROPDMG * data$CROPDMGEXP

Aggregate data per event type and calcualte two new column health - total of fatalities and injuries damage - total of property and crop damage

# Aggreata date per EVTYPE
agg <- aggregate(cbind(FATALITIES, INJURIES, PROPDMGUSD, CROPDMGUSD) ~ EVTYPE, data = data, FUN = sum)
# Add calculated column 'health' as a sum of FATALITIES and INJURIES
agg$health <- agg$FATALITIES + agg$INJURIES
# Add calculated column 'damage' as a sum of FATALITIES and INJURIES
agg$damage <- agg$PROPDMGUSD + agg$CROPDMGUSD

Prepare data sets for graphing of health impact

# Examine fatalities on their own
fatalities <- agg[order(agg$FATALITIES, decreasing = T),][1:10,]

fatalities <- transform(fatalities, EVTYPE=reorder(EVTYPE, -FATALITIES) ) 
fatalities$TYPE <- "FATALITIES"

# Examine combined fatalities and injuries
health <- agg[order(agg$health, decreasing = T),][1:10,]

healthFatalities <- health[,1:2]
names(healthFatalities)[2] <- "PEOPLE"
healthFatalities$TYPE <- "FATALITIES"

healthInjuries <- health[,c(1,3)]
names(healthInjuries)[2] <- "PEOPLE"
healthInjuries$TYPE <- "INJURIES"

healthPlot <- rbind(healthFatalities, healthInjuries)
healthPlot <- transform(healthPlot, EVTYPE=reorder(EVTYPE, -PEOPLE) )

Prepare data sets for graphing of economic impact

# Aggreata date per EVTYPE
agg <- aggregate(cbind(FATALITIES, INJURIES, PROPDMGUSD, CROPDMGUSD) ~ EVTYPE, data = data, FUN = sum)
# Add calculated column 'health' as a sum of FATALITIES and INJURIES
agg$health <- agg$FATALITIES + agg$INJURIES
# Add calculated column 'damage' as a sum of FATALITIES and INJURIES
agg$damage <- agg$PROPDMGUSD + agg$CROPDMGUSD

Prepare data sets for graphing of health impact

damage <- agg[order(agg$damage, decreasing = T),][1:10,]

damageProperty <- damage[,c(1,4)]
names(damageProperty)[2] <- "DAMAGE"
damageProperty$TYPE <- "PROPERTY"

damageCrop <- damage[,c(1,5)]
names(damageCrop)[2] <- "DAMAGE"
damageCrop$TYPE <- "CROP"

damagePlot <- rbind(damageProperty, damageCrop)
damagePlot <- transform(damagePlot, EVTYPE=reorder(EVTYPE, -DAMAGE) ) 

damagePlot$DAMAGE <- damagePlot$DAMAGE / 1000000

Results

Which event is the most harmfull for public health?

Number of fatalities
Tornado is the most dangerous event type when number of fatalities is examined.

qplot(
  EVTYPE, 
  FATALITIES, 
  data = fatalities,
  fill = TYPE,
  geom = "bar",
  stat = "identity",
  main = "Fatalities",
  ylab = "Number of people",
  xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))

plot of chunk unnamed-chunk-9

Combined number of fatalities and injuries
Tornado is the most dangerous event type when combined number of fatalities and injures is examined.

qplot(
  EVTYPE, 
  PEOPLE, 
  data = healthPlot,
  fill = TYPE,
  geom = "bar",
  stat = "identity",
  main = "Fatalities & Injuries",
  ylab = "Number of people",
  xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))

plot of chunk unnamed-chunk-10

Which event causes the most economic damages
Flood is the most economically damaging event type.

qplot(
  EVTYPE, 
  DAMAGE, 
  data = damagePlot,
  fill = TYPE,
  geom = "bar",
  stat = "identity",
  main = "Economic damage",
  ylab = "Damage in million $",
  xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))

plot of chunk unnamed-chunk-11