Data Processing

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

First, it is necessary to load the some packages.

library(ggplot2)
library(plyr)
library(reshape2)

To begin with our analysis, we download and unzip the data.

#setInternet2(TRUE)

## Create temporary file
f <- tempfile()

## Download the data
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
    f)

## Import the data into R
data <- read.csv(bzfile(f), stringsAsFactors = FALSE)

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. As the measurements might have been prone to large errors in the past years, we chose to include only the last ten years of data in our analysis.

# Set format
data$BGN_DATE <- as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
# Restrict to the last ten years
data <- data[data$BGN_DATE >= as.Date("1991-01-01"), ]

Processing the categories data

It is necessary to note that the data categories are not in a perfect shape - there are cases of misspeling and duplication of categories. In further research, it might be interesting to try cleaning the data in more detail. However due to the time constraint, we tried to perform only the basic data processing steps - converting all characters to upper case and trimming the categories from leading and trailing spaces

# Define trim function
trim <- function(x) gsub("^\\s+|\\s+$", "", x)
# Clean Event Type...
data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trim(data$EVTYPE)

Processing the public health damage data

The public health damage data need to be summarized so as to show the number of injuries and fatalities by event name. Finally, top 10 events resulting in injuries and top 10 events resulting in fatalities are selected. Finally, the data is melt by the reshape2 package so that we can utilize it later in ggplot charts.

# Make sums of injuries and fatalities
sumhealth <- ddply(data, .(EVTYPE), summarise, fatalities = sum(FATALITIES), 
    injuries = sum(INJURIES))
# Select ten most harmful events

topfatalities <- head(sumhealth[order(sumhealth$fatalities, decreasing = T), 
    ], n = 10)[, c(1, 2)]
topinjuries <- head(sumhealth[order(sumhealth$injuries, decreasing = T), ], 
    n = 10)[, c(1, 3)]
## Prepare data for the barchart
forchart <- melt(topfatalities)
## Using EVTYPE as id variables
## Using EVTYPE as id variables
forchart2 <- melt(topinjuries)
## Using EVTYPE as id variables
## Using EVTYPE as id variables

Processing the economic damage data

The economic damage data is present in the form of a base and a multiplier (in the form of abbreviations). Hence, we multiply the base numbers by multipliers.

# Property damage multiplier: prepare and use to multiply the damage
data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$PROPDMGEXP[data$PROPDMGEXP == ""] <- 1
data$PROPDMGEXP[grep("[-+?]", data$PROPDMGEXP)] <- 1
data$PROPDMGEXP[grep("[Hh]", data$PROPDMGEXP)] <- 100
data$PROPDMGEXP[grep("[Kk]", data$PROPDMGEXP)] <- 1000
data$PROPDMGEXP[grep("[Mm]", data$PROPDMGEXP)] <- 1e+06##
data$PROPDMGEXP[grep("[Bb]", data$PROPDMGEXP)] <- 1e+09
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMG <- data$PROPDMGEXP * data$PROPDMG

# Crop damage multiplier: prepare and use to multiply the damage
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0
data$CROPDMGEXP[data$CROPDMGEXP == ""] <- 1
data$CROPDMGEXP[grep("[-+?]", data$CROPDMGEXP)] <- 1
data$CROPDMGEXP[grep("[Hh]", data$CROPDMGEXP)] <- 100
data$CROPDMGEXP[grep("[Kk]", data$CROPDMGEXP)] <- 1000
data$CROPDMGEXP[grep("[Mm]", data$CROPDMGEXP)] <- 1e+06
data$CROPDMGEXP[grep("[Bb]", data$CROPDMGEXP)] <- 1e+09
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMG <- data$CROPDMGEXP * data$CROPDMG

Similarly to the health data processing, the economic damage figures are first summarized according to the type of event. Subsequently, top 10 events with the highest economic impact (defined as damage to crops plus damage to property) were selected. Finaly, the data was prepared for ggplots with the melt function.

# Make sums f injuries and fatalities
sumecon <- ddply(data, .(EVTYPE), summarise, cropdmg = sum(CROPDMG), propdmg = sum(PROPDMG))
sumecon$totaldamage <- sumecon$cropdmg + sumecon$propdmg
## Select top 10
topecon <- head(sumecon[order(sumecon$totaldamage, decreasing = T), ], n = 10)
## Prepare data for the barchart
forchart3 <- melt(topecon)
## Using EVTYPE as id variables
## Using EVTYPE as id variables
forchart3 <- forchart3[forchart3$variable != "totaldamage", ]

Results

Question 1: Public health

The following table and chart present the 10 most damaging events from the perspective of fatalities.

topfatalities

## **EVTYPE fatalities**

108 EXCESSIVE HEAT 1903

750 TORNADO 1699

130 FLASH FLOOD 978

3 235 HEAT 937

410 LIGHTNING 816

146 FLOOD 470

516 RIP CURRENT 368

771 TSTM WIND 285

312 HIGH WIND 248

11 AVALANCHE 224

## Make the barchart
ggplot(forchart, aes(x = factor(forchart$EVTYPE), y = forchart$value, fill = variable)) + 
    geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -270), 
    plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Number of injuries") + 
    theme(legend.position = "none") + ggtitle("Fatalities")

The event with the highest number of fatalities during the last 10 years of the dataset was EXCESSIVE HEAT followed closely by TORNADO.

The following table and chart present the 10 most damaging events from the perspective of injuries.

topfatalities

 ## **EVTYPE fatalities**

108 EXCESSIVE HEAT 1903

750 TORNADO 1699

130 FLASH FLOOD 978

235 HEAT 937

410 LIGHTNING 816

146 FLOOD 470

516 RIP CURRENT 368

771 TSTM WIND 285

312 HIGH WIND 248

11 AVALANCHE 224

## Make the barchart
ggplot(forchart2, aes(x = factor(forchart2$EVTYPE), y = forchart2$value, fill = variable)) + 
    geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -270), 
    plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Number of injuries") + 
    theme(legend.position = "none") + ggtitle("Injuries")

The event with the highest number of injuries during the last 10 years of the dataset was TORNADO.

To sum up, TORNADO seems to be the most damaging event from the perspective of public health.

Question 2: Economic Damage

The following table and chart present the 10 most damaging events from the perspective of economic damage.

topecon

EVTYPE cropdmg propdmg totaldamage

364 HURRICANE/TYPHOON 2607872800 69305840000 71913712800

591 STORM SURGE 5000 43323536000 43323541000

750 TORNADO 414953110 28897947814 29312900924

204 HAIL 3025954453 15732267427 18758221880

130 FLASH FLOOD 1421317100 16140862294 17562179394

76 DROUGHT 13972566000 1046106000 15018672000

355 HURRICANE 2741910000 11868319010 14610229010

521 RIVER FLOOD 5029459000 5118945500 10148404500

379 ICE STORM 5022113500 3944927810 8967041310

## Make the barchart
ggplot(forchart3, aes(x = factor(forchart3$EVTYPE), y = forchart3$value, fill = variable)) + 
    geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = -270), 
    plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Economic damage") + 
    scale_fill_discrete(name = "Type of damage", labels = c("Crop", "Property")) + 
    theme(legend.position = "top") + ggtitle("Economic impact")

Analysis

From the above we can see that the largest aggregate property damage is again due to Tornados. Costal Erosion has the highest average per-event property damage.

Tornados dominate weather events harmful to population health. Firgure 1 above displays the most harmful weather events, fatalities vs. injuries, conditional on either injuries or fatalities reported. The size of the bubbles indicate the total count of such events. After Tornados, the next most dangerous event types as measured by fatalities include events involving Heat, and events involving Flooding. Tornados are also much more prevalent when it comes to weather events involving either injuries or fatalities. In terms of injuries, Heat, Floods, and Thunder are about equally damaging (after Tornados).

Tornado appears to be the deadliest phenomenon among occuring in USA, followed by heat and fires and then hurricanes. The amount of tornado victims exceeds 5,000 people in a considered time period.

Hurricanes hurt economy the most due to the combination of adverse effects, such as floods, followed by storms and tornadoes. The total damage of hurricanes and caused effects over reported period exceeds $300B.