Synopsis

This is designed to be a study of economic and personal harm by natural events. It is based on an analysis of NOAA storm data for the years 1985-2011. The history of data goes back to 1950; a decision was made to cut older information due to lack of depth and age. It was found that, on average, tornadoes inflict the most personal harm in the United States, affecting approximately 32,000 Americans each year. On average, thunderstorm winds inflict the greatest economic damage in the United States, causing approximately $6B in damage each year.

Data Processing

Load the dataset into R. The timing is read as a “chr” variable, which we will convert to a date using the strptime() function. From this, we will pull the Year variable from the field BGN_DATE and treat this as the time at which the event occurred. Reading the storm data documentation from the NOAA website, these values are read every month.

getwd()

## [1] "/Users/brad/Projects/JHU Coursera"

setwd("~/Desktop")
getwd()

## [1] "/Users/brad/Desktop"

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, last

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
storm_data <- read.csv("StormData.csv.bz2", header = TRUE, sep = ",")
storm_data$BGN_DATE <- as.POSIXct(storm_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
storm_data$BGN_TIME <- as.POSIXct(storm_data$BGN_TIME, format = "%m/%d/%Y %H:%M:%S")
storm_data$END_DATE <- as.POSIXct(storm_data$END_DATE, format = "%m/%d/%Y %H:%M:%S")
storm_data$END_TIME <- as.POSIXct(storm_data$END_TIME, format = "%m/%d/%Y %H:%M:%S")
storm_data$YEAR <- format(storm_data$BGN_DATE, "%Y")

Exploring the dataset more intently, we find that there are 985 factors in EVTYPE. Many are erroneous for our analysis - for example, values such as “Summary of March 23”, “Summary Jan 17”, and “Summary: Sept. 18”.

We will look to find the Top 5 values for both economic damage and physical harm. If these erroneous values are present, we will clean them; if not, we will let them be. We should also note that there are no “N/A” values in any of the personal harm categories (which we define as the fields “FATALITIES” and “INJURIES”) or economic damage categories (which we define as the fields “PROPDMG” and “CROPDMG”). We do notice, however, that property and crop damage values have an additional field “EXP” which defines the magnitude of the damange: “K” for thousands, “M” for millions, and “B” for billions.

str(storm_data$EVTYPE)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...

sum(is.na(storm_data$FATALITIES))

## [1] 0

sum(is.na(storm_data$INJURIES))

## [1] 0

sum(is.na(storm_data$PROPDMG))

## [1] 0

sum(is.na(storm_data$CROPDMG))

## [1] 0

head(storm_data$CROPDMGEXP, 30)

##  [1]                              
## Levels:  ? 0 2 B k K m M

I find the piping of the dplyr package to be intuitive, and used that to create a summary table. I created an intermediate table “storm_intermediate” to try and cut down on the size of the data table. It cuts the rows from ~900K to ~785K, which isn’t that much for 35 years of data but it helps.

storm_intermediate <- storm_data %>% filter(YEAR > "1985")

summary_intermediate <- ddply(storm_intermediate, .(EVTYPE, YEAR), summarize, personal_harm = sum(FATALITIES, INJURIES))
summary_intermediate <- ddply(summary_intermediate, .(EVTYPE), summarise, Sum = sum(personal_harm), Mean = mean(personal_harm), SD = sd(personal_harm))

storm_summary <- top_n(summary_intermediate %>% filter(SD >0), 5, Sum)
storm_summary <- arrange(storm_summary, desc(Sum))

However, I found that I was having processing errors (ie, R would close) when using the dplyr package for the damage_intermediate table. A bit of googling, and I found that the data.table package is much more efficient for these type of issues.

damage_intermediate <- data.table(storm_intermediate$EVTYPE, storm_intermediate$YEAR, storm_intermediate$PROPDMG, storm_intermediate$CROPDMG, storm_intermediate$PROPDMGEXP, storm_intermediate$CROPDMGEXP)

colnames(damage_intermediate) <- c("EVTYPE", "YEAR", "PROPDMG", "CROPDMG", "PROPDMGEXP", "CROPDMGEXP")
                      
damage_intermediate <- damage_intermediate %>% mutate( 
       PROPDMG_FIX = ifelse(PROPDMGEXP == "K", 
                        PROPDMG, 
                        ifelse(PROPDMGEXP == "M", 
                               PROPDMG * 1000, 
                               PROPDMG * 1000000)),
       CROPDMG_FIX = ifelse(CROPDMGEXP == "K",
                        CROPDMG,
                        ifelse(CROPDMGEXP == "M",
                               CROPDMG * 1000,
                               CROPDMG * 1000000)))
       
economic_intermediate <- ddply(damage_intermediate, .(EVTYPE, YEAR), summarize, economic_harm = sum(PROPDMG_FIX, CROPDMG_FIX))

second_intermediate <- ddply(economic_intermediate, .(EVTYPE), summarise, Sum = sum(economic_harm), Mean = mean(economic_harm), SD = sd(economic_harm))

economic_summary <- top_n(second_intermediate %>% filter(SD >0), 5, Sum)
economic_summary <- arrange(economic_summary, desc(Sum))

Results

Looking at the chart below, on average tornadoes inflict the most personal harm in the United States.

The chart below shows the Top 5 causes of personal harm. On average, tornadoes inflict the greatest personal harm in the United States. Notice that the average damage for tornadoes is significantly higher than the other Top 5. Personal harm is defined as both fatalities and injury.

p <- ggplot(data = storm_summary, mapping = aes(x=EVTYPE, y=log(Mean)))
p <- p + ylab("Log of Harm") + xlab("Event Type") + labs(title = "Personal Harm from Natural Events (1985-2011)")
p <- p + geom_point() 
p <- p + geom_errorbar(aes(ymin=log(Mean)-log(2*SD), ymax=log(Mean)+log(2*SD)), width=.2)

storm_summary

##           EVTYPE   Sum      Mean        SD
## 1        TORNADO 32094 1234.3846 1199.9910
## 2 EXCESSIVE HEAT  8428  468.2222  491.3538
## 3          FLOOD  7259  382.0526 1408.4724
## 4      TSTM WIND  6863  326.8095  198.7095
## 5      LIGHTNING  6046  318.2105  113.2429

The chart below shows the Top 5 causes of economic damage. On average, thunderstorm winds inflict the greatest economic damage in the United States. Notice that the average damage for Thunderstorm Winds is significantly higher than the other Top 5. Economic damage is defined as both property and crop damage.

e <- ggplot(data = economic_summary, mapping = aes(x=EVTYPE, y=log(Mean)))
e <- e + ylab("Log of Average Damage") + xlab("Event Type") + labs(title = "Economic Damage from Natural Events (1985-2011)")
e <- e + geom_point() 
e <- e + geom_errorbar(aes(ymin=log(Mean)-log(2*SD), ymax=log(Mean)+log(2*SD)), width=.2)

economic_summary

##               EVTYPE        Sum       Mean         SD
## 1 THUNDERSTORM WINDS 6269624104 2089874701 3527226077
## 2               HAIL  791352904   30436650  136035362
## 3        FLASH FLOOD  574662129   30245375   90736158
## 4            TORNADO  505388984   19438038   72118204
## 5          LIGHTNING  167640751    8823197   26042480

Analysis of NOAA Storm Data to Determine Economic and Health Effects

Brad Allen

January 19, 2016

Synopsis

Data Processing

Results