This is designed to be a study of economic and personal harm by natural events. It is based on an analysis of NOAA storm data for the years 1985-2011. The history of data goes back to 1950; a decision was made to cut older information due to lack of depth and age. It was found that, on average, tornadoes inflict the most personal harm in the United States, affecting approximately 32,000 Americans each year. On average, thunderstorm winds inflict the greatest economic damage in the United States, causing approximately $6B in damage each year.
Load the dataset into R. The timing is read as a “chr” variable, which we will convert to a date using the strptime() function. From this, we will pull the Year variable from the field BGN_DATE and treat this as the time at which the event occurred. Reading the storm data documentation from the NOAA website, these values are read every month.
getwd()
## [1] "/Users/brad/Projects/JHU Coursera"
setwd("~/Desktop")
getwd()
## [1] "/Users/brad/Desktop"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, last
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
storm_data <- read.csv("StormData.csv.bz2", header = TRUE, sep = ",")
storm_data$BGN_DATE <- as.POSIXct(storm_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
storm_data$BGN_TIME <- as.POSIXct(storm_data$BGN_TIME, format = "%m/%d/%Y %H:%M:%S")
storm_data$END_DATE <- as.POSIXct(storm_data$END_DATE, format = "%m/%d/%Y %H:%M:%S")
storm_data$END_TIME <- as.POSIXct(storm_data$END_TIME, format = "%m/%d/%Y %H:%M:%S")
storm_data$YEAR <- format(storm_data$BGN_DATE, "%Y")
Exploring the dataset more intently, we find that there are 985 factors in EVTYPE. Many are erroneous for our analysis - for example, values such as “Summary of March 23”, “Summary Jan 17”, and “Summary: Sept. 18”.
We will look to find the Top 5 values for both economic damage and physical harm. If these erroneous values are present, we will clean them; if not, we will let them be. We should also note that there are no “N/A” values in any of the personal harm categories (which we define as the fields “FATALITIES” and “INJURIES”) or economic damage categories (which we define as the fields “PROPDMG” and “CROPDMG”). We do notice, however, that property and crop damage values have an additional field “EXP” which defines the magnitude of the damange: “K” for thousands, “M” for millions, and “B” for billions.
str(storm_data$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
sum(is.na(storm_data$FATALITIES))
## [1] 0
sum(is.na(storm_data$INJURIES))
## [1] 0
sum(is.na(storm_data$PROPDMG))
## [1] 0
sum(is.na(storm_data$CROPDMG))
## [1] 0
head(storm_data$CROPDMGEXP, 30)
## [1]
## Levels: ? 0 2 B k K m M
I find the piping of the dplyr package to be intuitive, and used that to create a summary table. I created an intermediate table “storm_intermediate” to try and cut down on the size of the data table. It cuts the rows from ~900K to ~785K, which isn’t that much for 35 years of data but it helps.
storm_intermediate <- storm_data %>% filter(YEAR > "1985")
summary_intermediate <- ddply(storm_intermediate, .(EVTYPE, YEAR), summarize, personal_harm = sum(FATALITIES, INJURIES))
summary_intermediate <- ddply(summary_intermediate, .(EVTYPE), summarise, Sum = sum(personal_harm), Mean = mean(personal_harm), SD = sd(personal_harm))
storm_summary <- top_n(summary_intermediate %>% filter(SD >0), 5, Sum)
storm_summary <- arrange(storm_summary, desc(Sum))
However, I found that I was having processing errors (ie, R would close) when using the dplyr package for the damage_intermediate table. A bit of googling, and I found that the data.table package is much more efficient for these type of issues.
damage_intermediate <- data.table(storm_intermediate$EVTYPE, storm_intermediate$YEAR, storm_intermediate$PROPDMG, storm_intermediate$CROPDMG, storm_intermediate$PROPDMGEXP, storm_intermediate$CROPDMGEXP)
colnames(damage_intermediate) <- c("EVTYPE", "YEAR", "PROPDMG", "CROPDMG", "PROPDMGEXP", "CROPDMGEXP")
damage_intermediate <- damage_intermediate %>% mutate(
PROPDMG_FIX = ifelse(PROPDMGEXP == "K",
PROPDMG,
ifelse(PROPDMGEXP == "M",
PROPDMG * 1000,
PROPDMG * 1000000)),
CROPDMG_FIX = ifelse(CROPDMGEXP == "K",
CROPDMG,
ifelse(CROPDMGEXP == "M",
CROPDMG * 1000,
CROPDMG * 1000000)))
economic_intermediate <- ddply(damage_intermediate, .(EVTYPE, YEAR), summarize, economic_harm = sum(PROPDMG_FIX, CROPDMG_FIX))
second_intermediate <- ddply(economic_intermediate, .(EVTYPE), summarise, Sum = sum(economic_harm), Mean = mean(economic_harm), SD = sd(economic_harm))
economic_summary <- top_n(second_intermediate %>% filter(SD >0), 5, Sum)
economic_summary <- arrange(economic_summary, desc(Sum))
Looking at the chart below, on average tornadoes inflict the most personal harm in the United States.
The chart below shows the Top 5 causes of personal harm. On average, tornadoes inflict the greatest personal harm in the United States. Notice that the average damage for tornadoes is significantly higher than the other Top 5. Personal harm is defined as both fatalities and injury.
p <- ggplot(data = storm_summary, mapping = aes(x=EVTYPE, y=log(Mean)))
p <- p + ylab("Log of Harm") + xlab("Event Type") + labs(title = "Personal Harm from Natural Events (1985-2011)")
p <- p + geom_point()
p <- p + geom_errorbar(aes(ymin=log(Mean)-log(2*SD), ymax=log(Mean)+log(2*SD)), width=.2)
storm_summary
## EVTYPE Sum Mean SD
## 1 TORNADO 32094 1234.3846 1199.9910
## 2 EXCESSIVE HEAT 8428 468.2222 491.3538
## 3 FLOOD 7259 382.0526 1408.4724
## 4 TSTM WIND 6863 326.8095 198.7095
## 5 LIGHTNING 6046 318.2105 113.2429
p
The chart below shows the Top 5 causes of economic damage. On average, thunderstorm winds inflict the greatest economic damage in the United States. Notice that the average damage for Thunderstorm Winds is significantly higher than the other Top 5. Economic damage is defined as both property and crop damage.
e <- ggplot(data = economic_summary, mapping = aes(x=EVTYPE, y=log(Mean)))
e <- e + ylab("Log of Average Damage") + xlab("Event Type") + labs(title = "Economic Damage from Natural Events (1985-2011)")
e <- e + geom_point()
e <- e + geom_errorbar(aes(ymin=log(Mean)-log(2*SD), ymax=log(Mean)+log(2*SD)), width=.2)
economic_summary
## EVTYPE Sum Mean SD
## 1 THUNDERSTORM WINDS 6269624104 2089874701 3527226077
## 2 HAIL 791352904 30436650 136035362
## 3 FLASH FLOOD 574662129 30245375 90736158
## 4 TORNADO 505388984 19438038 72118204
## 5 LIGHTNING 167640751 8823197 26042480
e