Author: Nathan Smith
In this paper I will discuss which types of events are most harmful with respect to population health and which types of events have the greatest economic consequences. The study is focused on the United States exclusively. The dataset I used is from the National Weather Service Storm Data.
This file is rather large so will take a minute load into R. Then we’ll take a look at the structure of the file.
setwd("/Users/nathansmith/")
data <- read.csv("repdata_data_StormData.csv", header=TRUE)
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
How many unique event types are there?
length(unique(data$EVTYPE))
## [1] 985
The variables in the dataset that contain information regarding Population Health are:
*FATALITIES
*INJURIES
So let’s make a new dataset that contains the event type and these relevant values.
library(data.table)
PopHealth <- as.data.table(data[,c("EVTYPE", "FATALITIES", "INJURIES")])
head(PopHealth)
## EVTYPE FATALITIES INJURIES
## 1: TORNADO 0 15
## 2: TORNADO 0 0
## 3: TORNADO 0 2
## 4: TORNADO 0 2
## 5: TORNADO 0 2
## 6: TORNADO 0 6
Now we need a table that summarizes how many fatalities and injuries per event type so we can look at total impact over time. We’ll look at the top ten in descending order.
fatals <- PopHealth[,sum(FATALITIES), by=EVTYPE][order(V1, decreasing=TRUE)]
head(fatals,10)
## EVTYPE V1
## 1: TORNADO 5633
## 2: EXCESSIVE HEAT 1903
## 3: FLASH FLOOD 978
## 4: HEAT 937
## 5: LIGHTNING 816
## 6: TSTM WIND 504
## 7: FLOOD 470
## 8: RIP CURRENT 368
## 9: HIGH WIND 248
## 10: AVALANCHE 224
injur <- PopHealth[,sum(INJURIES), by=EVTYPE][order(V1, decreasing=TRUE)]
head(injur,10)
## EVTYPE V1
## 1: TORNADO 91346
## 2: TSTM WIND 6957
## 3: FLOOD 6789
## 4: EXCESSIVE HEAT 6525
## 5: LIGHTNING 5230
## 6: HEAT 2100
## 7: ICE STORM 1975
## 8: FLASH FLOOD 1777
## 9: THUNDERSTORM WIND 1488
## 10: HAIL 1361
The variables in the dataset that contain information regarding Economic impact are:
*PROPDMG (i.e., property damage)
*PROPDMGEXP
*CROPDMG (i.e., crop damage)
*CROPDMGEXP
EconImpact <- as.data.table(data[c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")])
head(EconImpact)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1: TORNADO 25.0 K 0
## 2: TORNADO 2.5 K 0
## 3: TORNADO 25.0 K 0
## 4: TORNADO 2.5 K 0
## 5: TORNADO 2.5 K 0
## 6: TORNADO 2.5 K 0
To isolate the PROPDMGEXP variables so we can use them, we need to find out which ones are used most frequently.
table(EconImpact$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
We need to process the PROPDMGEXP so we can get to the cost (in $).
EconImpact$PROPCOST <- with(EconImpact, ifelse(PROPDMGEXP == 'B', PROPDMG*1000000000,
ifelse(PROPDMGEXP == 'M', PROPDMG*1000000,
ifelse(PROPDMGEXP == 'K', PROPDMG*1000,0))))
head(EconImpact)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP PROPCOST
## 1: TORNADO 25.0 K 0 25000
## 2: TORNADO 2.5 K 0 2500
## 3: TORNADO 25.0 K 0 25000
## 4: TORNADO 2.5 K 0 2500
## 5: TORNADO 2.5 K 0 2500
## 6: TORNADO 2.5 K 0 2500
Property <- EconImpact[,sum(PROPCOST), by=EVTYPE][order(V1, decreasing=TRUE)]
head(Property,10)
## EVTYPE V1
## 1: FLOOD 144657709800
## 2: HURRICANE/TYPHOON 69305840000
## 3: TORNADO 56925660480
## 4: STORM SURGE 43323536000
## 5: FLASH FLOOD 16140811510
## 6: HAIL 15727366720
## 7: HURRICANE 11868319010
## 8: TROPICAL STORM 7703890550
## 9: WINTER STORM 6688497250
## 10: HIGH WIND 5270046260
library(ggplot2)
library(plyr)
First we’ll look at fatality counts and then at injury counts.
ggplot(fatals[1:10,],aes(reorder(EVTYPE, desc(V1)), V1)) +
geom_bar(colour="black", fill="red3", width=.7, stat="identity") +
theme(axis.text.x = element_text(angle = 70,hjust = 1)) + guides(fill=FALSE) +
xlab("Event Type") + ylab("Total Fatalities") + ggtitle("Top Ten Event Types for Fatalities")
ggplot(injur[1:10,],aes(reorder(EVTYPE, desc(V1)), V1)) +
geom_bar(colour="black", fill="navyblue", width=.7, stat="identity") +
theme(axis.text.x = element_text(angle = 70,hjust = 1)) + guides(fill=FALSE) +
xlab("Event Type") + ylab("Total Injuries") + ggtitle("Top Ten Event Types for Injuries")
It looks like tornadoes cause (by far) the most fatalities and injuries.
As far as property damage cost goes, floods have the largest economic impact.
ggplot(Property[1:10,],aes(reorder(EVTYPE, desc(V1)), V1)) +
geom_bar(colour="black", fill="grey69", width=.7, stat="identity") +
theme(axis.text.x = element_text(angle = 70,hjust = 1)) + guides(fill=FALSE) +
xlab("Event Type") + ylab("Total Property Damage ($)") + ggtitle("Top Ten Event Types by Property Damage")