Using the data.table and ggplot2 packages in R, I inspected the overall impact of severe weather events on population health and economic damage. The data set used was from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. After some some data cleaning and preprocessing, I found that the top causes of injury and fatalies were Tornados, Excessive Heat, Thunderstorm wind, Flooding, and Lightning. In terms of economic damage, the most devestating events were Flooding, Hurricanes, Tornados, Storm Surges, and Hail. These findings are visualized in the included plots.
The following packages are used for this analysis:
require(ggplot2)
## Loading required package: ggplot2
require(data.table)
## Loading required package: data.table
Check for file in current working directory. If it doesn’t exist, download it. Then read in as data frame and convert to data table.
if(!file.exists("repdata%Fdata%2FStormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile="repdata%Fdata%2FStormData.csv.bz2")
}
stormdf<-read.csv("repdata%Fdata%2FStormData.csv.bz2")
stormdt<-as.data.table(stormdf)
rm(stormdf)
First look at the dimensions of our dataset.
dim(stormdt)
## [1] 902297 37
This table includes more variables than needed, so use data.table functionality to get the subset of desired variables. Look at the dimensions, check for missing values, then take a look at the first few observations.
stormdt[,c(1:7,9:22,29:37):=NULL]
dim(stormdt)
## [1] 902297 7
sum(is.na(stormdt))
## [1] 0
head(stormdt)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1: TORNADO 0 15 25.0 K 0
## 2: TORNADO 0 0 2.5 K 0
## 3: TORNADO 0 2 25.0 K 0
## 4: TORNADO 0 2 2.5 K 0
## 5: TORNADO 0 2 2.5 K 0
## 6: TORNADO 0 6 2.5 K 0
For processing values of property damage and crop damage, we have to clean up the data. The documentation provided gives us a key: K=Thousands, M=Millions, B=Billions. It is not unreasonable to assume that lower case values (k,m,b) are equivalent to uppercase. Likewise, it is reasonable to assume that H/h=Hundreds. However, there are other values in both the PROPDMGEXP and CROPDMGEXP variables that are unexplained in the documentation.
We can see the extra factors in the following code output.
stormdt[,.(Count=.N),by=.(PROPDMGEXP)]
## PROPDMGEXP Count
## 1: K 424665
## 2: M 11330
## 3: 465934
## 4: B 40
## 5: m 7
## 6: + 5
## 7: 0 216
## 8: 5 28
## 9: 6 4
## 10: ? 8
## 11: 4 4
## 12: 2 13
## 13: 3 4
## 14: h 1
## 15: 7 5
## 16: H 6
## 17: - 1
## 18: 1 25
## 19: 8 1
stormdt[,.(Count=.N),by=.(CROPDMGEXP)]
## CROPDMGEXP Count
## 1: 618413
## 2: M 1994
## 3: K 281832
## 4: m 1
## 5: B 9
## 6: ? 7
## 7: 0 19
## 8: k 21
## 9: 2 1
Unfortunately, we also see that not all of the undocumented values for the *EXP variables are zero:
stormdt[,.(Nonzero=sum(PROPDMG>0)),by=.(PROPDMGEXP)]
## PROPDMGEXP Nonzero
## 1: K 227481
## 2: M 11319
## 3: 76
## 4: B 40
## 5: m 7
## 6: + 5
## 7: 0 209
## 8: 5 18
## 9: 6 3
## 10: ? 0
## 11: 4 4
## 12: 2 1
## 13: 3 1
## 14: h 1
## 15: 7 2
## 16: H 6
## 17: - 1
## 18: 1 0
## 19: 8 0
stormdt[,.(Nonzero=sum(CROPDMG>0)),by=.(CROPDMGEXP)]
## CROPDMGEXP Nonzero
## 1: 3
## 2: M 1918
## 3: K 20137
## 4: m 1
## 5: B 7
## 6: ? 0
## 7: 0 12
## 8: k 21
## 9: 2 0
This code chunk cleans up the data by removing observations with non-zero damage values and undocumented factors. Factors with zero damage and missing/improper multiplication factor will be normalized to preserve as much data as possible.
stormdt[(PROPDMGEXP %in% c("","-","?","+","0","1","2","3","4","5","6","7",
"8")) & PROPDMG==0, ]$PROPDMGEXP<-"Z"
stormdt[(CROPDMGEXP %in% c("","?","0",
"2")) & CROPDMG==0,]$CROPDMGEXP<-"Z"
setkey(stormdt,PROPDMGEXP)
stormdmgdt<-stormdt[c("B","h","H","K","m","M","Z"),]
setkey(stormdmgdt,CROPDMGEXP)
stormdmgdt<-stormdmgdt[c("B","k","K","m","M","Z"),]
stormdmgdt$PROPDMGEXP<-as.factor(toupper(as.character(stormdmgdt$PROPDMGEXP)))
stormdmgdt$CROPDMGEXP<-as.factor(toupper(as.character(stormdmgdt$CROPDMGEXP)))
levels(stormdmgdt$PROPDMGEXP)<-c(1000000000,100,1000,1000000,0)
levels(stormdmgdt$CROPDMGEXP)<-c(1000000000,1000,1000000,0)
stormdmgdt[,.(Count=.N),by=.(PROPDMGEXP)]
## PROPDMGEXP Count
## 1: 1e+09 40
## 2: 1000 424652
## 3: 0 465927
## 4: 1e+06 11336
## 5: 100 7
stormdmgdt[,.(Count=.N),by=.(CROPDMGEXP)]
## CROPDMGEXP Count
## 1: 1e+09 9
## 2: 1000 281843
## 3: 1e+06 1993
## 4: 0 618117
Now there is a new data set with no ambiguous data; zero values are still zero, while non-zero values with an unknown multiplier have been removed. Ambigous character multipliers have been replaced by numeric factors. The final step in data processing is to create numeric columns of damage.
stormdmgdt[,REALPROPDMG := PROPDMG * as.numeric(as.character(PROPDMGEXP))]
stormdmgdt[,REALCROPDMG := CROPDMG * as.numeric(as.character(CROPDMGEXP))]
stormdmgdt[,TOTALDMG := REALPROPDMG+REALCROPDMG]
To answer this question, first define what “most” harmful means. For this analysis, we will look at two measurements: first mean damage per event, and total damage. Because there are 895 unique event types, we will limit our analysis to the top 5 in each category. Subset the data to find top 5 sources of fatalities, injuries, and sum of the two.
totals<-stormdt[ , .(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES), FatalitiesAndInjuries=sum(FATALITIES,INJURIES), Count=.N), by=EVTYPE]
setkey(totals,Fatalities)
top5fat<-totals[ ,tail(totals[,c(1,2)],5)]
setkey(totals,Injuries)
top5inj<-totals[ ,tail(totals[,c(1,3)],5)]
setkey(totals,FatalitiesAndInjuries)
top5tot<-totals[ ,tail(totals[,1:3],5)]
top5tot<-melt(data=top5tot,id=1)
means<-totals[ ,.(EVTYPE=EVTYPE, Count=Count, MeanHarm=FatalitiesAndInjuries/Count)]
setkey(means,MeanHarm)
top5mean<-means[ ,tail(means[,1:3],5)]
First look at the top sources of harm; that is, the event types that cause the most injuries and fatalities per event.
top5mean
## EVTYPE Count MeanHarm
## 1: TORNADOES, TSTM WIND, HAIL 1 25.00
## 2: THUNDERSTORMW 1 27.00
## 3: WILD FIRES 4 38.25
## 4: TROPICAL STORM GORDON 1 51.00
## 5: Heat Wave 1 70.00
Unfortunately, this step shows us that our data is not fully homogenized: 4 of the 5 top mean sources of harm come from single data points. Further cleaning of the dataset is outside the scope of this assignment, but we can bypass input errors by subsetting a selection of event types which occur more frequently; say at least 10 times.
means2<-totals[Count>=10, .(EventType=EVTYPE, Count=Count, MeanTotalHarm=FatalitiesAndInjuries/Count)]
setkey(means2,MeanTotalHarm)
top5mean2<-means2[ ,tail(means2[,1:3],5)]
top5mean2
## EventType Count MeanTotalHarm
## 1: HEAT WAVE 74 6.50000
## 2: GLAZE 32 6.96875
## 3: TSUNAMI 20 8.10000
## 4: EXTREME HEAT 22 11.40909
## 5: HURRICANE/TYPHOON 88 15.21591
Next we’ll plot the total fatalaties from given weather events. To make sensible graphics, let’s use a function from the R cookbook called multiplot:
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
plots <- c(list(...), plotlist)
numPlots = length(plots)
if (is.null(layout)) {
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
p1<-ggplot(data=top5fat, aes(x=reorder(EVTYPE,-Fatalities),y=Fatalities, fill=EVTYPE)) + geom_col() + labs(x="Event Type",y="Total Fatalities",title="Top 5 sources of fatalities") + scale_fill_brewer(palette="Set1") + theme(text=element_text(size=6), legend.position="none")
p2<-ggplot(data=top5inj, aes(x=reorder(EVTYPE,-Injuries), y=Injuries, fill=EVTYPE)) + geom_col() + labs(x="Event Type",y="Total Injuries",title="Top 5 sources of injuries") + scale_fill_brewer(palette="Accent") + theme(text=element_text(size=6), legend.position="none")
p3<-ggplot(data=top5tot, aes(x=reorder(EVTYPE,-value), y=value, fill=variable)) + geom_col() + labs(x="Event Type", y="Fatalities and Injuries", title="Top 5 sources of harm", fill="Source of harm:") + scale_fill_brewer(palette="Spectral") + theme(text=element_text(size=6)) + theme(legend.position="top")
multiplot(p1,p2,p3,cols=2,layout=matrix(c(1,2,3),nrow=1,byrow=TRUE))
These plots show that the majority of injuries and deaths come from Tornados, Excessive Heat, Thunderstorm wind, Flooding, and Lightning. However, the means above show that the less common Hurricanes, Extreme Heat, Tsunami, Glaze, and Heat Wave are more dangerous on a per event basis.
Further efforts to consolidate similar event types may produce slightly different results, and may be worth looking into.
We will define “greatest economic consequence” as those events whose damage and frequency cause the greatest total damage. The 10 events with the greatest economic impact are shown in the plot below.
totalecon<-stormdmgdt[ , .(Count=.N, CropDamage=sum(REALCROPDMG), PropertyDamage=sum(REALPROPDMG), TotalDamage=sum(TOTALDMG)),by=EVTYPE]
setkey(totalecon,TotalDamage)
top10econ<-totalecon[,tail(totalecon[,c(1,3,4)],10)]
top10econ<-melt(top10econ,id=1)
ggplot(data=top10econ,aes(x=EVTYPE,y=value,fill=variable)) + geom_col() + labs(x="Event Type",y="Damage (USD)",title="Economic losses from the top 10 severe weather events") + theme(axis.text.x=element_text(angle=90, hjust=1))
We see in this plot that floods are by far the most economically devestating weather event, with around $150 billion of losses. Following floods, the three most significant sources of economic loss were hurricanes, tornados, and storm surges.
Upon analysis of the 902297 observations, the majority of injuries and fatalities occurred during hurricanes, heat events, and thunderstorms. Looking at 901962 of the 902297 observations shows that the majority of economic losses occurred during floods, hurricanes, and tornados.
There is further data cleaning that may bring new trends to light. In particular, the analysis of the data set would benefit from consolidating similar event types into the same variables, i.e. “excessive heat” and “extreme heat,” “hurricane/typhoon” and “hurricane.” However, that is a job for someone with better regex skills and more time than me.