This study will take data from the NOAA Storm Database. This data is fairly messy and requires a thorough cleaning. After cleaning the data, we will explore the weather events that have caused the most fatalities and injuries. We find that excessive heat is the number one killer with tornados being the number two killer. In terms of injuries, tornados are the number one culprit with excessive heat and floods being next. The final section shows the property damage and crop damage from weather-related events. We find floods to be the largest cause of property damage with hurricanes in second. In terms of crop damage, droughts are by far the most destructive with floods coming in second.
We begin by reading in the data. The data is a simple csv compressed in bz2 format. The read.csv() function will automatically decompress the file before trying to read the csv.
data <- read.csv("repdata-data-StormData.csv.bz2")
This is a rather large data set with 902297 observations and 37 variables.
dim(data)
## [1] 902297 37
Next, we will clean the data. To begin, we multiply the property damage, PROPDMG, and crop damage, CROPDMG, variables by their respective multipliers to get the actual property damage and crop damage in a more useful to analyze format.
PROPTOT <- data$PROPDMG
CROPTOT <- data$CROPDMG
for(i in 1:nrow(data)){
if(data$PROPDMGEXP[i]=='H'){PROPTOT[i]<-PROPTOT[i]*100}else
if(data$PROPDMGEXP[i]=='K'){PROPTOT[i]<-PROPTOT[i]*1000}else
if(data$PROPDMGEXP[i]=='M'){PROPTOT[i]<-PROPTOT[i]*1000000}else
if(data$PROPDMGEXP[i]=='B'){PROPTOT[i]<-PROPTOT[i]*1000000000}
if(data$CROPDMGEXP[i]=='H'){CROPTOT[i]<-CROPTOT[i]*100}else
if(data$CROPDMGEXP[i]=='K'){CROPTOT[i]<-CROPTOT[i]*1000}else
if(data$CROPDMGEXP[i]=='M'){CROPTOT[i]<-CROPTOT[i]*1000000}else
if(data$CROPDMGEXP[i]=='B'){CROPTOT[i]<-CROPTOT[i]*1000000000}
}
Next, the dates are not currently in an easy-to-use format. To convert the dates:
STARTDATE <- as.Date(data$BGN_DATE,format="%m/%d/%Y")
ENDDATE <- as.Date(data$END_DATE,format="%m/%d/%Y")
For this analysis, not all of the variables are necessary. Using existing variables from the data set and the created variables above, the finished data set to be used is created using the code below.
cdata <- cbind(STARTDATE,ENDDATE,data[,c(3,13,4,6,7,8,9:11,16:18,19:24,32:33)])
cdata <- cbind(cdata,PROPTOT,CROPTOT)
The aim of this study is to focus on the health and economic impact of weather events today. There are several factors that likely could have changed since the 1950s including building standards, awareness/preparation, and others that would potentially bias the results.
Furthermore, looking through the data, we observe an obvious change in record keeping beginning around 1996. One such discrepency is the way in which the start time was recorded. Also the data seems to be much more complete after this time.
For both of these reasons, it will be best to focus on the data from 1996 onwards:
cdata <- subset(cdata, STARTDATE > "1995-12-31")
The event type will be an important variables in our observation. We can view all of these by using the count() function. I will not evaluate this code here as it shows 508 different event types.
count(cdata$EVTYPE)
Looking through these event types one can see many redundancies, issues with capital letters, and typos. Some of the cleaning will occur as results are processed but in order to simplify at least a little bit, all event data will be converted to capital letters. This reduced the different observations to 430.
cdata$EVTYPE <- toupper(cdata$EVTYPE)
Since the EVTYPE variable has so many problems, we will have to be careful when creating this dataset. If we only look at events that report 50 or more fatalities and injuries we are left with a reasonable number of observations to try and combine. (evaluation omitted)
library(plyr)
F1 <- ddply(cdata, "EVTYPE", summarize, FATALITIES = sum(FATALITIES))
I1 <- ddply(cdata, "EVTYPE", summarize, INJURIES = sum(INJURIES))
numFAT <- subset(F1, FATALITIES>50)
numINJ <- subset(I1, INJURIES>50)
numFAT
numINJ
I will combine event types that are basically the same, ie. EXTREME COLD is the same as EXTREMECOLD/WIND CHILL.
for(i in 1:nrow(cdata)){
if(cdata$EVTYPE[i]=="EXTREME COLD/WIND CHILL"){
cdata$EVTYPE[i]<-"EXTREME COLD"
}
if(cdata$EVTYPE[i]=="COLD/WIND CHILL"){
cdata$EVTYPE[i]<-"EXTREME COLD"
}
if(cdata$EVTYPE[i]=="FLASH FLOOD"){
cdata$EVTYPE[i]<-"FLOOD"
}
if(cdata$EVTYPE[i]=="HURRICANE/TYPHOON"){
cdata$EVTYPE[i]<-"HURRICANE"
}
if(cdata$EVTYPE[i]=="RIP CURRENT"){
cdata$EVTYPE[i]<-"RIP CURRENTS"
}
if(cdata$EVTYPE[i]=="HIGH WIND"){
cdata$EVTYPE[i]<-"STRONG WIND"
}
if(cdata$EVTYPE[i]=="WIND"){
cdata$EVTYPE[i]<-"STRONG WIND"
}
if(cdata$EVTYPE[i]=="TSTM WIND"){
cdata$EVTYPE[i]<-"THUNDERSTORM WIND"
}
if(cdata$EVTYPE[i]=="DENSE FOG"){
cdata$EVTYPE[i]<-"FOG"
}
if(cdata$EVTYPE[i]=="HEAT"){
cdata$EVTYPE[i]<-"EXCESSIVE HEAT"
}
if(cdata$EVTYPE[i]=="HEAT WAVE"){
cdata$EVTYPE[i]<-"EXCESSIVE HEAT"
}
if(cdata$EVTYPE[i]=="TSTM WIND/HAIL"){
cdata$EVTYPE[i]<-"HAIL"
}
if(cdata$EVTYPE[i]=="WILD/FOREST FIRE"){
cdata$EVTYPE[i]<-"WILDFIRE"
}
if(cdata$EVTYPE[i]=="WINTER WEATHER"){
cdata$EVTYPE[i]<-"WINTER STORM"
}
if(cdata$EVTYPE[i]=="WINTER WEATHER MIX"){
cdata$EVTYPE[i]<-"WINTER STORM"
}
if(cdata$EVTYPE[i]=="WINTER WEATHER/MIX"){
cdata$EVTYPE[i]<-"WINTER STORM"
}
if(cdata$EVTYPE[i]=="WINTRY MIX"){
cdata$EVTYPE[i]<-"WINTER STORM"
}
}
The dataset should now be functional for the analysis performed in the following sections. To summarize the cleaning process:
Total property and crop damage were calculated.
Many similar weather events were aggregated.
Only observations after 1996 were kept.
This section will look at the effects of different weather related events. First, I will look at the impact of various events on health — fatalities and injuries. Second, I will focus on the economic impact — property damage and crop damage. Since we have so many different events, I will only focus on those events that do the most harm.
This section will look at the impact of the top 10 weather related events on fatalities and injuries since 1996. The following code shows the total amount of fatalities and deaths since 1996 of the top 10 events.
F1 <- ddply(cdata, "EVTYPE", summarize, FATALITIES = sum(FATALITIES))
I1 <- ddply(cdata, "EVTYPE", summarize, INJURIES = sum(INJURIES))
numFAT <- subset(F1, FATALITIES>150) #Subsets top 10
numINJ <- subset(I1, INJURIES>830) #Subsets top 10
numFAT <- numFAT[order(numFAT[,2], decreasing=TRUE),] #Puts them in order
numINJ <- numINJ[order(numINJ[,2], decreasing=TRUE),] #Puts them in order
numFAT
## EVTYPE FATALITIES
## 67 EXCESSIVE HEAT 2034
## 349 TORNADO 1511
## 84 FLOOD 1301
## 174 LIGHTNING 651
## 238 RIP CURRENTS 542
## 345 THUNDERSTORM WIND 371
## 273 STRONG WIND 356
## 74 EXTREME COLD 335
## 419 WINTER STORM 253
## 16 AVALANCHE 223
numINJ
## EVTYPE INJURIES
## 349 TORNADO 20667
## 84 FLOOD 8432
## 67 EXCESSIVE HEAT 7683
## 345 THUNDERSTORM WIND 5029
## 174 LIGHTNING 4141
## 419 WINTER STORM 1852
## 411 WILDFIRE 1456
## 273 STRONG WIND 1445
## 141 HURRICANE 1321
## 88 FOG 855
The following plots shows this data. Be sure to install the ggplot2 and gridExtra packages before running this code.
library(ggplot2)
library(gridExtra)
library(grid)
g1 <- ggplot(numFAT,aes(x=EVTYPE,y=FATALITIES))
g1 <- g1 + geom_bar(stat="identity")
g1 <- g1 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g1 <- g1 + theme(legend.position="none")
g1 <- g1 + xlab("")
g2 <- ggplot(numINJ,aes(EVTYPE,INJURIES))
g2 <- g2 + geom_bar(stat="identity")
g2 <- g2 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- g2 + xlab("")
grid.arrange(g1,g2,ncol=2,top="Top 10 Weather-related Events on Population Health")
As we can see, the number one weather-related killer seems to be excessive heat followed by tornados. When looking at number of injuries, tornados are number one with floods and excessive heat being number 2 and 3 respectively.
This section will look at the impact of the top 10 weather related events on property damage and crop damage. The following code shows the total amount of property damage and crop damage since 1996 from each of the top 10 events.
P1 <- ddply(cdata, "EVTYPE", summarize, PROPTOT = sum(PROPTOT))
C1 <- ddply(cdata, "EVTYPE", summarize, CROPTOT = sum(CROPTOT))
moneyPROP <- subset(P1, PROPTOT>4500000000) #Subsets top 10
moneyCROP <- subset(C1, CROPTOT>500000000) #Subsets top 10
moneyPROP <- moneyPROP[order(moneyPROP[,2], decreasing=TRUE),]#Puts them in order
moneyCROP <- moneyCROP[order(moneyCROP[,2], decreasing=TRUE),]#Puts them in order
moneyPROP
## EVTYPE PROPTOT
## 84 FLOOD 159167037460
## 141 HURRICANE 81118659010
## 270 STORM SURGE 43193536000
## 349 TORNADO 24616945710
## 109 HAIL 14639478920
## 345 THUNDERSTORM WIND 7860710880
## 411 WILDFIRE 7760449500
## 353 TROPICAL STORM 7642475550
## 273 STRONG WIND 5424909310
## 271 STORM SURGE/TIDE 4641188000
moneyCROP
## EVTYPE CROPTOT
## 51 DROUGHT 13367566000
## 84 FLOOD 6309680100
## 141 HURRICANE 5349282800
## 109 HAIL 2540725700
## 74 EXTREME COLD 1309623000
## 96 FROST/FREEZE 1094186000
## 345 THUNDERSTORM WIND 952246350
## 116 HEAVY RAIN 728169800
## 273 STRONG WIND 698814800
## 353 TROPICAL STORM 677711000
The following plots shows this data.
g3 <- ggplot(moneyPROP,aes(x=EVTYPE,y=PROPTOT))
g3 <- g3 + geom_bar(stat="identity")
g3 <- g3 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3 <- g3 + theme(legend.position="none")
g3 <- g3 + ylab("Property Damage (Dollars)")
g3 <- g3 + xlab("")
g4 <- ggplot(moneyCROP,aes(EVTYPE,CROPTOT))
g4 <- g4 + geom_bar(stat="identity")
g4 <- g4 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g4 <- g4 + ylab("Crop Damage (Dollars)")
g4 <- g4 + xlab("")
grid.arrange(g3,g4,ncol=2,top="Top 10 Weather-related Events on the Economy")
The number one weather-related event causing property damage are floods followed by hurricanes. The number one cause of crop damage are droughts followed by floods.