Synopsis

This study will take data from the NOAA Storm Database. This data is fairly messy and requires a thorough cleaning. After cleaning the data, we will explore the weather events that have caused the most fatalities and injuries. We find that excessive heat is the number one killer with tornados being the number two killer. In terms of injuries, tornados are the number one culprit with excessive heat and floods being next. The final section shows the property damage and crop damage from weather-related events. We find floods to be the largest cause of property damage with hurricanes in second. In terms of crop damage, droughts are by far the most destructive with floods coming in second.

Data Processing

Loading the Data

We begin by reading in the data. The data is a simple csv compressed in bz2 format. The read.csv() function will automatically decompress the file before trying to read the csv.

data <- read.csv("repdata-data-StormData.csv.bz2")

This is a rather large data set with 902297 observations and 37 variables.

dim(data)
## [1] 902297     37

Selecting and Creating Variables

Next, we will clean the data. To begin, we multiply the property damage, PROPDMG, and crop damage, CROPDMG, variables by their respective multipliers to get the actual property damage and crop damage in a more useful to analyze format.

PROPTOT <- data$PROPDMG
CROPTOT <- data$CROPDMG
for(i in 1:nrow(data)){
        if(data$PROPDMGEXP[i]=='H'){PROPTOT[i]<-PROPTOT[i]*100}else
        if(data$PROPDMGEXP[i]=='K'){PROPTOT[i]<-PROPTOT[i]*1000}else
        if(data$PROPDMGEXP[i]=='M'){PROPTOT[i]<-PROPTOT[i]*1000000}else
        if(data$PROPDMGEXP[i]=='B'){PROPTOT[i]<-PROPTOT[i]*1000000000}

        if(data$CROPDMGEXP[i]=='H'){CROPTOT[i]<-CROPTOT[i]*100}else
        if(data$CROPDMGEXP[i]=='K'){CROPTOT[i]<-CROPTOT[i]*1000}else
        if(data$CROPDMGEXP[i]=='M'){CROPTOT[i]<-CROPTOT[i]*1000000}else
        if(data$CROPDMGEXP[i]=='B'){CROPTOT[i]<-CROPTOT[i]*1000000000}
}

Next, the dates are not currently in an easy-to-use format. To convert the dates:

STARTDATE <- as.Date(data$BGN_DATE,format="%m/%d/%Y")
ENDDATE <- as.Date(data$END_DATE,format="%m/%d/%Y")

For this analysis, not all of the variables are necessary. Using existing variables from the data set and the created variables above, the finished data set to be used is created using the code below.

cdata <- cbind(STARTDATE,ENDDATE,data[,c(3,13,4,6,7,8,9:11,16:18,19:24,32:33)])
cdata <- cbind(cdata,PROPTOT,CROPTOT)

Selecting and Cleaning Observations

The aim of this study is to focus on the health and economic impact of weather events today. There are several factors that likely could have changed since the 1950s including building standards, awareness/preparation, and others that would potentially bias the results.

Furthermore, looking through the data, we observe an obvious change in record keeping beginning around 1996. One such discrepency is the way in which the start time was recorded. Also the data seems to be much more complete after this time.

For both of these reasons, it will be best to focus on the data from 1996 onwards:

cdata <- subset(cdata, STARTDATE > "1995-12-31")

The event type will be an important variables in our observation. We can view all of these by using the count() function. I will not evaluate this code here as it shows 508 different event types.

count(cdata$EVTYPE)

Looking through these event types one can see many redundancies, issues with capital letters, and typos. Some of the cleaning will occur as results are processed but in order to simplify at least a little bit, all event data will be converted to capital letters. This reduced the different observations to 430.

cdata$EVTYPE <- toupper(cdata$EVTYPE)

Since the EVTYPE variable has so many problems, we will have to be careful when creating this dataset. If we only look at events that report 50 or more fatalities and injuries we are left with a reasonable number of observations to try and combine. (evaluation omitted)

library(plyr)
F1 <- ddply(cdata, "EVTYPE", summarize, FATALITIES = sum(FATALITIES))
I1 <- ddply(cdata, "EVTYPE", summarize, INJURIES = sum(INJURIES))
numFAT <- subset(F1, FATALITIES>50)
numINJ <- subset(I1, INJURIES>50)
numFAT
numINJ

I will combine event types that are basically the same, ie. EXTREME COLD is the same as EXTREMECOLD/WIND CHILL.

for(i in 1:nrow(cdata)){
        if(cdata$EVTYPE[i]=="EXTREME COLD/WIND CHILL"){
                cdata$EVTYPE[i]<-"EXTREME COLD"
        }
        if(cdata$EVTYPE[i]=="COLD/WIND CHILL"){
                cdata$EVTYPE[i]<-"EXTREME COLD"
        }
        if(cdata$EVTYPE[i]=="FLASH FLOOD"){
                cdata$EVTYPE[i]<-"FLOOD"
        }
        if(cdata$EVTYPE[i]=="HURRICANE/TYPHOON"){
                cdata$EVTYPE[i]<-"HURRICANE"
        }
        if(cdata$EVTYPE[i]=="RIP CURRENT"){
                cdata$EVTYPE[i]<-"RIP CURRENTS"
        }
        if(cdata$EVTYPE[i]=="HIGH WIND"){
                cdata$EVTYPE[i]<-"STRONG WIND"
        }
        if(cdata$EVTYPE[i]=="WIND"){
                cdata$EVTYPE[i]<-"STRONG WIND"
        }
        if(cdata$EVTYPE[i]=="TSTM WIND"){
                cdata$EVTYPE[i]<-"THUNDERSTORM WIND"
        }
        if(cdata$EVTYPE[i]=="DENSE FOG"){
                cdata$EVTYPE[i]<-"FOG"
        }
        if(cdata$EVTYPE[i]=="HEAT"){
                cdata$EVTYPE[i]<-"EXCESSIVE HEAT"
        }
        if(cdata$EVTYPE[i]=="HEAT WAVE"){
                cdata$EVTYPE[i]<-"EXCESSIVE HEAT"
        }
        if(cdata$EVTYPE[i]=="TSTM WIND/HAIL"){
                cdata$EVTYPE[i]<-"HAIL"
        }
        if(cdata$EVTYPE[i]=="WILD/FOREST FIRE"){
                cdata$EVTYPE[i]<-"WILDFIRE"
        }
        if(cdata$EVTYPE[i]=="WINTER WEATHER"){
                cdata$EVTYPE[i]<-"WINTER STORM"
        }
        if(cdata$EVTYPE[i]=="WINTER WEATHER MIX"){
                cdata$EVTYPE[i]<-"WINTER STORM"
        }
        if(cdata$EVTYPE[i]=="WINTER WEATHER/MIX"){
                cdata$EVTYPE[i]<-"WINTER STORM"
        }
        if(cdata$EVTYPE[i]=="WINTRY MIX"){
                cdata$EVTYPE[i]<-"WINTER STORM"
        }
}

The dataset should now be functional for the analysis performed in the following sections. To summarize the cleaning process:

  1. Total property and crop damage were calculated.

  2. Many similar weather events were aggregated.

  3. Only observations after 1996 were kept.

Results

This section will look at the effects of different weather related events. First, I will look at the impact of various events on health — fatalities and injuries. Second, I will focus on the economic impact — property damage and crop damage. Since we have so many different events, I will only focus on those events that do the most harm.

Weather Events on Health

This section will look at the impact of the top 10 weather related events on fatalities and injuries since 1996. The following code shows the total amount of fatalities and deaths since 1996 of the top 10 events.

F1 <- ddply(cdata, "EVTYPE", summarize, FATALITIES = sum(FATALITIES))
I1 <- ddply(cdata, "EVTYPE", summarize, INJURIES = sum(INJURIES))
numFAT <- subset(F1, FATALITIES>150)  #Subsets top 10 
numINJ <- subset(I1, INJURIES>830)    #Subsets top 10 
numFAT <- numFAT[order(numFAT[,2], decreasing=TRUE),]  #Puts them in order
numINJ <- numINJ[order(numINJ[,2], decreasing=TRUE),] #Puts them in order
numFAT
##                EVTYPE FATALITIES
## 67     EXCESSIVE HEAT       2034
## 349           TORNADO       1511
## 84              FLOOD       1301
## 174         LIGHTNING        651
## 238      RIP CURRENTS        542
## 345 THUNDERSTORM WIND        371
## 273       STRONG WIND        356
## 74       EXTREME COLD        335
## 419      WINTER STORM        253
## 16          AVALANCHE        223
numINJ
##                EVTYPE INJURIES
## 349           TORNADO    20667
## 84              FLOOD     8432
## 67     EXCESSIVE HEAT     7683
## 345 THUNDERSTORM WIND     5029
## 174         LIGHTNING     4141
## 419      WINTER STORM     1852
## 411          WILDFIRE     1456
## 273       STRONG WIND     1445
## 141         HURRICANE     1321
## 88                FOG      855

The following plots shows this data. Be sure to install the ggplot2 and gridExtra packages before running this code.

library(ggplot2)
library(gridExtra)
library(grid)
g1 <- ggplot(numFAT,aes(x=EVTYPE,y=FATALITIES))
g1 <- g1 + geom_bar(stat="identity")
g1 <- g1 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g1 <- g1 + theme(legend.position="none")
g1 <- g1 + xlab("")

g2 <- ggplot(numINJ,aes(EVTYPE,INJURIES))
g2 <- g2 + geom_bar(stat="identity")
g2 <- g2 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 <- g2 + xlab("")

grid.arrange(g1,g2,ncol=2,top="Top 10 Weather-related Events on Population Health")

As we can see, the number one weather-related killer seems to be excessive heat followed by tornados. When looking at number of injuries, tornados are number one with floods and excessive heat being number 2 and 3 respectively.

Weather Events on the Economy

This section will look at the impact of the top 10 weather related events on property damage and crop damage. The following code shows the total amount of property damage and crop damage since 1996 from each of the top 10 events.

P1 <- ddply(cdata, "EVTYPE", summarize, PROPTOT = sum(PROPTOT))
C1 <- ddply(cdata, "EVTYPE", summarize, CROPTOT = sum(CROPTOT))
moneyPROP <- subset(P1, PROPTOT>4500000000) #Subsets top 10
moneyCROP <- subset(C1, CROPTOT>500000000)  #Subsets top 10
moneyPROP <- moneyPROP[order(moneyPROP[,2], decreasing=TRUE),]#Puts them in order
moneyCROP <- moneyCROP[order(moneyCROP[,2], decreasing=TRUE),]#Puts them in order
moneyPROP
##                EVTYPE      PROPTOT
## 84              FLOOD 159167037460
## 141         HURRICANE  81118659010
## 270       STORM SURGE  43193536000
## 349           TORNADO  24616945710
## 109              HAIL  14639478920
## 345 THUNDERSTORM WIND   7860710880
## 411          WILDFIRE   7760449500
## 353    TROPICAL STORM   7642475550
## 273       STRONG WIND   5424909310
## 271  STORM SURGE/TIDE   4641188000
moneyCROP
##                EVTYPE     CROPTOT
## 51            DROUGHT 13367566000
## 84              FLOOD  6309680100
## 141         HURRICANE  5349282800
## 109              HAIL  2540725700
## 74       EXTREME COLD  1309623000
## 96       FROST/FREEZE  1094186000
## 345 THUNDERSTORM WIND   952246350
## 116        HEAVY RAIN   728169800
## 273       STRONG WIND   698814800
## 353    TROPICAL STORM   677711000

The following plots shows this data.

g3 <- ggplot(moneyPROP,aes(x=EVTYPE,y=PROPTOT))
g3 <- g3 + geom_bar(stat="identity")
g3 <- g3 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3 <- g3 + theme(legend.position="none")
g3 <- g3 + ylab("Property Damage (Dollars)")
g3 <- g3 + xlab("")

g4 <- ggplot(moneyCROP,aes(EVTYPE,CROPTOT))
g4 <- g4 + geom_bar(stat="identity")
g4 <- g4 + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g4 <- g4 + ylab("Crop Damage (Dollars)")
g4 <- g4 + xlab("")

grid.arrange(g3,g4,ncol=2,top="Top 10 Weather-related Events on the Economy")

The number one weather-related event causing property damage are floods followed by hurricanes. The number one cause of crop damage are droughts followed by floods.