Synopsis

The objective of this study is to find out which are the severe weather events that have the greatest impact on human population in terms of fatalities,injuries and damage to property and crops based on data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950-2011. There have been significant weather changes in the world and we felt that it will be more useful to use data from the 10-year period of 2002 to 2011 for this study to reflect more recent weather events that have the most impact. According to the World Resources Institute, “The world must brace for more extreme weather”. From these data, we found that weather events like tornado and heat have a great impact on the human population in terms of fatalities and injuries. For property and crop damage, the greatest impact came from flood, hurricane and storm surge during the recent 10-year period from 2002 to 2011.

Loading the Raw Data

The raw data (a comma-separated-value compressed file) was downloaded from this website: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2, which contains the Storm Data across the U.S. from 1950-2011.

Loading Packages to be used.

For the data processing and analyses, I used the following packages:

library(data.table)
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
library(car)
library(ggplot2)

Data Processing

Reading in the 1950-2011 raw data

I first read in the 1950-2011 data from the raw data file that I had downloaded into my working directory, and also checked the number of observations (902297) and variables(37) in the data frame.

stormFile<-read.table("repdata_data_StormData.csv.bz2", sep=",",header=TRUE)
dim(stormFile)
## [1] 902297     37

Since the more recent years’ data are more complete, I did a preliminary study of the data from 1992-2011 and decided to focus this study on the most recent 10-year period of 2002-2011. In order to read in data for this period, I had to convert the BGN_DATE variable to a date variable. Some of the more significant events have multiple codes and these were converted into single code for each event. Finally, since we were interested in the economic consequences and fatalities/injuries caused by the various weather events, only 8 of the relevant variables were selected from the original data file.

stormFile$BGN_DATE <- as.Date(stormFile$BGN_DATE, format = '%m/%d/%Y %H:%M:%S')

stormFile[grep("TSTM WIND|THUNDERSTORM", stormFile$EVTYPE), "EVTYPE"]<- "THUNDERSTORM"
stormFile[grep("HURRICANE|HURRICANE|TYPHOON", stormFile$EVTYPE), "EVTYPE"]<- "HURRICANE"
stormFile[grep("HEAT|EXCESSIVE HEAT", stormFile$EVTYPE), "EVTYPE"]<- "HEAT"
stormFile[grep("FLOOD|RIVER FLOOD|FLASH FLOOD", stormFile$EVTYPE), "EVTYPE"]<-"FLOOD"
stormFile<-select(stormFile, BGN_DATE,EVTYPE,FATALITIES:CROPDMGEXP)

Selecting and Processing the 2002-2011 data

This subset of the original data set was selected to reflect the most recent impact of the weather events on property/crop damage and human fatalities/injuries during the 10-year period. After reading in this subset, I checked the dimensions of the file. It contained 453730 observations of 8 variables.

stormData1 <- subset(stormFile, stormFile$BGN_DATE>"2001-12-31") # Subset from 2002-2011
dim(stormData1)
## [1] 453730      8

Next, I removed all event types that had zero value in “PROPDMG”, “FATALITIES” and “INJURIES” to make it more efficient for analysis, and then checked that we have indeed obtained a more precise dataset by looking at it’s structure.

stormData1 <- filter(stormData1,PROPDMG >0 | CROPDMG > 0| FATALITIES > 0 | INJURIES> 0)
str(stormData1)
## 'data.frame':    134528 obs. of  8 variables:
##  $ BGN_DATE  : Date, format: "2002-06-04" "2002-06-04" ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 244 753 464 753 753 464 753 753 753 753 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 1 1 0 1 0 0 0 0 ...
##  $ PROPDMG   : num  2 2 3 5 2 0 4 8 1 10 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 1 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 7 7 7 7 7 1 7 7 1 7 ...

The PROPDMGEXP and CROPDMGEXP variables are assumed to be the values to the power of 10 that need to be multiplied with the PROPDMG and CROPDMG variables to obtain the amount of property and crop damage for each weather event. So the codes in these variables, PROPDMGEXP and CROPDMGEXP were converted to a numeric value equal to the power of 10 before multiplying with the associated PROPDMG and CROPDMG.

stormData1$PROPDMG<-stormData1$PROPDMG * as.numeric(recode(toupper(stormData1$PROPDMGEXP), "''=1;'NA'=0;'0'=1;'1'=10;'2'=100;'3'=1000;'4'=10000;
     '5'=100000;'6'=1000000;'7'=10000000;'8'=100000000;'H'=100; 'B'=1000000000;'K'=1000; 'M'=1000000;else=0", as.factor.result=FALSE)) #calculate property damage $

stormData1$CROPDMG <-stormData1$CROPDMG * as.numeric(recode(toupper(stormData1$CROPDMGEXP),"''=1;'0'=1;'2'=100;'B'=1000000000;'K'=1000; 'M'=1000000;else=0", as.factor.result=FALSE)) 
dim(stormData1) # 134528 obs of 9 variables
## [1] 134528      8

The next step is to compute the total property/crop damage, and the total number of fatalities and injuries associated with each weather event using the following code:

data2 <- ddply(stormData1, c("EVTYPE"), summarise,    
               sumF = sum(FATALITIES),
               sumI = sum(INJURIES),
               sumP = sum(PROPDMG),
               sumC = sum(CROPDMG),
               sumTotal = (sumP + sumC))

We then went on to arrange the order of the weather events in terms of the the greatest total impact on combined property/crop damage, the total number of fatalities and total number of injuries. Since there were altogether 69 different events in the final results, I selected only 6 of these significant events for display in the bar plot.

data2PC <-data2[order(-data2$sumTotal,-data2$sumP),] # Check EVTYPEs that cause most damage
data02PC<-data2PC[1:6,]

data2FI <- data2[order(-data2$sumF,data2$sumI),] # Sort the EVTYPEs that cause most FATALITIES/INJURIES
data02F<- data2FI[1:6,]
data02I<- data2[order(-data2$sumI),]
data02I<-data02I[1:6,]

Results

Economic Consequences: Property and Crop Loss

Then we plot the total amount of damage (in $ Billions) associated with each weather event.

ggplot(data02PC, aes(x=reorder(EVTYPE,-sumTotal),y=sumTotal/1e9)) + geom_bar(stat="identity",fill="blue") + labs(x="Event Type", y="Billion $")

The results show that the total amount of damage to both property and crop is more than $150 Billion caused mainly by flood, with hurricane, storm surge and tornado following behind in terms of total damage. This is reflected in the bar plot attached.

Impact on Fatalities and Injuries

We then plotted the numbers given in the above computation for total fatalities and injuries.

ggplot(data02F, aes(x=reorder(EVTYPE,-sumF),y=sumF)) + geom_bar(stat="identity",fill="red") + labs(x="Event Type", y="Total number of Fatalities")

The code for plotting the total number of injuries is shown below:

ggplot(data02I, aes(x=reorder(EVTYPE, -sumI), y=sumI)) +geom_bar(stat="identity", cex=0.5,fill="brown") + labs(x="Event Type", y="Total number of Injuries")

The above plots show that fatalities exceeding 1100 was caused by tornadoes, followed by heat and flood, in terms of the total number of fatalities. In terms of injuries, the total number was more than 13,000 during the 10-year period caused again by tornadoes, followed by heat and thunderstorm. These results show that both tornado and heat have the greatest impact on human population in terms of fatalities and injuries.

In conclusion, this study of the impact of weather events across the United States, has shown that the greatest harmful effect on economic consequences is caused by flood and hurricane, while the greatest impact on human population is caused by tornadoes and heat during the 10-year period from 2002 to 2011.