Synopsis

The goal of this report was to explore the economic and population health-related consequences of severe weather events described in the NOAA Storm Database and find which types of events have the biggest impact.

After cleaning the data and performing the analysis the following conclusions were reached:

Data Processing

Loading the data

Load the data

source_data <- read.csv("repdata-data-StormData.csv.bz2")

The original data file contains some fields that do not impact the analysis. We can subset the data, retaining only the event type and damage data, for faster processing

data <- source_data[,c(8,23:28)]

Data Clean-up

The EVTYPE field in the original data is very non-tidy, as it seems to have been collected from a variety of sources without any attempts at standardization. While cleaning the data completely will be time-consuming and well beyound the scope of this project, we will attempt to apply some rudimentary cleaning to improve the quality of the dataset.

Transform the factor field into character and see how many different event types are in the dataset

data <- transform(data,EVTYPE=as.character(EVTYPE))
length(unique(data$EVTYPE))
## [1] 985
#Convert everything to uppercase
data <- transform(data,EVTYPE=toupper(EVTYPE))
#Remove brackets
data <- transform(data,EVTYPE=gsub("\\(","",EVTYPE))
data <- transform(data,EVTYPE=gsub("\\)","",EVTYPE))
#Replace dots with spaces
data <- transform(data,EVTYPE=gsub("[.]"," ",EVTYPE))
#Remove leading spaces
data <- transform(data,EVTYPE=gsub("^ +","",EVTYPE))
#Replace sequences of spaces with a single space
data <- transform(data,EVTYPE=gsub(" +"," ",EVTYPE))
#Remove trailing spaces
data <- transform(data,EVTYPE=gsub(" +$","",EVTYPE))

See how many unique event types remained

length(unique(data$EVTYPE))
## [1] 876

As mentioned above, the provided dataset is very non-tidy. One of the problems with the data is that some event types correspond to a specific event (e.h. Hurricane Emily), rather than to a general event category. Since we are trying to come up with strategies for dealing with the future events, it makes sense to get rid of the events that had only happened once. This approach has an added benefit of removing some of the erroneous data as well.

#Count the number of events of each type
types <-  aggregate(list(count=data$EVTYPE), by=list(event=data$EVTYPE), FUN=length)
#Find multiple events
multiple_events <- types$event[types$count>1]
#Filter the data
data <- data[data$EVTYPE %in% multiple_events,]

Check again how many unique event types remained

length(unique(data$EVTYPE))
## [1] 439

Calculating the economic damage values

Property and crop damage representation in this dataset is divided into two fields each. One (PROPDMG and CROPDMG) with the numeric base value and another (PROPDMGEXP and CROPDMGEXP) with the exponent multiplier.

First we convert the damage exponent fields to uppercase

data <- transform(data,PROPDMGEXP=as.factor(toupper(as.character(PROPDMGEXP))))
data <- transform(data,CROPDMGEXP=as.factor(toupper(as.character(CROPDMGEXP))))

Then we can see the possible values

summary(data$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
## 465618      1      8      3    216     25     13      4      4     28 
##      6      7      8      B      H      K      M 
##      4      5      1     38      7 424566  11319
summary(data$CROPDMGEXP)
##             ?      0      2      B      K      M 
## 617999      6     18      1      9 281840   1987

After exploring the data documentation (Page 12) we can see that K corresponds to a multipler of 1000, M to 1000000 and B to 1000000000. We also assume that H is 100. Unfortunately, there are no hints as to what other values can correspond and, given the data tidyness issues, they can simply be errors. Fortunately, the multiplers described above (along with the NULL value for no multipler) account for the vast majority of records and we can filter out the remaining fields without impacting the result significantly.

data <- data[data$PROPDMGEXP %in% c("","B","H","K","M") & data$CROPDMGEXP %in% c("","B","H","K","M"),]

Then we calculate actual values for the property and crop damage and add them together to derive the total damage

data <- transform(data,PD=PROPDMG,CD=CROPDMG)
data[data$PROPDMGEXP=="H",] <- transform(data[data$PROPDMGEXP=="H",],PD=100*PROPDMG)
data[data$CROPDMGEXP=="H",] <- transform(data[data$CROPDMGEXP=="H",],CD=100*CROPDMG)
data[data$PROPDMGEXP=="K",] <- transform(data[data$PROPDMGEXP=="K",],PD=1000*PROPDMG)
data[data$CROPDMGEXP=="K",] <- transform(data[data$CROPDMGEXP=="K",],CD=1000*CROPDMG)
data[data$PROPDMGEXP=="M",] <- transform(data[data$PROPDMGEXP=="M",],PD=1000000*PROPDMG)
data[data$CROPDMGEXP=="M",] <- transform(data[data$CROPDMGEXP=="M",],CD=1000000*CROPDMG)
data[data$PROPDMGEXP=="B",] <- transform(data[data$PROPDMGEXP=="B",],PD=1000000000*PROPDMG)
data[data$CROPDMGEXP=="B",] <- transform(data[data$CROPDMGEXP=="B",],CD=1000000000*CROPDMG)
data <- transform(data,DMG=PD+CD)

Results

Population health effects

For each of the weather events the dataset contains information on both injuries and fatalities. Since the goal of our analysis is to prepare for the consequences of severe weather events and the necessary preparation measures for dealing with injuries and preventing fatalities would differ, we should consider both parameters separately, rather than aggregate them into one metric. In addition some events might be rare, but catastrophic (i.e. have a lot of injuries) while others common, but relatively benign. Looking not just at the average injuries/fatalities for the event type, but also at the total, would help us perepare for such events.

#Calculate the total fatalities for the event type
fatalities_et_sum <- aggregate(list(fatalities=data$FATALITIES),
                           by=list(event=data$EVTYPE),
                           FUN=sum,na.rm=TRUE)
#Calculate the average fatalities for the event type
fatalities_et_avg <- aggregate(list(fatalities=data$FATALITIES),
                           by=list(event=data$EVTYPE),
                           FUN=mean,na.rm=TRUE)
#Calculate the total injuries for the event type
injuries_et_sum <- aggregate(list(injuries=data$INJURIES),
                           by=list(event=data$EVTYPE),
                           FUN=sum,na.rm=TRUE)
#Calculate the average injuries for the event type
injuries_et_avg <- aggregate(list(injuries=data$INJURIES),
                           by=list(event=data$EVTYPE),
                           FUN=mean,na.rm=TRUE)
#Select top 5 event types in each category
top5_injuries_avg <- injuries_et_avg[order(injuries_et_avg$injuries,decreasing=TRUE)[1:5],]
top5_injuries_sum <- injuries_et_sum[order(injuries_et_sum$injuries,decreasing=TRUE)[1:5],]
top5_fatalities_avg <- fatalities_et_avg[order(fatalities_et_avg$fatalities,decreasing=TRUE)[1:5],]
top5_fatalities_sum <- fatalities_et_sum[order(fatalities_et_sum$fatalities,decreasing=TRUE)[1:5],]

#Plot the injuries by event type
par(mfcol=c(1,2),mar=c(8.1,4.1,4.1,2.1), oma = c(0, 0, 2, 0),cex.main=0.8)
barplot(top5_injuries_avg$injuries,space=1, main="Average",ylab="Injuries")
text(seq(1.5,9.5,by=2), par("usr")[3]-0.25, srt = 75, adj= 1, xpd = TRUE,labels = top5_injuries_avg$event, cex=0.65)
barplot(top5_injuries_sum$injuries,space=1, main="Total",ylab="Injuries")
text(seq(1.5,9.5,by=2), par("usr")[3]-0.25, srt = 75, adj= 1, xpd = TRUE,labels = top5_injuries_sum$event, cex=0.65)
mtext("Injuries from Severe Weather Events",outer=TRUE,cex=1.5)

plot of chunk unnamed-chunk-12

#Plot the fatalities by event type
par(mfcol=c(1,2),mar=c(9.1,4.1,4.1,2.1), oma = c(0, 0, 2, 0),cex.main=0.8)
barplot(top5_fatalities_avg$fatalities,space=1, main="Average",ylab="Fatalities")
text(seq(1.5,9.5,by=2), par("usr")[3]-0.25, srt = 75, adj= 1, xpd = TRUE,labels = top5_fatalities_avg$event, cex=0.65)
barplot(top5_fatalities_sum$fatalities,space=1, main="Total",ylab="Fatalities")
text(seq(1.5,9.5,by=2), par("usr")[3]-0.25, srt = 75, adj= 1, xpd = TRUE,labels = top5_fatalities_sum$event, cex=0.65)
mtext("Fatalities from Severe Weather Events",outer=TRUE,cex=1.5)

plot of chunk unnamed-chunk-12

As we can see from these graphs, tornadoes are responsible by far for the largest total number of both injuries and fatalities. On average, however, wildfires cause more injuries and record/excessive heat (especially if combined with extreme heat (which makes conceptual sense)) cause more fatalities than other event types.

Economic consequences

To assess the economic consequences of the sever weather events we will use the same approach as above (i.e. consider the average and total damages separately), but will combine the pproperty and crop damage for each event into a single damage number.

#Calculate the total damage for the event type
damage_et_sum <- aggregate(list(damage=data$DMG),
                           by=list(event=data$EVTYPE),
                           FUN=sum)
#Calculate the average damage for the event type
damage_et_avg <- aggregate(list(damage=data$DMG),
                           by=list(event=data$EVTYPE),
                           FUN=mean)

#Select top 5 event types in each category
top5_dmg_avg <- damage_et_avg[order(damage_et_avg$damage,decreasing=TRUE)[1:5],]
top5_dmg_sum <- damage_et_sum[order(damage_et_sum$damage,decreasing=TRUE)[1:5],]

#Plot the economic damage by event type
par(mfcol=c(1,2),mar=c(9.1,4.1,4.1,2.1), oma = c(0, 0, 2, 0),cex.main=0.8)
barplot(top5_dmg_avg$damage,space=1, main="Average",ylab="Damage ($)")
text(seq(1.5,9.5,by=2), par("usr")[3]-0.25, srt = 75, adj= 1, xpd = TRUE,labels = top5_dmg_avg$event, cex=0.65)
barplot(top5_dmg_sum$damage,space=1, main="Total",ylab="Damage ($)")
text(seq(1.5,9.5,by=2), par("usr")[3]-0.25, srt = 75, adj= 1, xpd = TRUE,labels = top5_dmg_sum$event, cex=0.65)
mtext("Economic damage from weather events",outer=TRUE,cex=1.5)

plot of chunk unnamed-chunk-13

As we can see from these graphs, floods are responsible for the heaviest total economic damage. On average, however, heavy rain/severe weather causes more economic damage than other event types.