Stormdata Analysis

Synopsis

It is impossible to control the behaviour of nature. However, if well prepared, the consequences of natural disaster can be minimized. In this anylisiseNational Weather Service Storm Data Documentation which includes all natural disasters from 1951-2011 was explored (965405 observations of 37 variables). The most harmful events with respect to population health and the events that had the most serious economic consequences was then extracted from the data. The purpose of this analysis is to understand the consequences of different types of natural disasters in order to be better prepared when they hit next.

library(R.utils)

## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v2.0.0 (2015-02-28) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

library(plyr)

Data Processing

The R.utils package was useful to get the data from the web which then was read into RStudio using read.csv.

if(!file.exists("Stormdata.csv.bz2")){
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","Stormdata.csv.bz2",method="curl")
}
if(!file.exists("Stormdata.csv")) {
        bunzip2("Stormdata.csv.bz2", overwrite=F, remove=F)
}
data=read.csv("Stormdata.csv")

The data is a little messy in the raw format. First, the variables needed were extracted from the data, omitting 30 variables. Many of the eventtypes in the bariable EVTYPE have typos in them and some types are not included in the documentation of the data set. According to the Storm Data Documentation there are 46 types of events. Therefore a vector of the eventtypes was made followed by a vector of eventtypes including regular expressions to allow the events with typos to be included. Using grep, the rows that included the right eventtypes was than extracted using a for loop.

slimdata = with(data,data.frame(EVTYPE, FATALITIES, INJURIES, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP))
#remove(data)
events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", 
            "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", 
            "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme cold/Wind Chill", 
            "Flash Flood", "Flood", "Freezing", "Frost/Freeze", "Funnel Cloud", "Hail", 
            "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", 
            "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", "Marine Hail", 
            "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", 
            "Seiche", "Sleet", "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", 
            "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", 
            "Wildfire", "Winter Storm", "Winter Weather")

eventsregexpr <- c("Astronomical Low Tide|Low Tide", "Avalanche", "Blizzard", 
                        "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", 
                        "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme cold/Wind Chill|Extreme Cold|Wind Chill", 
                        "Flash Flood", "Flood", "Freezing", "Frost/Freeze|Frost|Freeze", "Funnel Cloud", 
                        "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon|Hurricane|Typhoon", 
                        "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", "Marine Hail", 
                        "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind|Marine tstm Wind", 
                        "Rip Current", "Seiche", "Sleet", "Storm Tide", "Strong Wind", "Thunderstorm Wind|tstm wind", 
                        "Tornado|Torndao", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", 
                        "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")

Cleandata = data.frame(EVTYPE = character(0), FATALITIES = numeric(0), INJURIES = numeric(0), 
                       CROPDMG=numeric(0), CROPDMGEXP=character(0), PROPDMG=numeric(0), PROPDMGEXP=character(0))
for(i in 1:length(eventsregexpr)){
        eventsfromvector=grep(eventsregexpr[i],slimdata[,1],ignore.case=T)
        Rightrows=slimdata[eventsfromvector,]
        name=rep(events[i],nrow(Rightrows))
        Rightrows=cbind(Rightrows,name)
        Cleandata=rbind(Cleandata,Rightrows)
}

Next, the CROPDMGEXP and PROPDMGEXP variables were cleaned. These variables determined if the numbers in the CROPDMG and PROPDMG variables were in hundred thousands, millions or billions of dollars(H,K,B,M). Using the unique command the different letters and signs of the CROPDMGEXP and PROPDMGEXP variables were exlplored first:

unique(Cleandata$CROPDMGEXP)

## [1] K   M B 0 ? k m 2
## Levels:  0 2 ? B K M k m

unique(Cleandata$PROPDMGEXP)

##  [1] K   M 0 5 4 7 ? 1 B + 2 m 8 H - 3 6 h
## Levels:  + - 0 1 2 3 4 5 6 7 8 ? B H K M h m

Which lead to the use of gsub and chartr to substitute the letters with exponents of 10 e.g. from K to 3.

Cleandata$CROPDMGEXP=gsub("\\?|2|$^","0",Cleandata$CROPDMGEXP)
Cleandata$CROPDMGEXP=chartr("KkMmBb","336699",Cleandata$CROPDMGEXP)

Cleandata$PROPDMGEXP=gsub("\\-|\\+|\\?|0|1|2|3|4|5|6|7|8|$^","0",Cleandata$PROPDMGEXP)
Cleandata$PROPDMGEXP=chartr("HhKkMmBb","22336699",Cleandata$PROPDMGEXP)

After the substitution the unique command gave:

unique(Cleandata$CROPDMGEXP)

## [1] "3" "0" "6" "9"

unique(Cleandata$PROPDMGEXP)

## [1] "3" "0" "6" "9" "2"

Results

The most common events

The most common events were counted and showed in the following barplot.

occurences=table(Cleandata$name)
occurences=sort(occurences, decreasing=T)
par(cex.main=0.9,cex.lab=0.8,cex.axis=0.6,mfrow=c(1,1),oma = c(1,2,1,1))
barplot(head(occurences),las=2,ylab="Number of Occurences",las=2,col="brown",main="The Top 6 Most Common Natural Disasters in the USA 1951-2011")

Which types of events are the most harmful with respect to population health?

Using the plyrs ddply both the mean and the sum of fatalities and injuries were found to get a full picture of the events. The extracted data was then ordered and the top 6 most harmful events with respect to population health plotted in the following barplot

fatalmeandata=ddply(Cleandata,.(name),summarize,Fatalities=mean(FATALITIES))
injuredmeandata=ddply(Cleandata,.(name),summarize, Injuries=mean(INJURIES))
fataltotaldata=ddply(Cleandata,.(name),summarize,Fatalities=sum(FATALITIES))
injuredtotaldata=ddply(Cleandata,.(name),summarize, Injuries=sum(INJURIES))

fatalmeandata=fatalmeandata[order(-fatalmeandata$Fatalities),]
injuredmeandata=injuredmeandata[order(-injuredmeandata$Injuries),]
fataltotaldata=fataltotaldata[order(-fataltotaldata$Fatalities),]
injuredtotaldata=injuredtotaldata[order(-injuredtotaldata$Injuries),]

par(cex.main=0.8,cex.lab=0.6,cex.axis=0.6,mfrow=c(2,2),oma = c(1,2,1,1))
barplot(head(fatalmeandata[,2]),names.arg=head(fatalmeandata[,1]),ylab="Fatalities",las=2,col="green",main="The Top 6 Average Most Harmful Natural Disasters")
barplot(head(fataltotaldata[,2]),names.arg=head(fataltotaldata[,1]),ylab="Fatalities",las=2,col="blue",main="The Top 6 Total Most Harmful Natural Disasters")
barplot(head(injuredmeandata[,2]),names.arg=head(injuredmeandata[,1]),ylab="Injuries",las=2,col="green")
barplot(head(injuredtotaldata[,2]),names.arg=head(injuredtotaldata[,1]),ylab="Injuries",las=2,col="blue")

Across the United States, which types of events have the greatest economic consequences?

Using the plyrs ddply both the mean and the sum of crop damage and property damage were found to get a full picture of the economic consequences of the events. The extracted data was then ordered and the top 6 most expensive events were plotted in the following barplot

Cleandata$CROPDMG=as.numeric(Cleandata$CROPDMG)*10^(as.numeric(Cleandata$CROPDMGEXP))
Cleandata$PROPDMG=as.numeric(Cleandata$PROPDMG)*10^(as.numeric(Cleandata$PROPDMGEXP))

cropmeandmg=ddply(Cleandata,.(name),summarize,CROPDMG=mean(CROPDMG))
propmeandmg=ddply(Cleandata,.(name),summarize,PROPDMG=mean(PROPDMG))
croptotaldmg=ddply(Cleandata,.(name),summarize,CROPDMG=sum(CROPDMG))
proptotaldmg=ddply(Cleandata,.(name),summarize,PROPDMG=sum(PROPDMG))

cropmeandmg=cropmeandmg[order(-cropmeandmg$CROPDMG),]
propmeandmg=propmeandmg[order(-propmeandmg$PROPDMG),]
croptotaldmg=croptotaldmg[order(-croptotaldmg$CROPDMG),]
proptotaldmg=proptotaldmg[order(-proptotaldmg$PROPDMG),]

par(cex.main=0.8,cex.axis=0.6,cex.lab=0.6,mfrow=c(2,2),oma = c(1,2,1,1))
barplot(head(cropmeandmg[,2]*10^-6),names.arg=head(cropmeandmg[,1]),ylab="Crop damage (in millions)",las=2,col="red",main="The Top 6 Average Most Expensive Natural Disasters")
barplot(head(croptotaldmg[,2]*10^-6),names.arg=head(croptotaldmg[,1]),ylab="Crop damage (in millions)",las=2,col="purple",main="The Top 6 Total Most Expensive Natural Disasters")
barplot(head(propmeandmg[,2]*10^-6),names.arg=head(propmeandmg[,1]),ylab="Property damage (in millions)",las=2,col="red")
barplot(head(proptotaldmg[,2]*10^-6),names.arg=head(proptotaldmg[,1]),ylab="Property damage (in millions)",las=2,col="purple")

In the last barplot there are surprising results. The eventtype flood is the total most expensive event in terms of property damage. However, on average, Hurricane/Typhoon is the most expensive eventtype by far, flood being the fourth most expensive on average in terms of property damage. This is suspicious. Could there be a big outlier in the dataset in one event of type flood?

Cleandata[which.max(Cleandata$PROPDMG),]

##        EVTYPE FATALITIES INJURIES  CROPDMG CROPDMGEXP  PROPDMG PROPDMGEXP
## 605953  FLOOD          0        0 32500000          6 1.15e+11          9
##         name
## 605953 Flood

data[605953,]

##        STATE__         BGN_DATE    BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 605953       6 1/1/2006 0:00:00 12:00:00 AM       PST     55       NAPA
##        STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI         END_DATE
## 605953    CA  FLOOD         0         COUNTYWIDE 1/1/2006 0:00:00
##           END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 605953 07:00:00 AM          0         NA         0         COUNTYWIDE
##        LENGTH WIDTH  F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953      0     0 NA   0          0        0     115          B    32.5
##        CROPDMGEXP WFO          STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 605953          M MTR CALIFORNIA, Western               3828     12218
##        LATITUDE_E LONGITUDE_
## 605953       3828      12218
##                                                                                                                                                                                                                                                                                                                                                                                               REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
##        REFNUM
## 605953 605943

There is obviously a typo in the PROPDMGEXP variable for this event when the Remarks variable is read. There is little chance that this one event in Napa costed 115 billion dollars in property damage cost. Instead of B there should be an M. This skews the data quite a bit in the total property damage barplot. This has to be taken in to the account when the reading the previous barplot. ##Conclusion This analysis was quite informative and gave some interesting results. It is not surprising that tornados are the total most harmful events as they occur quite often. By taking the mean of the most harmful events it showed that tsunamis are the most dangerous types of events on average. In terms of crop damage hurricanes and drought were the the most expensive. In terms of property damage, hurricanes were the most expensive. Floods are also expensive but the property total damage plot is skewed by one huge outlier.