Synopsis

The Storm Data dataset contains a wide variety of different extreme weather events and their associated impact on the affected area.

We are going to examine the data set to determine which types of events (the ENVTYPE field in the dataset) have the largest impact on the health of the population, and which types of events cause the largest amount of economic damage.

Data Processing

We will begin our data processing by downloading the Storm Data file, and reading it’s content into memory. Be advised that this may take some time as the dataset is quite large as well as compressed using Bzip2.

setInternet2(use = TRUE) #fix for windows 8
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="repdata%2Fdata%2FStormData.csv.bz2")
rawdata <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", stringsAsFactors = FALSE)

For the purposes of this document we are going to assume that “impact on the health of the population” is defined as the number of people injured or killed by the event. The Storm Data dataset has two columns related to fatalities and injuries respectively, we will examine these separately.

In a similar vein, the dataset contains two discrete values representing economic damage to the stricken area; property damage and crop damage. This distinction is not relevant for determining the overall damage done, and as such, during processing we are going to unify these values.

Overall this leaves us with only 7 of the variables in the dataset being relevant to our research:

We will begin processing by creating a new data frame containing only these values:

data <- data.frame(
    rawdata$EVTYPE, 
    rawdata$FATALITIES, 
    rawdata$INJURIES, 
    rawdata$PROPDMG, 
    rawdata$PROPDMGEXP, 
    rawdata$CROPDMG, 
    rawdata$CROPDMGEXP,
    stringsAsFactors = FALSE
)
names(data) <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP") #fix the column names for ease of use.
head(data, n=3)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0

We are left with a much smaller dataset, but it is not necessarily “tidy” yet. Because we will be performing some mathematical operations between columns, and we want to avoid running into issues involving NA values, we will replace all missing values in the numerical columns (e.g. FATALITIES, INJURIES, PROPDMG, CROPDMG) with “0”.

#We use the square bracket notation to define a subset using a logical operation
#in this case 'value == """' and then insert a 0 in the field.
data$FATALITIES[(data$FATALITIES == "")] <- 0
data$INJURIES[(data$INJURIES == "")] <- 0
data$PROPDMG[(data$PROPDMG == "")] <- 0
data$CROPDMG[(data$CROPDMG == "")] <- 0

Next we must clean the PROPDMGEXP and CROPDMGEXP columns, as turn them into a numerical value. These columns essentially represent the exponent for their associated PROPDMG and CROPDMG values, it would therefore be much more useful to have a simple integer than a “class value”.

Some of the entries already have a proper numerical exponent value listed. We can easily check all current unique values using the unique() command:

unique(data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Information on how these exponents are represented is available in the Storm Data documentation.

#first we will replace any missing fields with "0" using the same method as
#we used for the numerical fields.
data$PROPDMGEXP[(data$PROPDMGEXP == "")] <- 0
data$CROPDMGEXP[(data$CROPDMGEXP == "")] <- 0

#Now we have to transform the EXP values ( +/-/?, h/H, k/K, m/M, B) into their
#associated numerical exponent values (1, 2, 3, 6, 9 respectively)
#we need to take care as the dataset is not specific about the use of capital
#letters in these fields.

##first the PROPDMGEXP column.
#replace the +/-/? values
data$PROPDMGEXP[(data$PROPDMGEXP == "+") | (data$PROPDMGEXP == "-") | (data$PROPDMGEXP == 
    "?")] <- 1
#replace the h/H values
data$PROPDMGEXP[(data$PROPDMGEXP == "h") | (data$PROPDMGEXP == "H")] <- 2

#replace the k/K values
data$PROPDMGEXP[(data$PROPDMGEXP == "k") | (data$PROPDMGEXP == "K")] <- 3

#replace the m/M values
data$PROPDMGEXP[(data$PROPDMGEXP == "m") | (data$PROPDMGEXP == "M")] <- 6

#replace the B values
data$PROPDMGEXP[(data$PROPDMGEXP == "B")] <- 9

##Next, the CROPDMGEXP column.
#replace the +/-/? values
data$CROPDMGEXP[(data$CROPDMGEXP == "+") | (data$CROPDMGEXP == "-") | (data$CROPDMGEXP == 
    "?")] <- 1
#replace the h/H values
data$CROPDMGEXP[(data$CROPDMGEXP == "h") | (data$CROPDMGEXP == "H")] <- 2

#replace the k/K values
data$CROPDMGEXP[(data$CROPDMGEXP == "k") | (data$CROPDMGEXP == "K")] <- 3

#replace the m/M values
data$CROPDMGEXP[(data$CROPDMGEXP == "m") | (data$CROPDMGEXP == "M")] <- 6

#replace the B values
data$CROPDMGEXP[(data$CROPDMGEXP == "B")] <- 9

#convert to integer so we can perform mathematical operations with the values.
data$PROPDMGEXP <- as.integer(data$PROPDMGEXP)
data$CROPDMGEXP <- as.integer(data$CROPDMGEXP)

We will create a singular damage total for each observation by adding the values of: PROPDMG * 10^PROPDMGEXP and CROPDMG * 10^CROPDMGEXP

#calculate the values
DAMAGE <- data$PROPDMG * 10^data$PROPDMGEXP + data$CROPDMG * 10^data$CROPDMGEXP

#column bind the result to our data frame.
data2 <- cbind(data, DAMAGE)

We now have a tidy dataset on which to perform our analysis.

head(data2, n=3)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP DAMAGE
## 1 TORNADO          0       15    25.0          3       0          0  25000
## 2 TORNADO          0        0     2.5          3       0          0   2500
## 3 TORNADO          0        2    25.0          3       0          0  25000

Analysis and results.

We will first analyse the number of instances of injury and fatalities caused by the different event types, and order them by severity.

We will do this by creating two new data frames, where we take the sum of injuries and fatalities for each type of event and sort them. This is most easily done using the plyr package.

library(plyr)
## Warning: package 'plyr' was built under R version 3.2.1
#subset the injuries and fatalities
injuriesdata <- aggregate(data=data2, INJURIES~EVTYPE, FUN=sum)
fatalitiesdata <- aggregate(data=data2, FATALITIES~EVTYPE, FUN=sum)

#arrange both subsets by the instances of harm, in decreasing order.
injuriesdata <- arrange(injuriesdata, -INJURIES)
fatalitiesdata <- arrange(fatalitiesdata, -FATALITIES)

We can now easily find the top 5 most harmful types of events in respect to both fatalities and injuries.

Injuries

head(injuriesdata, n=5)
##           EVTYPE INJURIES
## 1        TORNADO    91346
## 2      TSTM WIND     6957
## 3          FLOOD     6789
## 4 EXCESSIVE HEAT     6525
## 5      LIGHTNING     5230

Fatalities

head(fatalitiesdata, n=5)
##           EVTYPE FATALITIES
## 1        TORNADO       5633
## 2 EXCESSIVE HEAT       1903
## 3    FLASH FLOOD        978
## 4           HEAT        937
## 5      LIGHTNING        816

Using the ggplot2 package we can represent this information in two bar charts:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.1
injuriesplot <- ggplot(data=head(injuriesdata, n=5), aes(x=EVTYPE, y=INJURIES)) + 
    geom_bar(stat="identity", fill="paleturquoise4") + 
    geom_text(aes(label=INJURIES), vjust=-0.2) + 
    ggtitle("Top 5 types of event by Injuries caused") + 
    theme(plot.title = element_text(lineheight=.8, face="bold"))
injuriesplot

fatalitiesplot <- ggplot(data=head(fatalitiesdata, n=5), aes(x=EVTYPE, y=FATALITIES)) + 
    geom_bar(stat="identity", fill="wheat4") + 
    geom_text(aes(label=FATALITIES), vjust=-0.2) + 
    ggtitle("Top 5 types of event by Fatalities caused") + 
    theme(plot.title = element_text(lineheight=.8, face="bold"))
fatalitiesplot

In both cases we can see that Tornadoes are by far the most likely type of event to inflict injury or death on the population. Interestingly we can also see that (relatively speaking) excessive heat is disproportionately likely to cause a fatality, likely this is because the number of possible injuries one might sustain from excessive heat without dying is small.

Economic damage

In order to analyse the economic damage done by each type of event, we will first subset and sort our data.

library(plyr)

#subset the economic damage data
economicdata <- aggregate(data=data2, DAMAGE~EVTYPE, FUN=sum)

#arrange the subsets by the amount of damage, in decreasing order.
economicdata <- arrange(economicdata, -DAMAGE)

Again we can look at the top 5 events as concerns damage done to the economy:

head(economicdata, n=5)
##              EVTYPE       DAMAGE
## 1             FLOOD 150319678257
## 2 HURRICANE/TYPHOON  71913712800
## 3           TORNADO  57362334487
## 4       STORM SURGE  43323541000
## 5              HAIL  18761221986

We can now plot this data as a bar chart, in doing so it became clear that the top 4 of event types causes significantly more damage than the next 6 types in line. To illustrate this the top 4 have been highlighted in red.

economicsplot <- ggplot(data=head(economicdata, n=15), aes(x=EVTYPE, y=DAMAGE)) + 
    geom_bar(stat="identity") + 
    theme(axis.text.x = element_text(angle=90, hjust=1, vjust=1)) + 
    geom_bar(data=head(economicdata, n=4), fill="red", stat="identity") + 
    ggtitle("Top 15 types of event by damage caused") + 
    theme(plot.title = element_text(lineheight=.8, face="bold"))

economicsplot