In this analysis I investigate various types of weather events and their consequences on: human health in terms of injury and fatality, and their economic toll measured through property and crop damage. My main hypothesis is that the events that cause the most economic damage are also the types of natural disasters that lead to the most injuries and fatalities. To test this assertion, I used data from the U.S. National Oceanic and Atmospheric Administration's storm database. These data record statistics on the characteristics of major storms and weather events. My findings indicate that the leading causes of fatalities and injuries are not the events that lead to the most financial damage. Most injuries and fatalities are the result of tornados, whereas, most of the financial damage across the United States is caused by flooding.
The data was downloaded from the class website and is a 'csv' file that comes as a 'bzip2' file. Asumming the zip file is in your working directory, the following code will read in the data.
data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), header = TRUE)
First, subset the relevant variables and only the observations that include fatalities or injuries.
subData <- subset(data, data$FATALITIES != 0 & data$INJURIES != 0, select = c("EVTYPE",
"FATALITIES", "INJURIES"))
The following code makes a data frame with the number of fatalities per disaster type then creates a vector(“fatalOrd”) with the types in a descending order which will be passed to the plot function.
fatalFrame <- na.omit(as.data.frame(tapply(subData$FATALITIES, subData$EVTYPE,
sum)))
names(fatalFrame)[1] <- "Number.Deaths"
fatalOrd <- fatalFrame[order(fatalFrame$Number.Deaths, decreasing = TRUE), ]
Show the top ten causes of fatality.
head(fatalOrd, 10)
## TORNADO EXCESSIVE HEAT LIGHTNING TSTM WIND FLASH FLOOD
## 5227 402 283 199 171
## FLOOD HIGH WIND WINTER STORM HEAT WILDFIRE
## 104 102 85 73 55
Do the same as above but for the number of injuries per disaster type.
injurFrame <- na.omit(as.data.frame(tapply(subData$INJURIES, subData$EVTYPE,
sum)))
names(injurFrame)[1] <- "Number.Injury"
injurOrd <- injurFrame[order(injurFrame$Number.Injury, decreasing = TRUE), ]
Show the top ten causes of injury.
head(injurOrd, 10)
## TORNADO EXCESSIVE HEAT FLOOD ICE STORM
## 60187 4791 2679 1720
## HEAT HURRICANE/TYPHOON BLIZZARD LIGHTNING
## 1420 1219 718 649
## TSTM WIND FLASH FLOOD
## 646 641
Plot the results.
par(mfrow = c(2, 1), font.lab = 4, cex.axis = 0.75, mar = c(4.5, 4.5, 2, 1))
barplot(fatalOrd[1:5], col = "lightblue", xlab = "", ylab = "Number of Fatalities",
main = "Most Harmful Events Across U.S.")
text(c(0.75, 2, 3, 4.3, 5.5), c(4500, 1000, 1000, 1000, 1000), labels = c("5227",
"402", "283", "199", "171"))
barplot(injurOrd[1:5], col = "red", xlab = "Event Types", ylab = "Number of Injuries")
text(c(0.75, 2, 3, 4.3, 5.5), c(50000, 10000, 10000, 10000, 10000), labels = c("60187",
"4791", "2679", "1720", "1420"))
From the plots, we can see that tornados account for most fatalities and injuries across the United States and the difference between tornados and other event types is visibly large.
First we subset the data by event type(“EVTYPE”),and by the variables that account for damage(“PROPDMG”,“PROPDMGEXP”,“CROPDMG”,“CROPDMGEXP”).
damageData <- subset(data, data$PROPDMG != 0 & data$CROPDMG != 0, select = c("EVTYPE",
"PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
Here's what the data look like.
str(damageData)
## 'data.frame': 16242 obs. of 5 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 409 786 405 408 408 786 786 834 834 812 ...
## $ PROPDMG : num 0.1 5 25 48 20 50 500 500 500 5 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 14 19 19 19 18 17 17 17 17 17 ...
## $ CROPDMG : num 10 500 1 4 10 50 50 5 50 15 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 9 7 9 9 8 7 7 7 7 7 ...
Inspecting the “PROPDMGEXP” and “CROPDMGEXP” variables, we can see that they are multi-leveled.
levels(damageData$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(damageData$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Page twelve of the 'Storm Data Documentation' which can be found at https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf, explains that these letters refer to amount classifications for the corresponding observation linking this value to the appropriate damage variables. For instance, if 'PROPDMG' equals 2.1 and “PROPDMGEXP” is 'B', then the property damage for that observation is 2.1 billion. “B” stands for 'billions', “K” for 'thousands' and “M” for 'millions'. “H” is unaccounted for but if the pattern holds, then it should stand for 'hundreds'. The other symbols don't have any classifications that were mentioned in the documentation.
Checking for the proportion of observations with known classifications, shows that these known categories make up most of the observations.
nrow(damageData[damageData$PROPDMGEXP %in% c("B", "h", "H", "K", "m", "M") &
damageData$CROPDMGEXP %in% c("B", "k", "K", "m", "M"), ])/nrow(damageData)
## [1] 0.9984
Given that the unknown classifications make up a small fraction of the observations, they will be left out of the analysis.
Next, we will subset the observations that have defined classifications.
subDamDat <- damageData[damageData$PROPDMGEXP %in% c("B", "h", "H", "K", "m",
"M") & damageData$CROPDMGEXP %in% c("B", "k", "K", "m", "M"), ]
We will convert each letter to upper case then map these to their numerical value for later computation.
subDamDat$PROPDMGEXP <- toupper(subDamDat$PROPDMGEXP)
subDamDat$CROPDMGEXP <- toupper(subDamDat$CROPDMGEXP)
We'll use the 'mapvalues' function from the 'plyr' package to change the factor levels to their numerical equivalent.
library(plyr)
subDamDat$PROPDMGEXP <- mapvalues(subDamDat$PROPDMGEXP, from = c("B", "M", "K",
"H"), to = c(10^9, 10^6, 10^3, 10^2))
## The following `from` values were not present in `x`: H
subDamDat$CROPDMGEXP <- mapvalues(subDamDat$CROPDMGEXP, from = c("B", "M", "K"),
to = c(10^9, 10^6, 10^3))
subDamDat$PROPDMGEXP <- as.numeric(subDamDat$PROPDMGEXP)
subDamDat$CROPDMGEXP <- as.numeric(subDamDat$CROPDMGEXP)
Now we will create a new variable that will be the total sum of damages to both property and crops.
subDamDat <- transform(subDamDat, totalDamage = (PROPDMG * PROPDMGEXP) + (CROPDMG *
CROPDMGEXP))
Here's how the current data looks.
str(subDamDat)
## 'data.frame': 16216 obs. of 6 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 409 786 405 408 408 786 786 834 834 812 ...
## $ PROPDMG : num 0.1 5 25 48 20 50 500 500 500 5 ...
## $ PROPDMGEXP : num 1e+09 1e+06 1e+06 1e+06 1e+06 1e+03 1e+03 1e+03 1e+03 1e+03 ...
## $ CROPDMG : num 10 500 1 4 10 50 50 5 50 15 ...
## $ CROPDMGEXP : num 1e+06 1e+03 1e+06 1e+06 1e+06 1e+03 1e+03 1e+03 1e+03 1e+03 ...
## $ totalDamage: num 1.10e+08 5.50e+06 2.60e+07 5.20e+07 3.00e+07 1.00e+05 5.50e+05 5.05e+05 5.50e+05 2.00e+04 ...
head(subDamDat)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 187566 HURRICANE OPAL/HIGH WINDS 0.1 1e+09 10 1e+06
## 187571 THUNDERSTORM WINDS 5.0 1e+06 500 1e+03
## 187581 HURRICANE ERIN 25.0 1e+06 1 1e+06
## 187583 HURRICANE OPAL 48.0 1e+06 4 1e+06
## 187584 HURRICANE OPAL 20.0 1e+06 10 1e+06
## 187653 THUNDERSTORM WINDS 50.0 1e+03 50 1e+03
## totalDamage
## 187566 1.1e+08
## 187571 5.5e+06
## 187581 2.6e+07
## 187583 5.2e+07
## 187584 3.0e+07
## 187653 1.0e+05
Make a data frame that has the sum of damages separated by levels of the event type, and order it in descending order.
damageFrame <- na.omit(as.data.frame(tapply(subDamDat$totalDamage, subDamDat$EVTYPE,
sum)))
names(damageFrame)[1] <- "Sum.Damage"
damageOrd <- damageFrame[order(damageFrame$Sum.Damage, decreasing = TRUE), ]
A call to 'head' gives you the top ten event types that cause the most damage.
head(damageOrd, 10)
## FLOOD HURRICANE/TYPHOON HURRICANE RIVER FLOOD
## 1.260e+11 2.935e+10 1.050e+10 1.011e+10
## ICE STORM FLASH FLOOD HAIL TORNADO
## 5.109e+09 4.309e+09 3.814e+09 2.336e+09
## HURRICANE OPAL HIGH WIND
## 2.187e+09 1.919e+09
Finally, plot the results.
par(cex.axis = 0.7, mar = c(7.5, 4, 3, 1))
barplot(damageOrd[1:5], main = "Most Economically Damaging Events(US)", xlab = "",
ylab = "Damage (in Billions of $)", col = "orange", yaxt = "n", xaxt = "n")
axis(side = 2, at = c(0, 4e+10, 8e+10, 1.2e+11), labels = c(0, 40, 80, 120),
las = 2)
axis(side = 1, at = c(0.6, 2, 3.2, 4.4, 5.7), labels = c("FLOOD", "HURRICANE/TYPHOON",
"HURRICANE", "RIVER FLOOD", "ICE STORM"), las = 2)
This plot displays the top five most damaging events with 'FLOOD' being more than three times more damaging than the other categories.