The goal of this analysis is to address the following questions:
First, the data is processed and cleaned. Due to RAM constraints, the original dataset is condensed to only include relevant columns, and aggregated by event type. Then, a weighted ranking system is created to assess which event types are most harmful with respect to population health, and which event types have the greatest economic consequences. A ranking of the most harmful event types with respect to population health and economic damage are obtained.
The following code reads in the original csv.bz2 file and keeps only the columns necessary for the analysis.
if(!dir.exists("data")){
dir.create("./data")
}
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = temp)
storm <- read.csv(temp)
dim(storm)
## [1] 902297 37
columns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")
storm <- storm[, columns]
dim(storm)
## [1] 902297 5
Then, as shown below, the dataset is aggregated across event type for each variable of interest. The summed values are then placed into a single dataframe.
#aggregating the dataset by our four variables of interest
fatalities <- aggregate(storm$FATALITIES, by = list(storm$EVTYPE), FUN = sum)
injuries <- aggregate(storm$INJURIES, by = list(storm$EVTYPE), FUN = sum)
property <- aggregate(storm$PROPDMG, by = list(storm$EVTYPE), FUN = sum)
crop <- aggregate(storm$CROPDMG, by = list(storm$EVTYPE), FUN = sum)
storm <- cbind(as.character(fatalities$Group.1), fatalities$x, injuries$x,
property$x, crop$x)
storm <- data.frame(storm)
colnames(storm) <- c("event", "fatalities", "injuries", "property.dmg", "crop.dmg")
storm[2:5] <- lapply(storm[2:5], as.character)
storm[2:5] <- lapply(storm[2:5], as.numeric)
head(storm)
## event fatalities injuries property.dmg crop.dmg
## 1 HIGH SURF ADVISORY 0 0 200 0
## 2 COASTAL FLOOD 0 0 0 0
## 3 FLASH FLOOD 0 0 50 0
## 4 LIGHTNING 0 0 0 0
## 5 TSTM WIND 0 0 108 0
## 6 TSTM WIND (G45) 0 0 8 0
dim(storm)
## [1] 985 5
In order to find the most harmful event types, I used a simple weighted ranking system. To find the most harmful event type in terms of health, the formula is: fatalities * 2 + injuries. The respective ranking formula in terms of economic damage is a sum of property damage and crop damage.
A column is added to the dataset representing the health score, and the most harmful events with respect to population health are shown below in descending order.
storm$health <- storm$fatalities * 2 + storm$injuries
healthranked <- storm[order(-storm$health),]
healthranked[1:10,]
## event fatalities injuries property.dmg crop.dmg health
## 834 TORNADO 5633 91346 3212258.16 100018.52 102612
## 130 EXCESSIVE HEAT 1903 6525 1460.00 494.40 10331
## 856 TSTM WIND 504 6957 1335965.61 109202.60 7965
## 170 FLOOD 470 6789 899938.48 168037.88 7729
## 464 LIGHTNING 816 5230 603351.78 3580.61 6862
## 275 HEAT 937 2100 298.50 662.70 3974
## 153 FLASH FLOOD 978 1777 1420124.59 179200.46 3733
## 427 ICE STORM 89 1975 66000.67 1688.95 2153
## 760 THUNDERSTORM WIND 133 1488 876844.17 66791.45 1754
## 972 WINTER STORM 206 1321 132720.59 1978.99 1733
The same is done with respect to economic consequences and the most economically harmful events are shown below.
storm$econ <- storm$property.dmg + storm$crop.dmg
econranked <- storm[order(-storm$econ),]
econranked[1:10,]
## event fatalities injuries property.dmg crop.dmg health
## 834 TORNADO 5633 91346 3212258.2 100018.52 102612
## 153 FLASH FLOOD 978 1777 1420124.6 179200.46 3733
## 856 TSTM WIND 504 6957 1335965.6 109202.60 7965
## 244 HAIL 15 1361 688693.4 579596.28 1391
## 170 FLOOD 470 6789 899938.5 168037.88 7729
## 760 THUNDERSTORM WIND 133 1488 876844.2 66791.45 1754
## 464 LIGHTNING 816 5230 603351.8 3580.61 6862
## 786 THUNDERSTORM WINDS 64 908 446293.2 18684.93 1036
## 359 HIGH WIND 248 1137 324731.6 17283.21 1633
## 972 WINTER STORM 206 1321 132720.6 1978.99 1733
## econ
## 834 3312276.7
## 153 1599325.1
## 856 1445168.2
## 244 1268289.7
## 170 1067976.4
## 760 943635.6
## 464 606932.4
## 786 464978.1
## 359 342014.8
## 972 134699.6
Below are bar plots to aid in visualization when comparing the event types.
library(ggplot2)
g = ggplot(data = healthranked[1:10,], aes(x = reorder(event, -health), y = health))
g = g + geom_bar(stat = "identity") + xlab("event type") + ylab("health score") + ggtitle("10 Most Harmful Health Events")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g
g = ggplot(data = econranked[1:10,], aes(x = reorder(event, -econ), y = econ))
g = g + geom_bar(stat = "identity") + xlab("event type") + ylab("total economic damage (in US $) ") + ggtitle("10 Most Harmful Economic Events")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g
In conclusion, it seems that tornadoes are by far the most harmful event type with respect to population health and economic consequences.