In this assignment, I analyzed the data of natural events from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. First, I read in the data and cleaned up some event types by looking into the cookbook. Then I aggregated the fatality, body injury, property damage, and crop damage by using the aggregate function according to different event types. With data processing and analyzing, I displayed the most harmful events to human health and the events which have the most economic damage. The results and analyses suggest that tornadoes, thunderstorms, floods, and excessive heat are the most harmful events to human health while while floods, hurricanes, tornadoes, and storms have the most impact on property and crops.
This section explains how the raw data was prepared for analysis. The first step was reading in the data from the original .csv file, and the second was loading appropriate packages for analyses.
setwd("/Users/ericweber/Desktop/NewEducation/rstuff/reprod")
data<-read.csv(bzfile("StormData.csv.bz2"))
library(ggplot2)
library(car)
After loading the original file, I examined the dimensions, first lines of the data, and columns names to get a feel for what steps might be necessary to prepare the data for analysis. Part of this step involved examining the proportion of missing data, which was roughly 5 percent. With a large data set, this percentage was not an immediate concern.
dim(data)
## [1] 902297 37
head(data, n = 5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
mean(is.na(data))
## [1] 0.05229737
Given the parameters of the assignment and the questions at hand, I identified variables that would be relevant for the analyses, including event type, injuries, fatalities, and economic loss in the form of crops and property damage. However, the crop and property damage had different units attached to the given values, which were given in another column of the data set. Thus, the next step involved converting the crop and property damage amount to US Dollars for ease of comparison. Any missing values were coded as 1.
data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$PROPDMGEXP[data$PROPDMGEXP == "" | data$PROPDMGEXP == "+" | data$PROPDMGEXP == "?" | data$PROPDMGEXP == "-"] <- "1"
data$PROPDMGEXP[data$PROPDMGEXP == "H" | data$PROPDMGEXP == "h"] <- "100"
data$PROPDMGEXP[data$PROPDMGEXP == "K" | data$PROPDMGEXP == "k"] <- "1000"
data$PROPDMGEXP[data$PROPDMGEXP == "M" | data$PROPDMGEXP == "m"] <- "1000000"
data$PROPDMGEXP[data$PROPDMGEXP == "B" | data$PROPDMGEXP == "b"] <- "1000000000"
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMGUSD <- data$PROPDMG * data$PROPDMGEXP
data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
data$CROPDMGEXP[data$CROPDMGEXP == "" | data$CROPDMGEXP == "?"] <- "1"
data$CROPDMGEXP[data$CROPDMGEXP == "B" | data$CROPDMGEXP == "b"] <- "1000000000"
data$CROPDMGEXP[data$CROPDMGEXP == "M" | data$CROPDMGEXP == "m"] <- "1000000"
data$CROPDMGEXP[data$CROPDMGEXP == "K" | data$CROPDMGEXP == "k"] <- "1000"
data$CROPDMGEXP[data$CROPDMGEXP == "" | data$CROPDMGEXP == "?"] <- "1"
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMGUSD <- data$CROPDMG * data$CROPDMGEXP
Prior to doing the analysis, I created a subset of the original data focusing only on the variables identified as relevant. These variables included event type, fatalities, injuries, property damage and crop damage. The subset consisted of sums of fatalities, injuries and types of property damage for each event type, the basic structure of which is shown below. I also created two variables (combineHealth and combineEcon) that represented the sum of fatalities and injuries, and the two types of damage, respectively.
merged <- aggregate(cbind(FATALITIES, INJURIES, PROPDMGUSD, CROPDMGUSD) ~ EVTYPE, data = data, FUN = sum)
head(merged, n=5)
## EVTYPE FATALITIES INJURIES PROPDMGUSD CROPDMGUSD
## 1 HIGH SURF ADVISORY 0 0 200000 0
## 2 COASTAL FLOOD 0 0 0 0
## 3 FLASH FLOOD 0 0 50000 0
## 4 LIGHTNING 0 0 0 0
## 5 TSTM WIND 0 0 8100000 0
merged$combineHealth <- merged$FATALITIES + merged$INJURIES
merged$combineEcon <- merged$PROPDMGUSD + merged$CROPDMGUSD
This description has explained how I transformed the raw data to a data set consisting of relevant variables and how I created new variables to represent economic and human impact. In the results section, I describe how I made use of this transformed data set to answer the research questions.
The results section focuses on the two main questions for the analyses. First, I examine which types of events are most harmful with respect to population health, measured by injuries and fatalities. Second I examine which types of events are most harmful with respect to economic impact, measured by property and crop damage.
I used two approaches to examine which types of events are most harmful to population health. First, I focused only on fatalities. Second, I focused on both fatalities and injuries. To do accomplish the first part, I ordered by event type.
To accomplish the second part, I create two subsets from an ordered set of event types. The ordered data was sorted in decreasing order using the combination of fatalities and injuries. This was prior to modifying the data to prepare for plotting.
combineHealth<- merged[order(merged$combineHealth, decreasing = T), ][1:15, ]
combineFatalities<- combineHealth[,c(1,2)]
names(combineFatalities)[2]<- "PERSONS"
combineFatalities$NAME <- "FATALITIES"
combineInjuries<- combineHealth[,c(1,3)]
names(combineInjuries)[2]<- "PERSONS"
combineInjuries$NAME<- "INJURIES"
I created two plots. The first focuses only on fatalities, the second focuses on both fatalities and injuries. The first plot, below, shows a plot of fatalities by event type, in decreasing order. It should be noted that all event types are not displayed, on those with the highest number of fatalities. The figure suggests that tornadoes, flooding, heat, and wind are the most destructive event types as measured by number of fatalities.
combineFatalitiesPlot<-transform(combineFatalities, EVTYPE = reorder(EVTYPE, -PERSONS))
library(ggplot2)
qplot(
EVTYPE,
PERSONS,
data = combineFatalitiesPlot,
fill = NAME,
geom = "bar",
stat = "identity",
main = "Fatalities",
ylab = "Persons",
xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))
The second plot, below, shows a plot of fatalities and injuries by event type, in decreasing order. It should be noted that all event types are not displayed, on those with the highest combined number of fatalities and injuries. The analysis suggests that tornadoes, floods, heat, and wind are the event types with the most combined injuries and fatalities.
combineInjFatPlot<- rbind(combineFatalities, combineInjuries)
combineInjFatPlot<- transform(combineInjFatPlot, EVTYPE = reorder(EVTYPE, -PERSONS))
library(ggplot2)
qplot(
EVTYPE,
PERSONS,
data = combineInjFatPlot,
fill = NAME,
geom = "bar",
stat = "identity",
main = "Fatalities & Injuries Combined",
ylab = "Persons",
xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))
This phase of the results examines the event types with the largest economic impact as measured by property and crop damage. The first step for this phase of the analysis consisted of creating subsets focused on property damage and crop damage, respectively.
combineDamage<- merged[order(merged$combineEcon, decreasing = T), ][1:15,]
combineProp<- combineDamage[,c(1,4)]
names(combineProp)[2]<- "DAMAGE"
combineProp$NAME<- "PROPERTY"
combineCrop<- combineDamage[,c(1,5)]
names(combineCrop)[2]<- "DAMAGE"
combineCrop$NAME<- "CROP"
I then merged these subsets and ordered so that those event types with the most damage would appear in decreasing order. This also helped prepare the data for plotting. Additionally, I rescaled the damage variable to make the plotting more reasonable. This phase of the analysis suggests that property damage is greater in magnitude than crop damage, and that floods, tornaodies, hail and drough have the greatest economic impact.
combineDamagePlot<- rbind(combineProp, combineCrop)
combineDamagePlot<- transform(combineDamagePlot, EVTYPE = reorder(EVTYPE, -DAMAGE))
combineDamagePlot$DAMAGE <- combineDamagePlot$DAMAGE /10^5
library(ggplot2)
qplot(
EVTYPE,
DAMAGE,
data = combineDamagePlot,
fill = NAME,
geom = "bar",
stat = "identity",
main = "Economic Damage",
ylab = "Damage (Millions of Dollars)",
xlab = ""
) + scale_fill_discrete("") + theme(axis.text.x = element_text(angle = 90))