In this assignment, we use NOAA’s storm database to assess which types of weather events are most hazardous to human health and which cause the greatest economic damage. We assess these using four variables in the NOAA data: weather event (1) injuries, (2) fatalities, (3) property damage, and (4) crop damage. After downloading the data from the website into R, we perform a basic inspection of the variables to determine their contents. Then we create new variables to capture the monetary value of property and crop damage and sum these for each type of weather event. For each of our four health and economic harm variables, we determine the top 6 weather event causes and plot these variables against each other in two 2X1 barplots, the first containing the injuries and fatalities and the second containing property and crop damage.
In the interest of producing fully reproducible research, we loaded the storm data directly from the host website into R, saving it to a temporary file, and reading the csv into an R object:
##Download data directly from website
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
tmp <- tempfile()
download.file(url, tmp, mode = "wb")
data <- read.csv(bzfile(tmp))
After this, we conducted an extensive investigation of the contents of the data using documentation from the National Weather Service and the data’s FAQ page. Data browsing commands including head() to inspect the first few rows, str() for a variable list with data types, unique() to inspect the contents of the key event type variable (EVTYPE) and one of the property/crop damage variables (PROPDMGEXP), table() on event types to get a sense of the frequency of different types of events and property/crop damage categories. Due to the length of some outputs, we present the code but suppress the outputs:
##Browse data
head(data)
str(data)
unique(data$EVTYPE)
table(data$EVTYPE, useNA = "always")
unique(data$PROPDMGEXP)
table(data$PROPDMGEXP, useNA = "always")
unique(data$CROPDMGEXP)
table(data$CROPDMGEXP, useNA = "always")
EVTYPE is a string variable describing different types of weather events and FATALITIES and INJURIES are counts of the fatalities and injuries for each event. These variables do not require pre-processing.
Property and crop damage, however, are each contained in two variables, variables which respectively give a number (PROPDMG, CROPDMG) and contain either a character, number, or blank space (PROPDMGEXP, CROPDMGEXP). Through reading the documentation, we ascertained that the characters are magnifiers that apply to the numbers in PROPDMG and CROPDMG, with a blank space indicating that the numbers reported in PROPDMG and CROPDMG are the total damage (in usd), k/K indicating that this number is in thousands, m/M indicating that it’s in millions, and b/B indicatng that it’s in billions. The other characters are a negligible fraction of the data and we ingore.
We generate and add to the dataset the following two variables, which show the total dollar amounts of property and crop damage respectively:
##Generate interpretable property and crop damage variables
data$prop_damage <- with(data,
ifelse(PROPDMGEXP == "", PROPDMG,
ifelse(PROPDMGEXP == "B", PROPDMG * 1e9,
ifelse(PROPDMGEXP == "M", PROPDMG * 1e6,
ifelse(PROPDMGEXP == "K", PROPDMG * 1e3, NA)))))
data$crop_damage <- with(data,
ifelse(CROPDMGEXP == "", CROPDMG,
ifelse(CROPDMGEXP == "B", CROPDMG * 1e9,
ifelse(CROPDMGEXP == "M" |
CROPDMGEXP == "m", CROPDMG * 1e6,
ifelse(CROPDMGEXP == "K" |
CROPDMGEXP == "k", CROPDMG * 1e3, NA)))))
Now that our four health and economic variables are in usable forms, we create four data frames by summing each for each type of weather event. Then we merge these four data frames into a single data frame of 985 observations that contains each weather event type, and the total number of injuries and fatalities and total property and crop damage that they caused:
##Sum Injuries/Fatalities/Property Damage/Crop Damage by EVTYPE
ev_injury<-aggregate(INJURIES~EVTYPE, data=data, FUN=sum, na.rm=TRUE)
ev_fatal<-aggregate(FATALITIES~EVTYPE, data=data, FUN=sum, na.rm=TRUE)
ev_prop<-aggregate(prop_damage~EVTYPE, data=data, FUN=sum, na.rm=TRUE)
ev_crop<-aggregate(crop_damage~EVTYPE, data=data, FUN=sum, na.rm=TRUE)
##Combine above into one dataset
data_list<-list(ev_injury, ev_fatal, ev_prop, ev_crop)
analysis_data <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "EVTYPE"), data_list)
The assignment guidelines call for a set of figures showing the weather events which caused the highest harm to human health and economic damage. We show the top six causes of each of these four health and economic harm outcomes. This requires one more bit of data processing, in which we create four objects from analysis_data that consist of the top 6 rows of each of our outcomes ordered from highest to lowest by event type:
top_fatal<- analysis_data[order(-analysis_data$FATALITIES), ][1:6, ]
top_injury<- analysis_data[order(-analysis_data$INJURIES), ][1:6, ]
top_prop<- analysis_data[order(-analysis_data$prop_damage), ][1:6, ]
top_crop<- analysis_data[order(-analysis_data$crop_damage), ][1:6, ]
We create two 2X1 plots using R’s base plotting system. The first shows the top 6 fatality-causing weather events (top) and the top 6 injury-causing weather events (bottom). We’ve made several alterations to the plots including adjusting the height of the plots, making the title and axis labels smaller, and creating custom y-axis scales so that the axis extends beyond the highest bar.
##Make Barplots of top 6 events for Population Health (Injuries, Fatalities)
par(mfrow = c(2, 1), mar = c(4.2, 4, 1, 1)) #Plot in 2 rows, 1 column, adjusting margins
par(mgp = c(3, 0.5, 0)) #Move x tick labels closer to plot
options(scipen=999) #Suppress scientific notation
bp1<-barplot(top_fatal$FATALITIES,
names.arg = top_fatal$EVTYPE,
las = 2,
main = "Top 6 Fatality-Causing Events",
ylab = "Number of Fatalities Caused",
ylim = c(0,6000),
cex.main = 0.9,
cex.names = 0.5,
cex.axis = 0.5,
cex.lab = 0.8)
bp2<-barplot(top_injury$INJURIES,
names.arg = top_injury$EVTYPE,
las = 2,
main = "Top 6 Injury-Causing Events",
ylab = "Number of Injuries Caused",
ylim = c(0,100000),
yaxt = "n",
cex.main = 0.9,
cex.names = 0.5,
cex.axis = 0.5,
cex.lab = 0.8)
axis(side = 2, at = seq(0, 100000, by = 20000),
labels = seq(0, 100000, by = 20000),
las = 1,
cex.axis = 0.7) #Set y-axis labels manually to improve appearance
As we can see above, causes of injuries and fatalities are similar, with tornados being by far the foremost causes of both, and wind, flooding, heat, and lighting being the other top causes. The ordering of these next five causes differs however, with wind being the second-highest cause of injury but only the sixth-highest cause of death. Likewise excessive heat is the second highest cause of death but only the fourth-highest cause of injury. Floods/flash floods are the third-highest cause of both injury and death and lightning is the fifth-highest cause of both injury and death.
One important caveat is that the labeling of events in the weather events data is inconsistent, with the same type of event often given several different, sometimes idiosyncratic names. It’s an issue here as well, with both ‘heat’ and ‘excessive heat’ being among the top 6 causes of both injury and death. These categories should likely be combined, which would make heat both the second-highest cause of injury and death. But this issue applies to all types of weather events and fixing it would require a thorough inspection of the contents of the events variable to generate a crosswalk that could combine like events into fewer categories. For the purposes of this assignment, we treat these different category names as different types of weather events.
Next we create the same two plots for property damage and crop damage, with similar adjustment to title and label sizes. One difference is that because the top event for each damage type caused damage of over $1 billion, we divide our top six damage amounts by 1 billion and present the y-axis in terms of billions of dollars.
par(mfrow = c(2, 1), mar = c(4.2, 4, 1, 1)) #Plot in 2 rows, 1 column, adjusting margins
par(mgp = c(3, 0.5, 0)) #Move x tick labels closer to plot
bp1<-barplot(top_prop$prop_damage/1000000000,
names.arg = top_prop$EVTYPE,
las = 2,
main = "Top 6 Property-Damage-Causing Events",
ylab = "Property Damage (in billions usd)",
ylim = c(0,150),
cex.main = 0.9,
cex.names = 0.5,
cex.axis = 0.7,
cex.lab = 0.8)
bp1<-barplot(top_crop$crop_damage/1000000000,
names.arg = top_crop$EVTYPE,
las = 2,
main = "Top 6 Crop-Damage-Causing Events",
ylab = "Crop damage (in billions usd)",
ylim = c(0,15),
cex.main = 0.9,
cex.names = 0.5,
cex.axis = 0.7,
cex.lab = 0.8)
As we can see above, the top determinants of property and crop damage are different. Drought causes by far the greatest damage to crops but is not one of the top six causes of property damage. Flooding is the greatest cause of property damage and is also the second-greatest cause of crop damage. And as with injuries and fatalities, there are multiple categories of flooding among the top six causes of property and crop damage, meaning that flooding altogether is more damaging that indicated by this cursory data analysis.
Hurricanes are the second-greatest cause of property damage but only the sixth-greatest cause of crop damage, although it is somewhat surprising that it’s one of the top causes of crop damage given how most crops are grown in areas like the midwest, great plains, and central California, areas which do not experience hurricanes. And hurricanes are likely an even greater cause of property damange than indicated here because they are the primary cause of storm surge, which is the fourth-greatest cause of property damage. Ice storms are the fourth-greatest cause of crop damange while hail is the fifth and sixth-greatest cause of crop and property damage respectively.